switch to room list switch to menu My folders
Go to page: First ... 35 36 37 38 [39] 40 41 42 43 ... Last
[#] Mon Jun 29 2009 10:58:03 EDT from IGnatius T Foobar @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

UTF-8 is, in my opinion, REALLY F***ING CLEVER.

Unicode came along, with its tens of thousands of characters, and it would be a nice thing to have -- the character set to end all character sets. Literally.

Windows defined all new API's to deal with strings using multi byte characters.
New data types, new functions, new everything. Very ugly and very messy.

Ken Thompson and Rob Pike took a different approach for Unix. They came up with UTF-8, which has the ability to map the entire Unicode set onto strings of 8-bit characters. Strings are still null-terminated, so all of the existing API's like strcpy() strcat() strlen() still work. As an added bonus, the characters which made up low ASCII still consume only one byte each, so you don't end up doubling all of your string buffers "just in case" someone wants to use international characters. And if you've got mixed US and International text, you automatically get something which resembles compression.

Adding UTF-8 support to a program that previously used only ASCII is fairly easy. You just have to make sure all of your string handling is 8-bit clean.
And there are a couple of situations where you might want to use a different string length function. strlen() still tells you the number of bytes a string consumes, which is usually what the programmer wants. There are also a few new functions like mbstowcs() which tell you the number of characters that will be displayed, which in UTF-8 is of course not necessarily the same number, and wcwidth() which tells you the number of columns a string will occupy (since Unicode does contain some characters which require more than one column to display if you're using a fixed-width font).

(Yes, I know Windows supports UTF-8 now too, but the point was that as always, they went with the ugly way first, because after all these years they still can't design good software on their own. They can only crank out something marginally acceptable when copying someone else.)

Here's a really good writeup, if you want to read more:

[#] Mon Jun 29 2009 15:27:04 EDT from girthta @ Uncensored

Subject: Visual Basic Help

[Reply] [ReplyQuoted] [Headers] [Print]

Does anyone have a resource for learning VERY VERY BASIC Visual Basic stuff.

I just want to know how to script a couple of really simple commands. I've tried learning VB before but usually my eyes just gloss over.

Most immediately I need to write a string that will automagically direct someone to where I want the document to be saved. I have no real interest in learning to program but I keep finding myself needing a little VB knowledge here and there.

[#] Mon Jun 29 2009 15:46:53 EDT from girthta @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

My bad, it's actually MS Script Editor I need to work with. Jeez.

[#] Mon Jun 29 2009 16:33:27 EDT from fleeb @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

Isn't MS Script Editor essentially VBScript?

[#] Mon Jun 29 2009 16:45:48 EDT from Ford II @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

wikipedia has a really good article on utf8.
I didn'r read the utf16 article, but basically utf8 is really slick, because it can do anything. But it stops being efficient if you represent one string with lots of different language characters in it.
Although I find it hard to understand why people complain about inefficient utf-8 (as in you may burn 4 bytes per character for the whole document in some cases) but they don't complain about the rest of the fucked up shit in the computer industry like.. oh... windows vista for example.
Anyway, so really you only need utf8, but if you're going to do lots of 4 byte characters, you're better off with utf 16 which makes more efficient use of the space for the same thing.

[#] Tue Jun 30 2009 03:00:06 EDT from dothebart @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

but if... like odf does... a gzip runs over that text it will compress better, and give you back your bigger space.

[#] Tue Jun 30 2009 09:48:19 EDT from fleeb @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

I should think UTF16 compresses at least as well as UTF8, but I haven't actually done any benchmarks to make sure.

UTF8 is pretty smart, though. It's truly a shame Microsoft went with UTF16.
It's awkward.

[#] Tue Jun 30 2009 15:51:09 EDT from Peter Pulse @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

So, not to derail the character encoding discussions.. but.. I want to replace my laptop hard disk with a bigger one.. but I don't know what brands/models are the reliable ones these days. Any suggestions? It's an old thinkpad.. so a regular 2.5 inch IDE drive, ideally want to put at least 160G in there but looking for more like 250+

[#] Tue Jun 30 2009 15:57:32 EDT from LoanShark @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

If it's old old, remember there is a 120gig barrier for standard IDE, for old BIOSes and operating systems.

[#] Tue Jun 30 2009 16:34:42 EDT from Peter Pulse @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

Hmm.. good point, it's a T30. I'll have to research it a bit. I should just move to a different laptop (actually I have a T42 I'm trying to get going) bt I started using this one when my other machine died and despite its slowness it is a very enjoyable machine.. solid, great keyboard etc.. all the things that made thinkpads great before they started going downhill. So I want to extend its life.

[#] Tue Jun 30 2009 16:36:35 EDT from dothebart @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

a good collection.

*prints and puts under the cussion*


[#] Tue Jun 30 2009 16:57:35 EDT from Peter Pulse @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

So, it seems that this machine does have LBA48 support in the most recent BIOS, I will just have to check to see if I need to update my BIOS.

[#] Wed Jul 01 2009 06:28:19 EDT from girthta @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

Well I didn't want to go to our Tech department because it's outsourced adn every request costs money. I thought I could figure it out but the IT Director (who is not an IT person really) overheard me talking to my boss about what I wanted to do and INSISTED that I submit a Sysaid and let them handle it. She said my time was better spent and I'm about the only person in the agency who would come up with a juicy request like that.

I may still try to figure out scripting stuff on my own... just cause I keep needing to use it.

[#] Wed Jul 01 2009 11:54:53 EDT from Peter Pulse @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

LOL Girtha he probably doesn't want you showing up his "department" ;)

So, does anyone have any suggestions for me on laptop hard disk brands??

[#] Thu Jul 02 2009 02:48:03 EDT from Dirk Stanley @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

So here's another "Oh-god-Dirk-has-no-clue"-type question :

So if UTF-8 essentially burns several bytes per character -

1. Does it use more bytes per character for normal English?
2. Does it use the same for, say, Chinese?

... In other words, would a document with 1000 English characters be smaller than a document with 1000 Chinese characters?

(And do IP packets still get addressed with standard ASCII, or can they use UTF-8 for that too somehow?)

[#] Thu Jul 02 2009 04:45:00 EDT from dothebart @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

the chinese will definitely be bigger. how big depends on how high in the unicode namespace the chinese chars are...

you pro'lly mean DNS by "ip" right? since ips are just numbers, they don't have a relation to UTF8.

DNS does something similar as you might have seen in your emails with base64 with umlaut domains... so a purely chinese domain might take up to 5 times what a generic english domain would take in glyphs.

[#] Thu Jul 02 2009 11:34:29 EDT from IGnatius T Foobar @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

Read the document I cited earlier; it'll answer all your questions.

Part of the brilliance of UTF-8 is that the representation of ASCII characters 0-127 does not change at all. It's strictly one byte per character, and characters 0-127 of UTF-8 are identical to characters 0-127 of ASCII. Nothing changes.

It's only when you get into the international character sets (such as Chinese) that you start to insert multibyte characters into the data stream. When you have a character in the range of 128-255, it means that you're constructing a multibyte character in UTF-8. This allows all characters to be representable using (mostly) existing software, without requiring wasted space for representation of ASCII characters.

Hmm, actually it's 1-127 and 128-254. 0 is still a null terminator, and 255 is used as an optional "byte order mark" (U+FEFF). The BOM is normally only used in UTF-16, but some software naively retains it when downconverting to UTF-8.

[#] Thu Jul 02 2009 15:39:44 EDT from fleeb @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

When you asked 'does it use more bytes per character for normal English', the answer is 'no'.

When you asked 'does it use the same for, say, Chinese', the answer is 'ASCII doesn't have a way to represent Chinese'. But that's a pedantic answer.

I'm not 100% sure, but I think it's possible to use more characters to represent some languages in UTF-8 than UTF-16 (the Windows solution). But for English, UTF-8 wins over UTF-16.

[#] Thu Jul 02 2009 17:03:30 EDT from LoanShark @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

Correct. For typical English and Western European languages, UTF-8 is a more compact encoding. However, for "CJK" Chinese/Japanese/Korean, UTF-16 is the more compact encoding. All of those characters are from a codepoint range that encodes to three UTF-8 bytes or (usually) two UTF-16 bytes.

[#] Thu Jul 02 2009 19:38:24 EDT from IGnatius T Foobar @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

I would say that UTF-8 wins for "all Western languages" rather than just US English / US ASCII, since most of those languages share a large portion of the same character set.

Go to page: First ... 35 36 37 38 [39] 40 41 42 43 ... Last