Unicode came along, with its tens of thousands of characters, and it would be a nice thing to have -- the character set to end all character sets. Literally.
Windows defined all new API's to deal with strings using multi byte characters.
New data types, new functions, new everything. Very ugly and very messy.
Ken Thompson and Rob Pike took a different approach for Unix. They came up with UTF-8, which has the ability to map the entire Unicode set onto strings of 8-bit characters. Strings are still null-terminated, so all of the existing API's like strcpy() strcat() strlen() still work. As an added bonus, the characters which made up low ASCII still consume only one byte each, so you don't end up doubling all of your string buffers "just in case" someone wants to use international characters. And if you've got mixed US and International text, you automatically get something which resembles compression.
Adding UTF-8 support to a program that previously used only ASCII is fairly easy. You just have to make sure all of your string handling is 8-bit clean.
And there are a couple of situations where you might want to use a different string length function. strlen() still tells you the number of bytes a string consumes, which is usually what the programmer wants. There are also a few new functions like mbstowcs() which tell you the number of characters that will be displayed, which in UTF-8 is of course not necessarily the same number, and wcwidth() which tells you the number of columns a string will occupy (since Unicode does contain some characters which require more than one column to display if you're using a fixed-width font).
(Yes, I know Windows supports UTF-8 now too, but the point was that as always, they went with the ugly way first, because after all these years they still can't design good software on their own. They can only crank out something marginally acceptable when copying someone else.)
Here's a really good writeup, if you want to read more:
Subject: Visual Basic Help
Does anyone have a resource for learning VERY VERY BASIC Visual Basic stuff.
I just want to know how to script a couple of really simple commands. I've tried learning VB before but usually my eyes just gloss over.
Most immediately I need to write a string that will automagically direct someone to where I want the document to be saved. I have no real interest in learning to program but I keep finding myself needing a little VB knowledge here and there.
My bad, it's actually MS Script Editor I need to work with. Jeez.
Isn't MS Script Editor essentially VBScript?
I didn'r read the utf16 article, but basically utf8 is really slick, because it can do anything. But it stops being efficient if you represent one string with lots of different language characters in it.
Although I find it hard to understand why people complain about inefficient utf-8 (as in you may burn 4 bytes per character for the whole document in some cases) but they don't complain about the rest of the fucked up shit in the computer industry like.. oh... windows vista for example.
Anyway, so really you only need utf8, but if you're going to do lots of 4 byte characters, you're better off with utf 16 which makes more efficient use of the space for the same thing.
but if... like odf does... a gzip runs over that text it will compress better, and give you back your bigger space.
I should think UTF16 compresses at least as well as UTF8, but I haven't actually done any benchmarks to make sure.
UTF8 is pretty smart, though. It's truly a shame Microsoft went with UTF16.
If it's old old, remember there is a 120gig barrier for standard IDE, for old BIOSes and operating systems.
a good collection.
*prints and puts under the cussion*
Well I didn't want to go to our Tech department because it's outsourced adn every request costs money. I thought I could figure it out but the IT Director (who is not an IT person really) overheard me talking to my boss about what I wanted to do and INSISTED that I submit a Sysaid and let them handle it. She said my time was better spent and I'm about the only person in the agency who would come up with a juicy request like that.
I may still try to figure out scripting stuff on my own... just cause I keep needing to use it.
So, does anyone have any suggestions for me on laptop hard disk brands??
So if UTF-8 essentially burns several bytes per character -
1. Does it use more bytes per character for normal English?
2. Does it use the same for, say, Chinese?
... In other words, would a document with 1000 English characters be smaller than a document with 1000 Chinese characters?
(And do IP packets still get addressed with standard ASCII, or can they use UTF-8 for that too somehow?)
the chinese will definitely be bigger. how big depends on how high in the unicode namespace the chinese chars are...
you pro'lly mean DNS by "ip" right? since ips are just numbers, they don't have a relation to UTF8.
DNS does something similar as you might have seen in your emails with base64 with umlaut domains... so a purely chinese domain might take up to 5 times what a generic english domain would take in glyphs.
Part of the brilliance of UTF-8 is that the representation of ASCII characters 0-127 does not change at all. It's strictly one byte per character, and characters 0-127 of UTF-8 are identical to characters 0-127 of ASCII. Nothing changes.
It's only when you get into the international character sets (such as Chinese) that you start to insert multibyte characters into the data stream. When you have a character in the range of 128-255, it means that you're constructing a multibyte character in UTF-8. This allows all characters to be representable using (mostly) existing software, without requiring wasted space for representation of ASCII characters.
Hmm, actually it's 1-127 and 128-254. 0 is still a null terminator, and 255 is used as an optional "byte order mark" (U+FEFF). The BOM is normally only used in UTF-16, but some software naively retains it when downconverting to UTF-8.
When you asked 'does it use more bytes per character for normal English', the answer is 'no'.
When you asked 'does it use the same for, say, Chinese', the answer is 'ASCII doesn't have a way to represent Chinese'. But that's a pedantic answer.
I'm not 100% sure, but I think it's possible to use more characters to represent some languages in UTF-8 than UTF-16 (the Windows solution). But for English, UTF-8 wins over UTF-16.
Correct. For typical English and Western European languages, UTF-8 is a more compact encoding. However, for "CJK" Chinese/Japanese/Korean, UTF-16 is the more compact encoding. All of those characters are from a codepoint range that encodes to three UTF-8 bytes or (usually) two UTF-16 bytes.