How does a client recognize the difference between the two?
ASCII is an old character encoding format that can address 8 bits worth of characters, but no more. So, that's 256 characters it can encode.
UTF-8, by contrast, is a newer character encoding format that can handle a very large set of characters by using multiple bytes to represent one character (or just using one byte, if that's appropriate). One of the bits acts as a signal to indicate that the character requires another byte.
I don't recall the details of how UTF-8 characters are actually encoded, so I can't really give you much by way of details, but if you want to, by eye, recognize the difference between UTF-8 and ASCII, take a look at a sample of each using a hex editor or something that won't try to encode the UTF-8. You'll get a flavor for how it appears untranslated.
If you are looking for a mechanical means by which to detect UTF-8, I'm not sure that it's possible. Normally, something external to the text you're translating indicates the character encoding (e.g. MIME headers).
not quiet accurate, fleeb.
ascii specifies the use of 7bits, though 8 bits are the average usecase today; So, strictly defined are the first 127 chars, then starts the mess.
There are Charsets on top of ascii, like iso-8859-1 or windows CP nnn or utf-8.
while iso and the windows CP xxx just put some tables (incompatible to each other) into the upper 127 chars, utf-8 goes a different way.
If you're in a sequence, and the highest bit is set, up to 6 of the following bytes are drawn together, so you can address a wide range of chars at once from the Unicode namespace (like klingon, chinese...)
So If you've got a text, and you can see two jammed chars in sequence (or more..) its most probably utf-8 which was interpreted as iso; if its the other way arround, you having '?' in the text, it was trying to interpret an ISO character as the start of an UTF-8-sequence, and failed.
No, I wasn't accurate... I was trying to convey the sense of things, rather than the details, so a non-technical person would see the major points of differences. But it's good to point out the details, too.
no one loves a smart arse
Unicode came along, with its tens of thousands of characters, and it would be a nice thing to have -- the character set to end all character sets. Literally.
Windows defined all new API's to deal with strings using multi byte characters.
New data types, new functions, new everything. Very ugly and very messy.
Ken Thompson and Rob Pike took a different approach for Unix. They came up with UTF-8, which has the ability to map the entire Unicode set onto strings of 8-bit characters. Strings are still null-terminated, so all of the existing API's like strcpy() strcat() strlen() still work. As an added bonus, the characters which made up low ASCII still consume only one byte each, so you don't end up doubling all of your string buffers "just in case" someone wants to use international characters. And if you've got mixed US and International text, you automatically get something which resembles compression.
Adding UTF-8 support to a program that previously used only ASCII is fairly easy. You just have to make sure all of your string handling is 8-bit clean.
And there are a couple of situations where you might want to use a different string length function. strlen() still tells you the number of bytes a string consumes, which is usually what the programmer wants. There are also a few new functions like mbstowcs() which tell you the number of characters that will be displayed, which in UTF-8 is of course not necessarily the same number, and wcwidth() which tells you the number of columns a string will occupy (since Unicode does contain some characters which require more than one column to display if you're using a fixed-width font).
(Yes, I know Windows supports UTF-8 now too, but the point was that as always, they went with the ugly way first, because after all these years they still can't design good software on their own. They can only crank out something marginally acceptable when copying someone else.)
Here's a really good writeup, if you want to read more:
Subject: Visual Basic Help
Does anyone have a resource for learning VERY VERY BASIC Visual Basic stuff.
I just want to know how to script a couple of really simple commands. I've tried learning VB before but usually my eyes just gloss over.
Most immediately I need to write a string that will automagically direct someone to where I want the document to be saved. I have no real interest in learning to program but I keep finding myself needing a little VB knowledge here and there.
My bad, it's actually MS Script Editor I need to work with. Jeez.
Isn't MS Script Editor essentially VBScript?
I didn'r read the utf16 article, but basically utf8 is really slick, because it can do anything. But it stops being efficient if you represent one string with lots of different language characters in it.
Although I find it hard to understand why people complain about inefficient utf-8 (as in you may burn 4 bytes per character for the whole document in some cases) but they don't complain about the rest of the fucked up shit in the computer industry like.. oh... windows vista for example.
Anyway, so really you only need utf8, but if you're going to do lots of 4 byte characters, you're better off with utf 16 which makes more efficient use of the space for the same thing.
but if... like odf does... a gzip runs over that text it will compress better, and give you back your bigger space.
I should think UTF16 compresses at least as well as UTF8, but I haven't actually done any benchmarks to make sure.
UTF8 is pretty smart, though. It's truly a shame Microsoft went with UTF16.
If it's old old, remember there is a 120gig barrier for standard IDE, for old BIOSes and operating systems.
a good collection.
*prints and puts under the cussion*
Well I didn't want to go to our Tech department because it's outsourced adn every request costs money. I thought I could figure it out but the IT Director (who is not an IT person really) overheard me talking to my boss about what I wanted to do and INSISTED that I submit a Sysaid and let them handle it. She said my time was better spent and I'm about the only person in the agency who would come up with a juicy request like that.
I may still try to figure out scripting stuff on my own... just cause I keep needing to use it.
So, does anyone have any suggestions for me on laptop hard disk brands??
So if UTF-8 essentially burns several bytes per character -
1. Does it use more bytes per character for normal English?
2. Does it use the same for, say, Chinese?
... In other words, would a document with 1000 English characters be smaller than a document with 1000 Chinese characters?
(And do IP packets still get addressed with standard ASCII, or can they use UTF-8 for that too somehow?)