switch to room list switch to menu My folders
Go to page: 1 2 [3] 4 5 6 7 ... Last
[#] Sat Jun 27 2009 04:57:26 EDT from Dirk Stanley @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

Can someone teach me about the difference between ASCII and UTF-8?

How does a client recognize the difference between the two?

[#] Sat Jun 27 2009 07:20:42 EDT from fleeb @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

ASCII is an old character encoding format that can address 8 bits worth of characters, but no more.  So, that's 256 characters it can encode.

UTF-8, by contrast, is a newer character encoding format that can handle a very large set of characters by using multiple bytes to represent one character (or just using one byte, if that's appropriate).  One of the bits acts as a signal to indicate that the character requires another byte.

I don't recall the details of how UTF-8 characters are actually encoded, so I can't really give you much by way of details, but if you want to, by eye, recognize the difference between UTF-8 and ASCII, take a look at a sample of each using a hex editor or something that won't try to encode the UTF-8.  You'll get a flavor for how it appears untranslated.

If you are looking for a mechanical means by which to detect UTF-8, I'm not sure that it's possible.  Normally, something external to the text you're translating indicates the character encoding (e.g. MIME headers).

[#] Mon Jun 29 2009 08:24:15 EDT from dothebart @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

not quiet accurate, fleeb.


ascii specifies the use of 7bits, though 8 bits are the average usecase today; So, strictly defined are the first 127 chars, then starts the mess.

There are Charsets on top of ascii, like iso-8859-1 or windows CP nnn or utf-8.

while iso and the windows CP xxx just put some tables (incompatible to each other) into the upper 127 chars, utf-8 goes a different way.

If you're in a sequence, and the highest bit is set, up to 6 of the following bytes are drawn together, so you can address a wide range of chars at once from the Unicode namespace (like klingon, chinese...)

So If you've got a text, and you can see two jammed chars in sequence (or more..) its most probably utf-8 which was interpreted as iso; if its the other way arround, you having '?' in the text, it was trying to interpret an ISO character as the start of an UTF-8-sequence, and failed.


[#] Mon Jun 29 2009 08:28:30 EDT from fleeb @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

No, I wasn't accurate... I was trying to convey the sense of things, rather than the details, so a non-technical person would see the major points of differences. But it's good to point out the details, too.

[#] Mon Jun 29 2009 08:28:50 EDT from arabella @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

no one loves a smart arse

[#] Mon Jun 29 2009 10:58:03 EDT from IGnatius T Foobar @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

UTF-8 is, in my opinion, REALLY F***ING CLEVER.

Unicode came along, with its tens of thousands of characters, and it would be a nice thing to have -- the character set to end all character sets. Literally.

Windows defined all new API's to deal with strings using multi byte characters.
New data types, new functions, new everything. Very ugly and very messy.

Ken Thompson and Rob Pike took a different approach for Unix. They came up with UTF-8, which has the ability to map the entire Unicode set onto strings of 8-bit characters. Strings are still null-terminated, so all of the existing API's like strcpy() strcat() strlen() still work. As an added bonus, the characters which made up low ASCII still consume only one byte each, so you don't end up doubling all of your string buffers "just in case" someone wants to use international characters. And if you've got mixed US and International text, you automatically get something which resembles compression.

Adding UTF-8 support to a program that previously used only ASCII is fairly easy. You just have to make sure all of your string handling is 8-bit clean.
And there are a couple of situations where you might want to use a different string length function. strlen() still tells you the number of bytes a string consumes, which is usually what the programmer wants. There are also a few new functions like mbstowcs() which tell you the number of characters that will be displayed, which in UTF-8 is of course not necessarily the same number, and wcwidth() which tells you the number of columns a string will occupy (since Unicode does contain some characters which require more than one column to display if you're using a fixed-width font).

(Yes, I know Windows supports UTF-8 now too, but the point was that as always, they went with the ugly way first, because after all these years they still can't design good software on their own. They can only crank out something marginally acceptable when copying someone else.)

Here's a really good writeup, if you want to read more:

[#] Mon Jun 29 2009 15:27:04 EDT from girthta @ Uncensored

Subject: Visual Basic Help

[Reply] [ReplyQuoted] [Headers] [Print]

Does anyone have a resource for learning VERY VERY BASIC Visual Basic stuff.

I just want to know how to script a couple of really simple commands. I've tried learning VB before but usually my eyes just gloss over.

Most immediately I need to write a string that will automagically direct someone to where I want the document to be saved. I have no real interest in learning to program but I keep finding myself needing a little VB knowledge here and there.

[#] Mon Jun 29 2009 15:46:53 EDT from girthta @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

My bad, it's actually MS Script Editor I need to work with. Jeez.

[#] Mon Jun 29 2009 16:33:27 EDT from fleeb @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

Isn't MS Script Editor essentially VBScript?

[#] Mon Jun 29 2009 16:45:48 EDT from Ford II @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

wikipedia has a really good article on utf8.
I didn'r read the utf16 article, but basically utf8 is really slick, because it can do anything. But it stops being efficient if you represent one string with lots of different language characters in it.
Although I find it hard to understand why people complain about inefficient utf-8 (as in you may burn 4 bytes per character for the whole document in some cases) but they don't complain about the rest of the fucked up shit in the computer industry like.. oh... windows vista for example.
Anyway, so really you only need utf8, but if you're going to do lots of 4 byte characters, you're better off with utf 16 which makes more efficient use of the space for the same thing.

[#] Tue Jun 30 2009 03:00:06 EDT from dothebart @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

but if... like odf does... a gzip runs over that text it will compress better, and give you back your bigger space.

[#] Tue Jun 30 2009 09:48:19 EDT from fleeb @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

I should think UTF16 compresses at least as well as UTF8, but I haven't actually done any benchmarks to make sure.

UTF8 is pretty smart, though. It's truly a shame Microsoft went with UTF16.
It's awkward.

[#] Tue Jun 30 2009 15:51:09 EDT from Peter Pulse @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

So, not to derail the character encoding discussions.. but.. I want to replace my laptop hard disk with a bigger one.. but I don't know what brands/models are the reliable ones these days. Any suggestions? It's an old thinkpad.. so a regular 2.5 inch IDE drive, ideally want to put at least 160G in there but looking for more like 250+

[#] Tue Jun 30 2009 15:57:32 EDT from LoanShark @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

If it's old old, remember there is a 120gig barrier for standard IDE, for old BIOSes and operating systems.

[#] Tue Jun 30 2009 16:34:42 EDT from Peter Pulse @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

Hmm.. good point, it's a T30. I'll have to research it a bit. I should just move to a different laptop (actually I have a T42 I'm trying to get going) bt I started using this one when my other machine died and despite its slowness it is a very enjoyable machine.. solid, great keyboard etc.. all the things that made thinkpads great before they started going downhill. So I want to extend its life.

[#] Tue Jun 30 2009 16:36:35 EDT from dothebart @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

a good collection.

*prints and puts under the cussion*


[#] Tue Jun 30 2009 16:57:35 EDT from Peter Pulse @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

So, it seems that this machine does have LBA48 support in the most recent BIOS, I will just have to check to see if I need to update my BIOS.

[#] Wed Jul 01 2009 06:28:19 EDT from girthta @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

Well I didn't want to go to our Tech department because it's outsourced adn every request costs money. I thought I could figure it out but the IT Director (who is not an IT person really) overheard me talking to my boss about what I wanted to do and INSISTED that I submit a Sysaid and let them handle it. She said my time was better spent and I'm about the only person in the agency who would come up with a juicy request like that.

I may still try to figure out scripting stuff on my own... just cause I keep needing to use it.

[#] Wed Jul 01 2009 11:54:53 EDT from Peter Pulse @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

LOL Girtha he probably doesn't want you showing up his "department" ;)

So, does anyone have any suggestions for me on laptop hard disk brands??

[#] Thu Jul 02 2009 02:48:03 EDT from Dirk Stanley @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

So here's another "Oh-god-Dirk-has-no-clue"-type question :

So if UTF-8 essentially burns several bytes per character -

1. Does it use more bytes per character for normal English?
2. Does it use the same for, say, Chinese?

... In other words, would a document with 1000 English characters be smaller than a document with 1000 Chinese characters?

(And do IP packets still get addressed with standard ASCII, or can they use UTF-8 for that too somehow?)

Go to page: 1 2 [3] 4 5 6 7 ... Last