Unicode news introduction (chart)

Add date: 10/07/2008 Publishing date: 10/07/2008 Hits: 11

Total 2 pages, Current page:1, Jump to page:

Elementary knowledge
Byte and character difference
Big Endian and Little Endian
UCS-2 and UCS-4
UTF-16 and UTF-32
UTF-16
UTF-32
UTF-8

Elementary knowledge
Before introducing Unicode, must first explain some elementary knowledge. Although with the Unicode not direct relations, but wants to ravel Unicode, does not have these really not to be good.
Byte and character difference
Well, what the byte and can the character have to distinguish? Not is the same? Entirely accurate, but is only in the ancient DOS time. After Unicode appears, the byte and the character are dissimilar.
The byte (octet) is an eight memory cell, the value scope certainly is 0~255. But the character (character, or word) is in the language significance mark, the scope is uncertain. For example the character scope which defines in UCS-2 is 0~65535, its character takes two bytes.
Big Endian and Little Endian
Above mentioned a character possibly takes many bytes, then how this many bytes do save in the computer? For instance character 0xabcd, its storage format is AB CD, CD AB?
In fact both have the possibility, and has the different name separately. If the memory is AB CD, is called Big Endian; If the memory is CD AB, is called Little Endian.
Specifically speaking, following this kind of storage format is Big Endian, because value (0xabcd) top digit (0xab) memory in front:
Address
Value
0x00000000
AB
0x00000001
CD
On the contrary, following this kind of storage format is Little Endian:
Address
Value
0x00000000
CD
0x00000001
AB
UCS-2 and UCS-4
Unicode is born for the conformity world all language and writing. Any writing corresponds a value in Unicode, this value is called the code point (code point). The code point value usually wrote the U+ABCD form. But between the writing and the code point corresponding relationships are UCS-2 (Universal Character Set coded in 2 octets). As the name suggests, UCS-2 is expresses the code point with two bytes, its value scope is U+0000~U+FFFF.
For can express that more writing, the people also proposed UCS-4, namely uses four byte expression code point. Its scope is U+00000000~U+7FFFFFFF, U+00000000~U+0000FFFF and UCS-2 are the same.
Must pay attention, UCS-2 and UCS-4 had only stipulated between the code point and the writing corresponding relationships, had not stipulated how the code point does save in the computer. The stipulation memory way is called UTF (Unicode Transformation Format), applied many are UTF-16 and UTF-8.
UTF-16 and UTF-32
UTF-16
UTF-16 stipulated by RFC2781 that it uses two bytes to express a code point.
It is not difficult to guess correctly that UTF-16 is corresponds completely to UCS-2, namely way preserves directly the UCS-2 stipulation's code point through Big Endian or Little the Endian. UTF-16 including three kinds: UTF-16, UTF-16BE (Big Endian), UTF-16LE (Little Endian).
UTF-16BE and UTF-16LE are not difficult to understand, but UTF-16 needs through to begin in the document by named BOM (Byte Order Mark) the character indicated that the document is Big Endian or Little Endian. BOM is U+FEFF this character.

Other pages: : 1 * 2 * Next>>

Prev:Rootkit stealth technology course (figure)

Next:Teaches you two words to move the net 8.20 backstages to take

Comment:

Category: Home > hacker course