2025-02-21
Do you know how many ways there are to write the character "茴" in "茴香豆"? After reading this blog, you'll know. This article is written for all readers, regardless of technical background.
The three sections of this article are arranged from general principles to specific applications.
This is a fundamental question. Take the simplest ASCII code as an example: the character 'A' can be mapped to code point 65, then stored as an 8-bit unsigned integer, resulting in the byte 0x41 (corresponding to the binary representation: 00100001). We can extract a general process for how computers store text:
Text (Character) -> Natural Number (Codepoint) -> Byte Stream
ASCII has a total of 128 code points, which is far from enough to represent all the characters in the world. To store and represent more characters, the Unicode standard is generally used.
The Unicode standard defines a mapping of many characters in the world to natural numbers in the range [0, 17*2^16). This code point range should be written as U+0000 ~ U+10FFFF according to the Unicode notation standard (the standard specifies that U+ is followed by a hexadecimal number, with leading zeros added to make it four digits if necessary).
Each group of 2^16 consecutive code points is called a plane, and there are 17 planes in total. Plane 0 is the Basic Multilingual Plane, which contains commonly used characters.
Experiment: Convert characters and Unicode code points using Python. (The following code uses Python by default, with ">>>" indicating execution in the interactive interpreter REPL)
>>> ord('字')
23383
>>> chr(23383)
'字'
There are many ways to convert code points to byte streams. For example, we could use 3 bytes to represent each Unicode code point. However, this is not necessarily efficient because some code points (characters) appear more frequently than others. If we can represent them with fewer bytes without causing ambiguity, it is obviously beneficial for storing and transmitting text information in computers.
This is the source coding problem in information theory, where the expected code length is related to the distribution of input characters. Clearly, we want frequently occurring characters to have shorter encodings. This is the inherent reason why different countries and regions use different encoding standards. For example, Chinese characters are not common in English-speaking countries, so they require 3 bytes in the UTF-8 encoding standard, while they only need 2 bytes in China's GB 2312 encoding standard.
However, for computer interoperability, it is necessary to agree on a globally interoperable standard, and UTF-8 is one of them.
UTF-8 can refer to both the encoding standard (Character Coding Standard) and its implementation, the encoder (codec). It is part of the Unicode standard. Specifically, it represents each Unicode code point with 1 to 4 bytes, using the following conversion rules:
| First code point | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
|---|---|---|---|---|---|
| U+0000 | U+007F | 0xxxxxxx | |||
| U+0080 | U+07FF | 110xxxxx | 10xxxxxx | ||
| U+0800 | U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
| U+10000 | U+10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
UTF-8 Coding Standard
The number of bytes used for each Unicode code point depends on the code point range:
Code points represented by one byte, totaling 128, correspond exactly to the 128 code points of the ASCII standard, making it backward compatible with the ASCII standard.
Code points represented by two bytes are generally Latin letters.
Code points within three bytes (U+0000 to U+FFFF) include all characters in the Basic Multilingual Plane. CJK (Chinese, Japanese, Korean) characters require three bytes.
From this table, we can also derive the following "eyeball decoding UTF-8" techniques:
Bytes starting with 8, 9, A, or B in hexadecimal are "continuation bytes," while other cases are "starting bytes."
Starting bytes 0~7 are ASCII codes, C, D represent two bytes, E represents three bytes, F represents four bytes.
For example, the UTF-8 encoding of CJK characters is in the form \xe?\{8,9,a,b}?\{8,9,a,b}? where ? can be anywhere between 0 to f.
From this table, we can also verify that UTF-8 encoding is a prefix code. This good property ensures that when we concatenate the encodings of many characters to form a byte stream, we can decode all characters by scanning linearly from the beginning. For example:
>>> "哥德尔 Gödel".encode('utf-8').hex()
'e593a5e5beb7e5b0942047c3b664656c'
# e5 93 a5 哥
# e5 be b7 德
# e5 b0 94 尔
# 20 (space)
# 47 G
# c3 b6 ö
# 64 d
# 65 e
# 6c l
We manually calculate the encoding of the character 'ö':
Code point for ö is U+00F6, falls in the second category
=> use the 110xxxxx 10xxxxxx format, need 11 x's
=> 00F6 is 00011 110110
=> the two bytes are 11000011 10110110 = C3 B6
When faced with a byte stream from a file or network stream, which decoder should be used? Obviously, it should be a decoder that adheres to the same standard as the encoder. If the encoding and decoding standards are inconsistent, garbled text or errors will occur:
>>> '你好'.encode('utf-8')
b'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> b'\xe4\xbd\xa0\xe5\xa5\xbd'.decode('gbk')
'浣犲ソ'
>>> '你好'.encode('gbk')
b'\xc4\xe3\xba\xc3'
>>> b'\xc4\xe3\xba\xc3'.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 0: invalid continuation byte
We may need to determine the encoding standard through external information, such as descriptions elsewhere in the file.
We can also guess the encoder from the byte stream itself: as discussed above, the byte stream encoded by a UTF-8 encoder has certain characteristics, and byte streams output by different encoders must have distinguishing features. Many text editors have this functionality built-in, allowing them to guess an encoding standard for decoding when you open a file.
Having understood the general methods of text encoding, we are now interested in several control characters in ASCII, which are characters that are not printed. When frequently using the terminal, you will encounter them. Some example control characters are as follows:
| ^ | C0 | Abbr | Name | Effect |
|---|---|---|---|---|
| ^J | 0x0A | LF | Line Feed | Moves to next line. |
| ^M | 0x0D | CR | Carriage Return | Moves the cursor to column zero. |
| ^[ | 0x1B | ESC | Escape | Starts all the escape sequences |
Some Control Characters in ASCII
The first column of the table is caret notation, which means entering invisible characters on the keyboard by holding Ctrl (corresponding to ^) + uppercase letters. (However, it seems this notation conflicts with shortcuts like Ctrl + C?). In any case, this tradition is preserved in Unix Shell:
You can use ^C (end of text) to terminate a running process
You can use ^D to exit the shell or indicate the end of input (end of standard input. For example, the C System Call read(0,buffer,sizeof(buffer)); will stop reading standard input upon receiving ^D)
The 0x1B (ESC) in the table has a special status, as it can start escape sequences: the idea is that several characters following ESC should not be interpreted literally but specially.
The ANSI escape sequence standard is adopted by many Unix-like system terminals. In this standard, CSI commands (Control Sequence Introducer Commands) start with ESC [ and are the most commonly used. In programming language source code, this prefix is often written as \e[ or \033[.
Here are some useful or interesting CSI commands when using the terminal:
echo -e '\e[?25h' shows the cursor
echo -e '\e[?25l' hides the cursor
echo -e '\e[31;46mxx' displays colored 'xx'
In different East Asian writing systems, many characters have only slight differences in glyphs, and are linguistically common. "Variant characters" also describe a similar phenomenon. This phenomenon also has a counterpart in computers:
戶 户 戸
This phenomenon generally falls into two situations:
It may be because the Unicode of the characters on the page itself is different
It may also be due to the software you are using (such as browsers, MS Word, etc., with font and format functions), while the Unicode of the characters is the same (which can be verified by pasting into a command line terminal)
For variant characters, historically, some regions have encoded these variant characters separately, so some variant characters are also separately encoded in Unicode. For example, the three variant characters of "户" are encoded separately:
>>> chr(0x6236)
'戶'
>>> chr(0x6237)
'户'
>>> chr(0x6238)
'戸'
[Note: The actual rendered Emoji may vary depending on the software platform used to read this article]
Unicode supports Emoji:
>>> chr(128514)
'😂'
This is not surprising; what's interesting is that Unicode defines the "composition" rules for Emoji.
base_emoji = '\U0001F926'
print("Base Face Palm Emoji:", base_emoji)
for i in range(4):
skin_tone_modifier = chr(ord('\U0001F3FC')+i)
grinning_face_with_skin_tone = base_emoji + skin_tone_modifier
print("Skin Tone Modifier:", skin_tone_modifier)
print("Combined Emoji with Skin Tone Modifier:", grinning_face_with_skin_tone)
Output is as follows:
Base Face Palm Emoji: 🤦
Skin Tone Modifier: 🏼 Combined Emoji with Skin Tone Modifier: 🤦🏼
Skin Tone Modifier: 🏽 Combined Emoji with Skin Tone Modifier: 🤦🏽
Skin Tone Modifier: 🏾 Combined Emoji with Skin Tone Modifier: 🤦🏾
Skin Tone Modifier: 🏿 Combined Emoji with Skin Tone Modifier: 🤦🏿
woman_emoji = '\U0001F469' # Woman
man_emoji = '\U0001F468' # Man
girl_emoji = '\U0001F467' # Girl
zwj = '\U0000200D'
family_emoji = woman_emoji + zwj + man_emoji + zwj + girl_emoji
print("Family Emoji:", family_emoji)
Output is as follows: 👩👨👧
We can imagine a very simple method: each code point corresponds to a token. In this way, a purely English model only needs ~100 tokens to represent the corpus it needs to input and output.
However, what if we want multilingual input and output? Take Chinese characters as an example, we can include only the less rare characters, requiring ~10,000 tokens (uncovered characters will be turned into
With the intuition of "matching the number of predictions with the amount of information," we merge tokens that frequently appear together into new tokens, exchanging alphabet size for inference efficiency. This is BPE (Byte-pair Encoding), which can be simply understood as starting from letters and basic symbols, gradually building a vocabulary containing "subwords." The alphabet size of GPT2, "50257," consists of 256 single-byte basic symbols, plus 50,000 "subwords," plus special tokens.
You can experiment with the tokenizer at Tiktokenizer:
This is UTF-8
"This", " is", " UTF", "-", "8"
1212, 318, 41002, 12, 23
You can see that one token corresponds to multiple English letters, so there's no worry about the efficiency of English output.
The above method still does not solve the problem of unknown tokens appearing in multilingual input: no matter how many characters from different languages you include, there will always be characters you haven't included, unless you exhaust the Basic Multilingual Plane (2^16=65535) (then what about emoji?). A method to "remain unchanged in response to all changes" is to encode all input into a byte stream using UTF-8, then use the 256 code points of one byte as the most basic unit, apply BPE to compute subwords, and input them into the model. In this way, the model's input will not encounter unknown characters.
The problem with this method is that the tokens output by the model, when converted into a byte stream, may not be valid UTF-8 bytes (incorrect position or number of continuation bytes). In such cases, special handling is needed during decoding, such as attempting to correct or ignore:
decoded_text = byte_sequence.decode('utf-8', errors='replace')
# Replace illegal sequences with this character "�" (literally U+FFFD, known as the "replacement character")
However, if the proportion of Chinese corpus is insufficient when training BPE, and the frequency is low, subwords may not piece together the UTF-8 bytes of Chinese characters, making the model's Chinese output efficiency very low (in the worst case, each Chinese character is represented by 3 bytes/tokens).
Comparing GPT2 and Deepseek-R1's tokenizer, you can see that Deepseek has made many optimizations for Chinese language processing:
计算机和语言模型的字符编码
GPT2: 164, 106, 94, 163, 106, 245, 17312, 118, 161, 240, 234, 46237, 255, 164, 101, 222, 162, 101, 94, 161, 252, 233, 21410, 27764, 245, 163, 105, 99, 163, 120, 244, 163, 254, 223
Deepseek-R1: 11766, 548, 7831, 52727, 15019, 26263