Character Encoding, Control Characters, and Escape Sequences

2025-02-21

Do you know how many ways there are to write the character "茴" in "茴香豆"? After reading this blog, you'll know. This article is written for all readers, regardless of technical background.

The three sections of this article are arranged from general principles to specific applications.

How Computers Store Text

This is a fundamental question. Take the simplest ASCII code as an example: the character 'A' can be mapped to code point 65, then stored as an 8-bit unsigned integer, resulting in the byte 0x41 (corresponding to the binary representation: 00100001). We can extract a general process for how computers store text:

Text (Character) -> Natural Number (Codepoint) -> Byte Stream

ASCII has a total of 128 code points, which is far from enough to represent all the characters in the world. To store and represent more characters, the Unicode standard is generally used.

From Character to Codepoint

The Unicode standard defines a mapping of many characters in the world to natural numbers in the range [0, 17*2^16). This code point range should be written as U+0000 ~ U+10FFFF according to the Unicode notation standard (the standard specifies that U+ is followed by a hexadecimal number, with leading zeros added to make it four digits if necessary).

Each group of 2^16 consecutive code points is called a plane, and there are 17 planes in total. Plane 0 is the Basic Multilingual Plane, which contains commonly used characters.

Experiment: Convert characters and Unicode code points using Python. (The following code uses Python by default, with ">>>" indicating execution in the interactive interpreter REPL)

>>> ord('')
23383
>>> chr(23383)
''

From Codepoint to Byte Stream

There are many ways to convert code points to byte streams. For example, we could use 3 bytes to represent each Unicode code point. However, this is not necessarily efficient because some code points (characters) appear more frequently than others. If we can represent them with fewer bytes without causing ambiguity, it is obviously beneficial for storing and transmitting text information in computers.

This is the source coding problem in information theory, where the expected code length is related to the distribution of input characters. Clearly, we want frequently occurring characters to have shorter encodings. This is the inherent reason why different countries and regions use different encoding standards. For example, Chinese characters are not common in English-speaking countries, so they require 3 bytes in the UTF-8 encoding standard, while they only need 2 bytes in China's GB 2312 encoding standard.

However, for computer interoperability, it is necessary to agree on a globally interoperable standard, and UTF-8 is one of them.

From Codepoint to Byte Stream: UTF-8 Encoder

UTF-8 can refer to both the encoding standard (Character Coding Standard) and its implementation, the encoder (codec). It is part of the Unicode standard. Specifically, it represents each Unicode code point with 1 to 4 bytes, using the following conversion rules:

First code pointLast code pointByte 1Byte 2Byte 3Byte 4
U+0000U+007F0xxxxxxx
U+0080U+07FF110xxxxx10xxxxxx
U+0800U+FFFF1110xxxx10xxxxxx10xxxxxx
U+10000U+10FFFF11110xxx10xxxxxx10xxxxxx10xxxxxx

UTF-8 Coding Standard

The number of bytes used for each Unicode code point depends on the code point range:

From this table, we can also derive the following "eyeball decoding UTF-8" techniques:

From this table, we can also verify that UTF-8 encoding is a prefix code. This good property ensures that when we concatenate the encodings of many characters to form a byte stream, we can decode all characters by scanning linearly from the beginning. For example:

>>> "哥德尔 Gödel".encode('utf-8').hex()
'e593a5e5beb7e5b0942047c3b664656c'
# e5 93 a5 哥
# e5 be b7 德
# e5 b0 94 尔
# 20 (space)
# 47 G
# c3 b6 ö
# 64 d
# 65 e
# 6c l

We manually calculate the encoding of the character 'ö':

Code point for ö is U+00F6, falls in the second category
=> use the 110xxxxx 10xxxxxx format, need 11 x's
=> 00F6 is 00011 110110
=> the two bytes are 11000011 10110110 = C3 B6

Which Decoder to Use?

When faced with a byte stream from a file or network stream, which decoder should be used? Obviously, it should be a decoder that adheres to the same standard as the encoder. If the encoding and decoding standards are inconsistent, garbled text or errors will occur:

>>> '你好'.encode('utf-8')
b'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> b'\xe4\xbd\xa0\xe5\xa5\xbd'.decode('gbk')
'浣犲ソ'
>>> '你好'.encode('gbk')
b'\xc4\xe3\xba\xc3'
>>> b'\xc4\xe3\xba\xc3'.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 0: invalid continuation byte

We may need to determine the encoding standard through external information, such as descriptions elsewhere in the file.

We can also guess the encoder from the byte stream itself: as discussed above, the byte stream encoded by a UTF-8 encoder has certain characteristics, and byte streams output by different encoders must have distinguishing features. Many text editors have this functionality built-in, allowing them to guess an encoding standard for decoding when you open a file.

Control Characters in ASCII

Having understood the general methods of text encoding, we are now interested in several control characters in ASCII, which are characters that are not printed. When frequently using the terminal, you will encounter them. Some example control characters are as follows:

^C0AbbrNameEffect
^J0x0ALFLine FeedMoves to next line.
^M0x0DCRCarriage ReturnMoves the cursor to column zero.
^[0x1BESCEscapeStarts all the escape sequences

Some Control Characters in ASCII

The first column of the table is caret notation, which means entering invisible characters on the keyboard by holding Ctrl (corresponding to ^) + uppercase letters. (However, it seems this notation conflicts with shortcuts like Ctrl + C?). In any case, this tradition is preserved in Unix Shell:

The 0x1B (ESC) in the table has a special status, as it can start escape sequences: the idea is that several characters following ESC should not be interpreted literally but specially.

CSI Commands in ANSI Escape Sequences

The ANSI escape sequence standard is adopted by many Unix-like system terminals. In this standard, CSI commands (Control Sequence Introducer Commands) start with ESC [ and are the most commonly used. In programming language source code, this prefix is often written as \e[ or \033[.

Here are some useful or interesting CSI commands when using the terminal:

Trivia

Unicode Encoding of Variant Characters

In different East Asian writing systems, many characters have only slight differences in glyphs, and are linguistically common. "Variant characters" also describe a similar phenomenon. This phenomenon also has a counterpart in computers:

戶 户 戸

This phenomenon generally falls into two situations:

For variant characters, historically, some regions have encoded these variant characters separately, so some variant characters are also separately encoded in Unicode. For example, the three variant characters of "户" are encoded separately:

>>> chr(0x6236)
''
>>> chr(0x6237)
''
>>> chr(0x6238)
''

Unicode and Emoji

[Note: The actual rendered Emoji may vary depending on the software platform used to read this article]

Unicode supports Emoji:

>>> chr(128514)
'😂'

This is not surprising; what's interesting is that Unicode defines the "composition" rules for Emoji.

Two Characters Define an Emoji

base_emoji = '\U0001F926'
print("Base Face Palm Emoji:", base_emoji)

for i in range(4):
    skin_tone_modifier = chr(ord('\U0001F3FC')+i)
    grinning_face_with_skin_tone = base_emoji + skin_tone_modifier
    print("Skin Tone Modifier:", skin_tone_modifier)
    print("Combined Emoji with Skin Tone Modifier:", grinning_face_with_skin_tone)

Output is as follows:

Base Face Palm Emoji: 🤦
Skin Tone Modifier: 🏼 Combined Emoji with Skin Tone Modifier: 🤦🏼
Skin Tone Modifier: 🏽 Combined Emoji with Skin Tone Modifier: 🤦🏽
Skin Tone Modifier: 🏾 Combined Emoji with Skin Tone Modifier: 🤦🏾
Skin Tone Modifier: 🏿 Combined Emoji with Skin Tone Modifier: 🤦🏿

Join emoji using ZWJ

woman_emoji = '\U0001F469'  # Woman
man_emoji = '\U0001F468'    # Man
girl_emoji = '\U0001F467'   # Girl
zwj = '\U0000200D'

family_emoji = woman_emoji + zwj + man_emoji + zwj + girl_emoji
print("Family Emoji:", family_emoji)

Output is as follows: 👩‍👨‍👧

Extra: How Language Models Represent Text

We can imagine a very simple method: each code point corresponds to a token. In this way, a purely English model only needs ~100 tokens to represent the corpus it needs to input and output.

However, what if we want multilingual input and output? Take Chinese characters as an example, we can include only the less rare characters, requiring ~10,000 tokens (uncovered characters will be turned into upon input). But in this language model, the usage frequency of these 10,000 Chinese characters is completely different, for example, the frequency of the 3,000 commonly used characters is completely different from the remaining characters, yet they all occupy a row of the embedding matrix, generating computational and storage overhead. Furthermore, if we output in a mix of English and Chinese, two predictions can output a Chinese word, but the same amount of information output in English words may require ~10 letters, which is ten predictions, making English output inefficient.

Byte-pair Encoding (BPE)

With the intuition of "matching the number of predictions with the amount of information," we merge tokens that frequently appear together into new tokens, exchanging alphabet size for inference efficiency. This is BPE (Byte-pair Encoding), which can be simply understood as starting from letters and basic symbols, gradually building a vocabulary containing "subwords." The alphabet size of GPT2, "50257," consists of 256 single-byte basic symbols, plus 50,000 "subwords," plus special tokens.

You can experiment with the tokenizer at Tiktokenizer:

This is UTF-8
"This", " is", " UTF", "-", "8"
1212, 318, 41002, 12, 23

You can see that one token corresponds to multiple English letters, so there's no worry about the efficiency of English output.

Byte-level BPE

The above method still does not solve the problem of unknown tokens appearing in multilingual input: no matter how many characters from different languages you include, there will always be characters you haven't included, unless you exhaust the Basic Multilingual Plane (2^16=65535) (then what about emoji?). A method to "remain unchanged in response to all changes" is to encode all input into a byte stream using UTF-8, then use the 256 code points of one byte as the most basic unit, apply BPE to compute subwords, and input them into the model. In this way, the model's input will not encounter unknown characters.

The problem with this method is that the tokens output by the model, when converted into a byte stream, may not be valid UTF-8 bytes (incorrect position or number of continuation bytes). In such cases, special handling is needed during decoding, such as attempting to correct or ignore:

decoded_text = byte_sequence.decode('utf-8', errors='replace')
# Replace illegal sequences with this character "�" (literally U+FFFD, known as the "replacement character")

However, if the proportion of Chinese corpus is insufficient when training BPE, and the frequency is low, subwords may not piece together the UTF-8 bytes of Chinese characters, making the model's Chinese output efficiency very low (in the worst case, each Chinese character is represented by 3 bytes/tokens).

Comparing GPT2 and Deepseek-R1's tokenizer, you can see that Deepseek has made many optimizations for Chinese language processing:

计算机和语言模型的字符编码
GPT2: 164, 106, 94, 163, 106, 245, 17312, 118, 161, 240, 234, 46237, 255, 164, 101, 222, 162, 101, 94, 161, 252, 233, 21410, 27764, 245, 163, 105, 99, 163, 120, 244, 163, 254, 223
Deepseek-R1: 11766, 548, 7831, 52727, 15019, 26263