#998 – UTF-8 Encoding
December 18, 2013 3 Comments
Unicode maps characters into their corresponding code points, i.e. a numeric value that represents that character. A character encoding scheme then dictates how each code point is represented as a series of bits so that it can be stored in memory or on disk.
UTF-8 is the most common character encoding when transferring data over the web, i.e. via XML or HTML. UTF-8 uses from 1 to 4 bytes to represent each code point.
The full range of Unicode code points, from U+000000 through U+10FFFF, are encoded with UTF-8 as follows:
- U+0000 to U+007F: 1 byte, storing code point exactly (identical to ASCII)
- U+0080 to U+07FF: 2 bytes for 11 bits – 110x xxxx, 10xx xxxx
- U+0800 to U+FFFF: 3 bytes for 16 bits – 1110 xxxx, 10xx xxxx, 10xx xxxx
- U+10000 to U+10FFFF: 4 bytes for 21 bits – 1111 0xxx, 10xx xxxx, 10xx xxxx, 10xx xxxx
English text requires 1 byte per character.