#998 – UTF-8 Encoding

Unicode maps characters into their corresponding code points, i.e. a numeric value that represents that character.  A character encoding scheme then dictates how each code point is represented as a series of bits so that it can be stored in memory or on disk.

UTF-8 is the most common character encoding when transferring data over the web, i.e. via XML or HTML.  UTF-8 uses from 1 to 4 bytes to represent each code point.

The full range of Unicode code points, from U+000000 through U+10FFFF, are encoded with UTF-8 as follows:

  • U+0000 to U+007F: 1 byte, storing code point exactly (identical to ASCII)
  • U+0080 to U+07FF: 2 bytes for 11 bits – 110x xxxx, 10xx xxxx
  • U+0800 to U+FFFF: 3 bytes for 16 bits – 1110 xxxx, 10xx xxxx, 10xx xxxx
  • U+10000 to U+10FFFF: 4 bytes for 21 bits – 1111 0xxx, 10xx xxxx, 10xx xxxx, 10xx xxxx

English text requires 1 byte per character.

Advertisements