#998 – UTF-8 Encoding

Unicode maps characters into their corresponding code points, i.e. a numeric value that represents that character.  A character encoding scheme then dictates how each code point is represented as a series of bits so that it can be stored in memory or on disk.

UTF-8 is the most common character encoding when transferring data over the web, i.e. via XML or HTML.  UTF-8 uses from 1 to 4 bytes to represent each code point.

The full range of Unicode code points, from U+000000 through U+10FFFF, are encoded with UTF-8 as follows:

  • U+0000 to U+007F: 1 byte, storing code point exactly (identical to ASCII)
  • U+0080 to U+07FF: 2 bytes for 11 bits – 110x xxxx, 10xx xxxx
  • U+0800 to U+FFFF: 3 bytes for 16 bits – 1110 xxxx, 10xx xxxx, 10xx xxxx
  • U+10000 to U+10FFFF: 4 bytes for 21 bits – 1111 0xxx, 10xx xxxx, 10xx xxxx, 10xx xxxx

English text requires 1 byte per character.

Advertisement

About Sean
Software developer in the Twin Cities area, passionate about software development and sailing.

3 Responses to #998 – UTF-8 Encoding

  1. Pingback: #999 – Some Examples of UTF-16 and UTF-8 Encoding | 2,000 Things You Should Know About C#

  2. Pingback: #1,000 – UTF-8 and ASCII | 2,000 Things You Should Know About C#

  3. Pingback: #1,002 – Specifying Character Encoding when Writing to a File | 2,000 Things You Should Know About C#

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: