#999 – Some Examples of UTF-16 and UTF-8 Encoding →

#998 – UTF-8 Encoding

December 18, 2013 3 Comments

Unicode maps characters into their corresponding code points, i.e. a numeric value that represents that character. A character encoding scheme then dictates how each code point is represented as a series of bits so that it can be stored in memory or on disk.

UTF-8 is the most common character encoding when transferring data over the web, i.e. via XML or HTML. UTF-8 uses from 1 to 4 bytes to represent each code point.

The full range of Unicode code points, from U+000000 through U+10FFFF, are encoded with UTF-8 as follows:

U+0000 to U+007F: 1 byte, storing code point exactly (identical to ASCII)
U+0080 to U+07FF: 2 bytes for 11 bits – 110x xxxx, 10xx xxxx
U+0800 to U+FFFF: 3 bytes for 16 bits – 1110 xxxx, 10xx xxxx, 10xx xxxx
U+10000 to U+10FFFF: 4 bytes for 21 bits – 1111 0xxx, 10xx xxxx, 10xx xxxx, 10xx xxxx

English text requires 1 byte per character.

Filed under Basics Tagged with Basics, C#, Unicode, UTF-8

About Sean
Software developer in the Twin Cities area, passionate about software development and sailing.

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

2,000 Things You Should Know About C#

#998 – UTF-8 Encoding

3 Responses to #998 – UTF-8 Encoding

Leave a comment Cancel reply

Sean Sexton

Recent Posts

Blogroll

Calendar

Top Posts

Tags

Blog Stats

2,000 Things You Should Know About C#

#998 – UTF-8 Encoding

Share this:

Related

3 Responses to #998 – UTF-8 Encoding

Leave a comment Cancel reply

Sean Sexton

Recent Posts

Blogroll

Calendar

Top Posts

Tags

Blog Stats