#999 – Some Examples of UTF-16 and UTF-8 Encoding

December 19, 2013 Leave a comment

Unicode maps characters into their corresponding code points, i.e. a numeric value that represents that character. A character encoding scheme then dictates how each code point is represented as a series of bits so that it can be stored in memory or on disk. UTF-16 and UTF-8 are the most commonly used encoding schemes for Unicode character data.

Below are some examples of how various characters would be encoded in UTF-16 and UTF-8.

Latin capital ‘A’, code point U+0041
- UTF-16: 2 bytes, 00 41 (hex)
- UTF-8: 1 byte, 41 (hex)
Latin lowercase ‘é’ with acute accent, code point U+00E9
- UTF-16: 2 bytes, 00 E9 (hex)
- UTF-8: 2 bytes, C3 A9 (hex) [110x xxxx 10xx xxxx]
Mongolian letter A, U+1820
- UTF-16: 2 bytes, 18 20 (hex)
- UTF-8: 3 bytes, E1 A0 A0 (hex) [1110 xxxx 10xx xxxx 10xx xxxx]
Ace of Spades playing card character, U+1F0A1
- UTF-16: 4 bytes, D8 3C DC A1
- UTF-8: 4 bytes, F0 9F 82 A1 [1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx]

Filed under Basics Tagged with Basics, C#, Unicode, UTF-16, UTF-8

About Sean
Software developer in the Twin Cities area, passionate about software development and sailing.

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

2,000 Things You Should Know About C#

#999 – Some Examples of UTF-16 and UTF-8 Encoding

Leave a comment Cancel reply

Sean Sexton

Recent Posts

Blogroll

Calendar

Top Posts

Tags

Blog Stats

2,000 Things You Should Know About C#

#999 – Some Examples of UTF-16 and UTF-8 Encoding

Share this:

Related

Leave a comment Cancel reply

Sean Sexton

Recent Posts

Blogroll

Calendar

Top Posts

Tags

Blog Stats