#997 – UTF-16 Encoding, Part II

December 17, 2013 3 Comments

Unicode maps characters into their corresponding code points, i.e. a numeric value that represents that character. A character encoding scheme then dictates how each code point is represented as a series of bits so that it can be stored in memory or on disk.

Recall that UTF-16 encoding uses either 2 or 4 bytes to represent each code point.

Code points larger than FFFF are represented using 4 bytes, known as a surrogate pair.

The first two bytes (lead surrogate) are in the range D800 – DBFF
The third and fourth bytes (trail surrogate) are in the range DC00 – DFFF

These surrogate pairs in turn represent code points in the range U+10000 – U+10FFFF. This represents 1,048,576 additional values that can be encoded as surrogate pairs. (Though the Unicode standard currently defines only a small subset of these values).

Filed under Basics Tagged with Basics, C#, Unicode, UTF-16

About Sean
Software developer in the Twin Cities area, passionate about software development and sailing.

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

2,000 Things You Should Know About C#

#997 – UTF-16 Encoding, Part II

3 Responses to #997 – UTF-16 Encoding, Part II

Leave a comment Cancel reply

Sean Sexton

Recent Posts

Blogroll

Calendar

Top Posts

Tags

Blog Stats

2,000 Things You Should Know About C#

#997 – UTF-16 Encoding, Part II

Share this:

Related

3 Responses to #997 – UTF-16 Encoding, Part II

Leave a comment Cancel reply

Sean Sexton

Recent Posts

Blogroll

Calendar

Top Posts

Tags

Blog Stats