#997 – UTF-16 Encoding, Part II

Unicode maps characters into their corresponding code points, i.e. a numeric value that represents that character.  A character encoding scheme then dictates how each code point is represented as a series of bits so that it can be stored in memory or on disk.

Recall that UTF-16 encoding uses either 2 or 4 bytes to represent each code point.

Code points larger than FFFF are represented using 4 bytes, known as a surrogate pair.

  • The first two bytes (lead surrogate) are in the range D800 – DBFF
  • The third and fourth bytes (trail surrogate) are in the range DC00 – DFFF

These surrogate pairs in turn represent code points in the range U+10000 – U+10FFFF.  This represents 1,048,576 additional values that can be encoded as surrogate pairs.  (Though the Unicode standard currently defines only a small subset of these values).

 

 

Advertisements

About Sean
Software developer in the Twin Cities area, passionate about .NET technologies. Equally passionate about my own personal projects related to family history and preservation of family stories and photos.

3 Responses to #997 – UTF-16 Encoding, Part II

  1. Pingback: Dew Drop – December 17, 2013 (#1685) | Morning Dew

  2. Pingback: #1,001 – Representing Unicode Surrogate Pairs | 2,000 Things You Should Know About C#

  3. Pingback: #1,007 – Getting Length of String that Contains Surrogate Pairs | 2,000 Things You Should Know About C#

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: