#997 – UTF-16 Encoding, Part II

Unicode maps characters into their corresponding code points, i.e. a numeric value that represents that character.  A character encoding scheme then dictates how each code point is represented as a series of bits so that it can be stored in memory or on disk.

Recall that UTF-16 encoding uses either 2 or 4 bytes to represent each code point.

Code points larger than FFFF are represented using 4 bytes, known as a surrogate pair.

  • The first two bytes (lead surrogate) are in the range D800 – DBFF
  • The third and fourth bytes (trail surrogate) are in the range DC00 – DFFF

These surrogate pairs in turn represent code points in the range U+10000 – U+10FFFF.  This represents 1,048,576 additional values that can be encoded as surrogate pairs.  (Though the Unicode standard currently defines only a small subset of these values).

 

 

Advertisements

About Sean
Software developer in the Twin Cities area, passionate about software development and sailing.

3 Responses to #997 – UTF-16 Encoding, Part II

  1. Pingback: Dew Drop – December 17, 2013 (#1685) | Morning Dew

  2. Pingback: #1,001 – Representing Unicode Surrogate Pairs | 2,000 Things You Should Know About C#

  3. Pingback: #1,007 – Getting Length of String that Contains Surrogate Pairs | 2,000 Things You Should Know About C#

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: