#996 – UTF-16 Encoding, Part I

Unicode maps characters into their corresponding code points, i.e. a numeric value that represents that character.  A character encoding scheme then dictates how each code point is represented as a series of bits so that it can be stored in memory or on disk.

UTF-16 is one of the more common character encodings used to represent Unicode characters.  UTF-16 uses either 2 or 4 bytes to represent each code point.

All code points in the range of 0 to FFFF are represented directly as a 2-byte value.  This set of code points is known as the Basic Multilingual Plane (BMP).

Code points in the BMP are defined only within the following two ranges (hex values):

  • 0000 – D7FF
  • E000 – FFFF

This results in a total of 63,488 characters that can be represented.  This first set of values is known as the Basic Multilingual Plane (BMP).  In actuality, around 55,000 code points are currently defined in the BMP.

Advertisement

#995 – Unicode Code Points

Unicode maps characters into their corresponding code points, i.e. a numeric value that represents that character.  The full set of possible Unicode values ranges from 0 to 10FFFF (hex), for a total of 1,114,112 possible values.

The full set of possible Unicode values is divided into 17 sets of 65,536 values, known as planes.  Each plane is identified numerically by the first 5 bits of the code point.  For example:

  • Plane 0: 000000 – 00FFFF
  • Plane 1: 010000 – 01FFFF
  • etc.
  • Plane 16: 100000 – 10FFFF

Plane 0, also known as the Basic Multilingual Plane, contains code points representing characters in most of the world’s alphabets.  This include Latin alphabets, as well as Asian scripts like CJK, Katakana and Hiragana.

Other planes contain less common characters.  For example, plane 1, known as the Supplementary Multilingual Plane (SMP) contains characters from ancient alphabets, hieroglyphics, and musical symbols.

The Unicode standard has currently only defined a mapping for about 10% of all of the possible code point values.