#995 – Unicode Code Points

Unicode maps characters into their corresponding code points, i.e. a numeric value that represents that character.  The full set of possible Unicode values ranges from 0 to 10FFFF (hex), for a total of 1,114,112 possible values.

The full set of possible Unicode values is divided into 17 sets of 65,536 values, known as planes.  Each plane is identified numerically by the first 5 bits of the code point.  For example:

  • Plane 0: 000000 – 00FFFF
  • Plane 1: 010000 – 01FFFF
  • etc.
  • Plane 16: 100000 – 10FFFF

Plane 0, also known as the Basic Multilingual Plane, contains code points representing characters in most of the world’s alphabets.  This include Latin alphabets, as well as Asian scripts like CJK, Katakana and Hiragana.

Other planes contain less common characters.  For example, plane 1, known as the Supplementary Multilingual Plane (SMP) contains characters from ancient alphabets, hieroglyphics, and musical symbols.

The Unicode standard has currently only defined a mapping for about 10% of all of the possible code point values.

#994 – Unicode Basics

To store alphabetic characters or other written characters on a computer system, either in memory or on disk, we need to encode each character so that we can store some numeric value that represents the character.  The numeric values are, ultimately, just bit patterns–where each bit pattern represents some character.

Unicode is a standard that specifies methods for encoding all characters from the written languages of the world.  This includes the ability to encode more than 1 million unique characters.  Unicode is the standard used for all web-based traffic (HTML and XML) and for storing character data on most modern operating systems (e.g. Windows, OS X, Unix).

The Unicode standard defines a number of different character encodings.  The most common are:

  • UTF-8 – Variable number of bytes used, from 1-4 bytes.  English characters use only 1 byte.
  • UTF-16 – Uses 2 bytes for most common characters, 4 bytes for other characters.

#993 – Some Examples of Characters

Given that the System.Char data type (char) can store any Unicode character, encoded using UTF-16, below are some examples of valid characters.

Here is the list that we bind to, to display some characters:

            CharList = new ObservableCollection<char>
            {
                'a', 'Z', '.', '$', '\x00e1', '\x00e7', '\x0398', '\x03a9',
                '\x0429', '\x046c', '\x05d4', '\x05da', '\x0626', '\x0643',
                '\x0b92', '\x0caf', '\x0e0d', '\x1251', '\x13ca', '\x14cf',
                '\x21bb', '\x21e9', '\x222d', '\x2285', '\x2466', '\x2528',
                '\x2592', '\x265a', '\x265e', '\x2738', '\x3062', '\x3063',
                '\x3082', '\x3071', '\x3462', '\x3463', '\x3464', '\x3485',
                '\x5442', '\x54e1', '\x5494', '\xac21', '\xae25', '\xfc45',
                '\xfce8'
            };

993-001

#992 – The System.Char Data Type

In C#, the char keyword is a synonym for the System.Char data type in the Base Class Library (BCL).  An instance of a char represents a single character, i.e. a single unit that is part of a written language.

Some examples of characters, all of which can be represented by a char:

  • Uppercase or lowercase alphabetic characters from the English (Latin) alphabet (a..z, A..Z)
  • Accented alphabetic characters
  • Alphabetic characters from other alphabets (e.g. Greek, Cyrillic, Hebrew, Chinese, Arabic, etc).
  • Punctuation marks  (e.g. ! , = [ {, etc).
  • Numeric digits (0..9)
  • Control or formatting characters (e.g. end-of-line, delete)
  • Mathematical and musical symbols
  • Special graphical glyphs (e.g. trademark symbol, smiley face)

A character stored in an instance of a char takes up 2 bytes (16 bits).  The values are encoded as Unicode characters, using the UTF-16 encoding.

#773 – Reversing a String that Contains Unicode Characters Expressed as Surrogate Pairs

You can use the Reverse method to reverse the characters in a .NET-based string.  This method works if the string contains Unicode characters that can be expressed as 2-byte UTF16 code points.  This subset of Unicode is known as the Basic Multilingual Plane (BMP) and is able to represent 65,536 unique code points.

UTF16 can represent Unicode code points outside the BMP through the use of surrogate pairs.  Within a series of 16-bit characters, a 32-bit character can appear, stored as a pair of normal UTF16 words.

In practice, it’s quite rare to encounter Unicode characters outside of the BMP, given that this plane can represent characters from most living languages.

To reverse a string that contains surrogate pairs, you can use the Microsoft.VisualBasic.Strings.StrReverse method.

            // 8 byte string includes surrogate pair
            string s = "A𠈓C";

            // Won't handle surrogate pair
            string s2 = new string(s.Reverse().ToArray());

            // Will handle surrogate pair
            string s3 = Microsoft.VisualBasic.Strings.StrReverse(s);

773-001

#476 – Don’t Use Unicode Escape Sequences in Identifiers

Although you can embed Unicode escape sequences within identifiers in a C# program, you shouldn’t.  Including escape sequences makes the identifiers in the code less readable.

You can include Unicode characters for any Unicode character in the range of U+0000 to U+FFFF and the glyph will show up correctly in Visual Studio, if you have the proper fonts installed.  You can also enter these characters directly using the appropriate international keyboard.

In  the example below, we declare an identifier that includes the letter ‘c’ with a “cedilla”, using the Unicode value of U+00E7 for the cedilla.

            string fa\u00E7ade = "Front of building";

This is valid C# code, but should be avoided. Instead, you can insert the glyph directly into the code:

            string façade = "Front of building";

Visual Studio treats these as identical. We get a compiler warning if we include both definitions in a program.

#474 – You Can Include Unicode Characters in Your Source Code

You can enter Unicode characters directly into your source code using the editor in Visual Studio.  These characters will be stored and intepreted correctly, if contained in a comment or a string literal.  (Unicode characters are not used in any C# keywords).

You can also enter a hex code for the Unicode character, rather than entering the character directly.  (E.g. By copying/pasting it or entering from a keyboard that includes the character)

        static void Main()
        {
            string money = "€";
            string sameThing = "\u20AC";
        }

#25 – String Literals

A string literal in C# represents a sequence of Unicode characters.  There are two main types of string literals in C#–regular string literals and verbatim string literals.  Verbatim string literals allow including special characters in a string directly, rather than having to specify them using an escape sequence.  All string literals are of type string.

Here are some examples of string literals:

 string s1 = "Hi";
 string s2 = @"Hi";       // Verbatim string literal--same thing
 string s3 = "C:\\Dir";   // C:\Dir  (escape seq for backslash)
 string s4 = @"C:\Dir";   // No escape seq required
 string s5 = "\x48\x69";  // Hi  (hex codes for each character)
 string s6 = "\x20AC 1.99";  // € 1.99
 string s7 = "€ 1.99";    // Unicode directly in string

 // UTF-32 characters using surrogate pairs
 string s8 = "\U00020213";      // U+20213 (UTF-32)
 string s9 = "\ud840\ude13";    // Equiv surrogate pair
 string s10 = "𠈓";             // Same character

Note:

  • \u is followed by 4-byte UTF16 character code
  • \U is followed by 8-byte UTF32 character code, which is then converted to equivalent surrogate pair