#1,007 – Getting Length of String that Contains Surrogate Pairs

You can use the string.Length property to get the length (number of characters) of a string.  This only works, however, for Unicode code points that are no larger than U+FFFF.  This set of code points is known as the Basic Multilingual Plane (BMP).

Unicode code points outside of the BMP are represented in UTF-16 using 4 byte surrogate pairs, rather than using 2 bytes.

To correctly count the number of characters in a string that may contain code points higher than U+FFFF, you can use the StringInfo class (from System.Globalization).

            // 3 Latin (ASCII) characters
            string simple = "abc";

            // 3 character string where one character
            //  is a surrogate pair
            string containsSurrogatePair = "A𠈓C";

            // Length=3 (correct)
            Console.WriteLine(string.Format("Length 1 = {0}", simple.Length));

            // Length=4 (not quite correct)
            Console.WriteLine(string.Format("Length 2 = {0}", containsSurrogatePair.Length));

            // Better, reports Length=3
            StringInfo si = new StringInfo(containsSurrogatePair);
            Console.WriteLine(string.Format("Length 3 = {0}", si.LengthInTextElements));

1007-001

Advertisement

#1,006 – Getting the Length of a String

You can use the string.Length property to get an integer representing the length of a string.  In most cases, length means–the number of characters.

            // 3 Latin (ASCII) characters
            string simple = "abc";
            // 2 other characters: U+0100, E+4E01
            string other = "Ā丁";

            Console.WriteLine(string.Format("Length 1 = {0}", simple.Length));
            Console.WriteLine(string.Format("Length 2 = {0}", other.Length));

1006-001

 

It’s important to note that the Length property returns the number of individual Char objects that make up the string.  This can be different from the actual number of Unicode characters.

#1,003 – Accessing Underlying Bytes in a String for Different Encodings

Strings in .NET are stored in memory as Unicode character data, using the UTF-16 encoding.  (2 bytes per character, or 4 bytes for surrogate pairs).

If you want to get access to the underlying data for the string in memory, you can use one of the functions listed below, indicating what encoding to use for the Unicode data when converting it to a byte array.  If you use Encoding.Unicode, you’ll get the data exactly as it is stored in memory for the String type.

  • System.Text.Encoding.Unicode.GetBytes – UTF-16
  • System.Text.Encoding.UTF8.GetBytes – UTF-8

In the example below, notice the different byte sequences used to encode the CJK character.

            string ideograph = "𠈓";
            byte[] utf16 = Encoding.Unicode.GetBytes(ideograph);
            byte[] utf8 = Encoding.UTF8.GetBytes(ideograph);

1003-001

#1,002 – Specifying Character Encoding when Writing to a File

In .NET, string data is stored in memory as Unicode data encoded as UTF-16 (2 bytes per character, or 4 bytes for surrogate pairs).

When you persist string data out to a file, however, you must be aware of what encoding is being used.  In the example below, we use a StreamWriter to write string data to a file.  StreamWriter by default uses UTF-8 as  the encoding.

            string s1 = "A";             // U+0041
            string s2 = "\u00e9";        // U+00E9 accented e
            string s3 = "\u0100";        // Capital A with bar
            string s4 = "\U00020213";    // CJK ideograph (d840, de13 surrogate)

            using (StreamWriter sw = new StreamWriter(@"C:\Users\Sean\Documents\sometext.txt"))
            {
                sw.WriteLine(s1);
                sw.WriteLine(s2);
                sw.WriteLine(s3);
                sw.WriteLine(s4);
            }

1002_001

We could also explicitly specify a UTF-16 encoding (Encoding.Unicode) when creating the StreamWriter object.

            using (StreamWriter sw = new StreamWriter(@"C:\Users\Sean\Documents\sometext.txt", false, Encoding.Unicode))

1002_002

#1,001 – Representing Unicode Surrogate Pairs

UTF-16 encodes Unicode code points above U+FFFF using surrogate pairs that take up 4 bytes.

You can specify a surrogate pair within a string literal by inserting the character directly into the string (provided that you have a keyboard that can insert the character):

            string myString = "𠈓";   // CJK Ideograph

You can also represent the surrogate pair within a string literal using the \Unnnnnnnn (4 byte) syntax to specify the Unicode code point or the \unnnn\unnnn syntax to specify the encoded surrogate pair value.

            string s1 = "\U00020213";    // Codepoint E+20213
            string s2 = "\uD840\uDE13";  // Surrogate pair

1001_001
Note that because a surrogate pair requires more then 2 bytes, you cannot represent a surrogate pair within a single character (System.Char) literal.

#1,000 – UTF-8 and ASCII

UTF-8 is a character encoding scheme for Unicode character data that uses from 1-4 bytes to represent each character, depending on the code point of the character to be represented.

In Unicode, code points for ASCII characters are equivalent to the ASCII code for that character.

This mapping is true for all 128 ASCII codes.  

UTF-8 encoding maps these first 128 characters in the set of Unicode code points to a single byte containing the code point.  Because of this:

Characters included in the ASCII character set that are present in a stream of UTF-8 encoded character data will appear the same as if they were encoded as ASCII.

This means that a UTF-8 encoded stream of ASCII characters will be identical to an ASCII-encoded stream of the same characters.  I.e. For English language characters, UTF-8 is identical to ASCII.

#999 – Some Examples of UTF-16 and UTF-8 Encoding

Unicode maps characters into their corresponding code points, i.e. a numeric value that represents that character.  A character encoding scheme then dictates how each code point is represented as a series of bits so that it can be stored in memory or on disk.  UTF-16 and UTF-8 are the most commonly used encoding schemes for Unicode character data.

Below are some examples of how various characters would be encoded in UTF-16 and UTF-8.

  • Latin capital ‘A’, code point U+0041
    • UTF-16: 2 bytes, 00 41  (hex)
    • UTF-8: 1 byte, 41 (hex)
  • Latin lowercase ‘é’ with acute accent, code point U+00E9
    • UTF-16: 2 bytes, 00 E9 (hex)
    • UTF-8: 2 bytes, C3 A9 (hex)  [110x xxxx 10xx xxxx]
  • Mongolian letter A, U+1820
    • UTF-16: 2 bytes, 18 20 (hex)
    • UTF-8: 3 bytes, E1 A0 A0 (hex)  [1110 xxxx 10xx xxxx 10xx xxxx]
  • Ace of Spades playing card character, U+1F0A1
    • UTF-16: 4 bytes, D8 3C DC A1
    • UTF-8: 4 bytes, F0 9F 82 A1  [1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx]

#998 – UTF-8 Encoding

Unicode maps characters into their corresponding code points, i.e. a numeric value that represents that character.  A character encoding scheme then dictates how each code point is represented as a series of bits so that it can be stored in memory or on disk.

UTF-8 is the most common character encoding when transferring data over the web, i.e. via XML or HTML.  UTF-8 uses from 1 to 4 bytes to represent each code point.

The full range of Unicode code points, from U+000000 through U+10FFFF, are encoded with UTF-8 as follows:

  • U+0000 to U+007F: 1 byte, storing code point exactly (identical to ASCII)
  • U+0080 to U+07FF: 2 bytes for 11 bits – 110x xxxx, 10xx xxxx
  • U+0800 to U+FFFF: 3 bytes for 16 bits – 1110 xxxx, 10xx xxxx, 10xx xxxx
  • U+10000 to U+10FFFF: 4 bytes for 21 bits – 1111 0xxx, 10xx xxxx, 10xx xxxx, 10xx xxxx

English text requires 1 byte per character.

#997 – UTF-16 Encoding, Part II

Unicode maps characters into their corresponding code points, i.e. a numeric value that represents that character.  A character encoding scheme then dictates how each code point is represented as a series of bits so that it can be stored in memory or on disk.

Recall that UTF-16 encoding uses either 2 or 4 bytes to represent each code point.

Code points larger than FFFF are represented using 4 bytes, known as a surrogate pair.

  • The first two bytes (lead surrogate) are in the range D800 – DBFF
  • The third and fourth bytes (trail surrogate) are in the range DC00 – DFFF

These surrogate pairs in turn represent code points in the range U+10000 – U+10FFFF.  This represents 1,048,576 additional values that can be encoded as surrogate pairs.  (Though the Unicode standard currently defines only a small subset of these values).

 

 

#996 – UTF-16 Encoding, Part I

Unicode maps characters into their corresponding code points, i.e. a numeric value that represents that character.  A character encoding scheme then dictates how each code point is represented as a series of bits so that it can be stored in memory or on disk.

UTF-16 is one of the more common character encodings used to represent Unicode characters.  UTF-16 uses either 2 or 4 bytes to represent each code point.

All code points in the range of 0 to FFFF are represented directly as a 2-byte value.  This set of code points is known as the Basic Multilingual Plane (BMP).

Code points in the BMP are defined only within the following two ranges (hex values):

  • 0000 – D7FF
  • E000 – FFFF

This results in a total of 63,488 characters that can be represented.  This first set of values is known as the Basic Multilingual Plane (BMP).  In actuality, around 55,000 code points are currently defined in the BMP.