#1,007 – Getting Length of String that Contains Surrogate Pairs

You can use the string.Length property to get the length (number of characters) of a string.  This only works, however, for Unicode code points that are no larger than U+FFFF.  This set of code points is known as the Basic Multilingual Plane (BMP).

Unicode code points outside of the BMP are represented in UTF-16 using 4 byte surrogate pairs, rather than using 2 bytes.

To correctly count the number of characters in a string that may contain code points higher than U+FFFF, you can use the StringInfo class (from System.Globalization).

            // 3 Latin (ASCII) characters
            string simple = "abc";

            // 3 character string where one character
            //  is a surrogate pair
            string containsSurrogatePair = "A𠈓C";

            // Length=3 (correct)
            Console.WriteLine(string.Format("Length 1 = {0}", simple.Length));

            // Length=4 (not quite correct)
            Console.WriteLine(string.Format("Length 2 = {0}", containsSurrogatePair.Length));

            // Better, reports Length=3
            StringInfo si = new StringInfo(containsSurrogatePair);
            Console.WriteLine(string.Format("Length 3 = {0}", si.LengthInTextElements));

1007-001

Advertisements

#773 – Reversing a String that Contains Unicode Characters Expressed as Surrogate Pairs

You can use the Reverse method to reverse the characters in a .NET-based string.  This method works if the string contains Unicode characters that can be expressed as 2-byte UTF16 code points.  This subset of Unicode is known as the Basic Multilingual Plane (BMP) and is able to represent 65,536 unique code points.

UTF16 can represent Unicode code points outside the BMP through the use of surrogate pairs.  Within a series of 16-bit characters, a 32-bit character can appear, stored as a pair of normal UTF16 words.

In practice, it’s quite rare to encounter Unicode characters outside of the BMP, given that this plane can represent characters from most living languages.

To reverse a string that contains surrogate pairs, you can use the Microsoft.VisualBasic.Strings.StrReverse method.

            // 8 byte string includes surrogate pair
            string s = "A𠈓C";

            // Won't handle surrogate pair
            string s2 = new string(s.Reverse().ToArray());

            // Will handle surrogate pair
            string s3 = Microsoft.VisualBasic.Strings.StrReverse(s);

773-001