#1,007 – Getting Length of String that Contains Surrogate Pairs

You can use the string.Length property to get the length (number of characters) of a string.  This only works, however, for Unicode code points that are no larger than U+FFFF.  This set of code points is known as the Basic Multilingual Plane (BMP).

Unicode code points outside of the BMP are represented in UTF-16 using 4 byte surrogate pairs, rather than using 2 bytes.

To correctly count the number of characters in a string that may contain code points higher than U+FFFF, you can use the StringInfo class (from System.Globalization).

            // 3 Latin (ASCII) characters
            string simple = "abc";

            // 3 character string where one character
            //  is a surrogate pair
            string containsSurrogatePair = "A𠈓C";

            // Length=3 (correct)
            Console.WriteLine(string.Format("Length 1 = {0}", simple.Length));

            // Length=4 (not quite correct)
            Console.WriteLine(string.Format("Length 2 = {0}", containsSurrogatePair.Length));

            // Better, reports Length=3
            StringInfo si = new StringInfo(containsSurrogatePair);
            Console.WriteLine(string.Format("Length 3 = {0}", si.LengthInTextElements));

1007-001

Advertisement

#1,006 – Getting the Length of a String

You can use the string.Length property to get an integer representing the length of a string.  In most cases, length means–the number of characters.

            // 3 Latin (ASCII) characters
            string simple = "abc";
            // 2 other characters: U+0100, E+4E01
            string other = "Ā丁";

            Console.WriteLine(string.Format("Length 1 = {0}", simple.Length));
            Console.WriteLine(string.Format("Length 2 = {0}", other.Length));

1006-001

 

It’s important to note that the Length property returns the number of individual Char objects that make up the string.  This can be different from the actual number of Unicode characters.