#1,000 – UTF-8 and ASCII

UTF-8 is a character encoding scheme for Unicode character data that uses from 1-4 bytes to represent each character, depending on the code point of the character to be represented.

In Unicode, code points for ASCII characters are equivalent to the ASCII code for that character.

This mapping is true for all 128 ASCII codes.  

UTF-8 encoding maps these first 128 characters in the set of Unicode code points to a single byte containing the code point.  Because of this:

Characters included in the ASCII character set that are present in a stream of UTF-8 encoded character data will appear the same as if they were encoded as ASCII.

This means that a UTF-8 encoded stream of ASCII characters will be identical to an ASCII-encoded stream of the same characters.  I.e. For English language characters, UTF-8 is identical to ASCII.

Advertisement

#999 – Some Examples of UTF-16 and UTF-8 Encoding

Unicode maps characters into their corresponding code points, i.e. a numeric value that represents that character.  A character encoding scheme then dictates how each code point is represented as a series of bits so that it can be stored in memory or on disk.  UTF-16 and UTF-8 are the most commonly used encoding schemes for Unicode character data.

Below are some examples of how various characters would be encoded in UTF-16 and UTF-8.

  • Latin capital ‘A’, code point U+0041
    • UTF-16: 2 bytes, 00 41  (hex)
    • UTF-8: 1 byte, 41 (hex)
  • Latin lowercase ‘é’ with acute accent, code point U+00E9
    • UTF-16: 2 bytes, 00 E9 (hex)
    • UTF-8: 2 bytes, C3 A9 (hex)  [110x xxxx 10xx xxxx]
  • Mongolian letter A, U+1820
    • UTF-16: 2 bytes, 18 20 (hex)
    • UTF-8: 3 bytes, E1 A0 A0 (hex)  [1110 xxxx 10xx xxxx 10xx xxxx]
  • Ace of Spades playing card character, U+1F0A1
    • UTF-16: 4 bytes, D8 3C DC A1
    • UTF-8: 4 bytes, F0 9F 82 A1  [1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx]

#998 – UTF-8 Encoding

Unicode maps characters into their corresponding code points, i.e. a numeric value that represents that character.  A character encoding scheme then dictates how each code point is represented as a series of bits so that it can be stored in memory or on disk.

UTF-8 is the most common character encoding when transferring data over the web, i.e. via XML or HTML.  UTF-8 uses from 1 to 4 bytes to represent each code point.

The full range of Unicode code points, from U+000000 through U+10FFFF, are encoded with UTF-8 as follows:

  • U+0000 to U+007F: 1 byte, storing code point exactly (identical to ASCII)
  • U+0080 to U+07FF: 2 bytes for 11 bits – 110x xxxx, 10xx xxxx
  • U+0800 to U+FFFF: 3 bytes for 16 bits – 1110 xxxx, 10xx xxxx, 10xx xxxx
  • U+10000 to U+10FFFF: 4 bytes for 21 bits – 1111 0xxx, 10xx xxxx, 10xx xxxx, 10xx xxxx

English text requires 1 byte per character.

#997 – UTF-16 Encoding, Part II

Unicode maps characters into their corresponding code points, i.e. a numeric value that represents that character.  A character encoding scheme then dictates how each code point is represented as a series of bits so that it can be stored in memory or on disk.

Recall that UTF-16 encoding uses either 2 or 4 bytes to represent each code point.

Code points larger than FFFF are represented using 4 bytes, known as a surrogate pair.

  • The first two bytes (lead surrogate) are in the range D800 – DBFF
  • The third and fourth bytes (trail surrogate) are in the range DC00 – DFFF

These surrogate pairs in turn represent code points in the range U+10000 – U+10FFFF.  This represents 1,048,576 additional values that can be encoded as surrogate pairs.  (Though the Unicode standard currently defines only a small subset of these values).

 

 

#996 – UTF-16 Encoding, Part I

Unicode maps characters into their corresponding code points, i.e. a numeric value that represents that character.  A character encoding scheme then dictates how each code point is represented as a series of bits so that it can be stored in memory or on disk.

UTF-16 is one of the more common character encodings used to represent Unicode characters.  UTF-16 uses either 2 or 4 bytes to represent each code point.

All code points in the range of 0 to FFFF are represented directly as a 2-byte value.  This set of code points is known as the Basic Multilingual Plane (BMP).

Code points in the BMP are defined only within the following two ranges (hex values):

  • 0000 – D7FF
  • E000 – FFFF

This results in a total of 63,488 characters that can be represented.  This first set of values is known as the Basic Multilingual Plane (BMP).  In actuality, around 55,000 code points are currently defined in the BMP.

#995 – Unicode Code Points

Unicode maps characters into their corresponding code points, i.e. a numeric value that represents that character.  The full set of possible Unicode values ranges from 0 to 10FFFF (hex), for a total of 1,114,112 possible values.

The full set of possible Unicode values is divided into 17 sets of 65,536 values, known as planes.  Each plane is identified numerically by the first 5 bits of the code point.  For example:

  • Plane 0: 000000 – 00FFFF
  • Plane 1: 010000 – 01FFFF
  • etc.
  • Plane 16: 100000 – 10FFFF

Plane 0, also known as the Basic Multilingual Plane, contains code points representing characters in most of the world’s alphabets.  This include Latin alphabets, as well as Asian scripts like CJK, Katakana and Hiragana.

Other planes contain less common characters.  For example, plane 1, known as the Supplementary Multilingual Plane (SMP) contains characters from ancient alphabets, hieroglyphics, and musical symbols.

The Unicode standard has currently only defined a mapping for about 10% of all of the possible code point values.

#994 – Unicode Basics

To store alphabetic characters or other written characters on a computer system, either in memory or on disk, we need to encode each character so that we can store some numeric value that represents the character.  The numeric values are, ultimately, just bit patterns–where each bit pattern represents some character.

Unicode is a standard that specifies methods for encoding all characters from the written languages of the world.  This includes the ability to encode more than 1 million unique characters.  Unicode is the standard used for all web-based traffic (HTML and XML) and for storing character data on most modern operating systems (e.g. Windows, OS X, Unix).

The Unicode standard defines a number of different character encodings.  The most common are:

  • UTF-8 – Variable number of bytes used, from 1-4 bytes.  English characters use only 1 byte.
  • UTF-16 – Uses 2 bytes for most common characters, 4 bytes for other characters.

#993 – Some Examples of Characters

Given that the System.Char data type (char) can store any Unicode character, encoded using UTF-16, below are some examples of valid characters.

Here is the list that we bind to, to display some characters:

            CharList = new ObservableCollection<char>
            {
                'a', 'Z', '.', '$', '\x00e1', '\x00e7', '\x0398', '\x03a9',
                '\x0429', '\x046c', '\x05d4', '\x05da', '\x0626', '\x0643',
                '\x0b92', '\x0caf', '\x0e0d', '\x1251', '\x13ca', '\x14cf',
                '\x21bb', '\x21e9', '\x222d', '\x2285', '\x2466', '\x2528',
                '\x2592', '\x265a', '\x265e', '\x2738', '\x3062', '\x3063',
                '\x3082', '\x3071', '\x3462', '\x3463', '\x3464', '\x3485',
                '\x5442', '\x54e1', '\x5494', '\xac21', '\xae25', '\xfc45',
                '\xfce8'
            };

993-001

#992 – The System.Char Data Type

In C#, the char keyword is a synonym for the System.Char data type in the Base Class Library (BCL).  An instance of a char represents a single character, i.e. a single unit that is part of a written language.

Some examples of characters, all of which can be represented by a char:

  • Uppercase or lowercase alphabetic characters from the English (Latin) alphabet (a..z, A..Z)
  • Accented alphabetic characters
  • Alphabetic characters from other alphabets (e.g. Greek, Cyrillic, Hebrew, Chinese, Arabic, etc).
  • Punctuation marks  (e.g. ! , = [ {, etc).
  • Numeric digits (0..9)
  • Control or formatting characters (e.g. end-of-line, delete)
  • Mathematical and musical symbols
  • Special graphical glyphs (e.g. trademark symbol, smiley face)

A character stored in an instance of a char takes up 2 bytes (16 bits).  The values are encoded as Unicode characters, using the UTF-16 encoding.

#991 – Using the Round-Trip Format Specifier

Normally, when converting a floating point value to a string, you can lose some precision.  If you then later need to convert from the string back to the original floating point data type, you could end up with a different value.  In the example below, we convert from a double to a string and then back again.  But the double that we get in the end is not equal to the original value.

            double d1 = 0.123456789123456789;
            string s1 = d1.ToString();
            double d1b = double.Parse(s1);
            bool b1 = (d1 == d1b);

991-001
You can use the “R” format specifier when converting to a string to indicate that you want to be able to convert back to a floating point value without loss of precision.

            double d1 = 0.123456789123456789;
            string s1 = d1.ToString("R");
            double d1b = double.Parse(s1);
            bool b1 = (d1 == d1b);

991-002