#1,000 – UTF-8 and ASCII

UTF-8 is a character encoding scheme for Unicode character data that uses from 1-4 bytes to represent each character, depending on the code point of the character to be represented.

In Unicode, code points for ASCII characters are equivalent to the ASCII code for that character.

This mapping is true for all 128 ASCII codes.  

UTF-8 encoding maps these first 128 characters in the set of Unicode code points to a single byte containing the code point.  Because of this:

Characters included in the ASCII character set that are present in a stream of UTF-8 encoded character data will appear the same as if they were encoded as ASCII.

This means that a UTF-8 encoded stream of ASCII characters will be identical to an ASCII-encoded stream of the same characters.  I.e. For English language characters, UTF-8 is identical to ASCII.

#999 – Some Examples of UTF-16 and UTF-8 Encoding

Unicode maps characters into their corresponding code points, i.e. a numeric value that represents that character.  A character encoding scheme then dictates how each code point is represented as a series of bits so that it can be stored in memory or on disk.  UTF-16 and UTF-8 are the most commonly used encoding schemes for Unicode character data.

Below are some examples of how various characters would be encoded in UTF-16 and UTF-8.

  • Latin capital ‘A’, code point U+0041
    • UTF-16: 2 bytes, 00 41  (hex)
    • UTF-8: 1 byte, 41 (hex)
  • Latin lowercase ‘é’ with acute accent, code point U+00E9
    • UTF-16: 2 bytes, 00 E9 (hex)
    • UTF-8: 2 bytes, C3 A9 (hex)  [110x xxxx 10xx xxxx]
  • Mongolian letter A, U+1820
    • UTF-16: 2 bytes, 18 20 (hex)
    • UTF-8: 3 bytes, E1 A0 A0 (hex)  [1110 xxxx 10xx xxxx 10xx xxxx]
  • Ace of Spades playing card character, U+1F0A1
    • UTF-16: 4 bytes, D8 3C DC A1
    • UTF-8: 4 bytes, F0 9F 82 A1  [1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx]

#998 – UTF-8 Encoding

Unicode maps characters into their corresponding code points, i.e. a numeric value that represents that character.  A character encoding scheme then dictates how each code point is represented as a series of bits so that it can be stored in memory or on disk.

UTF-8 is the most common character encoding when transferring data over the web, i.e. via XML or HTML.  UTF-8 uses from 1 to 4 bytes to represent each code point.

The full range of Unicode code points, from U+000000 through U+10FFFF, are encoded with UTF-8 as follows:

  • U+0000 to U+007F: 1 byte, storing code point exactly (identical to ASCII)
  • U+0080 to U+07FF: 2 bytes for 11 bits – 110x xxxx, 10xx xxxx
  • U+0800 to U+FFFF: 3 bytes for 16 bits – 1110 xxxx, 10xx xxxx, 10xx xxxx
  • U+10000 to U+10FFFF: 4 bytes for 21 bits – 1111 0xxx, 10xx xxxx, 10xx xxxx, 10xx xxxx

English text requires 1 byte per character.

#994 – Unicode Basics

To store alphabetic characters or other written characters on a computer system, either in memory or on disk, we need to encode each character so that we can store some numeric value that represents the character.  The numeric values are, ultimately, just bit patterns–where each bit pattern represents some character.

Unicode is a standard that specifies methods for encoding all characters from the written languages of the world.  This includes the ability to encode more than 1 million unique characters.  Unicode is the standard used for all web-based traffic (HTML and XML) and for storing character data on most modern operating systems (e.g. Windows, OS X, Unix).

The Unicode standard defines a number of different character encodings.  The most common are:

  • UTF-8 – Variable number of bytes used, from 1-4 bytes.  English characters use only 1 byte.
  • UTF-16 – Uses 2 bytes for most common characters, 4 bytes for other characters.