#994 – Unicode Basics

To store alphabetic characters or other written characters on a computer system, either in memory or on disk, we need to encode each character so that we can store some numeric value that represents the character.  The numeric values are, ultimately, just bit patterns–where each bit pattern represents some character.

Unicode is a standard that specifies methods for encoding all characters from the written languages of the world.  This includes the ability to encode more than 1 million unique characters.  Unicode is the standard used for all web-based traffic (HTML and XML) and for storing character data on most modern operating systems (e.g. Windows, OS X, Unix).

The Unicode standard defines a number of different character encodings.  The most common are:

  • UTF-8 – Variable number of bytes used, from 1-4 bytes.  English characters use only 1 byte.
  • UTF-16 – Uses 2 bytes for most common characters, 4 bytes for other characters.