#999 – Some Examples of UTF-16 and UTF-8 Encoding

Unicode maps characters into their corresponding code points, i.e. a numeric value that represents that character.  A character encoding scheme then dictates how each code point is represented as a series of bits so that it can be stored in memory or on disk.  UTF-16 and UTF-8 are the most commonly used encoding schemes for Unicode character data.

Below are some examples of how various characters would be encoded in UTF-16 and UTF-8.

  • Latin capital ‘A’, code point U+0041
    • UTF-16: 2 bytes, 00 41  (hex)
    • UTF-8: 1 byte, 41 (hex)
  • Latin lowercase ‘é’ with acute accent, code point U+00E9
    • UTF-16: 2 bytes, 00 E9 (hex)
    • UTF-8: 2 bytes, C3 A9 (hex)  [110x xxxx 10xx xxxx]
  • Mongolian letter A, U+1820
    • UTF-16: 2 bytes, 18 20 (hex)
    • UTF-8: 3 bytes, E1 A0 A0 (hex)  [1110 xxxx 10xx xxxx 10xx xxxx]
  • Ace of Spades playing card character, U+1F0A1
    • UTF-16: 4 bytes, D8 3C DC A1
    • UTF-8: 4 bytes, F0 9F 82 A1  [1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx]

About Sean
Software developer in the Twin Cities area, passionate about software development and sailing.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: