#1,003 – Accessing Underlying Bytes in a String for Different Encodings

Strings in .NET are stored in memory as Unicode character data, using the UTF-16 encoding.  (2 bytes per character, or 4 bytes for surrogate pairs).

If you want to get access to the underlying data for the string in memory, you can use one of the functions listed below, indicating what encoding to use for the Unicode data when converting it to a byte array.  If you use Encoding.Unicode, you’ll get the data exactly as it is stored in memory for the String type.

  • System.Text.Encoding.Unicode.GetBytes – UTF-16
  • System.Text.Encoding.UTF8.GetBytes – UTF-8

In the example below, notice the different byte sequences used to encode the CJK character.

            string ideograph = "𠈓";
            byte[] utf16 = Encoding.Unicode.GetBytes(ideograph);
            byte[] utf8 = Encoding.UTF8.GetBytes(ideograph);


About Sean
Software developer in the Twin Cities area, passionate about software development and sailing.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: