#1,088 – How 32-Bit Floating Point Numbers Are Stored in .NET, part I

Floating point numbers in .NET (on Intel-based PCs) are stored using the IEEE 754 standard, which defines how to store both 32-bit (float) and 64-bit (double) floating points.

Floating point numbers are stored in memory by storing the value as a binary floating point value represented using scientific notation (binary).

For example, to store a decimal value of 7.25:

1088-001

The exponent is expressed as binary, so it has a value of 2.

We can now store this floating point number in memory by storing three things:

  • The sign of the number (positive)
  • The mantissa (1.1101)
  • The exponent (10)

On Intel-based PCs, 32-bit floating point numbers are stored as follows:

  • 1 bit to store the sign (0 for positive numbers, 1 for negative numbers)
  • 8 bits to store the exponent
  • 23 bits to store the mantissa

More coming in part II

#1,087 – Representing Binary Floating Point Numbers Using Scientific Notation

We can represent decimal floating point numbers using scientific notation, using the form:

1084-001

(where a and b are both decimal values).

We can also represent binary floating point numbers using scientific notation.  The basic form is:

1087-001

(where a and b are both binary values).

Below is an example.  Assume that we have a binary floating point value of 101.101 (5.625 decimal).

1087-002

If we have the binary value 1.01101, we need to shift the decimal point two places to the right to get our desired value of 101.101.  This is equivalent to multiplying by 4, or 2 raised to the power of 2.  We write the exponent in binary, so the power of 2 is written as “10”.

 

#1,086 – Converting Decimal Floating Point to Binary Floating Point

We can represent a particular floating point value as either a decimal floating point number or a binary floating point number.

Below is an example of how we would convert a decimal floating point value (3.25)  to its equivalent binary floating point value.

1086-001

This particular example was rather easy, because the decimal value 0.25 represents 1/4, so can be represented as a binary fraction with only two digits after the decimal point.

The decimal value 1.1 is a bit more difficult to calculate.

At each step, we find the largest fractional power of two (e.g. 1/2, 1/4, 1/8) that is smaller than the remaining fractional value.  We then subtract that value and continue.

1086-002

We could continue this process, adding one digit of precision at each step and reducing the error–the difference between the binary representation and the value 1.1.  In the end, we can’t exactly represent this value with a binary floating point number.

 

#1,085 – Binary Floating Point Numbers

When we write decimal floating point numbers, digits to the left of the decimal point indicate values of powers of 10 (1, 10, 100) that should be summed together.  Digits to the right of the decimal point indicate values of negative powers of 10 (1/10, 1/100, etc).

We can also write binary floating point numbers.  Numbers represented this way aren’t seen very often in practice, but are important to understand when we talk about how to store floating point numbers in memory.

Digits to the left of the decimal point in a binary floating point number represent powers of 2 (1, 2, 4) to be summed together.

Digits to the right of the decimal point in a binary floating point number represent negative powers of two, e.g. 1/2, 1/4, 1/8, etc.

Here are some examples.

1085-001

#1,080 – Binary Numerals Written as Hexadecimal

Binary data, for example data stored in in some location in memory, is typically written as a hexadecimal number.  This can be done by grouping the binary digits into groups of four.  Each group of four digits can then be represented by a single hexadecimal character.

Four binary digits can range from 0000 to 1111, representing values from 0 to 15, corresponding to the 16 available characters in the hexadecimal number system.

1080-001

The example below shows how we can represent a long binary number (16 bits in this case) as a series of hex characters.

1080-002

The practice of grouping binary data into groups of four digits maps well to data stored in digital computers, since the typical size of a data word in a binary computer is some factor of four–e.g. 16 bits, 32 bits, or 64 bits.  These groups of bits can then be represented by 4, 8, or 16 hexadecimal characters, respectively.

#1,079 – The Binary Numeral System

Humans normally represent numeric values using the decimal (base 10) numeral system.  We can also represent numbers as binary, or base 2.

The binary numeral system uses only two digits (numeric symbols), rather than 10.  You represent a numerical value using a string of these binary digits (or bits).  The two digits used are 0 and 1.

Moving from right to left, the digits in a base 2 system represent powers of 2 (1, 2, 4, 8, etc. or 2^0, 2^1, 2^2, 2^3, etc).  The value of a number represented as binary is the sum of the value of each digit multiplied by the appropriate power of 2.  For example:

1079-001

As an example, the binary number 101101 is equivalent to the decimal number 45.

1079-002

 

 

 

 

#1,073 – Arithmetic Binary Operators are Left-Associative

All arithmetic binary operators (+, -, *, /, %) are left-associative.  This means that when there are multiple operators having the same precedence, the expression is evaluated from left to right.  Below are some examples.

            // Multiplicative, left-right

            // (10 / 5) * 2 = 4
            // [result would be 1 if right-associative]
            int i = 10 / 5 * 2;

            // (40 % 12) * 2 = 8
            // [result would be 16 if right-associative]
            int i2 = 40 % 12 * 2;

            // Additive, left-right

            // (4 - 3) + 5 = 6
            // [result would be -4 if right-associative]
            int i3 = 4 - 3 + 5;

Note that the multiplicative operators (*, /, %) have a higher precedence than the additive (+, -).  This means that if there are no parentheses, the multiplicative operators are evalated before the arithmetic.

            // 1 + (2 * 3) = 7
            int i4 = 1 + 2 * 3;

            // Equivalent to 1 + ((10 / 5) * 2) = 5
            int i5 = 1 + 10 / 5 * 2;