#1,091 – Subnormal Floating Point Numbers

32-bit binary floating point numbers are normally stored in a normalized form, that is:


where d is the fractional part of the mantissa, consisting of 23 binary digits and e is the exponent, represented by 8 bits.

In this form, the minimum allowed value for e is -126, which is stored in the 8-bit exponent as a value of 1.  Because the leading 1 is implicit, this means that the  minimum positive floating point value is:


We could use the 8-bit value of 0 in the exponent to represent an exponent of -127, but that would only gain us a single power of two, or one more value that we could store.

Instead, a value of of 0 stored in the 8-bit exponent is a signal to drop the leading 1 in the mantissa.  This allows storing a set of much smaller numbers, known as subnormal numbers, of the form:


We can now use all 23 digits in the mantissa, allowing us to store numbers as low as 2^-149.


#1,090 – Using Visual Studio to Verify How Floating Point Numbers Are Stored

Recall that floating point numbers are stored in memory by storing the sign bit, exponent and mantissa.

We showed that the decimal value of 7.25, stored as a 32-bit floating point value, is stored as the binary value 0x40E80000.


We can verify this in Visual Studio by assigning a float to contain the value 7.25 and then looking at that value in memory.


Notice that the bytes appear to be backwards, relative to their order as written above.  This is because Intel is a little-endian platform (bytes at “little” end of 32-bit word are stored first).

#1,089 – How 32-Bit Floating Point Numbers Are Stored in .NET, part II

(part I)

We store the sign, exponent, and mantissa of a binary floating point number in memory as follows.

The sign is stored using a single bit, 0 for positive, 1 for negative.

The exponent is stored using 8 bits.  The exponent can be positive or negative, but we store it as a positive value by adding an offset (bias) of 127.  We add 127 to the desired exponent and then store the resulting value.  Exponents in the range of -126 to 127 are therefore stored as values in the range 1 to 254.  (Stored exponent values of 0 and 255 have special meaning).

Because we normalize the binary floating point number, it has the form 1.xxx.  Since it always includes the leading 1, we don’t store it, but use 23 bits to store up to 23 digits after the decimal point.

For example:





#1,088 – How 32-Bit Floating Point Numbers Are Stored in .NET, part I

Floating point numbers in .NET (on Intel-based PCs) are stored using the IEEE 754 standard, which defines how to store both 32-bit (float) and 64-bit (double) floating points.

Floating point numbers are stored in memory by storing the value as a binary floating point value represented using scientific notation (binary).

For example, to store a decimal value of 7.25:


The exponent is expressed as binary, so it has a value of 2.

We can now store this floating point number in memory by storing three things:

  • The sign of the number (positive)
  • The mantissa (1.1101)
  • The exponent (10)

On Intel-based PCs, 32-bit floating point numbers are stored as follows:

  • 1 bit to store the sign (0 for positive numbers, 1 for negative numbers)
  • 8 bits to store the exponent
  • 23 bits to store the mantissa

More coming in part II

#1,087 – Representing Binary Floating Point Numbers Using Scientific Notation

We can represent decimal floating point numbers using scientific notation, using the form:


(where a and b are both decimal values).

We can also represent binary floating point numbers using scientific notation.  The basic form is:


(where a and b are both binary values).

Below is an example.  Assume that we have a binary floating point value of 101.101 (5.625 decimal).


If we have the binary value 1.01101, we need to shift the decimal point two places to the right to get our desired value of 101.101.  This is equivalent to multiplying by 4, or 2 raised to the power of 2.  We write the exponent in binary, so the power of 2 is written as “10”.


#1,086 – Converting Decimal Floating Point to Binary Floating Point

We can represent a particular floating point value as either a decimal floating point number or a binary floating point number.

Below is an example of how we would convert a decimal floating point value (3.25)  to its equivalent binary floating point value.


This particular example was rather easy, because the decimal value 0.25 represents 1/4, so can be represented as a binary fraction with only two digits after the decimal point.

The decimal value 1.1 is a bit more difficult to calculate.

At each step, we find the largest fractional power of two (e.g. 1/2, 1/4, 1/8) that is smaller than the remaining fractional value.  We then subtract that value and continue.


We could continue this process, adding one digit of precision at each step and reducing the error–the difference between the binary representation and the value 1.1.  In the end, we can’t exactly represent this value with a binary floating point number.


#1,085 – Binary Floating Point Numbers

When we write decimal floating point numbers, digits to the left of the decimal point indicate values of powers of 10 (1, 10, 100) that should be summed together.  Digits to the right of the decimal point indicate values of negative powers of 10 (1/10, 1/100, etc).

We can also write binary floating point numbers.  Numbers represented this way aren’t seen very often in practice, but are important to understand when we talk about how to store floating point numbers in memory.

Digits to the left of the decimal point in a binary floating point number represent powers of 2 (1, 2, 4) to be summed together.

Digits to the right of the decimal point in a binary floating point number represent negative powers of two, e.g. 1/2, 1/4, 1/8, etc.

Here are some examples.