Understanding floats
 

Binary representation of single precision floating point numbers


Author: Jean-David Gadina
Copyright: © 2024 Jean-David Gadina - www.xs-labs.com - All Rights Reserved
License: This article is published under the terms of the FreeBSD Documentation License
Source: IEEE Standard for Floating-Point Arithmetic - IEEE 754


Table of contents

  1. Theory
  2. Example
  3. Special numbers
    1. Denormalized numbers
    2. Zero
    3. Infinity
    4. NaN
  4. Range
    1. Normalized numbers
    2. Denormalized numbers
  5. C code example

1. Theory

Single precsion floating point numbers are usually called 'float', or 'real'. They are 4 bytes long, and are packed the following way, from left to right:

  • Sign: 1 bit
  • Exponent: 8 bits
  • Mantissa: 23 bits
X XXXX XXXX XXX XXXX XXXX XXXX XXXX XXXXX
Sign
1 bit
Exponent
8 bits
Mantissa
23 bits

The sign indicates if the number is positive or negative (zero for positive, one for negative).

The real exponent is computed by substracting 127 to the value of the exponent field. It's the exponent of the number as it is expressed in the scientific notation.

The full mantissa, which is also sometimes called significand, should be considered as a 24 bits value. As we are using scientific notation, there is an implicit leading bit (sometimes called the hidden bit), always set to 1, as there is never a leading 0 in the scientific notation.
For instance, you won't say 0.123 · 105 but 1.23 · 104.

The conversion is performed the following way:

-1S · 1.M · 2( E - 127 )

Where S is the sign, M the mantissa, and E the exponent.

2. Example

For instance, 0100 0000 1011 1000 0000 0000 0000 0000, which is 0x40B80000 in hexadecimal.

Hex 4 0 B 8 0 0 0 0
Bin 0100 0000 1011 1000 0000 0000 0000 0000
Sign Exponent Mantissa
0 1000 0001 (1) 011 1000 0000 0000 0000 0000
  • The sign is 0, so the number is positive.
  • The exponent field is 1000 0001, which is 129 in decimal. The real exponent value is then 129 - 127, which is 2.
  • The mantissa with the leading 1 bit, is 1011 1000 0000 0000 0000 0000.

The final representation of the number in the binary scientific notation is:

-10 · 1.0111 · 22

Mathematically, this means:

1 · ( 1 · 20 + 0 · 2-1 + 1 · 2-2 + 1 · 2-3 + 1 · 2-4 ) · 22
( 20 + 2-2 + 2-3 + 2-4 ) · 22
22 + 20 + 2-1 + 2-2
4 + 1 + 0.5 + 0.25

The floating point value is then 5.75.

3. Special numbers

Depending on the value of the exponent field, some numbers can have special values. They can be:

  • Denormalized numbers
  • Zero
  • Infinity
  • NaN (not a number)

3.1. Denormalized numbers

If the value of the exponent field is 0 and the value of the mantissa field is greater than 0, then the number has to be treated as a denormalized number.
In such a case, the exponent is not -127, but -126, and the implicit leading bit is not 1 but 0.
That allows smaller numbers to be represented.

The scientific notation for a denormalized number is:

-1S · 0.M · 2-126

3.2. Zero

If the exponent and the mantissa fields are both 0, then the final number is zero. The sign bit is permitted, even if it does not have much sense mathematically, allowing a positive or a negative zero.
Note that zero can be considered as a denormalized number. In that case, it would be 0 · 2-126, which is zero.

3.3. Infinity

If the value of the exponent field is 255 (all 8 bits are set) and if the value of the mantissa field is 0, the number is an infinity, either positive or negative, depending on the sign bit.

3.4. NaN

If the value of the exponent field is 255 (all 8 bits are set) and if the value of the mantissa field is not 0, then the value is not a number. The sign bit as no meaning in such a case.

3. Range

The range depends if the number is normalized or not. Below are the ranges for that two cases:

3.1 Normalized numbers

  • Min: ±1.1754944909521E-38 / ±1.00000000000000000000001-126
  • Max: ±3.4028234663853E+38 / ±1.11111111111111111111111128

3.2 Denormalized numbers

  • Min: ±1.4012984643248E-45 / ±0.00000000000000000000001-126
  • Max: ±1.1754942106924E-38 / ±0.11111111111111111111111-126

4. C code example

Below is an example of a C program that will converts a binary number to its float representation:

#include <stdlib.h>
#include <stdio.h>
#include <math.h>

/**
* Converts a integer to its float representation
*
* This function converts a 32 bits integer to a single precision floating point
* number, as specified by the IEEE Standard for Floating-Point Arithmetic
* (IEEE 754). This standard can be found at the folowing address:
* {@link http://ieeexplore.ieee.org/servlet/opac?punumber=4610933}
*
* @param unsigned long The integer to convert to a floating point value
* @return The floating point number
*/
float binaryToFloat( unsigned int binary );
float binaryToFloat( unsigned int binary )
{
unsigned int sign;
int exp;
unsigned int mantissa;
float floatValue;
int i;

/* Gets the sign field */
/* Bit 0, left to right */
sign = binary >> 31;

/* Gets the exponent field */
/* Bits 1 to 8, left to right */
exp = ( ( binary >> 23 ) & 0xFF );

/* Gets the mantissa field */
/* Bits 9 to 32, left to right */
mantissa = ( binary & 0x7FFFFF );

floatValue = 0;
i = 0;

/* Checks the values of the exponent and the mantissa fields to handle special numbers */
if( exp == 0 && mantissa == 0 )
{
/* Zero - No need for a computation even if it can be considered as a denormalized number */
return 0;
}
else if( exp == 255 && mantissa == 0 )
{
/* Infinity */
return 0;
}
else if( exp == 255 && mantissa != 0 )
{
/* Not a number */
return 0;
}
else if( exp == 0 && mantissa != 0 )
{
/* Denormalized number - Exponent is fixed to -126 */
exp = -126;
}
else
{
/* Computes the real exponent */
exp = exp - 127;

/* Adds the implicit bit to the mantissa */
mantissa = mantissa | 0x800000;
}

/* Process the 24 bits of the mantissa */
for( i = 0; i > -24; i-- )
{
/* Checks if the current bit is set */
if( mantissa & ( 1 << ( i + 23 ) ) )
{
/* Adds the value for the current bit */
/* This is done by computing two raised to the power of the exponent plus the bit position */
/* (negative if it's after the implicit bit, as we are using scientific notation) */
floatValue += ( float )pow( 2, i + exp );
}
}

/* Returns the final float value */
return ( sign == 0 ) ? floatValue : -floatValue;
}

int main( void )
{
printf( "%f\n", binaryToFloat( 0x40B80000 ) );

return EXIT_SUCCESS;
}