Author: Jean-David Gadina
Copyright: © 2024 Jean-David Gadina - www.xs-labs.com - All Rights Reserved
License: This article is published under the terms of the FreeBSD Documentation License
Source: IEEE Standard for Floating-Point Arithmetic - IEEE 754
Single precsion floating point numbers are usually called 'float', or 'real'. They are 4 bytes long, and are packed the following way, from left to right:
X | XXXX XXXX | XXX XXXX XXXX XXXX XXXX XXXXX |
Sign 1 bit |
Exponent 8 bits |
Mantissa 23 bits |
---|
The sign indicates if the number is positive or negative (zero for positive, one for negative).
The real exponent is computed by substracting 127 to the value of the exponent field. It's the exponent of the number as it is expressed in the scientific notation.
The full mantissa, which is also sometimes called significand, should be considered as a 24 bits value. As we are using scientific notation, there is an implicit leading bit (sometimes called the hidden bit), always set to 1, as there is never a leading 0 in the scientific notation.
For instance, you won't say 0.123 · 105
but 1.23 · 104
.
The conversion is performed the following way:
-1S · 1.M · 2( E - 127 )
Where S is the sign, M the mantissa, and E the exponent.
For instance, 0100 0000 1011 1000 0000 0000 0000 0000
, which is 0x40B80000
in hexadecimal.
Hex | 4 | 0 | B | 8 | 0 | 0 | 0 | 0 |
---|---|---|---|---|---|---|---|---|
Bin | 0100 | 0000 | 1011 | 1000 | 0000 | 0000 | 0000 | 0000 |
Sign | Exponent | Mantissa |
---|---|---|
0 | 1000 0001 | (1) 011 1000 0000 0000 0000 0000 |
0
, so the number is positive.1000 0001
, which is 129 in decimal. The real exponent value is then 129 - 127, which is 2.1011 1000 0000 0000 0000 0000
.The final representation of the number in the binary scientific notation is:
-10 · 1.0111 · 22
Mathematically, this means:
1 · ( 1 · 20 + 0 · 2-1 + 1 · 2-2 + 1 · 2-3 + 1 · 2-4 ) · 22
( 20 + 2-2 + 2-3 + 2-4 ) · 22
22 + 20 + 2-1 + 2-2
4 + 1 + 0.5 + 0.25
The floating point value is then 5.75.
Depending on the value of the exponent field, some numbers can have special values. They can be:
If the value of the exponent field is 0 and the value of the mantissa field is greater than 0, then the number has to be treated as a denormalized number.
In such a case, the exponent is not -127, but -126, and the implicit leading bit is not 1 but 0.
That allows smaller numbers to be represented.
The scientific notation for a denormalized number is:
-1S · 0.M · 2-126
If the exponent and the mantissa fields are both 0, then the final number is zero. The sign bit is permitted, even if it does not have much sense mathematically, allowing a positive or a negative zero.
Note that zero can be considered as a denormalized number. In that case, it would be 0 · 2-126
, which is zero.
If the value of the exponent field is 255 (all 8 bits are set) and if the value of the mantissa field is 0, the number is an infinity, either positive or negative, depending on the sign bit.
If the value of the exponent field is 255 (all 8 bits are set) and if the value of the mantissa field is not 0, then the value is not a number. The sign bit as no meaning in such a case.
The range depends if the number is normalized or not. Below are the ranges for that two cases:
±1.1754944909521E-38
/ ±1.00000000000000000000001-126
±3.4028234663853E+38
/ ±1.11111111111111111111111128
±1.4012984643248E-45
/ ±0.00000000000000000000001-126
±1.1754942106924E-38
/ ±0.11111111111111111111111-126
Below is an example of a C program that will converts a binary number to its float representation:
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
/**
* Converts a integer to its float representation
*
* This function converts a 32 bits integer to a single precision floating point
* number, as specified by the IEEE Standard for Floating-Point Arithmetic
* (IEEE 754). This standard can be found at the folowing address:
* {@link http://ieeexplore.ieee.org/servlet/opac?punumber=4610933}
*
* @param unsigned long The integer to convert to a floating point value
* @return The floating point number
*/
float binaryToFloat( unsigned int binary );
float binaryToFloat( unsigned int binary )
{
unsigned int sign;
int exp;
unsigned int mantissa;
float floatValue;
int i;
/* Gets the sign field */
/* Bit 0, left to right */
sign = binary >> 31;
/* Gets the exponent field */
/* Bits 1 to 8, left to right */
exp = ( ( binary >> 23 ) & 0xFF );
/* Gets the mantissa field */
/* Bits 9 to 32, left to right */
mantissa = ( binary & 0x7FFFFF );
floatValue = 0;
i = 0;
/* Checks the values of the exponent and the mantissa fields to handle special numbers */
if( exp == 0 && mantissa == 0 )
{
/* Zero - No need for a computation even if it can be considered as a denormalized number */
return 0;
}
else if( exp == 255 && mantissa == 0 )
{
/* Infinity */
return 0;
}
else if( exp == 255 && mantissa != 0 )
{
/* Not a number */
return 0;
}
else if( exp == 0 && mantissa != 0 )
{
/* Denormalized number - Exponent is fixed to -126 */
exp = -126;
}
else
{
/* Computes the real exponent */
exp = exp - 127;
/* Adds the implicit bit to the mantissa */
mantissa = mantissa | 0x800000;
}
/* Process the 24 bits of the mantissa */
for( i = 0; i > -24; i-- )
{
/* Checks if the current bit is set */
if( mantissa & ( 1 << ( i + 23 ) ) )
{
/* Adds the value for the current bit */
/* This is done by computing two raised to the power of the exponent plus the bit position */
/* (negative if it's after the implicit bit, as we are using scientific notation) */
floatValue += ( float )pow( 2, i + exp );
}
}
/* Returns the final float value */
return ( sign == 0 ) ? floatValue : -floatValue;
}
int main( void )
{
printf( "%f\n", binaryToFloat( 0x40B80000 ) );
return EXIT_SUCCESS;
}