In the last episode we talked about the data representation of integer, a kind
of fixedpoint numbers. Today we’re going to learn about floatingpoint numbers.
Floatingpoint numbers are used to approximate real numbers. Because of the
fact that all the stuffs in computers are, eventually, just a limited sequence
of bits. The representation of floatingpoint number had to made tradeoffs
between ranges and precision.
Due to its computational complexities, CPU also have a dedicated set of
instructions to accelerate on floatingpoint arithmetics.
Terminologies
The terminologies of floatingpoint number is coming from the
scientific notation,
where a real number can be represented as such:
1  1.2345 = 12345 × 10 ** 4 
 significand, or mantissa, 有效数字, 尾数
 base, or radix 底数
 exponent, 幂
So where is the floating point? It’s the .
of 1.2345
. Imaging the dot
can be float to the left by one to make the representation .12345
.
The dot is called radix point, because to us it’s seem to be a decimal point,
but it’s really a binary point in the computers.
Now it becomes clear that, to represent a floatingpoint number in computers,
we will simply assign some bits for significand and some for exponent, and
potentially a bit for sign and that’s it.
IEEE754 32bits SinglePrecision Floats 单精度浮点数
It was called single back to IEEE7541985 and now binary32 in the
relatively new IEEE7542008 standard.
1  (8 bits) (23 bits) 
 The sign part took 1 bit to indicate the sign of the floats. (
0
for+
and1
for
. This is the same treatment as the sign magnitute.  The exponent part took 8 bits and used offsetbinary (biased) form to represent a signed integer.
It’s a variant form since it took out the127
(all 0s) for zero and+128
(all 1s) for nonnumbers, thus it ranges only[126, 127]
instead of[127, 128]
. Then, it choose the zero offset of127
in these 254 bits (like
using128
in excess128), a.k.a the exponent bias in the standard.  The fraction part took 23 bits with an implicit leading bit
1
and
represent the actual significand in total precision of 24bits.
Don’t be confused by why it’s called fraction instead of significand!
It’s all because that the 23 bits in the representation is indeed, representing
the fraction part of the real significand in the scientific notation.
The floatingpoint version of “scientific notation” is more like:
1  (leading 1) 
So what number does the above bits represent?
1  S F × E = R 
Aha! It’s the real number 1
!
Recall that the E = 0b0111 1111 = 0
because it used a biased representation!
We will add more nontrivial examples later.
Demoing Floats in C/C++
Writing sample code converting between binaries (in hex) and floats are not
as straightforward as it for integers. Luckily, there are still some hacks to
perform it:
C  Unsafe Cast
We unsafely cast a pointer to enable reinterpretation of the same binaries.
1  float f1 = 0x3f800000; // C doesn't have a floating literal taking hex. 
C  Union Trick
Oh I really enjoyed this one…Union in C is not only untagged union, but also
share the exact same chunk of memory. So we are doing the same reinterpretation,
but in a more structural and technically fancier way.
1 

N.B. this trick is wellknown as type punning:
In computer science, type punning is a common term for any programming technique that subverts or circumvents the type system of a programming language in order to achieve an effect that would be difficult or impossible to achieve within the bounds of the formal language.
C++  reinterpret_cast
C++ does provide such type punning to the standard language:
1  uint32_t u = 0x40490fdb; 
N.B. it still need to be a conversion between pointers,
see https://en.cppreference.com/w/cpp/language/reinterpret_cast.
Besides, C++ 17 does add a floating point literal that can take hex, but it
works in a different way, using an explicit radix point in the hex:
1  float f = 0x1.2p3; // 1.2 by 2^3 
That’s try with another direction:
1 

Representation of NonNumbers
There are more in the IEEE754!
Real numbers doesn’t satisfy closure property
as integers does. Notably, the set of real numbers is NOT closed under the
division! It could produce nonnumber results such as infinity (e.g. 1/0
)
and NaN (NotaNumber) (e.g. taking
a square root of a negative number).
It would be algebraically ideal if the set of floatingpoint numbers can be
closed under all floatingpoint arithmetics. That would made many people’s life
easier. So the IEEE made it so! Nonnumeber values are squeezed in.
We will also include the two zeros (+0
/0
) into the comparison here,
since they are also special by being the only two demanding an 0x00
exponent:
1  binary  hex  
1  (8 bits) (23 bits) 
Encodings of qNaN and sNaN are not specified in IEEE 754 and implemented
differently on different processors. Luckily, both x86 and ARM family use the
“most significant bit of fraction” to indicate whether it’s quite.
More on NaN
If we look carefully into the IEEE 7542008 spec, in the page35, 6.2.1, it
actually defined anything with exponent FF
and not a infinity (i.e. with
all the fraction bits being 0
), a NaN!
All binary NaN bit strings have all the bits of the biased exponent field E set to 1 (see 3.4). A quiet NaN bit string should be encoded with the first bit (d1) of the trailing significand field T being 1. A signaling NaN bit string should be encoded with the first bit of the trailing significand field being 0.
That implies, we actually had 2 ** 24  2
of NaNs in a 32bits float!
The 24
came from the 1
sign bit plus 23
fractions and the 2
excluded
were the +/ inf
.
The continuous 22 bits inside the fraction looks quite a waste, and there
would be even 51 bits of them in the double
! We will see how to made them useful
in later episodes (spoiler: they are known as NaN payload).
It’s also worth noting that it’s weird that the IEEE choose to use the MSB
instead of the sign bit for NaN quiteness/signalness:
It seems strange to me that the bit which signifies whether or not the NaN is signaling is the top bit of the mantissa rather than the sign bit; perhaps something about how floating point pipelines are implemented makes it less natural to use the sign bit to decide whether or not to raise a signal.
– https://anniecherkaev.com/thesecretlifeofnan
I guess it might be something related to the CPU pipeline? I don’t know yet.
Equality of NaNs and Zeros.
The spec defined a comparison with NaNs to return an unordered result, that
means any comparison operation except !=
, i.e. >=, <=, >, <, =
between a
NaN and any other floatingpoint number would return false
.
No surprised that most (if not every) language implemented such behaviours, e.g.
in JavaScript:
1  NaN !== NaN // true 
Position and negative zeros, however, are defined to be equal!
1  +0 === 0 // true, using the traditional JS equality 
In Cpp, we can tell them apart by looking at its sign bit:
1 

IEEE754 64bits DoublePrecision Floats
Now, the 64bit versions floatingpoint number, known as double
, is just a
matter of scale:
1  (11 bits) (52 bits) 
IEEE7542008 16bits Short Floats
The 2008 edition of IEEE754 also standardize the short float
, which is
neither in C or C++ standard. Though compiler extension might include it.
It looks like:
1  1 sign bit  5 exponent bits  10 fraction bits 
References
 本文标题：Data Representation  Floating Point Numbers
 创建时间：20210328 00:00:00
 本文链接：posts/5157.html
 版权声明：本博客所有文章除特别声明外，均采用 BYNCSA 许可协议。转载请注明出处！