Floating point arithmetic and Doubles

Floating point arithmetic and Doubles
Here comes a newbie Java programmer
Concept of floating point numbers and double/float variables
How IEEE people solved it
Q & A
References

Here comes a newbie Java programmer

Imagine you want to split $200 between four people. First, two of them will get 1/3 and the other two will get 1/6.

double a = 200;
double b = a / 3d;
double c = b / 2d;
double d = b + b + c + c;
System.out.println(a); // 200.0
System.out.println(b); // 66.66666666666667
System.out.println(c); // 33.333333333333336
System.out.println(d); // 200.00000000000003

That’s strange. Anyway, then you try something like this:

float x = 1.0f;
int p = 0;
while (x != x + 1) {
    // this should go for ever but
    // it does exit within few loops
    x = x * 3;
    p++;
}
System.out.println(p);
System.out.println(x);

So you check the internet and find that you should be using BigDecimal in Java for handling floating point numbers safely.

So you now try the same example using BigDecimal to see if it works as expected.

BigDecimal a = new BigDecimal(200.0);
BigDecimal b = a.divide(new BigDecimal(3.0));
BigDecimal c = b.divide(new BigDecimal(2.0));
BigDecimal d = b.add(b).add(c).add(c);
System.out.println(b);
System.out.println(c);
System.out.println(d);

But this throws:

java.lang.ArithmeticException: Non-terminating decimal expansion; no exact representable decimal result.

This is because you’re not specifying a precision and a rounding-mode. BigDecimal is complaining that it could use 10, 20, 5000, or infinity decimal places, and it still wouldn’t be able to give you an exact representation of the number. So instead of giving you an incorrect BigDecimal, it just whinges at you.

BigDecimal a = new BigDecimal(200.0);
BigDecimal b = a.divide(new BigDecimal(3.0), 2, RoundingMode.FLOOR);
BigDecimal c = b.divide(new BigDecimal(2.0), 2, RoundingMode.FLOOR);
BigDecimal d = b.add(b).add(c).add(c).setScale(0, RoundingMode.UP);
System.out.println(b);
System.out.println(c);
System.out.println(d);

First we specify decimal precision as 2 and rounding to floor for divisions. Then after the addition we get 199.98, so we round it off to 0 decimals and round it up to the nearest integer.

So far so good. Now that we have used BigDecimal this far, we try a few more scenarios to make sure it is working how we expect it to.

BigDecimal a = new BigDecimal(0.2d);
BigDecimal b = new BigDecimal(String.valueOf(0.2d));
BigDecimal c = new BigDecimal("0.2");
BigDecimal d = new BigDecimal("0.20");
System.out.println(a);   // 0.200000000000000011102230246251565404236316680908203125
System.out.println(b);   // 0.2
System.out.println(c);   // 0.2
System.out.println(d);   // 0.20
System.out.println(a.equals(b));  // false
System.out.println(a.equals(c));  // false
System.out.println(a.equals(d));  // false
System.out.println(b.equals(c));  // true
System.out.println(b.equals(d));  // false
System.out.println(c.equals(d));  // false
System.out.println(a.compareTo(b)); // 1
System.out.println(a.compareTo(c)); // 1
System.out.println(a.compareTo(d)); // 1
System.out.println(b.compareTo(c)); // 0
System.out.println(b.compareTo(d)); // 0
System.out.println(c.compareTo(d)); // 0

As you can see, our expectation was wrong. If you read the Javadoc, then you can see the reasons behind it. Here is my three-point extract from said Javadoc:

The double constructor can be somewhat unpredictable. This has something to do with the binary representation of floating point numbers. String constructors are predictable. So if you want to get an exact number from double, then try taking the path of double -> String -> BigDecimal.
The equals method doesn’t work the same way as natural ordering. This is because BigDecimal is composed of two components called unscaledValue and scale. The value represented by the number is (unscaledValue × 10^-scale). Since both components are part of equals, true is returned only if numbers have the same scale (that means 0.2 and 0.20 are not equal). The hashCode method follows the contract as well (therefore you can expect different hash codes for 0.2 and 0.20).
The compareTo method works in the same way as natural ordering. Number a is just greater than the others due to a different constructor.

Rule: If you are dealing with inputs and you are planning to put them into BigDecimal, don’t cast the incoming data to double — instead use it as String. E.g. new BigDecimal("0.2");

Let’s see why all this fuss about double and BigDecimals.

Concept of floating point numbers and double/float variables

It’s really easy to write integers as binary numbers in 2’s complement form. It’s a lot more difficult to express floating point numbers in a form that a computer can understand. The biggest problem, of course, is keeping track of the decimal point.

There are lots of possible ways to write floating point numbers as strings of binary digits. Here are some things that the original designers might have had to consider when picking a solution:

Range: To be useful, your method should allow very large positive and negative numbers.
Precision: Can you tell the difference between 1.7 and 1.8? How about between 1.700001 and 1.700002? How many decimal places should you remember?
Time Efficiency: Does your solution make comparisons and arithmetic operations fast and easy?
Space Considerations: An extremely precise representation of the square root of 3 is generally a wonderful thing, unless you require a megabyte to store it.
1-1 Relationships: Your solution will be a lot simpler if each floating-point number can be written only one way, and vice versa.

How IEEE people solved it

The method that the original developers finally hit upon uses the idea of scientific notation. Scientific notation is a standard way to express numbers; it makes them easy to read and compare. You’re probably familiar with scientific notation with base-10 numbers. You just factor your number into two parts: a value in the range of 1 <= n < 10, and a power of 10. For example:

Scientific notation base 10 example

The same idea applies here, except that you need to use powers of 2 because the computer deals with binary numbers. Just factor your number into a value in the range 1 <= n < 2, and a power of 2. For example:

Scientific notation base 2 example

You’re almost done when the number is in this form:

$Sign, fraction and power form$

This was a clever move. Instead of one messy value, you now have three important pieces of information that can identify the number:

If the sign bit is a 0, then the number is positive; (-1)^0 = 1. If the sign bit is a 1, the number is negative.
We always factor so that the number in parentheses equals 1 + some fraction. Since we know that the 1 is there, the only important thing is the fraction, which you will write as a binary string.
The power of 2 that you got in the last step is simply an integer. Since you don’t want to deal with any messy negative numbers, you can add a constant value (the bias) to that power.

For single-precision floating-point, the bias = 127. For double-precision, the bias = 1023.

The sum of the bias and the power of 2 is the exponent that actually goes into the IEEE 754 string. Remember, exponent = power + bias (alternatively, power = exponent - bias). When you have calculated these binary values, you can put them into a 32- or 64-bit field.

The digits are arranged like this:

IEEE 754 bit layout

Check out these examples to get a clear idea:

Q & A

Q. What rule does Java use to print out a double?

A. Here’s the java.lang.Double API: it always prints at least one digit after the decimal. After that, it uses as many digits as necessary (but no more) to distinguish the number from the nearest representable double.

Q. How are zero, infinity and NaN represented using IEEE 754?

A. By setting all the exponent bits to 1. Positive infinity = 0x7ff0000000000000 (all exponent bits 1, sign bit 0 and all mantissa bits 0), negative infinity = 0xfff0000000000000 (all exponent bits 1, sign bit 1 and all mantissa bits 0), NaN = 0x7ff8000000000000 (all exponent bits 1, at least one mantissa bit set). Positive zero = all bits 0. Negative zero = all bits 0, except sign bit which is 1.

Q. What is StrictMath?

A. Java’s Math library guarantees its results to be within 1 or 2 ulps of the true answer. Java’s StrictMath guarantees that all results are accurate to within 1/2 ulp of the true answer. A classic tradeoff of speed vs. accuracy.

Q. What are ulps?

// π with 20 decimal digits
BigDecimal π = new BigDecimal("3.14159265358979323846");

// truncate to a double floating point
double p0 = π.doubleValue();
// -> 3.141592653589793  (hex: 0x1.921fb54442d18p1)

// p0 is smaller than π, so find next number representable as double
double p1 = Math.nextUp(p0);
// -> 3.1415926535897936 (hex: 0x1.921fb54442d19p1)

// if you are wondering which number is taken as the next number
// what java does is take
Double.longBitsToDouble(Double.doubleToRawLongBits(number)+1)
// that means incrementing the current binary representation of the number

// ulp(π) is the difference between p1 and p0
BigDecimal ulp = new BigDecimal(p1).subtract(new BigDecimal(p0));
// -> 4.44089209850062616169452667236328125E-16
// (this is precisely 2**(-51))

// same result when using the standard library function
double ulpMath = Math.ulp(p0);
// -> 4.440892098500626E-16 (hex: 0x1.0p-51)

Q. What is the strictfp modifier?

A. It is also possible to use the strictfp modifier when declaring a class or method. This ensures that the floating point results will be bit-by-bit accurate across different JVMs. The IEEE standard allows processors to perform intermediate computations using more precision if the result would overflow. The strictfp requires that every intermediate result be truncated to 64-bit double format. This can be a major performance hit since Intel Pentium processor registers operate using IEEE 754’s 80-bit double-extended format. Not many people have a use for this mode.

Q. What does the compiler flag javac -ffast-math do?

A. It relaxes some of the rounding requirements of IEEE. It makes some floating point computations faster.

Q. Are integers always represented exactly using IEEE floating point?

A. Yes, barring overflow of the 52 bit mantissa. The smallest positive integer that is not represented exactly using type double is 2^53 + 1 = 9,007,199,254,740,993.

Q. Does (a + b) always equal (b + a) when a and b are floating point numbers?

A. Yes.

Q. Does x / y always equal the same value, independent of my platform?

A. Yes. IEEE requires that operations (+ * - /) are performed exactly and then rounded to the nearest floating point number (using Banker’s rounding if there is a tie: round to nearest even number). This improves portability at the expense of efficiency.

Floating point arithmetic and Doubles

Here comes a newbie Java programmer

Concept of floating point numbers and double/float variables

How IEEE people solved it

Q & A

References