Current article illustrates a corner case in regards to loop invariant code motion optimization in JDK10, which at first glance seems to regress in comparison to JDK9. Such optimization tries to move outside the body of a loop statements or expressions which do not depend on the loop itself, without affecting the overall semantics of the program.
I have created a small benchmark which contains a loop of 200,000 iterations that computes, at each iteration, the length of a circle (e.g. 2πR) modulo iteration counter and sums all intermediate results. The R is the radius of the arc (i.e. a constant in our case) and the π is computed based on explicit math computation formula (not using the Math.PI constant).
Apart base scenario, I also added similar benchmark test cases which contain superfluous loop invariant Pi computation formula only to stress the Just in Time Compiler and to check how it can optimize them. Please find the entire source code below.
@BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.MILLISECONDS) @Warmup(iterations = 5, timeUnit = TimeUnit.MILLISECONDS) @Measurement(iterations = 5, timeUnit = TimeUnit.MILLISECONDS) @Fork(value = 3, warmups = 1) @State(Scope.Benchmark) public class LoopInvariantCodeJmh { @Param({ "200000" }) public int iterations; @Param({ "42" }) public int radius; public static void main(String[] args) throws RunnerException { Options opt = new OptionsBuilder() .include(LoopInvariantCodeJmh.class.getSimpleName()) .build(); new Runner(opt).run(); } @Benchmark // Benchmark method containing 10 loop invariant computePi() method calls per iteration public double circleLengthModulo_10x() { double sum = 0; for (int i = 0; i < iterations; i++) { // 10 loop invariant method calls -> stresses the Compiler computePi(); // 1st computePi(); // 2nd computePi(); // 3rd computePi(); // 4th computePi(); // 5th computePi(); // 6th computePi(); // 7h computePi(); // 8th computePi(); // 9th computePi(); // 10th // compute circle length modulo i -> this really matters ! sum += (2 * radius * computePi()) % i; } return sum; } @Benchmark // Benchmark method containing 5 loop invariant computePi() method calls per iteration public double circleLengthModulo_5x() { double sum = 0; for (int i = 0; i < iterations; i++) { // 5 loop invariant method calls -> stresses the Compiler computePi(); // 1st computePi(); // 2nd computePi(); // 3rd computePi(); // 4th computePi(); // 5th // compute circle length module i -> this really matters ! sum += (2 * radius * computePi()) % i; } return sum; } @Benchmark public double circleLengthModulo() { double sum = 0; for (int i = 0; i < iterations; i++) { // compute circle length module i -> this really matters ! sum += (2 * radius * computePi()) % i; } return sum; } private double computePi() { double Pi = 4; boolean sign = false; // Pi / 4 = 1 - (1/3) + (1/5) - (1/7) + (1/9) - (1/11) + ... // Math.Pi = 3.14159265358979323846 for (int i = 3; i < 1000; i += 2) { if (sign) { Pi += 4.0 / i; } else { Pi -= 4.0 / i; } sign = !sign; } return Pi; } }
I have tested above benchmark with JDK9, JDK10 and JDK11. Even if at the moment of writing the article JDK11 is not released yet, I have downloaded a build from JDK 11 Early-Access Builds web page.
Benchmark Mode Cnt Score Error Units JDK9 circleLengthModulo avgt 15 472.877 ± 7.700 ms/op JDK9 circleLengthModulo_5x avgt 15 255.441 ± 6.711 ms/op JDK9 circleLengthModulo_10x avgt 15 254.129 ± 4.318 ms/op JDK10 circleLengthModulo avgt 15 454.195 ± 7.850 ms/op JDK10 circleLengthModulo_5x avgt 15 2,036.281 ± 57.482 ms/op JDK10 circleLengthModulo_10x avgt 15 3,818.085 ± 91.651 ms/op JDK11 circleLengthModulo avgt 15 465.766 ± 6.461 ms/op JDK11 circleLengthModulo_5x avgt 15 259.395 ± 11.803 ms/op JDK11 circleLengthModulo_10x avgt 15 261.039 ± 5.192 ms/op
Tests triggered on my machine (CPU: Intel i7-6700HQ Skylake; MEMORY: 16GB DDR4 2133 MHz; OS: Ubuntu 16.04.2)
Few conclusions:
- circleLengthModulo() benchmark method seems to behave almost the same without any noticeable difference in performance in case of JDK9, JDK10 and JDK11
- circleLengthModulo_5x() and circleLengthModulo_10x() benchmark methods have almost the same response time per iteration in case of JDK9 and JDK11, however JDK10 adds a huge penalty as follows:
- circleLengthModulo_5x() which contains 5 x computePi() loop invariant methods seems ~8 times slower in JDK10 than JDK9 or JDK11
- circleLengthModulo_10x() which contains 10 x computePi() loop invariant methods seems ~14 times slower in JDK10 than JDK9 or JDK11
Since it might be interesting why it behaves so slow in JDK10 in comparison to JDK9 and JDK11, I tried to write simplistic pseudocode version derived from assembly code generated in case of circleLengthModulo_10x() method (the same for circleLengthModulo_5x()).
JDK9 – rough pseudocode version
// JDK 9 public circleLengthModulo_10x() { sum = 0; 2Radius = 2 * radius; // constant subexpression hoisted out of main loop while (outer_counter < 200_000) { inline computePi() { // loop unroled by a factor of 32 while (inner_counter < 994) { unrolls 32 iterations step from Pi series using vectorized operations inner_counter += 32 } // handle the remaining in a post loop while (inner_counter < 1000) { unrolls 2 iterations step from Pi series using vectorized operations inner_counter += 2 } } outer_counter += 1 sum += (2Radius * Pi) % outer_counter } return sum; }
In essence what it does is to inline the method computePi() in the caller only once, computes the π value using vectorized instructions and unrolling the main Pi loop by a factor of 32 and the post Pi loop by a factor of 2, etc.
I am not targeting to describe in detail all these under the hood Just In Time Compiler optimizations, however if you are interested in the topic you can check my talk Runtime vs. compile time (JIT vs. AOT) optimizations in Java and C++ .
JDK10 – rough pseudocode version
// JDK 10 public circleLengthModulo_10x() { sum = 0; 2Radius = 2 * radius; // constant subexpression hoisted out of main loop while (outer_counter < 200_000) { inline computePi() {...} // useless inlining ! inline computePi() {...} // useless inlining ! inline computePi() {...} // useless inlining ! inline computePi() {...} // useless inlining ! inline computePi() {...} // useless inlining ! inline computePi() {...} // useless inlining ! inline computePi() {...} // useless inlining ! inline computePi() {...} // useless inlining ! inline computePi() {...} // useless inlining ! inline computePi() {...} // useless inlining ! inline computePi() {...} outer_counter += 1 sum += (2Radius * Pi) % outer_counter } return sum; }
To get the full assembly listing you can download it from below:
As we can easily spot, in JDK10 the more “interesting” fact is that method computePi() is inlined multiple times instead of being removed, since it is superfluous and do not impact the semantics of the program (i.e. its return value is not used in case of first 10 calls). This might explain the performance penalty in such case. For JDK9 and JDK11 computePi() method is inlined exactly once within caller method (i.e. removing useless computePi() methods), which leads to almost the same response time for all included JDK versions, as per provided experiment.
Further references:
UPDATE: As per comment from Jean-Philippe Bempel, this optimization is more linked to Dead Code Elimination rather than Loop Invariant Code Motion!
Hi for me it’s not loop invariant, but more dead code elemination 🙂
ComputePi() without using the return value is dead code and by inlining the JIT is able to detect that this code has no side effects.
But yes there is a regression in JDK 10 as it is not able to perform this :/
Hi Jean-Philippe Bempel,
Indeed, you are right, many thanks for the update! Dead code is not only unreachable code, but also code that does not affect the program (e.g. dead stores)
Ionut Balosin