Motivation

Current post tackles the problem of chaining (or linking) multiple lambda calls which seem to be differently optimized by the HotSpot Just In Time Compiler C2 (i.e. JIT C2) and GraalVM JIT Compiler. In this regard, I would propose to start detailing the problem, to run and check the benchmark results, to find out what is really happening and to provide some valuable advice for you.

The problem

Let’s suppose there are few chained lambda calls which, in the end, return a value, as for example:

λ U+2192.svg λ U+2192.svg λ U+2192.svg λ U+2192.svg … U+2192.svg λ

– OR –

 () -> () -> () -> () -> … -> () -> return

This might not be far away for a real use case scenario, imagine for example we have to dispatch the request from one caller to another (e.g. chain of responsibility pattern) or maybe to compute a mathematical formula by calling in sequence few functions, each applying a different computation on the incoming request, etc.

Microbenchmark

I have created a synthetic benchmark around this problem using different levels of call depth, in order to stress the JIT Compiler and to check how it performs the runtime optimizations.

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3, warmups = 1, jvmArgsAppend = {})
@State(Scope.Benchmark)
public class ChainingLambdaBenchmark {

  @Param({"77"})
  public static Integer value;

  public static void main(String[] args) throws RunnerException {

    Options opt =
      new OptionsBuilder()
        .include(ChainingLambdaBenchmark.class.getSimpleName())
        .build();
    new Runner(opt).run();
  }

  @Benchmark
  public int baseline() {
    return value;
  }

  @Benchmark
  public int depth1() {
    Level9 l9;
    Level10 l10;

    l10 = () -> value;
    l9 = () -> l10;

    return l9.next().get();
  }

  @Benchmark
  public int depth2() {
    Level8 l8;
    Level9 l9;
    Level10 l10;

    l10 = () -> value;
    l9 = () -> l10;
    l8 = () -> l9;

    return l8.next().next().get();
  }

  @Benchmark
  public int depth3() {
    Level7 l7;
    Level8 l8;
    Level9 l9;
    Level10 l10;

    l10 = () -> value;
    l9 = () -> l10;
    l8 = () -> l9;
    l7 = () -> l8;

    return l7.next().next().next().get();
  }

  @Benchmark
  public int depth5() {
    Level5 l5;
    Level6 l6;
    Level7 l7;
    Level8 l8;
    Level9 l9;
    Level10 l10;

    l10 = () -> value;
    l9 = () -> l10;
    l8 = () -> l9;
    l7 = () -> l8;
    l6 = () -> l7;
    l5 = () -> l6;

    return l5.next().next().next().next().next().get();
  }

  @Benchmark
  public int depth10() {
    Level0 l0;
    Level1 l1;
    Level2 l2;
    Level3 l3;
    Level4 l4;
    Level5 l5;
    Level6 l6;
    Level7 l7;
    Level8 l8;
    Level9 l9;
    Level10 l10;

    l10 = () -> value;
    l9 = () -> l10;
    l8 = () -> l9;
    l7 = () -> l8;
    l6 = () -> l7;
    l5 = () -> l6;
    l4 = () -> l5;
    l3 = () -> l4;
    l2 = () -> l3;
    l1 = () -> l2;
    l0 = () -> l1;

    return l0.next().next().next().next().next().next().next().next().next().next().get();
  }
}

Please find below additional benchmark classes for each Level.

@FunctionalInterface
public interface Level0 {
  Level1 next();
}

@FunctionalInterface
public interface Level1 {
  Level2 next();
}

@FunctionalInterface
public interface Level2 {
  Level3 next();
}

@FunctionalInterface
public interface Level3 {
  Level4 next();
}

@FunctionalInterface
public interface Level4 {
  Level5 next();
}

@FunctionalInterface
public interface Level5 {
  Level6 next();
}

@FunctionalInterface
public interface Level6 {
  Level7 next();
}

@FunctionalInterface
public interface Level7 {
  Level8 next();
}

@FunctionalInterface
public interface Level8 {
  Level9 next();
}

@FunctionalInterface
public interface Level9 {
  Level10 next();
}

@FunctionalInterface
public interface Level10 {
  Integer get();
}

I ran the benchmark twice, first time with Oracle HotSpot VM and second using GraalVM, as per below details:

  • Case I – Oracle HotSpot VM with JIT C2 (e.g. jdk-11.0.2_linux-x64)
  • Case II – GraalVM with Graal JIT (e.g. graalvm-ee-1.0.0-rc12-linux-amd64)

Tests were triggered using the following configuration: CPU: Intel i7-8550U Kaby Lake R; MEMORY: 16GB DDR4 2400 MHz; OS: Ubuntu 18.10

Case I results – HotSpot VM w/ JIT C2

Benchmark                        Mode Cnt Score Error Units

baseline                         avgt 15 2.259 ± 0.062 ns/op
baseline:·gc.alloc.rate          avgt 15 ≈ 10⁻⁴ MB/sec
baseline:·gc.alloc.rate.norm     avgt 15 ≈ 10⁻⁶ B/op
baseline:·gc.count               avgt 15 ≈ 0 counts

depth1                           avgt 15 2.168 ± 0.082 ns/op
depth1:·gc.alloc.rate            avgt 15 ≈ 10⁻⁴ MB/sec
depth1:·gc.alloc.rate.norm       avgt 15 ≈ 10⁻⁶ B/op
depth1:·gc.count                 avgt 15 ≈ 0 counts

depth2                           avgt 15 4.327 ± 0.134 ns/op
depth2:·gc.alloc.rate            avgt 15 2350.611 ± 72.089 MB/sec
depth2:·gc.alloc.rate.norm       avgt 15 16.000 ± 0.001 B/op
depth2:·gc.count                 avgt 15 171.000 counts

depth3                           avgt 15 6.760 ± 0.185 ns/op
depth3:·gc.alloc.rate            avgt 15 3008.303 ± 81.393 MB/sec
depth3:·gc.alloc.rate.norm       avgt 15 32.000 ± 0.001 B/op
depth3:·gc.count                 avgt 15 204.000 counts

depth5                           avgt 15 12.029 ± 0.978 ns/op
depth5:·gc.alloc.rate            avgt 15 3396.754 ± 261.677 MB/sec
depth5:·gc.alloc.rate.norm       avgt 15 64.000 ± 0.001 B/op
depth5:·gc.count                 avgt 15 176.000 counts

depth10                          avgt 15 26.190 ± 0.947 ns/op
depth10:·gc.alloc.rate           avgt 15 3495.780 ± 121.156 MB/sec
depth10:·gc.alloc.rate.norm      avgt 15 144.000 ± 0.001 B/op
depth10:·gc.count                avgt 15 220.000 counts

Base on the above results we might conclude:

  • the cost of the call chain seems to be related to the depth (i.e. bigger the depth is, higher the response time).
  • response time in case of baseline and depth1 looks very fast (e.g. 2.2 ns/op), however, starting with depth2 it becomes slower and slower
  • the allocation rate is also free in case of baseline and depth1 but it becomes heavier and heavier for the rest of the calls (e.g. depth2, depth3, depth5, and depth10)

Case II results – GraalVM EE w/ Graal JIT

Benchmark                        Mode Cnt Score Error Units

baseline                         avgt 15 2.582 ± 0.149 ns/op
baseline:·gc.alloc.rate          avgt 15 1.366 ± 4.500 MB/sec
baseline:·gc.alloc.rate.norm     avgt 15 0.006 ± 0.019 B/op
baseline:·gc.count               avgt 15 1.000 counts

depth1                           avgt 15 2.550 ± 0.111 ns/op
depth1:·gc.alloc.rate            avgt 15 5.124 ± 20.939 MB/sec
depth1:·gc.alloc.rate.norm       avgt 15 0.020 ± 0.081 B/op
depth1:·gc.count                 avgt 15 ≈ 0 counts

depth2                           avgt 15 2.431 ± 0.078 ns/op
depth2:·gc.alloc.rate            avgt 15 0.029 ± 0.065 MB/sec
depth2:·gc.alloc.rate.norm       avgt 15 ≈ 10⁻⁴ B/op
depth2:·gc.count                 avgt 15 ≈ 0 counts

depth3                           avgt 15 2.590 ± 0.148 ns/op
depth3:·gc.alloc.rate            avgt 15 0.177 ± 0.733 MB/sec
depth3:·gc.alloc.rate.norm       avgt 15 0.001 ± 0.003 B/op
depth3:·gc.count                 avgt 15 ≈ 0 counts

depth5                           avgt 15 2.497 ± 0.171 ns/op
depth5:·gc.alloc.rate            avgt 15 0.069 ± 0.222 MB/sec
depth5:·gc.alloc.rate.norm       avgt 15 ≈ 10⁻⁴ B/op
depth5:·gc.count                 avgt 15 ≈ 0 counts

depth10                          avgt 15 2.487 ± 0.130 ns/op
depth10:·gc.alloc.rate           avgt 15 3.011 ± 11.355 MB/sec
depth10:·gc.alloc.rate.norm      avgt 15 0.012 ± 0.045 B/op
depth10:·gc.count                avgt 15 ≈ 0 counts

In contrast to the HotSpot JIT C2, Graal JIT offers almost the same response time (e.g. around 2.5 ns/op) and a keeps a very low allocation rate (e.g. Garbage Collector counter is 0), independent of the call depth. This might be the outcome of some nice optimizations happening under the hood, emphasizing some differences between HotSpot JIT C2 and Graal JIT.

Are you curious to find out more? Me too, hence let’s zoom in to see what happens, by revealing the bytecode and assembly generated.

Under the Hood

In my opinion, depth1 and depth2 are the most interesting cases to look at, since the performance seems degrading starting depth2 in HotSpot VM. By understanding these cases we can have a clue about what is really happening in other situations (e.g. depth3, depth5, and depth10).

The things look easier in case of Graal JIT since the response time and allocation rate is constant and similar, hence analyzing just one scenario, for example, depth2, might be enough.

HotSpot JIT C2 Analysis

Depth1 – bytecode 

...
invokedynamic   #14, 0// InvokeDynamic #1:get:(LChainingLambdaBenchmark;)LLevel10;
invokedynamic   #15, 0// InvokeDynamic #2:next:(LLevel10;)LLevel9;
...
invokeinterface #16, 1// InterfaceMethod Level9.next:()LLevel10;
invokeinterface #17, 1// InterfaceMethod Level10.get:()Ljava/lang/Integer;
invokevirtual   #13   // Method java/lang/Integer.intValue:()I
...

The bytecode contains two INVOKEDYNAMIC calls for creating lambda object instances, then the lambda chain calls (e.g. next() calls) are performed by sequential INVOKEINTERFACE invocations (since the methods belong to interfaces) and the final result is returned by INVOKEVIRTUAL which dispatches to Integer.intValue().

Depth1 – assembly 

...
movabs r10,0x71b8ae1b8 ; {oop(a 'java/lang/Class'{0x000000071b8ae1b8} = 'ChainingLambdaBenchmark')}
mov r11d,DWORD PTR [r10+0x70]
;*getstatic value {reexecute=0 rethrow=0 return_oop=0}
; - ChainingLambdaBenchmark::lambda$depth2$2@0
; - ChainingLambdaBenchmark$Lambda$44/0x0000000800112440::get@0
mov eax,DWORD PTR [r12+r11*8+0xc]
;*getfield value {reexecute=0 rethrow=0 return_oop=0}
; - java.lang.Integer::intValue@1
...

HotSpot JIT C2 was able to remove the intermediate lambda call (superfluous in this synthetic benchmark) and to return the final value directly. Implicitly, the JIT C2 Compiler was also able to get rid of intermediate object lambda allocation. This optimization is similar to the baseline case.

Depth2 – bytecode 

...
invokedynamic   #18, 0// InvokeDynamic #3:get:(LChainingLambdaBenchmark;)LLevel10;
invokedynamic   #19, 0// InvokeDynamic #4:next:(LLevel10;)LLevel9;
invokedynamic   #20, 0// InvokeDynamic #5:next:(LLevel9;)LLevel8;
...
invokeinterface #21, 1// InterfaceMethod Level8.next:()LLevel9;
invokeinterface #16, 1// InterfaceMethod Level9.next:()LLevel10;
invokeinterface #17, 1// InterfaceMethod Level10.get:()Ljava/lang/Integer;
invokevirtual   #13   // Method java/lang/Integer.intValue:()I
...

Quite similar to depth1, there are three INVOKEDYNAMIC invocations for creating lambda object instances, then the call chain performs few INVOKEINTERFACE calls and the final result (i.e. Integer.intValue()) is returned by the INVOKEVIRTUAL.

Depth2 – assembly 

...
// Lambda$1::new ;() -> value
mov DWORD PTR [rax+0x8],0x60840 ;{metadata('ChainingLambdaBenchmark$Lambda$1')}
mov r11,rbp
shr r11,0x3
mov DWORD PTR [rax+0xc],r11d ;*new {reexecute=0 rethrow=0 return_oop=0}
; - ChainingLambdaBenchmark$Lambda$1/0x0000000800060840::get$Lambda@0
mov rbp,rax ;*areturn {reexecute=0 rethrow=0 return_oop=0}
; - ChainingLambdaBenchmark$Lambda$1/0x0000000800060840::get$Lambda@8
// Lambda$2::new ;() -> Level10
mov DWORD PTR [rax+0x8],0x60c40 ;{metadata('ChainingLambdaBenchmark$Lambda$2')}
mov r10,rbp
shr r10,0x3
mov DWORD PTR [rax+0xc],r10d ;*new {reexecute=0 rethrow=0 return_oop=0}
; - ChainingLambdaBenchmark$Lambda$2/0x0000000800060c40::get$Lambda@0
mov rbp,rax ;*areturn {reexecute=0 rethrow=0 return_oop=0}
; - ChainingLambdaBenchmark$Lambda$2/0x0000000800060c40::get$Lambda@8

mov r11,rax
shr r11,0x9
movabs r8,0x7f7ad22a0000

// OBS: Heap allocation for Lambda$3 (i.e. () -> Level9) was eliminated due to inlining at this BCI
mov BYTE PTR [r8+r11*1],0x0 ;*invokespecial <init> {reexecute=0 rethrow=0 return_oop=0}
; - ChainingLambdaBenchmark$Lambda$3/0x0000000800061840::get$Lambda@5
//
//
// Level9::next
mov ebp,DWORD PTR [rbp+0xc] ;*getfield arg$1 {reexecute=0 rethrow=0 return_oop=0}
; - ChainingLambdaBenchmark$Lambda$2/0x0000000800060c40::next@1
mov r10d,DWORD PTR [r12+rbp*8+0x8]

// Level10::get
lea r10,[r12+rbp*8] ;*invokeinterface get {reexecute=0 rethrow=0 return_oop=0}
mov r11d,DWORD PTR [r10+0xc] ;*getfield arg$1 {reexecute=0 rethrow=0 return_oop=0}
; - ChainingLambdaBenchmark$Lambda$1/0x0000000800060840::get@1

// get field value
mov r10d,DWORD PTR [r12+r11*8+0xc] ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
; - ChainingLambdaBenchmark$Lambda$1/0x0000000800060840::get@4
; implicit exception: dispatches to 0x00007f7ac4cae52a
// return Integer::intValue()
mov eax,DWORD PTR [r12+r10*8+0xc] ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
; - java.lang.Integer::intValue@1
;implicit exception: dispatches to 0x00007f7ac4cae536
...

At first glance it might look a bit complicated, however, it is not so difficult to understand. It starts by allocating two lambda object instances corresponding to Lambda$1 (i.e. () -> value) and Lambda$2 (i.e. () -> Level10). Then, JIT C2 Compiler optimizes the first lambda call from this chain (i.e. Lambda$3, corresponding to () -> Level9) by inlining and eliminating the heap allocation. Then, the Compiler performs the calls in a sequential fashion, as they are declared in the Java source code:  Level9::next -> Level10::get -> Integer::intValue.

All these lambda object allocations and virtual calls explain why the performance is degraded and why the Garbage Collector has more work to do in collecting the garbage, in comparison to the previous, depth1, test case.

Similar optimization pattern happens for depth3, depth5, and depth10 which proves the performance penalty spotted as well by the benchmark.

Graal JIT Analysis

Depth2 – bytecode 

The bytecode is similar to the one above already described (see depth2 from HotSpot JIT C2), hence no reason to duplicate it.

Depth2 – assembly

mov eax,DWORD PTR [rsi+0xc] ;*aload_0 {reexecute=1 rethrow=0 return_oop=0}
; - ChainingLambdaBenchmark::depth2@0
mov eax,DWORD PTR [rax*8+0xc] ; implicit exception: deoptimizes
; OopMap{rsi=Oop off=51}

The assembly looks extremely nice, isn’t!? Graal JIT optimizes the call by removing all chained lambdas together with associated heap allocations, it just returns the value belonging to the current object instance.

The other calls (e.g. depth1, depth3, depth5, and depth10) are optimized in a similar manner by Graal JIT.

Conclusions

HotSpot JIT C2

  • JIT C2 Compiler is able to optimize(i.e. inline) chaining lambda calls up to a certain extent, afterward it deals with lambda object allocations and additional method calls. In our benchmark, depth1 offers similar performance to the baseline, however for depth2, depth3, depth5, and depth10 cases JIT C2 could not fully optimize the chained lambdas.
  • As a matter of precaution, please be careful when writing code which follows the same pattern, it might affect the overall performance of your application.
  • I would rather advise you to even limit the chained lambda depth, if possible!

Graal JIT

  • Using Graal JIT there is not much to worry about, since the Compiler optimizes the code in a more efficient way (i.e. inlining chained lambdas), offering a better response time.

By Ionut Balosin

Software Architect, Independent Technical Trainer

One thought on “Chaining lambda optimizations in HotSpot VM and GraalVM”
  1. Thanks to have shared!!
    I would add @CompilerControl(CompilerControl.Mode.DONT_INLINE) on any benchmark method to have more control about the stack depth

Leave a Reply

Your email address will not be published.