Loop Unrolling in Java - JVM JIT
The difference between source code and how code is executed
As we know, compilers optimise the source code. If the code is executed as it is written, it will not be efficient. If the compiler did not apply optimisations, it would execute a loop body per iteration, check for loop termination and jump to the beginning of the loop body. Such execution is not efficient because modern processors perform several operations in one clock cycle: selecting the next instruction, decoding, executing, writing. This type of execution is called a pipeline. The number of pipeline stages can vary depending on the processor, e.g. the Intel Pentium 4 processor had 20 pipeline stages, and in the Prescott modification it got a pipeline of 31 stages. In the case of executing one instruction per loop iteration, the execution is not linear and each transition to the beginning of the loop body invalidates the pipeline. The performance penalty in this case is comparable to a cache miss. Let's see how loops compile in C without optimisation and how Hotspot JIT optimises assembly in java11 and java17. For the purpose of the experiment, let's allocate 1000_000 long in memory and fill it with random values.
Loops in C code
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
unsigned long xorshift(unsigned long state[static 1]) {
unsigned long x = state[0];
x ^= x << 13;
x ^= x >> 17;
x ^= x << 5;
state[0] = x;
return x;
long random_long(long min, long max) {
int urandom = open("/dev/urandom", O_RDONLY);
unsigned long state[1];
read(urandom, state, sizeof(state));
unsigned long range = (unsigned long) max - min + 1;
unsigned long random_value = xorshift(state) % range;
return (long) (random_value + min);
int main(int argv, char** argc) {
int MAX = 1000000;
long* data = (long*)calloc(MAX, sizeof(long));
for (int i = 0; i < MAX; i++) {
data[i] = random_long(0,MAX);
gcc -S loopunrolling.c
Let's consider only a part of the assembly code, calling the main method. As we can see, there is only one call of the call random_long
function per loop iteration, which is expected.
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
pushq %rbx
subq $40, %rsp
.cfi_offset 3, -24
movl %edi, -36(%rbp)
movq %rsi, -48(%rbp)
movl $1000000, -28(%rbp)
movl -28(%rbp), %eax
movl $8, %esi
movq %rax, %rdi
call calloc@PLT
movq %rax, -24(%rbp)
movl $0, -32(%rbp)
jmp .L7
movl -28(%rbp), %eax
movl -32(%rbp), %edx
movslq %edx, %rdx
leaq 0(,%rdx,8), %rcx
movq -24(%rbp), %rdx
leaq (%rcx,%rdx), %rbx
movq %rax, %rsi
movl $0, %edi
call random_long
movq %rax, (%rbx)
addl $1, -32(%rbp)
movl -32(%rbp), %eax
cmpl -28(%rbp), %eax
jl .L8
movl $0, %eax
movq -8(%rbp), %rbx
.cfi_def_cfa 7, 8
.size main, .-main
.ident "GCC: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0"
.section .note.GNU-stack,"",@progbits
.section .note.gnu.property,"a"
.align 8
.long 1f - 0f
.long 4f - 1f
.long 5
Now let's fill in long[] in Java. Java code is different from C, we need to add an intStride1
method which will compile JIT since the minimum compilation unit is a method.
public class LoopUnroll {
private static int MAX = 1000000;
private static long[] data = new long[MAX];
public static void main(String[] args) {
java.util.Random random = new java.util.Random();
for (int i = 0; i < MAX; i++) {
data[i] = random.nextLong();
final long sum = intStride1();
private static long intStride1()
int sum = 0;
for (int i = 0; i < MAX; i += 1)
sum += data[i];
return sum;
In Bytecode we are interested in the private static long intStride1();
method. As we can see, the bytecode has two ladd
operations per iteration, one for dealing with the array data[]
(20: ladd
) and the other for counter i
(24: ladd
), which corresponds to one operation per loop iteration. As we can see, there is no runtime optimisation in the bytecode.
javap -p -v LoopUnroll.class
// -- omitted
private static long intStride1();
descriptor: ()J
flags: (0x000a) ACC_PRIVATE, ACC_STATIC
stack=5, locals=4, args_size=0
0: lconst_0
1: lstore_0
2: lconst_0
3: lstore_2
4: lload_2
5: getstatic #10 // Field MAX:I
8: i2l
9: lcmp
10: ifge 29
13: lload_0
14: getstatic #16 // Field data:[J
17: lload_2
18: l2i
19: laload
20: ladd
21: lstore_0
22: lload_2
23: lconst_1
24: ladd
25: lstore_2
26: goto 4
29: lload_0
30: lreturn
line 21: 0
line 22: 2
line 24: 13
line 22: 22
line 26: 29
StackMapTable: number_of_entries = 2
frame_type = 253 /* append */
offset_delta = 4
locals = [ long, long ]
frame_type = 250 /* chop */
offset_delta = 24
// -- omitted
SourceFile: "LoopUnroll.java"
Let's consider several loop variants with different counter type: one int and the other long and see how the counter type affects the generated JIT copmiler code, unrolling and safepoint placement in the loop.
We add @CompilerControl(CompilerControl.Mode.DONT_INLINE)
so that the method is not embedded in benchmark. We will run benchmark with different VM keys to control code generation:
- controls the presence of safepoints in the loop.-XX:LoopStripMiningIter=number_of_iterations
- Controls the number of iterations in the inner loop. The safepoint will be stored in the outer loop and the inner loop will not contain a safepoint. The default is 1,000.-XX:LoopStripStripMiningIterShortLoop=number_of_iterations
- Loops with less than the specified number of iterations will not have a safepoint.
@Fork(value = 1, jvmArgsPrepend = {"-XX:+UnlockDiagnosticVMOptions", "-XX:-UseCompressedOops", "-XX:PrintAssemblyOptions=intel", "-XX:LoopStripMiningIter=10000", "-XX:-UseCountedLoopSafepoints"})
public class LoopUnrollBenchmark {
public void baseline() {
private static final int MAX = 1_000_000;
private long[] data = new long[MAX];
public void createData()
java.util.Random random = new java.util.Random();
for (int i = 0; i < MAX; i++)
data[i] = random.nextLong();
public long intStride1()
long sum = 0;
for (int i = 0; i < MAX; i++)
sum += data[i];
return sum;
public long longStride1()
long sum = 0;
for (long l = 0; l < MAX; l++)
sum += data[(int) l];
return sum;
java -jar target/benchmarks.jar -prof perfasm
Java 11 counter loop
Build with the following vm arguments.
@Fork(value = 1, jvmArgsPrepend = {"-XX:+UnlockDiagnosticVMOptions", "-XX:-UseCompressedOops", "-XX:PrintAssemblyOptions=intel"})
c2, level 4, com.rkdeep.LoopUnrollBenchmark::intStride1, version 3, compile id 646
0x00007fdee83d10d0: cmp r10d,0xf423f
0x00007fdee83d10d7: jbe 0x00007fdee83d115d
0x00007fdee83d10dd: mov rax,QWORD PTR [r9+0x18] ;*laload {reexecute=0 rethrow=0 return_oop=0}
; - com.rkdeep.LoopUnrollBenchmark::intStride1@16 (line 72)
0x00007fdee83d10e1: mov r10d,0x1 ;*goto {reexecute=0 rethrow=0 return_oop=0}
; - com.rkdeep.LoopUnrollBenchmark::intStride1@22 (line 70)
0x00007fdee83d10e7: mov r8d,0xfa0
↗ 0x00007fdee83d10ed: mov ecx,0xf423d
│ 0x00007fdee83d10f2: sub ecx,r10d
│ 0x00007fdee83d10f5: cmp ecx,r8d
0.02% │ 0x00007fdee83d10f8: cmovg ecx,r8d
│ 0x00007fdee83d10fc: add ecx,r10d
│ 0x00007fdee83d10ff: nop ;*lload_1 {reexecute=0 rethrow=0 return_oop=0}
│ ; - com.rkdeep.LoopUnrollBenchmark::intStride1@10 (line 72)
0.06% ↗│ 0x00007fdee83d1100: add rax,QWORD PTR [r9+r10*8+0x18]
31.11% ││ 0x00007fdee83d1105: add rax,QWORD PTR [r9+r10*8+0x20]
22.35% ││ 0x00007fdee83d110a: add rax,QWORD PTR [r9+r10*8+0x28]
22.42% ││ 0x00007fdee83d110f: add rax,QWORD PTR [r9+r10*8+0x30]
││ ;*ladd {reexecute=0 rethrow=0 return_oop=0}
││ ; - com.rkdeep.LoopUnrollBenchmark::intStride1@17 (line 72)
22.14% ││ 0x00007fdee83d1114: add r10d,0x4 ;*iinc {reexecute=0 rethrow=0 return_oop=0}
││ ; - com.rkdeep.LoopUnrollBenchmark::intStride1@19 (line 70)
0.02% ││ 0x00007fdee83d1118: cmp r10d,ecx
╰│ 0x00007fdee83d111b: jl 0x00007fdee83d1100 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
│ ; - com.rkdeep.LoopUnrollBenchmark::intStride1@7 (line 70)
│ 0x00007fdee83d111d: mov r14,QWORD PTR [r15+0x108]
│ ; ImmutableOopMap{r11=Oop r9=Oop }
│ ;*goto {reexecute=1 rethrow=0 return_oop=0}
│ ; - com.rkdeep.LoopUnrollBenchmark::intStride1@22 (line 70)
0.01% │ 0x00007fdee83d1124: test DWORD PTR [r14],eax ;*goto {reexecute=0 rethrow=0 return_oop=0}
│ ; - com.rkdeep.LoopUnrollBenchmark::intStride1@22 (line 70)
│ ; {poll}
0.12% │ 0x00007fdee83d1127: cmp r10d,0xf423d
╰ 0x00007fdee83d112e: jl 0x00007fdee83d10ed
0x00007fdee83d1130: cmp r10d,0xf4240
0x00007fdee83d1137: jge 0x00007fdee83d114d
0x00007fdee83d1139: data16 xchg ax,ax ;*lload_1 {reexecute=0 rethrow=0 return_oop=0}
; - com.rkdeep.LoopUnrollBenchmark::intStride1@10 (line 72)
0x00007fdee83d113c: add rax,QWORD PTR [r9+r10*8+0x18]
;*ladd {reexecute=0 rethrow=0 return_oop=0}
; - com.rkdeep.LoopUnrollBenchmark::intStride1@17 (line 72)
98.23% <total for region 1>
....[Hottest Region 1]..............................................................................
c2, level 4, com.rkdeep.LoopUnrollBenchmark::longStride1, version 3, compile id 630
0x00007f8cc43cf124: jne 0x00007f8cbc84b080 ; {runtime_call ic_miss_stub}
0x00007f8cc43cf12a: xchg ax,ax
0x00007f8cc43cf12c: nop DWORD PTR [rax+0x0]
[Verified Entry Point]
0x00007f8cc43cf130: mov DWORD PTR [rsp-0x14000],eax
0x00007f8cc43cf137: push rbp
0x00007f8cc43cf138: sub rsp,0x30 ;*synchronization entry
; - com.rkdeep.LoopUnrollBenchmark::longStride1@-1 (line 81)
0x00007f8cc43cf13c: mov r10,QWORD PTR [rsi+0x10] ;*getfield data {reexecute=0 rethrow=0 return_oop=0}
; - com.rkdeep.LoopUnrollBenchmark::longStride1@14 (line 84)
0.00% 0x00007f8cc43cf140: mov r9d,DWORD PTR [r10+0x10] ;*laload {reexecute=0 rethrow=0 return_oop=0}
; - com.rkdeep.LoopUnrollBenchmark::longStride1@19 (line 84)
; implicit exception: dispatches to 0x00007f8cc43cf1a4
0.01% 0x00007f8cc43cf144: xor eax,eax ;*goto {reexecute=0 rethrow=0 return_oop=0}
; - com.rkdeep.LoopUnrollBenchmark::longStride1@26 (line 82)
0x00007f8cc43cf146: xor r11d,r11d
0x00007f8cc43cf149: xor r8d,r8d
╭ 0x00007f8cc43cf14c: jmp 0x00007f8cc43cf153
│ 0x00007f8cc43cf14e: xchg ax,ax
12.00% │ ↗ 0x00007f8cc43cf150: mov r11d,r8d ;*lload_1 {reexecute=0 rethrow=0 return_oop=0}
│ │ ; - com.rkdeep.LoopUnrollBenchmark::longStride1@12 (line 84)
10.85% ↘ │ 0x00007f8cc43cf153: cmp r11d,r9d
╭│ 0x00007f8cc43cf156: jae 0x00007f8cc43cf184
9.98% ││ 0x00007f8cc43cf158: add rax,QWORD PTR [r10+r11*8+0x18]
││ ;*ladd {reexecute=0 rethrow=0 return_oop=0}
││ ; - com.rkdeep.LoopUnrollBenchmark::longStride1@20 (line 84)
24.21% ││ 0x00007f8cc43cf15d: mov r11,QWORD PTR [r15+0x108]
11.56% ││ 0x00007f8cc43cf164: add r8,0x1 ; ImmutableOopMap{r10=Oop rsi=Oop }
││ ;*goto {reexecute=1 rethrow=0 return_oop=0}
││ ; - com.rkdeep.LoopUnrollBenchmark::longStride1@26 (line 82)
10.62% ││ 0x00007f8cc43cf168: test DWORD PTR [r11],eax ;*goto {reexecute=0 rethrow=0 return_oop=0}
││ ; - com.rkdeep.LoopUnrollBenchmark::longStride1@26 (line 82)
││ ; {poll}
18.83% ││ 0x00007f8cc43cf16b: cmp r8,0xf4240
│╰ 0x00007f8cc43cf172: jl 0x00007f8cc43cf150 ;*ifge {reexecute=0 rethrow=0 return_oop=0}
│ ; - com.rkdeep.LoopUnrollBenchmark::longStride1@9 (line 82)
│ 0x00007f8cc43cf174: add rsp,0x30
│ 0x00007f8cc43cf178: pop rbp
0.01% │ 0x00007f8cc43cf179: mov r10,QWORD PTR [r15+0x108]
│ 0x00007f8cc43cf180: test DWORD PTR [r10],eax ; {poll_return}
│ 0x00007f8cc43cf183: ret
↘ 0x00007f8cc43cf184: mov rbp,rsi
0x00007f8cc43cf187: mov QWORD PTR [rsp],r8
0x00007f8cc43cf18b: mov QWORD PTR [rsp+0x8],rax
0x00007f8cc43cf190: mov QWORD PTR [rsp+0x10],r10
0x00007f8cc43cf195: mov DWORD PTR [rsp+0x18],r11d
0x00007f8cc43cf19a: mov esi,0xffffffe4
0x00007f8cc43cf19f: call 0x00007f8cbc849e00 ; ImmutableOopMap{rbp=Oop [16]=Oop }
;*laload {reexecute=0 rethrow=0 return_oop=0}
98.08% <total for region 1>
Benchmark Mode Cnt Score Error Units
LoopUnrollBenchmark.baseline thrpt 5 420136389.339 ± 61698598.658 ops/s
LoopUnrollBenchmark.baseline:asm thrpt NaN ---
LoopUnrollBenchmark.intStride1 thrpt 5 2457.647 ± 176.800 ops/s
LoopUnrollBenchmark.intStride1:asm thrpt NaN ---
LoopUnrollBenchmark.longStride1 thrpt 5 1391.287 ± 85.554 ops/s
LoopUnrollBenchmark.longStride1:asm thrpt NaN ---
You can see that with the counter type int the loop consists of 2 inner and outer loops. The body of the inner loop is repeated 4 times, i.e. the loop is expanded by 4. After the inner loop, safepoints are added. Safepoint - points in the code where data are consistent and threads can be safely stopped for stacktrace removal or GC work. The loop that is executed for clarity can be represented as in the listing below.
for (int j = 0; j < 250; j++) {
for (int i = 0; i < 4_000; i = i+4) {
sum += data[i];
sum += data[i+1];
sum += data[i+2];
sum += data[i+3];
// safepoint
Unlike a loop with int counter, when using long counter the loop is compiled without using loopunrolling optimisation and safepoint is checked in each iteration of the loop. It is possible to represent with pseudocode as in the listing below
for (int i = 0; i < 1_000_000; i++) {
sum += data[i];
// safepoint
Let's take the results of java 11 as a baseline.
Java 17 counter loop saftpoints control
Benchmark without safepoints -XX:-UseCountedLoopSafepoints
Remove safepoints from the loop and add an inner loop with 10000 iterations "-XX:LoopStripMiningIter=10000", "-XX:-UseCountedLoopSafepoints"
@Fork(value = 1, jvmArgsPrepend = {"-XX:+UnlockDiagnosticVMOptions", "-XX:-UseCompressedOops", "-XX:+UseSuperWord", "-XX:PrintAssemblyOptions=intel", "-XX:LoopStripMiningIter=10000", "-XX:-UseCountedLoopSafepoints"})
Result "com.rkdeep.LoopUnrollBenchmark.intStride1":
2581.171 ±(99.9%) 14.527 ops/s [Average]
(min, avg, max) = (2575.700, 2581.171, 2585.076), stdev = 3.773
CI (99.9%): [2566.645, 2595.698] (assumes normal distribution)
Secondary result "com.rkdeep.LoopUnrollBenchmark.intStride1:asm":
PrintAssembly processed: 166164 total address lines.
Perf output processed (skipped 59.009 seconds):
Column 1: cycles (49732 events)
Hottest code regions (>10.00% "cycles" events):
Event counts are percents of total event count.
....[Hottest Region 1]..............................................................................
c2, level 4, com.rkdeep.LoopUnrollBenchmark::intStride1, version 3, compile id 721
0.01% 0x00007f3ee4fd7453: mov r8d,DWORD PTR [r10+0xc] ; implicit exception: dispatches to 0x00007f3ee4fd751c
0.01% 0x00007f3ee4fd7457: test r8d,r8d
0x00007f3ee4fd745a: jbe 0x00007f3ee4fd751c
0x00007f3ee4fd7460: cmp r8d,0xf423f
0x00007f3ee4fd7467: jbe 0x00007f3ee4fd751c
0x00007f3ee4fd746d: mov rax,QWORD PTR [r10+0x10] ;*laload {reexecute=0 rethrow=0 return_oop=0}
; - com.rkdeep.LoopUnrollBenchmark::intStride1@16 (line 73)
0x00007f3ee4fd7471: mov r11d,0x1
╭ 0x00007f3ee4fd7477: jmp 0x00007f3ee4fd7483
│ 0x00007f3ee4fd7479: nop DWORD PTR [rax+0x0]
0.01% │↗ 0x00007f3ee4fd7480: mov r11d,r9d ;*lload_1 {reexecute=0 rethrow=0 return_oop=0}
││ ; - com.rkdeep.LoopUnrollBenchmark::intStride1@10 (line 73)
6.03% ↘│ 0x00007f3ee4fd7483: add rax,QWORD PTR [r10+r11*8+0x10]
0.01% │ 0x00007f3ee4fd7488: add rax,QWORD PTR [r10+r11*8+0x18]
5.98% │ 0x00007f3ee4fd748d: add rax,QWORD PTR [r10+r11*8+0x20]
5.69% │ 0x00007f3ee4fd7492: add rax,QWORD PTR [r10+r11*8+0x28]
5.73% │ 0x00007f3ee4fd7497: add rax,QWORD PTR [r10+r11*8+0x30]
5.89% │ 0x00007f3ee4fd749c: add rax,QWORD PTR [r10+r11*8+0x38]
7.75% │ 0x00007f3ee4fd74a1: add rax,QWORD PTR [r10+r11*8+0x40]
5.88% │ 0x00007f3ee4fd74a6: add rax,QWORD PTR [r10+r11*8+0x48]
5.69% │ 0x00007f3ee4fd74ab: add rax,QWORD PTR [r10+r11*8+0x50]
5.94% │ 0x00007f3ee4fd74b0: add rax,QWORD PTR [r10+r11*8+0x58]
6.17% │ 0x00007f3ee4fd74b5: add rax,QWORD PTR [r10+r11*8+0x60]
5.94% │ 0x00007f3ee4fd74ba: add rax,QWORD PTR [r10+r11*8+0x68]
5.84% │ 0x00007f3ee4fd74bf: add rax,QWORD PTR [r10+r11*8+0x70]
5.71% │ 0x00007f3ee4fd74c4: add rax,QWORD PTR [r10+r11*8+0x78]
8.30% │ 0x00007f3ee4fd74c9: add rax,QWORD PTR [r10+r11*8+0x80]
6.04% │ 0x00007f3ee4fd74d1: add rax,QWORD PTR [r10+r11*8+0x88];*ladd {reexecute=0 rethrow=0 return_oop=0}
│ ; - com.rkdeep.LoopUnrollBenchmark::intStride1@17 (line 73)
5.82% │ 0x00007f3ee4fd74d9: mov r9d,r11d
0.00% │ 0x00007f3ee4fd74dc: add r9d,0x10 ;*iinc {reexecute=0 rethrow=0 return_oop=0}
│ ; - com.rkdeep.LoopUnrollBenchmark::intStride1@19 (line 71)
│ 0x00007f3ee4fd74e0: cmp r9d,0xf4231
╰ 0x00007f3ee4fd74e7: jl 0x00007f3ee4fd7480 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
; - com.rkdeep.LoopUnrollBenchmark::intStride1@7 (line 71)
0x00007f3ee4fd74e9: cmp r9d,0xf4240
0x00007f3ee4fd74f0: jge 0x00007f3ee4fd7509
0x00007f3ee4fd74f2: add r11d,0x10
0x00007f3ee4fd74f6: xchg ax,ax ;*lload_1 {reexecute=0 rethrow=0 return_oop=0}
; - com.rkdeep.LoopUnrollBenchmark::intStride1@10 (line 73)
0x00007f3ee4fd74f8: add rax,QWORD PTR [r10+r11*8+0x10];*ladd {reexecute=0 rethrow=0 return_oop=0}
98.44% <total for region 1>
....[Hottest Region 1]..............................................................................
c2, level 4, com.rkdeep.LoopUnrollBenchmark::longStride1, version 3, compile id 719
0.00% 0x00007f72e4fd549a: cmp edx,r11d
0x00007f72e4fd549d: mov r10d,0x80000000
0x00007f72e4fd54a3: cmovl r11d,r10d
0x00007f72e4fd54a7: movsxd r10,r11d
0x00007f72e4fd54aa: cmp r10,rbp
0x00007f72e4fd54ad: cmovg r11d,edi
0x00007f72e4fd54b1: cmp r11d,0x2
0x00007f72e4fd54b5: jle 0x00007f72e4fd55ad
0x00007f72e4fd54bb: mov r10d,0x2 ;*lload_1 {reexecute=0 rethrow=0 return_oop=0}
; - com.rkdeep.LoopUnrollBenchmark::longStride1@12 (line 85)
10.74% ↗ 0x00007f72e4fd54c1: cmp r12d,DWORD PTR [rsp]
│ 0x00007f72e4fd54c5: jae 0x00007f72e4fd5575
0.02% │ 0x00007f72e4fd54cb: add rax,QWORD PTR [rcx+r12*8+0x10]
0.31% │ 0x00007f72e4fd54d0: movsxd rbx,r10d
0.02% │ 0x00007f72e4fd54d3: mov r8,r9
10.86% │ 0x00007f72e4fd54d6: add r8,rbx
│ 0x00007f72e4fd54d9: mov r12,rsi
0.08% │ 0x00007f72e4fd54dc: add r12,rbx
0.02% │ 0x00007f72e4fd54df: mov rbx,QWORD PTR [rcx+r12*8+0x48]
20.31% │ 0x00007f72e4fd54e4: mov rdi,QWORD PTR [rcx+r12*8+0x40]
0.31% │ 0x00007f72e4fd54e9: mov rdx,QWORD PTR [rcx+r12*8+0x38]
0.44% │ 0x00007f72e4fd54ee: mov rbp,QWORD PTR [rcx+r12*8+0x30]
0.19% │ 0x00007f72e4fd54f3: mov r13,QWORD PTR [rcx+r12*8+0x28]
10.46% │ 0x00007f72e4fd54f8: mov r14,QWORD PTR [rcx+r12*8+0x20]
0.03% │ 0x00007f72e4fd54fd: mov r12,QWORD PTR [rcx+r12*8+0x18];*laload {reexecute=0 rethrow=0 return_oop=0}
│ ; - com.rkdeep.LoopUnrollBenchmark::longStride1@19 (line 85)
0.41% │ 0x00007f72e4fd5502: add rax,r12
0.18% │ 0x00007f72e4fd5505: add rax,r14
10.29% │ 0x00007f72e4fd5508: add rax,r13
0.10% │ 0x00007f72e4fd550b: add rax,rbp
0.47% │ 0x00007f72e4fd550e: add rax,rdx
10.77% │ 0x00007f72e4fd5511: add rax,rdi
11.09% │ 0x00007f72e4fd5514: add rax,rbx ;*ladd {reexecute=0 rethrow=0 return_oop=0}
│ ; - com.rkdeep.LoopUnrollBenchmark::longStride1@20 (line 85)
11.06% │ 0x00007f72e4fd5517: add r8,0x8
0.02% │ 0x00007f72e4fd551b: mov r12d,r8d ;*l2i {reexecute=0 rethrow=0 return_oop=0}
│ ; - com.rkdeep.LoopUnrollBenchmark::longStride1@18 (line 85)
│ 0x00007f72e4fd551e: add r10d,0x8
0.01% │ 0x00007f72e4fd5522: cmp r10d,r11d
╰ 0x00007f72e4fd5525: jl 0x00007f72e4fd54c1 ;*ifge {reexecute=0 rethrow=0 return_oop=0}
; - com.rkdeep.LoopUnrollBenchmark::longStride1@9 (line 83)
0x00007f72e4fd5527: cmp r10d,DWORD PTR [rsp+0x4]
0x00007f72e4fd552c: jge 0x00007f72e4fd5556
0x00007f72e4fd552e: xchg ax,ax ;*lload_1 {reexecute=0 rethrow=0 return_oop=0}
; - com.rkdeep.LoopUnrollBenchmark::longStride1@12 (line 85)
0x00007f72e4fd5530: cmp r12d,DWORD PTR [rsp]
0x00007f72e4fd5534: jae 0x00007f72e4fd55cb
0x00007f72e4fd553a: add rax,QWORD PTR [rcx+r12*8+0x10];*ladd {reexecute=0 rethrow=0 return_oop=0}
; - com.rkdeep.LoopUnrollBenchmark::longStride1@20 (line 85)
98.19% <total for region 1>
Benchmark Mode Cnt Score Error Units
LoopUnrollBenchmark.baseline thrpt 5 413725733.066 ± 20385130.808 ops/s
LoopUnrollBenchmark.baseline:asm thrpt NaN ---
LoopUnrollBenchmark.intStride1 thrpt 5 2581.171 ± 14.527 ops/s
LoopUnrollBenchmark.intStride1:asm thrpt NaN ---
LoopUnrollBenchmark.longStride1 thrpt 5 2427.188 ± 9.450 ops/s
LoopUnrollBenchmark.longStride1:asm thrpt NaN ---
As we can see in assembly, the inner loop is not added to the code because there is no sense in it if the safepoint is removed. With int counter the loop is expanded for 16 iterations. The resulting code can be represented as in the listing below:
for (int j = 0; j < 1_000_000; j++) {
sum += data[i];
sum += data[i+1];
sum += data[i+2];
sum += data[i+3];
sum += data[i+4];
sum += data[i+5];
sum += data[i+6];
sum += data[i+7];
sum += data[i+8];
sum += data[i+9];
sum += data[i+10];
sum += data[i+11];
sum += data[i+12];
sum += data[i+13];
sum += data[i+14];
sum += data[i+15];
For a long counter, the loop is unrolled by 8 iterations of the loop body. You can also notice that with the long type, the registers rbx, rdi, rdx, rbp, r13, r14, r12
are filled first and then the summation is performed. The resulting code can be represented as in the listing below:
for (long j = 0; j < 1_000_000; j++) {
sum += data[i];
sum += data[i+1];
sum += data[i+2];
sum += data[i+3];
sum += data[i+4];
sum += data[i+5];
sum += data[i+6];
sum += data[i+7];
As you can see, java 17 has significantly improved the handling of counter loops with long counter. The number of operations compared to int counter loop has increased from 56% in java 11 to 94% in java 17.
Benchmark with safepoints -XX:+UseCountedLoopSafepoints and -XX:LoopStripMiningIter=1000
We will run benchmark with the parameters
@Fork(value = 1, jvmArgsPrepend = {"-XX:+UnlockDiagnosticVMOptions", "-XX:-UseCompressedOops", "-XX:PrintAssemblyOptions=intel", "-XX:LoopStripMiningIter=1000"})
....[Hottest Region 1]..............................................................................
c2, level 4, com.rkdeep.LoopUnrollBenchmark::intStride1, version 3, compile id 720
0x00007f27f8fd76da: jbe 0x00007f27f8fd77d8
0x00007f27f8fd76e0: cmp r10d,0xf423f
0x00007f27f8fd76e7: jbe 0x00007f27f8fd77d8
0x00007f27f8fd76ed: mov rax,QWORD PTR [r8+0x10] ;*laload {reexecute=0 rethrow=0 return_oop=0}
; - com.rkdeep.LoopUnrollBenchmark::intStride1@16 (line 72)
0x00007f27f8fd76f1: mov r12d,0x1 ;*goto {reexecute=0 rethrow=0 return_oop=0}
; - com.rkdeep.LoopUnrollBenchmark::intStride1@22 (line 70)
0x00007f27f8fd76f7: mov ebx,0x3e80
0x00007f27f8fd76fc: xor ecx,ecx
╭ 0x00007f27f8fd76fe: jmp 0x00007f27f8fd777e
0.02% │↗ 0x00007f27f8fd7703: mov r12d,r11d ;*lload_1 {reexecute=0 rethrow=0 return_oop=0}
││ ; - com.rkdeep.LoopUnrollBenchmark::intStride1@10 (line 72)
5.60% ││ ↗ 0x00007f27f8fd7706: add rax,QWORD PTR [r8+r12*8+0x10]
0.01% ││ │ 0x00007f27f8fd770b: add rax,QWORD PTR [r8+r12*8+0x18]
5.73% ││ │ 0x00007f27f8fd7710: add rax,QWORD PTR [r8+r12*8+0x20]
5.75% ││ │ 0x00007f27f8fd7715: add rax,QWORD PTR [r8+r12*8+0x28]
5.48% ││ │ 0x00007f27f8fd771a: add rax,QWORD PTR [r8+r12*8+0x30]
5.59% ││ │ 0x00007f27f8fd771f: add rax,QWORD PTR [r8+r12*8+0x38]
9.69% ││ │ 0x00007f27f8fd7724: add rax,QWORD PTR [r8+r12*8+0x40]
5.59% ││ │ 0x00007f27f8fd7729: add rax,QWORD PTR [r8+r12*8+0x48]
5.77% ││ │ 0x00007f27f8fd772e: add rax,QWORD PTR [r8+r12*8+0x50]
5.38% ││ │ 0x00007f27f8fd7733: add rax,QWORD PTR [r8+r12*8+0x58]
6.13% ││ │ 0x00007f27f8fd7738: add rax,QWORD PTR [r8+r12*8+0x60]
5.56% ││ │ 0x00007f27f8fd773d: add rax,QWORD PTR [r8+r12*8+0x68]
5.58% ││ │ 0x00007f27f8fd7742: add rax,QWORD PTR [r8+r12*8+0x70]
5.56% ││ │ 0x00007f27f8fd7747: add rax,QWORD PTR [r8+r12*8+0x78]
9.13% ││ │ 0x00007f27f8fd774c: add rax,QWORD PTR [r8+r12*8+0x80]
5.83% ││ │ 0x00007f27f8fd7754: add rax,QWORD PTR [r8+r12*8+0x88];*ladd {reexecute=0 rethrow=0 return_oop=0}
││ │ ; - com.rkdeep.LoopUnrollBenchmark::intStride1@17 (line 72)
5.61% ││ │ 0x00007f27f8fd775c: mov r11d,r12d
0.00% ││ │ 0x00007f27f8fd775f: add r11d,0x10 ;*iinc {reexecute=0 rethrow=0 return_oop=0}
││ │ ; - com.rkdeep.LoopUnrollBenchmark::intStride1@19 (line 70)
0.00% ││ │ 0x00007f27f8fd7763: cmp r11d,r10d
│╰ │ 0x00007f27f8fd7766: jl 0x00007f27f8fd7703 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
│ │ ; - com.rkdeep.LoopUnrollBenchmark::intStride1@7 (line 70)
│ │ 0x00007f27f8fd7768: mov r9,QWORD PTR [r15+0x350] ; ImmutableOopMap {r8=Oop rdi=Oop }
│ │ ;*goto {reexecute=1 rethrow=0 return_oop=0}
│ │ ; - (reexecute) com.rkdeep.LoopUnrollBenchmark::intStride1@22 (line 70)
0.01% │ │ 0x00007f27f8fd776f: test DWORD PTR [r9],eax ;*goto {reexecute=0 rethrow=0 return_oop=0}
│ │ ; - com.rkdeep.LoopUnrollBenchmark::intStride1@22 (line 70)
│ │ ; {poll}
0.02% │ │ 0x00007f27f8fd7772: cmp r11d,0xf4231
│ ╭│ 0x00007f27f8fd7779: jge 0x00007f27f8fd77a5
│ ││ 0x00007f27f8fd777b: mov r12d,r11d
↘ ││ 0x00007f27f8fd777e: mov r10d,0xf4231
0.01% ││ 0x00007f27f8fd7784: sub r10d,r12d
││ 0x00007f27f8fd7787: cmp r12d,0xf4231
││ 0x00007f27f8fd778e: cmovg r10d,ecx
0.00% ││ 0x00007f27f8fd7792: cmp r10d,0x3e80
││ 0x00007f27f8fd7799: cmova r10d,ebx
0.00% ││ 0x00007f27f8fd779d: add r10d,r12d
0.00% │╰ 0x00007f27f8fd77a0: jmp 0x00007f27f8fd7706
↘ 0x00007f27f8fd77a5: cmp r11d,0xf4240
0x00007f27f8fd77ac: jge 0x00007f27f8fd77c5
0x00007f27f8fd77ae: add r12d,0x10
0x00007f27f8fd77b2: xchg ax,ax ;*lload_1 {reexecute=0 rethrow=0 return_oop=0}
; - com.rkdeep.LoopUnrollBenchmark::intStride1@10 (line 72)
0x00007f27f8fd77b4: add rax,QWORD PTR [r8+r12*8+0x10];*ladd {reexecute=0 rethrow=0 return_oop=0}
; - com.rkdeep.LoopUnrollBenchmark::intStride1@17 (line 72)
0x00007f27f8fd77b9: inc r12d ;*iinc {reexecute=0 rethrow=0 return_oop=0}
; - com.rkdeep.LoopUnrollBenchmark::intStride1@19 (line 70)
0x00007f27f8fd77bc: cmp r12d,0xf4240
98.07% <total for region 1>
....[Hottest Region 1]..............................................................................
c2, level 4, com.rkdeep.LoopUnrollBenchmark::longStride1, version 3, compile id 717
0x00007fe980fd651f: cmovl r10d,esi
0x00007fe980fd6523: movsxd r9,r10d
0x00007fe980fd6526: cmp r9,r11
0x00007fe980fd6529: cmovg r10d,edi
0x00007fe980fd652d: mov DWORD PTR [rsp+0x8],r10d
0x00007fe980fd6532: cmp r10d,0x2
╭ 0x00007fe980fd6536: jle 0x00007fe980fd65e7
│ ↗ 0x00007fe980fd653c: mov r10d,DWORD PTR [rsp+0x8]
│ │ 0x00007fe980fd6541: sub r10d,ecx
│ │ 0x00007fe980fd6544: mov r9d,DWORD PTR [rsp+0x8]
0.01% │ │ 0x00007fe980fd6549: xor r11d,r11d
│ │ 0x00007fe980fd654c: cmp r9d,ecx
0.00% │ │ 0x00007fe980fd654f: cmovl r10d,r11d
0.01% │ │ 0x00007fe980fd6553: cmp r10d,0x1f40
0.00% │ │ 0x00007fe980fd655a: mov r9d,0x1f40
│ │ 0x00007fe980fd6560: cmova r10d,r9d
0.01% │ │ 0x00007fe980fd6564: add r10d,ecx
0.00% │ │ 0x00007fe980fd6567: nop WORD PTR [rax+rax*1+0x0] ;*lload_1 {reexecute=0 rethrow=0 return_oop=0}
│ │ ; - com.rkdeep.LoopUnrollBenchmark::longStride1@12 (line 84)
11.22% │↗│ 0x00007fe980fd6570: cmp ebx,DWORD PTR [rsp]
│││ 0x00007fe980fd6573: jae 0x00007fe980fd6628
0.03% │││ 0x00007fe980fd6579: add rax,QWORD PTR [r8+rbx*8+0x10]
0.40% │││ 0x00007fe980fd657e: movsxd r9,ecx
0.04% │││ 0x00007fe980fd6581: mov rdx,r14
10.88% │││ 0x00007fe980fd6584: add rdx,r9
0.01% │││ 0x00007fe980fd6587: mov r11,rbp
0.10% │││ 0x00007fe980fd658a: add r11,r9
0.05% │││ 0x00007fe980fd658d: mov r9,QWORD PTR [r8+r11*8+0x48]
18.17% │││ 0x00007fe980fd6592: mov r12,QWORD PTR [r8+r11*8+0x40]
0.30% │││ 0x00007fe980fd6597: mov rbx,QWORD PTR [r8+r11*8+0x38]
0.47% │││ 0x00007fe980fd659c: mov rdi,QWORD PTR [r8+r11*8+0x30]
0.20% │││ 0x00007fe980fd65a1: mov rsi,QWORD PTR [r8+r11*8+0x28]
10.59% │││ 0x00007fe980fd65a6: mov r13,QWORD PTR [r8+r11*8+0x20]
0.09% │││ 0x00007fe980fd65ab: mov r11,QWORD PTR [r8+r11*8+0x18];*laload {reexecute=0 rethrow=0 return_oop=0}
│││ ; - com.rkdeep.LoopUnrollBenchmark::longStride1@19 (line 84)
0.44% │││ 0x00007fe980fd65b0: add rax,r11
0.22% │││ 0x00007fe980fd65b3: add rax,r13
10.65% │││ 0x00007fe980fd65b6: add rax,rsi
0.15% │││ 0x00007fe980fd65b9: add rax,rdi
0.57% │││ 0x00007fe980fd65bc: add rax,rbx
10.83% │││ 0x00007fe980fd65bf: add rax,r12
11.50% │││ 0x00007fe980fd65c2: add rax,r9 ;*ladd {reexecute=0 rethrow=0 return_oop=0}
│││ ; - com.rkdeep.LoopUnrollBenchmark::longStride1@20 (line 84)
11.28% │││ 0x00007fe980fd65c5: add rdx,0x8
0.01% │││ 0x00007fe980fd65c9: mov ebx,edx ;*l2i {reexecute=0 rethrow=0 return_oop=0}
│││ ; - com.rkdeep.LoopUnrollBenchmark::longStride1@18 (line 84)
0.02% │││ 0x00007fe980fd65cb: add ecx,0x8
0.04% │││ 0x00007fe980fd65ce: cmp ecx,r10d
│╰│ 0x00007fe980fd65d1: jl 0x00007fe980fd6570 ;*ifge {reexecute=0 rethrow=0 return_oop=0}
│ │ ; - com.rkdeep.LoopUnrollBenchmark::longStride1@9 (line 82)
0.02% │ │ 0x00007fe980fd65d3: mov r10,QWORD PTR [r15+0x350] ; ImmutableOopMap {r8=Oop xmm0=Oop }
│ │ ;*goto {reexecute=1 rethrow=0 return_oop=0}
│ │ ; - (reexecute) com.rkdeep.LoopUnrollBenchmark::longStride1@26 (line 82)
0.01% │ │ 0x00007fe980fd65da: test DWORD PTR [r10],eax ;*goto {reexecute=0 rethrow=0 return_oop=0}
│ │ ; - com.rkdeep.LoopUnrollBenchmark::longStride1@26 (line 82)
│ │ ; {poll}
0.11% │ │ 0x00007fe980fd65dd: cmp ecx,DWORD PTR [rsp+0x8]
│ ╰ 0x00007fe980fd65e1: jl 0x00007fe980fd653c
↘ 0x00007fe980fd65e7: cmp ecx,DWORD PTR [rsp+0x4]
╭ 0x00007fe980fd65eb: jge 0x00007fe980fd660e
0.00% │ 0x00007fe980fd65ed: data16 xchg ax,ax ;*l2i {reexecute=0 rethrow=0 return_oop=0}
│ ; - com.rkdeep.LoopUnrollBenchmark::longStride1@18 (line 84)
│↗ 0x00007fe980fd65f0: cmp ebx,DWORD PTR [rsp]
││ 0x00007fe980fd65f3: jae 0x00007fe980fd666f
││ 0x00007fe980fd65f5: add rax,QWORD PTR [r8+rbx*8+0x10];*ladd {reexecute=0 rethrow=0 return_oop=0}
││ ; - com.rkdeep.LoopUnrollBenchmark::longStride1@20 (line 84)
0.00% ││ 0x00007fe980fd65fa: movsxd rdx,ecx
││ 0x00007fe980fd65fd: add rdx,r14
││ 0x00007fe980fd6600: add rdx,0x1
││ 0x00007fe980fd6604: mov ebx,edx ;*l2i {reexecute=0 rethrow=0 return_oop=0}
││ ; - com.rkdeep.LoopUnrollBenchmark::longStride1@18 (line 84)
0.00% ││ 0x00007fe980fd6606: inc ecx
││ 0x00007fe980fd6608: cmp ecx,DWORD PTR [rsp+0x4]
│╰ 0x00007fe980fd660c: jl 0x00007fe980fd65f0 ;*ifge {reexecute=0 rethrow=0 return_oop=0}
│ ; - com.rkdeep.LoopUnrollBenchmark::longStride1@9 (line 82)
↘ 0x00007fe980fd660e: vmovq r11,xmm0
0x00007fe980fd6613: mov r10d,DWORD PTR [rsp]
0x00007fe980fd6617: cmp rdx,0xf4240
0x00007fe980fd661e: jge 0x00007fe980fd665c
0x00007fe980fd6620: mov r14,rdx
0x00007fe980fd6623: jmp 0x00007fe980fd645e
98.42% <total for region 1>
Benchmark Mode Cnt Score Error Units
LoopUnrollBenchmark.baseline thrpt 5 419882701.472 ± 13085589.188 ops/s
LoopUnrollBenchmark.baseline:asm thrpt NaN ---
LoopUnrollBenchmark.intStride1 thrpt 5 2493.944 ± 102.343 ops/s
LoopUnrollBenchmark.intStride1:asm thrpt NaN ---
LoopUnrollBenchmark.longStride1 thrpt 5 2446.834 ± 231.035 ops/s
LoopUnrollBenchmark.longStride1:asm thrpt NaN ---
As you can see in the assembly code, both benchmarks unroll the loop in long for 8, int for 16 iterations respectively. An inner loop and a safepoint after it are added.
Java 11 applies optimisations differently to loops with int and long counters. In java 11, a loop with long counter is ~2 times slower to execute than with int. In java 17, loop strip mining and safepoint control optimisations have been added, and as a consequence, an inner loop with a safepoint after it has been added. This makes it possible to control the frequency of saftpoint checking when executing loops.