Compact Strings feature might slow down predominant UTF-16 Strings applications

(Last Updated On: 6th June 2018)

Motivation

Current article aims to reveal how Compact Strings feature added in JDK 9 with JEP 254 behaves in case of applications where the Strings are predominant UTF-16.

Context

As you might now, in JDK 9 the internal representation of the String class has changed from a UTF-16 char[] array to a byte[] array plus a coder flag field. The new String class stores characters encoded either as ISO-8859-1/Latin-1 (using one byte per character) or as UTF-16 (using two bytes per character) and the coder field indicates which one is used.

// String Class in JDK 9
public final class String 
    implements java.io.Serializable, Comparable<String>, CharSequence {
    @Stable
    private final byte[] value;
    private final byte coder;

    //...
}

This new internal String representation (i.e. using a byte[] array instead of a char[] array) allows a new scheme of compacting Strings on their construction which basically tries to use one byte instead of two bytes for ISO-8859-1/Latin-1 Strings (saving the overall Strings footprint).

By default, when a new String is created, first it attempts to compress the input char[] to Latin-1 by stripping off upper bytes (i.e. each character backed by one byte). If it fails, UTF-16 encoding is used where each char spreads across 2 bytes. The code looks like below (snapshot from java.lang.String class):

String(char[] value, int off, int len, Void sig) {
    //...
    if (COMPACT_STRINGS) {
        byte[] val = StringUTF16.compress(value, off, len);
        if (val != null) {
            this.value = val;
            this.coder = LATIN1;
            return;
        }
    }
    this.coder = UTF16;
    this.value = StringUTF16.toBytes(value, off, len);
}

Compressing Strings always happen by default since COMPACT_STRINGS field is implicitly true. However, it can be overridden when starting the JVM with -XX:-CompactStrings flag.

Microbenchmark

I have created a small test to concatenate multiple UTF-16 Strings and I measured the time elapsed with Compress Strings feature enabled (default JDK 9 settings) and disabled (i.e. -XX:-CompactStrings). The code sample below:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, timeUnit = TimeUnit.NANOSECONDS)
@Measurement(iterations = 5, timeUnit = TimeUnit.NANOSECONDS)
@Fork(value = 5, warmups = 1)
@State(Scope.Benchmark)
public class CompactStringsJmh {

    @Param({ "Ой,всёÐ" })
    public String utf_16_str1;

    @Param({ "ϑ¿Ñ€Ð¾Ð¿φаϪ" })
    public String utf_16_str2;

    @Param({ "Ðϛζ»Ð¾,шеф"; })
    public String utf_16_str3;

    @Param({ "ΦЀ¾ʬϪлÐΛϼϨЁ" })
    public String utf_16_str4;

    @Param({ "ΏΔΘΞΨθςώϚϠϨϱ" })
    public String utf_16_str5;

    public static void main(String[] args) throws RunnerException {

        Options opts = new OptionsBuilder()
                .include(CompactStringsJmh.class.getName())
                .addProfiler(GCProfiler.class)
                .build();
        new Runner(opts).run();
    }

    @Benchmark
    public String utf16_concat() {
        return utf_16_str1 
             + utf_16_str2 
             + utf_16_str3 
             + utf_16_str4 
             + utf_16_str5;
    }

}

I have tested above benchmark with JDK9.

                      Benchmark                          Mode  Cnt     Score   Error  Units
-XX:+CompactStrings   utf16_concat                       avgt   25    44.469 ± 3.041  ns/op
-XX:+CompactStrings   utf16_concat:·gc.alloc.rate.norm   avgt   25   168.000 ± 0.001   B/op

-XX:-CompactStrings   utf16_concat                       avgt   25    35.785 ± 0.147  ns/op
-XX:-CompactStrings   utf16_concat:·gc.alloc.rate.norm   avgt   25   168.000 ± 0.001   B/op

Tests triggered on my machine (CPU: Intel i7-6700HQ Skylake; MEMORY: 16GB DDR4 2133 MHz; OS: Ubuntu 16.04.2)

Conclusions:
  • in case of Compact Strings enabled, it takes more time (i.e. 44.469 ns/op) to concatenate the same UTF-16 Strings in comparison to the case where Compact Strings is disabled (i.e. 35.785 ns/op). And the time might increase with the number of UTF-16 Strings from the application: more UTF-16 Strings are concatenated or created more time it takes, hence less optimal!
    • this happens because it tries to compress and it always fails since there are only UTF-16 Strings which cannot be compressed. Even if COMPACT_STRINGS field would be constant folded away by Just in Time Compiler, the explicit call to StringUTF16.compress() method still happens and takes time without any benefit in this case
  • in both cases the allocation rate is the same (e.g. 168 B/op), hence almost the same throughput of producing Strings

This leads to an interesting sum-up: for applications that extensively use UTF-16 characters, it might be worth it to consider disabling Compact Strings feature for a better performance! However, you should not exclusively rely on this, instead, my advice is just to keep this in mind and test if it better fits or not in your application.

Authorionutbalosin

Software Architect, Technical Trainer

Leave a Reply

Your email address will not be published. Required fields are marked *