CMoveX & Vectorization – Rise: RISC-V Software Ecosystem

In the ever-evolving landscape of software development, performance is paramount. For the Java platform, this means continuous innovation in the Just-In-Time (JIT) compiler to generate highly optimized code for a variety of hardware architectures. A recent series of pull requests to the OpenJDK project showcases a significant push to enhance the performance of Java applications on the RISC-V architecture. This blog post will take you through these optimizations, from the introduction of conditional moves to the speedups achieved through vectorization.

The Power of Conditional Moves (CMove)

At the heart of many of these optimizations is the conditional move (CMove) instruction (czero.eqz and czero.nez in Zicond extension). Unlike traditional branching, which can suffer from performance penalties due to branch mispredictions, CMove instructions allow the processor to select a value based on a condition, without altering the control flow. This leads to more efficient and predictable code execution.

A key set of changes in the OpenJDK has been focused on harnessing the power of CMove for RISC-V:

Checking for availability – PR #24095: We first needed to check for CMoveX support on riscv64. This was a key initial step, as it enabled the selective use of conditional moves, though not all types (like CMoveF/D for floats and doubles) were fully optimized on the platform.
Optimizing min/max – PR #24153: This optimization improved min/max operations by using the Zicond extension in RISC-V. This allowed for the use of CMove instead of branches, which helped performance by avoiding branch mispredictions and simplified the code.
Expanding CMove Support – PR #24490: This effort implemented and enabled CMoveI/L (conditional move for integer and long types). This change also included more optimizations from the C2 compiler and new benchmarks to validate the improvements. This update also made the UseZicond flag off by default because the larger code size generated by Zicond could sometimes lead to performance regressions.

More details about CMoveF/D on RISC-V

Implementing CMoveF/D on riscv64 is less efficient compared to other CMoveX instructions like CMoveI. For instance, a CMoveF/D operation with a float/double comparison might involve the following steps:

    feq.s      t0, cmp1, cmp2
    fmv.x.w    tmpDst, dst
    fmv.x.w    tmpSrc, src
    czero.nez  tmpDst, tmpDst, t0
    czero.eqz  t0 , tmpSrc, t0
    orr        tmpDst, tmpDst, t0
    fmv.w.x    dst, tmpDst

This approach significantly increases code size, especially when the Hotspot JVM unrolls loops, contrasting sharply with CMoveI/L. However, this performance concern is only relevant when the Hotspot JVM generates scalar code. As we will explore in the next section, vectorization can still yield performance benefits even with CMoveF/D.

Vectorization: The Next Frontier

Building upon a strong foundation for conditional moves, the next logical step was to significantly enhance performance through vectorization. This technique allows the processor to execute the same operation on multiple data points concurrently, yielding substantial performance gains, particularly within loops.

A pivotal optimization are introduced in PR #25336 and PR #25341, enabling the vectorization of conditional expressions in the Hotspot JVM. Specifically, expressions in the form of fd_1 bop fd_2 ? res_1 : res_2 could now be transformed into VectorBlend operations. This was achieved by relaxing the previous constraint that operands and results must have the same data type size, a seemingly minor adjustment that unlocked immense performance potential.

Furthermore, these pull requests also eased the constraint on transforming Op_CMoveI/L to Op_VectorBlend on RISC-V, offering additional benefits when the result is not a float or double type.

The impact of this vectorization was truly remarkable. With the appropriate flags enabled (-XX:+UseVectorCmov -XX:+UseCMoveUnconditionally), performance improved by over 2.1 times on average, and in some instances, it surged to more than 4 times the original performance. Crucially, these changes were meticulously implemented to prevent any performance regressions when the flags were disabled.

What’s next? There remains a gap in the Hotspot JVM that needs to be addressed to fully realize this performance improvement: superword vectorization does not currently support unsigned comparisons. The unsigned information is lost, meaning all comparisons are treated as signed. In Hotspot JVM, it’s an all-or-nothing scenario; only implementing signed comparisons could lead to incorrect program transformations at runtime in Hotspot C2. Once this gap is filled, we can finally integrate this performance optimization into the Hotspot JVM.

It’s also worth noting that this optimization is not limited to RISC-V; it will benefit all other platforms running OpenJDK.

Conclusion: A Faster Future for Java on RISC-V

These major improvements, detailed in several pull requests, significantly boost Java performance on RISC-V. By strategically using conditional moves and fully utilizing vectorization, the OpenJDK community has again pushed performance limits.

Beyond just making Java faster on RISC-V, these improvements highlight the continuous innovation that keeps Java a leading technology. As RISC-V becomes more popular, these optimizations will be crucial for ensuring Java applications provide the speed and efficiency users expect.