Posted by: Florian Loitsch, Ji Qiu, Kasper Lund, Yahan Lu, Zhijin Zeng
In 2020, Kim McMahon wrote a blog post where she announced the open-sourcing of V8 on RISC-V. At that time, the port was still in a separate repository, and lots of work remained to be done. In the last few years, RISC-V support has been upstreamed to the main V8 repository, and the port is now mostly at feature parity with the officially supported architectures like x86_64 and ARM64. The port is continuously tested on V8’s buildbots, where it shines as one of the most green ports. The RISC-V community is one of the fastest to fix breakages that are introduced by V8 core developers. An independent Jenkins buildbot tests even more configurations.
In this post, we want to highlight a few interesting changes that were made in the last 6 months. These are in no way exhaustive, or even representative of all the work that has gone into the RISC-V port of V8. Instead, we have picked a few changes that we found particularly interesting.
Pool improvements
The RISC-V port uses two kinds of pools: constant pools and trampoline pools. The constant pool is used to load constants that cannot be encoded directly in the instructions, while the trampoline pool is used to handle long jumps that cannot be encoded in a single instruction.
Basically, V8 emits a near-jump for all forward jumps. If the target ends up too far away, the near-jump target is adjusted to a jump pool entry that contains a long-jump to the actual target. Since there are only 13 bits available as offset for near jumps, we must ensure that the jump pool is not emitted too far away from the near-jump instruction. To achieve this, V8 checks regularly during code generation if the distance between the last trampoline pool entry and the current position exceeds a certain threshold. If it does, the trampoline pool is emitted at the current position.
# Jump over the trampoline pool (for 64 entries). j 516 # Entry 0: auipc t6, 0x0 jalr zero_reg, 1552(t6) # Entry 1: auipc t6, 0x0 jalr zero_reg, 1632(t6) ...
The constant pool used to work similarly. However, because each pool could influence the position of the other pool, some complicated logic was needed to ensure that emitting one pool wouldn’t push the other pool too far away.
Some of this complexity was inherited from the MIPS port, which was used as a starting point for the RISC-V port. Recently, we simplified the logic by taking advantage of the fact that instructions that load from the constant pool can have a larger offset of up to 32 bits. Instead of emitting the constant pool during instruction emission, we now emit it at the end of code generation.
This approach has two big advantages:
- It simplifies the logic for emitting pools.
- It prepares for moving constants into a non-executable section in the future. It is generally bad practice to give users write access to executable memory. However, since constants come from the user, this is exactly what happens today. Now that the constants aren’t intermingled with the code anymore, it is easier to move the whole block to a memory section that isn’t executable.
Atomic jump table patching
This section describes a bug fix that was made in the WebAssembly (Wasm) implementation. Wasm uses a jump table to call functions indirectly. These “tables” loop similar to trampoline pools in that they contain a long list of jump instructions. Initially, the jumps go to stubs that compile the actual function on first call. Once the function is compiled, the jump instruction is patched to jump directly to the compiled function. Later, the same jump instruction can be patched again to point to a better-optimized version of the function.
Until recently, these patchable jumps were implemented as follows:
auipc t6, high_20 # Load the upper 20 bits of the target address. jalr t6, t6, low_12 # Jump to t6 + low_12 offset.
Care was taken to ensure that the two instructions were pushed to memory atomically, so that no thread could see a half-updated jump instruction sequence.
However, we missed that on the reader side, the CPU could have already executed the first instruction before then executing the second instruction. If the code was updated in between, the memory would always be consistent, but the CPU would have executed two unrelated instructions, leading to jumps to unexpected locations.
To fix this, we changed the patching sequence to load target from memory instead of constructing it in two parts:
auipc t6, 0 # Load the PC into t6. ld t6, 16(t6) # Load the target address (below). jalr x0, t6 # Jump to the target. nop .dw target[0] # Lower 32 bits of the target address. .dw target[1] # Upper 32 bits of the target address.
Since the memory update is atomically visible to other threads, the CPU cannot see half-updated instructions anymore.
This change fixes the race condition, but requires more instructions. In many cases, we can do better: when a target is only at a small offset, then the `jal` instruction can be used directly: `jal x0 <imm21>`.
Far | Near ---------------------------------------------- auipc t6, 0 | jal ld t6, 16(t6) | ld t6, 16(t6) jalr x0, t6 | jalr x0, t6 nop | nop .dw target[0] | .dw target[0] .dw target[1] | .dw target[1]
The patcher now chooses between the two sequences depending on the target address. If it is a short distance, it just writes the `jal` instruction directly. If a reader already executed the first instruction it won’t be affected by the change and completes normally.
If the target is at a longer distance, the patcher updates the target address, and then patches the first instruction to `auipc t6, 0`.
In both cases there is no race condition anymore.
Performance improvements
There are too many performance improvements to list them all, but here are a few highlights.
You can find more of these kinds of optimizations on SpacemiT’s blog post.
SHxADD
The `shxadd` instruction is part of the Zba extension. It allows to fuse a shift and an addition into a single instruction, similar to what `lea` does on x86_64. V8 now uses this instruction when available to speed up address calculations. Together with some reshuffling of the offset, we can sometimes half the number of instructions for loads:
slli t7, t7, 3 addi t1, t7, 15 add t1, t2, t1 ld t3, 0(t1)
becomes
sh3add t1, t7, t2 ld t3, 15(t1)
Optimized pointer decompression
Another instruction from the same extension (Zba) can also be used to dramatically reduce the amount of code that is used to decompress tagged pointers.
Compressed pointers reduce the need for memory by storing pointers as 32-bit unsigned offsets relative to a base register. Decompressing the pointers just consists of adding the offset and register together. As simple as this sounds, it comes with a small complication on our RISC-V 64-bit port. By construction, 32-bit values are always loaded into the 64-bit registers as signed values. This means that we need to zero-extend the 32-bit offset first. Until recently this was done by bit-anding the register with 0xFFFF_FFFF:
li t3,1 slli t3, t3, 32 addi t3, t3, -1 and a0, a0, t3
Now, this code uses the `zext.w` instruction from the Zba extension:
zext.w a0, a0
After adding the base pointer (a single instruction) the cost of decompressing a tagged pointer thus went from 5 instructions to 2; a nice and noticeable improvement.
## Keeping up with main
V8 is rapidly evolving. Google engineers are constantly adding new features, improving performance, and fixing bugs. It’s up to port maintainers to keep up with these changes and ensure that the port remains functional.
About a quarter of all RISC-V commits in the last 6 months were dedicated to keeping up with main. Sometimes this can be as simple as adding a handful of lines, implementing a peephole optimization. Other times, it can involve more complex changes, requiring significant effort from the port maintainers.
The aforementioned peephole optimization added a shortcut for multiplying obtaining the high 32 bits of a 64-bit multiplication. On RISC-V, these can be implemented naively:
void Int32MultiplyOverflownBits::SetValueLocationConstraints() {
UseRegister(left_input());
UseRegister(right_input());
DefineAsRegister(this);
}
void Int32MultiplyOverflownBits::GenerateCode(MaglevAssembler* masm,
const ProcessingState& state) {
Register left = ToRegister(left_input());
Register right = ToRegister(right_input());
Register out = ToRegister(result());
__ Mul32(out, left, right);
__ srai(out, out, 32);
}
Vectors
During the last 6 months, we have also significantly improved the vector support in V8. Wasm supports SIMD operations that had already been mapped to RISC-V’s vector extension (RVV). However, until recently, the implementation was written for 128-bit vectors only, while RVV is more flexible and allows vectors of different lengths. Now, CPUs with larger vectors (256, and 512 bits) are supported as well. We have also tested the implementation on real hardware, and not just on V8’s built-in RISC-V simulator. This testing revealed some bugs that have been fixed. Specifically, we had to ensure that vector registers were correctly saved and restored before and after calls to C++ code.
Simd instructions are heavily used in some benchmarks of the JetStream benchmark suite. These fixes were the last missing piece to run the suite to completion.
RISC-V 32-bit deprecation
RISC-V 32-bit is primarily used in small, embedded systems. In fact, it’s difficult to even find Linux distributions running on this architecture. Given that V8 requires an operating system, like Linux, and typically can’t run on these small embedded systems anyway, the 32-bit port of RISC-V has been deprecated. The port will be maintained until May 2026, but if nobody comes forward with a strong use case, it will be removed.
Conclusion
The RISC-V port of V8 has come a long way in the last few years. While there is still work to be done, the port is now mostly at feature parity with the officially supported architectures. V8 on RISC-V now runs the full JetStream benchmark suite, which consists of ~33 MB of Wasm bytecode and ~2M lines of JavaScript code – and it is ready for your workloads too.