Working with Igalia to improve RISC-V LLVM Continuous Integration

by Alex Bradbury, Igalia S.L.

The LLVM project provides a very widely used and actively developed suite of compiler and toolchain technologies: including Clang, the LLVM middle-end optimizer and backend code generator, MLIR, LLD linker, and much more. While RISC-V LLVM support has progressed significantly, a key area that needed attention was the limited level of continuous integration (CI) for the RISC-V target. To bridge this gap, RISE collaborated with Igalia to provide the engineering expertise and server resources necessary to enhance and maintain robust CI support for RISC-V, ensuring more reliable and efficient development for the ecosystem.

Addressing gaps in LLVM Continuous Integration for RISC-V

As part of its testing infrastructure, LLVM relies primarily on post-commit CI to find target-specific bugs not covered by unit tests. Before this project, the RISC-V testing was limited to one QEMU configuration and some hardware builders for LLVM libc. Expanding CI for various RISC-V and compiler build configurations improves LLVM’s quality by catching bugs earlier, making them easier to fix.

This project initially at least adopts QEMU in order to run tests within a virtual environment. The reasoning is simple, we need to test configurations that aren’t yet available in shipping hardware (such as the forthcoming RVA23 profile) and so QEMU has to form part of the testing mix, with hardware runners added as appropriate hardware becomes available. We also benefit from being able to set options like rvv_ta_all_1s to control simulated behaviour in cases where the specification admits multiple valid semantics.

QEMU can be used either in full system emulation mode (where you boot a RISC-V Linux kernel and emulate devices) or user space emulation mode (where Linux syscalls are translated to the host). Although the latter has a higher chance of issues unique to the emulation environment and fundamentally can’t handle the ptrace syscall as used in debuggers, it can potentially run much faster. There’s a huge combination of possible builder configurations across different RISC-V subtargets, LLVM build configurations, LLVM tests run, and the choice of qemu-user/qemu-system/hardware or some combination of these. So part of the project involves exploring some of these options to find a mix that best meets the community’s needs, taking into account the trade-off between the number of tests run and the time to return results.

Status and progress

An important piece of context is that LLVM is a large (>5 millions lines of code) project with over 5 million lines of code, and a two-stage build appropriate for our CI needs would take 24 hours or more on current RISC-V single board computers. This completes more quickly on a fast X86 host running full-system RISC-V emulation, but still takes 8 hours or so, meaning there’s huge potential benefit to iterating on alternate build setups.

The starting point for the project was to set up slow but reliable qemu-system based multi-stage builds on the staging buildmaster in several configurations:

rise-clang-riscv-rva20-2stage: Standard RVA20 target.
rise-clang-riscv-rva23-2stage: Standard RVA23 target.
rise-clang-riscv-rva23-mrvv-vec-bits-2stage: RVA23 with the -mrvv-vector-bits=zvl compiler flag set.
rise-clang-riscv-rva23-evl-vec-2stage: RVA23 with the EVL vectorizer enabled.

This has acted, as intended, as an initial “pipe cleaner” and resulted in some fixes or improvements either to the CI system itself or of course for LLVM and subprojects, as well as detailed exploration of various build types. But effort has been focused on the planned next step – the deployment of a build configuration that can provide faster feedback. This is now out for review prior to a full rollout.

Although a build running fully within qemu-system will be kept in the mix for completeness, the majority of build configurations will be tested using the new configuration which works by:

Build Clang from the current HEAD on the fast X86 host. ccache can be used for this step.
Use the just-built toolchain to cross-compile a “stage 2” toolchain for the chosen RISC-V configuration. ccache can’t be used for this, but it completes within 30 minutes to 1 hour depending on configuration.
Run the test suite for the cross-compiled toolchain within qemu-system. As documented as part of this work, the source directory and build artefacts can be rapidly transferred and accessed. A wrapper script for LLVM’s lit testing tool is responsible for transparently spinning up a single-use virtual machine to do this.

Next steps

The next target is to complete the roll-out of the new faster build configuration, then iterate to enable more sub-projects within it, address any arising issues, and ultimately move these builders from the staging to the ‘main’ buildmaster.

With these new builders in place providing a better balance of coverage, response time, and fidelity, there will be opportunity to go beyond the current set of tests. We also need to expand our test suite – leveraging both LLVM’s own test-suite repository and ideally larger application builds through Yocto or similar. Such long-running tests with more infrequent reporting of results aren’t a good match for the current buildbot infrastructure, so we’ll need to explore how to best integrate such testing with it. It would also be valuable to go beyond post-commit CI and provide feedback on metrics like compiled code performance for pull requests while under review.

This continuous integration work is critical for advancing high quality RISC-V toolchains, and I would like to thank RISE for their ongoing support.

Stay tuned for updates including efforts to address SPEC performance gaps in RISC-V LLVM.