Tutorial
The following tutorial will guide you through the use of the Snitch cluster. You will learn how to develop, simulate, debug and benchmark software for the Snitch cluster architecture.
You can assume the working directory to be target/snitch_cluster
. All paths are to be assumed relative to this directory. Paths relative to the root of the repository are prefixed with a slash.
Setup
If you don't have access to an IIS machine, and you have set up the Snitch Docker container as described in the getting started guide, all of the commands presented in this tutorial will have to be executed in the Docker container.
To run the container in interactive mode:
docker run -it -v <path_to_repository_root>:/repo -w /repo ghcr.io/pulp-platform/snitch_cluster:main
Where you should replace <path_to_repository_root>
with the path to the root directory of the Snitch cluster repository cloned on your machine.
Warning
As QuestaSim and VCS are proprietary tools and require a license, only Verilator is provided within the container for RTL simulations.
Building the hardware
To run software on Snitch without a physical chip, you will need a simulation model of the Snitch cluster. You can build a cycle-accurate simulation model from the RTL sources directly using QuestaSim, VCS or Verilator, with either of the following commands:
make bin/snitch_cluster.vlt
make DEBUG=ON bin/snitch_cluster.vsim
make bin/snitch_cluster.vcs
These commands compile the RTL sources respectively in work-vlt
, work-vsim
and work-vcs
. Additionally, common C++ testbench sources (e.g. the frontend server (fesvr)) are compiled under work
. Each command will also generate a script or an executable (e.g. bin/snitch_cluster.vsim
) which we can use to simulate software on Snitch, as we will see in section Running a simulation.
Info
The variable DEBUG=ON
is required when using QuestaSim to preserve the visibility of all internal signals. If you need to inspect the simulation waveforms, you should set this variable when building the simulation model. For faster simulations you can omit the variable assignment, allowing QuestaSim to optimize internal signals away.
Building the Banshee simulator
Instead of building a simulation model from the RTL sources, you can use our instruction-accurate simulator called banshee
. To install the simulator, please follow the instructions provided in the Banshee repository.
Configuring the hardware
The Snitch cluster RTL sources are partly automatically generated from a configuration file provided in .hjson
format. Several RTL files are templated and use the .hjson
configuration file as input to fill in the template. An example is snitch_cluster_wrapper.sv.tpl.
In the cfg
folder, different configurations are provided. The cfg/default.hjson
configuration instantiates 8 compute cores + 1 DMA core in the cluster.
The command you previously executed automatically generated the RTL sources from the templates, and it implicitly used the default configuration file. In this configuration the FPU is not equipped with a floating-point divide and square-root unit.
To override the default configuration file, e.g. to use the configuration with FDIV/FSQRT unit, define the following variable when you invoke make
:
make CFG_OVERRIDE=cfg/fdiv.hjson bin/snitch_cluster.vlt
If you want to use a custom configuration, just point CFG_OVERRIDE
to the path of your configuration file.
Tip
When you override the configuration file on the make
command-line, the configuration is stored in the cfg/lru.hjson
file. Successive invocations of make
will automatically pick up the cfg/lru.hjson
file. You can therefore omit the CFG_OVERRIDE
definition in successive commands unless you want to override the least-recently used configuration.
Building the software
To build all of the software for the Snitch cluster, run the following command. Different simulators may require different C runtime or library function implementations, so different options have to be specified to select the appropriate implementation, e.g. for Banshee simulations or OpenOCD semi-hosting:
make DEBUG=ON sw -j
make DEBUG=ON SELECT_RUNTIME=banshee sw -j
make DEBUG=ON OPENOCD_SEMIHOSTING=ON sw -j
This builds all software targets defined in the repository, e.g. the Snitch runtime library and all applications. Artifacts are stored in the build directory of each target. For example, have a look inside sw/apps/blas/axpy/build/
and you will find the artifacts of the AXPY application build, e.g. the compiled executable axpy.elf
and a disassembly axpy.dump
.
If you only want to build a specific software target, you can by replacing sw
with the name of that target, e.g. the name of an application:
make DEBUG=ON axpy -j
For this to be possible, we require all software targets to have unique and distinct names from any other Make target.
Warning
The RTL is not the only source which is generated from the configuration file. The software stack also depends on the configuration file. Make sure you always build the software with the same configuration of the hardware you are going to run it on.
Info
The DEBUG=ON
flag is used to tell the compiler to produce debugging symbols and disassemble the generated ELF binaries for inspection (.dump
files in the build directories). Debugging symbols are required by the annotate
target, showcased in the Debugging and benchmarking section of this guide.
Tip
On GVSOC, it is better to use OpenOCD semi-hosting to prevent putchar from disturbing the DRAMSys timing model.
Running a simulation
Run one of the executables which was compiled in the previous step on your Snitch cluster simulator of choice:
bin/snitch_cluster.vlt sw/apps/blas/axpy/build/axpy.elf
bin/snitch_cluster.vsim sw/apps/blas/axpy/build/axpy.elf
bin/snitch_cluster.vcs sw/apps/blas/axpy/build/axpy.elf
banshee --no-opt-llvm --no-opt-jit --configuration src/banshee.yaml --trace sw/apps/blas/axpy/build/axpy.elf
The simulator binaries can be invoked from any directory, just adapt the relative paths in the preceding commands accordingly, or use absolute paths. We refer to the working directory where the simulation is launched as the simulation directory. Within it, you will find several log files produced by the RTL simulation.
Tip
If you don't want your log files to be overriden when you run another simulation, just create separate simulation directories for every simulation whose artifacts you want to preserve, and run the simulations therein.
The previous commands will launch the simulation on the console. QuestaSim simulations can also be launched with the GUI, e.g. for waveform inspection. Just adapt the previous command to:
bin/snitch_cluster.vsim.gui sw/apps/blas/axpy/build/axpy.elf
Debugging and benchmarking
When you run a simulation, every core logs all the instructions it executes in a trace file. The traces are located in the logs
folder within the simulation directory. Every trace is identified by a hart ID, that is a unique ID for every hardware thread (hart) in a RISC-V system (and since all our cores have a single thread that is a unique ID per core).
The simulation dumps the traces in a non-human-readable format with .dasm
extension. To convert these to a human-readable form run:
make traces -j
If the simulation directory does not coincide with the current working directory, you will have to provide the path to the simulation directory explicitly, this holds for all of the commands in this seciton:
make traces SIM_DIR=<path_to_simulation_directory> -j
This will generate human-readable traces with .txt
extension. In addition, several performance metrics will be computed and appended to the end of the trace. These and additional metrics are also dumped to a .json
file for further processing. Detailed information on how to interpret the traces and performance metrics can be found in the Trace Analysis page.
Debugging a program from the traces alone can be quite tedious and time-consuming, as it would require you to manually understand which lines in your source code every instruction originates from. Surely, you can help yourself with the disassembly, but we can do better.
You can automatically annotate every instruction with the originating source line using:
make annotate -j
This will produce a .s
file from every .txt
trace, in which the instructions from the .txt
trace are now interleaved with comments indicating which source lines those instructions correspond to.
Note
The annotate
target uses the addr2line
binutil behind the scenes, which needs debugging symbols to correlate instruction addresses with originating source code lines. The DEBUG=ON
flag you specified when building the software is necessary for this step to succeed.
Every performance metric is associated to a region in the trace. You can define regions by instrumenting your code with calls to the snrt_mcycle()
function. Every call to this function defines two code regions:
- the code preceding the call, up to the previous
snrt_mcycle()
call or the start of the program - the code following the call, up to the next
snrt_mcycle()
call or the end of the program
If you would like to benchmark a specific part of your program, you would call snrt_mcycle()
before and after that part. Performance metrics, such as the IPC, will be extracted for that region separately from other regions.
Sometimes you may want to graphically visualize the regions in your traces, to have a holistic and high-level view over all cores' operations. This can be useful e.g. to visualize if the compute and DMA phases in a double-buffered application overlap correctly and to what extent. To achieve this, you can use the following command, provided a file specifying the regions of interest (ROI) and associating a textual label to each region:
make visual-trace ROI_SPEC=../../sw/blas/axpy/roi.json -j
Where ROI_SPEC
points to the mentioned specification file.
This command generates the logs/trace.json
file, which you can graphically visualize in your browser. Go to http://ui.perfetto.dev/ and load the trace file. You can now graphically view the compute and DMA transfer regions in your code. If you click on a region, you will be able to see the performance metrics extracted for that region. Furthermore, you can also view the low-level traces of each core, with the individual instructions. Click on an instruction, and you will be able to see the originating source line information, the same you've seen to be generated by the annotate
target.
Note
As mentioned also for the annotate
target, the DEBUG=ON
flag is required when building the software for the source line information to be extracted.
Info
If you want to dig deeper into the ROI specification file syntax and how the visual trace is built behind the scenes, have a look at the documentation for the roi.py
and visualize.py
scripts or at the sources themselves, hosted in the bench
folder.
Developing your first Snitch application
In the following you will create your own AXPY kernel implementation as an example how to develop software for Snitch.
Writing the C code
Create a directory for your AXPY kernel:
mkdir sw/apps/tutorial
And a src
subdirectory to host your source code:
mkdir sw/apps/tutorial/src
Here, create a new file named tutorial.c
with the following contents:
#include "snrt.h"
#include "data.h"
// Define your kernel
void axpy(uint32_t l, double a, double *x, double *y, double *z) {
int core_idx = snrt_cluster_core_idx();
int offset = core_idx * l;
for (int i = 0; i < l; i++) {
z[offset] = a * x[offset] + y[offset];
offset++;
}
snrt_fpu_fence();
}
int main() {
// Read the mcycle CSR (this is our way to mark/delimit a specific code region for benchmarking)
uint32_t start_cycle = snrt_mcycle();
// DM core does not participate in the computation
if(snrt_is_compute_core())
axpy(L / snrt_cluster_compute_core_num(), a, x, y, z);
// Read the mcycle CSR
uint32_t end_cycle = snrt_mcycle();
}
The snrt.h
file implements the snRuntime API, a library of convenience functions to program Snitch-cluster-based systems, and it is automatically referenced by our compilation scripts. Documentation for the snRuntime can be found at the Snitch Runtime pages.
Note
The snRuntime sources only define the snRuntime API, and provide a base implementation for a subset of functions. A complete implementation of the snRuntime for RTL simulation can be found under target/snitch_cluster/sw/runtime/rtl
.
We will have to instead create the data.h
file ourselves. Create a folder to host the data for your kernel to operate on:
mkdir sw/apps/tutorial/data
Here, create a C file named data.h
with the following contents:
uint32_t L = 16;
double a = 2;
double x[16] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15};
double y[16] = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
double z[16];
In this file we hardcode the data to be used by the kernel. This data will be loaded in memory together with your application code.
Compiling the C code
In your tutorial
folder, create a new file named app.mk
with the following contents:
APP := tutorial
$(APP)_BUILD_DIR := $(ROOT)/target/snitch_cluster/sw/apps/$(APP)/build
SRCS := $(ROOT)/target/snitch_cluster/sw/apps/$(APP)/src/$(APP).c
$(APP)_INCDIRS := $(ROOT)/target/snitch_cluster/sw/apps/$(APP)/data
include $(ROOT)/target/snitch_cluster/sw/apps/common.mk
This file will be included in the top-level Makefile, compiling your source code into an executable with the name provided in the APP
variable.
In order for the top-level Makefile to find your application, add your application's directory to the APPS
variable in sw.mk
:
APPS += sw/apps/tutorial
Now you can recompile the software, including your newly added tutorial application, as shown in section Building the software.
Note
Only the software targets depending on the sources you have added/modified have been recompiled.
Info
If you want to dig deeper into how our build system works and how these files were generated you can start from the top-level Makefile and work your way through the other Makefiles included within it.
Running your application
You can then run your application as shown in section Running a simulation. Make sure to pick up the right binary, i.e. sw/apps/tutorial/build/tutorial.elf
.
Generating input data
In general, you may want to randomly generate the data for your application. You may also want to test your kernel on different problem sizes, e.g. varying the length of the AXPY vectors, without having to manually rewrite the file.
The approach we use is to generate the header file with a Python script. An input .json
file can be used to configure the data generation, e.g. to set the length of the AXPY vectors. Have a look at the datagen.py
and params.json
files in our full-fledged AXPY application as an example. As you can see, the data generation script reuses many convenience classes and functions from the data_utils
module. We advise you to do the same. Documentation for this module can be found at the auto-generated pages.
Verifying your application
When developing an application, it is good practice to verify the results of your application against a golden model. The traditional approach is to generate expected results in your data generation script, dump these into the header file and extend your application to check its results against the expected results, in simulation! Every cycle spent on verification is simulated, and this may take a significant time for large designs. We refer to this approach as the Built-in self-test (BIST) approach.
A better alternative is to read out the results from your application at the end of the simulation, and compare them outside of the simulation. You may have a look at our AXPY's verify.py
script as an example. We can reuse this script to verify our application, by prepending it to the usual simulation command, as:
../../sw/blas/axpy/scripts/verify.py bin/snitch_cluster.vlt sw/apps/tutorial/build/tutorial.elf
You can test if the verification passed by checking that the exit code of the previous command is 0 (e.g. in a bash terminal):
echo $?
Again, most of the logic in the script is implemented in convenience classes and functions provided by the verif_utils
module. Documentation for this module can be found at the auto-generated pages.
Info
The verif_utils
functions build upon a complex verification infrastructure, which uses inter-process communication (IPC) between the Python process and the simulation process to get the results of your application at the end of the simulation. If you want to dig deeper into how this framework is implemented, have a look at the SnitchSim.py
module and the IPC files within the test
folder.
Code reuse
As you may have noticed, there is a good deal of code which is independent of the hardware platform we execute our AXPY kernel on. This is true for the data.h
file and possible data generation scripts. The Snitch AXPY kernel itself is not specific to the Snitch cluster, but can be ported to any platform which provides an implementation of the snRuntime API. An example is Occamy, with its own testbench and SW development environment.
It is thus preferable to develop the data generation scripts and Snitch kernels in a shared location, from which multiple platforms can take and include the code. The sw
directory in the root of this repository was created with this goal in mind. For the AXPY example, shared sources are hosted under the sw/blas/axpy
directory.
We recommend that you follow this approach also in your own developments for as much of the code which can be reused.