Benchmarking

Choosing the right platform

To start developing and benchmarking applications on Snitch, we advise to start developing on banshee and in a later step benchmark them on the RTL for cycle-accurate results. banshee is a good starting point to functionaly verify your application and to get a first impression of the performance. banshee can generate traces for you which allows to already have a rough estimate of the FPU utilization for instance. However, banshee does not model delays for instructions and memory accesses. Therefore, the cycle-accurate results are only possible with RTL simulations or on the FPGA.

Generating traces

To generate traces, spike-dasm must be installed and available in the PATH. Using the source from this repository supports disassembly of Snitch-custom instructions. We refer to the Quick Start to install spike-dasm.

traces are automatically generated if you run the simulation when running the following target for RTL simulations from the build folder:

make run-rtl-my_binary

respectively for banshee:

make run-banshee-my_binary

Alternatively you can also generate traces for the RTL by running the following target in the hw/system/my_platform folder:

make traces

RTL Traces

The traces will be stored in the logs folder. The traces are generated for each core. The trace for core 0 is stored in trace_hart_00000000.txt. The trace for core 1 is stored in trace_hart_00000001.txt and so on. A trace file contains a summary of few statistics for the specific core that is appended at the end of the trace file. The following example shows such a summary:

## Performance metrics

Performance metrics for section 0 @ (11, 3459):
snitch_loads                                    89
snitch_stores                                   89
fpss_loads                                       0
snitch_avg_load_latency                    22.9888
snitch_occupancy                            0.1334
snitch_fseq_rel_offloads                    0.0650
fseq_yield                                     1.0
fseq_fpu_yield                                 1.0
fpss_section_latency                             0
fpss_avg_fpu_latency                           1.0
fpss_avg_load_latency                            0
fpss_occupancy                              0.0093
fpss_fpu_occupancy                          0.0093
fpss_fpu_rel_occupancy                         1.0
cycles                                        3449
total_ipc                                   0.1427

The trace script also allows to split the execution into multiple sections. The sections are defined by reading from the mcycle CSR register. This register will return the current cycle count, but also serves as a trigger for the trace script, to start a new section. The following example shows how to split the execution into two sections:

#include "sw/vendor/riscv-opcodes/encoding.h"
size_t benchmark_get_cycle() { return read_csr(mcycle); }

// End of section 0, Start of section 1
benchmark_get_cycle();

// Execute kernel to be benchmarked
my_kernel();

// End of section 1, Start of section 2
benchmark_get_cycle();