Applications
This subdirectory contains some applications or benchmarks specifically implemented and optimized for Snitch.
Contents
- Data generation:
data_gen.py
: script to generate data and expected results for various benchmarksdata
: output folder ofdata_gen.py
which also contains the configuration to generate the data
src
:kernels
: basic kernels, currently containsGEMM
,BatchNorm
,Maxpool
,Fusedconv
layers
: wraps the kernel to form a DNN layer. Manages data-movement, synchronization, double buffering etc.utils
: some helpful functions for benchmarking, verification, fastmemset
net_layer.c
: various ready tests to run layers.
include
: includeslayer
struct.
SW Testbenches
There are currently a few tests for various layer types. Some additional information about these tests is given below:
net_maxpool.c
: Naive implementation of a maxpooling layer, not optimized in any way due to memory-boundnessnet-batchnorm.c
: Implementation of a batchnorm layer with SSR streams (both read and write)net-conv2d.c
: Implementation and tiling of a 2D convolution that can be distributed to multiple clusters. The convolution is implemented as anim2col
transformation (performed by 2D DMA transfers) + optimized GEMM. The memory layout of input and output feature map is Height x Width x Channels. The convolution is globally parallelized over output channels. Inside a cluster, the output pixels are distributed among the cores. There is an option to load the feature map from a different cluster instead of the main memory by settingcluster2cluster
in the layer struct to1
. Currently onlyfp64
is implemented, but the data movement forfp32
or lower precision SIMD should be analogously.net-gemm.c
: Testbench to benchmark the optimized GEMM implementation for different memory layouts, dimensions and precisions.net-fusedconv.c
: Implementation of a fused kernel with Conv2d + BatchNorm + ReLU. The interface of the kernel is compatible with DORY. Parameters of a tile can be specified indata/fusedconv_param.hjson
. Supported paramters are input/output dimension, padding, kernel dimension & stride, flags for BatchNorm and ReLU. Further there are two additional specialized kernels 1) a CHW kernel for input layers with very few input channels, the output of this kernel is in the HWC layout again 2) A depthwise kernel
Usage
To run a specific benchmark, first configure the dimensions and the desired precision data/app_params.hjson
.
{
kernel: "GEMM"
M: 16,
N: 16,
K: 16,
alpha: 0,
transpose_A: false,
transpose_B: true,
prec: 16
}
The file will be automatically generated with a cmake
macro and is stored in data/data_app.h
. The result will also be checked. Reference is a golden model written in python
with help of the torch
.
The applications are compiled into a folder which can be enabled by adding add_subdirectory(${SNITCH_SOFTWARE_DIR}/applications
to CMakeLists.txt
in the specific sw
folder.
Requirements
torch
Updated on 2023-06-19 at 09:43:56 +0000