ara: Top-Level Vector Unit

Overview

The ara module is the top-level vector processing unit that implements the RISC-V Vector 1.0 Extension (RVV). It interfaces directly with the CVA6 scalar core and contains all vector-specific sub-units required to execute RVV instructions, including:

  • A dispatcher (decodes vector instructions, keeps the vector CSR state, injects special micro-operations)

  • A sequencer (controls instruction issue to the units, enforce instruction dependencies, collect results)

  • Lanes (each lane contains a slice of the vector register file plus the arithmetic units)

  • Vector Load/Store Unit (VLSU, it initiates AXI AR/AW transactions and handle the dataflow from/to memory)

  • Slide Unit (SLDU, runs vector slides and byte layout reshuffling)

  • Mask Unit (MASKU, assembles and distributes mask (predication) bits, executes bit-level mask instructions, and runs more complex bit-level permutations plus vrgather)

The design is modular and scalable, with configurable parameters for lane count, VLEN, data types, and extended features like segmentation or MMU support.

The most stable configurations are with a power-of-2 number of lanes from 2 to 16, with a VLEN derived from considering a VLEN-per-lane of 1024 bit/lane.


Parameters

Name

Description

NrLanes

Number of parallel vector lanes

VLEN

Vector length (in bits)

OSSupport

Enables MMU and fault-only-first logic

FPUSupport

Enables FP16/32/64 support

FPExtSupport

Enables vfrec7, vfrsqrt7

FixPtSupport

Enables fixed-point support

SegSupport

Enables segmented memory operations

CVA6Cfg

CVA6 configuration record

Axi*Width

AXI bus widths

axi_*

AXI channel and bundle typedefs

exception_t, accelerator_*, acc_mmu_*

Types for accelerator/MMU interfacing


Ports

Port

Dir

Description

clk_i, rst_ni

In

Clock and reset

scan_*

In/Out

Scan chain (test)

acc_req_i, acc_resp_o

In/Out

Vector accelerator interface (with CVA6)

axi_req_o, axi_resp_i

Out/In

AXI memory interface


Internal Units

1. Dispatcher (ara_dispatcher)

  • Fully decodes RVV instructions

  • Keeps the vector CSR state

  • Handles register reshuffling for EW mismatches

  • Injects special vector micro-ops (e.g., reshuffle slides, segment memory micro-ops)

  • Tracks active instructions and completion

2. Sequencer (ara_sequencer)

  • Manages vector instruction lifecycle

  • Tracks instruction dependencies with a scoreboard

  • Handles scalar responses and exception tracking

  • Broadcast work to lanes and special units

3. Vector Lanes (lane)

  • Keeps a slice of the Vector Register File and multiple functional units (FPU, ALU, Multiplier)

  • Executes arithmetic and logic vector operations

  • Feed Ara’s units through the VRF, and receive memory operands from the load unit

4. Vector Load/Store Unit (vlsu)

  • Handles vector memory access

  • Interfaces with CVA6’s MMU (virtual address translation through dedicated interface) and memory (through AXI4)

  • Manages address generation and exception detection for memory operations

5. Slide Unit (sldu)

  • Performs byte-reshuffling and slide ops

  • Optimized for power-of-two strides

  • All-to-all lane connectivity

6. Mask Unit (masku)

  • Centralized logic for mask generation and bit-level access

  • Handles mask combination ops (e.g., vfirst, vcpop)

  • Shares scalar result lines with the sequencer


MMU Support

When OSSupport = 1, the module:

  • Interfaces with CVA6’s MMU (via SV39)

  • Performs virtual-to-physical address translation

  • Supports fault-only-first loads and address exceptions


Interconnect & Communication

The sequencer broadcasts instructions to all the units in parallel (VLSU, SLDU, MASKU, Lanes). Also, the all-to-all connected units (VLSU, SLDU, MASKU) are connected to every lane in parallel.

Each lane works on a 64-bit datapath, and every all-to-all unit receives at least one bus of 64-bit from every lane.