valu - Instantiate the in-lane SIMD ALU (unpipelined)
Overview
The valu module is a central component of Ara’s vector processing pipeline. It acts as a SIMD (Single Instruction, Multiple Data) integer Arithmetic Logic Unit, capable of executing vector instructions over 64-bit wide data lanes. Its primary role is to execute integer operations across multiple vector lanes in parallel and to manage operations including fixed-point arithmetic, scalar replication, vector reductions, narrowing operations, and interaction with the mask unit and slide unit.
This documentation serves as a golden reference, explaining every functional aspect of the valu module in detail.
Table of Contents
Module Parameters
parameter int unsigned NrLanes;
parameter int unsigned VLEN;
parameter fixpt_support_e FixPtSupport;
parameter type vaddr_t;
parameter type vfu_operation_t;
NrLanes: Number of vector lanes.VLEN: Vector register length.FixPtSupport: Enable or disable fixed-point support.vaddr_t,vfu_operation_t: Type definitions for vector addresses and operations.
Top-Level Interface
The valu module interfaces with several components:
Dispatcher: Provides new vector operations (
vfu_operation_i) and receivesvxsat_flag_o.Lane sequencer: Coordinates the execution flow.
Operand queues: Deliver source operands (
alu_operand_i).VRF: Accepts results for writeback.
Slide Unit: Exchanges operands and results for inter-lane reduction.
Mask Unit: Manages masking for selective operations.
Vector Instruction Queue
The queue tracks in-flight instructions and separates their execution phases:
accept_pnt: Points to the next free entry for acceptance.issue_pnt: Instruction to be executed next.commit_pnt: Instruction whose results are being written back.
Each instruction is described by a vfu_operation_t struct and includes the vector operation, destination ID, and control fields like vm and vsew.
Result Queue
This FIFO queue temporarily stores computation results, including:
wdata: Resulting word.id: Instruction ID.addr: Target vector address.be: Byte enable.mask: Whether this result is to be forwarded to the mask unit.
Two pointers (write_pnt, read_pnt) and a count (result_queue_cnt_q) track the queue status.
Scalar Operand Processing
Scalar values from instructions are replicated to 64-bit words depending on vsew. For instance:
EW8: replicated 8×EW16: replicated 4×etc.
This ensures compatibility with SIMD-wide operations.
Mask Operand Handling
When vector masking is enabled (vm == 0), a spill_register sends result data to the mask unit, filtered via the mask_operand_ready_i signal.
Reduction Support
Reduction operations are divided into:
Intra-lane: Sequential accumulation within a single lane.
Inter-lane: Exchange of partial sums between lanes via the Slide Unit.
The ALU manages:
Internal counters (
reduction_rx_cnt_q)Inter-lane state transitions (e.g.,
INTER_LANES_REDUCTION_TX)Final SIMD reduction in lane 0
Narrowing Instructions
Instructions like VNSRA, VNCLIP produce only half the normal element width per cycle. The module uses a toggle (narrowing_select_q) to alternate writing high/low halves.
SIMD ALU Execution
The core computation is handled by simd_alu, parameterized by fixed-point support and rounding. It takes:
alu_operand_aandalu_operand_bThe operation (
op_i)The mask (
mask_i)A rounding modifier (
rm)
Results are fed to the result queue and selectively masked.
Fixed-Point Rounding and Saturation
When enabled, rounding and saturation are implemented via fixed_p_rounding and internal saturation flags (alu_vxsat_q). The saturation flag vxsat_flag_o is updated during commit.
Control and State Machines
ALU State (alu_state_q)
Defines execution phase:
NO_REDUCTIONINTRA_LANE_REDUCTIONINTER_LANES_REDUCTION_TXINTER_LANES_REDUCTION_RXSIMD_REDUCTIONLN0_REDUCTION_COMMIT
The FSM governs transitions based on instruction type, operand readiness, and SLDU handshake.
Instruction Lifecycle
Acceptance: New vector instructions are stored in the queue.
Issue: Instructions are fetched and executed if operands are valid.
Execution: Results are generated (immediately or over multiple cycles).
Commit: Data is written to the vector register file or sent to the mask unit.
Commit and Writeback
Writeback to VRF is gated by
alu_result_gnt_i.Results for masking go through
mask_operand_o.Queue counters and state pointers are updated accordingly.