valu
- Instantiate the in-lane SIMD ALU (unpipelined)
Overview
The valu
module is a central component of Ara’s vector processing pipeline. It acts as a SIMD (Single Instruction, Multiple Data) integer Arithmetic Logic Unit, capable of executing vector instructions over 64-bit wide data lanes. Its primary role is to execute integer operations across multiple vector lanes in parallel and to manage operations including fixed-point arithmetic, scalar replication, vector reductions, narrowing operations, and interaction with the mask unit and slide unit.
This documentation serves as a golden reference, explaining every functional aspect of the valu
module in detail.
Table of Contents
Module Parameters
parameter int unsigned NrLanes;
parameter int unsigned VLEN;
parameter fixpt_support_e FixPtSupport;
parameter type vaddr_t;
parameter type vfu_operation_t;
NrLanes
: Number of vector lanes.VLEN
: Vector register length.FixPtSupport
: Enable or disable fixed-point support.vaddr_t
,vfu_operation_t
: Type definitions for vector addresses and operations.
Top-Level Interface
The valu
module interfaces with several components:
Dispatcher: Provides new vector operations (
vfu_operation_i
) and receivesvxsat_flag_o
.Lane sequencer: Coordinates the execution flow.
Operand queues: Deliver source operands (
alu_operand_i
).VRF: Accepts results for writeback.
Slide Unit: Exchanges operands and results for inter-lane reduction.
Mask Unit: Manages masking for selective operations.
Vector Instruction Queue
The queue tracks in-flight instructions and separates their execution phases:
accept_pnt
: Points to the next free entry for acceptance.issue_pnt
: Instruction to be executed next.commit_pnt
: Instruction whose results are being written back.
Each instruction is described by a vfu_operation_t
struct and includes the vector operation, destination ID, and control fields like vm
and vsew
.
Result Queue
This FIFO queue temporarily stores computation results, including:
wdata
: Resulting word.id
: Instruction ID.addr
: Target vector address.be
: Byte enable.mask
: Whether this result is to be forwarded to the mask unit.
Two pointers (write_pnt
, read_pnt
) and a count (result_queue_cnt_q
) track the queue status.
Scalar Operand Processing
Scalar values from instructions are replicated to 64-bit words depending on vsew
. For instance:
EW8
: replicated 8×EW16
: replicated 4×etc.
This ensures compatibility with SIMD-wide operations.
Mask Operand Handling
When vector masking is enabled (vm == 0
), a spill_register
sends result data to the mask unit, filtered via the mask_operand_ready_i
signal.
Reduction Support
Reduction operations are divided into:
Intra-lane: Sequential accumulation within a single lane.
Inter-lane: Exchange of partial sums between lanes via the Slide Unit.
The ALU manages:
Internal counters (
reduction_rx_cnt_q
)Inter-lane state transitions (e.g.,
INTER_LANES_REDUCTION_TX
)Final SIMD reduction in lane 0
Narrowing Instructions
Instructions like VNSRA
, VNCLIP
produce only half the normal element width per cycle. The module uses a toggle (narrowing_select_q
) to alternate writing high/low halves.
SIMD ALU Execution
The core computation is handled by simd_alu
, parameterized by fixed-point support and rounding. It takes:
alu_operand_a
andalu_operand_b
The operation (
op_i
)The mask (
mask_i
)A rounding modifier (
rm
)
Results are fed to the result queue and selectively masked.
Fixed-Point Rounding and Saturation
When enabled, rounding and saturation are implemented via fixed_p_rounding
and internal saturation flags (alu_vxsat_q
). The saturation flag vxsat_flag_o
is updated during commit.
Control and State Machines
ALU State (alu_state_q
)
Defines execution phase:
NO_REDUCTION
INTRA_LANE_REDUCTION
INTER_LANES_REDUCTION_TX
INTER_LANES_REDUCTION_RX
SIMD_REDUCTION
LN0_REDUCTION_COMMIT
The FSM governs transitions based on instruction type, operand readiness, and SLDU handshake.
Instruction Lifecycle
Acceptance: New vector instructions are stored in the queue.
Issue: Instructions are fetched and executed if operands are valid.
Execution: Results are generated (immediately or over multiple cycles).
Commit: Data is written to the vector register file or sent to the mask unit.
Commit and Writeback
Writeback to VRF is gated by
alu_result_gnt_i
.Results for masking go through
mask_operand_o
.Queue counters and state pointers are updated accordingly.