`valu` - Instantiate the in-lane SIMD ALU (unpipelined)

Overview

The valu module is a central component of Ara’s vector processing pipeline. It acts as a SIMD (Single Instruction, Multiple Data) integer Arithmetic Logic Unit, capable of executing vector instructions over 64-bit wide data lanes. Its primary role is to execute integer operations across multiple vector lanes in parallel and to manage operations including fixed-point arithmetic, scalar replication, vector reductions, narrowing operations, and interaction with the mask unit and slide unit.

This documentation serves as a golden reference, explaining every functional aspect of the valu module in detail.

Module Parameters

parameter int unsigned NrLanes;
parameter int unsigned VLEN;
parameter fixpt_support_e FixPtSupport;
parameter type vaddr_t;
parameter type vfu_operation_t;

NrLanes: Number of vector lanes.
VLEN: Vector register length.
FixPtSupport: Enable or disable fixed-point support.
vaddr_t, vfu_operation_t: Type definitions for vector addresses and operations.

Top-Level Interface

The valu module interfaces with several components:

Dispatcher: Provides new vector operations (vfu_operation_i) and receives vxsat_flag_o.
Lane sequencer: Coordinates the execution flow.
Operand queues: Deliver source operands (alu_operand_i).
VRF: Accepts results for writeback.
Slide Unit: Exchanges operands and results for inter-lane reduction.
Mask Unit: Manages masking for selective operations.

Vector Instruction Queue

The queue tracks in-flight instructions and separates their execution phases:

accept_pnt: Points to the next free entry for acceptance.
issue_pnt: Instruction to be executed next.
commit_pnt: Instruction whose results are being written back.

Each instruction is described by a vfu_operation_t struct and includes the vector operation, destination ID, and control fields like vm and vsew.

Result Queue

This FIFO queue temporarily stores computation results, including:

wdata: Resulting word.
id: Instruction ID.
addr: Target vector address.
be: Byte enable.
mask: Whether this result is to be forwarded to the mask unit.

Two pointers (write_pnt, read_pnt) and a count (result_queue_cnt_q) track the queue status.

Scalar Operand Processing

Scalar values from instructions are replicated to 64-bit words depending on vsew. For instance:

EW8: replicated 8×
EW16: replicated 4×
etc.

This ensures compatibility with SIMD-wide operations.

Mask Operand Handling

When vector masking is enabled (vm == 0), a spill_register sends result data to the mask unit, filtered via the mask_operand_ready_i signal.

Reduction Support

Reduction operations are divided into:

Intra-lane: Sequential accumulation within a single lane.
Inter-lane: Exchange of partial sums between lanes via the Slide Unit.

The ALU manages:

Internal counters (reduction_rx_cnt_q)
Inter-lane state transitions (e.g., INTER_LANES_REDUCTION_TX)
Final SIMD reduction in lane 0

Narrowing Instructions

Instructions like VNSRA, VNCLIP produce only half the normal element width per cycle. The module uses a toggle (narrowing_select_q) to alternate writing high/low halves.

SIMD ALU Execution

The core computation is handled by simd_alu, parameterized by fixed-point support and rounding. It takes:

alu_operand_a and alu_operand_b
The operation (op_i)
The mask (mask_i)
A rounding modifier (rm)

Results are fed to the result queue and selectively masked.

Fixed-Point Rounding and Saturation

When enabled, rounding and saturation are implemented via fixed_p_rounding and internal saturation flags (alu_vxsat_q). The saturation flag vxsat_flag_o is updated during commit.

Control and State Machines

ALU State (`alu_state_q`)

Defines execution phase:

NO_REDUCTION
INTRA_LANE_REDUCTION
INTER_LANES_REDUCTION_TX
INTER_LANES_REDUCTION_RX
SIMD_REDUCTION
LN0_REDUCTION_COMMIT

The FSM governs transitions based on instruction type, operand readiness, and SLDU handshake.

Instruction Lifecycle

Acceptance: New vector instructions are stored in the queue.
Issue: Instructions are fetched and executed if operands are valid.
Execution: Results are generated (immediately or over multiple cycles).
Commit: Data is written to the vector register file or sent to the mask unit.

Commit and Writeback

Writeback to VRF is gated by alu_result_gnt_i.
Results for masking go through mask_operand_o.
Queue counters and state pointers are updated accordingly.

valu - Instantiate the in-lane SIMD ALU (unpipelined)