valu - Instantiate the in-lane SIMD ALU (unpipelined)

Overview

The valu module is a central component of Ara’s vector processing pipeline. It acts as a SIMD (Single Instruction, Multiple Data) integer Arithmetic Logic Unit, capable of executing vector instructions over 64-bit wide data lanes. Its primary role is to execute integer operations across multiple vector lanes in parallel and to manage operations including fixed-point arithmetic, scalar replication, vector reductions, narrowing operations, and interaction with the mask unit and slide unit.

This documentation serves as a golden reference, explaining every functional aspect of the valu module in detail.


Table of Contents

  1. Module Parameters

  2. Top-Level Interface

  3. Vector Instruction Queue

  4. Result Queue

  5. Scalar Operand Processing

  6. Mask Operand Handling

  7. Reduction Support

  8. Narrowing Instructions

  9. SIMD ALU Execution

  10. Fixed-Point Rounding and Saturation

  11. Control and State Machines

  12. Instruction Lifecycle

  13. Commit and Writeback

  14. Summary


Module Parameters

parameter int unsigned NrLanes;
parameter int unsigned VLEN;
parameter fixpt_support_e FixPtSupport;
parameter type vaddr_t;
parameter type vfu_operation_t;
  • NrLanes: Number of vector lanes.

  • VLEN: Vector register length.

  • FixPtSupport: Enable or disable fixed-point support.

  • vaddr_t, vfu_operation_t: Type definitions for vector addresses and operations.


Top-Level Interface

The valu module interfaces with several components:

  • Dispatcher: Provides new vector operations (vfu_operation_i) and receives vxsat_flag_o.

  • Lane sequencer: Coordinates the execution flow.

  • Operand queues: Deliver source operands (alu_operand_i).

  • VRF: Accepts results for writeback.

  • Slide Unit: Exchanges operands and results for inter-lane reduction.

  • Mask Unit: Manages masking for selective operations.


Vector Instruction Queue

The queue tracks in-flight instructions and separates their execution phases:

  • accept_pnt: Points to the next free entry for acceptance.

  • issue_pnt: Instruction to be executed next.

  • commit_pnt: Instruction whose results are being written back.

Each instruction is described by a vfu_operation_t struct and includes the vector operation, destination ID, and control fields like vm and vsew.


Result Queue

This FIFO queue temporarily stores computation results, including:

  • wdata: Resulting word.

  • id: Instruction ID.

  • addr: Target vector address.

  • be: Byte enable.

  • mask: Whether this result is to be forwarded to the mask unit.

Two pointers (write_pnt, read_pnt) and a count (result_queue_cnt_q) track the queue status.


Scalar Operand Processing

Scalar values from instructions are replicated to 64-bit words depending on vsew. For instance:

  • EW8: replicated 8×

  • EW16: replicated 4×

  • etc.

This ensures compatibility with SIMD-wide operations.


Mask Operand Handling

When vector masking is enabled (vm == 0), a spill_register sends result data to the mask unit, filtered via the mask_operand_ready_i signal.


Reduction Support

Reduction operations are divided into:

  • Intra-lane: Sequential accumulation within a single lane.

  • Inter-lane: Exchange of partial sums between lanes via the Slide Unit.

The ALU manages:

  • Internal counters (reduction_rx_cnt_q)

  • Inter-lane state transitions (e.g., INTER_LANES_REDUCTION_TX)

  • Final SIMD reduction in lane 0


Narrowing Instructions

Instructions like VNSRA, VNCLIP produce only half the normal element width per cycle. The module uses a toggle (narrowing_select_q) to alternate writing high/low halves.


SIMD ALU Execution

The core computation is handled by simd_alu, parameterized by fixed-point support and rounding. It takes:

  • alu_operand_a and alu_operand_b

  • The operation (op_i)

  • The mask (mask_i)

  • A rounding modifier (rm)

Results are fed to the result queue and selectively masked.


Fixed-Point Rounding and Saturation

When enabled, rounding and saturation are implemented via fixed_p_rounding and internal saturation flags (alu_vxsat_q). The saturation flag vxsat_flag_o is updated during commit.


Control and State Machines

ALU State (alu_state_q)

Defines execution phase:

  • NO_REDUCTION

  • INTRA_LANE_REDUCTION

  • INTER_LANES_REDUCTION_TX

  • INTER_LANES_REDUCTION_RX

  • SIMD_REDUCTION

  • LN0_REDUCTION_COMMIT

The FSM governs transitions based on instruction type, operand readiness, and SLDU handshake.


Instruction Lifecycle

  1. Acceptance: New vector instructions are stored in the queue.

  2. Issue: Instructions are fetched and executed if operands are valid.

  3. Execution: Results are generated (immediately or over multiple cycles).

  4. Commit: Data is written to the vector register file or sent to the mask unit.


Commit and Writeback

  • Writeback to VRF is gated by alu_result_gnt_i.

  • Results for masking go through mask_operand_o.

  • Queue counters and state pointers are updated accordingly.