`vector_fus_stage` Module Documentation

Module Name: vector_fus_stage Authors: Matheus Cavalcante Description: This is Ara’s vector execution stage. It instantiates and connects the vector functional units (VFUs) for each lane, specifically the Vector ALU (VALU) and the Vector Multiplier/FPU (VMFPU), enabling SIMD-style parallel vector operations.

1. Overview

This module coordinates the operation of all vector functional units for one lane in the Ara vector processor. It interfaces with:

The Dispatcher for configuration (e.g., rounding/saturation).
The Lane Sequencer for operation dispatch and handshaking.
The Vector Register File (VRF) for read/write data movement.
The Slide Unit, to handle reduction and data forwarding.
The Mask Unit, to apply selective operation masking.

2. Parameters

Parameter	Type	Description
`NrLanes`	`int`	Number of lanes in the vector processor
`VLEN`	`int`	Vector register length
`CVA6Cfg`	Struct	Configuration of the CVA6 processor
`FPUSupport`	Enum	Enable/disable support for FP16, FP32, FP64
`FPExtSupport`	Enum	Enable external FP operations (like vfrec7, vfrsqrt7)
`FixPtSupport`	Enum	Support for fixed-point arithmetic
`vaddr_t`	Type	Type used to address vector elements
`vfu_operation_t`	Type	Type representing vector functional unit operations

3. Submodule Instantiations

3.1 `VALU` - Vector ALU

Handles all integer and fixed-point arithmetic operations.

Inputs:
- Operands (alu_operand_i)
- Operation type (vfu_operation_i)
- Control flags (alu_vxrm_i, mask info)
Outputs:
- Result back to VRF (alu_result_wdata_o, alu_result_addr_o)
- Done signal per instruction (alu_vinsn_done_o)
- Reductions (alu_red_complete_o)
- Saturation flag (alu_vxsat)
Handshake signals:
- alu_ready_o, alu_result_gnt_i, mask signals

3.2 `VMFPU` - Vector Multiplier/FPU

Handles:

Integer multiply and multiply-accumulate
All floating-point operations (add, mul, div, sqrt, etc.)
Optional external FP instructions
Fixed-point arithmetic extensions
Inputs:
- 3 operands (mfpu_operand_i)
- Operation type (vfu_operation_i)
Outputs:
- Result and writeback signals (mfpu_result_*)
- Exception flags (fflags_ex_o)
- Completion tracking (mfpu_vinsn_done_o, fpu_red_complete_o)
Handshake and masking similar to VALU

4. Interface Summary

4.1 Input/Output Overview

Inputs from Dispatcher:
- alu_vxrm_i: Rounding mode
- vfu_operation_i: Operation type
- vfu_operation_valid_i: Validity
Operand Queues:
- ALU: alu_operand_i[1:0], alu_operand_valid_i
- MFPU: mfpu_operand_i[2:0], mfpu_operand_valid_i
VRF Writeback:
- ALU: alu_result_*
- MFPU: mfpu_result_*
Mask Interface:
- Shared mask_i, mask_valid_i
- Split readiness signals: alu_mask_ready, mfpu_mask_ready
Slide Unit:
- Slide operands (sldu_operand_i)
- Slide handshake per unit
- Reduction request/ack (sldu_*_req_valid_o, sldu_*_gnt_i)
Saturation:
- vxsat_flag_o: Indicates whether saturation occurred
FPU Exceptions:
- fflags_ex_o, fflags_ex_valid_o

5. Control Logic and Signal Routing

5.1 Masking Coordination

Shared input mask: mask_i, mask_valid_i
Readiness mask_ready_o = alu_mask_ready | mfpu_mask_ready
Broadcast strategy requires tagged mask handling if instruction queue > 1.

5.2 Saturation Flag Aggregation

Both units can set a saturation flag:

assign vxsat_flag_o = mfpu_vxsat | alu_vxsat;

6. Lane-Level Operation and Modular Structure

This module is fully parameterized per-lane, enabling reuse across multiple vector lanes. Each lane runs its own instance of this module, and each instance manages:

One ALU (i_valu)
One MFPU (i_vmfpu)

The control strategy is uniform:

Operand queues → functional units
Results → VRF
Status → dispatcher/sequencer
Mask/Slide → helpers

7. Design Notes

Uses the ara_pkg and rvv_pkg definitions for consistency across Ara.
Designed to be flexible with regard to precision, operation type, and lane width.
Follows strict handshake protocols to avoid hazards.
Clean separation of logic for ALU and MFPU makes it easy to extend or adapt.
Interfacing with the Mask and Slide Units is modular and scalable.

vector_fus_stage Module Documentation