simd_alu - Ara’s in-lane SIMD ALU (simd_alu)

This document provides an in-depth technical explanation of the simd_alu module in the Ara vector processor. The simd_alu (Single Instruction, Multiple Data Arithmetic Logic Unit) is responsible for element-wise ALU operations on 64-bit vector elements. It supports fixed-point arithmetic, saturating arithmetic, logical and comparison operations, shift instructions, and narrowing/rounding/merge instructions.


Summary

  • Module Name: simd_alu

  • Source: simd_alu.sv

  • Author: Matheus Cavalcante

  • License: Solderpad Hardware License, Version 0.51

  • Purpose: Implements vector ALU functionality supporting element-wise operations, fixed-point saturation, rounding, comparisons, and shifts for Ara’s 64-bit SIMD vector datapath.


Inputs

Signal

Type

Description

operand_a_i

elen_t (64-bit)

First operand for the ALU operation

operand_b_i

elen_t (64-bit)

Second operand

valid_i

logic

Enables processing of a new instruction

vm_i

logic

Vector mask enable

mask_i

strb_t (byte-wide mask)

Byte-level mask controlling predicate effects

narrowing_select_i

logic

Select for narrowing results

op_i

ara_op_e

ALU operation code

vew_i

vew_e

Vector element width selector (EW8, EW16, etc.)

rm

strb_t

Rounding mode (used in fixed-point ops)

vxrm_i

vxrm_t

Fixed-point rounding mode (VXRM)

Outputs

Signal

Type

Description

result_o

elen_t

Final result after SIMD ALU computation

vxsat_o

vxsat_t

Overflow saturation flags per lane


Internal Types

  • alu_operand_t: Unions allowing the interpretation of a 64-bit value as 8/16/32/64-bit elements.

  • alu_sat_operand_t: Extended width unions for saturation detection.


Main Features and Functionality

1. Vector Element Width Awareness

Operations are performed on lanes as defined by vew_i:

  • EW8: 8x 8-bit operations

  • EW16: 4x 16-bit

  • EW32: 2x 32-bit

  • EW64: 1x 64-bit

Each operation adapts to the selected width via unpacking the input operands accordingly.

2. ALU Operation Decoding

The module uses a large case statement on op_i to implement logic/arithmetic/comparison instructions. Many instructions use nested case statements based on vew_i.

3. Saturation and Fixed-Point Handling

Fixed-point operations (e.g., VSADD, VASUB, VNCLIP) are handled conditionally using FixPtSupport. Overflow checks are done by checking high bits and flags are set in vxsat.

4. Mask Logic & Merging

The mask signal (mask_i) interacts with vm_i and is embedded in certain instruction results (e.g., comparisons). Merge and scalar move operations use the mask to choose between operands.

5. Shift & Narrowing Operations

Includes support for:

  • Logical/arithmetic shifts (VSLL, VSRL, VSRA)

  • Narrowing shift with optional rounding (VNSRL, VNSRA)

  • Clip instructions (VNCLIP, VNCLIPU) with saturation

6. Rounding Modes (VXRM)

Rounding behavior for fixed-point arithmetic and narrowing instructions is selected via vxrm_i, using 4 defined rounding modes (e.g., round to nearest even, zero, etc.).


Assertions

  • The final assertion checks that DataWidth == $bits(alu_operand_t) to ensure 64-bit operation compatibility.


Instruction Categories

Instructions include but are not limited to:

Category

Examples

Logical

VAND, VOR, VXOR

Arithmetic

VADD, VSUB, VSADDU, VSADD

Comparison

VMSEQ, VMSLT, VMAX, VMIN, etc.

Saturating

VSSUB, VSSUBU, VSADDU

Fixed-point

VASUB, VNCLIP, VSSRA, VSSRL

Merging/Masking

VMERGE, VMXOR, VMXNOR, etc.

Shift Operations

VSLL, VSRA, VNSRA, etc.


Design Considerations

  • Efficiency: Optimized for combinational output with modular per-lane calculations.

  • Flexibility: Supports varied element widths and rounding behavior.

  • Masking Support: Integrated mask control for conditional computation.

  • Saturation Awareness: vxsat flags make it suitable for overflow-sensitive ops.

  • RISC-V RVV Compatible: Aligns with vector instruction format and control conventions.


Example Behavior (Pseudocode)

// VADD with EW16 and two operands
for (int i = 0; i < 4; i++) {
    res.w16[i] = opa.w16[i] + opb.w16[i];
}