sldu — Ara’s slide unit, for permutations, shuffles, and slides
Overview
The Slide Unit (sldu) in Ara’s vector processor is responsible for implementing vector slide instructions as specified in the RISC-V Vector Extension (RVV). These instructions shift elements within vector registers, either left or right, potentially with a configurable stride, and can support varying effective element widths (EEWs). The design is modular and consists of three components:
sldu: The top-level Slide Unit modulesldu_op_dp: The datapath handling element reshuffling and shiftingp2_stride_gen: A utility module that generates power-of-two strides
This unit supports seamless data flow between the operand lanes and result queues, handling valid/ready handshakes and internal reshuffling, aligning with the RVV specification.
1. sldu: Top-Level Slide Unit
Purpose
The sldu module serves as the interface and coordinator for the entire slide operation. It connects the operand input/output ports, manages the slide operation control logic, and integrates the datapath (sldu_op_dp) and the stride generator (p2_stride_gen).
Key Interfaces
Clock and Reset
clk_i,rst_ni: Standard synchronous design signals.
Operands
sldu_operand_i [NrLanes-1:0]: Operand vector from the lanes.sldu_operand_valid_i,sldu_operand_ready_o: Valid/ready handshake.sldu_result_o,sldu_result_valid_o,sldu_result_ready_i: Slide result vector and handshake signals.
Control
vinsn_issue_i: Vector instruction information (EEW, SEW, etc.).stride_valid_i,stride_update_i: Control inputs for stride progression.stride_i: Incoming stride value.
Utility
stride_valid_o: Asserted if the stride value is a power of two.popcount_o: Output of 1-bit population count on stride vector.
Functionality
Integrates:
Datapath (
sldu_op_dp) for reshuffling and slidingStride Generator (
p2_stride_gen) for power-of-two stride sequence generation
Responds to stride updates and dynamically loads new strides.
The slide unit’s datapath can only handle power-of-two strides. Every non-power-of-two stride is broken down into power-of-two strides. This ensures a lightweight interconnect datapath in the slide unit while accelerating the common case. Non-power-of-2 slides are extremely rare.
The slide unit can also reshuffle, i.e., perform a slide-by-zero with different input and output data widths. This is used to change the byte layout of a vector register file.
2. sldu_op_dp: Slide Operand Datapath
Purpose
This module implements the actual sliding logic of the operands, depending on:
Source and destination EEW (
eew_src_i,eew_dst_i)Direction (
dir_i)Slide amount (
slamt_i)
It operates with flattened vectors (op_i_flat, op_o_flat) for simplified internal manipulation.
Operation
Uses a large
unique caseblock over{eew_src_i, eew_dst_i, slamt_i, dir_i}to pattern-match operations.For each case, byte-wise manipulation (via
+: 8slices) rearranges bytes between source and destination.The result is assigned back to the
op_o_flatregister, which is then returned to the module interface.
Notable Features
Handles conversions across EEWs (e.g., EW8 → EW16)
To have a simpler datapath, it cannot slide and reshuffle in the same cycle
3. p2_stride_gen: Power-of-Two Stride Generator
Purpose
This utility module generates stride vectors where exactly one bit is high (i.e., a power-of-two stride), and can sequentially generate the next stride on update_i.
Interfaces
Input
stride_i: A stride vector to load.valid_i: Load enable.update_i: Trigger to generate the next stride.
Output
stride_p2_o: Power-of-two stride vectorvalid_o: Indicates if a valid (non-zero) stride is presentpopc_o: Population count of stride bits
Functionality
Uses:
popcountmodule to count active bits instride_ilzcmodule to detect the first active bit
Computes the next stride by XORing the current with the last stride
Asserts
valid_oif the stride is valid
Signal Behavior Across Modules
vinsn_issue_iis propagated across modules to control EEW behaviors and operand reshuffling.sldu_op_dpinterprets the sliding direction (dir_i) and index (slamt_i) to select the output permutation.stride_p2_ocontrols which element is selected during a stride-slide.All data vectors (
op_i,op_o) are organized aselen_t [NrLanes-1:0], allowing lane-based parallel operation.