sldu
— Ara’s slide unit, for permutations, shuffles, and slides
Overview
The Slide Unit (sldu
) in Ara’s vector processor is responsible for implementing vector slide instructions as specified in the RISC-V Vector Extension (RVV). These instructions shift elements within vector registers, either left or right, potentially with a configurable stride, and can support varying effective element widths (EEWs). The design is modular and consists of three components:
sldu
: The top-level Slide Unit modulesldu_op_dp
: The datapath handling element reshuffling and shiftingp2_stride_gen
: A utility module that generates power-of-two strides
This unit supports seamless data flow between the operand lanes and result queues, handling valid/ready handshakes and internal reshuffling, aligning with the RVV specification.
1. sldu
: Top-Level Slide Unit
Purpose
The sldu
module serves as the interface and coordinator for the entire slide operation. It connects the operand input/output ports, manages the slide operation control logic, and integrates the datapath (sldu_op_dp
) and the stride generator (p2_stride_gen
).
Key Interfaces
Clock and Reset
clk_i
,rst_ni
: Standard synchronous design signals.
Operands
sldu_operand_i [NrLanes-1:0]
: Operand vector from the lanes.sldu_operand_valid_i
,sldu_operand_ready_o
: Valid/ready handshake.sldu_result_o
,sldu_result_valid_o
,sldu_result_ready_i
: Slide result vector and handshake signals.
Control
vinsn_issue_i
: Vector instruction information (EEW, SEW, etc.).stride_valid_i
,stride_update_i
: Control inputs for stride progression.stride_i
: Incoming stride value.
Utility
stride_valid_o
: Asserted if the stride value is a power of two.popcount_o
: Output of 1-bit population count on stride vector.
Functionality
Integrates:
Datapath (
sldu_op_dp
) for reshuffling and slidingStride Generator (
p2_stride_gen
) for power-of-two stride sequence generation
Responds to stride updates and dynamically loads new strides.
The slide unit’s datapath can only handle power-of-two strides. Every non-power-of-two stride is broken down into power-of-two strides. This ensures a lightweight interconnect datapath in the slide unit while accelerating the common case. Non-power-of-2 slides are extremely rare.
The slide unit can also reshuffle, i.e., perform a slide-by-zero with different input and output data widths. This is used to change the byte layout of a vector register file.
2. sldu_op_dp
: Slide Operand Datapath
Purpose
This module implements the actual sliding logic of the operands, depending on:
Source and destination EEW (
eew_src_i
,eew_dst_i
)Direction (
dir_i
)Slide amount (
slamt_i
)
It operates with flattened vectors (op_i_flat
, op_o_flat
) for simplified internal manipulation.
Operation
Uses a large
unique case
block over{eew_src_i, eew_dst_i, slamt_i, dir_i}
to pattern-match operations.For each case, byte-wise manipulation (via
+: 8
slices) rearranges bytes between source and destination.The result is assigned back to the
op_o_flat
register, which is then returned to the module interface.
Notable Features
Handles conversions across EEWs (e.g., EW8 → EW16)
To have a simpler datapath, it cannot slide and reshuffle in the same cycle
3. p2_stride_gen
: Power-of-Two Stride Generator
Purpose
This utility module generates stride vectors where exactly one bit is high (i.e., a power-of-two stride), and can sequentially generate the next stride on update_i
.
Interfaces
Input
stride_i
: A stride vector to load.valid_i
: Load enable.update_i
: Trigger to generate the next stride.
Output
stride_p2_o
: Power-of-two stride vectorvalid_o
: Indicates if a valid (non-zero) stride is presentpopc_o
: Population count of stride bits
Functionality
Uses:
popcount
module to count active bits instride_i
lzc
module to detect the first active bit
Computes the next stride by XORing the current with the last stride
Asserts
valid_o
if the stride is valid
Signal Behavior Across Modules
vinsn_issue_i
is propagated across modules to control EEW behaviors and operand reshuffling.sldu_op_dp
interprets the sliding direction (dir_i
) and index (slamt_i
) to select the output permutation.stride_p2_o
controls which element is selected during a stride-slide.All data vectors (
op_i
,op_o
) are organized aselen_t [NrLanes-1:0]
, allowing lane-based parallel operation.