vldu: Ara’s Vector Load Unit

The vldu module implements Ara’s Vector Load Unit. It is responsible for loading data from memory into the Vector Register File (VRF) by receiving memory transactions via the AXI R channel and delivering vector data, possibly masked, to the lanes. This unit supports:

  • Masked/unmasked vector loads

  • Multi-instruction pipelining with an internal instruction queue

  • AXI burst handling

  • Exception tracking and safe partial commits


Module Parameters

Parameter

Description

NrLanes

Number of vector lanes.

VLEN

Vector register length in bits.

vaddr_t

Address type for vector register file addressing.

pe_req_t

Vector instruction request type.

pe_resp_t

Vector instruction response type.

AxiDataWidth

Width of the AXI data channel.

AxiAddrWidth

Width of the AXI address channel.

axi_r_t

AXI R-channel data type.


Interfaces

###️ Inputs

  • Clock & Reset: clk_i, rst_ni

  • Memory Load Channel: axi_r_i, axi_r_valid_i

  • Instruction Inputs:

    • pe_req_i, pe_req_valid_i: New vector instruction

    • pe_vinsn_running_i: Tracks active vector instructions

    • axi_addrgen_req_i, axi_addrgen_req_valid_i: Load address metadata

    • addrgen_illegal_load_i: Signals illegal access

  • Masking Support:

    • mask_i, mask_valid_i: Per-lane mask bytes

  • Flush: lsu_ex_flush_i

Outputs

  • AXI Handshake: axi_r_ready_o

  • Instruction Handshake: pe_req_ready_o

  • Memory Completion: load_complete_o

  • Response: pe_resp_o, ldu_current_burst_exception_o

  • Lane Interface:

    • ldu_result_req_o, ldu_result_addr_o, ldu_result_wdata_o

    • ldu_result_id_o, ldu_result_be_o


Internal Structure

1. Mask Cut

  • Uses spill_register_flushable for each lane.

  • Applies masking only when vm=0.

  • Ensures valid masks are acknowledged only when a masked instruction is issued.

2. Vector Instruction Queue (VIQ)

  • Triple-pointers:

    • accept_pnt, issue_pnt, commit_pnt

  • Accepts instructions and issues them sequentially.

  • Maintains counts of inflight and committed instructions.

  • Separate counters track committed/issued instructions and their remaining byte loads.

3. Result Queue (RQ)

  • Per-lane dual-entry queue buffering data before final commitment.

  • Data is written to VRF only after final grants (ldu_result_final_gnt_i) are received.

  • Supports partial writes for vstart > 0.

4. AXI Data Reception

  • Data is read beat-by-beat.

  • Beat slicing is calculated with beat_lower_byte and beat_upper_byte.

  • Data is shuffled using shuffle_index based on element size (vsew).

  • Per-lane address and ID are calculated and stored in result_queue.

5. VRF Commit Logic

  • All data must be granted and acknowledged before commit.

  • Updates commit counters and triggers load_complete_o.

6. Exception Handling FSM

  • States:

    • IDLE

    • VALID_RESULT_QUEUE

    • WAIT_RESULT_QUEUE

    • HANDLE_EXCEPTION

  • Ensures partially buffered results are committed before signaling an exception.

  • Keeps ldu_current_burst_exception_o accurate for safe exception replay.


Instruction Lifecycle

  1. Accept: Valid pe_req_i is accepted if there’s space and VFU matches.

  2. Issue: Begins loading AXI data. Uses mask unit if applicable.

  3. AXI Read: Transfers data beat-by-beat to result queue.

  4. VRF Commit: Writes to VRF after grant. Signals completion.

  5. Exception: If exception occurs mid-load, transitions to FSM to commit partials.


Design Considerations

  • Masking Support: Integrated at per-byte level using per-lane strobes.

  • Pipeline Decoupling: Three-phase VIQ lets accept, issue, and commit progress independently.

  • Exception Robustness: Can gracefully handle faults without data corruption.

  • Performance: Decouples address generation, AXI, and VRF phases to maximize throughput.

  • Alignment & vstart: First load carefully handles misalignment and partial data.