vmfpu — Instantiate in-lane SIMD FPU, SIMD multiplier, and SIMD divider (pipelined or multi-cycle)

Module Name: vmfpu


Summary

The vmfpu module is part of the Ara RISC-V vector processor and implements the Vector Multiply and Floating Point Unit (VFPU) responsible for executing arithmetic operations that include multiplication and floating point computation. It supports operations over elements coming from the vector register file, controlled by the vector instruction sequencer and interconnected via operand queues.

This module receives instructions, manages multiple operand queues, schedules operations on sub-units (like FP multipliers and ALUs), and routes results back to the operand dispatcher. It plays a key role in exploiting vector-level parallelism in floating point and multiplication domains.


Interface Description

Clocking and Reset

Signal

Direction

Description

clk_i

Input

Clock signal

rst_ni

Input

Active-low synchronous reset

Operand Queue Interface

Signal

Direction

Description

operand_valid_i

Input

One-hot encoding for operand queue validity

operand_i

Input

Operands input from the vector register file

operand_ready_o

Output

Ready handshake for each operand queue

Result Interface

Signal

Direction

Description

result_req_o

Output

Valid signal to request writing result

result_id_o

Output

ID of the instruction producing the result

result_addr_o

Output

VRF write address

result_wdata_o

Output

Result data to be written

result_be_o

Output

Byte enable signal

result_gnt_i

Input

Grant from operand requester

result_final_gnt_i

Input

Final commit acknowledgment from operand requester


Functional Blocks Overview

1. Operand Queue Decoding

The module reads from several operand queues concurrently and decodes operand availability. A onehot signal (operand_valid_i) triggers internal FSMs to accept and latch operand data into internal registers.

This mechanism ensures that the pipeline only proceeds when all necessary operands are ready.

2. FSM (Finite State Machine) Control Logic

The FSM implements several states:

  • IDLE: Waits for operands to be valid.

  • WAIT: Waits for all operands to be latched and for a backend (like FPU/MUL) to be free.

  • EXECUTE: Passes the operands to the compute unit and asserts a valid signal.

  • DONE: Waits for the result grant handshake.

Transitions occur on the basis of operand readiness, backend availability, and completion handshakes. This careful control ensures precise handling of instruction lifecycle.

3. Backend Compute Unit Selection

Depending on the instruction:

  • Integer multiplication might be routed to the Mul unit.

  • Floating point computation might use FP ALU, FP MUL, or FP FMA.

These units are external to the vmfpu module and are connected via operand and result interfaces. The correct selection depends on the target_fu and operation code embedded in the vector instruction.

4. Stream Registers

Each result-producing unit feeds into a stream register, which buffers results until downstream units acknowledge the result (result_gnt_i), avoiding backpressure to the compute units.

These registers decouple execution and result storage, facilitating pipeline throughput.


Code Walkthrough

Module Declaration and Parameters

module vmfpu import ara_pkg::*; ...

The module imports Ara and RVV packages, ensuring access to operand types, configuration constants, and hardware definitions.

It uses several type parameters:

  • NrLanes, VLEN, vaddr_t: define structural hardware constraints.

  • operand_queue_cmd_t, operand_request_cmd_t: encapsulate control signals between scheduler and operand queues.


Registers and Internal Signals

Internal signals such as fsm_state_q, operand_valid, and stream_reg_payload track FSM state, operand readiness, and result data, respectively.

These are updated in sequential logic blocks and control combinational data paths.


Operand Gathering and Arbitration

Each operand queue’s validity is polled, and the module arbitrates between them using one-hot logic.

When valid operands are detected, the FSM transitions into WAIT and EXECUTE states.


Computation Dispatch

Operations are dispatched to the correct backend depending on instruction decoding logic:

case (target_fu)
  FPU: begin
    ...
  end
  MUL: begin
    ...
  end
endcase

Here, target_fu determines which functional unit receives the operands.


Result Capture and Output

Results are written to output ports using stream registers:

stream_register #(...) i_fpu_stream_reg (
  ...
  .valid_i   (fpu_result_valid),
  .data_i    ({id, addr, result}),
  ...
);

Once the result is accepted by the downstream module, the FSM returns to IDLE.