Hardware Processing Engines: Concept and IPs#

Hardware Processing Engines (HWPEs) are special-purpose, memory-coupled accelerators that can be inserted in the SoC or cluster of a PULP system to amplify its performance and energy efficiency in particular tasks.

Differently from most accelerators in literature, HWPEs do not rely on an external DMA to feed them with input and to extract output, and they are not (necessarily) tied to a single core. Rather, they operate directly on the same memory that is shared by other elements in the PULP system (e.g. the L1 TCDM in a PULP cluster, or the shared L2 in PULPissimo). Their control is memory-mapped and accessed through a peripheral bus or interconnect. HW-based execution on an HWPE can be readily intermixed with software code, because all that needs to be exchanged between the two is a set of pointers and, if necessary, a few parameters.

_images/hwpe.png

Fig. 6 Template of a Hardware Processing Engine (HWPE).#

This document defines the interface protocols and modules that are used to enable connecting HWPEs in a PULP system. Typically, such a module is divided in a streamer interface towards the memory system, a control/peripheral interface used for programming it, and an engine containing the actual datapath of the accelerator.

HWPE Interface Modules: Data Movement & Marshaling#

Basic modules (HWPE-Stream)#

Basic HWPE-Stream management modules are used to select multiple streams, merge multiple streams into one, split a stream in multiple ones, synchronize their handshakes and similar basic “morphing” functionality; or to delay and enqueue streams. Modules performing these functions can be found within the rtl/basic and rtl/fifo subfolders of the hwpe-stream repository.

hwpe_stream_merge#

_images/hwpe_stream_merge.sv.png

The hwpe_stream_merge module is used to merge NB_IN_STREAMS input streams into a single, bigger stream. The data and strb channels from the input streams are bound in order and the valid is generated as the AND of all valid’s from input streams. The ready is broadcasted from the output stream to all input streams.

A typical use of this module is to take NB_IN_STREAMS 32-bit streams coming from a TCDM load interface to be merged into a single bigger stream.

The following shows an example of the hwpe_stream_merge operation:

Table 8 hwpe_stream_merge design-time parameters.#

Name

Default

Description

NB_IN_STREAMS

2

Number of input HWPE-Stream streams.

DATA_WIDTH_IN

32

Width of the input HWPE-Stream streams.

hwpe_stream_split#

_images/hwpe_stream_split.sv.png

The hwpe_stream_split module is used to split a single stream into NB_OUT_STREAMS, 32-bit output streams. The data and strb channel from the input stream is split in ordered output streams, and the valid is broadcast to all outgoing streams. The ready is generated as the AND of all ready’s from output streams.

A typical use of this module is to take a multiple-of-32-bit stream coming from within the HWPE and split it into multiple 32-bit streams that feed a TCDM store interface.

The following shows an example of the hwpe_stream_split operation:

Table 9 hwpe_stream_split design-time parameters.#

Name

Default

Description

NB_OUT_STREAMS

2

Number of output HWPE-Stream streams.

DATA_WIDTH_IN

128

Width of the input HWPE-Stream stream.

hwpe_stream_fence#

_images/hwpe_stream_fence.sv.png

The hwpe_stream_fence module is used to synchronize the handshake between NB_STREAMS streams. This is necessary, for example, when multiple 32-bit streams are produced from separate TCDM accesses and have to be joined into a single, wider stream.

Table 10 hwpe_stream_fence design-time parameters.#

Name

Default

Description

NB_STREAMS

2

Number of input/output HWPE-Stream streams.

DATA_WIDTH

32

Width of the HWPE-Stream streams.

hwpe_stream_mux_static#

_images/hwpe_stream_mux_static.sv.png

The hwpe_stream_mux_static module is used to statically propagate one of 2 input streams of size DATA_SIZE into a single output stream. The multiplexer is static as the selection bit sel_i cannot be changed when there are transactions in flight; if the selection bit is changed when transactions are in flight, the result is undefined.

The following shows an example of the hwpe_stream_mux_static operation:

hwpe_stream_demux_static#

_images/hwpe_stream_demux_static.sv.png

The hwpe_stream_demux_static module is used to propagate a single input stream of size DATA_SIZE into one of NB_OUT_STREAMS output streams. The non-selected output streams are all invalid. The demultiplexer is static as the selection bit sel_i cannot be changed when there are transactions in flight; if the selection bit is changed when transactions are in flight, the result is undefined.

The following shows an example of the hwpe_stream_demux_static operation:

Table 11 hwpe_stream_demux_static design-time parameters.#

Name

Default

Description

NB_OUT_STREAMS

2

Number of output HWPE-Stream streams.

hwpe_stream_fifo#

_images/hwpe_stream_fifo.sv.png

The hwpe_stream_fifo module implements a hardware FIFO queue for HWPE-Stream streams, used to withstand data scarcity (valid`=0) or backpressure (`ready`=0), decoupling two architectural domains. This FIFO is single-clock and therefore cannot be used to cross two distinct clock domains. The FIFO will lower its `ready signal on the input stream push_i interface when it is completely full, and will lower its valid signal on the output stream pop_o interface when it is completely empty.

Table 12 hwpe_stream_fifo design-time parameters.#

Name

Default

Description

DATA_WIDTH

32

Width of the HWPE-Streams (typically multiple of 32, but this module does not care).

FIFO_DEPTH

8

Depth of the FIFO queue (multiple of 2).

LATCH_FIFO

0

If 1, use latches instead of flip-flops (requires special constraints in synthesis).

LATCH_FIFO_TEST_WRAP

0

If 1 and LATCH_FIFO is 1, wrap latches with BIST wrappers.

Table 13 hwpe_stream_fifo output flags.#

Name

Type

Description

empty

logic

1 if the FIFO is currently empty.

full

logic

1 if the FIFO is currently full.

push_pointer

logic[7:0]

Unused.

pop_pointer

logic[7:0]

Unused.

hwpe_stream_fifo_earlystall#

_images/hwpe_stream_fifo_earlystall.sv.png

The hwpe_stream_fifo_earlystall module implements a hardware FIFO queue for HWPE-Stream streams, used to withstand data scarcity (valid =1) or backpressure (ready =1), decoupling two architectural domains. This FIFO is single-clock and therefore cannot be used to cross two distinct clock domains. The only difference with respect to hwpe_stream_fifo is that this version of the FIFO lowers its ready signal one cycle earlier, i.e. when it is filled by FIFO_DEPTH -1 elements. It will lower its valid signal on the output stream pop_o interface when it is completely empty.

Table 14 hwpe_stream_fifo_earlystall design-time parameters.#

Name

Default

Description

DATA_WIDTH

32

Width of the HWPE-Streams (multiple of 32).

FIFO_DEPTH

8

Depth of the FIFO queue (multiple of 2).

LATCH_FIFO

0

If 1, use latches instead of flip-flops (requires special constraints in synthesis).

Table 15 hwpe_stream_fifo_earlystall output flags.#

Name

Type

Description

empty

logic

1 if the FIFO is currently empty.

full

logic

1 if the FIFO is currently full.

push_pointer

logic[7:0]

Unused.

pop_pointer

logic[7:0]

Unused.

hwpe_stream_fifo_ctrl#

_images/hwpe_stream_fifo_ctrl.sv.png

The hwpe_stream_fifo_ctrl module implements a hardware FIFO queue similar to that implemented by hwpe_stream_fifo, but without any actual interface handshake forced on HWPE-Streams. Instead, it will push its “virtual” handshake on the push_valid_i/push_ready_o and pop_valid_o/pop_ready_i signals. It can be used to operate multiple big FIFO queues (e.g. with latches) in a synchronized fashion without breaking the HWPE-Stream protocol.

Table 16 hwpe_stream_fifo_ctrl design-time parameters.#

Name

Default

Description

FIFO_DEPTH

8

Depth of the FIFO queue (multiple of 2).

HCI Core modules#

hci_core_assign#

_images/hci_core_assign.sv.png

The hci_core_assign module implements a simple assignment for HCI-Core streams.

hci_core_fifo#

_images/hci_core_fifo.sv.png

The hci_core_fifo module implements a hardware FIFO queue for HCI-Core interfaces, used to withstand data scarcity (req=0) or backpressure (gnt=0), decoupling two architectural domains. This FIFO is single-clock and therefore cannot be used to cross two distinct clock domains. The FIFO treats a HCI-Core load stream as a combination of two 32-bit HWPE-Streams, one going from the tcdm_initiator to the tcdm_target interface carrying the addr (outgoing stream); the other from the tcdm_target to the tcdm_initiator interface, carrying the r_data (incoming stream).

On the target side, the req and gnt of the HCI-Core interfaces are mapped on valid and ready respectively in the outgoing stream. Backpressure on the incoming stream (target side) cannot be enforced by means of the HCI-Core target interface and thus is carried by a specific input ready_i that must be generated outside of the TCDM FIFO, typically by a hwpe_stream_source module (output tcdm_fifo_ready_o). On the initiator side, req is mapped to the AND of the incoming stream ready signal and the outgoing stream valid signal. gnt is hooked to the outgoing stream ready signal. The r_valid is mapped on valid in the incoming stream. _hci_core_fifo_mapping shows this mapping.

Mapping of HCI-Core and HWPE-Stream signals inside the load FIFO.

Table 17 hci_core_fifo design-time parameters.#

Name

Default

Description

FIFO_DEPTH

8

Depth of the FIFO queue (multiple of 2).

LATCH_FIFO

0

If 1, use latches instead of flip-flops (requires special constraints in synthesis).

Table 18 hci_core_fifo output flags.#

Name

Type

Description

empty

logic

1 if the FIFO is currently empty.

full

logic

1 if the FIFO is currently full.

push_pointer

logic[7:0]

Unused.

pop_pointer

logic[7:0]

Unused.

hci_core_mux_dynamic#

_images/hci_core_mux_dynamic.sv.png

The HCI multiplexer can be used to funnel more input “virtual” HCI channels in into a smaller set of initiator ports out. It uses a round robin counter to avoid starvation, and differs from the modules used within the logarithmic interconnect in that arbitration is performed depending on the round robin counter and not on the target port; in other words, its task is to fill all out ports with requests from the in port, and not to route in requests to a specific out port.

Notice that the multiplexer is not “optimal” in the sense that there is no reorder buffer, so transactions cannot be swapped in-flight to optimally fill the downstream available bandwidth. However, in real accelerators many systematic issues with bandwidth sharing can be solved by upstream HCI FIFOs and by clever reordering of channels, since the dataflow schedule is known. For a multiplexer with reorder buffer, see hci_core_mux_ooo.

Table 19 hci_core_mux design-time parameters.#

Name

Default

Description

NB_IN_CHAN

2

Number of input HWPE-Mem channels.

NB_OUT_CHAN

1

Number of output HWPE-Mem channels.

hci_core_mux_ooo#

_images/hci_core_mux_ooo.sv.png

The HCI dynamic OoO N-to-1 multiplexer enables to funnel multiple HCI ports into a single one. It supports out-of-order responses by means of ID. As the ID is implemented as user signal, any FIFO coming after (i.e., nearer to memory side) with respect to this block must respect id signals - specifically it must return them identical in the response. At the end of the chain, there will typically be a hci_core_r_id_filter block reflecting back all the IDs. This must be placed at the 0-latency boundary with the memory system. Priority is normally round-robin but can also be forced from the outside by setting priority_force_i to 1 and driving the priority_i array to the desired priority values.

Table 20 hci_core_mux_ooo design-time parameters.#

Name

Default

Description

NB_CHAN

2

Number of input HCI channels.

hci_core_mux_static#

_images/hci_core_mux_static.sv.png

The HCI static multiplexer can be used in place of the dynamic ones when two sets of ports are guaranteed to be used in a strictly alternative fashion.

Table 21 hci_core_mux_static design-time parameters.#

Name

Default

Description

NB_CHAN

2

Number of input HCI channels.

hci_core_r_id_filter#

_images/hci_core_r_id_filter.sv.png

This block filters the id field of the TCDM request, and forwards it to the r_id field of the TCDM response.

hci_core_r_valid_filter#

_images/hci_core_r_valid_filter.sv.png

This block filters the r_valid field of the TCDM response: when enable_i is 1, only responses with r_valid=1 in case of a read transaction. The block is currently only working at the zero-latency boundary between core and memory (it expects that the latency between gnt and r_valid is exactly one cycle).

hci_core_split#

_images/hci_core_split.sv.png

The hci_core_split module uses FIFOs to enqueue a split version of the HCI transactions. The FIFO queues evolve in a synchronized fashion on the accelerator side and evolve freely on the TCDM side. In this way, split transactions that can not be immediately brought back to the accelerator do not need to be repeated, massively reducing TCDM traffic. The hci_core_split requires to be followed (not preceded!) by any hci_core_r_id_filter that is used, for example, to implement HCI IDs for the purpose of supporting out-of-order access from a hci_core_mux.

Table 22 hci_core_split design-time parameters.#

Name

Default

Description

NB_OUT_CHAN

2

Number of output channels.

FIFO_DEPTH

0

Depth of internal HCI Core FIFOs.

Basic modules (HWPE-Mem / HWPE-MemDecoupled - deprecated)#

Basic HWPE-Mem management modules are used to delay/enqueue HWPE-MemDecoupled interfaces, multiplex multiple HWPE-Mem, or reorder them before hooking the accelerator to a Tightly-Coupled Data Memory (TCDM). Modules performing these functions can be found within the rtl/tcdm subfolder of the hwpe-stream repository.

hwpe_stream_tcdm_fifo_store#

_images/hwpe_stream_tcdm_fifo_store.sv.png

The hwpe_stream_tcdm_fifo_store module implements a hardware FIFO queue for HWPE-MemDecoupled store streams, used to withstand data scarcity (req`=0) or backpressure (`gnt`=0), decoupling two architectural domains. This FIFO is single-clock and therefore cannot be used to cross two distinct clock domains. The FIFO treats a HWPE-MemDecoupled store stream as a wide HWPE-Stream where, on both sides, the `data field contains addr, data, be of the input tcdm_slave; the req and gnt of the HWPE-MemDecoupled interfaces are mapped on valid and ready respectively. The FIFO will lower its gnt signal on the slave interface tcdm_slave when it is completely full, and will lower its req signal on the master interface tcdm_master when it is completely empty. _hwpe_stream_tcdm_fifo_store_mapping shows this mapping.

_images/hwpe_stream_tcdm_fifo_store.png

Fig. 7 Mapping of HWPE-MemDecoupled and HWPE-Stream signals inside the store FIFO.#

Table 23 hwpe_stream_tcdm_fifo_store design-time parameters.#

Name

Default

Description

FIFO_DEPTH

8

Depth of the FIFO queue (multiple of 2).

LATCH_FIFO

0

If 1, use latches instead of flip-flops (requires special constraints in synthesis).

Table 24 hwpe_stream_tcdm_fifo_store output flags.#

Name

Type

Description

empty

logic

1 if the FIFO is currently empty.

full

logic

1 if the FIFO is currently full.

push_pointer

logic[7:0]

Unused.

pop_pointer

logic[7:0]

Unused.

hwpe_stream_tcdm_fifo_load#

_images/hwpe_stream_tcdm_fifo_load.sv.png

The hwpe_stream_tcdm_fifo_load module implements a hardware FIFO queue for HWPE-MemDecoupled load streams, used to withstand data scarcity (req`=0) or backpressure (`gnt`=0), decoupling two architectural domains. This FIFO is single-clock and therefore cannot be used to cross two distinct clock domains. The FIFO treats a HWPE-MemDecoupled load stream as a combination of two 32-bit HWPE-Streams, one going from the `tcdm_master to the tcdm_slave interface carrying the addr (outgoing stream); the other from the tcdm_slave to the tcdm_master interface, carrying the r_data (incoming stream).

On the slave side, the req and gnt of the HWPE-MemDecoupled interfaces are mapped on valid and ready respectively in the outgoing stream. Backpressure on the incoming stream (slave side) cannot be enforced by means of the HWPE-MemDecoupled slave interface and thus is carried by a specific input ready_i that must be generated outside of the TCDM FIFO, typically by a hwpe_stream_source module (output tcdm_fifo_ready_o). On the master side, req is mapped to the AND of the incoming stream ready signal and the outgoing stream valid signal. gnt is hooked to the outgoing stream ready signal. The r_valid is mapped on valid in the incoming stream. _hwpe_stream_tcdm_fifo_load_mapping shows this mapping.

_images/hwpe_stream_tcdm_fifo_load.png

Fig. 8 Mapping of HWPE-MemDecoupled and HWPE-Stream signals inside the load FIFO.#

Table 25 hwpe_stream_tcdm_fifo_load design-time parameters.#

Name

Default

Description

FIFO_DEPTH

8

Depth of the FIFO queue (multiple of 2).

LATCH_FIFO

0

If 1, use latches instead of flip-flops (requires special constraints in synthesis).

Table 26 hwpe_stream_tcdm_fifo_load output flags.#

Name

Type

Description

empty

logic

1 if the FIFO is currently empty.

full

logic

1 if the FIFO is currently full.

push_pointer

logic[7:0]

Unused.

pop_pointer

logic[7:0]

Unused.

hwpe_stream_tcdm_mux#

_images/hwpe_stream_tcdm_mux.sv.png

The TCDM multiplexer can be used to funnel more input “virtual” TCDM channels in into a smaller set of master ports out. It uses a round robin counter to avoid starvation, and differs from the modules used within the logarithmic interconnect in that arbitration is performed depending on the round robin counter and not on the slave port; in other words, its task is to fill all out ports with requests from the in port, and not to route in requests to a specific out port.

Notice that the multiplexer is not “optimal” in the sense that there is no reorder buffer, so transactions cannot be swapped in-flight to optimally fill the downstream available bandwidth. However, in real accelerators many systematic issues with bandwidth sharing can be solved by upstream TCDM FIFOs and by clever reordering of channels, since the dataflow schedule is known.

Table 27 hwpe_stream_tcdm_mux design-time parameters.#

Name

Default

Description

NB_IN_CHAN

2

Number of input HWPE-Mem channels.

NB_OUT_CHAN

1

Number of output HWPE-Mem channels.

hwpe_stream_tcdm_mux_static#

_images/hwpe_stream_tcdm_mux_static.sv.png

The hwpe_stream_tcdm_mux_static module is used to statically share a set of out master ports using the HWPE-Mem protocol between two sets of slave ports in0 and in1. It works similarly to the hwpe_stream_mux_static and similarly requires a strictly static selector sel_i.

Table 28 hwpe_stream_tcdm_mux_static design-time parameters.#

Name

Default

Description

NB_CHAN

2

Number of output HWPE-Mem channels.

hwpe_stream_tcdm_reorder#

_images/hwpe_stream_tcdm_reorder.sv.png

The hwpe_stream_tcdm_reorder block can be used to rotate the order of a set of HWPE-Mem channels depending on an order_i input, which can be changed dynamically (e.g. a counter). This is used to “equalize” channels with different probabilities of issuing a request so that the downstream HWPE-Mem channels are used with the same average probability, minimizing the chances for memory starvation.

Table 29 hwpe_stream_tcdm_reorder design-time parameters.#

Name

Default

Description

NB_CHAN

2

Number of HWPE-Mem channels.

HCI Streamer modules#

Streamer modules constitute the heart of the IPs use to interface HWPEs with a PULP system. They include all the modules that are used to generate HWPE-Streams from address patterns on the TCDM, including the address generation itself, data realignment to enable access to data located at non-byte-aligned addresses, strobe generation to selectively disable parts of a stream, and the main streamer source and sink modules used to put these functions together. HCI Modules performing these functions can be found within the rtl/core subfolder of the hci repository.

Two main streamer modules (hci_core_source and hci_core_sink) are composite of several other IPs, including address generation and strobe generation blocks included in this section, as well as of basic HWPE-Stream management blocks.

hci_core_source#

_images/hci_core_source.sv.png

The hci_core_source module is the high-level source streamer performing a series of loads on a HCI-Core interface and producing a HWPE-Stream data stream to feed a HWPE engine/datapath. The source streamer is a composite module that makes use of many other fundamental IPs.

Fundamentally, a source streamer acts as a specialized DMA engine acting out a predefined pattern from an hwpe_stream_addressgen_v3 to perform a burst of loads via a HCI-Core interface, producing a HWPE-Stream data stream from the HCI-Core r_data field. By default, the HCI-Core streamer supports delayed accesses using a HCI-Core interface.

Misaligned accesses are supported by widening the HCI-Core data width of 32 bits compared to the HWPE-Stream that gets produced by the streamer. Unused bytes are simply ignored. This feature can be deactivated by unsetting the MISALIGNED_ACCESS parameter; in this case, the sink will only work correctly if all data is aligned to a word boundary.

In principle, the source streamer is insensitive to latency. However, when configured to support misaligned memory accesses, the address FIFO depth sets the maximum supported latency. This parameter can be controlled by the ADDR_MIS_DEPTH parameter (default 8).

Table 30 hci_core_source design-time parameters.#

Name

Default

Description

LATCH_FIFO

0

If 1, use latches instead of flip-flops (requires special constraints in synthesis).

TRANS_CNT

16

Number of bits supported in the transaction counter of the address generator, which will overflow at 2^ TRANS_CNT.

ADDR_MIS_DEPTH

8

Depth of the misaligned address FIFO. This must be equal to the max-latency between the HCI-Core gnt and r_valid.

MISALIGNED_ACCESS

1

If set to 0, the source will not support non-word-aligned HCI-Core accesses.

PASSTHROUGH_FIFO

0

If set to 1, the address FIFO will be capable of fall-through operation (i.e., skipping the FIFO latency entirely).

Table 31 hci_core_source input control signals.#

Name

Type

Description

req_start

logic

When 1, the source streamer operation is started if it is ready.

addressgen_ctrl

ctrl_addressgen_v3_t

Configuration of the address generator (see hwpe_stream_addresgen_v3).

Table 32 hci_core_source output flags.#

Name

Type

Description

ready_start

logic

1 when the source streamer is ready to start operation, from the first IDLE state cycle on.

done

logic

1 for one cycle when the streamer ends operation, in the cycle before it goes to IDLE state .

addressgen_flags

flags_addressgen_v3_t

Address generator flags (see hwpe_stream_addresgen_v3).

hci_core_sink#

_images/hci_core_sink.sv.png

The hci_core_sink module is the high-level sink streamer performing a series of stores on a HCI-Core interface from an incoming HWPE-Stream data stream from a HWPE engine/datapath. The sink streamer is a composite module that makes use of many other fundamental IPs.

Fundamentally, a sink streamer acts as a specialized DMA engine acting out a predefined pattern from an hwpe_stream_addressgen_v3 to perform a burst of stores via a HCI-Core interface, consuming a HWPE-Stream data stream into the HCI-Core data field. The sink streamer is insensitive to memory latency. This is due to the nature of store streams, which are unidirectional (i.e. addr and data move in the same direction).

Misaligned accesses are supported by widening the HCI-Core data width of 32 bits compared to the HWPE-Stream that gets consumed by the streamer. The stream is shifted according to the address alignment and invalid bytes are disabled by unsetting their strb. This feature can be deactivated by unsetting the MISALIGNED_ACCESS parameter; in this case, the sink will only work correctly if all data is aligned to a word boundary.

Table 33 hci_core_sink design-time parameters.#

Name

Default

Description

TCDM_FIFO_DEPTH

2

If >0, the module produces a HWPE-MemDecoupled interface and includes a TCDM FIFO of this depth.

TRANS_CNT

16

Number of bits supported in the transaction counter of the address generator, which will overflow at 2^ TRANS_CNT.

MISALIGNED_ACCESS

1

If set to 0, the sink will not support non-word-aligned HWPE-Mem accesses.

Table 34 hci_core_sink input control signals.#

Name

Type

Description

req_start

logic

When 1, the sink streamer operation is started if it is ready.

addressgen_ctrl

ctrl_addressgen_v3_t

Configuration of the address generator (see hwpe_stream_addresgen_v3).

Table 35 hci_core_sink output flags.#

Name

Type

Description

ready_start

logic

1 when the sink streamer is ready to start operation, from the first IDLE state cycle on.

done

logic

1 for one cycle when the streamer ends operation, in the cycle before it goes to IDLE state .

addressgen_flags

flags_addressgen_v3_t

Address generator flags (see hwpe_stream_addresgen_v3).

hwpe_stream_addressgen_v3#

_images/hwpe_stream_addressgen_v3.sv.png

The hwpe_stream_addressgen_v3 module is used to generate addresses to load or store HWPE-Stream stream. In this version of the address generator, the address is itself carried within a HWPE-Stream, making it easily stallable. The address generator can be used to generate address from a three-dimensional space, which can be visited with configurable strides in all three dimensions.

The multiple loop functionality is partially overlapped by the functionality provided by the microcode processor hwce_ctrl_ucode that can be embedded in HWPEs. The latter is much more flexible and smaller, but less fast.

One iteration is performed per each cycle when enable_i is 1 and the output addr_o stream is ready. presample_i should be 1 in the first cycle in which the address generator can start generating addresses, and no further. The following piece of pseudo-C code resumes the basic functionality provided by the address generator.

hwpe_stream_addressgen_v3(
  int base_addr,                                          // base address (byte-aligned)
  int d0_len,    int d1_len,    int tot_len               // d0,d1,total length (in number of transactions)
  int d0_stride, int d1_stride, int d2_stride,            // d0,d1,d2 strides (in bytes)
  int *d0_addr,  int *d1_addr,  int *d2_addr,             // d0,d1,d2 addresses (by reference)
  int *d0_cnt,   int *d1_cnt,   int *ov_cnt               // d0,d1,overall counters (by reference)
) {
  // compute current address
  int current_addr = 0;
  int done = 0;
  if (dim_enable & 0x1 == 0) { // 1-dimensional streaming
    current_addr = base_addr + *d0_addr;
  }
  else if(dim_enable & 0x2 == 0) { // 2-dimensional streaming
    current_addr = base_addr + *d1_addr + *d0_addr;
  }
  else { // 3-dimensional streaming
    current_addr = base_addr + *d2_addr + *d1_addr + *d0_addr;
  }
  // update counters and dimensional addresses
  if(*ov_cnt == tot_len) {
    done = 1;
  }
  if((*d0_cnt < d0_len) || (dim_enable & 0x1 == 0)) {
    *d0_addr = *d0_addr + d0_stride;
    *d0_cnt  = *d0_cnt + 1;
  }
  else if ((*d1_cnt < d1_len) || (dim_enable & 0x2 == 0)) {
    *d0_addr = 0;
    *d1_addr = *d1_addr + d1_stride;
    *d0_cnt  = 1;
    *d1_cnt  = *d1_cnt + 1;
  }
  else {
    *d0_addr = 0;
    *d1_addr = 0;
    *d2_addr = *d2_addr + d2_stride;
    *d0_cnt  = 1;
    *d1_cnt  = 1;
  }
  *ov_cnt = *ov_cnt + 1;
  return current_addr, done;
}
Table 36 hwpe_stream_addressgen_v3 design-time parameters.#

Name

Default

Description

TRANS_CNT

32

Number of bits supported in the transaction counter, which will overflow at 2^ TRANS_CNT.

CNT

32

Number of bits supported in non-transaction counters, which will overflow at 2^ CNT.

Table 37 hwpe_stream_addressgen_v3 input control signals.#

Name

Type

Description

base_addr

logic[31:0]

Byte-aligned base address of the stream in the HWPE-accessible memory.

tot_len

logic[31:0]

Total number of transactions in stream; only the TRANS_CNT LSB are actually used.

d0_len

logic[31:0]

d0 length in number of transactions

d0_stride

logic[31:0]

d0 stride in bytes

d0_len

logic[31:0]

d0 length in number of transactions

d1_stride

logic[31:0]

d1 stride in bytes

d1_len

logic[31:0]

d1 length in number of transactions

d2_stride

logic[31:0]

d2 stride in bytes

dim_enable_1h

logic[1:0]

One-hot switch to enable 3-d counting (11), 2-d (01), or 1-d (00).

Table 38 hwpe_stream_addressgen_v3 output flags.#

Name

Type

Description

done

logic

1 when the address generation has finished.

Plain HWPE-Mem Streamer modules (deprecated)#

The “plain” HWPE-Mem Streamer modules, although still functional, have generally been superseded by the HCI Streamer modules. We suggest using those for new designs.

Streamer modules constitute the heart of the IPs use to interface HWPEs with a PULP system. They include all the modules that are used to generate HWPE-Streams from address patterns on the TCDM, including the address generation itself, data realignment to enable access to data located at non-byte-aligned addresses, strobe generation to selectively disable parts of a stream, and the main streamer source and sink modules used to put these functions together. Modules performing these functions can be found within the rtl/streamer subfolder of the hwpe-stream repository.

Two main streamer modules (hwpe_stream_source and hwpe_stream_sink) are composite of several other IPs, including address generation and strobe generation blocks included in this section, as well as of basic HWPE-Stream management blocks.

hwpe_stream_source#

_images/hwpe_stream_source.sv.png

The hwpe_stream_source module is the high-level source streamer performing a series of loads on a HWPE-Mem or HWPE-MemDecoupled interface and producing a HWPE-Stream data stream to feed a HWPE engine/datapath. The source streamer is a composite module that makes use of many other fundamental IPs. Its architecture is shown in :numfig: _hwpe_stream_source_archi.

_images/hwpe_stream_source_archi.png

Fig. 9 Architecture of the source streamer.#

Fundamentally, a source streamer acts as a specialized DMA engine acting out a predefined pattern from an hwpe_stream_addressgen to perform a burst of loads via a HWPE-Mem interface, producing a HWPE-Stream data stream from the HWPE-Mem r_data field.

Depending on the DECOUPLED parameter, the streamer supports delayed accesses using a HWPE-MemDecoupled interface. The source streamer does not include any TCDM FIFO inside on its own; rather, it provides a specific tcdm_fifo_ready_o output signal that can be hooked to an external hwpe_stream_tcdm_fifo_load. tcdm_fifo_ready_o provides a backpressure mechanism from the source streamer to the TCDM FIFO (this is unnecessary in the case of TCDM FIFOs for store).

Table 39 hwpe_stream_source design-time parameters.#

Name

Default

Description

DECOUPLED

0

If 1, the module expects a HWPE-MemDecoupled interface instead of HWPE-Mem.

DATA_WIDTH

32

Width of input/output streams (multiple of 32).

LATCH_FIFO

0

If 1, use latches instead of flip-flops (requires special constraints in synthesis).

TRANS_CNT

16

Number of bits supported in the transaction counter of the address generator, which will overflow at 2^ TRANS_CNT.

REALIGNABLE

1

If set to 0, the source will not support non-word-aligned HWPE-Mem accesses.

Table 40 hwpe_stream_source input control signals.#

Name

Type

Description

req_start

logic

When 1, the source streamer operation is started if it is ready.

addressgen_ctrl

ctrl_addressgen_t

Configuration of the address generator (see hwpe_stream_addresgen).

Table 41 hwpe_stream_source output flags.#

Name

Type

Description

ready_start

logic

1 when the source streamer is ready to start operation.

done

logic

1 for one cycle when the streamer ends operation.

addressgen_flags

flags_addressgen_t

Address generator flags (see hwpe_stream_addresgen).

ready_fifo

logic

Unused.

hwpe_stream_sink#

_images/hwpe_stream_sink.sv.png

The hwpe_stream_sink module is the high-level sink streamer performing a series of stores on a HWPE-Mem or HWPE-MemDecoupled interface from an incoming HWPE-Stream data stream from a HWPE engine/datapath. The sink streamer is a composite module that makes use of many other fundamental IPs. Its architecture is shown in :numfig: _hwpe_stream_sink_archi.

_images/hwpe_stream_sink_archi.png

Fig. 10 Architecture of the source streamer.#

Fundamentally, a ink streamer acts as a specialized DMA engine acting out a predefined pattern from an hwpe_stream_addressgen to perform a burst of stores via a HWPE-Mem interface, consuming a HWPE-Stream data stream into the HWPE-Mem data field.

The sink streamer indifferently supports standard HWPE-Mem or delayed HWPE-MemDecoupled accesses. This is due to the nature of store streams, that are unidirectional (i.e. addr and data move in the same direction) and hence insensitive to latency.

Table 42 hwpe_stream_sink design-time parameters.#

Name

Default

Description

TCDM_FIFO_DEPTH

2

If >0, the module produces a HWPE-MemDecoupled interface and includes a TCDM FIFO of this depth.

DATA_WIDTH

32

Width of input/output streams.

LATCH_FIFO

0

If 1, use latches instead of flip-flops (requires special constraints in synthesis).

TRANS_CNT

16

Number of bits supported in the transaction counter of the address generator, which will overflow at 2^ TRANS_CNT.

REALIGNABLE

1

If set to 0, the sink will not support non-word-aligned HWPE-Mem accesses.

Table 43 hwpe_stream_sink input control signals.#

Name

Type

Description

req_start

logic

When 1, the sink streamer operation is started if it is ready.

addressgen_ctrl

ctrl_addressgen_t

Configuration of the address generator (see hwpe_stream_addresgen).

Table 44 hwpe_stream_sink output flags.#

Name

Type

Description

ready_start

logic

1 when the sink streamer is ready to start operation.

done

logic

1 for one cycle when the streamer ends operation.

addressgen_flags

flags_addressgen_t

Address generator flags (see hwpe_stream_addresgen).

ready_fifo

logic

Unused.

hwpe_stream_addressgen#

_images/hwpe_stream_addressgen.sv.png

_**hwpe_stream_addressgen** is DEPRECATED. New designs should use hwpe_stream_addressgen_v3 instead._

The hwpe_stream_addressgen module is used to generate addresses to load or store HWPE-Stream streams, as well as the related byte enable strobes (gen_addr_o and gen_strb_o respectively). The address generator can be used to generate address from a three-dimensional space of “words”, “lines” and “features”. Lines and features can be separated by a certain stride, and a roll parameter can be used to reuse the same offsets multiple times.

The multiple loop functionality is partially overlapped by the functionality provided by the microcode processor hwce_ctrl_ucode that can be embedded in HWPEs. The latter is much more flexible and smaller, but less fast. When using a single loop in the address generator, the HWPE designer should statically set line_stride =0, feat_length =1, feat_stride =0.

The address generation loop considers three-dimensional vectors, where the three dimensions are called packet, line and features from the innermost to the outermost. One iteration is performed per each cycle when enable_i is 1. Feature loops can behave in two different fashions, modeled after the behavior of input/output features in CNNs. The following piece of code resumes the basic functionality provided by the address generator, discarding more complex situations where the address is misaligned (resulting in one more transaction, introduced automatically).

int word_addr=0, line_addr=0, feat_addr=0;
int trans_idx=0;
while(trans_idx < trans_size) {
  if(!enable)
    continue;
  for(int feat_idx=0; feat_idx<feat_roll; feat_idx++) { // feature loop
    for(int line_idx=0; line_idx<feat_length; line_idx++) { // line loop
      for(int word_idx=0; word_idx<line_length; word_idx++) { // word loop
        gen_addr = base_addr + feat_addr + line_addr + word_idx * STEP;
      }
      line_addr += line_stride;
    }
    if((loop_outer) && (feat_idx == feat_roll-1)) {
      feat_addr += feat_stride;
      feat_idx  = 0;
    }
    else if ((!loop_outer) && (feat_idx < feat_roll-1)){
      feat_addr += feat_stride;
    }
    else if ((!loop_outer) && (feat_idx == feat_roll-1)){
      feat_addr = 0;
      feat_idx  = 0;
    }
  }
}
Table 45 hwpe_stream_addressgen design-time parameters.#

Name

Default

Description

REALIGN_TYPE

HWPE_STREAM_REALIGN_SOURCE

Type of realignment, can be set to HWPE_STREAM_REALIGN{SOURCE,SINK}.

STEP

4

Step of address generation (untested with != 4).

TRANS_CNT

16

Number of bits supported in the transaction counter, which will overflow at 2^ TRANS_CNT.

CNT

10

Number of bits supported in non-transaction counters, which will overflow at 2^ CNT.

DELAY_FLAGS

0

If 1, delay the production of flags by one cycle.

Table 46 hwpe_stream_addressgen input control signals.#

Name

Type

Description

base_addr

logic[31:0]

Byte-aligned base address of the stream in the HWPE-accessible memory.

trans_size

logic[31:0]

Total size of transaction; only the TRANS_CNT LSB are actually used.

line_stride

logic[15:0]

Distance between two adjacent lines in bytes.

line_length

logic[15:0]

Length of a line in words, rounded by including also incomplete final words.

feat_stride

logic[15:0]

Distance between two adjacent features in bytes.

feat_length

logic[15:0]

Length of a feature in number of lines.

loop_outer

logic

Whether this corresponds to an outer or inner feature loop.

feat_roll

logic[15:0]

After this number of features, depending on loop_outer, feature index will be rolled back or incremented.

realign_type

logic

Unused.

line_length_remainder

logic[7:0]

Unused.

Table 47 hwpe_stream_addressgen output flags.#

Name

Type

Description

realign_flags

ctrl_realign_t

Control signals to be used for realignment by hwpe_stream_{source,sink}_realign modules.

word_update

logic

1 when the word loop has been updated.

line_update

logic

1 when the line loop has been updated.

feat_update

logic

1 when the feature loop has been updated.

in_progress

logic

1 when the address generation has progressed.

hwpe_stream_strbgen#

_images/hwpe_stream_strbgen.sv.png

The hwpe_stream_strbgen module is used to generate strobes for load or store HWPE-Stream streams, in case of incomplete transfers. It uses information passed through the same configuration struct used for the address generator.

Table 48 hwpe_stream_strbgen design-time parameters.#

Name

Default

Description

DATA_WIDTH

32

Width of input/output streams.

Table 49 hwpe_stream_strbgen input control signals.#

Name

Type

Description

base_addr

logic[31:0]

Unused.

trans_size

logic[31:0]

Unused.

line_stride

logic[15:0]

Unused.

line_length

logic[15:0]

Length of a line in words, rounded by including also incomplete final words.

feat_stride

logic[15:0]

Unused.

feat_length

logic[15:0]

Unused.

loop_outer

logic

Unused.

feat_roll

logic[15:0]

Unused.

realign_type

logic

Unused.

line_length_remainder

logic[7:0]

Number of valid bytes in the final word in a line; if 0, the final word is considered fully valid.

hwpe_stream_sink_realign#

_images/hwpe_stream_sink_realign.sv.png

The hwpe_stream_sink_realign module realigns HWPE-Streams to prepare them for storage in memory. Specifically, it rotates strb signals according to its control interface, produced along with addresses in the address generator.

Table 50 hwpe_stream_sink_realign design-time parameters.#

Name

Default

Description

DATA_WIDTH

32

Width of input/output streams.

Table 51 hwpe_stream_sink_realign input control signals.#

Name

Type

Description

enable

logic

Unused.

strb_valid

logic

Unused.

realign

logic

If 1, the realigner is actively used to generate strobed HWPE-Streams. If 0, it is bypassed.

first

logic

Strobe at 1 for the first packet in a line.

last

logic

Strobe at 1 for the last packet in a line.

last_packet

logic

Strobe at 1 for the last packet of the transfer.

line_length

logic[15:0]

Unused.

hwpe_stream_source_realign#

_images/hwpe_stream_source_realign.sv.png

The hwpe_stream_source_realign module realigns HWPE-Streams loaded in a misaligned fashion from memory. Specifically, it rotates strb signals according to its control interface, produced along with addresses in the address generator.

Table 52 hwpe_stream_source_realign design-time parameters.#

Name

Default

Description

DECOUPLED

0

If 1, the module expects a HWPE-MemDecoupled interface instead of HWPE-Mem.

DATA_WIDTH

32

Width of input/output streams.

STRB_FIFO_DEPTH

4

Depth of the FIFO queue used for strobes; when full, the realigner will lower its ready signal at the input interface.

Table 53 hwpe_stream_source_realign input control signals.#

Name

Type

Description

enable

logic

If 0, the realigner is fully clock-gated.

strb_valid

logic

If 1, the strobe at the strb_i interface is considered valid.

realign

logic

If 1, the realigner is actively used to generate strobed HWPE-Streams. If 0, it is bypassed.

first

logic

Strobe at 1 for the first packet in a line.

last

logic

Strobe at 1 for the last packet in a line.

last_packet

logic

Strobe at 1 for the last packet of the transfer.

line_length

logic[15:0]

Length of a line in words, rounded by including also incomplete final words.

Table 54 hwpe_stream_source_realign output flags.#

Name

Type

Description

decoupled_stall

logic

Do not use.

HCI Interconnect modules#

hci_router#

_images/hci_router.sv.png

The hci_router is a specialized router used to build interconnects in a heterogeneous PULP cluster. It takes as input a single in HCI channel of width DWH (typically “wide”, i.e., greater than 32 bits) that gets routed without arbitration to DWH/32 adjacent out targets from a set of NB_OUT_CHAN out channels (typically, one per memory bank). Routing is performed by splitting the address of the DWH-bit wide word in an index (bits [$clog2(DWH)+2-1:2]) and an offset part (bits [AWH:$clog2(DWH)+2]). The index is used to select which out targets need to propagate the request, while the offset is used to compute the target-level address for each out channel – since word interleaving is assumed, the same address is generally propagated to all targeted out channels. However, if index > NB_OUT_CHAN-DWH/32, then the set of selected targets “wraps around”: the first NB_OUT_CHAN-DWH/32-index out channels are activated, propagating as address the offset+4. See https://ieeexplore.ieee.org/document/9903915 Sec. II-A (open-access) for details (the router is called a shallow router).

Table 55 hci_router design-time parameters.#

Name

Default

Description

FIFO_DEPTH

0

If > 0, insert a HCI FIFO of this depth after the input channel.

NB_OUT_CHAN

8

Number of output HCI channel

hci_arbiter#

_images/hci_arbiter.sv.png

The hci_arbiter is a specialized arbiter used to build interconnects in a heterogeneous PULP cluster, and in particular to arbitrate between two sets of NB_CHAN input channels, one with “default high” (in_high) and the other with “default low” priority (in_low). The arbitration is meant to be performed generally at the direct boundary between the interconnect and the tightly-coupled memory banks. The arbiter uses a starvation-free unbalanced-priority scheme where one of the input channels has by default access to most of the bandwidth guaranteed by the output channels. To prevent starvation effects, depending on the control settings, the other input channel is always granted after a given number of stall cycles. For more details, see:

Table 56 hci_arbiter design-time parameters.#

Name

Default

Description

NB_CHAN

2

Number of HCI channels.

Table 57 hci_arbiter input control signals.#

Name

Type

Description

invert_prio

logic

When 1, invert priorities between in_high and in_low.

low_prio_max_stall

logic[7:0]

Maximum number of consecutive stalls on low-priority channel.

hci_interconnect#

_images/hci_interconnect.sv.png

Convenience top-level for the PULP heterogeneous cluster interconnect. It wraps both a logarithmic interconnect (LIC) and an (optional) HCI router meant to realize a LIC and a HWPE branch of the interconnect, respectively. The two branches are (optionally) arbitrated via a HCI arbiter.

Table 58 hci_interconnect design-time parameters.#

Name

Default

Description

N_HWPE

1

Number of HWPEs attached as initiator to the interconnect (LIC or HWPE branch).

N_CORE

8

Number of cores attached as initiator to the interconnect (LIC branch).

N_DMA

4

Number of DMA ports attached as initiator to the interconnect (LIC branch).

N_EXT

4

Number of external ports attached as initiator to the interconnect (LIC branch).

N_MEM

16

Number of memory banks attached as target to the interconnect.

TS_BIT

21

Bit passed to LIC to define test&set aliased memory region.

IW

N_HWPE+N_CORE+N_DMA+N_EXT

ID Width.

EXPFIFO

0

Depth of HCI router FIFO.

SEL_LIC

0

Kind of LIC to instantiate (0=regular L1, 1=L2).

Control interface modules (HWPE-Periph)#

The control interface of HWPEs exposes a HWPE-Periph interface that is used to program a memory-mapped register file. Several IPs can be used to compose the control interface, delivering a standard accelerator control interface that is described below. Modules performing these functions can be found within the rtl/ subfolder of the hwpe-ctrl repository.