Bibliography & References#

Disclaimer: most of these references are the effect of the point of view of myself, Francesco Conti, maintainer and dictator of this site. Most papers are related to work on HWPEs performed in the context of the PULP project during my activity at University of Bologna (2012-ongoing) and ETH Zurich (2015-2020). Although there is a Other authors section, there may be several missing papers using the HWPE IPs and/or a similar template. In case you spot a missing reference, let me know and I’ll be happy to amend the list.

Hardware Accelerators based on HWPE template#

Softex: A. Belano, Y. Tortorella, A. Garofalo, L. Benini, D. Rossi, and F. Conti, “A Flexible Template for Edge Generative AI with High-Accuracy Accelerated Softmax and GELU,” in ,” IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 15, no. 2, pp. 200–216, 2025. doi: 10.1109/JETCAS.2025.3562734 (IEEE JETCAS 2026 Best Paper Award).
NEureka: A. S. Prasad, L. Benini, and F. Conti, “Specialization meets Flexibility: a Heterogeneous Architecture for High-Efficiency, High-flexibility AR/VR Processing,” in 2023 60th ACM/IEEE Design Automation Conference (DAC), DAC 2023. [IEEE]
SNE: A. Di Mauro, A. S. Prasad, Z. Huang, M. Spallanzani, F. Conti, and L. Benini, “SNE: an Energy-Proportional Digital Accelerator for Sparse Event-Based Convolutions,” in Design, Automation & Test in Europe Conference & Exhibition, DATE 2022. [arXiv]
RedMulE: Y. Tortorella, L. Bertaccini, D. Rossi, L. Benini, and F. Conti, “RedMulE: A Compact FP16 Matrix-Multiplication Accelerator for Adaptive Deep Learning on RISC-V-Based Ultra-Low-Power SoCs,” in Design, Automation & Test in Europe Conference & Exhibition, DATE 2022. [arXiv extension]
IMA: A. Garofalo, G. Ottavi, F. Conti, G. Karunaratne, I. Boybat, L. Benini, and D. Rossi, “A Heterogeneous In-Memory Computing Cluster for Flexible End-to-End Inference of Real-World Deep Neural Networks,” in IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 12, no. 2, pp. 422-435, June 2022, doi: 10.1109/JETCAS.2022.3170152. [arXiv]
FFT: L. Bertaccini, L. Benini, and F. Conti, “To Buffer, or Not to Buffer? A Case Study on FFTAccelerators for Ultra-Low-Power Multicore Clusters”, in IEEE International Conference on Application-Specific Systems, Architectures and Processors, ASAP 2021. [IEEE]
XNE: F. Conti, P. D. Schiavone, and L. Benini, “XNOR Neural Engine: A Hardware Accelerator IP for 21.6-fJ/op Binary Neural Network Inference.,” IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., vol. 37, no. 11, pp. 2940–2951, 2018, doi: 10.1109/TCAD.2018.2857019 (ESWEEK CODES+ISSS 2018 Best Paper Award). [arXiv]
HWCE: F. Conti and L. Benini, “A ultra-low-energy convolution engine for fast brain-inspired vision in multicore clusters.,” in Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition, DATE 2015, Grenoble, France, March 9-13, 2015, 2015, pp. 683–688, [ACM]

HWPEs on FPGA#

1. Bellocchi, A. Capotondi, F. Conti, and A. Marongiu, “A RISC-V-based FPGA Overlay to Simplify Embedded Accelerator Deployment”, in DSD 2021 Conference.
1. Meloni, D. Loi, G. Deriu, M. Carreras, F. Conti, A. Capotondi, and D. Rossi, “Exploring NEURAghe: A Customizable Template for APSoC-Based CNN Inference at the Edge.,” IEEE Embed. Syst. Lett., vol. 12, no. 2, pp. 62–65, 2020, doi: 10.1109/LES.2019.2947312.
1. Meloni et al., “Optimization and deployment of CNNs at the edge: the ALOHA experience.,” in Proceedings of the 16th ACM International Conference on Computing Frontiers, CF 2019, Alghero, Italy, April 30 - May 2, 2019., 2019, pp. 326–332, doi: 10.1145/3310273.3323435
1. Meloni, G. Deriu, F. Conti, I. Loi, L. Raffo, and L. Benini, “Curbing the roofline: a scalable and flexible architecture for CNNs on FPGA.,” in Proceedings of the ACM International Conference on Computing Frontiers, CF’16, Como, Italy, May 16-19, 2016, 2016, pp. 376–383, doi: 10.1145/2903150.2911715.
1. Meloni, G. Deriu, F. Conti, I. Loi, L. Raffo, and L. Benini, “A high-efficiency runtime reconfigurable IP for CNN acceleration on a mid-range all-programmable SoC.,” in International Conference on ReConFigurable Computing and FPGAs, ReConFig 2016, Cancun, Mexico, November 30 - Dec. 2, 2016, 2016, pp. 1–8, doi: 10.1109/ReConFig.2016.7857144.

HWPE-augmented SoC’s#

Siracusa: A. S. Prasad, M. Scherer, F. Conti, D. Rossi, A. Di Mauro, M. Eggimann, J. T. Gomez, Z. Li, S. S. Sarwar, Z. Wang, B. De Salvo, and L. Benini, “Siracusa: A 16 nm Heterogenous RISC-V SoC for Extended Reality with At-MRAM Neural Engine.” [arXiv]
Siracusa: M. Scherer, M. Eggimann, A. Di Mauro, A. S. Prasad, F. Conti, D. Rossi, J. T. Gomez, Z. Li, S. S. Sarwar, Z. Wang, B. De Salvo, and L. Benini, “Siracusa: A Low-Power On-Sensor RISC-V SoC for Extended Reality Visual Processing in 16nm CMOS.” ESSCIRC 2023-IEEE 49th European Solid State Circuits Conference (ESSCIRC), ESSCIRC 2023.
Marsellus: F. Conti, G. Paulin, A. Garofalo, D. Rossi, A. Di Mauro, G. Rutishauser, G. Ottavi, M. Eggimann, H. Okuhara, and L. Benini, “Marsellus: A Heterogeneous RISC-V AI-IoT End-Node SoC with 2-to-8b DNN Acceleration and 30%-Boost Adaptive Body Biasing,” in IEEE Journal of Solid-State Circuits, doi: 10.1109/JSSC.2023.3318301. [arXiv]
Marsellus: F. Conti, D. Rossi, G. Paulin, A. Garofalo, A. Di Mauro, G. Rutishauser, G. Ottavi, M. Eggimann, H. Okuhara, V. Huard, O. Montfort, L. Jure, N. Exibard, P. Gouedo, M. Louvat, E. Botte, and L. Benini, “A 12.4 TOPS/W@ 136GOPS AI-IoT system-on-chip with 16 RISC-V, 2-to-8b precision-scalable DNN acceleration and 30%-boost adaptive body biasing,” 2023 IEEE International Solid-State Circuits Conference (ISSCC), ISSCC 2023.
Echoes: M. Sinigaglia, L. Bertaccini, L. Valente, A. Garofalo, S. Benatti, L. Benini, F. Conti, and D. Rossi, “ECHOES: a 200 GOPS/W Frequency Domain SoC with FFT Processor and I2S DSP for Flexible Data Acquisition from Microphone Arrays,” 2023 IEEE International Symposium on Circuits and Systems (ISCAS), Monterey, CA, USA, 2023, pp. 1-5, doi: 10.1109/ISCAS46773.2023.10181862. [arXiv]
Darkside: A. Garofalo, Y. Tortorella, M. Perotti, L. Valente, A. Nadalini, L. Benini, D. Rossi, and F. Conti, “DARKSIDE: A Heterogeneous RISC-V Compute Cluster for Extreme-Edge On-Chip DNN Inference and Training,” in IEEE Open Journal of the Solid-State Circuits Society, vol. 2, pp. 231-243, 2022, doi: 10.1109/OJSSCS.2022.3210082. [open access]
Darkside: A. Garofalo, M. Perotti, L. Valente, Y. Tortorella, A. Nadalini, L. Benini, D. Rossi, and F. Conti, “DARKSIDE: 2.6GFLOPS, 8.7mW Heterogeneous RISC-V Cluster for Extreme-Edge On-Chip DNN Inference and Training”, ESSCIRC 2022- IEEE 48th European Solid State Circuits Conference (ESSCIRC), Milan, Italy, 2022, pp. 273-276, doi: 10.1109/ESSCIRC55480.2022.9911384.
Vega: D. Rossi, F. Conti, M. Eggimann, A. Di Mauro, G. Tagliavini, S. Mach, M. Guermandi, A. Pullini, I. Loi, J. Chen, E. Flamand, and L. Benini, “Vega: A 10-Core SoC for IoT End-Nodes with DNN Acceleration and Cognitive Wake-Up from MRAM-Based State-Retentive Sleep Mode,” in IEEE Journal on Solid-State Circuits. [arXiv]
Vega: D. Rossi, F. Conti, M. Eggimann, S. Mach, A. Di Mauro, M. Guermandi, G. Tagliavini, A. Pullini, I. Loi, J. Chen, E. Flamand, and L. Benini, “A 1.3TOPS/W @ 32GOPS Fully Integrated 10-Core SoC for IoT End-Nodes with 1.7μW Cognitive Wake-Up From MRAM-Based State-Retentive Sleep Mode,” in Proceedings of the 2021 International Solid State Circuit Conference
Quentin: A. Di Mauro, F. Conti, P. D. Schiavone, D. Rossi, and L. Benini, “Always-On 674uW @ 4GOP/s Error Resilient Binary Neural Networks With Aggressive SRAM Voltage Scaling on a 22-nm IoT End-Node.,” IEEE Trans. Circuits Syst., vol. 67–I, no. 11, pp. 3905–3918, 2020, doi: 10.1109/TCSI.2020.3012576.
GAP8: E. Flamand, D. Rossi, F. Conti, I. Loi, A. Pullini, F. Rotenberg, and L. Benini, “GAP-8: A RISC-V SoC for AI at the Edge of the IoT.,” in 29th IEEE International Conference on Application-specific Systems, Architectures and Processors, ASAP 2018, Milano, Italy, July 10-12, 2018, 2018, pp. 1–4, doi: 10.1109/ASAP.2018.8445101.
Mia Wallace: A. Pullini, F. Conti, D. Rossi, I. Loi, M. Gautschi, and L. Benini, “A Heterogeneous Multicore System on Chip for Energy Efficient Brain Inspired Computing.,” IEEE Trans. Circuits Syst. II Express Briefs, vol. 65–II, no. 8, pp. 1094–1098, 2018, doi: 10.1109/TCSII.2017.2652982.
Fulmine: F. Conti et al., “An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics.,” IEEE Trans. Circuits Syst. I Regul. Pap., vol. 64–I, no. 9, pp. 2481–2494, 2017, doi: 10.1109/TCSI.2017.2698019 (IEEE CAS Darlington Award 2020). [arXiv]
Fulmine: F. K. Gürkaynak, R. Schilling, M. Muehlberghuber, F. Conti, S. Mangard, and L. Benini, “Multi-core data analytics SoC with a flexible 1.76 Gbit/s AES-XTS cryptographic accelerator in 65 nm CMOS.,” in Proceedings of the Fourth Workshop on Cryptography and Security in Computing Systems, CS2 at HiPEAC 2017, Stockholm, Sweden, January 24, 2017, 2017, pp. 19–24, doi: 10.1145/3031836.3031840.
Mia Wallace: A. Pullini, F. Conti, D. Rossi, I. Loi, M. Gautschi, and L. Benini, “A heterogeneous multi-core system-on-chip for energy efficient brain inspired vision.,” in IEEE International Symposium on Circuits and Systems, ISCAS 2016, Montréal, QC, Canada, May 22-25, 2016, 2016, p. 2910, doi: 10.1109/ISCAS.2016.7539213.

HWPE template#

1. Conti, C. Pilkington, A. Marongiu, and L. Benini, “He-P2012: Architectural heterogeneity exploration on a scalable many-core platform.,” in IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors, ASAP 2014, Zurich, Switzerland, June 18-20, 2014, 2014, pp. 114–120, doi: 10.1109/ASAP.2014.6868645.
1. Burgio, G. Tagliavini, F. Conti, A. Marongiu, and L. Benini, “Tightly-coupled hardware support to dynamic parallelism acceleration in embedded shared memory clusters.,” in Design, Automation & Test in Europe Conference & Exhibition, DATE 2014, Dresden, Germany, March 24-28, 2014, 2014, pp. 1–6, doi: 10.7873/DATE.2014.169.
1. Conti, A. Marongiu, and L. Benini, “Synthesis-friendly techniques for tightly-coupled integration of hardware accelerators into shared-memory multi-core clusters.,” in Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis, CODES+ISSS 2013, Montreal, QC, Canada, September 29 - October 4, 2013, 2013, p. 5:1-5:10, doi: 10.1109/CODES-ISSS.2013.6658992
1. Dehyadegari, A. Marongiu, M. R. Kakoee, S. Mohammadi, N. Yazdani and L. Benini, “Architecture Support for Tightly-Coupled Multi-Core Clusters with Shared-Memory HW Accelerators,” in IEEE Transactions on Computers, vol. 64, no. 8, pp. 2132-2144, 1 Aug. 2015, doi: 10.1109/TC.2014.2360522.

Other authors#

TinyVers from Marian Verhelst’s team at KU Leuven: V. Jain, S. Giraldo, J. De Roose, B. Boons, L. Mei, M. Verhelst, “TinyVers: A 0.8-17 TOPS/W, 1.7 μW-20 mW, Tiny Versatile System-on-chip with State-Retentive eMRAM for Machine Learning Inference at the Extreme Edge”, VLSI 2022, doi: 10.1109/VLSITechnologyandCir46769.2022.9830409.
DIANA from Marian Verhelst’s team at KU Leuven: K. Ueyoshi et al., “DIANA: An End-to-End Energy-Efficient Digital and ANAlog Hybrid Neural Network SoC,” 2022 IEEE International Solid- State Circuits Conference (ISSCC), 2022, pp. 1-3, doi: 10.1109/ISSCC42614.2022.9731716.
PULPO from Christoph Studer’s team at ETH Zurich: O. Castañeda, L. Benini and C. Studer, “A 283 pJ/b 240 Mb/s Floating-Point Baseband Accelerator for Massive MU-MIMO in 22FDX,” ESSCIRC 2022- IEEE 48th European Solid State Circuits Conference (ESSCIRC), 2022, pp. 357-360, doi: 10.1109/ESSCIRC55480.2022.9911311.
Bit-Serial NE: M. Capra, F. Conti, and M. Martina, “A Multi-Precision Bit-Serial Hardware Accelerator IP for Deep Learning Enabled Internet-of-Things”, in IEEE MWSCAS 2021.

Bibliography & References

Contents