The ever-increasing adoption of data-driven algorithms across a wide range of applications
has unlocked entirely new possibilities for smarter human-machine interaction, intelligent
features, and autonomous operation. We now expect our battery-powered digital
companions to perform a variety of tasks, such as image recognition, virtual reality,
always-on natural language understanding, and intelligent health monitoring. However, the
deployment of such features on edge devices is currently hindered by their limited
computing power on the one hand and the bandwidth limitations of the current network
infrastructure on the other. Moving data processing close to the data source and the user is
therefore paramount to enable the seamless yet sustainable diffusion of smart devices in our everyday lives.
This has driven the exploration of new computing paradigms that address the inefficiencies
of the traditional Von Neumann architecture, which serves as the basis for most
microcontroller-based edge devices. In particular, significant effort has been invested in
mitigating the memory bottleneck by bringing computation within the memory subsystem
itself. This approach allows for better exploitation of the available memory bandwidth at the
output of the on-chip SRAM macros, rather than relying on the system bus to move data
elements between the SRAM and the processing elements inside the CPU. This paradigm is
commonly referred to as Processing-in-Memory (PiM) or Compute-in-Memory (CiM) and has
been proven to effectively reduce energy consumption. Further benefits can be achieved by
combining the PiM paradigm with existing data-centric computing approaches such as Single Instruction, Multiple Data (SIMD), where the same operation (i.e., an instruction) operates on multiple data elements (e.g., a vector or matrix), thereby reducing instruction fetch overhead and the overall energy consumption of the system.
Building upon these considerations, the Embedded Systems Laboratory (ESL) at EPFL has
developed NM-Carus, a near-memory computing IP that tightly integrates off-the-shelf SRAM memory with a programmable controller and a vector-capable execution engine supporting multi-precision integer and fixed-point data types. The resulting circuit exposes a memory-like slave interface to the host system and provides a transparent memory mode together with an autonomous computing mode where a user-programmed kernel is executed on NM-Carus data. From a software perspective, the NM-Carus CPU-based controller can be programmed using the RISC-V RV32E based instruction set, complemented by a custom, PiM-oriented vector extension that utilizes the private data memory as a large Vector Register File (VRF).
In its current form, NM-Carus offers 10-100× higher performance when integrated into a host RISC-V microcontroller (MCU) and executing linear algebra workloads on 8-bit data, with 7-60× lower energy consumption and a +25% area overhead compared to the CPU-only MCU. While these results are remarkable in comparison to CPU-based systems, they do not yet match the performance and energy efficiency of Application Specific Integrated Circuits (ASICs). One critical reason for this is the way NM-Carus handles multiply-accumulate (MACC) operations, which are the fundamental building blocks of many linear algebra operations commonly used in machine learning and other data-driven workloads. Instead of relying on a dedicated accumulator, NM-Carus Arithmetic-Logic Units (ALUs) store partial results inside the internal SRAM memory, introducing expensive and redundant memory accesses that inflate the overall energy budget.

Project Description and Main Goal


The primary goal of this thesis is to optimize the energy cost and performance of
multiply-accumulate (MACC) operations in NM-Carus. This will be achieved by supporting a
new class of instructions that define operations between the data stored inside NM-Carus’s
private memory and a dedicated accumulation register. The expected outcome is a peak
throughput of four 8-bit operations per cycle in each ALU.

Throughout this project, the student will learn:
● How the NM-Carus Near-Memory Computing (NMC) IP works and how to offload
computationally expensive tasks to it within the X-HEEP MCU framework.
● How to extend NM-Carus’s decoder and execution engine to support optimized
multiply-accumulate instructions operating on a dedicated accumulator to reduce
memory accesses and increase throughput.
● To assess the impact of the applied additions on the timing, area, and power
characteristics of NM-Carus.
● To add assembler support for the new instructions to the RISC-V toolchain.
● To write optimized assembly kernels to execute common linear-algebra tasks (e.g.,
matrix multiplication, convolution).
● To write comprehensive regression tests to verify the functionality of the added
instructions through extensive RTL simulation.
● To assess the system-level impact of the new instructions in terms of throughput and
area overhead by integrating the updated NM-Carus IP into a host MCU.
● To compare the system-level performance and energy efficiency of an MCU equipped
with NM-Carus against other variants integrating different accelerators and
coprocessors (e.g., RISC-V compliant vector coprocessors, embedded GPUs, systolic
arrays).
● [Optional] To build an MCU system with multiple NM-Carus instances and deploy
complex, real-world applications on it (e.g., complex neural networks) to extract
real-world performance, which can then be compared with other existing embedded
systems in the same domain (low-power edge computing devices).

The project will be carried out entirely at the Embedded Systems Laboratory (ESL) of EPFL,
one of the world’s top-class universities. ESL is an active group (25 PhD students among 51
members) involved in several research aspects, providing a stimulating research and learning environment. The student will be under the supervision of Prof. David Atienza, Dr. Michele
Caon, and Dr. Davide Schiavone.

Project Objectives

The work is subdivided into three major, sequential milestones, cumulatively contributing to
the final grade. The first milestone is both necessary and sufficient to reach the minimum
grade (4), while the latter two are required to reach the maximum grade (6).

  1. Optimize the performance of MACC operations in NM-Carus (0-4 points)
    a. Define a new class of vector instructions that take as input operands one
    scalar operand (from the CPU GPRs) or vector operand (from the SRAM-based
    VRF) and the current value of a special-purpose accumulation vector register.
    The new instruction variants (*.av and *.ax for accumulator-vector and
    accumulator-scalar respectively) shall be provided for all currently supported
    arithmetic and move/slide operations (i.e., not only MACC operations).
    b. Implement the new instruction class by modifying the RTL description of the
    decode and execution stages inside the vector pipeline of NM-Carus. The
    architecture of NM-Carus’s ALU shall be modified to provide a throughput of
    four 8-bit MACC operations per cycle (e.g., by adding the necessary arithmetic
    units).
    c. Extensively test the modified RTL description of NM-Carus by adding support
    for the newly added instructions to the RISC-V GCC assembler and writing a
    comprehensive set of computing kernels using the new instruction variants, to
    run in RTL simulation. This process implies the modification of the existing C++
    UVM-like testbench, particularly the reference model inside the scoreboard
    unit.
    d. Assess the cost of hardware support for the new instruction class in terms of
    timing and circuit area by performing the ASIC synthesis of the original and
    modified circuits on a low-power standard cell library. If necessary, modify the
    RTL description of the circuit to achieve better timing and area characteristics.
  2. Evaluate the system-level performance of the updated NM-Carus (0-1 points)
    a. Integrate the updated NM-Carus IP into an already-existing host MCU based
    on the X-HEEP platform.
    b. Synthesize, place, and route the MCU using a low-power standard cell library.
    c. Modify the existing applications to make use of the newly added instructions.
    d. Evaluate the benefits brought by the new instructions at system level in terms
    of throughput and energy consumption by performing RTL and post-layout
    simulations.
  3. Compare the performance of NM-Carus with other types of accelerators (0-1 points)
    a. Extend the MCU assembled in the previous point with at least another existing
    accelerator/coprocessor (e.g., RISC-V compliant vector coprocessors,
    embedded GPUs, systolic arrays).
    b. Select and deploy one or more meaningful real-world applications (e.g.,
    edge-oriented AI workloads) on the MCU to evaluate the system-level
    throughput and energy efficiency when offloading computation to the
    different accelerators/coprocessors.
    c. [Optional] Possibly explore the trade-offs of different NM-Carus
    configurations, varying the number of NM-Carus instances, their memory
    capacity, and computation parallelism.

    Required knowledge and skills

    ● Strong background and advanced understanding of computer architecture and
    microprocessor design.
    ● In-depth knowledge of Reduced Instruction Set Computer (RISC) architectures, from
    both hardware and software perspectives.
    ● Extensive experience with the digital implementation flow, from RTL design to place
    and route.
    ● Proficient in low-level programming, ideally with RISC-V assembly.
    ● Solid experience with object-oriented programming in C++.
    ● Solid experience in collaborative software and hardware development using Git.
    ● Strong knowledge of at least one high-level programming language (e.g., Python).
    ● Good analytical skills.

    Appreciated skills:

    ● Good understanding and experience with machine learning algorithms, optimization
    workflows, and deployment frameworks.
    ● Fundamental experience with hardware acceleration of linear algebra computational
    kernels.
    ● Fundamental experience in writing scientific publications and navigating the state of
    the art.
    ● Good understanding of the digital verification flow, ideally with the UVM.
    ● Advanced proficiency in English.
    ● Effective communication skills.

    Type of work:

    ● 70% hardware/software co-design, implementation, verification, and validation.
    ● 15% software development and deployment.
    ● 15% theory and state-of-the-art analysis.

Leave a Reply