Artificial Intelligence (AI) has been one of the most dominant factors driving technology innovation over the last decade and has been exploited in a huge variety of fields ranging from image recognition and natural language processing to autonomous driving and modeling of complex physical systems. As integrated semiconductor devices become smaller and faster, systems-on-chip (SoCs) become more and more complex, enabling pocket-size, wearable, battery-powered systems to efficiently support the computationally expensive algorithms at the core of complex AI models while running on a limited power budget. Due to the large number of parameters, software-programmable SoCs are preferred thanks to their versatility and short time-to-market. Small cameras that recognize faces, microphones that filter background noise or recognize voice commands, wearable devices that detect epilepsy attacks, and implantable devices that constantly monitor tens of body parameters and release drugs accordingly to prevent organ failures are just a few examples of what smart edge devices represent in the AI revolution we are experiencing.
One of the main bottlenecks for performance and energy efficiency of next-generation SoCs reside in the limited memory bandwidth inherent in the traditional Von Neuman architecture. One idea to overcome this limitation is to bring computation within the memory subsystem, to better exploit the available memory bandwidth and leverage data reuse more efficiently. Such computational paradigm is referred to as Processing-in-Memory (PiM) or Compute-in-Memory (CiM). Further benefits can be achieved by leveraging the Single Instruction, Multiple Data (SIMD) approach, where the same operation (i.e., an instruction) operates on a multitude of data (e.g., a vector or matrix), therefore significantly reducing the number of instructions loaded from memory and contributing to reducing the system’s energy consumption.
The Embedded Systems Laboratory (ESL) at the Swiss Federal Institute of Technology Lausanne (EPFL) has developed two SRAM-based low-power architectures (known as Caesar and Carus) that normally behave as traditional memories, but also offer scalar and vector computing capabilities (i.e., arithmetic and logic operations such as addition, and, or, xor, multiplication, multiply-add, etc.) between two or more memory words. Because the data is processed within the memory layout itself, these smart near-memory IPs eliminate the need for moving operands through the system bus and into the local memory elements of processing elements that are physically far from the memory (e.g., inside the system CPU).
In addition to memory architectures, ESL has also developed X-HEEP (eXtendable Heterogeneous Energy-Efficient Platform). It is an open-source, configurable, and extensible single-core RISC-V 32-bit Microcontroller Unit (MCU), sponsored by the EcoCloud Sustainable Computing center of EPFL. It is based on many third-party open-source IPs as well as in-house IPs developed at the ESL jointly with other EPFL laboratories. X-HEEP provides a framework to configure and extend the MCU and experiment with it as an RTL simulation model (Verilator, Questasim, or VCS), a hardware prototype on a Xilinx FPGA, and even tape it out as a standalone ASIC circuit. The framework also provides the RISC-V software toolchain and the SDK that are necessary to deploy applications on the MCU.
Recently, the X-HEEP system has been extended to integrate Caesar- and Carus-based memories besides traditional SRAM banks. When running in computing mode, they can be programmed or controlled using dedicated software routines that implement application-specific computing kernels (e.g., matrix multiplication). Otherwise, they operate as traditional memories. By definition, these near-memory computing units exclusively process data that is directly mapped inside their private memory space (i.e., the memory banks instantiated within the IP itself.
From a low-level point of view, this approach reduces data movement and memory bandwidth, thus increasing the system’s energy efficiency. However, from an application point of view, it limits the size of the data that can be processed by the in-memory computing kernel (as it must fit inside a single memory bank) and does not allow for multi-memory bank parallelism opportunities.
Many edge AI applications rely on fixed-point operations, replacing the more expensive floating-point operations used when deploying the same machine learning models on more powerful hardware. As of today, Carus and Caesar support integer datapath on generic 32, 16, and 8-bit instructions. None of the operations is specifically designed to deal with the fixed-point data format (as additions or multiplications followed by rounding and shifting instructions). Therefore, fixed-width operation must be emulated in software in the current implementation.
This thesis aims to extend the Instruction Set Architecture (ISA) of the Carus near-memory computing IP with fixed-point instructions to increase performance and energy efficiency.
Throughout the project, the student will learn:
- How the Carus NMC IP works and how to offload computationally expensive tasks to it within the X-HEEP framework.
- How to extend the Carus NMC IP decoder and execution pipeline to support fixed-point instructions as additions, subtractions, and multiplications with rounding and shift in 32, 16, and 8-bit modes.
- Verify the functionality of the new instructions with randomized inputs.
- Verify that the introduced modifications do not alter the timing characteristics of the system, and iterate on the architecture in case they do (for example with techniques such as multicycle logic paths).
- [Optional] Update a few existing applications to use the new fixed-point instructions instructions and test them on the system deployed on an FPGA.
- How to work with version control (Git) and third-party, open-source repositories.
- How to work in a team of people all contributing to the same project.
The project will be carried out at the ESL at EPFL, one of the world’s top-class universities. ESL is an active group (24 PhD students among 45 members) involved in many research aspects, therefore providing a stimulating research environment. The student will be under the supervision of Prof. David Atienza, Dr. Davide Schiavone, and Dr. Michele Caon.
Project objectives:
Project objectives:
- Design a new set of fixed-point instructions as addition, subtraction, multiplication, and multiply-add supporting 32-, 16-, and 8-bit data elements by extending the Carus NMC IP decoder and execution pipeline.
- Verify that such instructions work correctly with randomized tests by extending the Carus testbench.
- Verify that the timing characteristics (e.g., the maximum operating frequency) of Carus ASIC implementation do not get worse and that the area increases negligibly by checking its existing physical implementation flow. In case it does, iterate the hardware.
- [Optional] Update existing applications to leverage the new instructions and run the application on the system’s hardware model deployed on an FPGA.
Required knowledge and skills:
- RTL design and FPGA implementation in SystemVerilog
- Good understanding of memory architectures and microcontrollers
- Good analytical skills
- Good background in computer architecture
- Teamwork and Git
Appreciated skills:
- Scientific curiosity
- Good communication skills
- Advanced English
Type of work: 40% theory analysis, 60% SW/HW co-design and simulation