Maxim Integrated MAX78000 AI Accelerator for Embedded IoT Applications
In the previous blog post, we covered the basics of Artificial Intelligence and Machine Learning (AI/ML) and invited you to sign up for a free on-demand webinar, in which Stanislaw Klinke, our AI/ML Application Engineer, introduces the basics of this technology, discusses some popular ML tools and frameworks such as Caffe and TensorFlow, and proposes some practical implementation solutions for Cloud and Edge, based on Xilinx’s powerful heterogeneous programmable hardware.
The blog post above also briefly touches on the topic of AI/ML in the IoT domain, underlining that in the world of IoT, every milliwatt counts. Constrained by their small form factor, power consumption, and cost, typical IoT devices cannot afford to run inference on dedicated hardware. They have to rely on the software implementation running on the microcontroller (MCU), enabling only the most rudimentary AI inference tasks due to high computing requirements, such as simple wake word spotting or gesture detection. In such applications, the neural network (NN) models are simplified and vastly limited, while the inference comes with a high power consumption penalty.
Inference at the (IoT) Edge
As mentioned above, small IoT devices cannot effectively implement any advanced AI functions due to their inherent limitations. However, there is an increasing demand for small form factor devices that can provide a rich AI experience with long battery life and at a minimal cost. To help developers cope with this challenge, Maxim Integrated has introduced the MAX78000, a new breed of AI-enabled ultra-low-power Systems-on-Chip (SoC), designed to enable more complex AI inference at the IoT Edge. It features a low-power Arm® Cortex®-M4 processor core with FPU, a 32-bit RISC-V (RV) coprocessor, and an ultra-low-power deep NN accelerator, highly optimized for Convolutional Neural Networks (CNN). By providing more than 100 times lower power consumption and significantly better performances than the equivalent software solutions running on a traditional MCU, the MAX78000 is a real game-changer in the embedded AI world.
A Closer Look at the MAX78000
The main highlight of the MAX78000 is most certainly its highly optimized hardware-based CNN accelerator. It takes a large portion of the silicon die, featuring 64 parallel processing units divided into four regions, which operate entirely independently of the primary processor. Each processor includes a pooling unit and a convolutional engine with dedicated weight memory. Such an architecture provides a lot of flexibility while supporting a broader range of networks. The CNN accelerator has a total weights storage memory of 442 KB, while weights can be configured per layer as 1, 2, 4, or 8-bit integer values, with a max total of more than 3.5 million weights. Additional 512 kB of data memory allows working with inputs of 181 x 181 pixels per channel, without any optimizations (preloaded data). However, thanks to the advanced streaming capabilities in the form of four asynchronous (plus one synchronous) FIFO buffers and the RV coprocessor acting as a “smart DMA,” the CNN accelerator can process larger inputs. Translated in real-world terms, the MAX78000 can perform analysis on VGA images at low frame rates. How low? It is extremely difficult to estimate, because so much depends on the cleverness and size of the model, but Maxim estimates that VGA images can be processed at better than one frame per second.
Data transfer is a very expensive operation in terms of power consumption and performance, especially in the embedded applications. The CNN accelerator architecture provides dedicated memory in each region for minimizing data movement across the die and therefore conserving energy.
However, data exchange with the outer world can’t be avoided altogether. The ultra-low-power 32-bit RISC-V coprocessor is provided as an advanced DMA controller that resides within the same clock domain as the CNN accelerator, operating at 60 MHz. Therefore, using the low-power RV coprocessor for signal processing and data transfer is a good option for shortening data paths and reducing power consumption. The RV coprocessor has its own JTAG program/debug interface and can be debugged independently of the primary processor.
The MAX78000 SoC uses an Arm® Cortex®-M4F processor core as the primary processor for system control. This well-known architecture combines highly efficient signal processing with low cost and ease of use, facilitating interaction with the rest of the design. It comes with high-speed, low-power communication interfaces, including an I2S serial audio interface and a 12-bit PCIF parallel camera interface, among many others. Up to 52 configurable GPIO pins are available for various purposes. Targeted at IoT applications, the MAX78000 also provides a mandatory set of security features, such as unique ID, secure boot, AES 128/192/256 HW acceleration engine, and a true random number generator (TRNG).
Increasing the Battery Life
Power management of the MAX78000 is a complex topic. Although it is beyond the scope of this blog post, several key features are worth mentioning.
The MAX78000 SoC features a highly modular design, allowing different regions to be powered down when not used. It also supports numerous power-saving modes with selective SRAM retention, enabling highly optimized energy consumption. The MAX78000 integrates a single-inductor multiple-output (SIMO) switching-mode power supply (SMPS) with dynamic voltage scaling (DVT), allowing it to be powered by a single Li-Po/Li-Ion battery. An external PMIC can be used for more demanding power control, as demonstrated by the MAX78000EVKIT evaluation kit.
To benchmark the performance of the MAX78000, engineers from Maxim Integrated conducted a series of tests, comparing it to an equivalent MCU-only solution from their line card. They used the MAX32650, a low-power Arm Cortex-M4F MCU for wearables running at 120 MHz. The results were astonishing: the KWS20 keyword spotter demo running on MAX78000 had an inference latency of 2 ms, consuming only 0.14 mJ, while the MAX32650 had a latency of 350 ms, burning 8.37 mJ in the process. The FaceID face recognition demo illustrated the sheer power of the MAX78000 even better, taking only 13.89 ms to complete the face detection task (input loading time included). In comparison, the MAX32650 had a lag of 1760 ms, even without input loading time included. The consumed energy difference was also apparent: 0.40 mJ on the MAX78000 vs. 42.1 mJ on the MAX32650.
More powerful MCUs may perform better in terms of latency; however, faster MCUs or MPUs usually involve higher energy consumption and may cost a lot more. So when it comes to latency vs. power efficiency vs. cost, the MAX78000 is a clear winner.
AI vs. Embedded: Bridging the Gap
Developers working on AI/ML algorithms are usually focused on their own field of application. They are not very familiar with the hardware they use, or at least not at the lowest level. Instead, they have to rely on libraries and development toolchains provided by the hardware manufacturers. Conversely, embedded developers commonly do not have a profound understanding of various aspects of machine learning and related algorithms. Therefore, the main challenge for hardware manufacturers is to provide a comprehensive toolchain suitable for both the embedded and the AI/ML developers.
Fortunately, the MAX78000’s flexible CNN architecture allows AI/ML developers to train their (hardware-aware) NN models in familiar frameworks like PyTorch and TensorFlow, taking advantage of the more powerful computing hardware (GPUs, dedicated FPGA accelerators, etc.). However, unlike the field-programmable (FPGA) hardware solutions, new operators cannot be simply added to the MAX78000’s CNN accelerator engine. Maxim’s chip developers had to carefully select a set of optimal features for the target applications, ensuring high flexibility and support for the broad range of NN models. Therefore, AI/ML developers have to be aware of these limitations to get the most out of the MAX78000.
Once the NN model has been successfully trained and optimized, Maxim’s proprietary synthesizer tool can use the resulting PyTorch Checkpoint or ONNX file to generate embedded C code that runs on the MAX78000. From this point on, the embedded engineer can begin developing the final application around the synthesized code without having to worry about ML aspects of the workflow.
Although quite lengthy, this blog post was meant only to give you a glimpse of what the MAX78000 AI accelerator for embedded applications has to offer. For the complete list of features and collaterals, please visit the MAX78000 landing page on our portal. You can also register for free on-demand webinars, where Maxim Integrated experts provided a practical introduction to the MAX78000 AI accelerator and explained the workflow with examples in great detail.
However, if this post piqued your curiosity, and you are eager to test the product yourself, all you need is the full-featured MAX78000EVKIT or the small form factor MAX78000FTHR evaluation board, and you’re ready to go! Maxim Integrated provides a comprehensive SDK on their GitHub repository, along with several demos, such as FaceID face recognition or KWS20 keyword spotter demo.
As the leading semiconductors distributor in the EMEA region, EBV Elektronik offers a steady supply chain of the latest semiconductor solutions from its extensive manufacturer portfolio. Reach out to EBV’s AI/ML experts at firstname.lastname@example.org and make sure you are using the optimal solution for your AI application.
Register to Win a MAX78000FTHR Feather Board!
Don’t miss the limited opportunity to register HERE and win a MAX78000FTHR feather form factor rapid development platform, and quickly explore the world of battery-powered ultra-low-power AI inference at the IoT Edge!