AI Summary • Published on Feb 18, 2026
The increasing integration of artificial intelligence into embedded systems presents a significant challenge: balancing high model performance with strict energy consumption constraints, which is crucial for the sustainability and practicality of these systems. Edge AI offers benefits like energy efficiency, low latency, and privacy by processing data directly on the device. However, edge platforms operate with significantly fewer resources compared to cloud or mobile counterparts. While existing research has focused on benchmarking AI on single-board computers, there is a notable gap in studies specifically addressing bare-metal processors. Single-board computers introduce abstraction layers that can obscure true hardware performance, making it difficult to understand the fundamental capabilities of the underlying hardware. This study aims to fill this gap by developing a dedicated test bench to evaluate embedded AI systems at the bare-metal level, measuring key performance indicators such as accuracy, inference time, and energy consumption, and using Pareto front analysis to identify optimal trade-offs.
The research presents a structured four-stage workflow for systematically evaluating AI models on bare-metal processors to achieve energy-efficient embedded AI design. The first stage, Model Generation, starts with a baseline AI architecture. It employs automated multi-objective optimization, structured pruning, and 8-bit static quantization to create a diverse set of pruned and quantized models in ONNX format, each representing a different performance-resource trade-off. The second stage, Model Deployment, converts these ONNX models into executable C binaries tailored for the target hardware (Cortex-M0+, M4, or M7). This involves translating ONNX to a C model library, integrating it with a C++ benchmarking framework, and compiling the entire package into a single executable binary. The third stage, System Benchmarking, flashes the compiled binary onto the target processor. The framework then executes multiple model inferences, precisely measuring real-time power consumption and recording execution latency. Finally, the Pareto Optimal Solution stage processes all collected benchmarking results, including compilation and inference metrics, alongside hardware and use-case specific factors like idle time between inferences. This comprehensive analysis identifies the most efficient AI model configurations by ensuring the best balance between energy efficiency and performance for the given resource-constrained embedded environment. The study utilized several use cases adopted from MLPerf Tiny, including Optical Digit Recognition (LeNet-5 on MNIST), Anomaly Detection (autoencoder on industrial sounds), Compact Image Classification (custom ResNet on CIFAR-10), and Visual Wake Words (MobileNetV1 on MSCOCO). The experimental setup included an ATP carrier board supporting M0+, M4, and M7 cores, a Segger J-Link debugger (for M0+/M4), USB-C interface (for M7), and a Power Profiler Kit (PPK) for real-time energy consumption measurements, ensuring precise synchronization of power measurements with inference execution phases.
The test bench demonstrated high reliability, with minimal variance across multiple benchmarks, confirming the stability of the experimental setup. Model variation analysis showed that quantization significantly reduces model size, often to one-quarter of the original, which is crucial for deployment on constrained devices. While RAM and ROM usage are critical hardware constraints, they were found to be poor predictors of energy consumption. A strong linear correlation (R² ≥ 0.93) was observed between Floating-Point Operations (FLOPs) and inference time across M0+, M4, and M7 processors. This indicates that FLOPs can reliably predict inference time, allowing developers to estimate performance early in the design phase once memory constraints are met. The energy efficiency analysis revealed distinct characteristics for each processor: the M0+ consistently exhibited the highest energy consumption. The M4 processor became increasingly efficient for longer inference cycle times due to its very low idle current. In contrast, the M7 processor achieved the lowest energy consumption during short inference cycles by minimizing the active power-on duration, benefiting from its high computational speed. The Pareto front analysis, comparing inference cycle energy and accuracy for M4 and M7 processors, highlighted that the optimal balance depends critically on the inference cycle duration. For frequent inference tasks (short cycle times, e.g., ≤ 0.5 seconds), the M7's high throughput is superior, and models optimized for speed are more energy-efficient. For infrequent inference tasks with long idle periods (long cycle times, e.g., ≥ 2.5 seconds), the M4's ultra-low idle current makes it more energy-efficient, allowing designers to prioritize accuracy over speed with minimal energy penalty.
This study provides embedded AI developers with a practical, application-driven framework for selecting hardware and optimizing models, moving beyond the simplistic "faster is better" assumption. The optimal choice is shown to depend on how a processor’s energy profile aligns with the temporal structure of the target application. Specifically, for applications requiring frequent, low-latency inference (e.g., real-time monitoring), the Cortex-M7 is ideal due to its high computational throughput, making compact, fast models most efficient. Conversely, for battery-powered devices with long idle intervals (e.g., environmental sensors), the Cortex-M4's ultra-low idle current is decisive, allowing developers to prioritize accuracy in models without significant energy penalties. The Cortex-M0+ is deemed unsuitable for frequent AI workloads due to high idle consumption but may suit non-AI contexts where cost and simplicity are prioritized. The work establishes two practical tools: FLOPs as a reliable early-stage predictor of inference latency, aiding hardware-aware model selection, and Pareto fronts that visually map energy–accuracy trade-offs per cycle time, guiding optimal processor-model pairings. Future research should refine power-measurement methodologies, especially for idle current and wake-up latency, and validate findings through long-term, real-world deployments across additional architectures and accelerator-based systems.