Efficient deep learning inference on end devices
Deep Learning (DL) has become a cornerstone of modern Artificial Intelligence (AI), powering applications across healthcare, computer vision, and autonomous systems. However, executing DL inference on resource-constrained end devices—such as smartphones and IoT hardware—poses challenges due to limited computational resources, energy constraints, and real-time requirements.
This thesis addresses the optimization of DL inference on Heterogeneous Multi-Processing System-on-Chips (HMPSoCs), which integrate CPUs, GPUs, and Neural Processing Units (NPUs). It explores strategies that collaboratively utilize these processors to enhance inference efficiency in terms of latency, throughput, and power.
A layer-wise switching strategy is proposed to assign each layer of a DL model to the processor that minimizes inference latency, improving responsiveness for time-sensitive applications like AR/VR. To enhance power efficiency, Dynamic Voltage and Frequency Scaling (DVFS) is combined with layer-switching, ensuring performance within battery constraints. Additionally, selective quantization is introduced to leverage NPUs without sacrificing model accuracy, assigning quantized and full-precision layers to appropriate processors. For throughput, a pipelined execution approach partitions models across CPU clusters, GPUs, and NPUs to process frames concurrently, meeting high FPS demands.
As a key outcome, the ARM-CO-UP framework is developed to support profiling, processor switching, pipelining, and DVFS, enabling flexible and cooperative execution across processors.
This work contributes toward enabling efficient DL deployment on everyday devices, balancing performance, energy, and accuracy. The proposed methods and framework provide a practical foundation for continued research in efficient AI computing at the edge.