GPU Acceleration Techniques for Optimizing AI-ML Inference in the Cloud

Author(s): Charan Shankar Kummarapurugu

Publication #: 2411046

Date of Publication: 09.12.2022

Country: USA

Pages: 1-7

Published In: Volume 8 Issue 6 December-2022

DOI: https://doi.org/10.5281/zenodo.14183964

Abstract

The demand for real-time Artificial Intelligence (AI) and Machine Learning (ML) inference in cloud environments has grown substantially in recent years. However, delivering high- performance inference at scale remains a challenge due to the computational intensity of AI/ML workloads. General-purpose CPUs often struggle to meet the latency and throughput require- ments of modern AI/ML applications. This paper explores the application of Graphics Processing Units (GPUs) to accelerate in- ference tasks, particularly in cloud environments, where dynamic and scalable resources are essential. We review current GPU- based optimization techniques, focusing on reducing inference latency and enhancing cost-effectiveness. The proposed approach integrates distributed GPU resource management with AI-driven prediction models to balance workloads efficiently across multiple cloud platforms. Experiments conducted on AWS, Azure, and Google Cloud demonstrate that GPU acceleration can reduce inference latency by up to 40% while improving cost efficiency by 30%, compared to CPU-only implementations. These findings highlight the potential of GPU acceleration to transform AI-ML inference in the cloud, making it more scalable and accessible for a wide range of applications.

Keywords: GPU acceleration, AI/ML inference, cloud com- puting, performance optimization, cost-efficiency

Download/View Paper's PDF

Download/View Count: 136

Share this Article