Published In
Publication Number
Page Numbers
Paper Details
GPU Acceleration Techniques for Optimizing AI-ML Inference in the Cloud
Authors
Charan Shankar Kummarapurugu
Abstract
The demand for real-time Artificial Intelligence (AI) and Machine Learning (ML) inference in cloud environments has grown substantially in recent years. However, delivering high- performance inference at scale remains a challenge due to the computational intensity of AI/ML workloads. General-purpose CPUs often struggle to meet the latency and throughput require- ments of modern AI/ML applications. This paper explores the application of Graphics Processing Units (GPUs) to accelerate in- ference tasks, particularly in cloud environments, where dynamic and scalable resources are essential. We review current GPU- based optimization techniques, focusing on reducing inference latency and enhancing cost-effectiveness. The proposed approach integrates distributed GPU resource management with AI-driven prediction models to balance workloads efficiently across multiple cloud platforms. Experiments conducted on AWS, Azure, and Google Cloud demonstrate that GPU acceleration can reduce inference latency by up to 40% while improving cost efficiency by 30%, compared to CPU-only implementations. These findings highlight the potential of GPU acceleration to transform AI-ML inference in the cloud, making it more scalable and accessible for a wide range of applications.
Keywords
GPU acceleration, AI/ML inference, cloud com- puting, performance optimization, cost-efficiency
Citation
GPU Acceleration Techniques for Optimizing AI-ML Inference in the Cloud. Charan Shankar Kummarapurugu. 2022. IJIRCT, Volume 8, Issue 6. Pages 1-7. https://www.ijirct.org/viewPaper.php?paperId=2411046