Unlocking asynchronicity in continuous batching

TL;DR

Researchers have demonstrated that implementing asynchronous batching in GPU inference workflows significantly reduces idle time, boosting throughput by up to 24%. This approach involves disentangling CPU and GPU operations using CUDA streams to run tasks concurrently.

Researchers have introduced a method to implement asynchronous batching in GPU inference workflows, which can eliminate nearly a quarter of idle GPU time during continuous model generation, significantly boosting throughput.

The core innovation involves separating CPU batch preparation from GPU computation, allowing both processes to run concurrently using CUDA streams. Traditional synchronous batching forces the CPU and GPU to operate in sequence, causing idle periods where either the CPU waits for GPU computation or vice versa. This inefficiency has been quantified: profiling of an 8B parameter model generating 8,000 tokens with a batch size of 32 revealed that 24% of total runtime was spent with the GPU idle, waiting for CPU tasks.

The new approach employs CUDA streams to categorize GPU operations, enabling independent execution. Operations within the same stream are sequential, but operations across different streams can run concurrently, allowing batch preparation for the next step to occur during ongoing GPU computation. This concurrency reduces idle time and improves overall throughput without requiring changes to the core model or kernels.

Why It Matters

This development is significant for large-scale inference tasks, where GPU utilization directly impacts operational costs and performance. By reducing idle periods, organizations can achieve faster inference times and more cost-effective GPU usage, especially important given the high hourly costs of inference hardware like the H200. The ability to optimize resource utilization can lead to substantial savings and efficiency gains in deploying large language models at scale.

GPU-Accelerated Computing with Python 3 and CUDA: From low-level kernels to real-world applications in scientific computing and machine learning

View Latest Price

As an affiliate, we earn on qualifying purchases.

Background

Previous efforts in continuous batching improved GPU utilization by tightly scheduling batches to avoid padding waste. However, these methods remained synchronous, leaving a substantial portion of runtime idle due to the sequential nature of CPU and GPU tasks. The concept of asynchronous batching builds on the understanding of CUDA streams, which allow for concurrent execution of GPU operations. This approach is part of ongoing research to enhance inference efficiency, with recent code implementations shared within the transformers library ecosystem. The challenge has been to coordinate the launch and completion of GPU tasks while preparing subsequent batches, a problem now addressed through careful management of CUDA streams and synchronization barriers.

“By disentangling CPU and GPU workloads using CUDA streams, we can run batch preparation concurrently with GPU computation, reducing idle time significantly.”

— Lead researcher (source from Hugging Face)

“Our profiling shows that nearly a quarter of total inference time is wasted waiting for CPU or GPU to become available, and this method can eliminate that waste.”

— Hugging Face engineer

GPU Kernel Engineering for LLM Inference: CUDA, Triton, and Flash Attention Optimization for High-Throughput AI Production Systems (AI Infrastructure, Hardware & Compiler Engineering Series)

View Latest Price

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

While initial results are promising, the approach’s effectiveness across different hardware configurations, model sizes, and real-world workloads remains to be fully validated. Implementation complexity and potential synchronization issues could also pose challenges in broader deployment. Further testing and optimization are needed to confirm scalability and robustness.

Amazon

GPU utilization monitoring software

View Latest Price

As an affiliate, we earn on qualifying purchases.

What’s Next

Next steps include extensive benchmarking across diverse models and hardware setups, refining code implementations for broader adoption, and integrating asynchronous batching techniques into production inference pipelines. Researchers and engineers will also explore automated tools to manage CUDA stream orchestration more effectively.

Amazon

asynchronous batching GPU acceleration

View Latest Price

As an affiliate, we earn on qualifying purchases.

Key Questions

What is asynchronous batching?

It is a method that separates CPU batch preparation from GPU computation, allowing both to run in parallel using CUDA streams, thus reducing idle time and increasing throughput.

How does this improve GPU utilization?

By enabling concurrent execution of CPU and GPU tasks, it minimizes idle periods where either component waits for the other, leading to faster inference times.

Does implementing this require changes to the model?

No, the approach does not require modifications to the core model or kernels; it relies on managing GPU operation scheduling through CUDA streams.

Are there any limitations or risks?

The technique’s effectiveness across different hardware and workloads needs further validation, and improper synchronization could cause errors or performance issues.

Unlocking asynchronicity in continuous batching

Up next

vLLM V0 To V1: Correctness Before Corrections In RL

Author

Artificial Intelligence

Share article

Why It Matters

GPU-Accelerated Computing with Python 3 and CUDA: From low-level kernels to real-world applications in scientific computing and machine learning

Background

GPU Kernel Engineering for LLM Inference: CUDA, Triton, and Flash Attention Optimization for High-Throughput AI Production Systems (AI Infrastructure, Hardware & Compiler Engineering Series)