Undervolting Your GPU for Local Inference: Lower Heat, Same Tokens/sec

TL;DR

Thorsten Meyer AI has published a GPU tuning guide arguing that power limiting and undervolting should be the first step for high-power local AI workstations. The guide says local inference is often memory-bandwidth-bound, so lowering GPU power can reduce heat and fan noise while preserving much of tokens-per-second output, though results vary by card, model and workload.

Thorsten Meyer AI has published a GPU undervolting and power-limiting guide for local AI inference, arguing that workstation owners can cut heat and noise before buying new cooling hardware while keeping much of their tokens-per-second performance.

The guide presents power limiting as the first and lowest-risk step. It says users can set a GPU power limit, such as moving an RTX 4090-class card from stock behavior toward a 70% cap, then measure temperature, held clock, power draw and tokens per second under the same local inference workload.

According to the source material, a sustained RTX 4090 workload showed stock behavior at about 390 watts, 72 degrees Celsius and full baseline speed. At a 70% power limit, the guide reports about 300 watts, 67 degrees Celsius and 93.4% of baseline speed. At 80%, it reports 330 watts, 70 degrees Celsius and 98.6% of baseline speed.

The guide distinguishes between power limiting and manual undervolting. Power limiting is described as a one-slider change that restricts the card and lets the GPU reduce voltage and clocks on its own. Manual undervolting edits the voltage-frequency curve directly, with the guide naming roughly 0.9 to 0.95 volts as a starting range for some cards, while warning that stability should be tested under real workloads.

Why It Matters

The report matters for local AI users because high-end GPUs can put hundreds of watts of heat into a room during long inference or fine-tuning runs. If the guide’s reported pattern holds for a user’s hardware, a power cap could reduce heat output, noise and electricity use without requiring a new case, cooler or fan layout.

The claim is also relevant because tokens per second, rather than peak benchmark performance, is the performance measure many local inference users care about. Thorsten Meyer AI argues that many local LLM workloads are memory-bandwidth-bound, meaning the GPU core may not need peak clocks to maintain useful output speed.

Thermal Grizzly WireView GPU – 1x8Pin PCIe Normal – GPU Power Consumption Measuring Device – PCIe Power Connector – Real Time Direct Monitoring – Made in Germany

REAL-TIME OLED WATTAGE: Instantly shows current GPU power draw in watts for quick, at-a-glance monitoring while gaming, benchmarking,…

As an affiliate, we earn on qualifying purchases.

Background

Modern consumer GPUs are generally shipped with voltage and clock settings intended to meet advertised performance across a wide range of chips and conditions. The guide says that factory voltage curves include extra margin so weaker chips in a production batch remain stable at rated clocks.

For gaming, lowering core power can reduce frame rates when a workload is compute-bound. The guide argues that local inference is often different because VRAM bandwidth, not raw core compute, is frequently the limiting factor. It frames undervolting and power limiting as the first lever in a broader series on reducing heat and noise in high-power AI workstations.

The source also includes a consumer disclosure, saying the article contains affiliate links and that buyers should confirm current prices and specs. It states that undervolting and power limiting are widely used and reversible, but users make changes at their own risk.

“This is the first thing you should do to a high-power AI workstation, and it costs nothing.”

— Thorsten Meyer AI guide

“Local inference is memory-bound — the GPU core spends much of its time waiting on VRAM.”

— Thorsten Meyer AI guide

“Power limiting moves one slider and can’t damage anything.”

— Thorsten Meyer AI guide

“Test under your real workload — a curve stable for 10 min can fail on hour 3.”

— Thorsten Meyer AI guide

VIPERA NVIDIA GeForce RTX 4090 Founders Edition Graphic Card

16.384 NVIDIA CUDA Core

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

The source material does not provide an independent lab validation, a full test setup, model list or complete methodology for every figure cited. Results may differ by GPU model, chip quality, case airflow, driver version, model size, quantization and batch settings.

It is also unclear how much of the reported benefit carries over to every local inference task. Workloads that become more compute-bound may lose more speed when power is reduced.

Height Adjustable RGB GPU Support with Integrated Temperature Display, 3PIN 5V PC Graphics Card Stand Holder, Anti Sag Bracket & Magnetic Base

💻[Precision Height Adjustment for Optimal Performance]: Upport height is from 35mm(1.38in) to 157mm(6.18in). Effortlessly modify the height of…

As an affiliate, we earn on qualifying purchases.

What’s Next

The guide’s recommended next step is practical testing: set a conservative power limit, run the user’s actual inference workload for a sustained period, record tokens per second, temperature and power draw, then save the setting so it persists after reboot. Users who need more tuning can then try manual undervolting with longer stability checks.

ARCTIC MX-4 (incl. Spatula, 4 g) – Premium Performance Thermal Paste for All Processors (CPU, GPU – PC), Very high Thermal Conductivity, Long Durability, Safe Application

WELL PROVEN QUALITY: The design of our thermal paste packagings has changed several times, the formula of the…

As an affiliate, we earn on qualifying purchases.

Key Questions

What is the actual development here?

Thorsten Meyer AI has published a guide and interactive infographic arguing that GPU power limiting and undervolting can reduce heat and noise in local inference workloads while preserving much of throughput.

Is the performance claim confirmed?

The source reports measured RTX 4090 workload figures, including about 93.4% of baseline speed at a 70% power limit. Those figures are attributed to the guide and should be checked on each user’s own hardware and workload.

Why might inference tolerate lower GPU power?

The guide says many local LLM inference tasks are limited by memory bandwidth, so the GPU core may spend time waiting on VRAM rather than using full compute capacity. In that case, lowering core power can cut heat faster than it cuts tokens per second.

Is power limiting the same as undervolting?

No. Power limiting sets a wattage ceiling and lets the GPU manage clocks and voltage. Manual undervolting edits the voltage-frequency curve and can preserve more speed at a given heat level, but it requires more testing.

What remains unclear?

The guide does not establish a universal result for all GPUs or all inference jobs. Users still need to test stability, power draw and tokens per second under their own long-running workload.

Source: Thorsten Meyer AI

Undervolting Your GPU for Local Inference: Lower Heat, Same Tokens/sec

Up next

11 Best Smart Hydration Tracker in 2026 — Stay Hydrated and Healthy

Author

Artificial Intelligence

Share article

Why It Matters

Thermal Grizzly WireView GPU – 1x8Pin PCIe Normal – GPU Power Consumption Measuring Device – PCIe Power Connector – Real Time Direct Monitoring – Made in Germany

Background

VIPERA NVIDIA GeForce RTX 4090 Founders Edition Graphic Card

What Remains Unclear

Height Adjustable RGB GPU Support with Integrated Temperature Display, 3PIN 5V PC Graphics Card Stand Holder, Anti Sag Bracket & Magnetic Base

What’s Next

ARCTIC MX-4 (incl. Spatula, 4 g) – Premium Performance Thermal Paste for All Processors (CPU, GPU – PC), Very high Thermal Conductivity, Long Durability, Safe Application