TL;DR

Thorsten Meyer AI has published a GPU tuning guide arguing that power limiting and undervolting should be the first step for high-power local AI workstations. The guide says local inference is often memory-bandwidth-bound, so lowering GPU power can reduce heat and fan noise while preserving much of tokens-per-second output, though results vary by card, model and workload.

Thorsten Meyer AI has published a GPU undervolting and power-limiting guide for local AI inference, arguing that workstation owners can cut heat and noise before buying new cooling hardware while keeping much of their tokens-per-second performance.

The guide presents power limiting as the first and lowest-risk step. It says users can set a GPU power limit, such as moving an RTX 4090-class card from stock behavior toward a 70% cap, then measure temperature, held clock, power draw and tokens per second under the same local inference workload.

According to the source material, a sustained RTX 4090 workload showed stock behavior at about 390 watts, 72 degrees Celsius and full baseline speed. At a 70% power limit, the guide reports about 300 watts, 67 degrees Celsius and 93.4% of baseline speed. At 80%, it reports 330 watts, 70 degrees Celsius and 98.6% of baseline speed.

The guide distinguishes between power limiting and manual undervolting. Power limiting is described as a one-slider change that restricts the card and lets the GPU reduce voltage and clocks on its own. Manual undervolting edits the voltage-frequency curve directly, with the guide naming roughly 0.9 to 0.95 volts as a starting range for some cards, while warning that stability should be tested under real workloads.

Why It Matters

The report matters for local AI users because high-end GPUs can put hundreds of watts of heat into a room during long inference or fine-tuning runs. If the guide’s reported pattern holds for a user’s hardware, a power cap could reduce heat output, noise and electricity use without requiring a new case, cooler or fan layout.

The claim is also relevant because tokens per second, rather than peak benchmark performance, is the performance measure many local inference users care about. Thorsten Meyer AI argues that many local LLM workloads are memory-bandwidth-bound, meaning the GPU core may not need peak clocks to maintain useful output speed.

Amazon

GPU undervolting tools

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Background

Modern consumer GPUs are generally shipped with voltage and clock settings intended to meet advertised performance across a wide range of chips and conditions. The guide says that factory voltage curves include extra margin so weaker chips in a production batch remain stable at rated clocks.

For gaming, lowering core power can reduce frame rates when a workload is compute-bound. The guide argues that local inference is often different because VRAM bandwidth, not raw core compute, is frequently the limiting factor. It frames undervolting and power limiting as the first lever in a broader series on reducing heat and noise in high-power AI workstations.

The source also includes a consumer disclosure, saying the article contains affiliate links and that buyers should confirm current prices and specs. It states that undervolting and power limiting are widely used and reversible, but users make changes at their own risk.

“This is the first thing you should do to a high-power AI workstation, and it costs nothing.”

— Thorsten Meyer AI guide

“Local inference is memory-bound — the GPU core spends much of its time waiting on VRAM.”

— Thorsten Meyer AI guide

“Power limiting moves one slider and can’t damage anything.”

— Thorsten Meyer AI guide

“Test under your real workload — a curve stable for 10 min can fail on hour 3.”

— Thorsten Meyer AI guide

Amazon

RTX 4090 power limit software

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

The source material does not provide an independent lab validation, a full test setup, model list or complete methodology for every figure cited. Results may differ by GPU model, chip quality, case airflow, driver version, model size, quantization and batch settings.

It is also unclear how much of the reported benefit carries over to every local inference task. Workloads that become more compute-bound may lose more speed when power is reduced.

Amazon

GPU temperature monitoring hardware

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What’s Next

The guide’s recommended next step is practical testing: set a conservative power limit, run the user’s actual inference workload for a sustained period, record tokens per second, temperature and power draw, then save the setting so it persists after reboot. Users who need more tuning can then try manual undervolting with longer stability checks.

Amazon

high-performance GPU cooling solutions

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

What is the actual development here?

Thorsten Meyer AI has published a guide and interactive infographic arguing that GPU power limiting and undervolting can reduce heat and noise in local inference workloads while preserving much of throughput.

Is the performance claim confirmed?

The source reports measured RTX 4090 workload figures, including about 93.4% of baseline speed at a 70% power limit. Those figures are attributed to the guide and should be checked on each user’s own hardware and workload.

Why might inference tolerate lower GPU power?

The guide says many local LLM inference tasks are limited by memory bandwidth, so the GPU core may spend time waiting on VRAM rather than using full compute capacity. In that case, lowering core power can cut heat faster than it cuts tokens per second.

Is power limiting the same as undervolting?

No. Power limiting sets a wattage ceiling and lets the GPU manage clocks and voltage. Manual undervolting edits the voltage-frequency curve and can preserve more speed at a given heat level, but it requires more testing.

What remains unclear?

The guide does not establish a universal result for all GPUs or all inference jobs. Users still need to test stability, power draw and tokens per second under their own long-running workload.

Source: Thorsten Meyer AI

You May Also Like

Mitchellh – I strongly believe there are entire companies now under AI psychosis

Mitchellh claims many companies are suffering from ‘AI psychosis,’ raising concerns about overreliance on AI systems. Details are still emerging.

Anthropic now has more business customers than OpenAI, according to Ramp data

According to Ramp’s latest AI Index, Anthropic now has more verified business customers than OpenAI for the first time, signaling a shift in enterprise AI adoption.

Digital Twins in the Workplace: Simulating Jobs With AI

Optimize workplace efficiency and safety through AI-driven digital twins that simulate jobs and workflows—discover how these innovations can transform your operations.

Probe parallel A

Authorities are conducting a probe into Parallel A, a development with potential implications for cybersecurity and data privacy, details remain emerging.