Compute Needs for On-Site ML: GPU Sizing Guide

Build a practical calculator for on-site ML compute, GPU sizing, latency targets, and cost per thousand users using accelerator-model thinking.

On-site machine learning features can turn a product from useful to indispensable, but they can also turn your infrastructure bill into a mystery. If you are shipping recommendations, personalization, ranking, or dynamic content selection, you need a reliable way to estimate compute before production traffic arrives. That is where a structured compute estimation process, grounded in an accelerator model and a datacenter model, becomes valuable for both marketers and engineers.

This guide shows you how to build a practical calculator for on-site ML workloads using concepts inspired by SemiAnalysis, especially its Accelerator Industry Model and Datacenter Industry Model. We will translate business goals like “improve personalization” into concrete sizing inputs: requests per second, model size, latency targets, batching strategy, CPU/GPU needs, and monthly cost per thousand users.

For teams trying to connect infrastructure planning with ROI, this matters. A recommendation system that is technically elegant but misses your latency target can hurt conversion. A personalization layer that is too expensive per user can destroy margin. And a feature that silently overprovisions GPUs may look fast in a demo while wasting cash at scale. For a useful analogy, think of this like the discipline behind designing cost-optimal inference pipelines, where the best architecture is the one that meets quality and latency constraints at the lowest sustainable cost.

Why On-Site ML Needs a Different Capacity Plan

1. Product traffic is bursty, not flat

On-site ML traffic rarely behaves like a neatly averaged dashboard line. Homepage loads spike at the top of the hour, campaign launches create sudden surges, and personalized experiences often have a much higher request fan-out than ordinary page views. That means simple average-QPS thinking underestimates peak demand, especially for interactive use cases like recommendations on category pages or real-time ranking on search result pages. If you are building around event-driven traffic patterns, it helps to think the way infrastructure teams do in why AI traffic makes cache invalidation harder: dynamic systems break naive assumptions quickly.

2. Latency is a product metric, not just an engineering metric

For many ML-powered site features, latency is inseparable from conversion. If the recommendation block takes too long, the page may render without it, or the user may scroll past before the model returns. If personalization adds enough delay to hurt Core Web Vitals or perceived responsiveness, the product gains from relevance can be offset by friction. That is why you need explicit latency targets for p50, p95, and sometimes p99, not just “fast enough.”

3. Cost per user must be legible to marketers

Marketing teams need to know whether an ML feature improves revenue more than it costs to run. A strong calculator should translate infrastructure into cost per user or cost per thousand users, because that is the language of campaign ROI, audience economics, and retention planning. This is similar to the discipline used in turning earnings data into smarter buy boxes: the metric only matters if it helps decision-makers compare options and protect margin. A good ML capacity model turns opaque GPU spend into a business input.

What SemiAnalysis Models Teach Us About Sizing

1. The accelerator model gives you supply-side context

The SemiAnalysis Accelerator Industry Model is designed to gauge historical and future accelerator production by company and type. For your internal planning, the lesson is not that you must forecast the global accelerator market. The lesson is that accelerator supply, generation mix, and price bands matter. Different GPUs have different memory capacities, throughput characteristics, and cost structures, and your sizing calculator should reflect that instead of assuming “one GPU is like another.”

2. The datacenter model forces power and hosting realism

The Datacenter Industry Model focuses on current and forecast datacenter critical IT power capacity across colocation and hyperscale environments, driven by AI accelerator deployments. That framing is useful because ML capacity planning is not only about FLOPS; it is also about rack power, thermal headroom, and deployment constraints. In other words, the best theoretical GPU plan is worthless if your hosting environment cannot support it. This is one reason teams that are serious about production AI also pay attention to network and power planning, much like the kind of thinking in hosting for AgTech resilient platforms or hosting when connectivity is spotty.

3. Models are decision aids, not crystal balls

A critical mindset: the value of the accelerator model is not perfect prediction. Its value is structure. It helps you break down the problem into measurable pieces and compare scenarios. That is exactly what your calculator should do: let users move from business assumptions to capacity outcomes, while understanding trade-offs between lower latency, higher throughput, and greater spend. This is the same reason analysts favor scenario-based planning in high-risk tech acquisitions—you do not need certainty to make better decisions; you need better bounds.

The Inputs Your Calculator Must Capture

1. Traffic assumptions

Start with sessions, page views, and ML-invoking events per user. A recommendation module on a homepage may fire once per visit, while a personalized feed may request scores multiple times per session. Capture the number of monthly active users, average sessions per user, average pages per session, and the fraction of page views that trigger inference. You should also include peak-to-average traffic ratio so the calculator can size for realistic bursts rather than monthly averages.

2. Model characteristics

Next, capture model size, token or feature footprint, inference steps, and whether the workload uses CPU-only scoring, GPU-accelerated vector search, or a mixed pipeline. Smaller ranking models may run comfortably on CPU, especially if batched. Larger embedding or reranking systems often benefit from GPU acceleration, particularly when latency targets are strict. If your architecture resembles a multi-stage pipeline, the advice from cost-optimal inference pipeline design becomes especially relevant: put the expensive hardware only where it adds measurable value.

3. Performance and SLO inputs

Every calculator needs an explicit service-level target. That means p95 latency target, maximum acceptable queueing delay, and the percentage of requests that can be served from cache or fallback. If the user experience can tolerate a stale recommendation for 10 minutes, your compute needs may drop dramatically. If freshness matters more than depth, the reverse is true. Don’t forget to define a degradation policy for overload: no ML response, cached response, or simplified model path.

A Practical Estimation Framework You Can Use Today

1. Step 1: Convert business traffic into inference requests

Begin with monthly active users and the average number of ML requests per user per month. For example, 100,000 active users with 20 recommendation-triggering actions each produces 2,000,000 monthly inference events. Divide by the number of seconds in a month to get average requests per second, then apply your peak multiplier. If your traffic spikes 5x during campaigns, your production sizing should reflect that peak, not the average.

2. Step 2: Estimate work per request

Measure or estimate the compute cost of one inference in milliseconds of CPU time or GPU kernel time. For example, a lightweight ranking model might take 8 ms CPU time per request, while a dual-stage recommendation system could take 3 ms on CPU for retrieval and 20 ms on GPU for reranking. Multiply work per request by concurrent requests and adjust for batching efficiency. This is where many teams underestimate cost: batching helps throughput, but it can also add queueing delay and threaten latency targets.

3. Step 3: Convert to instance counts and hardware class

Once you know peak requests per second and effective milliseconds per request, you can estimate the number of cores, vCPUs, or GPUs required. The rule of thumb is simple: if one worker can process X requests per second while keeping p95 under your target, then peak demand divided by X gives you the number of workers, plus headroom. The right answer is usually not “maximize utilization,” but “stay below the point where queueing harms the user experience.” This is why many teams revisit guidance like right-sizing GPUs, ASICs, and inference stacks before signing cloud commitments.

CPU vs GPU Sizing for Recommendations and Personalization

1. When CPU is enough

CPU is often sufficient for simple scoring, rule-based personalization, feature lookups, and smaller tree-based models. It is also attractive when traffic is moderate and latency targets are forgiving. CPU-first systems are generally easier to deploy, cheaper to autoscale, and simpler to observe. For marketing teams launching a first personalization project, a CPU-first architecture can be the fastest route to value without the operational complexity of GPU scheduling.

2. When GPU becomes the better bet

GPU is usually justified when the model is larger, the request volume is high, or the latency target is tight enough that CPU scaling becomes inefficient. This is common in modern recommendations where embeddings, ANN retrieval, reranking, and contextual features all sit in one path. GPUs can reduce response time and increase throughput, but they also change the economics: you need to think in terms of utilization, batch size, queue depth, and duty cycle. For broader context on hardware tradeoffs, see alternatives to high-bandwidth memory, which reinforces that hardware choices should follow workload economics, not hype.

3. Mixed architectures often win

In production, many teams use a hybrid design: CPU for candidate generation, GPU for reranking, and cache for repeated requests. This reduces total GPU time while preserving high-quality personalization where it matters most. You can also place expensive models behind eligibility rules, serving them only to logged-in or high-value users. The resulting system tends to deliver a better cost per thousand users than a pure GPU strategy. That is why capacity planning should support architecture scenarios, not a single answer.

How to Build the Calculator: A Marketer-Friendly Formula

1. Core formula structure

Your calculator should expose a few friendly inputs and hide the complexity behind formulas. A simple version looks like this:

Monthly inference requests = active users × ML-triggering actions per user × request coverage

Peak QPS = monthly inference requests ÷ seconds per month × peak multiplier

Required workers = peak QPS ÷ sustainable QPS per worker × headroom factor

Cost per thousand users = monthly infrastructure cost ÷ active users × 1,000

That last metric is what marketers can actually use. It allows a campaign manager to compare the cost of personalization across segments, markets, or landing page experiments.

2. Add scenario sliders

The most useful calculators let users change latency target, cache hit rate, and model class. For example, a slider that moves p95 from 80 ms to 200 ms should show how much CPU or GPU capacity is saved. A cache hit rate slider can demonstrate the value of memoizing repeated recommendations. A model quality slider can show the cost delta between a lightweight ranker and a more sophisticated reranker. This is the same principle behind good decision support in budget-conscious decision-making: make the trade-off visible before the purchase.

3. Show the answer in business terms

Do not stop at “you need 4 GPUs.” Show “this equals $0.42 per thousand users at your current traffic profile.” Then show what happens if traffic doubles, or if you loosen latency from 50 ms to 120 ms. That helps stakeholders understand whether to optimize for margin, speed, or quality. It also creates a shared language for product, marketing, and infrastructure teams.

Comparison Table: Common On-Site ML Sizing Patterns

Workload Pattern	Typical Hardware	Latency Target	Best For	Cost Risk
Rule-based personalization	CPU only	100-300 ms	Simple content swaps, audience targeting	Low
Light ranking model	CPU or small GPU	50-150 ms	Homepage modules, next-best-action prompts	Low to medium
Embedding retrieval + rerank	Mixed CPU/GPU	40-120 ms	Recommendations, related content	Medium
Real-time session personalization	GPU preferred	20-80 ms	Adaptive feeds, high-frequency commerce UX	Medium to high
Large candidate generation pipeline	Multi-GPU or GPU cluster	30-100 ms	High-scale marketplaces and media products	High

The practical value of the table is that it makes architecture selection easier. A smaller site with moderate traffic should not overbuild for a real-time GPU cluster if a CPU-based ranker can meet the business target. Conversely, a high-scale product that depends on instant relevance may find CPU-only scoring too fragile under load. This kind of decision tree is also useful in adjacent planning work like testing app stability after major UI changes, because the product experience is only as good as the weakest operational assumption.

Cost per Thousand Users: The Metric Everyone Should Use

1. Why CPM-like thinking works for infrastructure

Marketing teams are accustomed to thinking in cost per thousand impressions, cost per click, and cost per acquisition. You can apply the same discipline to ML infrastructure by measuring cost per thousand active users or cost per thousand personalized sessions. That creates an apples-to-apples way to compare one model strategy against another. It also lets finance teams forecast spend more accurately when user growth changes.

2. A sample calculation

Imagine 500,000 monthly active users, 30 personalized events per user, and a 40% request coverage rate. That gives 6,000,000 monthly inference requests. If your blended infrastructure cost is $900 per month, then cost per thousand users is $1.80. If a more advanced model raises the bill to $2,700 but improves conversion enough to justify it, the business case becomes visible immediately. Without this metric, teams often debate model quality without understanding total economics.

3. Tie cost to revenue and margin

For commercial teams, the final question is not “is it expensive?” but “does it pay back?” You should connect cost per thousand users to uplift in average order value, retention, click-through rate, or lead quality. That is the same kind of cause-and-effect reasoning used when teams evaluate impact reports that drive action instead of vanity metrics. If personalization adds 2% revenue lift at a cost of $1.80 per thousand users, you have a framework for deciding whether to scale, optimize, or sunset the feature.

Latency Targets and Capacity Planning in the Real World

1. Set separate targets for user-facing and internal systems

Not all latency is equal. A user-facing recommendation response should hit a tight p95 target, while an asynchronous personalization refresh job may tolerate much longer execution time. Your calculator should distinguish synchronous path latency from offline job latency, because mixing them leads to overprovisioning. The best operational plans clearly separate live serving from batch enrichment.

2. Use headroom deliberately

Capacity planning should always include headroom for traffic spikes, noisy neighbors, model version changes, and failover scenarios. A 20-30% headroom factor is common, but the right number depends on how critical the feature is and how expensive overload would be. If the personalized module is a core revenue driver, more headroom may be justified. If it is a secondary enhancement, a lighter cushion may be acceptable. This is similar to the way teams approach vendor AI spend procurement: you buy resilience when the downside of undercapacity is larger than the incremental cost.

3. Plan for degradation modes

Good calculators should include a fallback policy under overload. If GPU utilization exceeds a threshold, can the system serve cached results, simplify candidate generation, or suppress the feature entirely? Each fallback path changes the effective compute requirement and the user experience. That makes degradation strategy a first-class capacity planning topic, not a postscript.

Checklist for Deploying Your Own Estimation Model

1. Gather baseline measurements

Before you size anything, instrument your current traffic. Measure request rates, latency distribution, cache hit rates, and average inference time in staging or shadow traffic. If you do not have measurements, start with conservative benchmarks and note the uncertainty. You can also borrow strategic thinking from competitive intelligence playbooks: observe what high-performing peers track, then adapt the framework to your own business.

2. Test three scenarios

Your calculator should support at least three cases: conservative, expected, and aggressive growth. The conservative case tests minimal traffic and simple models. The expected case represents your current roadmap. The aggressive case models a successful campaign or product launch. If all three scenarios remain within budget and latency targets, you have a robust plan. If one scenario breaks, that is where you focus optimization.

3. Review economics with stakeholders

Once the calculator is ready, use it in product planning and quarterly reviews. It should answer questions like: Which segment gets personalized first? Which model class fits our budget? What is the cost of tightening latency from 150 ms to 60 ms? The planning process becomes much better when technical teams and revenue teams share the same assumptions. For inspiration on building clearer internal processes, see building an API strategy with governance, which emphasizes that sustainable systems need both architecture and operating rules.

Common Mistakes to Avoid

1. Sizing off averages instead of peaks

The most frequent mistake is using average traffic and average latency to justify too-small infrastructure. That approach fails the moment a campaign lands or a page layout changes and traffic pattern shifts. Capacity planning must include burst behavior and concurrency, or your “successful” model will fail under real load. This kind of error is common enough that it deserves to sit alongside the practical warnings in timing-sensitive purchasing guides: the window matters, not the average day.

2. Ignoring the cost of experimentation

ML features are rarely static. Teams test new embeddings, adjust feature sets, and retrain models regularly. If your calculator only covers one stable model version, it will understate real costs. Include a factor for experimentation, A/B tests, and shadow deployments, especially when personalization is still evolving.

3. Forgetting observability and control plane costs

Inference is not the whole bill. Logging, tracing, monitoring, retries, rollout safety, feature stores, and data pipelines also consume budget. A mature capacity plan should account for these overheads, especially when the feature is business-critical. That operational realism is the difference between a demo and a durable system.

Frequently Asked Questions

How do I estimate GPU needs for recommendations if I only know monthly active users?

Start by estimating how many actions per user trigger inference, then convert those actions into total monthly requests. From there, calculate peak QPS using a realistic burst multiplier, and divide by the sustainable throughput of one GPU worker at your latency target. Add headroom for retries and traffic spikes. If you do not know throughput yet, benchmark a representative model path in staging before committing to production sizing.

What latency target should I use for on-site personalization?

It depends on how visible the ML response is to the user. For in-page, synchronous personalization, p95 often needs to stay below 100 ms, and lower is better if the feature blocks rendering. For less critical modules, 150-300 ms may still be acceptable if the page can render a placeholder or cached result first. The right answer is not one number; it is a target tied to user experience and fallback design.

Is a CPU-only architecture enough for modern recommendation systems?

Sometimes yes, especially for simple ranking, rule-based personalization, or moderate traffic. CPU-only systems are easier to operate and often cheaper early on. But once your model complexity rises, your traffic becomes bursty, or your latency target tightens, GPU or mixed architectures may offer better economics. The best decision comes from scenario testing, not assumption.

How do I translate infrastructure spend into cost per thousand users?

Take the total monthly infrastructure cost for the feature and divide it by monthly active users, then multiply by 1,000. If your spend is $1,200 and you have 400,000 active users, your cost per thousand users is $3.00. You can refine this further by segment or feature usage rate if only part of your audience receives the ML experience.

Why reference SemiAnalysis models for an internal calculator?

Because they provide a disciplined way to think about supply, deployment constraints, and datacenter power economics. The accelerator model helps you reason about hardware classes and availability, while the datacenter model keeps your plan grounded in power and infrastructure realities. Even if you are not buying at hyperscale, the framework improves the quality of your assumptions and reduces overengineering.

Conclusion: Turn ML Sizing into a Repeatable Business Process

On-site ML only becomes scalable when teams can predict its compute cost with enough accuracy to make smart product and budget decisions. A practical calculator built around accelerator modeling gives you a repeatable way to estimate CPU/GPU needs, align latency targets with user experience, and express cost in terms marketers can evaluate. That combination is what turns personalization from a technical experiment into a controllable growth lever.

If you are formalizing this process, start with the measurement inputs, then layer in scenario modeling, then connect the output to cost per thousand users and revenue impact. For more perspective on the broader technical and commercial tradeoffs around AI infrastructure, it is worth revisiting cost-optimal inference pipeline design, hardware alternatives for AI workloads, and AI traffic and cache behavior. Those ideas will help you stay grounded as your model portfolio grows.

Designing Cost‑Optimal Inference Pipelines: GPUs, ASICs and Right‑Sizing - A deeper look at matching hardware to workload economics.
AI Without the Hardware Arms Race: Alternatives to High‑Bandwidth Memory for Cloud AI Workloads - Explore infrastructure options beyond the default GPU path.
Why AI Traffic Makes Cache Invalidation Harder, Not Easier - Learn why dynamic ML requests change caching assumptions.
Building an API Strategy for Health Platforms: Developer Experience, Governance and Monetization - Useful governance patterns for production systems.
OS Rollback Playbook: Testing App Stability and Performance After Major iOS UI Changes - A practical reference for resilience planning after product changes.

Maya Thompson

Senior SEO Editor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.