Rate limits that protect upstreams—without punishing partners who did nothing wrong

Rate limiting is often implemented as a panic button: one big number at the edge, tuned after an outage, that makes every client pay for one bad integration—or one runaway batch job. The better framing is capacity allocation: who is entitled to which slice of upstream time, how bursts behave, and where enforcement sits relative to authentication and expensive work.

Automated clients and tool loops (including AI-driven callers) amplify the problem: retries multiply load, and naive or synchronized backoff can produce retry storms. A flat global cap turns localized noise into a multi-tenant incident—exactly when your status page should not say “we are investigating elevated error rates” for everyone.

Why global caps optimize the wrong objective

A single limit is easy to operate and hard to defend in a post-incident review:

Partners with contractual throughput expect predictable behavior during month-end or settlement windows—not a mystery 429 because another tenant spiked.
Internal callers often burst during deploys, cache cold start, or reconciliation jobs; they need separate budgets from external traffic.
Sandbox and try-it flows should not consume the same budget as production integration tests—or your demo environment becomes a denial-of-service lever against prod-adjacent paths sharing infrastructure.

“Fair” usually means multiple dimensions: identity (client, partner, credential), API product or route class, and environment (sandbox vs live). Limits applied after you can authenticate and classify the caller stay explainable: partner A exhausted their tier, not the internet got slower.

Algorithms in brief: what operators actually choose

You do not need a PhD to choose a strategy—you need clarity about fairness and burst behavior:

Approach	Strength	Risk
Fixed window	Simple	Thundering herd at window boundaries
Sliding window / log	Smoother	More memory per key
Token bucket	Natural burst allowance	Tun burst vs sustained carefully
Leaky bucket	Smooth output	Can feel unfair to bursty legit traffic

Many gateways combine token bucket-style burst with per-key partitions. The product decision is usually: what is the key? (IP-only vs client id vs partner id vs API key)—not which formula wins a benchmark.

Enforcement order: the detail that decides whether limits work

If you throttle before you can identify the caller, you may still shed load—but you cannot attribute it, and you cannot grant targeted relief (e.g. raise a partner tier during a negotiated batch).

If you throttle after expensive synchronous work—large body parsing, deep auth chains, database fan-out—you have already burned the resources you were trying to protect.

A durable mental model:

Reject cheap — Malformed requests, oversize bodies, obvious abuse—where policy allows early drop.
Authenticate — Establish client/partner context; mTLS or tokens as your architecture requires.
Classify — Map the caller to API product, profile, and quota tier.
Spend — Forward to origin only when budget remains.

IP-only limits are a weak default: NATs, carrier-grade NAT, and partner egress IPs multiplex many real clients. Per-IP still has a place as a coarse DDoS signal—but not as the only dimension for B2B APIs.

Zerq’s model routes access control, logging, rate limits, and audit on the same path; see Capabilities under Access & Security and Observability & Metrics for product- and partner-scoped visibility.

Retry storms: coordination without synchronized pain

When clients retry on 5xx or timeouts, uncapped retry multipliers can overload a recovering system. Mitigations that work in production:

Jittered exponential backoff—not fixed intervals that align across clients.
Retry budgets per operation in client libraries—especially for batch jobs.
Idempotency keys for writes so safe retries do not double business effects.

At the gateway, consistent 429 semantics help: partners can decode policy vs overload—see below.

429 as contract: partners integrate against behavior

Partners ship code to your errors, not your intentions. A useful 429 story includes:

Stable machine-readable codes and human-readable messages—no string scraping in regex hell.
Retry-After or explicit backoff hints where you can commit—ambiguous retries fuel storms.
Separation of “you exceeded your quota” vs “the platform is capacity-limited”—mixing them erodes trust in status communication.

If you use workflows at the edge for scope checks or early exits, you keep expensive paths cold for unauthorized traffic—see Design gateway workflows without shipping another microservice.

Capacity planning: limits are not a substitute for sizing

Rate limits protect shared pools—they do not fix undersized databases. Healthy programs pair limits with:

SLOs per API product (latency, availability).
Load testing that includes retry and burst behavior.
Autoscaling where safe—with guardrails so runaway scale does not mask abuse.

Exercise: burst isolation for your top three routes

For your three highest-cost routes, document:

Who may call (partner tiers, internal vs external).
Where limits apply (gateway vs origin—prefer gateway for shaping).
What a well-formed 429 looks like (JSON shape, headers).

Then simulate a burst from one partner and confirm others stay within SLO. If you cannot isolate noise, your keys or partitioning strategy may be wrong.

Summary: Fair limiting is multi-dimensional, authenticated, and explainable. Global caps are a blunt instrument—useful for survival, not for partner relationships.

Request a demo if you want to walk tiers and portal-visible products on the platform.