Skip to main content
This page describes how Vast.ai API errors and rate limits currently work, with practical retry guidance.

Error Responses

Error responses vary slightly by endpoint. The most common error response shape is:
{
  "success": false,
  "error": "invalid_args",
  "msg": "Human-readable description of the problem."
}
Some endpoints omit the boolean success. Some omit error and return only msg or message.

Rate Limits

How rate limits work

Vast.ai enforces rate limits using a token bucket model at multiple levels:
  1. Infrastructure level (per IP): protects against high-volume traffic before it reaches the API.
  2. Account level (per API key): a global token bucket shared across all endpoints for your account, enforced via Redis.
  3. Endpoint level (per endpoint and method): an independent token bucket for each API endpoint and HTTP method combination.

Token bucket model

Each rate limit is defined by a token bucket with these parameters:
  • max_tokens (capacity): the maximum number of tokens the bucket can hold. This is your burst allowance — how many requests you can make in rapid succession before being throttled.
  • token_refresh_rate (tokens/sec): how quickly tokens refill. A rate of 2.0 means you regain 2 tokens per second.
  • penalty_tokens: extra tokens deducted when a request is rejected (429). This pushes the bucket into “debt,” requiring additional recovery time before the next request is accepted.
How it works:
  1. Each request consumes 1 token from the bucket.
  2. Tokens refill continuously at the configured token_refresh_rate, up to max_tokens.
  3. If fewer than 1 token is available, the request is rejected with HTTP 429.
  4. On rejection, penalty_tokens (if configured) push the bucket into negative balance, extending the cooldown period.
For example, an endpoint configured with max_tokens=5 and token_refresh_rate=1.0 allows a burst of 5 rapid requests, then sustains 1 request/second thereafter.

Two-tier enforcement

Rate limits are enforced at two tiers that work together:
  • Local (in-process): each API server process maintains its own token buckets. This handles per-endpoint limits with zero network overhead.
  • Global (Redis): a shared token bucket across all server processes, enforced via a Redis lease mechanism. Local processes lease tokens from Redis in small batches to minimize round-trips while maintaining a consistent global budget.
The response headers always reflect the most restrictive constraint between the two tiers.

Identity and scope

  • Rate limits are tracked per API key. If no key is provided, your client IP is used instead.
  • Each endpoint and HTTP method combination (e.g., GET /api/v0/instances/ vs PUT /api/v0/instances/{id}/) has its own independent token bucket.
  • Rate limit policies are configurable per endpoint, per permission group, or globally via a wildcard.

Response headers

For responses where rate-limiting logic is evaluated, the API includes:
  • X-RateLimit-Limit
  • X-RateLimit-Remaining
  • X-RateLimit-Reset
  • Retry-After (on 429 responses)
If multiple rate-limit layers apply, headers represent the most restrictive active constraint.

429 response behavior

When you hit a rate limit, you receive HTTP 429 with a JSON body:
{
  "error": "HTTPTooManyRequests",
  "msg": "API requests too frequent",
  "retry_after": 3,
  "limit": 5,
  "remaining": 0
}
  • retry_after: seconds to wait before retrying (matches the Retry-After header).
  • limit: the bucket’s max_tokens capacity.
  • remaining: tokens remaining (always 0 on a 429).
If penalty_tokens are configured for the endpoint, repeated 429 responses will increase retry_after as the bucket accrues debt.

Probabilistic output model

The token bucket limiter is deterministic per request timeline, but if request arrivals are modeled as a random process (Poisson rate r requests/second), a useful approximation is: Single-token bucket (max_tokens=1, refresh rate R tokens/sec): P(429) = 1 - exp(-r / R) This is equivalent to the threshold model where T = 1/R. Burst-capable bucket (max_tokens=B, refresh rate R tokens/sec): For sustained traffic at rate r > R, the probability of rejection approaches 1 after the initial burst of B tokens is consumed. For r <= R, the bucket stays full and P(429) ≈ 0. Modeled probability of rate-limit responses by request rate and threshold
This graph models single-token buckets (the most common configuration). For burst-capable buckets, the initial burst absorbs spikes before the sustained rate takes effect. Actual outcomes depend on real request timing, burst shape, penalty debt, and which limit tier (local or global) binds first.

Chart data used

The exact plotted data points are available here: CSV details:
  • Request-rate domain: 0.000 to 5.000 requests/second
  • Step size: 0.025 requests/second
  • Rows: 201 points (plus header)
  • Curves: R = 2.0 (T=0.5s), R = 1.0 (T=1s), R = 0.5 (T=2s), R = 0.2 (T=5s)
Representative points from the plotted data:
Request rate r (req/s)P(429) at R=2.0P(429) at R=1.0P(429) at R=0.5P(429) at R=0.2
0.00.00000.00000.00000.0000
0.50.22120.39350.63210.9179
1.00.39350.63210.86470.9933
1.50.52760.77690.95020.9994
2.00.63210.86470.98171.0000
2.50.71350.91790.99331.0000
3.00.77690.95020.99751.0000
3.50.82620.96980.99911.0000
4.00.86470.98170.99971.0000
4.50.89460.98890.99991.0000
5.00.91790.99331.00001.0000

Endpoint/method limit data

Per-endpoint rate limit data is documented here:

How to reduce rate limit errors

  • Batch requests where supported, rather than calling many single-item endpoints.
  • Reduce polling: use longer polling intervals, or cache results client-side.
  • Spread traffic over time: avoid bursts; use a queue or scheduler.
  • Honor headers: use Retry-After and X-RateLimit-Reset to pace retries.
If you need higher limits for legitimate production usage, contact support with the endpoint(s), your expected call rate, and your account details.