A direct HTTP API at api.perchy.ai. OpenAI-compatible request shape with Perchy lane controls.
Overview
What Perchy gives you.
Perchy is a routing layer in front of a marketplace of GPU hosts. The HTTP API mirrors the OpenAI chat-completions shape, so most existing client code works by changing the base URL and key. The only Perchy-specific addition is the optional lane object on each request, which controls how long a clear position is held for your app.
OpenAI-compatible chat, streaming, and tool-calling shapes.
Lane reservations for steadier first-token latency under load.
Per-second billing on lanes; per-token billing on traditional models.
Signed webhooks for payment, allowance, and usage events.
Base URL
One host, versioned paths.
All Perchy requests target a single base URL. The path prefix is versioned so that breaking changes don't affect existing integrations.
https://api.perchy.ai/v1
Use HTTPS only. Plaintext requests are rejected at the edge.
POST /v1/chat/completions, GET /v1/models, GET /v1/usage, and GET /v1/lanes are stable.
We never break a v1 endpoint without a sunset notice on the changelog.
Authentication
Bearer keys, server-side only.
Each request must include a Perchy API key in the Authorization header. Keys are created in the console under Settings → API keys and are shown exactly once.
The core endpoint. Request shape is OpenAI-compatible with one extra field.
Request fields
Field
Type
Description
modelrequired
string
Model id from /v1/models, e.g. anthropic/claude-sonnet-4-6.
messagesrequired
array<Message>
Ordered chat history. Same shape as OpenAI chat-completions.
stream
boolean
When true, response is text/event-stream of delta chunks.
max_tokens
integer
Hard cap on output tokens.
temperature
number
0–2. Defaults to model default (typically 1).
top_p
number
Nucleus sampling cutoff.
tools
array<Tool>
Optional tool/function calling spec.
lane
object
Perchy-specific. Reserves a clear position. See Lane reservations.
metadata
object
Free-form labels echoed back in usage and webhooks.
Response fields
Field
Type
Description
id
string
Unique completion id, e.g. gen_01HZS...
model
string
Resolved model id (canonical form).
choices
array<Choice>
One choice per generation. choices[0].message.content has the text.
usage
Usage
prompt_tokens, completion_tokens, total_tokens, and cost_cents — the amount actually billed to the account for this call.
perchy_request_id
string
Tracing id (also surfaced via the X-Perchy-Request-Id response header). Include this when filing support tickets.
Perchy normalizes vendor-specific roles (e.g. system, developer) and handles tool-call shapes consistently across model families. Perchy-specific request fields (lane, metadata) are stripped before forwarding; usage.cost_cents and perchy_request_id are added to every response.
Streaming (SSE)
Server-sent events for token-by-token delivery.
Set stream: true and read the response as Server-Sent Events. Each event is a data: line containing a JSON delta. The stream ends with data: [DONE].
When using lanes, the lane is held continuously across the stream. Closing the connection releases it — no explicit cleanup call is needed.
Lane reservations
How clear-lane scheduling works.
In development
In development. The lane object is accepted on every request and validated, but lane modes (clear / shared / burst) currently route identically while we finish the scheduler. Per-second lane billing is not active yet — only per-token billing applies for now. Existing integrations that pass lane will not break when scheduling lights up.
A "lane" is a temporary reservation on a host that's operating below its congestion point. While your lane is held, your requests are routed to the same warm position and aren't queued behind general traffic.
javascript — lane controls
// Reserve a clear lane that drops automatically after idle.
await fetch("https://api.perchy.ai/v1/chat/completions", {
method: "POST",
headers: {
"Authorization": `Bearer ${process.env.PERCHY_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
model: "anthropic/claude-haiku-4.5",
messages: [{ role: "user", content: "Hi" }],
// Lane controls — append to any chat/completions request.
lane: {
mode: "clear", // "clear" | "shared" | "burst"
idle_timeout_ms: 4000, // release the lane after N ms idle
max_hold_ms: 60_000, // hard cap on how long to hold
priority: "interactive", // "interactive" | "batch"
},
}),
});
Modes
clear — reserve a clear position. Steadier first-token latency. Billed per second of presence.
shared — best-effort routing. No reservation. Token billing only.
burst — short-lived clear lane intended for one request, released immediately on completion.
Pricing
Lane time is metered per second at the rate published on pricing. Token usage on top of a lane is still billed at the model's per-token rate. Idle holds are charged until idle_timeout_ms elapses, then the lane is released automatically.
Models endpoint
GET /v1/models
List the models currently routable, with pricing and context window. Cache the response on your side; the catalog updates a few times a week.
Aggregated usage and remaining balance for the calling key's account. Pass ?days=N (1–90, defaults to 30) to change the window. Costs are billed in micro-USD and surfaced here as USD strings for easy display.
Want a real-time, per-request signal? Every chat completion response includes a usage.cost_cents field with the exact amount billed for that call.
Errors
Standard error envelope.
Every non-2xx response uses the same JSON shape. Inspect error.code for programmatic handling and surface error.message to humans.
json — error body
// 429 — rate limit
{
"error": {
"type": "rate_limit_error",
"message": "You hit the per-key request rate limit. Retry after 1.2s.",
"code": "rate_limited",
"retry_after_ms": 1200,
"request_id": "req_01HZS9C1V7NBYDZTQB39P4WEJ8"
}
}
400 — request was malformed (missing field, invalid model id).
401 — missing, invalid, or revoked API key.
402 — usage allowance exhausted; extend it at billing.
429 — rate-limited. Retry after retry_after_ms.
5xx — temporary upstream issue. Safe to retry with backoff.
Rate limits
Per-key, with burst headroom.
In development
In development. Hard rate limiting is not enforced yet — the only gate today is the per-account balance check. The X-RateLimit-* headers below are reserved and will start emitting once the limiter is live; design against the table now so your client retries when 429s appear.
Each API key has an independent budget. Limits scale with your active plan:
Plan
Requests / min
Concurrent lanes
Tokens / min
Free dev
60
1
120k
Plus
300
1
600k
Pro
1,200
2
2.4M
Premium
4,800
4
9.6M
Every response includes X-RateLimit-Remaining-Requests, X-RateLimit-Remaining-Tokens, and X-RateLimit-Reset-Ms headers. Need higher limits? Email support@perchy.ai.
Billing
Two pricing modes.
Perchy supports one-time concurrency plan purchases and traditional token pricing. Plans and usage extensions use the embedded Stripe Payment Element with signed webhook fulfillment.
Concurrency plans are one-time purchases — no recurring charges, no auto-renew.
Token usage draws down your usage allowance (see GET /v1/usage).
Payment history lives in Stripe; payment methods are managed in-console.
Refund eligibility is described in our refund policy.
Webhooks
Signed events for billing and usage.
In development
In development. Outbound webhooks are not firing yet. The signature format and event payloads below are the shape we're committing to; sign-up registration in the console and event delivery will turn on together. Until then, poll GET /v1/usage if you need allowance signals.
Configure a webhook URL in Settings → Account. Each event is signed with HMAC-SHA256; verify the Perchy-Signature header before trusting the body.
Common event types: payment.succeeded, payment.refunded, allowance.low, usage.threshold_reached, api_key.created, api_key.revoked.
Compute hosts
Bring spare compute online.
In development
In development. The perchy host CLI and the marketplace it connects to are not released yet — running these commands today will not work. Stripe Connect onboarding at Settings → Payouts is live so you can prepare a payout account in advance; host enrollment opens with the CLI.
Hosts connect outbound to Perchy, advertise available positions, and earn when applications occupy those positions. No public address required and no inbound port opens on your network.
shell — perchy host CLI
# Bring spare compute online (outbound only — no public address).
npx perchy host login
npx perchy host start --earn --max-vram 80%
# Pause earning whenever you want the GPU back.
npx perchy host pause
# Inspect lane occupancy and earnings.
npx perchy host status --json
Outbound-only WebSocket. Plays nicely with NAT, residential ISPs, and most firewalls.
You set the cap. --max-vram 80% reserves headroom for your own work.
Prompts and completions are not used to train models. See privacy policy.
Logs are retained for 30 days for debugging. In development Zero-retention mode (Premium and up) is not toggleable yet — request it via support@perchy.ai in the meantime.