Documentation

Developer documentation

A direct HTTP API at api.perchy.ai. OpenAI-compatible request shape with Perchy lane controls.

Overview

What Perchy gives you.

Perchy is a routing layer in front of a marketplace of GPU hosts. The HTTP API mirrors the OpenAI chat-completions shape, so most existing client code works by changing the base URL and key. The only Perchy-specific addition is the optional lane object on each request, which controls how long a clear position is held for your app.

  • OpenAI-compatible chat, streaming, and tool-calling shapes.
  • Lane reservations for steadier first-token latency under load.
  • Per-second billing on lanes; per-token billing on traditional models.
  • Signed webhooks for payment, allowance, and usage events.
Base URL

One host, versioned paths.

All Perchy requests target a single base URL. The path prefix is versioned so that breaking changes don't affect existing integrations.

https://api.perchy.ai/v1
  • Use HTTPS only. Plaintext requests are rejected at the edge.
  • POST /v1/chat/completions, GET /v1/models, GET /v1/usage, and GET /v1/lanes are stable.
  • We never break a v1 endpoint without a sunset notice on the changelog.
Authentication

Bearer keys, server-side only.

Each request must include a Perchy API key in the Authorization header. Keys are created in the console under Settings → API keys and are shown exactly once.

HTTP header
Authorization: Bearer pk_live_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  • Use a separate key per app or environment so you can rotate one without downtime.
  • Keys are stored as one-way hashes — we cannot recover a lost key, only re-issue.
  • Never ship keys to a browser or mobile bundle. Use a server, an edge function, or Cloudflare Workers as a proxy.
Quickstart

Make your first call.

The fastest way to test the API is a single curl. Replace $PERCHY_API_KEY with a key from the console.

shell — first request
curl https://api.perchy.ai/v1/chat/completions \
  -H "Authorization: Bearer $PERCHY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "anthropic/claude-sonnet-4.6",
    "messages": [
      { "role": "system", "content": "You are a concise assistant." },
      { "role": "user", "content": "Write the launch email for our beta." }
    ],
    "lane": { "mode": "clear", "idle_timeout_ms": 3000 }
  }'

From an application, the same request from any HTTP client. No SDK required.

javascript — fetch
// Node 20+, Bun, Deno, edge runtimes — no SDK required.
const response = await fetch("https://api.perchy.ai/v1/chat/completions", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.PERCHY_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "openai/gpt-5.4-mini",
    messages: [{ role: "user", content: "Summarize Q3 in two sentences." }],
    lane: { mode: "clear", idle_timeout_ms: 3000 },
  }),
});

if (!response.ok) {
  const error = await response.json();
  throw new Error(`Perchy ${response.status}: ${error.error?.message}`);
}

const payload = await response.json();
console.log(payload.choices[0].message.content);
python — httpx
# pip install httpx  (or use requests / aiohttp / urllib)
import os, httpx

response = httpx.post(
    "https://api.perchy.ai/v1/chat/completions",
    headers={
        "Authorization": f"Bearer {os.environ['PERCHY_API_KEY']}",
        "Content-Type": "application/json",
    },
    json={
        "model": "qwen/qwen3.6-flash",
        "messages": [
            {"role": "user", "content": "Draft a release note for v1.4."}
        ],
        "lane": {"mode": "clear", "idle_timeout_ms": 3000},
    },
    timeout=60,
)

response.raise_for_status()
print(response.json()["choices"][0]["message"]["content"])
Chat completions

POST /v1/chat/completions

The core endpoint. Request shape is OpenAI-compatible with one extra field.

Request fields

FieldTypeDescription
modelrequiredstringModel id from /v1/models, e.g. anthropic/claude-sonnet-4-6.
messagesrequiredarray<Message>Ordered chat history. Same shape as OpenAI chat-completions.
streambooleanWhen true, response is text/event-stream of delta chunks.
max_tokensintegerHard cap on output tokens.
temperaturenumber0–2. Defaults to model default (typically 1).
top_pnumberNucleus sampling cutoff.
toolsarray<Tool>Optional tool/function calling spec.
laneobjectPerchy-specific. Reserves a clear position. See Lane reservations.
metadataobjectFree-form labels echoed back in usage and webhooks.

Response fields

FieldTypeDescription
idstringUnique completion id, e.g. gen_01HZS...
modelstringResolved model id (canonical form).
choicesarray<Choice>One choice per generation. choices[0].message.content has the text.
usageUsageprompt_tokens, completion_tokens, total_tokens, and cost_cents — the amount actually billed to the account for this call.
perchy_request_idstringTracing id (also surfaced via the X-Perchy-Request-Id response header). Include this when filing support tickets.

Perchy normalizes vendor-specific roles (e.g. system, developer) and handles tool-call shapes consistently across model families. Perchy-specific request fields (lane, metadata) are stripped before forwarding; usage.cost_cents and perchy_request_id are added to every response.

Streaming (SSE)

Server-sent events for token-by-token delivery.

Set stream: true and read the response as Server-Sent Events. Each event is a data: line containing a JSON delta. The stream ends with data: [DONE].

shell — sse stream
curl https://api.perchy.ai/v1/chat/completions \
  -H "Authorization: Bearer $PERCHY_API_KEY" \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{
    "model": "deepseek/deepseek-v4-flash",
    "stream": true,
    "lane": { "mode": "clear", "idle_timeout_ms": 2500 },
    "messages": [
      { "role": "user", "content": "Stream a haiku about steady inference." }
    ]
  }'
javascript — readable stream
const response = await fetch("https://api.perchy.ai/v1/chat/completions", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.PERCHY_API_KEY}`,
    "Content-Type": "application/json",
    "Accept": "text/event-stream",
  },
  body: JSON.stringify({
    model: "google/gemini-3.5-flash",
    stream: true,
    lane: { mode: "clear", idle_timeout_ms: 2500 },
    messages: [{ role: "user", content: "Stream three product taglines." }],
  }),
});

const reader = response.body!.getReader();
const decoder = new TextDecoder();
let buffer = "";

while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  buffer += decoder.decode(value, { stream: true });

  let boundary;
  while ((boundary = buffer.indexOf("\n\n")) !== -1) {
    const event = buffer.slice(0, boundary);
    buffer = buffer.slice(boundary + 2);
    const data = event.replace(/^data: /m, "");
    if (data === "[DONE]") return;
    const json = JSON.parse(data);
    process.stdout.write(json.choices[0]?.delta?.content ?? "");
  }
}

When using lanes, the lane is held continuously across the stream. Closing the connection releases it — no explicit cleanup call is needed.

Lane reservations

How clear-lane scheduling works.

In development

In development. The lane object is accepted on every request and validated, but lane modes (clear / shared / burst) currently route identically while we finish the scheduler. Per-second lane billing is not active yet — only per-token billing applies for now. Existing integrations that pass lane will not break when scheduling lights up.

A "lane" is a temporary reservation on a host that's operating below its congestion point. While your lane is held, your requests are routed to the same warm position and aren't queued behind general traffic.

javascript — lane controls
// Reserve a clear lane that drops automatically after idle.
await fetch("https://api.perchy.ai/v1/chat/completions", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.PERCHY_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "anthropic/claude-haiku-4.5",
    messages: [{ role: "user", content: "Hi" }],
    // Lane controls — append to any chat/completions request.
    lane: {
      mode: "clear",            // "clear" | "shared" | "burst"
      idle_timeout_ms: 4000,    // release the lane after N ms idle
      max_hold_ms: 60_000,      // hard cap on how long to hold
      priority: "interactive",  // "interactive" | "batch"
    },
  }),
});

Modes

  • clear — reserve a clear position. Steadier first-token latency. Billed per second of presence.
  • shared — best-effort routing. No reservation. Token billing only.
  • burst — short-lived clear lane intended for one request, released immediately on completion.

Pricing

Lane time is metered per second at the rate published on pricing. Token usage on top of a lane is still billed at the model's per-token rate. Idle holds are charged until idle_timeout_ms elapses, then the lane is released automatically.

Models endpoint

GET /v1/models

List the models currently routable, with pricing and context window. Cache the response on your side; the catalog updates a few times a week.

shell — list models
curl https://api.perchy.ai/v1/models \
  -H "Authorization: Bearer $PERCHY_API_KEY"

# {
#   "object": "list",
#   "discount_percent": 10,
#   "data": [
#     {
#       "id": "anthropic/claude-sonnet-4.6",
#       "object": "model",
#       "context_window": "1M",
#       "lane": "General purpose",
#       "input_per_mtok_usd": 2.7,
#       "output_per_mtok_usd": 13.5,
#       "input_per_mtok_usd_list": 3.0,
#       "output_per_mtok_usd_list": 15.0,
#       "discount_percent": 10
#     },
#     ...
#   ]
# }
Usage endpoint

GET /v1/usage

Aggregated usage and remaining balance for the calling key's account. Pass ?days=N (1–90, defaults to 30) to change the window. Costs are billed in micro-USD and surfaced here as USD strings for easy display.

shell — usage summary
curl "https://api.perchy.ai/v1/usage?days=30" \
  -H "Authorization: Bearer $PERCHY_API_KEY"

# {
#   "object": "usage.summary",
#   "range": { "days": 30, "since": "2026-04-25T00:00:00.000Z" },
#   "totals": {
#     "input_tokens": 184273,
#     "output_tokens": 41902,
#     "cost_usd": "1.4382",
#     "request_count": 217
#   },
#   "balance_cents": 8562,
#   "by_model": [
#     {
#       "model": "anthropic/claude-sonnet-4.6",
#       "input_tokens": 92110,
#       "output_tokens": 21043,
#       "cost_usd": "0.5917",
#       "request_count": 118
#     }
#   ]
# }

Want a real-time, per-request signal? Every chat completion response includes a usage.cost_cents field with the exact amount billed for that call.

Errors

Standard error envelope.

Every non-2xx response uses the same JSON shape. Inspect error.code for programmatic handling and surface error.message to humans.

json — error body
// 429 — rate limit
{
  "error": {
    "type": "rate_limit_error",
    "message": "You hit the per-key request rate limit. Retry after 1.2s.",
    "code": "rate_limited",
    "retry_after_ms": 1200,
    "request_id": "req_01HZS9C1V7NBYDZTQB39P4WEJ8"
  }
}
  • 400 — request was malformed (missing field, invalid model id).
  • 401 — missing, invalid, or revoked API key.
  • 402 — usage allowance exhausted; extend it at billing.
  • 429 — rate-limited. Retry after retry_after_ms.
  • 5xx — temporary upstream issue. Safe to retry with backoff.
Rate limits

Per-key, with burst headroom.

In development

In development. Hard rate limiting is not enforced yet — the only gate today is the per-account balance check. The X-RateLimit-* headers below are reserved and will start emitting once the limiter is live; design against the table now so your client retries when 429s appear.

Each API key has an independent budget. Limits scale with your active plan:

PlanRequests / minConcurrent lanesTokens / min
Free dev601120k
Plus3001600k
Pro1,20022.4M
Premium4,80049.6M

Every response includes X-RateLimit-Remaining-Requests, X-RateLimit-Remaining-Tokens, and X-RateLimit-Reset-Ms headers. Need higher limits? Email support@perchy.ai.

Billing

Two pricing modes.

Perchy supports one-time concurrency plan purchases and traditional token pricing. Plans and usage extensions use the embedded Stripe Payment Element with signed webhook fulfillment.

  • Concurrency plans are one-time purchases — no recurring charges, no auto-renew.
  • Token usage draws down your usage allowance (see GET /v1/usage).
  • Payment history lives in Stripe; payment methods are managed in-console.
  • Refund eligibility is described in our refund policy.
Webhooks

Signed events for billing and usage.

In development

In development. Outbound webhooks are not firing yet. The signature format and event payloads below are the shape we're committing to; sign-up registration in the console and event delivery will turn on together. Until then, poll GET /v1/usage if you need allowance signals.

Configure a webhook URL in Settings → Account. Each event is signed with HMAC-SHA256; verify the Perchy-Signature header before trusting the body.

http — webhook delivery
POST https://your-app.example.com/webhooks/perchy
Content-Type: application/json
Perchy-Signature: t=1735689600,v1=2c9b8...
Perchy-Event-Id: evt_01HZSAGNZ2...

{
  "id": "evt_01HZSAGNZ2",
  "type": "payment.succeeded",
  "created": "2026-05-08T15:21:09Z",
  "data": {
    "payment_id": "pay_01HZSAFXR1",
    "amount_cents": 2000,
    "currency": "usd",
    "kind": "topup"
  }
}

Common event types: payment.succeeded, payment.refunded, allowance.low, usage.threshold_reached, api_key.created, api_key.revoked.

Compute hosts

Bring spare compute online.

In development

In development. The perchy host CLI and the marketplace it connects to are not released yet — running these commands today will not work. Stripe Connect onboarding at Settings → Payouts is live so you can prepare a payout account in advance; host enrollment opens with the CLI.

Hosts connect outbound to Perchy, advertise available positions, and earn when applications occupy those positions. No public address required and no inbound port opens on your network.

shell — perchy host CLI
# Bring spare compute online (outbound only — no public address).
npx perchy host login
npx perchy host start --earn --max-vram 80%

# Pause earning whenever you want the GPU back.
npx perchy host pause

# Inspect lane occupancy and earnings.
npx perchy host status --json
  • Outbound-only WebSocket. Plays nicely with NAT, residential ISPs, and most firewalls.
  • You set the cap. --max-vram 80% reserves headroom for your own work.
  • Earnings are paid via Stripe Connect (configure payouts). Reviewed in our host terms.
Security

How we handle requests and data.

  • TLS 1.3 termination at the edge. HSTS preloaded.
  • Prompts and completions are not used to train models. See privacy policy.
  • Logs are retained for 30 days for debugging. In development Zero-retention mode (Premium and up) is not toggleable yet — request it via support@perchy.ai in the meantime.
  • Read the full posture on the security overview.