A workload review for LLM platform teams

See where your LLM serving breaks before your customers do.

Send 24 hours of request logs. Receive a routing and admission policy within 48 hours. Free for the first 3 qualified reviews.

Start a review Read the research

How we know

We started by measuring one boundary in full.

One H100 SXM5. Llama 3.1 70B FP8. vLLM 0.19.1 with FlashAttention v3 and FP8 KV cache. Five rounds. Bootstrap confidence intervals. Source label: measured_by_axcob.

Throughput, concurrency 4 to 16

83.05 → 83.20

tokens per second

A 4x concurrency increase moved throughput by 0.18 percent. Adding load did not buy throughput.

P99 TTFT, concurrency 4 to 16

4.6s → 40.1s

time to first token, P99

The same 4x concurrency increase moved tail latency by 772 percent. The boundary lives in the tail.

Mixed workload penalty (env-specific)

4.3x

short-request P99 increase (Spheron run)

When 10 percent of traffic carried long-context input, the other 90 percent of users saw P99 rise by 4.3x on Spheron. A Lambda H100 SXM5 replay of the same protocol showed only a 1.08x degradation at 90/10 mix, so fairness behavior appears scheduler/runtime-sensitive, not universal. Customer-specific replay required.

Cross-provider update

The throughput plateau and P99 TTFT explosion above reproduced on a second H100 SXM5 provider (Lambda Cloud, fresh vLLM server) under the same protocol. The critical ctx 8K c=4/c=16 cells matched the Spheron baseline within 1.007x and 1.004x P99 respectively. The ctx 28K long-context series also matched within 0.95-1.00x. The boundary structure is not a single-provider artifact. Full cross-provider replication note & raw artifacts.

Average throughput is what your
dashboard shows. P99 is what your customer feels.

Most platform teams measure the first. Few measure the second. Almost no one measures what happens when the workload mix shifts mid-shift. That is the gap a workload review closes.

What you send us

One file. No prompts. No PII.

request_idany unique identifier. Hashed is fine.

timestampISO 8601 or Unix epoch. UTC preferred.

input_tokenstokenized input length. Estimate is fine.

output_tokenstokenized output length, or completion length.

statussuccess, error, timeout, or your equivalent.

latency_msend-to-end latency. TTFT and TPOT optional.

CSV or JSONL. 24 hours of representative traffic is the minimum. More is better. We do not retain prompt content. We do not require PII. Sample de-identified rows are welcome before you send the full log.

If you do not log token counts yet, send prompt length or anonymized samples. We can help estimate.

What you receive

Three artifacts. 48 hours.

Boundary match report

Where your measured workload lives inside the boundary corpus. Coverage status per request bucket: measured, rounded up for safety, outside measured range, or unsupported model and hardware.

Routing and admission policy

A JSON policy your platform team can replay. Interactive, async, and reject splits with input-size thresholds. Reviewer-locked, with source labels per rule.

SLO dry-run

The same log replayed against a customer-facing SLO profile (chat, RAG, batch summarization, or your own thresholds). Verdict distribution and a single slo_risk percent.

Why us

The toolkit you receive is the toolkit we built our research on.

BoundaryOS is an early workload-review toolkit built from our measured H100 research corpus. The measurement methodology, the reference corpus, and the simulator that produces your review are all published on GitHub under MIT and CC-BY-4.0. Your review is generated by the same code path you can read.

github.com/jacob-sunho-kim/llm-boundary-research

The honest part

Our reference corpus is one measured environment so far. If your GPU, model family, quantization, or serving stack diverges, the review will flag those cells as outside_measured_boundary. For these first 3 reviews we treat that as a measurement opportunity and discuss what additional measurement is feasible together. These 3 reviews are free because we are still learning what real workloads look like.

Start a review.

48-hour turnaround. Free for the first 3 qualified reviews.

Send your log

No prompt content required. Token counts or estimates are enough.

[email protected]

Throughput delta (+0.18 percent) and TTFT delta (+772 percent) are derived from measured values on the stated hardware and serving stack. Mixed-workload penalty (4.3x) is measured. Reference corpus source label: measured_by_axcob. These numbers describe one measured environment. Replay against your traffic is required before enforcement.