Most "build with AI" tutorials show you a toy example. A todo app, a weather widget, a chatbot that waves. They prove nothing — of course AI can build a todo app. The question is whether AI can build something real.
This post is a real walkthrough. The project: an internal tool for a product-ops team at a mid-market SaaS company to triage customer escalations. The client: an actual paying Sprintt client, details anonymized. The builder: one person, using Claude as the primary agent, over three working days. The outcome: a working system deployed to production and used daily.
Every prompt, every decision, every shortcut — as they actually happened. If you're wondering what the real shape of this work looks like, here it is.
The ask
The client's VP of Customer Success came to us with a specific problem: their support team received ~200 high-priority escalations per week from customer-facing teams. Each escalation was a loose-form message, often missing context. The on-call engineer had to triage — figure out if the issue was real, find the relevant customer record, find the relevant product area, assign priority, route to the right team.
Current state: the on-call engineer spent ~40% of their week on triage. That was ~15 hours/week of senior-engineer time spent on routing work, not solving problems.
Desired state: a system that reads the incoming escalation, pulls context, suggests a routing and priority, and presents it to the on-call engineer as a one-click approval with full supporting context.
Measured outcome: we agreed on a target of reducing triage time by 60%, measured over four weeks of use.
Budget: one week of Sprintt time. That was the real constraint. We had three working days to ship, plus buffer for review.
The architecture, decided in 30 minutes
On day one, I opened a Claude Code session and asked for an architecture proposal.
First prompt (paraphrased from the actual):
We need a triage assistant. Input: a text escalation from an internal tool. Output: for a human reviewer — suggested customer match, suggested product area, suggested priority, suggested routing, and a one-paragraph summary. Must integrate with: Linear (their tracker), Salesforce (customer records), and their internal docs. Budget: 3 days of build. Stack: they use Node/TypeScript and Vercel.
Propose an architecture. No code yet. 300 words max.
Response came back with a three-layer design:
- Ingest: webhook from their internal tool → serverless function → staging queue.
- Enrichment: pull customer record (Salesforce), pull related Linear tickets, pull any relevant internal doc snippets. Compose a prompt.
- Classify and summarize: send to Claude with the enriched context → get structured JSON output → surface in a simple web UI for the on-call to approve or adjust.
The design was close to what I'd have drawn myself, with one useful addition I hadn't considered: a "confidence threshold" on the classification. Below a threshold, the system would flag "low confidence" and not pre-fill any routing. I added that to the plan.
Total time for architecture: 30 minutes, including pushback on two points.
Day 1: Build the ingest and enrichment layer
9:00 AM: Scaffolding the project
I gave Claude the stack and let it scaffold:
Create a new Next.js 14 App Router project with TypeScript. Set up Vercel deployment config. Create a
/api/escalations/ingestPOST endpoint that accepts the webhook payload (schema in@spec.md) and writes to a Postgres table. Use Drizzle ORM.
Ten minutes later I had a scaffolded project, a working ingest endpoint, and a schema for the escalations table. I reviewed the diff, made one correction (Claude had used a generic message column name; I renamed it to raw_message to match their existing conventions).
10:00 AM: Salesforce and Linear integration
This is where most AI-assisted builds go sideways. Integrations with real APIs require reading actual documentation, handling auth correctly, dealing with rate limits and error cases.
I gave Claude the docs explicitly:
Read
@docs/salesforce-api.mdand@docs/linear-api.md(both are internal summaries the client provided). Build two modules:lib/salesforce.tsandlib/linear.ts. Each should expose:findRecentRecords(query: string). Handle auth via the env vars listed in@.env.example. Handle rate limiting with exponential backoff. Return typed results.
This produced two solid modules, but the Salesforce one had a subtle bug — it was using SOQL for a full-text search but Salesforce's full-text search requires SOSL. I noticed it when I read the diff. Asked Claude to switch to SOSL. Fixed.
Lesson: the integration I'd have gotten done in a day myself, I got done in two hours with Claude. But the verification step — reading the actual generated code against the actual API — was non-negotiable.
11:30 AM: Enrichment layer
The enrichment step needed to: given an escalation, run both Salesforce and Linear lookups, plus a third lookup against the client's internal docs (already indexed in a vector store we'd set up earlier).
I wrote a brief:
Create
lib/enrichment.tswith a functionenrichEscalation(escalation). It should: extract key entities from the message (customer mentions, product area mentions, severity signals) using Claude; run parallel lookups against Salesforce, Linear, and the docs index; return a structuredEnrichedEscalationobject. Gracefully handle timeouts on any single source.
This produced a clean module. The entity-extraction call was itself a Claude call with a focused prompt. I tested it with three sample messages. All three pulled reasonable entities.
One edge case I caught in review: the function didn't handle the case where the docs vector store returned zero results. Added a fallback. Moved on.
End of Day 1: ingest and enrichment working against staging data. No UI yet, no classification yet. But the data was flowing.
Day 2: Build the classification and the UI
9:00 AM: The classification prompt
This was the highest-risk piece. The prompt had to reliably produce useful routing, priority, and summary.
Draft one (short version):
You are a customer-escalation triage assistant for [company]. Given:
- The raw escalation text
- Related customer record
- Related Linear tickets
- Relevant internal doc snippets
Produce a JSON object with fields:
- suggestedCustomer: { id, confidence }
- suggestedProductArea: { name, confidence }
- suggestedPriority: "P0" | "P1" | "P2" | "P3"
- suggestedTeam: string
- summary: string (under 2 sentences)
If confidence for any field is below 0.5, return null for that field.
I tested against 15 historical escalations the client shared. Results:
- 11 clean matches with correct priority.
- 2 where the priority was wrong (too conservative — classified P2 when it should have been P1).
- 2 with low-confidence nulls that were actually identifiable to a human.
Refined the prompt with three additions:
- Explicit priority definitions (P0 = revenue-at-risk, P1 = blocking multiple customers, etc.)
- Two few-shot examples of P0 and P1 classifications
- A specific instruction to use the customer's tier (Enterprise, Growth, Starter) as a signal for priority
Re-ran the 15 historical examples. 14/15 correct. The 15th was genuinely ambiguous and a human reviewer also disagreed with our target answer.
Total prompt engineering time: 2 hours.
11:00 AM: Wiring the classification
Built the end-to-end path: escalation arrives → enrichment runs → classification runs → result stored. No UI yet. Tested through the database.
1:00 PM: The review UI
This had to be simple. The on-call engineer needed to see the full picture, approve or adjust the suggestion, and commit. Nothing more.
The spec:
Build a
/review/[id]page. Show: the raw escalation at the top; the suggested routing fields as a form (pre-filled from the classification, editable); the supporting context (customer record summary, linked Linear tickets, doc snippets); an "Approve and route" button and an "Adjust and route" flow. Server action posts back to update the record and fire the downstream routing.
Claude built this in about ninety minutes. I reviewed. Adjustments I made:
- The customer record summary was too verbose; I shortened it.
- The "Adjust and route" flow was modal; I changed it to inline editing on the form directly.
- The button states weren't disabled during the API call; added loading states.
By end of Day 2, a reviewer could open the page, see a suggestion, approve or adjust, and route.
4:00 PM: The feedback loop
Critical piece: I wanted every human correction to be logged, so we could evaluate the system's accuracy over time.
Add logging: when a reviewer approves or adjusts, log the original suggestion, the human decision, and the delta (which fields changed) to a
classification_reviewstable. Also expose a simple/admin/accuracydashboard that shows: overall accuracy rate, accuracy by field, and the most common "overrides" (where the human consistently disagrees with the AI).
This took an hour. The dashboard wasn't pretty, but it was useful. If the human-AI accuracy dropped below a threshold, we'd know.
End of Day 2: working end-to-end system, including feedback tracking.
Day 3: Deploy, harden, hand off
9:00 AM: Production deployment
Deploying a Next.js app to Vercel is well-worn territory. Claude handled the Vercel config, the environment variable setup, and the production database migration (against their Neon Postgres). Smooth.
10:30 AM: Security review
I ran our internal security-review skill against the full repo. It surfaced:
- One endpoint missing auth (the admin dashboard). Added auth middleware.
- Customer data was being logged verbose in one place. Scrubbed PII from logs.
- The Salesforce API key was in the serverless function but could have been scoped narrower. Rotated to a scoped key.
Spent about two hours on this pass. This is the step that always gets skipped in "AI-built" demos. It's the step that matters for shipping to production.
1:00 PM: Integration into their workflow
The client's on-call engineers had to actually use the thing. I ran a thirty-minute walkthrough with the lead on-call engineer. We reviewed five live escalations together through the new system. Minor tweaks:
- He wanted the "Approve and route" button to require an explicit confirmation for P0 routings (to prevent accidents).
- He wanted keyboard shortcuts for "approve as-is" (
A) and "adjust" (E).
Both added inside an hour. Went live at 3:00 PM.
4:00 PM: Monitoring and alerting
I set up alerts for: classification errors, integration timeouts, and any sudden drop in human-AI agreement rate. Routed to their engineering Slack channel.
Handed off at 5:00 PM. Total build time: roughly 22 working hours across three days. The result: a production system in use by their team.
What I learned
The leverage was real, but earned
The 22-hour build would have been a 2-week build without Claude. That is not because the model is magic — it's because the handoffs that normally bleed time (spec to code, code to tests, tests to deploy) happened inside one loop instead of across people. The speed gain came from the unified loop, not from the code generation.
The review burden shifted, didn't disappear
I didn't write most of the code. I read and reviewed most of it. That's a different cognitive load but it isn't lighter. Three days of intense reviewing is still three days of focused work. The myth that AI makes building "easy" is wrong. It makes it faster, and it shifts the shape of the work.
The quality of the initial brief dictated the quality of the output
For every prompt above, the correctness of what came back was proportional to how clearly I had scoped the ask, attached the right docs, and specified the outputs. When I was lazy with a prompt, I got work I had to redo. When I invested 90 seconds in a careful brief, I got 90 minutes of good work.
The parts that require real engineering judgment still required it
Security review. Auth design. Priority-classification tuning. The "does this integration pattern actually match the API's behavior" checks. These required actual senior-engineer judgment. Claude did them after I scoped them and reviewed the output; it didn't do them instead of me. If you don't have the judgment to catch when the system is making the wrong call, the system will make wrong calls that ship.
The outcome metric is what made it an engagement, not a demo
The "60% reduction in triage time" target was the anchor for the entire project. Without it, the system would have been a cool demo that nobody committed to using. With it, there was a real reason to roll it out, a real reason to measure, and a real reason to iterate. Every AI project should have a number like this written down before the build starts.
Four weeks later
The system has handled ~800 escalations. Triage time per escalation is down from an average of 7.5 minutes to 2.2 minutes — a 70% reduction. The on-call engineer's weekly hours on triage dropped from ~15 to ~4.
The accuracy rate (humans approving the AI's suggestion without adjustment) is holding at 82%. The 18% that get adjusted still benefit from the pre-enrichment — the human spends seconds, not minutes, on each.
The client extended us for a second engagement to do the same pattern on their incident response triage.
This is what a real AI build looks like. Not magic. Careful scoping, careful briefing, careful review, tight feedback loops, and a measurable outcome. It works every time we do it this way. It fails every time we don't.
If you have a workflow like this one — high-volume, well-scoped, human-attention-expensive — it's probably a good AI project. Book a 30-minute call and we'll help you figure out if and how to ship it.

