Skip to main content
Back to all posts
Walkthroughs

Building an AI Product with Claude: A Complete Walkthrough

From empty folder to shipped production feature, using Claude as the primary builder. A real example, with every prompt, every decision, every shortcut we actually used.

Ricardo Ramirez

Ricardo Ramirez

Founder · Sprintt

March 4, 202611 min read
ClaudeExample ProjectWalkthrough

Most "build with AI" tutorials show you a toy example. A todo app, a weather widget, a chatbot that waves. They prove nothing — of course AI can build a todo app. The question is whether AI can build something real.

This post is a real walkthrough. The project: an internal tool for a product-ops team at a mid-market SaaS company to triage customer escalations. The client: an actual paying Sprintt client, details anonymized. The builder: one person, using Claude as the primary agent, over three working days. The outcome: a working system deployed to production and used daily.

Every prompt, every decision, every shortcut — as they actually happened. If you're wondering what the real shape of this work looks like, here it is.

The ask

The client's VP of Customer Success came to us with a specific problem: their support team received ~200 high-priority escalations per week from customer-facing teams. Each escalation was a loose-form message, often missing context. The on-call engineer had to triage — figure out if the issue was real, find the relevant customer record, find the relevant product area, assign priority, route to the right team.

Current state: the on-call engineer spent ~40% of their week on triage. That was ~15 hours/week of senior-engineer time spent on routing work, not solving problems.

Desired state: a system that reads the incoming escalation, pulls context, suggests a routing and priority, and presents it to the on-call engineer as a one-click approval with full supporting context.

Measured outcome: we agreed on a target of reducing triage time by 60%, measured over four weeks of use.

Budget: one week of Sprintt time. That was the real constraint. We had three working days to ship, plus buffer for review.

The architecture, decided in 30 minutes

On day one, I opened a Claude Code session and asked for an architecture proposal.

First prompt (paraphrased from the actual):

We need a triage assistant. Input: a text escalation from an internal tool. Output: for a human reviewer — suggested customer match, suggested product area, suggested priority, suggested routing, and a one-paragraph summary. Must integrate with: Linear (their tracker), Salesforce (customer records), and their internal docs. Budget: 3 days of build. Stack: they use Node/TypeScript and Vercel.

Propose an architecture. No code yet. 300 words max.

Response came back with a three-layer design:

  1. Ingest: webhook from their internal tool → serverless function → staging queue.
  2. Enrichment: pull customer record (Salesforce), pull related Linear tickets, pull any relevant internal doc snippets. Compose a prompt.
  3. Classify and summarize: send to Claude with the enriched context → get structured JSON output → surface in a simple web UI for the on-call to approve or adjust.

The design was close to what I'd have drawn myself, with one useful addition I hadn't considered: a "confidence threshold" on the classification. Below a threshold, the system would flag "low confidence" and not pre-fill any routing. I added that to the plan.

Total time for architecture: 30 minutes, including pushback on two points.

Day 1: Build the ingest and enrichment layer

9:00 AM: Scaffolding the project

I gave Claude the stack and let it scaffold:

Create a new Next.js 14 App Router project with TypeScript. Set up Vercel deployment config. Create a /api/escalations/ingest POST endpoint that accepts the webhook payload (schema in @spec.md) and writes to a Postgres table. Use Drizzle ORM.

Ten minutes later I had a scaffolded project, a working ingest endpoint, and a schema for the escalations table. I reviewed the diff, made one correction (Claude had used a generic message column name; I renamed it to raw_message to match their existing conventions).

10:00 AM: Salesforce and Linear integration

This is where most AI-assisted builds go sideways. Integrations with real APIs require reading actual documentation, handling auth correctly, dealing with rate limits and error cases.

I gave Claude the docs explicitly:

Read @docs/salesforce-api.md and @docs/linear-api.md (both are internal summaries the client provided). Build two modules: lib/salesforce.ts and lib/linear.ts. Each should expose: findRecentRecords(query: string). Handle auth via the env vars listed in @.env.example. Handle rate limiting with exponential backoff. Return typed results.

This produced two solid modules, but the Salesforce one had a subtle bug — it was using SOQL for a full-text search but Salesforce's full-text search requires SOSL. I noticed it when I read the diff. Asked Claude to switch to SOSL. Fixed.

Lesson: the integration I'd have gotten done in a day myself, I got done in two hours with Claude. But the verification step — reading the actual generated code against the actual API — was non-negotiable.

11:30 AM: Enrichment layer

The enrichment step needed to: given an escalation, run both Salesforce and Linear lookups, plus a third lookup against the client's internal docs (already indexed in a vector store we'd set up earlier).

I wrote a brief:

Create lib/enrichment.ts with a function enrichEscalation(escalation). It should: extract key entities from the message (customer mentions, product area mentions, severity signals) using Claude; run parallel lookups against Salesforce, Linear, and the docs index; return a structured EnrichedEscalation object. Gracefully handle timeouts on any single source.

This produced a clean module. The entity-extraction call was itself a Claude call with a focused prompt. I tested it with three sample messages. All three pulled reasonable entities.

One edge case I caught in review: the function didn't handle the case where the docs vector store returned zero results. Added a fallback. Moved on.

End of Day 1: ingest and enrichment working against staging data. No UI yet, no classification yet. But the data was flowing.

Day 2: Build the classification and the UI

9:00 AM: The classification prompt

This was the highest-risk piece. The prompt had to reliably produce useful routing, priority, and summary.

Draft one (short version):

You are a customer-escalation triage assistant for [company]. Given:
- The raw escalation text
- Related customer record
- Related Linear tickets
- Relevant internal doc snippets

Produce a JSON object with fields:
- suggestedCustomer: { id, confidence }
- suggestedProductArea: { name, confidence }
- suggestedPriority: "P0" | "P1" | "P2" | "P3"
- suggestedTeam: string
- summary: string (under 2 sentences)

If confidence for any field is below 0.5, return null for that field.

I tested against 15 historical escalations the client shared. Results:

  • 11 clean matches with correct priority.
  • 2 where the priority was wrong (too conservative — classified P2 when it should have been P1).
  • 2 with low-confidence nulls that were actually identifiable to a human.

Refined the prompt with three additions:

  1. Explicit priority definitions (P0 = revenue-at-risk, P1 = blocking multiple customers, etc.)
  2. Two few-shot examples of P0 and P1 classifications
  3. A specific instruction to use the customer's tier (Enterprise, Growth, Starter) as a signal for priority

Re-ran the 15 historical examples. 14/15 correct. The 15th was genuinely ambiguous and a human reviewer also disagreed with our target answer.

Total prompt engineering time: 2 hours.

11:00 AM: Wiring the classification

Built the end-to-end path: escalation arrives → enrichment runs → classification runs → result stored. No UI yet. Tested through the database.

1:00 PM: The review UI

This had to be simple. The on-call engineer needed to see the full picture, approve or adjust the suggestion, and commit. Nothing more.

The spec:

Build a /review/[id] page. Show: the raw escalation at the top; the suggested routing fields as a form (pre-filled from the classification, editable); the supporting context (customer record summary, linked Linear tickets, doc snippets); an "Approve and route" button and an "Adjust and route" flow. Server action posts back to update the record and fire the downstream routing.

Claude built this in about ninety minutes. I reviewed. Adjustments I made:

  • The customer record summary was too verbose; I shortened it.
  • The "Adjust and route" flow was modal; I changed it to inline editing on the form directly.
  • The button states weren't disabled during the API call; added loading states.

By end of Day 2, a reviewer could open the page, see a suggestion, approve or adjust, and route.

4:00 PM: The feedback loop

Critical piece: I wanted every human correction to be logged, so we could evaluate the system's accuracy over time.

Add logging: when a reviewer approves or adjusts, log the original suggestion, the human decision, and the delta (which fields changed) to a classification_reviews table. Also expose a simple /admin/accuracy dashboard that shows: overall accuracy rate, accuracy by field, and the most common "overrides" (where the human consistently disagrees with the AI).

This took an hour. The dashboard wasn't pretty, but it was useful. If the human-AI accuracy dropped below a threshold, we'd know.

End of Day 2: working end-to-end system, including feedback tracking.

Day 3: Deploy, harden, hand off

9:00 AM: Production deployment

Deploying a Next.js app to Vercel is well-worn territory. Claude handled the Vercel config, the environment variable setup, and the production database migration (against their Neon Postgres). Smooth.

10:30 AM: Security review

I ran our internal security-review skill against the full repo. It surfaced:

  • One endpoint missing auth (the admin dashboard). Added auth middleware.
  • Customer data was being logged verbose in one place. Scrubbed PII from logs.
  • The Salesforce API key was in the serverless function but could have been scoped narrower. Rotated to a scoped key.

Spent about two hours on this pass. This is the step that always gets skipped in "AI-built" demos. It's the step that matters for shipping to production.

1:00 PM: Integration into their workflow

The client's on-call engineers had to actually use the thing. I ran a thirty-minute walkthrough with the lead on-call engineer. We reviewed five live escalations together through the new system. Minor tweaks:

  • He wanted the "Approve and route" button to require an explicit confirmation for P0 routings (to prevent accidents).
  • He wanted keyboard shortcuts for "approve as-is" (A) and "adjust" (E).

Both added inside an hour. Went live at 3:00 PM.

4:00 PM: Monitoring and alerting

I set up alerts for: classification errors, integration timeouts, and any sudden drop in human-AI agreement rate. Routed to their engineering Slack channel.

Handed off at 5:00 PM. Total build time: roughly 22 working hours across three days. The result: a production system in use by their team.

What I learned

The leverage was real, but earned

The 22-hour build would have been a 2-week build without Claude. That is not because the model is magic — it's because the handoffs that normally bleed time (spec to code, code to tests, tests to deploy) happened inside one loop instead of across people. The speed gain came from the unified loop, not from the code generation.

The review burden shifted, didn't disappear

I didn't write most of the code. I read and reviewed most of it. That's a different cognitive load but it isn't lighter. Three days of intense reviewing is still three days of focused work. The myth that AI makes building "easy" is wrong. It makes it faster, and it shifts the shape of the work.

The quality of the initial brief dictated the quality of the output

For every prompt above, the correctness of what came back was proportional to how clearly I had scoped the ask, attached the right docs, and specified the outputs. When I was lazy with a prompt, I got work I had to redo. When I invested 90 seconds in a careful brief, I got 90 minutes of good work.

The parts that require real engineering judgment still required it

Security review. Auth design. Priority-classification tuning. The "does this integration pattern actually match the API's behavior" checks. These required actual senior-engineer judgment. Claude did them after I scoped them and reviewed the output; it didn't do them instead of me. If you don't have the judgment to catch when the system is making the wrong call, the system will make wrong calls that ship.

The outcome metric is what made it an engagement, not a demo

The "60% reduction in triage time" target was the anchor for the entire project. Without it, the system would have been a cool demo that nobody committed to using. With it, there was a real reason to roll it out, a real reason to measure, and a real reason to iterate. Every AI project should have a number like this written down before the build starts.

Four weeks later

The system has handled ~800 escalations. Triage time per escalation is down from an average of 7.5 minutes to 2.2 minutes — a 70% reduction. The on-call engineer's weekly hours on triage dropped from ~15 to ~4.

The accuracy rate (humans approving the AI's suggestion without adjustment) is holding at 82%. The 18% that get adjusted still benefit from the pre-enrichment — the human spends seconds, not minutes, on each.

The client extended us for a second engagement to do the same pattern on their incident response triage.

This is what a real AI build looks like. Not magic. Careful scoping, careful briefing, careful review, tight feedback loops, and a measurable outcome. It works every time we do it this way. It fails every time we don't.


If you have a workflow like this one — high-volume, well-scoped, human-attention-expensive — it's probably a good AI project. Book a 30-minute call and we'll help you figure out if and how to ship it.

Ricardo Ramirez

Written by

Ricardo Ramirez

Founder of Sprintt. Product leader, practitioner, and operator — not an academic or a theorist. Writes about the gap between AI strategy and shipped production systems, because closing that gap is the only thing Sprintt does.

Book a 30-min call

Ready to ship?

Stop planning.
Start shipping.

30 minutes. No pitch deck. A direct conversation about where AI can drive the most impact for your organization.

Book a strategy call