AI · Operator Guide 9 min read · May 2026

How to evaluate an AI agency in 2026.

Most “AI agencies” in 2026 are marketing agencies with a ChatGPT login. A field guide for operators evaluating real AI partners.

Basya Benshushan · Founder, Good Fortune

If you're evaluating AI agencies in 2026, you're navigating a category that changed faster than the buyers in it could keep up. Two years ago, AI was a niche capability. Today every marketing agency has rebranded as an AI agency, every consulting firm has an AI practice, and every freelancer with a ChatGPT subscription is selling "AI strategy." Most of it is noise.

The signal-to-noise problem isn't a vendor problem. It's a buyer problem. The agencies that can't actually build AI systems will tell you exactly what you want to hear, because they don't know enough to know what you should be asking. The agencies that can build will sound less polished, push back more, and quote higher. Counter-intuitively, the latter is what you want.

Here's a field guide for operators evaluating AI partners. Five questions that separate signal from noise, what "shipping" actually means, pricing realities, and the red flags that should kill any deal in the first call.

01The five questions

Most operators evaluating an AI agency ask the wrong questions. They ask about the team's AI experience (everyone says "deep"). They ask about case studies (everyone has logos). They ask about pricing (everyone is opaque).

The questions that actually surface signal:

1. Show me an agent you built that's in production right now.

Not a demo. Not a Loom video. Not a slide. An agent that's running, with users depending on it, with measurable output. If the agency can't show you one in 90 seconds, they don't build agents. They sell strategy decks about agents.

2. What's the first thing that broke?

Real builders have war stories. Real builders know that the demo is the easy part and the hard part is keeping the agent running when the underlying API changes, the prompt drift starts, the edge cases pile up, the costs spike. If they say "nothing has broken," they haven't built anything that matters.

3. Where does the AI run?

Cloud or on-prem. Claude or GPT or open-source. Hosted by them or by you. The right answer depends on your situation. The wrong answer is "don't worry about it." If they can't articulate the deployment architecture, they're probably reselling somebody else's.

4. Who operates it after launch?

Production AI requires ongoing operations — model swaps, prompt tuning, cost monitoring, eval cycles. The build phase is 4–8 weeks. The operate phase is forever. Agencies that stop at "launch" are leaving you with a fragile system.

5. What's your eval methodology?

If they don't have a serious answer about evaluating agent outputs systematically — golden datasets, regression testing, drift detection — they're shipping based on vibes. That's fine for prototypes, fatal in production.

Real builders have war stories. The demo is the easy part.

02What "shipping" actually means

There's a vocabulary gap between agencies that talk about AI and agencies that build AI. "We shipped an AI campaign" can mean almost anything. The questions to clarify what you're actually getting:

Did a system go into production? Production means real users, real usage volume, real consequences when it breaks. A pilot with five internal testers is not production.

Is it integrated with existing systems? A standalone agent in a sandbox is a science fair project. An agent that reads from your CRM, writes to your support queue, and triggers downstream workflows is shipped infrastructure.

Does it have observability? Logs, traces, evals, dashboards. If the agency can't tell you what the agent is doing right now, they're not running it. They're hoping it works.

Is there an on-call? Production systems need someone who picks up the phone when something breaks at 11pm. If the agency can't articulate their incident response, the system isn't really production-grade.

03Pricing realities

AI agency engagements in 2026 cluster in three tiers, and operators should know which tier they're shopping in:

Sub-$25K ("strategy" tier). You're buying a deck and a roadmap. You're not getting code. The agency may be senior people, but they won't ship anything you can use without a separate build phase. Useful for board prep and getting un-stuck. Not useful for actually deploying AI.

$50K–$150K (build tier). You're buying a working agent — research, build, deploy. This is where most real AI work happens. Custom-fit to a specific workflow. Operates in production. Replicable across similar workflows in your business.

$250K+ (program tier). Multi-agent systems, deep integration, ongoing operate. This is for operators with multiple workflows to address, or one workflow at significant scale. Usually annual.

If an agency quotes you 100K for what sounds like "strategy + a workshop," they're either inexperienced or hoping you don't know better. If they quote you 10K for a custom AI sales agent, they're either offshoring everything or planning to disappear after the build phase. Both are red flags.

04Red flags

Patterns that should kill any agency conversation in the first call:

They lead with the LLM brand. "We're a Claude shop" or "We build on GPT-5" is a vendor relationship, not a capability. Real agencies pick the model based on the use case, not the other way around.
They can't show you working code. If everything is decks and demos, they're not builders.
They promise "AI transformation." Real builders promise specific outcomes for specific workflows. Vague transformation language is the calling card of consulting firms with no engineering capability.
They don't ask you about your data. Production AI runs on your data. If they don't immediately get curious about where your data lives, what's clean, what isn't — they're not thinking about deployment.
They quote without scoping. Real engagements get scoped before priced. A flat rate quoted in the first call is a sales tactic.
They've never said no to a client. The agencies worth hiring turn down work that won't compound. The ones that will say yes to anything are the ones that ship slop.

05What good looks like

The right AI agency in 2026 looks less like a marketing agency and more like a small product team. Senior engineers, senior strategists, working in the same room. Comfortable with model APIs, prompt engineering, evals, and the operational realities of running production AI. Comfortable saying "no" to scope that won't ship.

They'll spend the first call asking about your workflow, your data, and your team — not pitching theirs. They'll quote a build phase and an operate phase separately. They'll tell you that ongoing prompt tuning and model swaps are part of the deal, not an upsell.

And the work they show you will be specific. Not "we built an AI tool for a Fortune 500." Closer to: "we built an AI sales agent for a B2B logistics company that researches accounts, drafts personalized outreach, and qualifies inbound leads. It's been in production for 11 months and currently handles 70% of pre-call research."

If the conversation feels too smooth, it's a marketing agency. If the conversation feels concrete and slightly inconvenient, it might be the real thing.

Common questions.

How much does it cost to build a custom AI agent in 2026?

$50K–$150K is the typical build-tier range for a single production agent. Strategy-only engagements run sub-$25K. Multi-agent programs and significant scale work run $250K+. Watch out for sub-$25K "build" engagements — usually offshored or unfinished.

What's the difference between an AI agency and an AI consultancy?

An AI consultancy delivers strategy and roadmaps. An AI agency builds and ships. Both are legitimate businesses. The mistake is buying one when you needed the other.

How long does an AI agent take to build?

Most production agents ship in 4–8 weeks. Deep integration or multi-agent systems take longer. If an agency quotes 2 weeks, they're either prototyping or skipping production-grade work.

Should I build with Claude, GPT, or open-source?

Depends on the workflow. Claude is strong on long-context and code. GPT is strong on broad coverage. Open-source is right when data sovereignty or cost-at-scale matter. The agency that picks before scoping the workflow is doing it backwards.

How do I know if an AI agency is legit?

Ask them to show you an agent in production — running right now, with real users. Ask what broke. Ask about evals. Ask about who operates after launch. If they can answer all four with specifics, they're builders. If they can't, they sell decks.