What an AI Shopping Agent Demo Won't Show You

A buyer's guide to the three gaps between an impressive demo and an agent you can trust with revenue

Every few weeks, another vendor shows you the same demo. A chat box on a storefront. A shopper types something vague and human — "I need a warm jacket for winter, around 250 francs" — and back comes a fluent, confident answer with three product recommendations and prices. It is genuinely impressive. As a basis for a buying decision, it is also almost worthless.

Not because the demo is faked. Because the demo is the easy 10%. Everything that decides whether an AI shopping agent earns money or quietly loses it sits in the other 90% — and none of it shows up in a demo. A demo runs on a handful of products and a handful of questions the presenter already knows work. Your shop is tens of thousands of products and customers who ask things nobody anticipated.

This is a buyer's guide to that gap. It will not teach you to build an agent — the full engineering walkthrough is linked at the end, and it is open. What it will give you is the three questions that turn a vendor demo from a magic trick into something you can actually evaluate.

What an AI shopping agent actually is

Strip away the branding and an "AI shopping agent" is a small, well-understood thing: a language model that has been given a few tools — search the catalogue, look up a product — and is allowed to decide for itself which tool to use, in which order, until it can answer. That is the whole mechanism. It is a loop, and the core of it is roughly forty lines of code.

Why does that matter to you as a buyer? Because once you know the core is small and not magic, the right questions change. You stop asking "how does the AI work" and start asking "what did you build around it" — which is where all of the cost, the risk, and the value actually live. The forty-line loop is the free part. The 90% around it is the engagement you are paying for.

That 90% comes down to three questions.

Gap 1 — Can it actually find your products?

A demo agent searches a dozen products, and the presenter can pick words that work. Your catalogue has tens of thousands of products, and your customers do not use your words.

A customer asks for "a waterproof jacket for heavy rain." Your best rain jacket is described — accurately, by someone who knows the gear — as a "three-layer hardshell with taped seams." The word "waterproof" appears nowhere in it. A plain keyword search, the kind in most webshop search bars, simply cannot connect the two. Your best product is invisible to your own shop, and an invisible product is a lost sale you will never see in any report — there is no error log for "a customer searched, found nothing relevant, and left."

The fix has a name: retrieval, or semantic search. The agent searches by meaning rather than by spelling, so the customer's words and your catalogue's words can be completely different and still match. It is the single highest-value capability in a shopping agent — the difference between an agent that understands your range and one that just matches strings.

The question to ask a vendor: Does the agent search our catalogue by meaning or by keyword — and can you show it finding a product whose description does not contain the customer's words? If the answer is hand-wavy, the agent will be blind to a real slice of your range.

Gap 2 — Will it invent things?

Here is the uncomfortable fact no demo will state out loud: a fluent answer and a correct answer look exactly the same. An agent that invents a price quotes it as smoothly as a true one. An agent that recommends a product you sold out of last week recommends it with the same warmth as one in stock. Confidence is free — the model produces it whether or not the content is true.

In most demos, the only thing standing between a customer and an invented price is a polite sentence in the agent's instructions: "do not make up prices." That is a request, not a guarantee. On a real storefront, across thousands of conversations, requests get ignored. And a wrong price shown to a customer is not a "low-quality answer" — it is a mis-sale, a support ticket, and a dent in trust.

The fix is guardrails: deterministic checks that run on every answer and catch the things that must never happen — an invented product, a wrong price, a sold-out item presented as available. The important word is run. A guardrail is code that executes and blocks a bad answer before the customer sees it; it is not a sentence the model read once and may or may not honour.

The question to ask: What actually runs to stop a wrong price or a made-up product reaching a customer — and does it block the answer, or merely log it afterwards?

Gap 3 — Do you know it's right?

The third gap is the one buyers fall into most often. The demo went well — four questions, four good answers, the room was impressed — so the agent ships.

Four good answers are not evidence. They are four samples, chosen by the person least motivated to find a flaw, on a system that produces a confident answer every single time regardless of whether it is right. A demo that goes well tells you the agent can succeed. It tells you nothing about how often it fails, how badly, or on what.

Before an agent talks to paying customers, you need that as a number: how often is it right, measured against a real set of test questions, and re-measured every time something changes. The discipline is called evaluation, and it is the difference between "we tested it and it works" and "it answers nine of every ten test questions correctly, and here are the ones it does not."

The question to ask: How is the agent measured — against how many real cases, how often, and what is the number today? "We tested it and it works" is not an answer.

The demo-to-production gap

Put the three gaps together and you can draw the whole map. This is what sits between the demo you were shown and an agent you can trust with revenue:

The demo	What production actually needs
A handful of sample products	Your real catalogue — tens of thousands of products, cleaned up
A few questions that work	Continuous evaluation against a growing test set
"Don't invent prices" in the instructions	Guardrails that block a wrong answer, wired into every response
Runs in a sales meeting	Deployment, with a latency and cost budget on every query
No data-protection context	Swiss nDSG compliance and data residency
A standalone chat box	Integration into your real checkout and commerce systems

The left column is an afternoon's work. The right column is the engineering — and it is the right column that decides whether the agent becomes an asset or a liability.

A buyer's checklist

If you are evaluating an AI shopping agent — to buy, or to commission a build — these are the questions that separate substance from a thin wrapper:

Does it search our catalogue by meaning, not just keywords? Show me it working.
What runs to stop an invented price or product — and does it block the answer, or only log it?
How is accuracy measured: how many test cases, how often, and what is the number today?
What happens on the questions it gets wrong — does it fail safely?
What is the cost and the latency per query, at our real query volume?
How does it stay current as our prices and stock change through the day?
Where does our customer data go, and is that nDSG-compliant?
When you change the agent, how do you know you did not quietly break it?

If a vendor cannot answer most of these concretely, you are looking at a demo, not a product.

The proof is open

A fair question back to me: why trust this framing? Because it is not theory. The mechanics behind all three gaps — building the agent loop, adding retrieval, building the evaluation harness — are written up in full and runnable. We built a small, open tutorial, build-an-agent, that runs on a normal laptop with no GPU and no cloud account, and a three-part engineering series that walks through every line of it. If you have a developer, point them there: they can have the whole thing running in an afternoon and judge it for themselves.

build-an-agent — the open tutorial repository: https://github.com/MehmetGoekce/build-an-agent
An AI Agent Is Just a Loop — Part 1, the agent mechanism: https://mehmetgoekce.substack.com/p/an-ai-agent-is-just-a-loop
The Product Your Search Bar Can't Find — Part 2, retrieval: https://mehmetgoekce.substack.com/p/the-product-your-search-bar-cant
A Fluent Answer and a Correct Answer Look the Same — Part 3, evaluation: https://mehmetgoekce.substack.com/p/a-fluent-answer-and-a-correct-answer

The takeaway

An AI shopping agent demo will always look good — that is what demos are for. The decision in front of you is not whether the demo is impressive. It is whether the vendor has built the three things a demo never shows: an agent that finds the right product, one that cannot invent a wrong one, and a measured number telling you how often it is right.

Ask those three questions. Whether the answers come back concrete or hand-wavy will tell you everything you need to know.

Mehmet Gökçe is a software & data engineer with IT experience since 1998. He runs MEMOTECH (Swiss-based, St. Gallen) and publishes regularly on agentic AI, multi-agent architectures, and e-commerce engineering.

Evaluating an AI shopping agent — to buy, or to build — and want the three questions asked properly? Working out what stands between a vendor's demo and an agent you can trust with revenue is exactly what we do at MEMOTECH. If you want a technical second opinion before you commit, get in touch.

I also publish engineering deep-dives on agentic AI, retrieval, and the economics of LLM-backed e-commerce roughly twice a month. Direct to your inbox.

Subscribe to the MEMOTECH Newsletter →

What an AI Shopping Agent Demo Won't Show You

What an AI Shopping Agent Demo Won't Show You

What an AI shopping agent actually is

Gap 1 — Can it actually find your products?

Gap 2 — Will it invent things?

Gap 3 — Do you know it's right?

The demo-to-production gap

A buyer's checklist

The proof is open

The takeaway

Mehmet Gökçe

Weitere Artikel

BDI Agents für Shopware: Wenn Multi-Agent Systeme zu Chatbots werden

KI-Agenten sicher einsetzen: Policy-as-Code mit NVIDIA OpenShell

ChatGPT in der Treuhandkanzlei: Was nDSG und EXPERTsuisse wirklich verlangen