How AI Works

How ChatGPT, Claude, Perplexity & Gemini Decide Which Brands to Name

AI recommendations feel like magic. They're not. Under the hood there are two mechanisms — training and retrieval — and understanding them tells you exactly where to spend your effort.

Ozvor Research26 June 20269 min read

Key takeaways

AI answers are built from two pools: what a model learned in training, and what it retrieves live from the web at answer time.
Retrieval is the fast lane for small businesses — content can be cited within days, not after a year-long retraining cycle.
Retrieval matches on meaning: if your content doesn't say the specific words a buyer would use, it won't be pulled in.
Crawlability is a hard gate — a large share of the pages AI wants to cite are blocked or unreadable, and that's a fixable own-goal.

When an AI assistant names three businesses in answer to a question, it can feel arbitrary — or worse, rigged. It's neither. There are two distinct mechanisms deciding who gets named, and once you can see them, the whole discipline of GEO stops feeling mysterious and starts feeling like a checklist.

Mechanism 1: training data (the slow pool)

When a model is trained, it ingests an enormous slice of the public internet. Sources that appear often, and are cited often, get baked into the model's parameters. This is why Wikipedia, Reddit, and major publishers echo through so many AI answers — they were represented at scale during training.

For a small business, this pool is real but slow. Anything you publish today won't enter a model's training data until its next major training run, which can be a year or more away. You can influence it — by being mentioned, consistently, on sources that get trained on — but you can't rush it.

Mechanism 2: live retrieval (the fast pool)

This is where the opportunity lives. Most modern assistants now fetch live web content at answer time — Perplexity was built around it, ChatGPT Search and Gemini do it routinely, and Claude can search the web. The model runs a search, pulls a set of pages, and synthesises an answer that cites them. This is retrieval-augmented generation, and it means content you publish this week can be cited this month.

Where do those retrieved sources come from? Studies of cited domains find a consistent cast: Reddit leads, with community and Q&A sites, Wikipedia, LinkedIn, YouTube, and review platforms close behind — plus the brand's own site when it's clear and crawlable.

Profound — the data on Reddit & AI search (4B+ citations analysed) — tryprofound.com/blog/the-data-on-reddit-and-ai-search; Peec AI — AI search engines cite Reddit, YouTube and LinkedIn most, via Search Engine Land — searchengineland.com; Semrush — the most-cited domains in AI: a 3-month study — semrush.com/blog/most-cited-domains-ai/

Retrieval matches meaning — so phrasing matters

Retrieval systems match your content to a query by semantic relevance. If a customer asks for an "Invisalign consultation in the city centre" and your page says exactly that, you're a candidate. If your page says "we love helping our patients smile," there's nothing for the query to match, and you're invisible — no matter how lovely the sentiment.

This is the single most common own-goal. Owners write warm, generic copy that connects with humans skimming a homepage but gives a retrieval engine nothing concrete to grab. The fix is to write the way customers ask: name the service, the place, the price range, the timeline, the specifics.

The hard gate: can the engine even read you?

Before any of this matters, the engine has to be able to fetch and parse your page. Ahrefs' analysis of ChatGPT's most-cited pages found that a striking share of the content models want to cite is effectively off-limits — blocked by robots rules, hidden behind scripts, or otherwise unreadable to crawlers. That's a self-inflicted wound: pages that could be cited, aren't, because the door is shut.

Ahrefs — ChatGPT's most-cited pages (67% off-limits to crawlers) — ahrefs.com/blog/chatgpts-most-cited-pages/

Two practical checks: make sure your important pages are server-rendered or otherwise readable without running JavaScript, and make sure you're not accidentally blocking the AI crawlers you actually want.

Trust signals tip the balance

Among readable, relevant candidates, engines favour sources that look credible: clear authorship, structured data, corroboration across multiple sites, recency, and real-world reputation signals like reviews. Practitioner research consistently finds freshness and structured markup associated with higher citation rates, and schema markup measurably helps engines understand and surface your content.

Ahrefs — fresh content and AI citations — ahrefs.com/blog/fresh-content/; Otterly.ai — schema markup's real impact on AI search — otterly.ai/blog/schema-markup-real-impact-ai-search/; Cyrus Shepard / Zyppy — AI citation ranking factors (synthesis of 54 experiments) — signal.zyppy.com/p/ai-citation-ranking-factors

What this means for your effort

Stack the mechanisms and the to-do list writes itself: be readable (crawlable, parseable), be relevant (say the specific words buyers use), be retrievable across the third-party sources AI trusts (reviews, communities, professional networks), and be credible (sourced, structured, fresh). You can't control the model. You can control every one of those inputs — and measure the result.

Sources

Profound — the data on Reddit & AI search (4B+ citations analysed) — tryprofound.com/blog/the-data-on-reddit-and-ai-search
Peec AI — AI search engines cite Reddit, YouTube and LinkedIn most, via Search Engine Land — searchengineland.com
Semrush — the most-cited domains in AI: a 3-month study — semrush.com/blog/most-cited-domains-ai/
Ahrefs — ChatGPT's most-cited pages (67% off-limits to crawlers) — ahrefs.com/blog/chatgpts-most-cited-pages/
Ahrefs — fresh content and AI citations — ahrefs.com/blog/fresh-content/
Otterly.ai — schema markup's real impact on AI search — otterly.ai/blog/schema-markup-real-impact-ai-search/
Cyrus Shepard / Zyppy — AI citation ranking factors (synthesis of 54 experiments) — signal.zyppy.com/p/ai-citation-ranking-factors
Semrush — LinkedIn AI Visibility Study (89K cited URLs across 325K prompts) — semrush.com/blog/linkedin-ai-visibility-study/