Why Schema is Entity Infrastructure, Not a Citation Lever

The conflation, the study, and the wrong question

There is no topic in contemporary SEO where the gap between primary evidence and folk consensus is wider than Schema. The industry has been carrying a single word — schema — through three jobs at three timescales for three different consumer systems, and the resulting confusion is structural, not incidental. The equation that costs the most is the familiar one: Schema = JSON-LD = rich results. Collapse those three into one, and you get the recurring pathology I have elsewhere called the Schema Sacrifice: practitioners over-engineer markup against the narrow target of rich-result eligibility, then conclude “Schema is dead” the moment Google retires a display feature.

Schema is not dead. What dies, on a roughly biennial schedule, are display features.

The conflation became empirically measurable on 11 May 2026, when Ahrefs published a causal study of 1,885 pages tracking what happens to AI citations after adding JSON-LD markup. The result was a null finding: ChatGPT citations rose 2.2%, Google AI Mode 2.4%, both indistinguishable from zero, and Google AI Overviews actually fell 4.6%. Within forty-eight hours, half of SEO Twitter declared schema dead. The other half was already busy quoting vendor blogs claiming Schema produces a 2.5× citation boost.

Both camps are wrong, and they are wrong in the same way. They are answering a question Schema was never built to answer, measured at the layer where Schema does least, on a timescale that cannot detect the layer where Schema does most.

The Ahrefs study is methodologically clean and tests the wrong thing, a formulation I owe to my own analysis of the study and one that Suganthan Mohanadasan has since carried forward with the framing this article will adopt and extend. To see why both camps miss the point, we need to separate three things the industry insists on fusing:

The architectural layers of structured data (vocabulary, serialisation, output).
The temporal phases at which different systems read it (index time, pretraining time, query time).
The consumer systems that actually do the reading (Google’s index pipeline, LLM pretraining corpora, LLM runtime retrieval).

The same JSON-LD block on the same page is read by all three, for completely different purposes, on completely different clocks. Treat them as one thing, and every claim about Schema becomes partially right and mostly meaningless.

Three layers, not one: vocabulary, serialisation, output

Start with the architecture.

Schema.org is a vocabulary, aka a set of types (Person, Organization, Product, Recipe, Article, thousands more) and properties (name, author, sameAs, priceCurrency, knowsAbout) maintained since 2011 by a consortium founded by Google, Microsoft, Yahoo, and Yandex, with R.V. Guha as the architectural lead and Dan Brickley as its long-running curator. It sits, formally, on top of RDF Schema (RDFS) and is therefore expressible in any RDF-compatible syntax.

JSON-LD, Microdata, and RDFa are serialisations: alternative syntactic vessels for the same semantic content. The analogy is closer than it sounds. A vocabulary is a language. A syntax is a grammar. A serialisation is an alphabet — Roman, Cyrillic, Greek — different scripts writing the same words. A parser converting Microdata, RDFa, or JSON-LD to RDF triples will produce semantically equivalent output in all three cases.

JSON-LD became Google’s preferred serialisation not because it is semantically richer than the others but because it is operationally cleaner:

It lives in a separate <script type=”application/ld+json”> block.
It doesn’t entangle itself with the rendered DOM.
It can be injected through tag managers or server-side templates without touching front-end code.

The preference is a deployment recommendation, not a semantic ruling. Microdata and RDFa remain valid and parsed.

Rich results are outputs, one possible consumer-facing manifestation, gated on a small subset of the vocabulary, available in a limited set of jurisdictions, and subject to revocation at any time. Google supports something on the order of eight hundred Schema.org types for content understanding; its Rich Results Gallery covers a few dozen. The implication is direct: most of the vocabulary does work that never produces a visible SERP change.

Google’s own Organization documentation is explicit on this point. It lists no required properties and instead says “we recommend adding as many properties that are relevant to your organization” — twenty-eight suggested properties for an Organization, against Schema.org’s own forty-plus direct properties on the type. The vocabulary is materially broader than the rich result documentation surfaces. Practitioners who only mark up rich-result-eligible types and properties are not implementing Schema; they are implementing one narrow slice of one consumer’s documentation.

The three layers fail and evolve independently. A serialisation can be deprecated without retiring the vocabulary. A rich-result eligibility can be revoked without invalidating the underlying markup. This independence is precisely what the FAQ/HowTo episode of 2023–2026 has now demonstrated in policy terms, and we will return to it.

Three lives, not one: index time, pretraining time, query time

Architecture tells you what Schema is. To explain when it does what for whom, we need a second axis, which is what Suganthan Mohanadasan has usefully labelled “the three lives” of schema markup, each running on a different clock for a different consumer system.

Life 1 is Google’s index pipeline. This is the original use case, the one Schema.org was designed for in 2011, and the one that still pays the highest dividend. Googlebot extracts both the visible HTML and the JSON-LD block on its first fetch (with JavaScript-injected markup queued for a second render pass through the Web Rendering Service); the visible content feeds the ranking pipeline, while the structured data feeds an entity pipeline that identifies which entity the page describes, checks whether that entity exists in the Knowledge Graph, follows sameAs links to anchor the entity across the web, and either creates or updates the entity record. This whole process runs offline, in batch, before any user has searched for anything.

The outputs are rich results where types are eligible, Knowledge Panels for organisations and people, author attribution, entity disambiguation in branded queries, and feedstock for Google’s own generative systems. At Search Central Live Toronto in April 2026, Google structured data engineer Ryan Levering reaffirmed that this pipeline also feeds AI Overviews and AI Mode, listing four reasons structured data still matters:

It is more precise than LLM extraction on complex content like product pricing.
It can express information not visible on the page.
Parsing structured data is cheaper than repeatedly inferring meaning from prose.
The right markup focuses machine attention so irrelevant context is excluded.

Life 2 is LLM pretraining. Foundation models — GPT, Claude, Gemini, and the rest — train on enormous text corpora. While the largest single public input is Common Crawl, which captures full HTML pages, modern cleaning and text-extraction pipelines widely strip out JSON-LD blocks. The data curation heuristics for datasets like C4 strip lines containing curly brackets, removing most JSON-LD. Highly optimized modern scraping pipelines, such as FineWeb, explicitly isolate the main visible text prose and discard script tags entirely.

The JSON-LD code itself does not directly survive cleaning to train foundation models. Instead, Schema’s intervention at Life 2 is structural and indirect. For over a decade, clean structured data has fed canonical entity stores and Google’s Knowledge Graph. Because these optimized entity profiles, search panels, and authoritative surfaces are heavily represented in public text and deeply ingested into training sets, a site’s historical structured data compounds into an LLM’s parametric memory via the entity stores it originally populated. This layer is invisible to short-term measurement; you cannot run a thirty-day test against it. Every new model version trained on a fresh snapshot carries forward, or fails to carry forward, the entity representation your site has been emitting for years.

Life 3 is LLM runtime retrieval. This is the layer that the Ahrefs study measured. When you ask ChatGPT, Claude, Perplexity, or Google AI Mode a question that triggers a web search, the system fetches live pages, parses them, and uses the extracted content to ground its answer. The searchVIU experiment of October 2025, documented in Search Engine Land, tested this behaviour across all five major systems and found a consistent pattern: when an assistant fetched a live page through its standard retrieval pipeline, it stripped the JSON-LD blocks and relied entirely on the visible HTML. The schema was ignored as structured data at this layer.

The Williams-Cook test of February 2026 clarified the exact mechanical reason: LLMs tokenise the entire HTML response, including script blocks, so they can read the characters inside schema as plain text, but they do not parse them as structured data. At Life 3, Schema is functionally treated as raw text characters, not interpreted structure. While industry discussion often conflates this with direct schema parsing by LLMs, there remains no verified first-party confirmation from providers like OpenAI, Anthropic, or Perplexity that runtime retrieval engines interpret raw JSON-LD blocks natively.

This is why the Ahrefs study produced a null result. It measured Life 3 on a thirty-day window. Life 3 is the layer where Schema is most likely to be stripped, and even when it survives, it is read as plain text rather than a parsed structure. Life 1’s effects — Knowledge Graph membership, entity disambiguation, rich-result eligibility, author attribution — operate on indexing and re-indexing cycles that the study’s pages had already passed through. Life 2’s indirect entity effects compound across snapshots and model releases, on a clock measured in years, not weeks. The study is methodologically right, the layer it tests is real, and the layer it tests is the one where Schema does the least.

Schema is one species in a wider genus

Before going further, a piece of taxonomic hygiene is worth keeping brief.

“Structured data” is the genus; Schema.org is one species.

The family includes:

Dublin Core (a fifteen-element library and digital-archive vocabulary from the mid-1990s).
FOAF, “Friend of a Friend” (Dan Brickley and Libby Miller’s vocabulary for describing people and social relations, conceptually upstream of schema:Person/knows).
SKOS, the Simple Knowledge Organization System (a W3C standard for thesauri and topical hierarchies).
GoodRelations (Martin Hepp’s e-commerce ontology, formally folded into Schema.org around 2012 and surviving inside the product/offer model).
Microformats and Microformats2 (h-card, h-recipe, h-product — Tantek Çelik’s lightweight class-attribute lineage, methodologically distinct from RDF).
In the broader linked-data ecosystem, Wikidata (Denny Vrandečić’s multilingual, collaboratively-edited factual graph, increasingly the neutral knowledge graph against which entity disambiguation is best anchored).

Then some formats are structured data without being vocabularies at all: XML, CSV, JSON proper, RSS (which Guha himself originated in its earliest form), and XML sitemaps. They define shape and serialisation but say nothing about the conceptual model they carry.

The reason this genus/species distinction matters is that the SEO industry (and Google itself in its documentation 🙄) routinely uses “Schema” and “structured data” interchangeably, and the substitution hides three independent decisions: which vocabulary, which serialisation, and what real-world referents the markup actually identifies.

Collapse all three into “add Schema,” and you get markup that validates against a schema but does no entity disambiguation, embeds no sameAs links, and contributes nothing to any knowledge graph beyond decoration.

What Schema is actually for

Aaron Bradley’s canonical formulation — that structured data shifted SEO “from strings to things” — is right, but only if practitioners are willing to identify the things. Five purposes, in descending order of evidential support.

1 )Entity disambiguation.

When a page declares @type: Organization, name: “Tratos”, and sameAs: [“https://www.wikidata.org/wiki/Q…”, “https://www.linkedin.com/company/tratos”, “https://www.tratosgroup.com”], it is doing what prose cannot do: asserting, in machine-resolvable terms, which Tratos this page is about, and pointing to authoritative external nodes any consumer can join against.

This is the architectural core of what Jason Barnard calls the Entity Home. The discipline became materially more consequential in June 2025, when Google executed what Jason Barnard has documented as the “Great Clarity Cleanup“: roughly three billion entities removed from the Knowledge Graph, a 6.26% contraction of Google’s world understanding, with the largest share of removals targeting “multityped” and ambiguous “Thing” classifications.

Surviving the cleanup requires unambiguous links to a specific category (Organization, Person, Product) supported by consistent external validation. sameAs discipline is now the principal mechanism by which this is asserted, and the targets of the property have shifted upward in authority: Wikipedia, Wikidata, government registries (Companies House, LEI, INPI), DUNS numbers, and primary-source social profiles.

2) Knowledge graph contribution.

Google’s Knowledge Graph, Microsoft’s Satori, and the various vertical graphs consume Schema.org markup as one of many inputs alongside Wikidata, licensed datasets, structured feeds, web extraction, and human curation.

Google’s published work on the Knowledge Vault describes a probabilistic knowledge base that accretes facts about entities by combining text extraction with structured-data extraction and prior knowledge from existing graphs, assigning confidence scores rather than treating any single source as ground truth.

Bill Slawski consistently emphasised that the entity layer is where Google has invested its most expensive engineering for over a decade, and Schema is one of the cheapest, cleanest inputs to that pipeline.

3) Machine-readable semantic explication.

Brickley and Guha’s original framing in their 2016 ACM paper:

A page has an underlying meaning that humans extract from prose.
Schema markup makes a subset of that meaning explicit, so search engines, voice assistants, (and LLM-based agents, I would hypothesize) do not need to reconstruct it lossily.

The point is cooperation, not gaming.

4) Type and property constraints on interpretation.

Declaring @type: MedicalCondition rather than @type: Article is a strong commitment about how a consumer should model the page, which property set to expect, which kinds of authority to require, and which factual claims to extract.

The discipline cascades: a Product constrains for brand, manufacturer, offers; a Person constrains for worksFor, knowsAbout, sameAs; an Article constrains for author, datePublished, publisher.

5) Multilingual disambiguation.

A sameAs link to a Wikidata Q-identifier is language-neutral.

An Italian page about Roma and an English page about Rome, both sameAs-linked to the same Q-id, are joined into a single entity node.

Two properties deserve specific mention because they are dramatically underused: about and mention.

The first names the principal entity a page is about; the second names entities that are referenced without being central. Together they allow a page to declare its semantic topology — what it argues, what it cites, what it merely brushes past — in terms a Knowledge Graph can act on.

Most pages that carry Schema markup carry none of this. They declare a type and a name and stop, resulting in markup that satisfies a validator but contributes nothing to the entity layer.

The synthesis across these five purposes is the one I have been pushing in my Ahrefs-study post, and the one Suganthan now carries forward: Schema is entity infrastructure, not a citation lever. The work is patient, compounds across consumer systems, and resists single-window measurement.

Where Schema actually intervenes in the IR pipeline

Walked through both axes — architectural and temporal — the picture clarifies.

Crawling and parsing (Life 1). Schema extracted on first fetch for hardcoded JSON-LD; JavaScript-injected markup queued for the Web Rendering Service. Both paths end in the same indexing pipeline.
Indexing (Life 1). Facts derived from structured data flow into Google’s document representations alongside many other features. Some Schema-derived attributes are demonstrably consumed and surface in displays. Others appear to be retained more weakly. The 2024 Content Warehouse API leak surfaced entity-attribute and type-assignment modules consistent with this picture, but did not surface a “schema is a ranking signal” feature.
Entity resolution and Knowledge Graph integration (Life 1). The highest-value contribution. sameAs, type declarations, and disambiguating properties materially help the entity resolution pipeline. This is the layer where the work compounds, and the layer where the June 2025 Great Clarity Cleanup raised the bar.
Ranking. The evidence runs strongly against the direct effect. John Mueller and Danny Sullivan have repeatedly stated on the record that Schema has nothing to do with rankings and does not give you a ranking boost. Ryan Levering added a useful piece of mechanism: Schema is treated as a signal, not absolute truth, and is overridden when markup conflicts with visible content or established authority signals. The indirect path — rich result → CTR → behavioural signal → ranking — is real, but compressing it into “Schema is a ranking factor” is a four-step chain misrepresented as one direct effect.
Presentation layer (Life 1). Rich results, Knowledge Panels, sitelinks, et al. Type-specific, jurisdiction-specific, and subject to revocation.
LLM retrieval and grounding (Lives 2 and 3). Google has confirmed that Life 1 structured-data work feeds AI Overviews and AI Mode. Across OpenAI, Anthropic, Perplexity, and Microsoft, there is no verified first-party confirmation that runtime engines parse JSON-LD natively; the empirical evidence points to Life 3 as the layer where Schema’s parsed structure is stripped or ignored.

The Schema Sacrifice: the FAQ/HowTo episode as cautionary precedent

The most concrete demonstration that the three architectural layers fail independently — and that strategies tightly coupled to a single rich-result eligibility are structurally fragile — is the FAQ/HowTo deprecation cycle of the past three years.

The timeline is worth keeping in front of any 2026 Schema strategy discussion:

August/September 2023: Google restricts FAQ rich results to authoritative sites and deprecates HowTo rich results.
7 May 2026: Full FAQ rich-result deprecation across all categories, with the official documentation explicitly carrying a sentence that most trade-press coverage buried: Google will continue to use FAQ structured data to help understand pages, even though the rich-result feature is gone.
June/August 2026: Search Console UI and API support for FAQ rich results retired.

The lesson is the architecture in policy form. The display feature (output) was retired. The vocabulary (Schema.org FAQPage) remains valid. The markup remains parsed at Life 1 for understanding, indirectly influences Life 2 territory over long-term updates, and is read as text characters at Life 3 as ever.

Practitioners who built FAQ strategies as ends in themselves are stranded. Practitioners who treated FAQPage as one of many semantic assertions about a page have lost a CTR enhancement and kept everything else. This is the Schema Sacrifice in its purest form. Any current advice that treats rich-result eligibility as the dominant return on Schema investment is one product decision away from obsolescence. Anyone watching the cycle for a third time should stop sacrificing.

AI search and Schema: what the evidence actually shows

The co-equal pillar. Three lives, walked through honestly.

At Life 2 (Pretraining), data cleaning heavily strips JSON-LD blocks, meaning structured markup code does not directly survive into text corpora. The mechanism is structural and indirect: pages carrying clean, valid structured data correlate with the technical and editorial markers that quality classifiers favour during corpus filtering, and the structured data actively builds the canonical Knowledge Graphs that LLM datasets reference.

At Life 3 (Query-Time Retrieval), the searchVIU experiment showed JSON-LD stripped across major assistants when fetching a live page, and the Williams-Cook test showed schema content read as plain text characters rather than interpreted structure. The Ahrefs causal study of May 2026 measured 1,885 pages over thirty days and produced a null result on AI citation uplift. These pieces of evidence converge: at Life 3, Schema does not directly drive runtime citation selection on short timescales.

None of this means Schema fails; it means Life 3 is the wrong layer to measure. Dan Petrovic’s work on query fan-out shows that internal sub-queries used by AI systems to retrieve candidate sources are themselves entity-and-attribute-driven. The cleaner a candidate page makes its entity and attribute claims at Life 1, the higher the probability it survives the structural fan-out filters into the final citation set.

This shapes the frontier between parametric and dynamic visibility. Parametric visibility is what the model knows when web search is off, aka a function of entity representation strength in the data layers that populate training corpora (Life 2 territory). Dynamic visibility is what the model finds when web search is on (Life 3 territory, with significant Life 1 prerequisites). Andrea Volpini’s concept of “ghost citations” names the phenomenon that emerges when Life 1 entity infrastructure feeds into retrieval through strong entity connections, even when a document’s traditional ranking history doesn’t predict it.

The honest summary across all three lives is this:

Google has confirmed that Life 1 structured-data work feeds AI Overviews and AI Mode.
Other major providers remain silent on Schema specifically.
The cleanest causal study to date produced exactly the null result that Life 3’s plain-text limitations predict.

Vendor “Schema increases AI citations by X%” claims should be read with heavy skepticism. The defensible systemic claim is narrower and stronger: AI search systems are entity-and-attribute-aware; Schema is the cheapest, cleanest way to communicate entity and attribute information; and building for the layers that compound is the only sustainable strategy.

Governance and the trust asymmetry

The reason Schema can be trusted as a long-term semantic contract even when individual rich-result eligibilities cannot is governance. Schema.org operates through a W3C Community Group, and its formal Steering Group is chaired by R.V. Guha in an individual capacity and includes named representatives from Google, Microsoft, Yahoo, and Yandex, plus key community contributors. Proposals, then, move through the GitHub repository and reach core through Steering Group approval.

The Knowledge Graph itself, meanwhile, is governed by Google purely as a product. It expands and contracts according to product decisions, as the June 2025 cleanup demonstrated when it removed 6.26% of its entity inventory in a single update.

This is the asymmetry that explains why the Schema Sacrifice keeps happening: practitioners build on top of product features (rich results) that the platform can revoke at will, when they could be building on top of the vocabulary contract, which is community-governed and stable across product cycles.

What you build on stable infrastructure survives; what you build on the cleanup floor doesn’t.

Implementation discipline

The pathologies cluster into recognisable types:

Markup that contradicts visible content.
Deceptive or manufactured markup.
Over-engineering Schema at the expense of foundational SEO, or Jarno van Driel’s now-canonical critique that “stuffing pages with all of schema.org is a strategy-killing tactic“.

The positive discipline is the inverse. The framing I have used in my work — HTML for indexing, JSON-LD for calculation — captures the productive distinction. Schema markup is the surface of a graph, not a per-page sticker. The graph has persistent @id URIs that don’t change once set, because those identifiers are how Knowledge Graphs and systems learn to anchor your entity across pages. Nested structures express actual relationships (Person → worksFor → Organization). Isolated blocks on the same page tell separate, fractured stories; a nested graph with @id references tells one.

The Games Workshop / Darren Latham example from the Branded SEO Guide 2026 I wrote for Advanced Web Ranking makes the pattern concrete: a Person node with an additionalName, jobTitle, worksFor linking to a stable Organization @id, knowsAbout listing specific expertise, and sameAs links pointing to authoritative external profiles.

The markup identifies the person unambiguously, anchors them to a verifiable organisation, declares their expertise in terms a topical relevance system can act on, links to external corroboration, and survives the deprecation of any specific rich-result feature because it isn’t asking for one.

This is the difference between treating Schema as decoration and treating it as the entity infrastructure of a brand.

Closing: registration, not advertising

The cleanest mental model for what Schema is doing is using the analogy I presented in my previous analysis of the Ahrefs study: think of it as business registration, not advertising. You don’t register your company with the government because you expect an overnight sales boost. You register because being a formally recognisable legal entity is the foundation that other things sit on top of: signing contracts, opening bank accounts, and being referenced in legal documents.

Schema works the same way. Being a formally recognisable structured entity is the foundation Google’s Knowledge Graph indexes against, retrieval systems use to disambiguate when prose is ambiguous, and consumer-facing surfaces draw from when the moment comes to feature you. Adding Schema does not directly cause citations or rankings; it makes you legible to the systems that decide whether to feature you. Those are very different things.

Andrea Volpini’s structure is the moat formulation is the contemporary endpoint of Aaron Bradley’s strings to things. The discipline is patient construction: a stable entity home, persistent @id URIs, sameAs discipline against authoritative registries, nested graphs, and an honest assessment that no single rich-result eligibility is permanent.

A page is a narrative; a narrative has characters, settings, and relationships. Schema markup is the page declaring its own dramatis personae directly — this is the protagonist, this is the setting, this is the relationship to that other entity. When the markup is honest and aligned with the prose, both human and machine readers leave with the same reconstruction.

Schema is not dead. It is also not a magic AI citation lever. It is the entity infrastructure that makes you formally recognisable to whichever system happens to be reading at whichever timescale. Add Schema cleanly, keep it consistent, don’t expect overnight wins from the layer where overnight wins aren’t possible, and let all three lives work together.

Share if you care

The Schema Question, Reframed or Trying To Clarify What It Is For