Chapter 7: Knowledge-Grounded Text Analytics
About This Chapter
The preceding chapters have built a progressively more capable analytics stack. Chapter 3 showed how large language models read and encode text by predicting the next token, learning extraordinary fluency in the process. Chapter 7 showed how those same models, equipped with tools and memory, become agents capable of multi-step reasoning. Chapter 9 showed how foundation models — pretrained at web scale and adapted cheaply via LoRA — serve as universal building blocks for nearly every downstream NLP task. Together, these three chapters describe a system that is fluent, adaptive, and generative. They describe, in short, most of what makes the current generation of AI systems remarkable.
But they also describe a system with a structural blind spot: knowledge. An LLM stores everything it knows implicitly, in the weights of a neural network trained on a frozen snapshot of the internet taken at some point in the past. Ask it who leads a company, what regulation governs a transaction, or which entity owns a subsidiary, and it will answer confidently — drawing on patterns absorbed during pretraining. The answer may be right. It may also be wrong in a way that the model cannot detect, because it has no mechanism for distinguishing between a remembered fact and a hallucinated one. It may be stale, because the company changed its CEO six months after the model’s training cutoff. And it is essentially uncitable: the model cannot tell you which document it used.
For casual conversation, these limitations are tolerable. For the analytics applications that matter in finance, law, compliance, and brand intelligence — domains where a wrong fact has real consequences — they are disqualifying.
The missing piece is structure. Specifically, it is the knowledge graph: an explicit, queryable representation of entities and the relations between them. A knowledge graph (KG) can tell you, with certainty, that as of a specified timestamp, Jensen Huang is the CEO of NVIDIA, that NVIDIA is headquartered in Santa Clara, that its primary SIC code is 3674, and that its board includes four members flagged as politically exposed persons in an AML database. Every one of those facts is a typed, time-stamped triple that can be sourced, updated, and audited. No hallucination. No staleness beyond the last update cycle. Full attribution.
This chapter shows how to combine LLMs with knowledge graphs to build text analytics that is both fluent and factually grounded. The combination is not simply additive — putting an LLM next to a KG and hoping they cooperate. It requires a specific pipeline: named entity recognition to find the entities in text, entity linking to resolve ambiguous mentions to canonical KG nodes, relation extraction to discover new triples from text, and KG-augmented retrieval to ground LLM generation in verified structured facts. Each of these steps is a discipline in its own right; together, they constitute the emerging paradigm of knowledge-grounded NLP.
The social media context makes this doubly important. Social media text is where new facts first appear — before they reach Wikipedia, before they reach regulatory filings, before they reach any structured database. The challenge of building knowledge graphs from noisy, informal, rapidly evolving social text is harder than building them from news wires or encyclopaedias; it is also more valuable, because the freshest facts are the highest-signal ones. We treat this hard case explicitly, with its own section and its own set of tools.
The chapter closes with a fully worked compliance-grade case study: an analyst receives fifty news headlines about corporate M&A and governance events, and the pipeline extracts entities, identifies relations, validates against an ontology, and performs graph traversal to surface ownership and sanctions information — all in structured, auditable form.
This chapter builds on the KG structural concepts introduced in the NetworkBook companion volume (Chapter 9: Knowledge Graphs and Semantic Networks) and on the LLM foundations from Chapters 3, 7, and 9 of this book. Key external references are: Mintz et al. (2009), “Distant Supervision for Relation Extraction without Labeled Data” (ACL 2009); Petroni et al. (2019), “Language Models as Knowledge Bases?” (EMNLP 2019); Lewis et al. (2020), “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (NeurIPS 2020); Edge et al. (2024), “From Local to Global: A Graph RAG Approach to Query-Focused Summarization” (arXiv 2404.16130); Yasunaga et al. (2022), “Deep Bidirectional Language-Knowledge Graph Pretraining” (NeurIPS 2022). Non-executing python blocks in this chapter require spacy, transformers, langchain, and rdflib; they cannot run in the browser and are clearly marked throughout.
Table of Contents
- The Problem with Pure-LLM Knowledge
- Named Entity Recognition — The First Mile
- Entity Linking
- Relation Extraction
- Ontology-Aware Information Extraction
- KG-Augmented RAG — The Modern Frontier
- Neuro-Symbolic NLP
- Industry Applications
- Building a KG from Social Media — The Hard Case
- Evaluation
- Limits and Open Problems
- Mini Case Study — Compliance-Grade News Triage
- Closing — The Symbolic Comeback
The Problem with Pure-LLM Knowledge
Knowledge stored in weights
Every large language model is, among other things, a compressed representation of the text it was trained on. Because that text includes encyclopaedias, news archives, financial disclosures, scientific papers, and billions of web pages, the model has absorbed an enormous quantity of factual content. Petroni et al. (2019) demonstrated this explicitly: BERT, prompted with cloze-style templates, could recall factual associations with reasonable accuracy across dozens of knowledge categories — capitals, inventors, birth dates, chemical compositions. For certain categories, BERT’s implicit recall was competitive with purpose-built fact retrieval systems.
This is genuinely impressive. It is also genuinely dangerous. The model’s knowledge is implicit — it is not stored in a table of (subject, relation, object) triples that can be queried, updated, or audited. It is distributed across billions of parameters in a way that is inscrutable. Ask the model a fact, and it produces an answer by pattern-matching across everything it learned; if the training data contained conflicting accounts of the same fact, the model will settle on the most frequent pattern, not necessarily the most recent or the most authoritative one. The model has no epistemic access to its own uncertainty about any specific fact.
Three failure modes follow directly from this architecture.
Hallucination. The model generates confident, fluent statements that are factually false. In large-scale benchmarks, state-of-the-art LLMs hallucinate at rates of 15–30% on closed-domain factual questions — meaning that between one in seven and one in three generated facts is simply wrong. In TruthfulQA, a benchmark specifically designed to probe factual accuracy, even the best models as of 2025 fail on questions that pivot on recently updated facts or on topics where plausible-sounding misinformation is common in training data. The problem is not that the model is careless; it is that the model cannot distinguish between a pattern absorbed from a reliable source and one absorbed from a confident but wrong one.
Staleness. Every LLM has a training cutoff — a date beyond which no new information was incorporated. For GPT-4o, the cutoff is April 2024; for Claude 3.5 Sonnet, it is July 2025. Ask a model about events that occurred after its cutoff, and it will either decline to answer (if aligned to do so) or produce a plausible-sounding answer extrapolated from pre-cutoff patterns — which is a hallucination of a different kind. In financial and compliance analytics, where the identity of a company’s controlling shareholder, the status of a regulatory proceeding, or the sanctions designation of an individual can change in a matter of days, a model with a months-old training cutoff is not a reliable fact source.
Attribution. A model cannot tell you where a fact came from. This is not a minor convenience failure; it is a fundamental accountability failure. In a compliance context, the analyst who relies on a model’s assertion that a counterparty is not sanctions-designated needs to be able to cite the authoritative source of that assertion to a regulator. “The model told me” is not an acceptable citation. The model’s training data cannot be queried after the fact; the specific document from which the fact was learned cannot be recovered.
A toy demonstration: KG lookup vs. implicit recall
The live cell below makes the contrast concrete. We construct a small knowledge graph of ten company–CEO pairs as a Python dictionary — this is, in miniature, exactly what a production knowledge graph does: store explicit, typed, queryable facts. We then define a mock “language model” that “remembers” these pairs with some deliberate errors introduced (representing the stale or hallucinated knowledge that a real LLM might encode). The demonstration shows that the KG lookup is always correct; the implicit model recall fails on several entries.
Interpretation. The knowledge graph achieves perfect recall because it stores facts explicitly. The mock LLM achieves 50% accuracy because it retains stale information from before leadership changes — and crucially, it does not signal uncertainty. Every wrong answer is delivered with the same syntactic confidence as every correct one. In a real LLM, the mechanism is more subtle (the model does assign internal probabilities to its outputs) but the failure pattern is identical: stale or confused facts emerge fluently, without a visible reliability signal attached. The KG is the remedy: every fact has a source, a timestamp, and a defined update procedure.
Named Entity Recognition — The First Mile
The task
Before a knowledge graph can be populated from text, and before a document can be linked to the entities it discusses, the text must be segmented into named entities — the proper nouns that refer to specific people, organisations, locations, products, dates, and other distinguishable things in the world. Named entity recognition (NER) is the task of identifying these spans and classifying each one into a type category.
Formally, let a document consist of a sequence of tokens \(x = (x_1, x_2, \ldots, x_T)\). NER assigns a label \(y_t\) to each token from a tag set that encodes both the entity type and the position of the token within the entity span. The most common encoding is the BIO scheme: B-TYPE marks the beginning of an entity of a given type, I-TYPE marks a continuation, and O marks a non-entity token. A four-class NER system over PERSON, ORG, LOC, and MISC produces a label set of size \(2 \times 4 + 1 = 9\).
The tag sequence for the sentence “Tim Cook of Apple briefed the Cupertino board” would be:
| Token | Tag |
|---|---|
| Tim | B-PERSON |
| Cook | I-PERSON |
| of | O |
| Apple | B-ORG |
| briefed | O |
| the | O |
| Cupertino | B-LOC |
| board | O |
The evaluation metric is the entity-level F1 score: an entity is counted as correctly predicted only if both its span boundaries and its type label match the ground truth exactly. Partial credit for getting the span right but the type wrong, or vice versa, is not awarded. This strictness reflects the downstream requirement: a downstream linker that receives a mistyped entity will search the wrong namespace.
Classical sequence labeling models
The dominant classical approach is the Conditional Random Field (CRF), which models the joint probability of the entire label sequence given the input:
\[p(y \mid x) = \frac{1}{Z(x)} \exp\!\left(\sum_{t=1}^{T} \sum_k \lambda_k f_k(y_{t-1}, y_t, x, t)\right)\]
where \(f_k\) are feature functions (indicator functions over token shapes, suffixes, contextual words, gazetteer membership) and \(\lambda_k\) are learned weights. The normalising constant \(Z(x)\) is computed via the forward algorithm; decoding uses the Viterbi algorithm to find the most probable label sequence in \(O(T \cdot |Y|^2)\) time. CRFs explicitly model label transition probabilities — the constraint that B-ORG can follow O but not I-PERSON, for instance — which is important for producing valid BIO sequences.
The BiLSTM-CRF architecture, introduced by Lample et al. (2016) and dominant through the early 2020s, replaces the hand-crafted feature functions with a bidirectional LSTM encoder. Each token representation is the concatenation of a forward LSTM hidden state \(\overrightarrow{h}_t\) and a backward LSTM hidden state \(\overleftarrow{h}_t\), capturing both left and right context:
\[h_t = [\overrightarrow{h}_t;\, \overleftarrow{h}_t]\]
These contextual representations feed into a linear CRF output layer. On the CoNLL-2003 English NER benchmark, BiLSTM-CRF achieved F1 scores in the 91–92% range, a substantial improvement over feature-based CRFs at 88–90%.
Modern transformer-based models (BERT fine-tuned for token classification, and its successors) have pushed CoNLL-2003 entity F1 above 93%, and domain-adapted models on financial or biomedical text exceed 90% F1 on those specialised benchmarks. The core mechanism is the same: each token receives a contextual embedding, a linear classifier maps it to tag scores, and a CRF or greedy decoder produces the final label sequence.
Live demo: regex and dictionary NER on financial news
The live cell below implements a lightweight NER system using two complementary strategies: a regex pattern for ticker symbols (which have a very regular form — one to five uppercase letters, optionally preceded by a dollar sign), and a dictionary lookup for person names and organisation names from an inlined gazette. This is not a competitive NER system — modern spaCy or HuggingFace models are far more accurate — but it is entirely transparent, requires no external models, and illustrates the core mechanics of span extraction and type classification.
Interpretation. The gazette NER extracts named entities wherever their canonical string form appears in text; the ticker regex catches financial symbols in context. Notice that the system correctly avoids double-tagging “NVIDIA” as both an ORG (from the gazette) and a TICKER (from the regex). The fundamental limitation of this approach is immediately clear: it cannot handle entity mentions that deviate from the gazette string — “the Santa Clara chipmaker” is not caught as a reference to NVIDIA, and “the Cupertino company” is not caught as a reference to Apple. These indirect mentions require a model with commonsense and world knowledge, which is precisely what the neural models in the non-executing block below provide.
The code below requires spacy (with the en_core_web_lg model downloaded) and the HuggingFace transformers library. Both require a Python environment with GPU or CPU acceleration unavailable in the browser. Run in a Colab or local environment with !pip install spacy transformers && python -m spacy download en_core_web_lg.
import spacy
from transformers import pipeline
# ── Option 1: spaCy NER (statistical, fast, broad-coverage) ───
nlp = spacy.load("en_core_web_lg") # 685MB model
text = (
"Jensen Huang unveiled NVIDIA's next-generation Blackwell architecture. "
"The Santa Clara company's shares rose 4% in Hong Kong trading."
)
doc = nlp(text)
for ent in doc.ents:
print(f" [{ent.label_:8s}] '{ent.text}' (chars {ent.start_char}–{ent.end_char})")
# spaCy correctly tags "Santa Clara company" context and maps NVIDIA/Jensen Huang
# ── Option 2: HuggingFace token classification (BERT-based) ───
ner_pipe = pipeline(
"token-classification",
model="dslim/bert-large-NER",
aggregation_strategy="simple"
)
results = ner_pipe(text)
for r in results:
print(f" [{r['entity_group']:8s}] '{r['word']}' (score {r['score']:.3f})")Entity Linking
The disambiguation problem
Named entity recognition produces spans and types; it does not produce identities. The mention “Apple” in a technology news article almost certainly refers to Apple Inc. (ticker: AAPL, Wikidata: Q312). But in a music licensing story, “Apple” may refer to Apple Records (Wikidata: Q216645), the label founded by The Beatles in 1968. In an agriculture story, it refers to the fruit. The NER tagger assigns the type ORG or MISC; it cannot resolve which specific entity in the world is being referred to.
Entity linking (also called entity disambiguation or entity resolution) is the task of mapping a mention — a string in context — to a canonical identifier in a reference knowledge base. Formally, given a mention \(m\) with surrounding context \(c\), entity linking assigns \(m\) to an entity \(e \in \mathcal{E}\), where \(\mathcal{E}\) is the full set of entities in the KG. The candidate set for “Apple” in \(\mathcal{E}\) may contain dozens of entities; linking selects the correct one.
Two families of features drive disambiguation.
Local context: the words immediately surrounding the mention often disambiguate it. “Apple Inc. announced record iPhone sales” — the words “Inc.”, “iPhone”, and “sales” all point strongly toward Apple Inc. A TF-IDF or BM25 model that represents the document context and each candidate entity’s description can score candidates by context similarity.
Global coherence: all mentions in a document should be linked consistently. If “Cook” is linked to Tim Cook (PERSON, Apple Inc.’s CEO), then the “Apple” mention later in the same document is almost certainly Apple Inc. — not Apple Records. Joint inference over all mentions simultaneously exploits this coherence. The classic model is the coherence graph: build a complete graph over all entity mentions; edge weights represent the semantic relatedness between candidate entities for adjacent mentions; the optimal joint assignment maximises the sum of local evidence plus global coherence edge weights. This is an NP-hard optimisation in general, but greedy approximations work well in practice.
TF-IDF entity disambiguation — a live implementation
The cell below implements a minimal entity linker using TF-IDF similarity between the query context and inlined entity descriptions. For each of five candidate entities, we compute a TF-IDF document vector and score it against three different query sentences containing the mention “Apple.” The correct entity should score highest in each context.
Interpretation. The formal score computed here is:
\[\text{sim}(m, e) = \frac{\mathbf{v}_{context}(m) \cdot \mathbf{v}_{desc}(e)}{\|\mathbf{v}_{context}(m)\|\, \|\mathbf{v}_{desc}(e)\|}\]
where \(\mathbf{v}_{context}(m)\) is the TF-IDF representation of the query context and \(\mathbf{v}_{desc}(e)\) is the TF-IDF representation of the candidate entity’s description. The term-overlap analysis shows exactly why each query resolves correctly: “iPhone”, “Tim Cook”, and “Wall Street” drive similarity toward Apple Inc. in Query 1; “Abbey Road”, “Beatles”, and “Paul McCartney” drive it toward Apple Records in Query 2; “Washington State”, “organic”, and “irrigation” drive it toward the fruit in Query 3. Production linkers augment this TF-IDF baseline with dense bi-encoder embeddings (as in entity linking models such as BLINK and ReFinED) to handle cases where the vocabulary overlap is low but the semantic similarity is high.
When you search for “Apple” on Google, the Knowledge Panel on the right side of the results page is powered by an entity linker operating at internet scale. Google’s Knowledge Graph, which underlies Knowledge Panels, contains over 500 billion facts about 8 billion entities. The entity linker resolves every query against this graph — determining that “Apple” in a query about “Apple stock earnings” maps to the Apple Inc. entity, not the Apple Records or apple fruit entity — and renders a structured card with the entity’s canonical name, key attributes, related entities, and a brief description. The disambiguation signal comes from query context, user location, search history, and the co-occurrence statistics of query terms with entity attributes accumulated across billions of search sessions. The basic mechanism — score candidates by context similarity, select the highest-scoring candidate — is identical to the TF-IDF linker above; the scale and the representation model are radically different.
Relation Extraction
Typed relations between linked entities
Entity linking produces a mapping from text spans to KG nodes. The next step is to discover the relations that hold between pairs of linked entities — the edges that, together with the nodes, constitute the knowledge graph. Relation extraction (RE) is the task of, given two linked entities \(e_1\) and \(e_2\) in a sentence, classifying the relation \(r \in \mathcal{R}\) that holds between them, or determining that no relation in the schema holds.
The output is a triple \((e_1, r, e_2)\): a subject entity, a relation type, and an object entity. For the sentence “Jensen Huang founded NVIDIA in 1993,” the extracted triple is (Jensen Huang, FOUNDED_BY\(^{-1}\), NVIDIA) or equivalently (NVIDIA, FOUNDED_BY, Jensen Huang), depending on the relation direction convention. The schema \(\mathcal{R}\) is defined by the ontology: it might include FOUNDED_BY, CEO_OF, SUBSIDIARY_OF, LOCATED_IN, ACQUIRED_BY, BOARD_MEMBER_OF, and hundreds of others for a financial KG.
A brief taxonomy of extraction methods
Pattern-based extraction exploits the observation that many relations in English are expressed with characteristic lexical patterns. Hearst (1992) introduced the most famous such patterns for the IS-A relation: “X such as Y, Z, and W” implies that X is a hypernym of Y, Z, and W; “X including Y and Z” implies the same. Analogous patterns exist for other relations: “X, the CEO of Y” implies the CEO_OF relation between X and Y; “X acquired Y for $Z” implies ACQUIRED_BY.
Supervised relation classification treats RE as a sentence-level classification problem: given a sentence with two entity spans marked, a classifier assigns a label from \(\mathcal{R} \cup \{\)NO_RELATION\(\}\). Standard models use the [CLS] token of a fine-tuned BERT encoder, augmented with entity-type embeddings, as the sentence representation. The loss is the standard cross-entropy:
\[\mathcal{L}_{RE} = -\sum_{i=1}^{N} \log p_\theta(r_i^* \mid x_i, e_{i,1}, e_{i,2})\]
where \(r_i^*\) is the ground-truth relation for the \(i\)-th training example. On the TACRED benchmark, fine-tuned BERT achieves F1 scores in the 72–75% range; the best models as of 2025 exceed 85% on this benchmark.
Distant supervision (Mintz et al., 2009) scales RE to domains where manual annotation is prohibitively expensive. The key assumption: if an entity pair \((e_1, e_2)\) is related by relation \(r\) in a known KG (say, Wikidata records that Jensen Huang FOUNDED NVIDIA), then every sentence that mentions both entities is a training example for the relation \(r\). This generates millions of labelled examples automatically — at the cost of introducing significant label noise, because a sentence mentioning Jensen Huang and NVIDIA may be about a product launch, not about founding.
Generative extraction is the frontier approach as of 2025. Models such as REBEL (Cabot and Navigli, 2021) and GenIE (Josifoski et al., 2022) generate triples directly from input text using a sequence-to-sequence architecture, producing structured output without any entity pre-marking. The generative approach handles overlapping triples and novel entity pairs naturally.
Live demo: Hearst pattern relation extraction
The cell below implements pattern-based relation extraction using four Hearst-style patterns, applied to an inlined corpus. This illustrates the key trade-off of rule-based extraction: high precision, limited recall.
Interpretation. The extraction correctly identifies HYPONYM_OF relations from “such as” constructions (NVIDIA and AMD are hyponyms of semiconductor companies), CEO_OF from the appositive pattern, and ACQUIRED relations from the verb pattern. The output triples are the raw material for KG construction. Notice the limitation that the pattern-based approach cannot handle paraphrase: “Musk’s electric vehicle maker” is not caught as a reference to Tesla, and “the Activision deal” does not trigger the ACQUIRED relation unless the sentence also contains the predicate “acquired.” This precision-recall trade-off is the core motivation for moving to statistical and generative models at scale.
The code below uses transformers and requires the Babelscape/rebel-large model (~1.6GB), which cannot run in the browser. Run in Google Colab with a GPU runtime: !pip install transformers. REBEL generates triples end-to-end from raw text without pre-specifying entity spans.
from transformers import pipeline
# REBEL: Relation Extraction By End-to-end Language generation
rebel_pipe = pipeline(
"text2text-generation",
model="Babelscape/rebel-large",
tokenizer="Babelscape/rebel-large"
)
def parse_rebel_output(text: str) -> list:
"""Parse REBEL's special token output format into (subj, rel, obj) triples."""
triples = []
# REBEL output: <triplet> SUBJ <subj> REL_TYPE <obj> OBJ
triplet_re = re.compile(
r'<triplet>\s*(.*?)\s*<subj>\s*(.*?)\s*<obj>\s*(.*?)(?=<triplet>|$)'
)
for m in triplet_re.finditer(text):
subj, rel, obj = m.group(1).strip(), m.group(2).strip(), m.group(3).strip()
if subj and rel and obj:
triples.append((subj, rel, obj))
return triples
sentence = (
"Jensen Huang, the founder and CEO of NVIDIA, announced that the company "
"had acquired Mellanox Technologies in 2020 to expand its data-centre networking business."
)
output = rebel_pipe(sentence, max_length=256, num_beams=4)[0]["generated_text"]
triples = parse_rebel_output(output)
print(f"Input: {sentence}\n")
print("Extracted triples:")
for (s, r, o) in triples:
print(f" ({s}, {r}, {o})")
# Expected output (approximately):
# (Jensen Huang, position held, CEO)
# (Jensen Huang, employer, NVIDIA)
# (NVIDIA, founded by, Jensen Huang)
# (NVIDIA, acquired, Mellanox Technologies)Ontology-Aware Information Extraction
What an ontology adds
A knowledge graph without an ontology is a collection of triples. An ontology gives those triples semantic structure: it defines entity types, specifies which relation types are valid between which entity type pairs, declares inverse and symmetric relations, and encodes transitivity constraints. The ontology converts a flat triple store into a reasoned knowledge base.
Formally, an ontology \(\mathcal{O} = (\mathcal{C}, \mathcal{R}, \mathcal{A})\) consists of:
- \(\mathcal{C}\): a set of concept classes (PERSON, ORG, LOCATION, FINANCIAL_INSTRUMENT, etc.), arranged in a subsumption hierarchy \(\sqsubseteq\) (LISTED_COMPANY \(\sqsubseteq\) ORG).
- \(\mathcal{R}\): a set of relation types, each with a domain and range: \(\text{CEO\_OF} : \text{PERSON} \times \text{ORG}\).
- \(\mathcal{A}\): a set of axioms: symmetry (\(\text{SIBLING\_OF}(x, y) \Rightarrow \text{SIBLING\_OF}(y, x)\)), inverse (\(\text{FOUNDED\_BY}(x, y) \Leftrightarrow \text{FOUNDER\_OF}(y, x)\)), transitivity (\(\text{SUBSIDIARY\_OF}(x, y) \wedge \text{SUBSIDIARY\_OF}(y, z) \Rightarrow \text{SUBSIDIARY\_OF}(x, z)\)).
Type checking and rule application
Type checking rejects triples that violate domain or range constraints. A triple (Apple Inc., CEO_OF, Tim Cook) is invalid because CEO_OF has domain PERSON and range ORG — the arguments are swapped. The corrected triple is (Tim Cook, CEO_OF, Apple Inc.). Automated extraction produces both valid and swapped triples; the ontology catches the swap.
Inverse and symmetric axioms generate additional triples for free. If the schema declares that CEO_OF and HAS_CEO are inverses, then extracting (Tim Cook, CEO_OF, Apple Inc.) immediately entails (Apple Inc., HAS_CEO, Tim Cook) — without any additional extraction. If BUSINESS_PARTNER_OF is declared symmetric, then extracting (Microsoft, BUSINESS_PARTNER_OF, OpenAI) immediately entails (OpenAI, BUSINESS_PARTNER_OF, Microsoft).
Transitivity enables inference across the graph. If the schema declares SUBSIDIARY_OF transitive, and the graph contains (YouTube, SUBSIDIARY_OF, Alphabet) and (Alphabet, SUBSIDIARY_OF, Google LLC), then the reasoner can infer (YouTube, SUBSIDIARY_OF, Google LLC) without any additional extraction.
Live demo: schema validation and rule application
Interpretation. The pipeline catches both the swapped CEO_OF triple (Apple Inc. cannot be the subject of a relation whose domain type is PERSON) and the swapped FOUNDED_BY triple. From the five valid seed triples, the inverse rules generate four additional triples for free — including (Apple Inc., HAS_CEO, Tim Cook) and (Jensen Huang, FOUNDER_OF, NVIDIA) — and transitivity infers (YouTube, SUBSIDIARY_OF, Google), which was not explicitly stated in the raw extraction but follows from the two SUBSIDIARY_OF triples that were. This ontology-driven inference is the mechanism by which a modest number of extracted facts can support a much richer set of downstream queries.
KG-Augmented RAG — The Modern Frontier
The limits of text-chunk RAG
Chapter 3 of this book introduced retrieval-augmented generation (RAG): rather than relying on an LLM’s implicit knowledge, retrieve relevant text passages from a document corpus and present them as additional context to the model before generation. RAG dramatically reduces hallucination on questions that are well-covered by the corpus, and it provides natural attribution — the model’s answer can be traced to specific retrieved passages.
Text-chunk RAG has a fundamental limitation, however: it retrieves by semantic similarity, not by logical structure. The retriever does not know that Apple Inc. and Alphabet are both members of the S&P 500 Information Technology sector; it does not know that YouTube is a subsidiary of Alphabet; it does not know that “the Santa Clara chipmaker” and “NVIDIA” refer to the same entity. When a query requires multi-hop reasoning — “What is the principal business location of the company that owns the most-watched video platform?” — text-chunk RAG must hope that a single retrieved passage happens to contain all the hops in the chain. Usually it does not.
KG-RAG replaces text chunks with structured graph paths. The retriever does not score passages; it traverses the knowledge graph. Because the graph encodes explicit typed relations, a two-hop query (“company that owns YouTube” → Alphabet → “principal business location” → Mountain View) becomes a deterministic graph traversal, not a probabilistic similarity search. The LLM’s role is then to generate fluent prose from the retrieved subgraph, not to perform the logical inference itself.
The KG-RAG pattern
The pipeline has four stages:
1. Query entity linking. The user’s natural language query is processed by the entity linker to identify anchor nodes in the KG. “Who is the CEO of the company that owns YouTube?” links to the YouTube entity.
2. Subgraph retrieval. Starting from the anchor node(s), retrieve the \(k\)-hop neighbourhood. For \(k = 1\), this means all triples where the anchor appears as subject or object. For \(k = 2\), extend one additional hop from each 1-hop neighbour. The retrieved subgraph is the union of all retrieved triples.
3. Subgraph serialization. The retrieved subgraph is converted to natural language. One simple serialization: for each triple \((s, r, o)\) in the subgraph, generate the sentence “According to the knowledge graph, [s] [r] [o].” More sophisticated serializers generate coherent multi-sentence paragraphs. Formally, the serialization function \(\phi\) maps a subgraph \(G_\text{sub} \subseteq G\) to a text string \(\phi(G_\text{sub})\).
4. Grounded generation. The LLM receives the query, the serialized subgraph, and (optionally) retrieved text passages. It generates an answer that is explicitly grounded in the KG evidence. The factuality of the answer can be verified by checking whether each claim in the output corresponds to a triple in \(G_\text{sub}\).
The end-to-end factuality metric for KG-RAG compares the answer against the retrieved subgraph:
\[\text{Factuality}(a, G_\text{sub}) = \frac{|\{\text{claims in } a \text{ supported by } G_\text{sub}\}|}{|\{\text{claims in } a\}|}\]
Live demo: 2-hop KG-RAG retrieval
Before running the cell above, consider: a text-chunk RAG system retrieving passages about “YouTube” would likely return passages describing the platform’s features, revenue, and history — none of which directly states who its CEO is. The KG-RAG system, by contrast, traverses the SUBSIDIARY_OF edge from YouTube to Alphabet, then the CEO_OF edge from Sundar Pichai to Alphabet, producing the correct 2-hop answer with full provenance. Predict whether the simulation correctly resolves the third query (“CEO of the company that owns YouTube”) by traversal, and check your prediction against the output.
Interpretation. The three queries illustrate the core value of structured retrieval. Query 3 — “Who is the CEO of the company that owns YouTube?” — cannot be answered by a text chunk retriever without extraordinary luck, because no single passage is likely to contain both “YouTube is owned by Alphabet” and “Sundar Pichai is the CEO of Alphabet.” The KG traversal resolves the query in two deterministic hops. In production KG-RAG systems (such as Microsoft’s GraphRAG, described in Edge et al. 2024), this traversal is combined with text-chunk retrieval: the KG provides structured precision; the text corpus provides richer contextual detail.
Perplexity AI’s search product differs from a conventional RAG system in a specific way: every sentence in its generated answer is attributed to a numbered source, and clicking any source shows the specific passage that supports the claim. This is KG-RAG in spirit, even if implemented without an explicit graph: each factual claim in the answer is grounded in a retrievable, cited document. The system effectively enforces a policy that no claim is made without an associated evidence span — reducing hallucination by making ungrounded claims visible to the reader. In a financial or compliance analytics context, this attribution model is not optional: every fact in an analyst’s report must be traceable to a specific data source, and the citation mechanism is the mechanism that ensures that traceability.
Neuro-Symbolic NLP
Two paradigms and why neither is sufficient alone
The deep learning revolution of the 2010s produced a compelling argument for abandoning symbolic AI: learn everything from data. Symbols, rules, and ontologies are brittle, require expensive expert design, and fail on inputs outside their scope. Neural networks, by contrast, learn robust representations from millions of examples, generalise gracefully, and require no hand-crafted rules. The argument was largely correct for perceptual tasks — image classification, speech recognition, machine translation — where the function to be learned is smooth and continuous, and where correctness is a matter of degree rather than of binary logical validity.
Natural language understanding in high-stakes domains is different. A contract is either valid or invalid; a sanctions designation either applies or it does not; a medical diagnostic criterion is either met or it is not. These are discrete, categorical facts governed by explicit rules — exactly the domain where symbolic systems excel. The question is not whether to use rules or learning, but how to combine them.
Neuro-symbolic NLP is the umbrella term for systems that integrate learned representations with symbolic reasoning. The learned component (the neural network, the LLM, the embedding model) handles the perceptual and linguistic surface: understanding language, identifying relevant passages, representing entities. The symbolic component (the rule engine, the ontology, the theorem prover) handles the logical structure: inferring consequences, checking consistency, verifying compliance.
Representative frameworks
DeepProbLog (Manhaeve et al., 2018) extends ProbLog — a probabilistic logic programming language — with neural predicates. A neural predicate maps a data object (an image, a sentence) to a probability distribution over logical facts. The logical program then reasons over those distributions using standard probabilistic inference. In an NLP context: a neural NER model assigns probabilities to entity type labels; the DeepProbLog program uses those probabilities to reason about whether a sentence violates a regulatory constraint.
Logic Tensor Networks (Badreddine et al., 2022) implement a differentiable satisfiability framework: logical constraints over a domain are compiled into a loss function that penalises assignments violating the constraints. This loss is added to the standard task loss during training, so the network learns representations that simultaneously minimize prediction error and maximise logical consistency.
Semantic parsing converts natural language queries into formal logical representations — SQL, SPARQL, Prolog — that can be executed against a structured knowledge base or database. A user’s question “Which companies in the S&P 500 are headquartered in Hong Kong?” is parsed into a SPARQL query against a financial KG, returning a precise, exhaustive, verifiable answer — not a generated paragraph that may contain errors. Text-to-SQL systems built on fine-tuned LLMs (such as DINSQL and C3-SQL) achieve over 80% execution accuracy on the Spider and BIRD benchmarks as of 2025.
Neural theorem provers (Rocktäschel and Riedel, 2017; Yang et al., 2017) learn embeddings of logical rules simultaneously with entity and relation embeddings, enabling soft generalisation across rules: if the model has learned that the FOUNDED_BY relation implies the FOUNDER_OF relation, it can apply this inference even when the inverse axiom was not explicitly encoded.
Why this matters for analytics
In medicine, law, and finance, the cost of a wrong answer is not symmetric. A hallucinated drug interaction is categorically worse than a missing one. A missed sanctions match is categorically worse than a false positive. An LLM operating without symbolic grounding treats these as continuous probability outputs; a neuro-symbolic system can enforce hard constraints — “never answer without a cited source,” “flag all outputs with confidence below threshold for human review,” “reject any entity classification that conflicts with the type ontology” — that transform soft probabilistic outputs into auditable, explainable decisions.
Industry Applications
Knowledge-grounded search
Google’s Knowledge Graph, introduced in 2012 and expanded continuously since, now contains over 500 billion facts about 8 billion entities. It powers the Knowledge Panel sidebar visible in search results, the structured “quick answers” for queries about prominent entities, and the entity disambiguation underlying Google Search’s understanding of what a query is about. When a user searches “Apple CEO 2026,” the Knowledge Graph — not an LLM generating text — supplies the answer. The LLM’s role (in Google’s AI Overviews) is to synthesise a readable paragraph from structured KG facts plus retrieved text. Microsoft’s Bing Copilot follows the same architecture: structured KG facts ground the model’s claims about entities, while LLM generation produces the prose. Both systems invest heavily in KG freshness: the Knowledge Graph is updated continuously from web crawl results, structured data markup, and trusted source feeds.
Customer-service assistants grounded in product KGs
Klarna’s customer-service agent, widely reported in 2024 as handling the equivalent of 700 full-time agents’ workload, operates over a product knowledge graph that encodes every product variant, policy rule, return condition, and account state. Rather than asking a general-purpose LLM what Klarna’s return policy is — and risk getting a stale or hallucinated answer — the system queries the product KG for the specific policy applicable to the specific product purchased by the specific customer, and presents that structured fact to the LLM for natural-language generation. The KG ensures that policy answers are always current and always account-specific; the LLM ensures that the answer is expressed naturally and contextually. Neither component alone is sufficient; the combination defines the architecture.
Compliance and AML entity screening
Anti-money-laundering (AML) and sanctions screening are canonically KG problems. A sanctions graph contains entities (individuals, companies, vessels, aircraft) with their aliases, associated addresses, controlling shareholders, beneficial owners, and jurisdictional flags. Screening a counterparty means traversing this graph: the counterparty is linked to its canonical entity record; the record’s 1-hop and 2-hop neighbours are examined for sanctions designations, politically exposed persons (PEP) flags, and adverse media tags. A pure-LLM approach cannot perform this traversal reliably: aliases, transliterations, and shell company chains are exactly the kind of precise, structured knowledge that LLMs hallucinate. The KG traversal is exact; the LLM’s role is to summarise the findings and generate a human-readable screening report.
Bloomberg’s enterprise AI product integrates the Bloomberg Knowledge Graph (BBKG) — a curated KG of financial entities, relationships, filings, and events — with a large language model to power natural-language analyst Q&A. An analyst can ask “What are the major shareholders of the company that recently acquired SVB Financial’s assets?” and receive a structured answer that traces through the acquisition event (a dated triple in the KG), identifies the acquirer (First Citizens BancShares), retrieves its ownership structure from the KG’s shareholder data, and presents the result with citations to the underlying data sources. The key property that Bloomberg’s enterprise clients pay for is precisely this: the answer is not a plausible-sounding LLM generation — it is a verifiable graph traversal with dated, sourced facts. For a portfolio manager or compliance officer, the distinction between a generated answer and a sourced answer is the distinction between a conversation and an audit trail.
Biomedical question answering
Med-PaLM 2 (Singhal et al., 2023), Google’s medical question-answering system, achieved performance at or above the pass mark on the United States Medical Licensing Examination (USMLE). The system grounds its answers in UMLS (Unified Medical Language System) and SNOMED-CT — two large biomedical ontologies containing millions of medical concepts and their relationships. When a clinician asks about drug interactions, the system queries the ontology for known interaction triples before generating a response; when it answers a diagnostic question, it checks its claim against the disease classification hierarchy. Grounding in these authoritative ontologies is not optional in a medical context — it is the mechanism that reduces the hallucination rate from clinically unacceptable to clinically usable.
The UMLS Metathesaurus contains over 3.7 million biomedical concepts drawn from more than 200 source vocabularies — including SNOMED-CT, MeSH, ICD-10, and RxNorm. When a clinical NLP system receives the query “Can a patient on warfarin safely take ibuprofen?”, a KG traversal in UMLS identifies that ibuprofen is a non-steroidal anti-inflammatory drug (NSAID), that NSAIDs have a known drug–drug interaction of type INCREASES_BLEEDING_RISK with warfarin anticoagulants, and that the interaction is flagged as SEVERE in the clinical interaction database. The LLM then generates a natural-language response grounded in this structured evidence: “Ibuprofen and other NSAIDs significantly increase the risk of bleeding when taken with warfarin. This combination should be avoided unless specifically directed by a physician.” The structured knowledge ensures that the model cannot generate a response that contradicts the clinical evidence in the ontology — which is precisely the safety guarantee that clinical deployment requires.
Financial analysis
The Refinitiv Knowledge Graph (now LSEG’s entity data product) and Bloomberg’s BBKG each contain hundreds of millions of financial facts: company hierarchies, instrument terms, analyst estimates, filing events, corporate action histories, and credit ratings. Investment analysts querying these KGs via natural language — “What was Softbank’s debt-to-equity ratio in Q3 2024, and how did it compare to its five-year average?” — receive structured answers traced to specific data points in the filing database, not to an LLM’s parametric memory of financial history. The combination of structured query precision with LLM prose generation is the emerging architecture for next-generation analyst workstations.
Evaluation
What to measure — and why it is hard
Evaluating a KG construction and KG-RAG pipeline requires metrics at multiple levels of the stack. Each level measures a different failure mode.
Entity F1 measures the accuracy of the NER component. Precision is the fraction of predicted entity spans that are correct (both boundary and type); recall is the fraction of true entity spans that were predicted. The entity-level F1 is their harmonic mean:
\[F1_\text{entity} = \frac{2 \cdot P_\text{entity} \cdot R_\text{entity}}{P_\text{entity} + R_\text{entity}}\]
State-of-the-art NER on CoNLL-2003 achieves entity F1 above 93% for formal English text. On Twitter data, the same models drop to 70–75% due to abbreviations, slang, and missing context.
Relation F1 measures the accuracy of the relation extraction component over the set of valid entity pair spans. Because NER errors propagate — a missed entity span means the relation involving that entity cannot be extracted — relation F1 is always bounded above by entity recall. The TACRED benchmark reports micro-averaged relation F1 over 42 relation types; best models as of 2025 achieve approximately 85%.
KG-RAG factuality measures the end-to-end correctness of answers generated by a KG-RAG system. The standard evaluation datasets are TriviaQA, NaturalQuestions, and HotpotQA. For TriviaQA, exact match against the gold answer is the primary metric. For HotpotQA — which explicitly requires multi-hop reasoning — both answer exact match and supporting-fact F1 are reported. Lewis et al. (2020) showed that RAG improved exact match on NaturalQuestions by approximately 11 percentage points over a closed-book LLM baseline; subsequent work on KG-RAG has shown additional gains of 5–8 points on multi-hop benchmarks where text-chunk RAG struggles.
Hallucination rate measures the fraction of generated claims that are not supported by retrieved evidence. Human evaluation remains the gold standard; automated approximations use an NLI model to check whether each sentence in the generated answer is entailed by the retrieved context. A hallucination rate below 5% is generally considered acceptable for analyst-facing applications; below 1% is required for compliance-facing applications.
The canonicalization problem
A persistent difficulty in KG evaluation is canonicalization: determining that “IBM,” “International Business Machines,” “Big Blue,” and “IBM Corp.” all refer to the same entity. When evaluating whether a predicted triple matches a gold triple, a naive string comparison will mark (IBM Corp., CEO_OF, Arvind Krishna) as incorrect relative to a gold triple (International Business Machines, CEO_OF, Arvind Krishna), even though they are semantically identical. Production evaluation pipelines must apply an entity canonicalization step — mapping all surface forms to canonical identifiers — before comparing predicted and gold triple sets. This is itself an entity linking problem, creating a circularity that makes KG evaluation distinctly harder than standard classification evaluation.
Limits and Open Problems
The construction bottleneck
Building a production-quality KG from scratch requires three investments that are each expensive in their own right: domain expert time to design and validate the ontology; data engineering time to build and maintain the extraction pipelines; and ongoing operations to update the KG as the world changes. The ontology design problem is particularly under-estimated: a financial KG designed by data engineers without domain expertise will capture simple ownership and leadership relations, but miss the complex structures that compliance and legal teams actually care about — beneficial ownership chains, voting trust arrangements, regulatory enforcement actions, material adverse change events. Getting the schema right requires extended collaboration between domain experts and knowledge engineers, and getting it wrong means the KG is useless for the highest-value queries.
Staleness
A KG that is not updated is a liability. Leadership changes, acquisitions, regulatory designations, and sanctions listings happen continuously. The gap between world-state and KG-state — the staleness window — is a function of update frequency, update cost, and update completeness. For a company like Bloomberg, whose clients pay for the freshness of financial data, the KG update cycle is measured in minutes for market-moving events. For a research-grade KG maintained by an academic team, the update cycle may be months. For any compliance-facing application, the staleness window is a regulatory risk: a sanctions match that was missed because the KG had not been updated since the designation was issued is not a technical failure — it is a compliance failure.
Combining soft and hard knowledge
The most active research area at the intersection of LLMs and KGs is the question of how to combine the LLM’s “soft knowledge” — the distributional patterns absorbed from billions of documents — with the KG’s “hard facts” — the explicit, typed, time-stamped triples. The naive approach (feed the KG facts to the LLM as additional context) works but is fragile: the LLM may ignore the structured context in favour of its parametric memory, may fail to integrate the context with the query correctly, or may generate text that superficially cites the KG facts but contradicts them in substance. The correct solution likely involves training approaches that teach the model to treat structured KG evidence as authoritative — higher priority than parametric memory — and to express uncertainty when the KG evidence is absent rather than falling back to implicit recall. This training approach is an open problem as of 2026.
Agents as KG maintainers
Chapter 7 of this book showed that LLM agents can plan and execute multi-step tasks. An attractive deployment for agentic systems is autonomous KG maintenance: an agent that monitors news feeds, social media, and financial filings; identifies events that imply KG updates (a CEO resignation implies a HAS_CEO relation change; an acquisition announcement implies a new ACQUIRED relation); extracts the relevant triple; validates it against the ontology; and inserts it into the KG pending human review. This architecture would dramatically reduce the staleness window while keeping human experts in the loop for quality assurance. Several research prototypes exist as of 2025; robust production deployment remains an open engineering challenge.
Mini Case Study — Compliance-Grade News Triage
The scenario
An analyst at a compliance team receives fifty news headlines relating to M&A activity and corporate governance. The task is to identify which headlines describe relationships between specific entities, validate those relationships against the company’s entity ontology, check whether the involved entities have beneficial ownership structures that touch sanctioned jurisdictions, and produce a structured analyst-ready report with full provenance. We implement this pipeline end-to-end in four live cells.
Step 1: NER and entity extraction
Step 2: Relation extraction — identifying acquisition and leadership triples
Step 3: Ontology validation
Step 4: KG-RAG sanctions and ownership lookup
Before running Step 4, identify from the headlines which acquisition involves an entity headquartered in the Cayman Islands — a jurisdiction that triggers enhanced due diligence under most AML frameworks. Check whether the pipeline correctly identifies and escalates this entity. The output should surface one escalation from the screened entities; verify that it corresponds to the correct acquisition and the correct screened role (ACQUIRER or TARGET).
Interpretation of the full pipeline. The four steps above constitute a minimal but complete compliance-grade news triage system. Step 1 identifies which headlines involve named entities of interest. Step 2 extracts the structured relationships asserted by those headlines. Step 3 validates those relationships against the ontology, rejecting extractions that violate type constraints. Step 4 performs KG-RAG over a sanctions and ownership graph, surfacing the entities that require enhanced due diligence.
Three design choices are worth noting explicitly. First, the pipeline is fully auditable: every output can be traced back to a specific headline, a specific extraction pattern, a specific schema rule, and a specific KG record. No step involves a black-box model whose reasoning cannot be inspected. Second, the pipeline is conservative: when an entity is not found in the sanctions KG, the system flags it as MANUAL_REVIEW_REQUIRED rather than assuming it is clean. False negatives are more costly than false positives in a compliance context. Third, the pipeline is modular: each step can be upgraded independently — replacing the gazette NER with a fine-tuned BERT tagger, replacing the Hearst patterns with a REBEL extraction model, replacing the rule-based schema validator with an RDF reasoner — without redesigning the overall architecture. This modularity is the principal engineering advantage of the knowledge-grounded pipeline over an end-to-end LLM approach.
Closing — The Symbolic Comeback
For approximately fifteen years — from roughly 2010 to 2025 — the dominant narrative in artificial intelligence was one of relentless neural advance and corresponding symbolic retreat. Every year, another task that symbolic systems had once monopolised — machine translation, image recognition, game playing, code generation — fell to deep learning models that learned from data rather than reasoning from rules. The symbolic AI community, once the field’s mainstream, was reduced to a rearguard action: arguing for explainability, for sample efficiency, for the brittleness of purely learned representations in high-stakes domains. These were correct arguments, but they were not winning arguments, because the empirical performance of neural systems was simply too compelling.
Something changed in 2024 and 2025. It was not that LLMs became less impressive — they continued to improve at a striking rate. What changed was the deployment context. As LLMs moved from research demonstrations to production systems in finance, medicine, law, and compliance, the hallucination problem became not a benchmark number but an operational reality. A model that generates a confident but wrong answer in a research demo is embarrassing; a model that does the same in a compliance report is a regulatory incident. The demand for verifiable, attributable, grounded answers — exactly the properties that symbolic systems provide — emerged from customers in the highest-stakes industries.
The pendulum is not swinging back to pure symbolic AI. Nobody is proposing to replace LLMs with Prolog. The architecture that is emerging is a genuine synthesis: LLMs handle the perceptual and linguistic surface — understanding natural language, generating fluent prose, reasoning over implicit patterns — while symbolic systems handle the logical structure — enforcing ontological constraints, performing exact graph traversal, maintaining verifiable audit trails. Each component does what it does best. Neither can be eliminated.
The 5–10 year outlook is one of progressive integration. Knowledge graphs will become richer, more dynamic, and more automatically maintained — partly by agents (as discussed in Chapter 7) that continuously extract new facts from text and validate them against existing structures. Neuro-symbolic training approaches will reduce the boundary between the learned and the symbolic: models will be fine-tuned to treat KG evidence as authoritative, to express uncertainty when facts are missing, and to generate outputs that are structurally consistent with the ontology. The evaluation infrastructure will mature to measure factuality as a first-class metric alongside fluency and task accuracy.
For the analyst, the practitioner, and the student, the practical message is this: a text analytics pipeline that terminates at the LLM — that asks a foundation model to answer factual questions from its parametric memory and trusts the result — is a pipeline that will eventually produce a wrong answer at a cost that the organisation was not prepared to pay. The architecture described in this chapter — NER, entity linking, relation extraction, ontology validation, KG-RAG — is the architecture that makes the LLM’s outputs verifiable. It adds engineering complexity. It requires domain expertise to build the ontology. It requires operational discipline to maintain the KG. It is worth every unit of that investment in any context where facts matter.
The symbolic AI community was not wrong about the importance of structure. It was simply early.
Prof. Xuhu Wan · HKUST · Modern AI Stack for Social Data · 2026 Edition