Chapter 6: Knowledge Graphs and Ontologies
About This Chapter
Every chapter in this book so far has placed the same quiet assumption at the bottom of the graph: an edge is an edge. When we drew the Karate Club friendship network in Chapter 1, we did not ask what kind of friendship — casual acquaintance or close confidant, shared workplace or shared neighborhood. When we built the interbank exposure network in Chapter 8, the edges carried dollar amounts but not the legal character of the claim — was the exposure a repo agreement, an unsecured loan, a derivative novation? When we modeled the Florentine Families in Chapter 2, the marriage edges and the banking edges were analyzed separately, as if they were two different graphs, rather than two different relation types woven into a single multi-relational reality.
That simplification is consequential. The real world does not issue single-relation graphs. A corporate group is not just a set of entities connected by some edge; it is a lattice of SUBSIDIARY_OF, CONTROLS, DIRECTOR_OF, REGISTERED_IN, and OWNS relations, each with its own logical properties and its own implications for liability, governance, and risk. A biomedical knowledge base is not a gene network or a disease network or a drug network — it is all three simultaneously, connected by typed relations like TREATS, INHIBITS, UPREGULATES, ASSOCIATED_WITH, and CAUSES. A financial data platform cannot serve its clients by returning a list of nodes; it must return structured facts: Tim_Cook --CEO_OF--> Apple, Apple --LISTED_ON--> NASDAQ, NASDAQ --REGULATED_BY--> SEC.
A knowledge graph (KG) is the data structure designed for exactly this reality. It represents the world as a set of typed entities connected by typed, directed relations — formally, a collection of triples \((h, r, t)\) where \(h\) and \(t\) are entities and \(r\) is the relation between them. The name “knowledge graph” entered mainstream vocabulary when Google published its Knowledge Graph in 2012, powering the information cards that appear alongside search results. Wikidata, Freebase, DBpedia, Hetionet, Bloomberg’s BBKG, and Refinitiv’s PermID are all instantiations of the same fundamental idea.
This chapter develops the theory and practice of knowledge graphs from the ground up. We begin with the formal definition and show how the triple-store model generalizes single-relation graphs. We cover the Resource Description Framework (RDF) and the SPARQL query language that sits on top of it. We introduce ontologies — the logical schemas that give a knowledge graph its inferential power — and implement a small rule-based reasoner. We then move to the chapter’s central computational contribution: knowledge graph embeddings, which represent entities and relations as vectors so that missing triples can be predicted by arithmetic in \(\mathbb{R}^d\). We implement TransE and DistMult from scratch in NumPy and evaluate them using the standard ranking metrics MRR and Hits@\(k\). We close with a mini case study: a beneficial-ownership investigation that uses a small KG and a trained TransE model to surface a hidden ownership chain connecting a flagged corporate entity to a sanctioned individual.
The mathematics is linear algebra, which you have used in every chapter since Chapter 2. The economics is corporate structure, compliance, and drug repurposing — three domains where the inability to reason over typed multi-relational data costs real money and, in the biomedical case, real lives.
Table of Contents
- From Graph to Knowledge Graph — Adding Types
- RDF and SPARQL — The Web of Data Layer
- Schema vs. Ontology
- Building a KG From Text — The IE Pipeline
- KG Embeddings — The Key Idea
- TransE: Translation in Embedding Space
- DistMult, ComplEx, and RotatE
- Link Prediction and the Open-World Assumption
- Applications
- GNNs Meet KGs (Briefly)
- Limits and Critiques
- Mini Case Study — A Tiny Beneficial-Ownership Investigation
From Graph to Knowledge Graph — Adding Types
The triple as the atomic unit of knowledge
The simplest possible graph is a binary relation: a set of pairs \((u, v)\) drawn from a node set \(V\). Every chapter so far has worked within this framework. A knowledge graph breaks out of it by assigning a type to every entity and every relation. The formal object is a multi-relational directed graph, also called a knowledge graph, defined as follows.
Let \(\mathcal{E}\) be a finite set of entities — the nodes — and let \(\mathcal{R}\) be a finite set of relation types — the edge labels. A knowledge graph \(\mathcal{G}\) is a set of triples:
\[\mathcal{G} = \{(h, r, t) : h \in \mathcal{E},\; r \in \mathcal{R},\; t \in \mathcal{E}\} \tag{9.1}\]
Each triple \((h, r, t)\) asserts a fact: the head entity \(h\) stands in relation \(r\) to the tail entity \(t\). Examples drawn from corporate finance:
| Head \(h\) | Relation \(r\) | Tail \(t\) |
|---|---|---|
Lehman_Brothers |
SUBSIDIARY_OF |
Lehman_Holdings |
Tim_Cook |
CEO_OF |
Apple |
Apple |
LISTED_ON |
NASDAQ |
NASDAQ |
REGULATED_BY |
SEC |
aspirin |
TREATS |
headache |
The standard single-relation graph is the special case \(|\mathcal{R}| = 1\). As soon as \(|\mathcal{R}| > 1\), you need a knowledge graph.
Four variants of typed graphs
It is worth distinguishing the knowledge graph from superficially similar constructs that appear in the literature:
| Model | Typed nodes? | Typed edges? | Node attributes? | Edge attributes? |
|---|---|---|---|---|
| Simple graph (Ch 1–8) | No | No | No | No |
| Multi-relational graph | No | Yes | No | No |
| Typed graph | Yes | Yes | No | No |
| Knowledge graph (RDF triples) | Yes | Yes | No | No |
| Labeled property graph (Neo4j) | Yes | Yes | Yes | Yes |
The knowledge graph in its RDF form does not natively support node or edge attributes — a fact like “Apple’s market cap is $3 trillion” must itself be reified as a triple, with the dollar amount as a literal entity. Neo4j’s labeled property graph adds first-class support for key–value attributes on both nodes and edges, which is more expressive but sacrifices the elegant algebraic properties of the pure triple-store model that make KG embeddings tractable.
Live cell: constructing a corporate-ownership knowledge graph
The cell below builds a small but structurally realistic corporate-ownership KG with 10 entities, 4 relation types, and 18 triples. We construct it as a list of (head, relation, tail) tuples — the native representation — and then lift it into a NetworkX MultiDiGraph for visualization. Throughout the chapter, this KG serves as the working example on which all algorithms are demonstrated.
Reading the figure. The color of each edge encodes the relation type: dark blue for SUBSIDIARY_OF edges that trace the ownership cascade from operating companies up to the parent, red for CONTROLS edges from Alice down through the holding structure, orange for DIRECTOR_OF board seats, and green for REGISTERED_IN jurisdictional registrations. The two individuals (purple) and the four jurisdictions (grey) are qualitatively different entity types from the corporate entities, though in a pure RDF triple-store they are all just URIs. This small graph already illustrates the core richness of a knowledge graph: the same pair of entities (Alice and ParentCorp) can participate in multiple distinct relations simultaneously, and the type of the relation determines what inferences are legally or logically warranted.
RDF and SPARQL — The Web of Data Layer
The Resource Description Framework
The Resource Description Framework (RDF) is the W3C standard for representing triple-store knowledge graphs on the web. Published in 1999 and revised in 2004 and 2014, RDF represents every entity and relation as a Universal Resource Identifier (URI) — a globally unique string that provides an unambiguous name for the concept regardless of which database or document it appears in. This is the property that makes the web of data federated: Wikidata’s URI for Apple Inc. (wd:Q312) and Refinitiv’s PermID URI for the same company refer to the same real-world entity, and the two identifiers can be aligned via owl:sameAs triples.
An RDF triple \(\langle s, p, o \rangle\) — subject, predicate, object — corresponds directly to \((h, r, t)\) in the notation of equation (9.1). In Turtle syntax, the corporate-ownership KG writes as:
@prefix corp: <http://example.org/corp/> .
@prefix rel: <http://example.org/rel/> .
corp:OpCo_Alpha rel:SUBSIDIARY_OF corp:HoldCo_A .
corp:HoldCo_A rel:SUBSIDIARY_OF corp:ParentCorp .
corp:Alice rel:CONTROLS corp:ParentCorp .
corp:Alice rel:CONTROLS corp:HoldCo_A .
corp:ParentCorp rel:REGISTERED_IN corp:Cayman_Islands .
The indirection through URIs looks verbose, but it is what allows two organizations to independently publish RDF datasets and then merge them automatically: the URIs serve as the join key across databases.
SPARQL: querying a knowledge graph
SPARQL (SPARQL Protocol and RDF Query Language) is the standard query language for RDF graphs, analogous to SQL for relational databases. A SPARQL SELECT query specifies a graph pattern — a set of triple patterns with some variables (prefixed with ?) — and returns all substitutions of the variables that match triples in the graph.
The following SPARQL query, which you would issue against Wikidata’s public SPARQL endpoint (https://query.wikidata.org), retrieves all companies that Alice controls and the jurisdictions in which those companies are registered:
# Non-executing SPARQL illustration
# (requires rdflib or a live SPARQL endpoint — not available in Pyodide)
import rdflib
from rdflib.plugins.sparql import prepareQuery
SPARQL_QUERY = """
PREFIX rel: <http://example.org/rel/>
PREFIX corp: <http://example.org/corp/>
SELECT ?company ?jurisdiction
WHERE {
corp:Alice rel:CONTROLS ?company .
?company rel:REGISTERED_IN ?jurisdiction .
}
"""RDFLib and SPARQL endpoint clients are not available in Pyodide. The code block above is illustrative only. In a local Python environment, you would install rdflib via pip install rdflib and either parse a Turtle file or connect to a remote SPARQL endpoint. For Wikidata queries, the SPARQLWrapper library provides a thin HTTP client. The live cell below re-implements the same logic as a pure-Python pattern matcher over our in-memory triple list, which does run in Pyodide.
Live cell: a tiny SPARQL-like pattern matcher
The cell below implements a minimal in-memory triple store with a query function that accepts a list of (subject_pattern, predicate_pattern, object_pattern) triple patterns — where None acts as a wildcard — and returns all satisfying variable bindings. This captures the core semantics of SPARQL’s basic graph pattern matching.
The pattern matcher handles all four standard SPARQL triple-pattern shapes: bound-bound-bound (exact fact lookup), bound-bound-unbound (find the object given subject and predicate), unbound-bound-bound (find the subject given predicate and object), and the two-hop join that simulates a SELECT with two joined triple patterns. The only difference from full SPARQL is the absence of aggregation, OPTIONAL clauses, and FILTER expressions — all of which are syntactic sugar over the same basic graph-pattern-matching semantics.
Wikidata as a research substrate. Wikidata (wikidata.org) is the largest publicly accessible general-purpose knowledge graph, maintained by the Wikimedia Foundation. As of 2025 it contains over 107 million items and 1.5 billion statements. Its public SPARQL endpoint (query.wikidata.org) accepts standard SPARQL 1.1 queries and returns results in JSON, CSV, or Turtle. For academic research, Wikidata is an extraordinary resource: you can retrieve the full board-of-directors history of every Fortune 500 company, every CEO appointment and departure date, every subsidiary relationship, and every jurisdiction of incorporation — all from a single SPARQL query. The Wikidata Query Service imposes a 60-second timeout per query and a rate limit, but for exploratory work these constraints are rarely binding. For production-scale extractions, the full Wikidata dump (approximately 80 GB compressed) is released weekly in JSON format and can be loaded into a local RDFLib or Virtuoso triple store.
Schema vs. Ontology
What a schema does — and what it cannot do
A schema is a type-level constraint on what triples are permitted. In a relational database, a schema says: the employees table has a foreign key into departments. In a knowledge graph, a schema says: the domain of CEO_OF must be a Person and its range must be a Company. This is sometimes called a type signature, and enforcing it prevents nonsense triples like (NASDAQ, CEO_OF, Germany).
A schema answers the question can this triple exist? It does not answer the question what can I logically deduce from these triples? That is the job of an ontology.
OWL and the inferential layer
The Web Ontology Language (OWL), built on top of RDF and published by the W3C, extends the schema with logical axioms that enable automated reasoning. The key axiom types relevant to corporate and financial KGs are:
Transitivity. If SUBSIDIARY_OF is declared transitive, then from \[(\text{OpCo\_Alpha}, \text{SUBSIDIARY\_OF}, \text{HoldCo\_A}) \quad \text{and} \quad (\text{HoldCo\_A}, \text{SUBSIDIARY\_OF}, \text{ParentCorp})\] the reasoner infers \[(\text{OpCo\_Alpha}, \text{SUBSIDIARY\_OF}, \text{ParentCorp}).\]
Formally, a relation \(r\) is transitive if: for all \(a, b, c \in \mathcal{E}\), \[[(a, r, b) \in \mathcal{G}] \wedge [(b, r, c) \in \mathcal{G}] \;\Rightarrow\; (a, r, c) \in \mathcal{G}. \tag{9.2}\]
SUBSIDIARY_OF, ANCESTOR_OF, LOCATED_IN, and PART_OF are canonical transitive relations in corporate and geographic ontologies.
Symmetry. If SPOUSE_OF is declared symmetric, then \((a, \text{SPOUSE\_OF}, b) \Rightarrow (b, \text{SPOUSE\_OF}, a)\). Formally: \[[(a, r, b) \in \mathcal{G}] \;\Rightarrow\; (b, r, a) \in \mathcal{G}. \tag{9.3}\]
Inverse relations. PARENT_OF and CHILD_OF are mutually inverse: \((a, \text{PARENT\_OF}, b) \Leftrightarrow (b, \text{CHILD\_OF}, a)\). More generally, if \(r^{-1}\) denotes the inverse of relation \(r\): \[[(a, r, b) \in \mathcal{G}] \;\Leftrightarrow\; (b, r^{-1}, a) \in \mathcal{G}. \tag{9.4}\]
Functional properties. CEO_OF is (approximately) functional — a company has at most one current CEO. If the reasoner finds \((A, \text{CEO\_OF}, \text{Apple})\) and \((B, \text{CEO\_OF}, \text{Apple})\) simultaneously, it can either infer \(A = B\) (the unique name assumption is closed-world) or flag an inconsistency.
Disjoint classes. In a well-designed corporate ontology, Person and Company are disjoint: no entity can be both. This lets the reasoner detect inconsistent data — for example, a triple asserting that Apple is a natural person.
Live cell: transitive-closure reasoner
The cell below implements a minimal rule-based reasoner that applies the transitivity axiom (equation 9.2) to the SUBSIDIARY_OF relation in the corporate-ownership KG. It iterates to a fixed point, inferring new triples until no further inferences are possible. This is the simplest form of forward chaining — applying rules to generate new facts from existing ones.
Reading the output. The base KG asserts only direct subsidiary relationships — OpCo_Alpha is a direct subsidiary of HoldCo_A, and HoldCo_A is a direct subsidiary of ParentCorp. The reasoner applies equation (9.2) iteratively and infers that OpCo_Alpha is also a subsidiary of ParentCorp, two hops away. Without this inference, a compliance analyst searching for “all subsidiaries of ParentCorp” would miss the operating companies entirely. This is precisely the gap that ontological reasoning closes: the data does not need to be exhaustive, because the ontology fills in the logical consequences.
The Google Knowledge Graph. When you search for a person, company, or place on Google, the information card on the right side of the results page draws from the Google Knowledge Graph — a proprietary KG first announced in May 2012. Google has never disclosed its full scale, but it was reported to contain 570 million entities and 18 billion facts at launch; the 2025 figure is likely an order of magnitude larger. The KG underpins not just information cards but also question answering in Google Search, the structured data in Google Maps, and the entity-recognition layer in Google Ads targeting. Google’s KG combines human-curated data from sources like CIA World Factbook, Wikipedia, and Wikidata with machine-learned extractions from web documents. The result is a hybrid: a hand-maintained ontology (entity types, key relations, constraints) layered on top of a statistically extracted fact base. The tension between these two layers — logical rigor versus statistical coverage — is a theme we will return to in the limits section.
Building a KG From Text — The IE Pipeline
The information extraction challenge
The knowledge graphs we have been working with treat triples as given. In practice, most of the world’s structured knowledge begins as unstructured text: news articles, regulatory filings, earnings calls, scientific abstracts, contracts. Extracting triples from text is the problem of information extraction (IE), and it divides into three coupled subproblems:
1. Named Entity Recognition (NER). Identify and classify spans of text as entities of known types. In a financial document, NER tags Lehman Brothers as ORG, September 2008 as DATE, \$639 billion as MONEY, and New York as GPE (geopolitical entity). Modern NER systems use fine-tuned transformer models (BERT, RoBERTa) and achieve F1 scores above 90% on standard benchmarks like CoNLL-2003.
2. Entity Linking (EL). Map each recognized mention to a canonical entity in a target KG. The mention “Apple” may refer to Apple Inc. (Wikidata: Q312), Apple Records (Q187714), or a fruit. EL resolves the ambiguity by computing a similarity score between the mention’s context and candidate entities, typically using dense vector retrieval over the KG’s entity embeddings.
3. Relation Extraction (RE). Given two linked entities within a sentence or document, classify the relation between them. For the sentence “Tim Cook has served as Apple’s chief executive since 2011,” a relation extractor should produce (Tim_Cook, CEO_OF, Apple). State-of-the-art RE systems — including the REBEL model (Cabot and Navigli, 2021) and OpenIE 6 (Kolluru et al., 2020) — use sequence-to-sequence transformers to jointly perform entity and relation extraction.
The canonical pipeline
The full IE pipeline requires spaCy, transformers, and GPU compute — none of which are available in Pyodide. The code block below illustrates the canonical three-stage pipeline and is intended for a local environment with pip install spacy transformers rebel-generative. It will not execute in the browser.
# Non-executing IE pipeline illustration
# Requires: pip install spacy transformers
# and: python -m spacy download en_core_web_trf
import spacy
from transformers import pipeline
# Stage 1 — NER with a fine-tuned transformer
nlp = spacy.load("en_core_web_trf")
doc = nlp("Lehman Brothers, a subsidiary of Lehman Holdings, "
"filed for bankruptcy in September 2008.")
for ent in doc.ents:
print(f" [{ent.label_}] '{ent.text}'")
# Stage 2 — Entity Linking (EL) via Wikidata SPARQL
# (simplified: in practice, use a dense retrieval model over entity embeddings)
# Stage 3 — Relation Extraction with REBEL
# REBEL: https://huggingface.co/Babelscape/rebel-large
triplet_extractor = pipeline(
"text2text-generation",
model="Babelscape/rebel-large",
tokenizer="Babelscape/rebel-large",
)
text = ("Lehman Brothers was a subsidiary of Lehman Holdings. "
"Richard Fuld served as CEO of Lehman Brothers.")
extracted = triplet_extractor(text, return_tensors=True,
return_text=False)[0]
# Parse REBEL's linearized output into (head, relation, tail) triples
# ...The IE pipeline is the bridge between the unstructured world — the trillion-word corpus of financial news, regulatory filings, and scientific literature — and the structured world of knowledge graph triples. The NLP details belong to the companion course on social media analytics; what matters for this chapter is understanding that every triple in a real-world KG was either manually curated or machine-extracted, and that machine extraction introduces noise which the embedding models of the following sections must be robust to.
KG Embeddings — The Key Idea
Why arithmetic in vector space?
The corporate-ownership KG we built in Section 1 has 18 triples. Wikidata has 1.5 billion. At that scale, even a sophisticated SPARQL query engine struggles with multi-hop inference, and the forward-chaining reasoner of Section 3 — which is correct but exponential in the worst case — is completely infeasible.
The solution the research community converged on between 2013 and 2019 is to represent entities and relations as vectors in \(\mathbb{R}^d\) — called embeddings — and to score the plausibility of a triple using arithmetic on those vectors. The idea is precise: we want a function \(f: \mathcal{E} \times \mathcal{R} \times \mathcal{E} \to \mathbb{R}\) such that
\[f(h, r, t) > f(h', r, t') \quad \text{whenever } (h, r, t) \text{ is a true fact and } (h', r, t') \text{ is not.} \tag{9.5}\]
If we can learn such a function from the observed triples, then:
- Link prediction: given a query
(Alice, CONTROLS, ?), score all candidate tail entities and return the top-ranked one — this is how missing facts are predicted. - Triple classification: given a candidate triple not in the KG, threshold the score to decide whether it is likely true or likely false.
- Entity alignment: if two KGs use different entity identifiers for the same real-world entity, their embeddings will be nearby in \(\mathbb{R}^d\) after joint training.
- Question answering: a natural-language question can be mapped to a (head, relation, ?) query via semantic parsing, and the embedding model resolves it.
The training objective: contrastive learning
All KG embedding methods share the same training setup. Observed triples are positive examples. We generate negative examples by corrupting each positive triple: replace \(h\) with a randomly sampled entity \(h'\) (head corruption) or replace \(t\) with \(t'\) (tail corruption), producing triples \((h', r, t)\) or \((h, r, t')\) that are assumed to be false.
The canonical training loss is the margin-based hinge loss:
\[\mathcal{L} = \sum_{(h, r, t) \in \mathcal{G}} \;\sum_{(h', r, t') \in \mathcal{G}'_{(h,r,t)}} \left[\gamma + f(h', r, t') - f(h, r, t)\right]_+ \tag{9.6}\]
where \([\cdot]_+ = \max(0, \cdot)\) is the hinge function, \(\gamma > 0\) is the margin hyperparameter, and \(\mathcal{G}'_{(h,r,t)}\) is the set of negative triples generated by corrupting \((h, r, t)\). The loss is zero when the score of every true triple exceeds the score of every negative triple by at least \(\gamma\). Minimizing this loss via stochastic gradient descent with respect to the entity and relation embedding matrices is the universal training recipe; only the scoring function \(f\) differs across methods.
TransE: Translation in Embedding Space
The translation hypothesis
TransE, introduced by Bordes, Usunier, Garcia-Duran, Weston, and Yakhnenko at NeurIPS 2013, is the simplest and most influential KG embedding method. Its central hypothesis is geometric:
If \((h, r, t)\) is a true triple, then the vector sum of the head embedding and the relation embedding should approximately equal the tail embedding: \(\mathbf{h} + \mathbf{r} \approx \mathbf{t}\).
This is the translation model — the relation vector \(\mathbf{r}\) acts as a translation operator that moves \(\mathbf{h}\) toward \(\mathbf{t}\) in embedding space. The scoring function is the negative distance:
\[f_{\text{TransE}}(h, r, t) = -\|\mathbf{h} + \mathbf{r} - \mathbf{t}\|_2 \tag{9.7}\]
High scores (small distances) indicate plausible triples; low scores (large distances) indicate implausible ones. The embeddings \(\mathbf{h}, \mathbf{r}, \mathbf{t} \in \mathbb{R}^d\) are the parameters to be learned; \(d\) is typically 50–200 in practice.
The translation analogy is elegant. Consider the celebrated word2vec analogy: king - man + woman ≈ queen. TransE discovers the same structure for knowledge graph relations: Lehman_Brothers + SUBSIDIARY_OF ≈ Lehman_Holdings. The relation vector encodes the direction in embedding space that transforms one entity into its relatum.
TransE has one important limitation: it cannot handle symmetric relations. If \((a, r, b)\) is true, then \(\mathbf{a} + \mathbf{r} \approx \mathbf{b}\). If \((b, r, a)\) is also true (as it would be for SPOUSE_OF), then \(\mathbf{b} + \mathbf{r} \approx \mathbf{a}\), which means \(\mathbf{r} \approx \mathbf{0}\) — the relation vector collapses to zero. TransE also struggles with 1-to-N relations like CEO_OF_COMPANY when a company has multiple historical CEOs. These limitations motivated the later methods covered in Section 7.
Live cell: TransE from scratch
Reading the output. After 200 epochs the hinge loss should have declined substantially from its initial value. The analogy check shows how close the query vector E[HoldCo_A] + R[SUBSIDIARY_OF] is to each entity embedding. If ParentCorp appears in the top-1 or top-2 positions, the translation hypothesis has been learned successfully: the model has discovered that the SUBSIDIARY_OF relation vector points from child to parent in embedding space. On a KG this small, you may see ParentCorp appear in rank 1–3 after 200 epochs; on larger KGs with hundreds of thousands of triples, TransE reliably places the correct entity in the top 10 across the full spectrum of relation types.
DistMult, ComplEx, and RotatE
DistMult: the bilinear model (Yang et al., 2014)
DistMult replaces the geometric translation with a bilinear inner product. Each entity \(h, t \in \mathbb{R}^d\) and each relation \(r \in \mathbb{R}^d\) (a diagonal weight vector) contribute through a three-way dot product:
\[f_{\text{DistMult}}(h, r, t) = \langle \mathbf{h},\, \mathbf{r},\, \mathbf{t} \rangle = \sum_{k=1}^{d} \mathbf{h}_k\, \mathbf{r}_k\, \mathbf{t}_k \tag{9.8}\]
The diagonal restriction on \(r\) — as opposed to a full matrix — keeps the model parameter-efficient and makes the score symmetric in \((h, t)\): \(f(h, r, t) = f(t, r, h)\) for all \(r\). This is simultaneously DistMult’s strength (symmetric relations like SPOUSE_OF and CO_AUTHOR_OF are handled correctly) and its fundamental limitation: it cannot assign different scores to \((h, r, t)\) and \((t, r, h)\) when the relation is asymmetric. CEO_OF, SUBSIDIARY_OF, and PARENT_OF are all asymmetric, and DistMult will systematically fail to rank them correctly.
ComplEx: moving to complex space (Trouillon et al., 2016)
ComplEx retains the bilinear form but lifts the embeddings from \(\mathbb{R}^d\) to \(\mathbb{C}^d\). The scoring function is the real part of the complex three-way inner product:
\[f_{\text{ComplEx}}(h, r, t) = \text{Re}\!\left(\sum_{k=1}^{d} \mathbf{h}_k\, \mathbf{r}_k\, \overline{\mathbf{t}}_k\right) \tag{9.9}\]
where \(\overline{\mathbf{t}}_k\) denotes the complex conjugate of \(\mathbf{t}_k\). The real part of \(\mathbf{h}_k \mathbf{r}_k \overline{\mathbf{t}}_k\) is symmetric in \((h, t)\) when \(r\) is real, and antisymmetric when \(r\) is imaginary. By allowing \(\mathbf{r}_k \in \mathbb{C}\), ComplEx can simultaneously model symmetric relations (through the real part of \(\mathbf{r}\)) and antisymmetric relations (through the imaginary part). Trouillon et al. proved that ComplEx is a universal approximator for any combination of symmetric and antisymmetric relations on a finite entity set — a theoretical guarantee that TransE and DistMult lack.
RotatE: relations as rotations (Sun et al., 2019)
RotatE interprets each relation as an element-wise rotation in the complex plane. Each entity \(\mathbf{h}, \mathbf{t} \in \mathbb{C}^d\) and each relation \(\mathbf{r} \in \mathbb{C}^d\) with \(|\mathbf{r}_k| = 1\) (i.e., \(\mathbf{r}_k = e^{i\theta_k}\) for some angle \(\theta_k\)). The translation hypothesis becomes:
\[\mathbf{t} = \mathbf{h} \circ \mathbf{r} \quad \text{with } |\mathbf{r}_k| = 1 \quad \forall k \tag{9.10}\]
where \(\circ\) denotes element-wise multiplication (Hadamard product). The scoring function is:
\[f_{\text{RotatE}}(h, r, t) = -\|\mathbf{h} \circ \mathbf{r} - \mathbf{t}\|_2 \tag{9.11}\]
The unit-modulus constraint on \(\mathbf{r}\) means that each relation dimension rotates the head vector by angle \(\theta_k\) in the complex plane. The beauty of RotatE is that it handles all three relational patterns simultaneously: - Symmetry: \(\mathbf{r} \circ \mathbf{r} = \mathbf{1}\) (rotating twice returns to start) iff \(\theta_k = \pi\), meaning \(r_k = -1\). - Antisymmetry: \(\mathbf{r} \ne \mathbf{r}^{-1}\) in general. - Inversion: the inverse relation corresponds to \(\overline{\mathbf{r}}\) (complex conjugate = rotation by \(-\theta_k\)). - Composition: two consecutive rotations compose as \(\mathbf{r}_1 \circ \mathbf{r}_2\), capturing paths like CEO_OF composed with SUBSIDIARY_OF.
Before running the cell below: DistMult uses a symmetric scoring function \(f(h,r,t) = f(t,r,h)\). Given this, which type of relation will DistMult fail to rank correctly — symmetric ones like a hypothetical PARTNER_OF, or asymmetric ones like SUBSIDIARY_OF and CONTROLS? Write down your prediction, then examine the MRR comparison between TransE and DistMult.
Live cell: DistMult from scratch and comparison with TransE
Reading the output. The symmetry test makes the DistMult limitation concrete. Because \(f_{\text{DistMult}}(h, r, t) = \sum_k \mathbf{h}_k \mathbf{r}_k \mathbf{t}_k = f_{\text{DistMult}}(t, r, h)\) by commutativity of multiplication, the scores for (OpCo_Alpha, SUBSIDIARY_OF, HoldCo_A) and (HoldCo_A, SUBSIDIARY_OF, OpCo_Alpha) are identical. TransE, by contrast, assigns different scores to the two orderings because the \(\ell_2\) distance is not symmetric in \(h\) and \(t\) when the relation vector is nonzero.
Link Prediction and the Open-World Assumption
The evaluation protocol
The standard benchmark for KG embedding models is link prediction: given a triple \((h, r, ?)\) with the tail entity held out, rank all candidate tail entities by their score under the model, and measure how well the true tail entity is recovered. The two canonical metrics are:
Mean Reciprocal Rank (MRR). Let \(\text{rank}_{(h,r,t)}\) denote the rank of the true tail entity \(t\) among all \(|\mathcal{E}|\) candidates when scored in descending order. MRR is:
\[\text{MRR} = \frac{1}{|\mathcal{G}|} \sum_{(h,r,t) \in \mathcal{G}} \frac{1}{\text{rank}_{(h,r,t)}} \tag{9.12}\]
MRR lies in \((0, 1]\); MRR \(= 1\) means the true entity is ranked first for every query. The reciprocal rank penalizes lower ranks strongly — ranking the true entity 10th contributes \(0.1\), while ranking it first contributes \(1.0\).
Hits@\(k\). The fraction of test triples for which the true tail entity appears in the top \(k\) candidates:
\[\text{Hits@}k = \frac{1}{|\mathcal{G}|} \sum_{(h,r,t) \in \mathcal{G}} \mathbf{1}[\text{rank}_{(h,r,t)} \leq k] \tag{9.13}\]
Common choices are \(k \in \{1, 3, 10\}\). Hits@10 is perhaps the most widely reported, because many real-world applications require only that the correct entity appear on the first “page” of results.
The evaluation is always reported under the filtered setting (Bordes et al., 2013): before ranking, remove all entities \(t'\) from the candidate list for which \((h, r, t')\) is a known true triple in the train or test set, excluding the query entity \(t\) itself. This prevents penalizing the model for correctly ranking other true facts above the specific held-out triple.
The open-world assumption
The open-world assumption (OWA) is a foundational distinction between knowledge graphs and relational databases. In a closed-world database, any fact not explicitly recorded is assumed to be false. In a knowledge graph, absence of a triple means only that the fact is unknown — it may be true but unrecorded. This matters profoundly for evaluation:
- Under closed-world semantics, a corrupted triple \((h', r, t)\) that does not appear in \(\mathcal{G}\) is a true negative — it is definitely false, and a good model should score it low.
- Under open-world semantics, the same triple may be true but simply missing from the KG, and scoring it high is the correct behavior (it is a candidate link prediction). The filtered evaluation protocol accounts for this by removing known positives from the negative set, but imperfect coverage of the true KG means residual noise is unavoidable.
Before running the cell: when 20% of triples are held out as the test set, leaving only 80% for training, what MRR do you expect TransE to achieve? Consider that the KG has only 18 triples and 12 entities — the training set has only about 14 triples, which is barely enough data for embedding learning. Predict whether MRR will be above or below 0.30, then run the cell.
Live cell: held-out evaluation
Reading the output. On a KG this small, the held-out evaluation is very noisy — three or four held-out triples is too few to estimate MRR with any precision. The main purpose of this cell is to demonstrate the evaluation protocol rather than benchmark a model. Notice that the filtered ranking removes known correct answers from the candidate pool: if (HoldCo_B, REGISTERED_IN, Luxembourg) is in the training set and the query is (HoldCo_A, REGISTERED_IN, ?), Luxembourg should not be penalized for appearing highly ranked even though it is not the query triple’s tail. The true benchmark performance of TransE — MRR of 0.22 on FB15k (Bordes et al., 2013), rising to 0.31 on FB15k-237 with improved negative sampling — comes from training on hundreds of thousands of triples over millions of gradient steps.
Applications
Wikidata and the Google Knowledge Graph
The public knowledge graph ecosystem is anchored by two systems. Wikidata is the collaborative, editable KG maintained by the Wikimedia Foundation, covering general human knowledge across hundreds of millions of items. Its SPARQL endpoint is the single most useful research tool for social scientists, economists, and network analysts working with entity-centric data — it provides structured access to corporate histories, biographical data, geographic hierarchies, and political relationships that would take months to assemble from primary sources.
The Google Knowledge Graph is the proprietary descendant of Google’s acquisition of Freebase in 2010. It powers the information panels visible in Google Search, the structured answers in Google Assistant, and the entity-disambiguation layer in Google Ads. Its scale — over a trillion facts as of 2024, by internal estimates — is achievable only through a combination of human curation, automated extraction from the web, and alignment with third-party structured data sources. The interface between the Google KG and its user-facing search product is the most commercially significant application of knowledge graph technology in existence.
The Google Knowledge Graph and information cards. The information card (“Knowledge Panel”) that appears when you search for a public company, a CEO, or a scientific concept draws its structured facts directly from the Google Knowledge Graph. The fields displayed — founded date, number of employees, ticker symbol, parent organization — are triples retrieved from the KG, cross-validated across multiple data sources, and formatted for display. For corporate analysts, this matters because it means Google’s KG is effectively performing entity disambiguation at web scale: when a user searches for “Apple CEO,” the KG resolves the query to the triple (Apple_Inc, CURRENT_CEO, Tim_Cook) and displays it without any document retrieval. Errors in the KG surface as errors in the information card, which is why major corporations employ data-quality teams specifically to monitor and correct their KG representation.
Biomedical knowledge graphs
The biomedical domain has produced some of the most mature and scientifically significant knowledge graphs. Hetionet (Himmelstein et al., 2017, published in eLife) is a heterogeneous network integrating 29 types of biological entities — genes, diseases, drugs, biological processes, pathways, anatomical structures — with 24 types of typed relations. With 47,000 nodes and 2.25 million edges, it is small by internet standards but enormous by the standards of curated biological knowledge.
Himmelstein et al. used graph-based features derived from Hetionet to predict which approved drugs could be repurposed to treat diseases for which they had not been approved. The model identified that metformin (a diabetes drug) had significant path-score signal for multiple sclerosis — a prediction that was subsequently supported by independent clinical evidence. This is link prediction as drug repurposing: the KG embedding model predicts a missing (metformin, TREATS, multiple_sclerosis) triple and thereby generates a testable biomedical hypothesis.
Hetionet and drug repurposing as link prediction. The Himmelstein et al. (2017) result is perhaps the clearest real-world demonstration that KG link prediction can generate economically and scientifically valuable outputs. Their method — Project Rephetio — computed a “degree-weighted path count” for every (drug, disease) pair across all meta-paths in Hetionet, then trained a logistic regression model to predict which pairs had prior clinical evidence of repurposing. The resulting predictions were published openly at het.io and have since been validated for several drug-disease pairs. The key insight is that the KG’s multi-relational structure encodes biological mechanism: a drug that inhibits a gene that is upregulated in a disease, and that gene participates in a pathway that is dysregulated in the same disease — this path structure, invisible in any single data type alone, becomes a predictive feature in the KG.
Financial entity graphs
The financial services industry has deployed knowledge graphs for three interrelated compliance problems: anti-money laundering (AML), sanctions screening, and beneficial ownership tracing. Each requires answering the same fundamental question that our corporate-ownership KG was designed to illustrate: given a flagged entity, what other entities are connected to it through ownership, control, or directorship chains, and does any of those connected entities appear on a watch list?
Bloomberg’s BBKG (Bloomberg Business Knowledge Graph) integrates structured data from regulatory filings, stock exchange records, and news extractions to maintain a continuously updated graph of corporate entities and their relationships. It underpins Bloomberg’s entity-disambiguation layer in the Terminal and powers the automated population of corporate profiles.
Refinitiv PermID is an open entity identifier system that assigns stable URIs to financial entities — companies, individuals, instruments, organizations, and quotes. PermID triples can be accessed via API and form the backbone of Refinitiv’s (now LSEG’s) entity graph, which is used for NLP entity linking in news analytics and for constructing corporate family trees for index construction.
OpenCorporates is the largest open database of company information, aggregating official registry data from over 140 jurisdictions and exposing it as a queryable API. Its corporate ownership data, combined with data from the ICIJ’s Offshore Leaks database, formed the analytical substrate for the Panama Papers and Pandora Papers investigations.
The Panama Papers and OpenCorporates. In April 2016, the International Consortium of Investigative Journalists (ICIJ) released the Panama Papers — 11.5 million documents leaked from the Panamanian law firm Mossack Fonseca — detailing the use of offshore shell companies to conceal wealth. The scale of the data made manual analysis impossible; the ICIJ used a combination of Neo4j (a labeled property graph database), OpenCorporates beneficial-ownership data, and sanctions-list matching to build a KG of approximately 320,000 entities and 1.3 million relations. Link prediction on this KG identified previously unknown connections between shell companies and sanctioned individuals by finding high-scoring OWNS triples that were present in the embedding space but absent from the explicit records. The Panama Papers investigation won the Pulitzer Prize for Explanatory Reporting in 2017 and directly led to regulatory reforms in the EU’s Anti-Money Laundering Directive.
Enterprise knowledge graphs
Beyond the public and financial domains, enterprise KGs have become infrastructure for large organizations seeking to make their internal data interoperable. Diffbot, Stardog, and Ontotext offer commercial KG platforms that combine web extraction, ontology management, and SPARQL querying. Neo4j, which uses a labeled property graph model rather than pure RDF, is the most widely deployed graph database, with production deployments at NASA, eBay, and Airbus for use cases ranging from supply-chain mapping to fraud detection. The common thread is that KGs dissolve the silo problem: once data is represented as typed triples with stable entity identifiers, information from different source systems can be merged without schema alignment.
GNNs Meet KGs (Briefly)
The relational GCN
Chapter 5 of this book introduced the Graph Convolutional Network (GCN), where each node aggregates representations from its neighbors through a shared linear transformation. The limitation for KGs is obvious: GCNs operate on single-relation graphs, and the shared weight matrix cannot distinguish a SUBSIDIARY_OF neighbor from a DIRECTOR_OF neighbor.
The Relational Graph Convolutional Network (R-GCN), introduced by Schlichtkrull, Kipf, Bloem, van den Berg, Titov, and Welling (2018), resolves this by using a separate weight matrix \(\mathbf{W}_r\) for each relation type. The message-passing update for entity \(i\) at layer \(l\) is:
\[\mathbf{h}_i^{(l+1)} = \sigma\!\left( \mathbf{W}_0^{(l)} \mathbf{h}_i^{(l)} + \sum_{r \in \mathcal{R}} \sum_{j \in \mathcal{N}_r(i)} \frac{1}{|\mathcal{N}_r(i)|} \mathbf{W}_r^{(l)} \mathbf{h}_j^{(l)} \right) \tag{9.14}\]
where \(\mathcal{N}_r(i)\) is the set of neighbors of entity \(i\) connected by relation \(r\), and \(\sigma\) is a nonlinear activation. R-GCN recovers classical GCNs when \(|\mathcal{R}| = 1\) and recovers shallow KG embeddings (without neighborhood aggregation) in the zero-layer case. The link prediction variant of R-GCN uses the learned entity embeddings as inputs to a DistMult decoder — the first end-to-end model to combine structural message-passing with multi-relational scoring.
CompGCN (Vashishth et al., 2020) generalizes this further by allowing relation embeddings to participate in the message-passing, so that each message carries both an entity feature and a relation-transformed representation. This is the natural completion of the KG embedding story: TransE and DistMult are shallow models that learn entity and relation embeddings from co-occurrence statistics alone; R-GCN and CompGCN are deep models that learn the same embeddings while also propagating structural information through the graph topology. The two perspectives are not in competition — they are two layers of the same generative model.
The GNN machinery for KGs is out of scope for Pyodide computation, because the required libraries (PyTorch Geometric, DGL) cannot run in the browser. The conceptual point is this: everything you learned about GNNs in Chapter 5 applies to KGs, with the simple substitution of a relation-specific weight matrix for the shared weight matrix. The KG-specific scoring functions of Sections 6 and 7 then serve as the output layer.
Limits and Critiques
Coverage is uneven — and the gap is not random
Every knowledge graph reflects the biases of its construction process. Wikidata’s coverage of European and American companies is orders of magnitude denser than its coverage of African or Southeast Asian entities, because its contributors are predominantly English-speaking and headquartered in OECD countries. Hetionet’s gene-disease associations are dominated by the diseases studied in published literature — which are themselves dominated by diseases prevalent in wealthy countries. A link prediction model trained on such a KG will generate predictions that are reliable for well-covered entity types and unreliable, or simply absent, for under-covered ones. The OWA means that absence of a triple is uninformative, but in practice the absence of coverage is not uniformly distributed.
The static-snapshot problem
Real-world knowledge changes continuously. A CEO appointment, a merger, a bankruptcy, a change of registered address — these are dynamic events that a static KG captures only at a point in time. Most public KGs are updated periodically (Wikidata is updated in near-real-time by human contributors; Bloomberg BBKG is updated with a lag measured in hours to days) but are not truly temporal databases. Temporal KG models, which associate each triple with a validity interval \((t_{\text{start}}, t_{\text{end}})\), are an active research area, but most deployed systems still treat the KG as a snapshot.
KG vs. LLM: the knowledge representation debate
The rise of large language models since 2020 has reopened a long-standing debate in AI about the right substrate for knowledge representation. LLMs like GPT-4 and Llama 3 have demonstrated that a neural network trained on web text implicitly acquires an enormous amount of factual knowledge — effectively a form of fuzzy, probabilistic KG stored in the model’s weights. This raises the question: why maintain a separate, expensive-to-curate KG when an LLM can answer many of the same queries?
The answer, as of 2025, is that KGs and LLMs have complementary failure modes. LLMs hallucinate: they generate confident but false factual claims at a rate that is currently too high for high-stakes compliance, legal, or scientific applications. KGs are reliable within their scope but have limited coverage and require expensive manual curation. Retrieval-augmented generation (RAG), which grounds LLM responses with facts retrieved from a KG or other structured source, is the most promising current synthesis — and it is precisely the architecture that systems like Bing Chat and Google Gemini deploy for factual queries. This chapter equips you to understand the KG side of that architecture; the LLM side belongs to the next course.
Mini Case Study — A Tiny Beneficial-Ownership Investigation
The scenario
A compliance analyst at a major bank has received a suspicious activity report flagging Corp_Flagged — a corporate entity in an offshore jurisdiction — as a potential front for sanctions evasion. The bank’s internal system has identified Individual_Z as an individual on a sanctions list. The question is: does a beneficial-ownership chain exist connecting Corp_Flagged to Individual_Z?
The analyst has access to a beneficial-ownership KG assembled from company registry filings, corporate intelligence databases, and leaked documents. The KG has 20 entities — a mix of companies, individuals, and jurisdictions — and four relation types: OWNS, CONTROLS, DIRECTOR_OF, and REGISTERED_IN. The complication: the ownership chain may not be explicit in the KG. Beneficial owners typically interpose multiple shell companies to obscure the connection. Several OWNS triples that would reveal the chain are missing from the KG — either because the filings were never made, were made in a jurisdiction with no public registry, or because the data simply has not been scraped yet.
The analyst trains a TransE model on the available triples and uses link prediction to surface the most likely missing OWNS triples. The result is a ranked list of candidate ownership connections that human investigators can then verify through targeted data requests.
Live cell: beneficial-ownership KG and TransE investigation
Interpreting the investigation
The cell above demonstrates the complete analytical workflow of a KG-based compliance investigation. The two hidden triples — (Individual_Z, CONTROLS, HoldCo_3) and (Shell_C, OWNS, Corp_Flagged) — represent the links that connect the sanctioned individual to the flagged corporate entity through a two-hop chain: Individual_Z → CONTROLS → HoldCo_3 → OWNS → Shell_C → OWNS → Corp_Flagged. These links are absent from the explicit registry data, either because the intermediate holding companies are registered in jurisdictions without public ownership registers or because the filings were deliberately structured to avoid disclosure.
TransE recovers these missing links through the geometry of the embedding space. Because Individual_Z and HoldCo_3 share the embedding-space signature of a CONTROLS relationship (learned from the other control pairs in the KG), and because Shell_C and Corp_Flagged share the embedding-space signature of an OWNS relationship (learned from the other ownership pairs), the model scores these missing triples highly even though they were never explicitly observed during training.
This is the compliance interpretation of the open-world assumption: the absence of an ownership record in the KG does not mean the ownership does not exist. It may mean only that the record was not filed, not scraped, or not yet ingested. A TransE-based system surfaces the candidates; human investigators then direct targeted regulatory inquiries or legal discovery requests to verify them. The KG-based workflow is not a substitute for legal process — it is a prioritization tool that focuses investigative resources on the highest-probability missing links.
Beneficial ownership and the 2024 EU reforms. The EU’s 6th Anti-Money Laundering Directive (6AMLD), combined with the 2024 revision of the EU Anti-Money Laundering Regulation (AMLR), mandates that member states maintain central registers of beneficial ownership — the natural persons who ultimately own or control legal entities — and that financial institutions conduct enhanced due diligence on customers with complex ownership structures. The technical challenge of identifying beneficial owners through multi-hop corporate chains — exactly the problem modeled in this case study — is formally recognized in the regulation as a compliance obligation. KG-based beneficial-ownership tools from companies such as Moody’s Analytics, LSEG, and Dun & Bradstreet are now standard components of the due-diligence workflows required by the directive. The Panama Papers investigation was, in part, the empirical event that accelerated the political consensus behind these regulatory changes.
What this chapter has built
We started with the observation that single-relation graphs discard the semantic richness of the real world, then developed the formal language — triples, ontologies, RDF — for representing multi-relational knowledge. We showed that rule-based reasoning over an ontology can recover logically implied facts (transitive closure), but that it scales poorly and requires costly manual axiom specification. We then showed that KG embeddings — TransE, DistMult, and their successors — provide a statistical alternative: learn entity and relation vectors from observed triples and use vector arithmetic to score and rank missing ones.
The two frameworks are not competitors. In a production compliance system, the rule-based reasoner fills in all logically necessary facts (adding inferred SUBSIDIARY_OF edges before embedding), and the embedding model then operates on the enriched, ontology-expanded KG to surface statistically probable missing links. The combination gives you both logical soundness and statistical coverage — the best of the symbolic and the statistical worlds.
Prof. Xuhu Wan · HKUST · Modern AI Stack for Social Data · 2026 Edition