Side-Quest - What Multilingual Wikipedia Reveals About Open Embeddings

June 12, 2026 · 4 min read

Someone at 5L Labs

A nice rabbit hole for data-nerdery

While talking to the affable Rafi on Private AI, we riffed on an interesting side-quest for Embedding Spaces. As most polyglots know, there are certain concepts that are easier expressed with cultural / semantic "embeddings" that do not smoothly come out across languages.

Japanese よろしくお願いします (vs even the polite form of よろしくお願い致します) has no direct equivalent in English, but it does have equivalent concepts of politeness and deference. So what interesting concepts are hiding in Wikipedia across languages?

Looking at Wikipedia

I have been testing that question with a deliberately messy corpus: multilingual Wikipedia.

Do Embedding spaces converge across L1...LN (Oddly supporting The Platonic Representation Hypothesis)
If we run the difference between the converged space, does something interesting pop out?

Unanswered questions - Longer term

Can we discover useful semantic structure from a publisher without downloading and re-embedding the original content?

The short version: the scatter plots are useful, but the better product primitive is a publisher-controlled concept card: an English-normalized claim, backed by source-language evidence and searchable embeddings.

Parlor Tricks with Data

The current local experiment pulls Wikipedia summaries for ten topics.

Corpus slice	Topics	Language coverage
Expanded baseline	`Turkey`, `Europe`, `Jesus`, `United States`	61 languages
Candidate pass	`Earth`, `Cat`, `Dog`, `Christianity`, `Islam`, `World War II`	10 languages

Each row is embedded in several spaces using open models. Short of anyone sending us free API credits, these local models make the experiment cheap enough to run repeatedly.

Family	Spaces
Raw local models	`Nomic v1.5`, `Nomic v2 MoE`, `Qwen3 0.6B`, `Granite 97M multilingual`
Shared latent projections	Qwen and Granite at `128d` and `1024d`

What concept appears in one language's article that does not appear in another language's article, and can we expose that concept in English?

A publisher should not only publish raw vectors. A publisher can also publish concept cards.

Field	Why it matters
English concept label	A consumer can search without knowing every source language.
Source language and snippet	The card stays auditable.
L0 (English) translation	English readers can read the evidence quickly.
Embedding representation	Consumers can choose native, prefix, or shared-latent search.
Recoverability metadata	The publisher can show which representation actually finds the evidence.

While additional work is needed, we potentially have a path to let a consumer search across the publisher's multilingual concept layer without translating or downloading every page.

warning

The remaining article is fully AI-written for expediency. Caveat lector.

Concept Cards

Interactive research note

Language-specific concept cards

Select any processed topic to see the kind of English-normalized concept layer a publisher could expose alongside Open Embeddings.

304 rows8 spaces10 topics

Topic

61 languages

Turkey

Rare concepts inside one multilingual topic cluster become searchable English cards.

geopolitics and state identity

zh / ka / tr

Istanbul as a global city

Several leads emphasize Istanbul as a superlative city spanning continents, not just as a former capital or largest city.

zh tr

bg / cy / pl / zh

Ottoman succession

Some leads frame modern Turkey through the collapse of the Ottoman Empire and the founding of the republic in 1923.

pl zh

ar / da / ko / mk

Kurdish population framing

Some language leads include ethnic and minority-population framing that does not appear uniformly.

ar ko

Examples from the current run:

Russian World War II foregrounds nuclear weapons and civilian casualties.
Polish Cat includes an invasive-species caveat.
Catalan Europe expands the continent through nearby islands and archipelagos.
Turkey pages differ on whether they foreground Istanbul, Ottoman succession, Kurdish population framing, secular governance, or strategic waterways.
Other processed topics include Earth, Dog, Christianity, Islam, Jesus, and United States; those are in the selector too.

This is the beginning of a publisher-controlled semantic layer.

What about Chunking?

One caveat: Wikipedia leads are long enough to carry several concepts at once. So the experiment checks whether retrieval improves when the searchable unit changes.

Chunking strategy	Question it answers
Full lead	Can one vector represent the whole article intro?
Sentence	Does a smaller unit recover rare concepts better?
Two-sentence window	Is nearby context useful without swallowing the whole lead?
Clause-like chunks	Does fine segmentation rescue concepts hidden inside dense prose?

This matters because a miss may not mean the embedding model failed. It may mean we asked one vector to represent too many concepts.

In the current Turkey concept search, smaller chunks help some models but do not fully explain the gap. That is good to know before turning this into a format recommendation.

Current Caveats

This is still a research demo.

Wikipedia leads are not true translations.
Some leads are terse, others are dense.
The candidate topics are not all at the same language count yet.
Granite needed a 700-character cap in this run because the local server had a 512-token physical batch limit.
Concept cards are currently curated from captured rows, not fully automated.

Still, the shape is promising: Open Embeddings can be more than "publish vectors." It can become a publisher-controlled semantic index for the open web.

A nice rabbit hole for data-nerdery​

Looking at Wikipedia​

Unanswered questions - Longer term​

Parlor Tricks with Data​

What concept appears in one language's article that does not appear in another language's article, and can we expose that concept in English?​

Concept Cards​