Skip to main content

Side-Quest - What Multilingual Wikipedia Reveals About Open Embeddings

· 4 min read
Nick Lange
Someone at 5L Labs

A nice rabbit hole for data-nerdery

While talking to the affable Rafi on Private AI, we riffed on an interesting side-quest for Embedding Spaces. As most polyglots know, there are certain concepts that are easier expressed with cultural / semantic "embeddings" that do not smoothly come out across languages.

Japanese よろしくお願いします (vs even the polite form of よろしくお願い致します) has no direct equivalent in English, but it does have equivalent concepts of politeness and deference. So what interesting concepts are hiding in Wikipedia across languages?

Looking at Wikipedia

I have been testing that question with a deliberately messy corpus: multilingual Wikipedia.

  1. Do Embedding spaces converge across L1...LN (Oddly supporting The Platonic Representation Hypothesis)
  2. If we run the difference between the converged space, does something interesting pop out?

Unanswered questions - Longer term

  1. Can we discover useful semantic structure from a publisher without downloading and re-embedding the original content?

The short version: the scatter plots are useful, but the better product primitive is a publisher-controlled concept card: an English-normalized claim, backed by source-language evidence and searchable embeddings.

Parlor Tricks with Data

The current local experiment pulls Wikipedia summaries for ten topics.

Corpus sliceTopicsLanguage coverage
Expanded baselineTurkey, Europe, Jesus, United States61 languages
Candidate passEarth, Cat, Dog, Christianity, Islam, World War II10 languages

Each row is embedded in several spaces using open models. Short of anyone sending us free API credits, these local models make the experiment cheap enough to run repeatedly.

FamilySpaces
Raw local modelsNomic v1.5, Nomic v2 MoE, Qwen3 0.6B, Granite 97M multilingual
Shared latent projectionsQwen and Granite at 128d and 1024d

What concept appears in one language's article that does not appear in another language's article, and can we expose that concept in English?

A publisher should not only publish raw vectors. A publisher can also publish concept cards.

FieldWhy it matters
English concept labelA consumer can search without knowing every source language.
Source language and snippetThe card stays auditable.
L0 (English) translationEnglish readers can read the evidence quickly.
Embedding representationConsumers can choose native, prefix, or shared-latent search.
Recoverability metadataThe publisher can show which representation actually finds the evidence.

While additional work is needed, we potentially have a path to let a consumer search across the publisher's multilingual concept layer without translating or downloading every page.

warning

The remaining article is fully AI-written for expediency. Caveat lector.

Concept Cards

Interactive research note

Language-specific concept cards

Select any processed topic to see the kind of English-normalized concept layer a publisher could expose alongside Open Embeddings.

304 rows8 spaces10 topics
61 languages

Turkey

Rare concepts inside one multilingual topic cluster become searchable English cards.

geopolitics and state identity
zh / ka / tr
Istanbul as a global city

Several leads emphasize Istanbul as a superlative city spanning continents, not just as a former capital or largest city.

bg / cy / pl / zh
Ottoman succession

Some leads frame modern Turkey through the collapse of the Ottoman Empire and the founding of the republic in 1923.

ar / da / ko / mk
Kurdish population framing

Some language leads include ethnic and minority-population framing that does not appear uniformly.

Examples from the current run:

  • Russian World War II foregrounds nuclear weapons and civilian casualties.
  • Polish Cat includes an invasive-species caveat.
  • Catalan Europe expands the continent through nearby islands and archipelagos.
  • Turkey pages differ on whether they foreground Istanbul, Ottoman succession, Kurdish population framing, secular governance, or strategic waterways.
  • Other processed topics include Earth, Dog, Christianity, Islam, Jesus, and United States; those are in the selector too.

This is the beginning of a publisher-controlled semantic layer.

What about Chunking?

One caveat: Wikipedia leads are long enough to carry several concepts at once. So the experiment checks whether retrieval improves when the searchable unit changes.

Chunking strategyQuestion it answers
Full leadCan one vector represent the whole article intro?
SentenceDoes a smaller unit recover rare concepts better?
Two-sentence windowIs nearby context useful without swallowing the whole lead?
Clause-like chunksDoes fine segmentation rescue concepts hidden inside dense prose?

This matters because a miss may not mean the embedding model failed. It may mean we asked one vector to represent too many concepts.

In the current Turkey concept search, smaller chunks help some models but do not fully explain the gap. That is good to know before turning this into a format recommendation.

Current Caveats

This is still a research demo.

  • Wikipedia leads are not true translations.
  • Some leads are terse, others are dense.
  • The candidate topics are not all at the same language count yet.
  • Granite needed a 700-character cap in this run because the local server had a 512-token physical batch limit.
  • Concept cards are currently curated from captured rows, not fully automated.

Still, the shape is promising: Open Embeddings can be more than "publish vectors." It can become a publisher-controlled semantic index for the open web.

OE