One post tagged with "open-embeddings"

Side-Quest - What Multilingual Wikipedia Reveals About Open Embeddings

June 12, 2026 · 4 min read

Someone at 5L Labs

A nice rabbit hole for data-nerdery

While talking to the affable Rafi on Private AI, we riffed on an interesting side-quest for Embedding Spaces. As most polyglots know, there are certain concepts that are easier expressed with cultural / semantic "embeddings" that do not smoothly come out across languages.

Japanese よろしくお願いします (vs even the polite form of よろしくお願い致します) has no direct equivalent in English, but it does have equivalent concepts of politeness and deference. So what interesting concepts are hiding in Wikipedia across languages?

Looking at Wikipedia

I have been testing that question with a deliberately messy corpus: multilingual Wikipedia.

Do Embedding spaces converge across L1...LN (Oddly supporting The Platonic Representation Hypothesis)
If we run the difference between the converged space, does something interesting pop out?

Unanswered questions - Longer term

Can we discover useful semantic structure from a publisher without downloading and re-embedding the original content?

The short version: the scatter plots are useful, but the better product primitive is a publisher-controlled concept card: an English-normalized claim, backed by source-language evidence and searchable embeddings.

A nice rabbit hole for data-nerdery​

Looking at Wikipedia​

Unanswered questions - Longer term​

A nice rabbit hole for data-nerdery

Looking at Wikipedia

Unanswered questions - Longer term