Side-Quest - What Multilingual Wikipedia Reveals About Open Embeddings
A nice rabbit hole for data-nerdery
While talking to the affable Rafi on Private AI, we riffed on an interesting side-quest for Embedding Spaces. As most polyglots know, there are certain concepts that are easier expressed with cultural / semantic "embeddings" that do not smoothly come out across languages.
Japanese よろしくお願いします (vs even the polite form of よろしくお願い致します) has no direct equivalent in English, but it does have equivalent concepts of politeness and deference. So what interesting concepts are hiding in Wikipedia across languages?
Looking at Wikipedia
I have been testing that question with a deliberately messy corpus: multilingual Wikipedia.
- Do Embedding spaces converge across L1...LN (Oddly supporting The Platonic Representation Hypothesis)
- If we run the difference between the converged space, does something interesting pop out?
Unanswered questions - Longer term
- Can we discover useful semantic structure from a publisher without downloading and re-embedding the original content?
The short version: the scatter plots are useful, but the better product primitive is a publisher-controlled concept card: an English-normalized claim, backed by source-language evidence and searchable embeddings.
