Investigating UForm Matryoshka Embeddings

December 12th, 2024 • 10 min read

Dense embeddings are a key component in modern information retrieval which is becoming increasingly widespread with vector-based dense retrieval powering most retrieval-augmented generation applications. You get dense embeddings using an embedding model — today, this is usually a neural network trained specifically to learn useful representation.

Given the right embedding model, you can turn any type of data into a vector representation that can be used for similarity search, clustering, classification, and other downstream tasks. When you're concerned about computational efficiency, you want to use an embedding model that is lightweight and which can produce particularly compact representations. Unum's UForm library and tiny embedding models are great for this.

In this blog post, we'll explore what Matryoshka representations are, and investigate the nested embedding performance of UForm models.

Background

During my time at Ogment AI, we were working on a vector-based dense retrieval system and embedding pipeline to allow for efficient similarity search over the world's job profiles. As part of the solution, I designed an profile attribute embedding pipeline that would allow for semantic clustering of components of job profiles. This pipeline would need to process billions of attributes, and multiple thousands of attributes per hour, and so we needed a fast model and a compact representation that would allow for efficient similarity search while also keeping vector index compute requirements low. The UForm embedding model family is lightweight and seemed like a great fit for this task.

Why UForm?

UForm provides very lightweight quantized Transformer models for dealing with image media and multi-lingual text. These models are multi-modal and fuse information from different modalities into a single embedding space. On Unum's multi-lingual adaptation of the Common Objects in Context (COCO) evaluation dataset, UForm models achieve state-of-the-art performance while being 2-4x faster than alternative open-source models. Out of the box UForm embeddings are 256-dimensional, but the README suggests using 64-dimensional Matryoshka-style embeddings. At Ogment we could afford to use the full 256-dimensional embeddings, but couldn't afford to rigourously evaluate the 64-dimensional embeddings. So this blog post is an attempt to do just that.

What are Matryoshka Representations?

When applying a dense embeddings to solve a problem, you care about the representational capacity and efficiency in accomodating the problem at hand. General purpose embedding models are designed to be flexible and accomodate a wide range of problems. In doing so, they may not be the most efficient representation for a specific problem. Matryoshka representations encode information at different granularities in coarse-to-fine embeddings. This allows a single embedding to be adapted to the computational constraints of a specific problem by simply slicing the embedding.

Using Matryoshka Representation Learning, any representation learning setup can be adapted to learn multi-granular Matryoshka representations.

Formal Example: Fully Supervised Representation Learning

To get Matryoshka representations, we modify the representation learning objective to incentivise the model to capture information at multiple granularities. Here the typical representation learning task is turned into a multi-scale representation learning problem on the same task. In the case of a fully supervised representation learning task using multi-class classification, we modify the typical multi-class classification loss into multiple layers of losses for each pre-defined level of granularity.
Suppose we're trying to learn $d$-dimensional representation $z \in \mathbb{R}^d$ for some input $x$ in domain $\mathcal{X}$ using regular multi-class classification. Here we use neural network $F(\cdot; \theta_F): \mathcal{X} \to \mathbb{R}^d$, that maps an input $x$ to our representation such that $z = F(x; \theta_F)$. Given dataset $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^n$ where $y_i \in [L]$ is the label for input $x_i$ for all $i \in [N]$, we have the following minimization problem: $$ \min_{\theta_F} \frac{1}{N} \sum_{i \in [N]} \mathcal{L}(F(x_i; \theta_F), y_i) $$ where $\mathcal{L}$ is the multi-class softmax cross-entropy loss. This is a standard multi-class classification loss function which minimizes the distance between the predicted label $F(x_i; \theta_F)$ and the true label $y_i$.
Now, to learn Matryoshka multi-granularity representations we modify this to instead optimize the multi-class classification loss for each nested Matryoshka granularity $m \in \mathcal{M}$ using separate linear classifiers, parameterized by $W^{(m)} \in \mathbb{R}^{L \times m}$, optimized using standard empirical risk minimization: $$ \require{color}\min_{\displaystyle{\colorbox{YellowGreen}{ $\{ \mathbf{W}^{(m)} \}_{m \in \mathcal{M}}$}}, \theta_F} \frac{1}{N} \sum_{i \in [N]} \displaystyle{\colorbox{YellowGreen}{ $\sum_{m \in \mathcal{M}}$ }} \mathcal{L} \left( \displaystyle{\colorbox{YellowGreen}{ $\mathbf{W}^{(m)} \cdot$}} F(x_i; \theta_F)\displaystyle{\colorbox{YellowGreen}{ $_{1:m}$ }}, y_i \right) $$

● Green marks Matryoshka modifications.

Here, the intuition is that the we're essentially training $|M|$ separate separate models all in one by minimizing the aggregate loss functions. However, despite optimizing around discrete Matryoshka granularities, what we see in practice is that models trained with Matryoshka Representation Learning diffuse information in an an interpolating manner across all dimensions $d$.

Practical Example: Sentence Transformers

Sentence Transformers is a great and popular library for using and training text and image embedding models.

In the below example, we'll use Sentence Transformers to fine-tune an MPNet BERT model with a triplet loss function on the MS Marco information retrieval dataset using Matryoshka Representation Learning - with the aim of splitting the 768-dimensional embedding into two 384-dimensional granularities.

"""Triplet Loss with Matryoshka Representation Learning."""

import datasets

from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    losses,
)

# 1. Load base MPNet BERT model.
model: SentenceTransformer = SentenceTransformer("all-mpnet-base-v2")

# 2. Load AG News Dataset.
dataset: datasets.Dataset = datasets.load_dataset(
    "sentence-transformers/msmarco-msmarco-MiniLM-L-6-v3",
)

# 3. Get dataset in triplet format.
train_dataset: datasets.Dataset = dataset["train"].map(
    lambda x: {
        "anchor": x["query"],
        "positive": x["positive"],
        "negative": x["negative"]
    },
    remove_columns=["query"],
)

# 4. Define triplet loss function.
loss = losses.TripletLoss(model=model)

# 5. Wrap with Matryoshka loss modifier to optimize for Matryoshka granularities of 768 and 384.
loss = losses.MatryoshkaLoss(model, loss, matryoshka_dims=[768, 384])

# ... the rest of the owl.
trainer = SentenceTransformerTrainer(
    model=model,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()

Experiments

We'll be extending the existing UForm benchmark suite to include evaluations of Matryoshka-style embeddings at different levels of granularity. Additionally, we'll take a look at the information sensitivity of UForm embeddings to see how this compares to that of OpenAI's text-embedding-3-small model.

Matryoshka Benchmarks

Existing UForm model benchmarks are conducted on the synthetic multi-lingual COCO-SM dataset which consists mainly of text-image retrieval tasks. Using this same dataset, we'll evaluate the text-to-image COCO-SM retrieval recall of UForm (uform3-image-text-multilingual-base) embeddings at different levels of Matryoshka coarseness, ranging from 16-256 dimensions with a stride of 16. Similar to the official benchmarks, we'll be comparing UForm performance against OpenCLIP (CLIP ViT-B/32 XLM RoBERTa base) as the baseline.

We compare the mean proportion of relevant entries retrieved in the top-$k$ across all languages in the COCO-SM dataset. Here "OpenCLIP @ 10" denotes OpenCLIP's recall at $k=3$. In this comparison it is worth noting that while both models are Vision Transformers, the OpenCLIP model has 5B parameters and a full 512 dimensions, whereas the UForm model is quantized and has 206M parameters.

The results are shown in Figure 1 below.

Figure 1: Text-to-image retrieval recall of UForm at 16-256 dimensions compared to full 512-dim OpenCLIP

Despite the parameter volume difference, UForm embeddings strictly outperform the OpenCLIP embeddings on COCO-SM text-to-image retrieval recall at granularity greater than 200 (39% of the size of OpenCLIP). This is a good result, and shows that the UForm model is able to learn compact representations that are competitive with models well out of its weight class at even at 78% resolution. However, from here, with each omitted dimension, the UForm text-to-image retrieval recall decreases linearly by approximately 0.27 percentage points on average. This seemingly linear degradation seems a bit steep, and so we'll take a look at the information sensitivity of UForm embeddings to see how this compares to that of OpenAI's text-embedding-3-small model.

The COCO-SM dataset contains synthetic translations from three different translation methods. In evaluating using the COCO-SM dataset, we're using 40,504 text-image pairs across 21 languages, computing the per-language mean across all three different translation methods. In this case, there is potentially a significant amount of distractors, especially compared to the original COCO dataset. Based on the current state-of-the-art performance for similar retrieval benchmarks well above 60%, I would expect a good model to have at least 60% recall among its top-10 results. With this threshold, UForm dimensionalities below ~192 would be deemed unviable.

For a full breakdown of the results, see Table 1 below.

Table 1: Recall of COCO-SM text-to-image retrieval of UForm at 16-256 dimensions to full 512-dim OpenCLIP

Language	OpenCLIP @ 1	UForm @ 1	OpenCLIP @ 5	UForm @ 5	OpenCLIP @ 10	UForm @ 10
Arabic	0.227	0.004	0.449	0.019	0.558	0.033
German	0.317	0.006	0.569	0.021	0.674	0.036
English	0.378	0.006	0.635	0.023	0.735	0.040
Spanish	0.326	0.006	0.580	0.021	0.688	0.036
Persian	0.240	0.004	0.470	0.018	0.578	0.032
French	0.313	0.005	0.565	0.020	0.674	0.037
Hebrew	0.237	0.004	0.463	0.016	0.570	0.028
Hindi	0.207	0.005	0.425	0.019	0.537	0.035
Armenian	0.056	0.003	0.143	0.014	0.202	0.024
Indonesian	0.269	0.005	0.514	0.019	0.627	0.032
Italian	0.313	0.005	0.567	0.021	0.671	0.037
Japanese	0.274	0.005	0.515	0.021	0.626	0.036
Korean	0.244	0.005	0.481	0.019	0.592	0.033
Polish	0.292	0.004	0.539	0.018	0.647	0.032
Portuguese	0.316	0.005	0.571	0.019	0.679	0.035
Russian	0.299	0.005	0.548	0.019	0.658	0.034
Thai	0.215	0.004	0.430	0.017	0.537	0.029
Turkish	0.255	0.005	0.491	0.019	0.603	0.034
Ukranian	0.260	0.004	0.499	0.017	0.609	0.030
Vietnamese	0.254	0.004	0.492	0.018	0.603	0.033
Chinese	0.273	0.005	0.513	0.022	0.621	0.037
Mean	0.265 ± 0.064	0.005 ± 0.001	0.498 ± 0.098	0.019 ± 0.002	0.604 ± 0.106	0.034 ± 0.004
Google Translate	0.274 ± 0.063	0.005 ± 0.001	0.511 ± 0.095	0.019 ± 0.002	0.617 ± 0.103	0.033 ± 0.004
Microsoft Translator	0.272 ± 0.064	0.005 ± 0.001	0.508 ± 0.098	0.019 ± 0.002	0.614 ± 0.106	0.034 ± 0.004
Meta NLLB	0.249 ± 0.067	0.005 ± 0.001	0.475 ± 0.103	0.019 ± 0.002	0.582 ± 0.112	0.034 ± 0.003

Language	OpenCLIP @ 1	UForm @ 1	OpenCLIP @ 5	UForm @ 5	OpenCLIP @ 10	UForm @ 10
Arabic	0.227	0.015	0.449	0.054	0.558	0.087
German	0.317	0.017	0.569	0.061	0.674	0.099
English	0.378	0.019	0.635	0.064	0.735	0.103
Spanish	0.326	0.018	0.580	0.062	0.688	0.099
Persian	0.240	0.014	0.470	0.049	0.578	0.081
French	0.313	0.018	0.565	0.062	0.674	0.099
Hebrew	0.237	0.013	0.463	0.046	0.570	0.075
Hindi	0.207	0.015	0.425	0.053	0.537	0.086
Armenian	0.056	0.010	0.143	0.035	0.202	0.058
Indonesian	0.269	0.015	0.514	0.054	0.627	0.085
Italian	0.313	0.018	0.567	0.061	0.671	0.098
Japanese	0.274	0.016	0.515	0.057	0.626	0.094
Korean	0.244	0.016	0.481	0.054	0.592	0.090
Polish	0.292	0.016	0.539	0.056	0.647	0.091
Portuguese	0.316	0.017	0.571	0.058	0.679	0.092
Russian	0.299	0.017	0.548	0.060	0.658	0.095
Thai	0.215	0.013	0.430	0.048	0.537	0.077
Turkish	0.255	0.016	0.491	0.056	0.603	0.092
Ukranian	0.260	0.014	0.499	0.052	0.609	0.082
Vietnamese	0.254	0.014	0.492	0.051	0.603	0.082
Chinese	0.273	0.017	0.513	0.059	0.621	0.096
Mean	0.265 ± 0.064	0.016 ± 0.002	0.498 ± 0.098	0.055 ± 0.007	0.604 ± 0.106	0.089 ± 0.011
Google Translate	0.274 ± 0.063	0.015 ± 0.002	0.511 ± 0.095	0.054 ± 0.007	0.617 ± 0.103	0.088 ± 0.011
Microsoft Translator	0.272 ± 0.064	0.016 ± 0.002	0.508 ± 0.098	0.054 ± 0.007	0.614 ± 0.106	0.088 ± 0.011
Meta NLLB	0.249 ± 0.067	0.016 ± 0.002	0.475 ± 0.103	0.056 ± 0.007	0.582 ± 0.112	0.090 ± 0.011

Language	OpenCLIP @ 1	UForm @ 1	OpenCLIP @ 5	UForm @ 5	OpenCLIP @ 10	UForm @ 10
Arabic	0.227	0.028	0.449	0.091	0.558	0.140
German	0.317	0.033	0.569	0.108	0.674	0.163
English	0.378	0.037	0.635	0.114	0.735	0.172
Spanish	0.326	0.033	0.580	0.108	0.688	0.164
Persian	0.240	0.025	0.470	0.085	0.578	0.132
French	0.313	0.033	0.565	0.105	0.674	0.160
Hebrew	0.237	0.023	0.463	0.075	0.570	0.116
Hindi	0.207	0.028	0.425	0.090	0.537	0.140
Armenian	0.056	0.017	0.143	0.058	0.202	0.093
Indonesian	0.269	0.027	0.514	0.090	0.627	0.136
Italian	0.313	0.033	0.567	0.105	0.671	0.161
Japanese	0.274	0.030	0.515	0.097	0.626	0.147
Korean	0.244	0.028	0.481	0.093	0.592	0.144
Polish	0.292	0.029	0.539	0.095	0.647	0.146
Portuguese	0.316	0.031	0.571	0.098	0.679	0.149
Russian	0.299	0.031	0.548	0.101	0.658	0.154
Thai	0.215	0.025	0.430	0.082	0.537	0.125
Turkish	0.255	0.030	0.491	0.096	0.603	0.148
Ukranian	0.260	0.025	0.499	0.086	0.609	0.134
Vietnamese	0.254	0.025	0.492	0.083	0.603	0.130
Chinese	0.273	0.030	0.513	0.098	0.621	0.148
Mean	0.265 ± 0.064	0.029 ± 0.004	0.498 ± 0.098	0.093 ± 0.013	0.604 ± 0.106	0.143 ± 0.018
Google Translate	0.274 ± 0.063	0.028 ± 0.004	0.511 ± 0.095	0.092 ± 0.013	0.617 ± 0.103	0.141 ± 0.018
Microsoft Translator	0.272 ± 0.064	0.028 ± 0.005	0.508 ± 0.098	0.092 ± 0.013	0.614 ± 0.106	0.142 ± 0.019
Meta NLLB	0.249 ± 0.067	0.029 ± 0.004	0.475 ± 0.103	0.095 ± 0.013	0.582 ± 0.112	0.146 ± 0.018

Language	OpenCLIP @ 1	UForm @ 1	OpenCLIP @ 5	UForm @ 5	OpenCLIP @ 10	UForm @ 10
Arabic	0.227	0.052	0.449	0.151	0.558	0.220
German	0.317	0.063	0.569	0.176	0.674	0.253
English	0.378	0.065	0.635	0.183	0.735	0.260
Spanish	0.326	0.062	0.580	0.175	0.688	0.253
Persian	0.240	0.046	0.470	0.139	0.578	0.205
French	0.313	0.061	0.565	0.174	0.674	0.251
Hebrew	0.237	0.042	0.463	0.125	0.570	0.185
Hindi	0.207	0.051	0.425	0.150	0.537	0.217
Armenian	0.056	0.032	0.143	0.098	0.202	0.148
Indonesian	0.269	0.048	0.514	0.144	0.627	0.212
Italian	0.313	0.061	0.567	0.171	0.671	0.249
Japanese	0.274	0.055	0.515	0.160	0.626	0.234
Korean	0.244	0.054	0.481	0.157	0.592	0.228
Polish	0.292	0.056	0.539	0.162	0.647	0.236
Portuguese	0.316	0.055	0.571	0.157	0.679	0.228
Russian	0.299	0.058	0.548	0.165	0.658	0.240
Thai	0.215	0.047	0.430	0.137	0.537	0.201
Turkish	0.255	0.055	0.491	0.159	0.603	0.233
Ukranian	0.260	0.050	0.499	0.145	0.609	0.213
Vietnamese	0.254	0.047	0.492	0.136	0.603	0.201
Chinese	0.273	0.055	0.513	0.156	0.621	0.229
Mean	0.265 ± 0.064	0.053 ± 0.008	0.498 ± 0.098	0.153 ± 0.020	0.604 ± 0.106	0.224 ± 0.026
Google Translate	0.274 ± 0.063	0.053 ± 0.008	0.511 ± 0.095	0.151 ± 0.020	0.617 ± 0.103	0.221 ± 0.026
Microsoft Translator	0.272 ± 0.064	0.053 ± 0.008	0.508 ± 0.098	0.152 ± 0.021	0.614 ± 0.106	0.221 ± 0.027
Meta NLLB	0.249 ± 0.067	0.054 ± 0.008	0.475 ± 0.103	0.157 ± 0.020	0.582 ± 0.112	0.229 ± 0.026

Language	OpenCLIP @ 1	UForm @ 1	OpenCLIP @ 5	UForm @ 5	OpenCLIP @ 10	UForm @ 10
Arabic	0.227	0.078	0.449	0.205	0.558	0.292
German	0.317	0.093	0.569	0.238	0.674	0.330
English	0.378	0.102	0.635	0.254	0.735	0.348
Spanish	0.326	0.092	0.580	0.241	0.688	0.331
Persian	0.240	0.070	0.470	0.188	0.578	0.267
French	0.313	0.093	0.565	0.237	0.674	0.328
Hebrew	0.237	0.062	0.463	0.170	0.570	0.245
Hindi	0.207	0.077	0.425	0.206	0.537	0.292
Armenian	0.056	0.047	0.143	0.136	0.202	0.199
Indonesian	0.269	0.071	0.514	0.195	0.627	0.279
Italian	0.313	0.091	0.567	0.234	0.671	0.327
Japanese	0.274	0.084	0.515	0.220	0.626	0.306
Korean	0.244	0.081	0.481	0.214	0.592	0.301
Polish	0.292	0.084	0.539	0.222	0.647	0.308
Portuguese	0.316	0.080	0.571	0.213	0.679	0.299
Russian	0.299	0.086	0.548	0.225	0.658	0.318
Thai	0.215	0.069	0.430	0.186	0.537	0.265
Turkish	0.255	0.082	0.491	0.217	0.603	0.306
Ukranian	0.260	0.074	0.499	0.198	0.609	0.282
Vietnamese	0.254	0.068	0.492	0.186	0.603	0.269
Chinese	0.273	0.079	0.513	0.214	0.621	0.302
Mean	0.265 ± 0.064	0.079 ± 0.012	0.498 ± 0.098	0.210 ± 0.027	0.604 ± 0.106	0.295 ± 0.034
Google Translate	0.274 ± 0.063	0.078 ± 0.012	0.511 ± 0.095	0.207 ± 0.027	0.617 ± 0.103	0.291 ± 0.034
Microsoft Translator	0.272 ± 0.064	0.078 ± 0.013	0.508 ± 0.098	0.207 ± 0.028	0.614 ± 0.106	0.292 ± 0.035
Meta NLLB	0.249 ± 0.067	0.081 ± 0.012	0.475 ± 0.103	0.215 ± 0.027	0.582 ± 0.112	0.302 ± 0.033

Language	OpenCLIP @ 1	UForm @ 1	OpenCLIP @ 5	UForm @ 5	OpenCLIP @ 10	UForm @ 10
Arabic	0.227	0.104	0.449	0.258	0.558	0.350
German	0.317	0.123	0.569	0.294	0.674	0.393
English	0.378	0.131	0.635	0.312	0.735	0.414
Spanish	0.326	0.124	0.580	0.296	0.688	0.394
Persian	0.240	0.093	0.470	0.234	0.578	0.322
French	0.313	0.124	0.565	0.293	0.674	0.392
Hebrew	0.237	0.083	0.463	0.211	0.570	0.292
Hindi	0.207	0.102	0.425	0.256	0.537	0.350
Armenian	0.056	0.065	0.056	0.169	0.202	0.239
Indonesian	0.269	0.097	0.514	0.245	0.627	0.335
Italian	0.313	0.120	0.567	0.292	0.671	0.389
Japanese	0.274	0.112	0.515	0.272	0.626	0.365
Korean	0.244	0.108	0.481	0.265	0.592	0.358
Polish	0.292	0.113	0.539	0.276	0.647	0.372
Portuguese	0.316	0.107	0.571	0.264	0.679	0.357
Russian	0.299	0.117	0.548	0.285	0.658	0.380
Thai	0.215	0.090	0.430	0.227	0.537	0.312
Turkish	0.255	0.111	0.491	0.272	0.603	0.365
Ukranian	0.260	0.099	0.499	0.250	0.609	0.337
Vietnamese	0.254	0.090	0.492	0.230	0.603	0.320
Chinese	0.273	0.106	0.513	0.265	0.621	0.358
Mean	0.265 ± 0.064	0.106 ± 0.016	0.498 ± 0.098	0.260 ± 0.033	0.604 ± 0.106	0.352 ± 0.040
Google Translate	0.274 ± 0.063	0.104 ± 0.016	0.511 ± 0.095	0.257 ± 0.033	0.617 ± 0.103	0.348 ± 0.040
Microsoft Translator	0.272 ± 0.064	0.104 ± 0.017	0.508 ± 0.098	0.257 ± 0.035	0.614 ± 0.106	0.348 ± 0.042
Meta NLLB	0.249 ± 0.067	0.108 ± 0.016	0.475 ± 0.103	0.266 ± 0.033	0.582 ± 0.112	0.360 ± 0.039

Language	OpenCLIP @ 1	UForm @ 1	OpenCLIP @ 5	UForm @ 5	OpenCLIP @ 10	UForm @ 10
Arabic	0.227	0.132	0.449	0.315	0.558	0.417
German	0.317	0.154	0.569	0.357	0.674	0.465
English	0.378	0.166	0.635	0.377	0.735	0.487
Spanish	0.326	0.158	0.580	0.359	0.688	0.464
Persian	0.240	0.117	0.470	0.285	0.578	0.384
French	0.313	0.156	0.565	0.355	0.674	0.462
Hebrew	0.237	0.104	0.463	0.260	0.570	0.353
Hindi	0.207	0.131	0.425	0.312	0.537	0.416
Armenian	0.056	0.083	0.056	0.211	0.202	0.292
Indonesian	0.269	0.126	0.514	0.303	0.627	0.403
Italian	0.313	0.152	0.567	0.350	0.671	0.458
Japanese	0.274	0.142	0.515	0.329	0.626	0.434
Korean	0.244	0.136	0.481	0.323	0.592	0.427
Polish	0.292	0.144	0.539	0.333	0.647	0.438
Portuguese	0.316	0.137	0.571	0.324	0.679	0.426
Russian	0.299	0.147	0.548	0.344	0.658	0.448
Thai	0.215	0.114	0.430	0.280	0.537	0.379
Turkish	0.255	0.142	0.491	0.331	0.603	0.434
Ukranian	0.260	0.126	0.499	0.301	0.609	0.403
Vietnamese	0.254	0.115	0.492	0.282	0.603	0.382
Chinese	0.273	0.136	0.513	0.321	0.621	0.424
Mean	0.265 ± 0.064	0.134 ± 0.020	0.498 ± 0.098	0.317 ± 0.038	0.604 ± 0.106	0.419 ± 0.044
Google Translate	0.274 ± 0.063	0.132 ± 0.020	0.511 ± 0.095	0.313 ± 0.038	0.617 ± 0.103	0.415 ± 0.045
Microsoft Translator	0.272 ± 0.064	0.133 ± 0.021	0.508 ± 0.098	0.313 ± 0.040	0.614 ± 0.106	0.415 ± 0.046
Meta NLLB	0.249 ± 0.067	0.137 ± 0.020	0.475 ± 0.103	0.324 ± 0.038	0.582 ± 0.112	0.427 ± 0.043

Language	OpenCLIP @ 1	UForm @ 1	OpenCLIP @ 5	UForm @ 5	OpenCLIP @ 10	UForm @ 10
Arabic	0.227	0.149	0.449	0.341	0.558	0.448
German	0.317	0.173	0.569	0.384	0.674	0.493
English	0.378	0.186	0.635	0.403	0.735	0.516
Spanish	0.326	0.177	0.580	0.386	0.688	0.497
Persian	0.240	0.135	0.470	0.314	0.578	0.417
French	0.313	0.175	0.565	0.383	0.674	0.493
Hebrew	0.237	0.121	0.463	0.289	0.570	0.388
Hindi	0.207	0.146	0.425	0.341	0.537	0.448
Armenian	0.056	0.096	0.056	0.235	0.202	0.323
Indonesian	0.269	0.144	0.514	0.330	0.627	0.436
Italian	0.313	0.171	0.567	0.378	0.671	0.489
Japanese	0.274	0.159	0.515	0.358	0.626	0.468
Korean	0.244	0.151	0.481	0.350	0.592	0.459
Polish	0.292	0.162	0.539	0.363	0.647	0.470
Portuguese	0.316	0.156	0.571	0.352	0.679	0.460
Russian	0.299	0.166	0.548	0.372	0.658	0.482
Thai	0.215	0.130	0.430	0.312	0.537	0.416
Turkish	0.255	0.160	0.491	0.359	0.603	0.465
Ukranian	0.260	0.143	0.499	0.331	0.609	0.436
Vietnamese	0.254	0.131	0.492	0.309	0.603	0.411
Chinese	0.273	0.154	0.513	0.353	0.621	0.461
Mean	0.265 ± 0.064	0.152 ± 0.021	0.498 ± 0.098	0.345 ± 0.039	0.604 ± 0.106	0.451 ± 0.044
Google Translate	0.274 ± 0.063	0.150 ± 0.021	0.511 ± 0.095	0.341 ± 0.038	0.617 ± 0.103	0.447 ± 0.044
Microsoft Translator	0.272 ± 0.064	0.150 ± 0.022	0.508 ± 0.098	0.341 ± 0.040	0.614 ± 0.106	0.447 ± 0.046
Meta NLLB	0.249 ± 0.067	0.155 ± 0.021	0.475 ± 0.103	0.352 ± 0.038	0.582 ± 0.112	0.460 ± 0.043

Language	OpenCLIP @ 1	UForm @ 1	OpenCLIP @ 5	UForm @ 5	OpenCLIP @ 10	UForm @ 10
Arabic	0.227	0.178	0.449	0.388	0.558	0.496
German	0.317	0.205	0.569	0.434	0.674	0.542
English	0.378	0.220	0.635	0.453	0.735	0.566
Spanish	0.326	0.208	0.580	0.435	0.688	0.545
Persian	0.240	0.160	0.470	0.357	0.578	0.464
French	0.313	0.206	0.565	0.434	0.674	0.544
Hebrew	0.237	0.146	0.463	0.332	0.570	0.435
Hindi	0.207	0.173	0.425	0.387	0.537	0.497
Armenian	0.056	0.115	0.056	0.274	0.202	0.369
Indonesian	0.269	0.170	0.514	0.375	0.627	0.484
Italian	0.313	0.203	0.567	0.428	0.671	0.538
Japanese	0.274	0.186	0.515	0.405	0.626	0.515
Korean	0.244	0.181	0.481	0.396	0.592	0.506
Polish	0.292	0.192	0.539	0.409	0.647	0.521
Portuguese	0.316	0.185	0.571	0.400	0.679	0.509
Russian	0.299	0.196	0.548	0.419	0.658	0.529
Thai	0.215	0.156	0.430	0.357	0.537	0.465
Turkish	0.255	0.190	0.491	0.405	0.603	0.515
Ukranian	0.260	0.169	0.499	0.377	0.609	0.484
Vietnamese	0.254	0.157	0.492	0.355	0.603	0.462
Chinese	0.273	0.181	0.513	0.397	0.621	0.507
Mean	0.265 ± 0.064	0.180 ± 0.024	0.498 ± 0.098	0.391 ± 0.041	0.604 ± 0.106	0.500 ± 0.044
Google Translate	0.274 ± 0.063	0.177 ± 0.024	0.511 ± 0.095	0.388 ± 0.041	0.617 ± 0.103	0.496 ± 0.044
Microsoft Translator	0.272 ± 0.064	0.178 ± 0.026	0.508 ± 0.098	0.387 ± 0.043	0.614 ± 0.106	0.496 ± 0.046
Meta NLLB	0.249 ± 0.067	0.184 ± 0.024	0.475 ± 0.103	0.398 ± 0.041	0.582 ± 0.112	0.507 ± 0.044

Language	OpenCLIP @ 1	UForm @ 1	OpenCLIP @ 5	UForm @ 5	OpenCLIP @ 10	UForm @ 10
Arabic	0.227	0.198	0.449	0.416	0.558	0.528
German	0.317	0.228	0.569	0.461	0.674	0.571
English	0.378	0.245	0.635	0.485	0.735	0.596
Spanish	0.326	0.232	0.580	0.465	0.688	0.575
Persian	0.240	0.178	0.470	0.384	0.578	0.491
French	0.313	0.232	0.565	0.463	0.674	0.575
Hebrew	0.237	0.163	0.463	0.357	0.570	0.461
Hindi	0.207	0.194	0.425	0.417	0.537	0.528
Armenian	0.056	0.129	0.056	0.298	0.202	0.395
Indonesian	0.269	0.190	0.514	0.403	0.627	0.514
Italian	0.313	0.226	0.567	0.457	0.671	0.569
Japanese	0.274	0.208	0.515	0.430	0.626	0.543
Korean	0.244	0.202	0.481	0.425	0.592	0.536
Polish	0.292	0.215	0.539	0.440	0.647	0.553
Portuguese	0.316	0.207	0.571	0.430	0.679	0.540
Russian	0.299	0.221	0.548	0.449	0.658	0.560
Thai	0.215	0.176	0.430	0.383	0.537	0.491
Turkish	0.255	0.210	0.491	0.434	0.603	0.545
Ukranian	0.260	0.191	0.499	0.406	0.609	0.516
Vietnamese	0.254	0.177	0.492	0.382	0.603	0.493
Chinese	0.273	0.202	0.513	0.425	0.621	0.537
Mean	0.265 ± 0.064	0.201 ± 0.027	0.498 ± 0.098	0.419 ± 0.043	0.604 ± 0.106	0.529 ± 0.046
Google Translate	0.274 ± 0.063	0.199 ± 0.026	0.511 ± 0.095	0.416 ± 0.043	0.617 ± 0.103	0.526 ± 0.045
Microsoft Translator	0.272 ± 0.064	0.199 ± 0.028	0.508 ± 0.098	0.416 ± 0.044	0.614 ± 0.106	0.525 ± 0.047
Meta NLLB	0.249 ± 0.067	0.206 ± 0.027	0.475 ± 0.103	0.426 ± 0.042	0.582 ± 0.112	0.537 ± 0.045

Language	OpenCLIP @ 1	UForm @ 1	OpenCLIP @ 5	UForm @ 5	OpenCLIP @ 10	UForm @ 10
Arabic	0.227	0.218	0.449	0.447	0.558	0.559
German	0.317	0.248	0.569	0.489	0.674	0.603
English	0.378	0.268	0.635	0.515	0.735	0.630
Spanish	0.326	0.253	0.580	0.494	0.688	0.609
Persian	0.240	0.197	0.470	0.412	0.578	0.525
French	0.313	0.251	0.565	0.495	0.674	0.607
Hebrew	0.237	0.179	0.463	0.387	0.570	0.498
Hindi	0.207	0.215	0.207	0.446	0.537	0.561
Armenian	0.056	0.144	0.056	0.323	0.202	0.423
Indonesian	0.269	0.210	0.514	0.436	0.627	0.549
Italian	0.313	0.246	0.567	0.485	0.671	0.602
Japanese	0.274	0.229	0.515	0.461	0.626	0.576
Korean	0.244	0.221	0.481	0.451	0.592	0.564
Polish	0.292	0.236	0.539	0.472	0.647	0.584
Portuguese	0.316	0.228	0.571	0.461	0.679	0.574
Russian	0.299	0.240	0.548	0.480	0.658	0.593
Thai	0.215	0.193	0.430	0.410	0.537	0.521
Turkish	0.255	0.230	0.491	0.463	0.603	0.577
Ukranian	0.260	0.209	0.499	0.438	0.609	0.548
Vietnamese	0.254	0.195	0.492	0.412	0.603	0.524
Chinese	0.273	0.225	0.513	0.456	0.621	0.569
Mean	0.265 ± 0.064	0.221 ± 0.029	0.498 ± 0.098	0.449 ± 0.043	0.604 ± 0.106	0.562 ± 0.046
Google Translate	0.274 ± 0.063	0.218 ± 0.028	0.511 ± 0.095	0.446 ± 0.043	0.617 ± 0.103	0.558 ± 0.046
Microsoft Translator	0.272 ± 0.064	0.218 ± 0.030	0.508 ± 0.098	0.445 ± 0.045	0.614 ± 0.106	0.557 ± 0.047
Meta NLLB	0.249 ± 0.067	0.226 ± 0.028	0.475 ± 0.103	0.456 ± 0.043	0.582 ± 0.112	0.570 ± 0.046

Language	OpenCLIP @ 1	UForm @ 1	OpenCLIP @ 5	UForm @ 5	OpenCLIP @ 10	UForm @ 10
Arabic	0.227	0.241	0.227	0.477	0.558	0.590
German	0.317	0.271	0.569	0.519	0.674	0.635
English	0.378	0.291	0.635	0.547	0.735	0.663
Spanish	0.326	0.276	0.580	0.524	0.688	0.641
Persian	0.240	0.216	0.470	0.445	0.578	0.558
French	0.313	0.275	0.565	0.521	0.674	0.639
Hebrew	0.237	0.199	0.463	0.417	0.570	0.529
Hindi	0.207	0.237	0.207	0.476	0.537	0.592
Armenian	0.056	0.161	0.056	0.350	0.202	0.456
Indonesian	0.269	0.233	0.514	0.467	0.627	0.583
Italian	0.313	0.269	0.567	0.517	0.671	0.633
Japanese	0.274	0.250	0.515	0.491	0.626	0.609
Korean	0.244	0.244	0.481	0.481	0.592	0.596
Polish	0.292	0.257	0.539	0.500	0.647	0.616
Portuguese	0.316	0.249	0.571	0.493	0.679	0.609
Russian	0.299	0.262	0.548	0.509	0.658	0.624
Thai	0.215	0.212	0.215	0.439	0.537	0.554
Turkish	0.255	0.253	0.255	0.494	0.603	0.610
Ukranian	0.260	0.231	0.499	0.466	0.609	0.582
Vietnamese	0.254	0.214	0.492	0.441	0.603	0.556
Chinese	0.273	0.246	0.513	0.483	0.621	0.602
Mean	0.265 ± 0.064	0.242 ± 0.030	0.498 ± 0.098	0.479 ± 0.044	0.604 ± 0.106	0.594 ± 0.046
Google Translate	0.274 ± 0.063	0.240 ± 0.030	0.511 ± 0.095	0.476 ± 0.044	0.617 ± 0.103	0.590 ± 0.046
Microsoft Translator	0.272 ± 0.064	0.240 ± 0.031	0.508 ± 0.098	0.475 ± 0.045	0.614 ± 0.106	0.590 ± 0.047
Meta NLLB	0.249 ± 0.067	0.247 ± 0.030	0.249 ± 0.067	0.486 ± 0.044	0.582 ± 0.112	0.602 ± 0.046

Language	OpenCLIP @ 1	UForm @ 1	OpenCLIP @ 5	UForm @ 5	OpenCLIP @ 10	UForm @ 10
Arabic	0.227	0.264	0.227	0.510	0.558	0.625
German	0.317	0.294	0.569	0.550	0.674	0.665
English	0.378	0.315	0.635	0.577	0.735	0.692
Spanish	0.326	0.299	0.580	0.558	0.688	0.671
Persian	0.240	0.237	0.240	0.473	0.578	0.589
French	0.313	0.297	0.565	0.555	0.674	0.668
Hebrew	0.237	0.219	0.463	0.450	0.570	0.565
Hindi	0.207	0.259	0.207	0.507	0.537	0.623
Armenian	0.056	0.178	0.056	0.379	0.202	0.487
Indonesian	0.269	0.254	0.514	0.500	0.627	0.615
Italian	0.313	0.293	0.567	0.547	0.671	0.663
Japanese	0.274	0.272	0.274	0.521	0.626	0.639
Korean	0.244	0.266	0.244	0.512	0.592	0.625
Polish	0.292	0.281	0.539	0.533	0.647	0.647
Portuguese	0.316	0.271	0.571	0.523	0.679	0.639
Russian	0.299	0.286	0.548	0.540	0.658	0.654
Thai	0.215	0.233	0.215	0.471	0.537	0.587
Turkish	0.255	0.276	0.255	0.526	0.603	0.641
Ukranian	0.260	0.251	0.499	0.498	0.609	0.612
Vietnamese	0.254	0.236	0.492	0.474	0.603	0.586
Chinese	0.273	0.269	0.273	0.519	0.621	0.636
Mean	0.265 ± 0.064	0.264 ± 0.032	0.265 ± 0.064	0.511 ± 0.044	0.604 ± 0.106	0.625 ± 0.045
Google Translate	0.274 ± 0.063	0.262 ± 0.031	0.511 ± 0.095	0.508 ± 0.044	0.617 ± 0.103	0.622 ± 0.045
Microsoft Translator	0.272 ± 0.064	0.261 ± 0.033	0.508 ± 0.098	0.506 ± 0.045	0.614 ± 0.106	0.621 ± 0.047
Meta NLLB	0.249 ± 0.067	0.269 ± 0.032	0.249 ± 0.067	0.518 ± 0.044	0.582 ± 0.112	0.633 ± 0.045

Language	OpenCLIP @ 1	UForm @ 1	OpenCLIP @ 5	UForm @ 5	OpenCLIP @ 10	UForm @ 10
Arabic	0.227	0.283	0.227	0.536	0.558	0.651
German	0.317	0.317	0.317	0.582	0.674	0.694
English	0.378	0.340	0.635	0.611	0.735	0.722
Spanish	0.326	0.322	0.326	0.588	0.688	0.700
Persian	0.240	0.256	0.240	0.504	0.578	0.620
French	0.313	0.319	0.313	0.585	0.674	0.697
Hebrew	0.237	0.237	0.237	0.478	0.570	0.591
Hindi	0.207	0.278	0.207	0.536	0.537	0.654
Armenian	0.056	0.193	0.056	0.407	0.202	0.517
Indonesian	0.269	0.273	0.269	0.530	0.627	0.644
Italian	0.313	0.314	0.313	0.579	0.671	0.693
Japanese	0.274	0.294	0.274	0.550	0.626	0.667
Korean	0.244	0.284	0.244	0.538	0.592	0.653
Polish	0.292	0.300	0.292	0.561	0.647	0.676
Portuguese	0.316	0.291	0.571	0.554	0.679	0.667
Russian	0.299	0.306	0.299	0.568	0.658	0.683
Thai	0.215	0.254	0.215	0.503	0.537	0.618
Turkish	0.255	0.296	0.255	0.555	0.603	0.667
Ukranian	0.260	0.272	0.260	0.527	0.609	0.641
Vietnamese	0.254	0.253	0.254	0.498	0.603	0.614
Chinese	0.273	0.291	0.273	0.545	0.621	0.663
Mean	0.265 ± 0.064	0.284 ± 0.033	0.265 ± 0.064	0.540 ± 0.045	0.604 ± 0.106	0.654 ± 0.045
Google Translate	0.274 ± 0.063	0.282 ± 0.033	0.274 ± 0.063	0.537 ± 0.044	0.617 ± 0.103	0.650 ± 0.045
Microsoft Translator	0.272 ± 0.064	0.281 ± 0.035	0.272 ± 0.064	0.535 ± 0.047	0.614 ± 0.106	0.650 ± 0.047
Meta NLLB	0.249 ± 0.067	0.290 ± 0.034	0.249 ± 0.067	0.547 ± 0.045	0.582 ± 0.112	0.662 ± 0.045

Language	OpenCLIP @ 1	UForm @ 1	OpenCLIP @ 5	UForm @ 5	OpenCLIP @ 10	UForm @ 10
Arabic	0.227	0.300	0.227	0.556	0.558	0.671
German	0.317	0.333	0.317	0.600	0.674	0.712
English	0.378	0.357	0.635	0.629	0.735	0.739
Spanish	0.326	0.338	0.326	0.607	0.688	0.719
Persian	0.240	0.274	0.240	0.524	0.578	0.642
French	0.313	0.337	0.313	0.604	0.674	0.714
Hebrew	0.237	0.251	0.237	0.497	0.570	0.612
Hindi	0.207	0.298	0.207	0.557	0.537	0.675
Armenian	0.056	0.208	0.056	0.428	0.202	0.538
Indonesian	0.269	0.291	0.269	0.548	0.627	0.665
Italian	0.313	0.332	0.313	0.598	0.671	0.713
Japanese	0.274	0.309	0.274	0.569	0.626	0.686
Korean	0.244	0.300	0.244	0.558	0.592	0.673
Polish	0.292	0.319	0.292	0.579	0.647	0.694
Portuguese	0.316	0.308	0.316	0.574	0.679	0.689
Russian	0.299	0.322	0.299	0.586	0.658	0.701
Thai	0.215	0.270	0.215	0.523	0.537	0.638
Turkish	0.255	0.313	0.255	0.573	0.603	0.688
Ukranian	0.260	0.288	0.260	0.545	0.609	0.660
Vietnamese	0.254	0.268	0.254	0.517	0.603	0.632
Chinese	0.273	0.306	0.273	0.565	0.621	0.681
Mean	0.265 ± 0.064	0.301 ± 0.034	0.265 ± 0.064	0.559 ± 0.045	0.604 ± 0.106	0.673 ± 0.044
Google Translate	0.274 ± 0.063	0.299 ± 0.034	0.274 ± 0.063	0.556 ± 0.044	0.617 ± 0.103	0.670 ± 0.044
Microsoft Translator	0.272 ± 0.064	0.298 ± 0.035	0.272 ± 0.064	0.555 ± 0.046	0.614 ± 0.106	0.669 ± 0.046
Meta NLLB	0.249 ± 0.067	0.307 ± 0.034	0.249 ± 0.067	0.567 ± 0.045	0.582 ± 0.112	0.681 ± 0.044

Language	OpenCLIP @ 1	UForm @ 1	OpenCLIP @ 5	UForm @ 5	OpenCLIP @ 10	UForm @ 10
Arabic	0.227	0.317	0.227	0.579	0.558	0.693
German	0.317	0.351	0.317	0.622	0.674	0.733
English	0.378	0.377	0.378	0.650	0.735	0.760
Spanish	0.326	0.356	0.326	0.628	0.688	0.737
Persian	0.240	0.289	0.240	0.546	0.578	0.662
French	0.313	0.354	0.313	0.626	0.674	0.733
Hebrew	0.237	0.267	0.237	0.518	0.570	0.636
Hindi	0.207	0.312	0.207	0.579	0.537	0.696
Armenian	0.056	0.220	0.056	0.447	0.202	0.560
Indonesian	0.269	0.307	0.269	0.570	0.627	0.686
Italian	0.313	0.348	0.313	0.621	0.671	0.731
Japanese	0.274	0.326	0.274	0.592	0.626	0.705
Korean	0.244	0.315	0.244	0.579	0.592	0.693
Polish	0.292	0.336	0.292	0.601	0.647	0.713
Portuguese	0.316	0.326	0.316	0.596	0.679	0.710
Russian	0.299	0.339	0.299	0.609	0.658	0.720
Thai	0.215	0.286	0.215	0.546	0.537	0.660
Turkish	0.255	0.330	0.255	0.596	0.603	0.708
Ukranian	0.260	0.305	0.260	0.567	0.609	0.682
Vietnamese	0.254	0.283	0.254	0.539	0.603	0.655
Chinese	0.273	0.321	0.273	0.590	0.621	0.705
Mean	0.265 ± 0.064	0.317 ± 0.035	0.265 ± 0.064	0.581 ± 0.045	0.604 ± 0.106	0.694 ± 0.043
Google Translate	0.274 ± 0.063	0.315 ± 0.035	0.274 ± 0.063	0.578 ± 0.044	0.617 ± 0.103	0.691 ± 0.043
Microsoft Translator	0.272 ± 0.064	0.314 ± 0.036	0.272 ± 0.064	0.577 ± 0.047	0.614 ± 0.106	0.690 ± 0.045
Meta NLLB	0.249 ± 0.067	0.323 ± 0.035	0.249 ± 0.067	0.589 ± 0.045	0.582 ± 0.112	0.702 ± 0.043

Dimensions of COCO-SM

In order to better understand the composition of the UForm Matryoshka representations, let's take a closer look at the individual dimensions. One way to analyze the representations is to look at the variance across the embedding dimensions. This gives us insights into the sensitivity of each dimension, serving as a proxy for the amount of information carried in each individual dimension relative to others. When using Matryoshka Representation Learning, the variance of the dimensions should be distributed across the different granularities, with the highest variance being in the coarsest granularity and with variance diffusing evenly across the nested dimensions optimized for during training.

Prior work by Weaviate shows that it is possible to use this proxy to easily identify the Matryoshka granularity of OpenAI's text-embedding-3-large model. To provide a similar baseline for comparison, we'll also look at the variance of the text-embedding-3-small model as computed on the full COCO-SM multi-lingual captions dataset.

The standard deviations of the dimensions in UForm embeddings computed on COCO-SM data are shown in Figure 2.

Figure 2: Standard deviation (std. dev.) of dimensions in UForm embeddings computed on COCO-SM data.

Here it is not immediately obvious that the would be any distinct regions of variance as you would expect from a model trained using Matryoshka Representation Learning.

The standard deviations of the dimensions in text-embedding-3-small embeddings computed on COCO-SM text caption data are shown in Figure 3.

Figure 3: Standard deviation (std. dev.) of dimensions in text-embedding-3-small embeddings computed on COCO-SM text caption data.

Comparing Figure 2 and Figure 3 we see that the variance of the UForm model in Figure 2 is consistently three times higher than that of the text-embedding-3-small model in Figure 3. In Figure 3, it is possible to spot how information diffuses evenly across the Matryoshka granularities of 512, 1024 and 1536, as you would expect from a model trained using Matryoshka Representation Learning. In contrast, the UForm model in Figure 2 shows the uniform distribution of variance across all dimensions, similar to what you would expect from a model trained using standard non-Matryoshka representation learning. These findings suggest that the UForm model, while being presenting an impressively compact representation, was not trained using Matryoshka Representation Learning.

Conclusion

We explored the fundamentals of Matryoshka Representation Learning, and evaluated the UForm (specifically uform3-image-text-multilingual-base) on the COCO-SM dataset. We've found that the UForm model is highly effective for the task of text-to-image retrieval in a multi-lingual setting, and that the model's performance holds up to some extent when reducing the embedding resolution. In investigating the UForm embeddings, we've seen that the multi-granular representations differ from those of the OpenAI text-embedding-3-small and -large models in that the UForm nested dimensions in that information does not diffuse evenly between any noticeable Matryoshka ranges. This along with the steep degradation in performance when reducing the embedding resolution suggests that the UForm model likely does not have Matryoshka-style embeddings.

Disclaimer

I am a big fan of Unum's work, and I'm a big fan of the UForm models. This post is not meant to be a critique of the UForm models nor Unum's work, but rather a curious investigation into the UForm model's embeddings and how they relate to the Matryoshka Representation Learning framework. Going into writing this post, I was of the understanding that UForm models all had Matryoshka-style embeddings, expecting to use the UForm model as a nice example of a Matryoshka Representation Learning model. However, through my investigations, and after subsequently confirming with Unum, I found that not all UForm models have Matryoshka-style embeddings. While the UForm README mentions "64-dimensional Matryoshka-style embeddings for extremely fast search.", this does not refer all of the UForm model family but only the newest generation of English-only embedding models.

I had fun learning more about Matryoshka Representation Learning, and writing this post. Investigation into these models will be left for future work.