Investigating UForm Matryoshka Embeddings

December 12th, 2024 • 10 min read

Dense embeddings are a key component in modern information retrieval which is becoming increasingly widespread with vector-based dense retrieval powering most retrieval-augmented generation applications. You get dense embeddings using an embedding model — today, this is usually a neural network trained specifically to learn useful representation.

Given the right embedding model, you can turn any type of data into a vector representation that can be used for similarity search, clustering, classification, and other downstream tasks. When you're concerned about computational efficiency, you want to use an embedding model that is lightweight and which can produce particularly compact representations. Unum's UForm library and tiny embedding models are great for this.

In this blog post, we'll explore what Matryoshka representations are, and investigate the nested embedding performance of UForm models.

Background

During my time at Ogment AI, we were working on a vector-based dense retrieval system and embedding pipeline to allow for efficient similarity search over the world's job profiles. As part of the solution, I designed an profile attribute embedding pipeline that would allow for semantic clustering of components of job profiles. This pipeline would need to process billions of attributes, and multiple thousands of attributes per hour, and so we needed a fast model and a compact representation that would allow for efficient similarity search while also keeping vector index compute requirements low. The UForm embedding model family is lightweight and seemed like a great fit for this task.


Why UForm?

UForm provides very lightweight quantized Transformer models for dealing with image media and multi-lingual text. These models are multi-modal and fuse information from different modalities into a single embedding space. On Unum's multi-lingual adaptation of the Common Objects in Context (COCO) evaluation dataset, UForm models achieve state-of-the-art performance while being 2-4x faster than alternative open-source models. Out of the box UForm embeddings are 256-dimensional, but the README suggests using 64-dimensional Matryoshka-style embeddings. At Ogment we could afford to use the full 256-dimensional embeddings, but couldn't afford to rigourously evaluate the 64-dimensional embeddings. So this blog post is an attempt to do just that.

What are Matryoshka Representations?

When applying a dense embeddings to solve a problem, you care about the representational capacity and efficiency in accomodating the problem at hand. General purpose embedding models are designed to be flexible and accomodate a wide range of problems. In doing so, they may not be the most efficient representation for a specific problem. Matryoshka representations encode information at different granularities in coarse-to-fine embeddings. This allows a single embedding to be adapted to the computational constraints of a specific problem by simply slicing the embedding.

Using Matryoshka Representation Learning, any representation learning setup can be adapted to learn multi-granular Matryoshka representations.

Formal Example: Fully Supervised Representation Learning

To get Matryoshka representations, we modify the representation learning objective to incentivise the model to capture information at multiple granularities. Here the typical representation learning task is turned into a multi-scale representation learning problem on the same task. In the case of a fully supervised representation learning task using multi-class classification, we modify the typical multi-class classification loss into multiple layers of losses for each pre-defined level of granularity.
Suppose we're trying to learn $d$-dimensional representation $z \in \mathbb{R}^d$ for some input $x$ in domain $\mathcal{X}$ using regular multi-class classification. Here we use neural network $F(\cdot; \theta_F): \mathcal{X} \to \mathbb{R}^d$, that maps an input $x$ to our representation such that $z = F(x; \theta_F)$. Given dataset $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^n$ where $y_i \in [L]$ is the label for input $x_i$ for all $i \in [N]$, we have the following minimization problem: $$ \min_{\theta_F} \frac{1}{N} \sum_{i \in [N]} \mathcal{L}(F(x_i; \theta_F), y_i) $$ where $\mathcal{L}$ is the multi-class softmax cross-entropy loss. This is a standard multi-class classification loss function which minimizes the distance between the predicted label $F(x_i; \theta_F)$ and the true label $y_i$.
Now, to learn Matryoshka multi-granularity representations we modify this to instead optimize the multi-class classification loss for each nested Matryoshka granularity $m \in \mathcal{M}$ using separate linear classifiers, parameterized by $W^{(m)} \in \mathbb{R}^{L \times m}$, optimized using standard empirical risk minimization: $$ \require{color}\min_{\displaystyle{\colorbox{YellowGreen}{ $\{ \mathbf{W}^{(m)} \}_{m \in \mathcal{M}}$}}, \theta_F} \frac{1}{N} \sum_{i \in [N]} \displaystyle{\colorbox{YellowGreen}{ $\sum_{m \in \mathcal{M}}$ }} \mathcal{L} \left( \displaystyle{\colorbox{YellowGreen}{ $\mathbf{W}^{(m)} \cdot$}} F(x_i; \theta_F)\displaystyle{\colorbox{YellowGreen}{ $_{1:m}$ }}, y_i \right) $$

Green marks Matryoshka modifications.
Here, the intuition is that the we're essentially training $|M|$ separate separate models all in one by minimizing the aggregate loss functions. However, despite optimizing around discrete Matryoshka granularities, what we see in practice is that models trained with Matryoshka Representation Learning diffuse information in an an interpolating manner across all dimensions $d$.

Practical Example: Sentence Transformers

Sentence Transformers is a great and popular library for using and training text and image embedding models.

In the below example, we'll use Sentence Transformers to fine-tune an MPNet BERT model with a triplet loss function on the MS Marco information retrieval dataset using Matryoshka Representation Learning - with the aim of splitting the 768-dimensional embedding into two 384-dimensional granularities.

"""Triplet Loss with Matryoshka Representation Learning."""

import datasets

from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    losses,
)

# 1. Load base MPNet BERT model.
model: SentenceTransformer = SentenceTransformer("all-mpnet-base-v2")

# 2. Load AG News Dataset.
dataset: datasets.Dataset = datasets.load_dataset(
    "sentence-transformers/msmarco-msmarco-MiniLM-L-6-v3",
)

# 3. Get dataset in triplet format.
train_dataset: datasets.Dataset = dataset["train"].map(
    lambda x: {
        "anchor": x["query"],
        "positive": x["positive"],
        "negative": x["negative"]
    },
    remove_columns=["query"],
)

# 4. Define triplet loss function.
loss = losses.TripletLoss(model=model)

# 5. Wrap with Matryoshka loss modifier to optimize for Matryoshka granularities of 768 and 384.
loss = losses.MatryoshkaLoss(model, loss, matryoshka_dims=[768, 384])

# ... the rest of the owl.
trainer = SentenceTransformerTrainer(
    model=model,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()

Experiments

We'll be extending the existing UForm benchmark suite to include evaluations of Matryoshka-style embeddings at different levels of granularity. Additionally, we'll take a look at the information sensitivity of UForm embeddings to see how this compares to that of OpenAI's text-embedding-3-small model.


Matryoshka Benchmarks

Existing UForm model benchmarks are conducted on the synthetic multi-lingual COCO-SM dataset which consists mainly of text-image retrieval tasks. Using this same dataset, we'll evaluate the text-to-image COCO-SM retrieval recall of UForm (uform3-image-text-multilingual-base) embeddings at different levels of Matryoshka coarseness, ranging from 16-256 dimensions with a stride of 16. Similar to the official benchmarks, we'll be comparing UForm performance against OpenCLIP (CLIP ViT-B/32 XLM RoBERTa base) as the baseline.

We compare the mean proportion of relevant entries retrieved in the top-$k$ across all languages in the COCO-SM dataset. Here "OpenCLIP @ 10" denotes OpenCLIP's recall at $k=3$. In this comparison it is worth noting that while both models are Vision Transformers, the OpenCLIP model has 5B parameters and a full 512 dimensions, whereas the UForm model is quantized and has 206M parameters.

The results are shown in Figure 1 below.

Figure 1: Text-to-image retrieval recall of UForm at 16-256 dimensions compared to full 512-dim OpenCLIP

Despite the parameter volume difference, UForm embeddings strictly outperform the OpenCLIP embeddings on COCO-SM text-to-image retrieval recall at granularity greater than 200 (39% of the size of OpenCLIP). This is a good result, and shows that the UForm model is able to learn compact representations that are competitive with models well out of its weight class at even at 78% resolution. However, from here, with each omitted dimension, the UForm text-to-image retrieval recall decreases linearly by approximately 0.27 percentage points on average. This seemingly linear degradation seems a bit steep, and so we'll take a look at the information sensitivity of UForm embeddings to see how this compares to that of OpenAI's text-embedding-3-small model.

The COCO-SM dataset contains synthetic translations from three different translation methods. In evaluating using the COCO-SM dataset, we're using 40,504 text-image pairs across 21 languages, computing the per-language mean across all three different translation methods. In this case, there is potentially a significant amount of distractors, especially compared to the original COCO dataset. Based on the current state-of-the-art performance for similar retrieval benchmarks well above 60%, I would expect a good model to have at least 60% recall among its top-10 results. With this threshold, UForm dimensionalities below ~192 would be deemed unviable.


For a full breakdown of the results, see Table 1 below.

Table 1: Recall of COCO-SM text-to-image retrieval of UForm at 16-256 dimensions to full 512-dim OpenCLIP

Dimensions of COCO-SM

In order to better understand the composition of the UForm Matryoshka representations, let's take a closer look at the individual dimensions. One way to analyze the representations is to look at the variance across the embedding dimensions. This gives us insights into the sensitivity of each dimension, serving as a proxy for the amount of information carried in each individual dimension relative to others. When using Matryoshka Representation Learning, the variance of the dimensions should be distributed across the different granularities, with the highest variance being in the coarsest granularity and with variance diffusing evenly across the nested dimensions optimized for during training.

Prior work by Weaviate shows that it is possible to use this proxy to easily identify the Matryoshka granularity of OpenAI's text-embedding-3-large model. To provide a similar baseline for comparison, we'll also look at the variance of the text-embedding-3-small model as computed on the full COCO-SM multi-lingual captions dataset.

The standard deviations of the dimensions in UForm embeddings computed on COCO-SM data are shown in Figure 2.

Figure 2: Standard deviation (std. dev.) of dimensions in UForm embeddings computed on COCO-SM data.

Here it is not immediately obvious that the would be any distinct regions of variance as you would expect from a model trained using Matryoshka Representation Learning.

The standard deviations of the dimensions in text-embedding-3-small embeddings computed on COCO-SM text caption data are shown in Figure 3.

Figure 3: Standard deviation (std. dev.) of dimensions in text-embedding-3-small embeddings computed on COCO-SM text caption data.

Comparing Figure 2 and Figure 3 we see that the variance of the UForm model in Figure 2 is consistently three times higher than that of the text-embedding-3-small model in Figure 3. In Figure 3, it is possible to spot how information diffuses evenly across the Matryoshka granularities of 512, 1024 and 1536, as you would expect from a model trained using Matryoshka Representation Learning. In contrast, the UForm model in Figure 2 shows the uniform distribution of variance across all dimensions, similar to what you would expect from a model trained using standard non-Matryoshka representation learning. These findings suggest that the UForm model, while being presenting an impressively compact representation, was not trained using Matryoshka Representation Learning.

Conclusion

We explored the fundamentals of Matryoshka Representation Learning, and evaluated the UForm (specifically uform3-image-text-multilingual-base) on the COCO-SM dataset. We've found that the UForm model is highly effective for the task of text-to-image retrieval in a multi-lingual setting, and that the model's performance holds up to some extent when reducing the embedding resolution. In investigating the UForm embeddings, we've seen that the multi-granular representations differ from those of the OpenAI text-embedding-3-small and -large models in that the UForm nested dimensions in that information does not diffuse evenly between any noticeable Matryoshka ranges. This along with the steep degradation in performance when reducing the embedding resolution suggests that the UForm model likely does not have Matryoshka-style embeddings.

Disclaimer

I am a big fan of Unum's work, and I'm a big fan of the UForm models. This post is not meant to be a critique of the UForm models nor Unum's work, but rather a curious investigation into the UForm model's embeddings and how they relate to the Matryoshka Representation Learning framework. Going into writing this post, I was of the understanding that UForm models all had Matryoshka-style embeddings, expecting to use the UForm model as a nice example of a Matryoshka Representation Learning model. However, through my investigations, and after subsequently confirming with Unum, I found that not all UForm models have Matryoshka-style embeddings. While the UForm README mentions "64-dimensional Matryoshka-style embeddings for extremely fast search.", this does not refer all of the UForm model family but only the newest generation of English-only embedding models.

I had fun learning more about Matryoshka Representation Learning, and writing this post. Investigation into these models will be left for future work.