A Cold-Start Dual-Encoder Embedding Pipeline
for Bootstrapping Graph-Based Recommendation and Retrieval Systems
This is a clean, simple solution we put together in a few hours to try and solve a problem we ran into in an exercise. It is not intended as a formal write-up. Code and demo coming soon... :)
1 Introduction
We ran into a simple cold-start problem: graph-based recommendation works best when user interactions already exist, but in a new setting there may be no useful graph at all. Our workaround was to use product images and text to build a shared embedding space, then use that space for retrieval, lightweight preference elicitation, and early feed generation. The goal here is simply to explain the idea clearly rather than present a formal experimental write-up.
2 Problem Setting
Consider a product catalogue in which each product has an image and text description . The objective is to construct a shared metric space in which semantic similarity between products, or between a product and a query-like preference vector, can be computed as a scalar score.
3 Dual-Encoder Formulation
A dual-encoder architecture is used, with an image encoder and a text encoder . is a Vision Transformer and is an autoregressive Transformer over tokenized text. Their raw outputs are
The two modalities are commensurable, as both share the same latent space of dimension . Both embeddings are then -normalized,
so that all points lie on the unit hypersphere and dot products coincide with cosine similarity.
4 Contrastive Training Objective
The encoders and projection layers are trained jointly on paired observations using a symmetric contrastive objective. For normalized embeddings and , define
where is a learnable temperature parameter. The image-to-text and text-to-image losses are
and the total objective is
Minimizing this loss aligns matched image-text pairs while separating mismatched pairs, making cosine similarity a useful proxy for semantic relatedness.
5 Product Embedding Construction
Each product yields an image embedding and a text embedding . They are combined into a single product representation by convex combination followed by re-normalization:
where controls the balance between visual and textual signal. The resulting embeddings are stored in a vector index supporting cosine-similarity search.
6 Geometry of the Embedding Space
Two geometric properties are central to the downstream system. First, because all embeddings lie on , cosine similarity is directly comparable across the space. Second, averaging a set of nearby embeddings and re-normalizing yields a meaningful centroid, which allows clusters and user selections to be summarized by a single query vector.
7 Offline Clustering via Stochastic Probing
In a cold-start setting, the catalogue is often still small enough to fit comfortably in memory, so direct clustering of the full embedding set is both feasible and preferable whenever bulk export is available. The stochastic probing procedure described here is therefore only an approximation for settings where the vector database does not expose the full embedding set.
We sample random unit vectors
and query the index with each probe to obtain a top- neighbourhood . The union of these neighbourhoods forms a sampled subset
This approximation has a clear weakness: in high dimensions, each probe covers only a very small region of , so sparse or isolated product groups are more likely to be missed entirely. The number of probes needed for reliable coverage depends on the unknown data distribution, so it cannot be determined analytically in any general way.
-means with -means++ initialization is then run on to obtain centroids , which are re-normalized to remain on . For each centroid, the nearest real product is retrieved,
producing one representative item per semantic cluster.
8 Two-Stage Preference Elicitation
User modeling begins without historical interactions, so we estimate preferences through two lightweight selection stages.
In Stage 1, the user selects from the representative products. Let denote the chosen cluster indices. This identifies the semantic regions of the embedding space that match the user's broad taste.
In Stage 2, for each selected cluster , we query with centroid and retrieve a small refinement set . The user then selects from , producing a second-stage set . This stage sharpens preferences within the broad regions identified in Stage 1.
9 Centroid Refinement
For each cluster , let denote the second-stage selections from that cluster. The refined centroid is defined by
This moves the cluster query from a generic representative toward the subregion the user actually prefers.
10 Proportional Slot Allocation
Let denote the desired number of products shown on a page or retrieval step. These slots are allocated across preferred clusters in proportion to user votes from both elicitation stages. Let
The value of is application-dependent and may be chosen by interface design, latency constraints, or user preference. The fractional quotas are converted into integer allocations using the largest remainder method, yielding counts that sum exactly to .
11 Evaluation Framework
This comparison is based on a small set of 5 people and is intended only as a practical sanity check between three systems: a random feed, an unnamed cold-start recommender baseline, and the proposed dual-encoder pipeline. It is not presented as a formal experimental study, but the preference split is still useful as an early directional product signal.
| Feed | Share of Subjects |
|---|---|
| Random feed | 0% |
| Unnamed cold-start baseline | 20% |
| Proposed dual-encoder pipeline | 80% |
Table 1: Informal preference split across three feeds using feedback from a set of 5 people.
12 Closing Notes
This approach is only as good as the metadata behind it. If the images or text are weak, noisy, or biased, the recommendations will reflect that. It is also a fairly simple retrieval pipeline, so in settings where precise ranking matters most, more expensive models would likely do better. Still, it gave us a simple way to make cold-start recommendations before any useful interaction graph existed.
