A Cold-Start Dual-Encoder Embedding Pipeline
for Bootstrapping Graph-Based Recommendation and Retrieval Systems

Robert Hoang · Eric Wang · Randy Ren

{rhoang01, ewang55, rren05}@student.ubc.ca

April 14, 2026

This is a clean, simple solution we put together in a few hours to try and solve a problem we ran into in an exercise. It is not intended as a formal write-up. Code and demo coming soon... :)

1 Introduction

We ran into a simple cold-start problem: graph-based recommendation works best when user interactions already exist, but in a new setting there may be no useful graph at all. Our workaround was to use product images and text to build a shared embedding space, then use that space for retrieval, lightweight preference elicitation, and early feed generation. The goal here is simply to explain the idea clearly rather than present a formal experimental write-up.

2 Problem Setting

Consider a product catalogue in which each product $p_i$ has an image $I_i$ and text description $T_i$ . The objective is to construct a shared metric space in which semantic similarity between products, or between a product and a query-like preference vector, can be computed as a scalar score.

3 Dual-Encoder Formulation

A dual-encoder architecture is used, with an image encoder $f_\theta : \mathcal{I} \to \mathbb{R}^{d_i}$ and a text encoder $g_\phi : \mathcal{T} \to \mathbb{R}^{d_t}$ . $f_\theta$ is a Vision Transformer and $g_\phi$ is an autoregressive Transformer over tokenized text. Their raw outputs are

\mathbf{h}_I = f_\theta(I), \qquad \mathbf{h}_T = g_\phi(T).

The two modalities are commensurable, as both share the same latent space of dimension $d$ . Both embeddings are then $\ell_2$ -normalized,

\mathbf{x} = \frac{\mathbf{h}_I}{\|\mathbf{h}_I\|_2}, \qquad \mathbf{t} = \frac{\mathbf{h}_T}{\|\mathbf{h}_T\|_2},

so that all points lie on the unit hypersphere $\mathcal{S}^{d-1}$ and dot products coincide with cosine similarity.

4 Contrastive Training Objective

The encoders and projection layers are trained jointly on paired observations $\{(I_i, T_i)\}_{i=1}^N$ using a symmetric contrastive objective. For normalized embeddings $\mathbf{x}_i$ and $\mathbf{t}_j$ , define

s_{i,j} = \frac{\mathbf{x}_i^\top \mathbf{t}_j}{\tau},

where $\tau > 0$ is a learnable temperature parameter. The image-to-text and text-to-image losses are

\mathcal{L}_{I \to T} = -\frac{1}{N}\sum_{i=1}^{N} \log \frac{\exp(s_{i,i})}{\sum_{j=1}^{N} \exp(s_{i,j})},

\mathcal{L}_{T \to I} = -\frac{1}{N}\sum_{i=1}^{N} \log \frac{\exp(s_{i,i})}{\sum_{j=1}^{N} \exp(s_{j,i})},

and the total objective is

\mathcal{L} = \frac{1}{2}\left(\mathcal{L}_{I \to T} + \mathcal{L}_{T \to I}\right).

Minimizing this loss aligns matched image-text pairs while separating mismatched pairs, making cosine similarity a useful proxy for semantic relatedness.

5 Product Embedding Construction

Each product yields an image embedding $\mathbf{x}_i$ and a text embedding $\mathbf{t}_i$ . They are combined into a single product representation by convex combination followed by re-normalization:

\mathbf{e}_i = \text{norm}\!\left(\alpha \mathbf{x}_i + (1-\alpha)\mathbf{t}_i\right),

where $\alpha \in [0,1]$ controls the balance between visual and textual signal. The resulting embeddings $\{\mathbf{e}_i\}_{i=1}^M$ are stored in a vector index supporting cosine-similarity search.

6 Geometry of the Embedding Space

Two geometric properties are central to the downstream system. First, because all embeddings lie on $\mathcal{S}^{d-1}$ , cosine similarity is directly comparable across the space. Second, averaging a set of nearby embeddings and re-normalizing yields a meaningful centroid, which allows clusters and user selections to be summarized by a single query vector.

7 Offline Clustering via Stochastic Probing

In a cold-start setting, the catalogue is often still small enough to fit comfortably in memory, so direct clustering of the full embedding set is both feasible and preferable whenever bulk export is available. The stochastic probing procedure described here is therefore only an approximation for settings where the vector database does not expose the full embedding set.

We sample random unit vectors

\mathbf{v}_q \sim \mathcal{N}(\mathbf{0}, I_d), \qquad \mathbf{r}_q = \text{norm}(\mathbf{v}_q),

and query the index with each probe to obtain a top- $K$ neighbourhood $\mathcal{N}_q$ . The union of these neighbourhoods forms a sampled subset

\mathcal{P}_{\text{sample}} = \bigcup_{q=1}^{Q} \mathcal{N}_q.

This approximation has a clear weakness: in high dimensions, each probe covers only a very small region of $\mathcal{S}^{d-1}$ , so sparse or isolated product groups are more likely to be missed entirely. The number of probes needed for reliable coverage depends on the unknown data distribution, so it cannot be determined analytically in any general way.

$k$ -means with $k$ -means++ initialization is then run on $\{\mathbf{e}_i : i \in \mathcal{P}_{\text{sample}}\}$ to obtain centroids $\{\boldsymbol{\mu}_c\}_{c=1}^k$ , which are re-normalized to remain on $\mathcal{S}^{d-1}$ . For each centroid, the nearest real product is retrieved,

r_c = \arg\min_{i \in [M]} \left(1 - \mathbf{e}_i^\top \boldsymbol{\mu}_c\right),

producing one representative item per semantic cluster.

8 Two-Stage Preference Elicitation

User modeling begins without historical interactions, so we estimate preferences through two lightweight selection stages.

In Stage 1, the user selects from the $k$ representative products. Let $\mathcal{A}_1 \subseteq \{1, \ldots, k\}$ denote the chosen cluster indices. This identifies the semantic regions of the embedding space that match the user's broad taste.

In Stage 2, for each selected cluster $c \in \mathcal{A}_1$ , we query with centroid $\boldsymbol{\mu}_c$ and retrieve a small refinement set $\mathcal{R}_c$ . The user then selects from $\bigcup_{c \in \mathcal{A}_1} \mathcal{R}_c$ , producing a second-stage set $\mathcal{A}_2$ . This stage sharpens preferences within the broad regions identified in Stage 1.

9 Centroid Refinement

For each cluster $c$ , let $\mathcal{B}_c \subseteq \mathcal{A}_2$ denote the second-stage selections from that cluster. The refined centroid is defined by

\boldsymbol{\mu}'_c = \begin{cases} \text{norm}\!\left(\frac{1}{|\mathcal{B}_c|}\sum_{i \in \mathcal{B}_c} \mathbf{e}_i\right), & \text{if } \mathcal{B}_c \neq \emptyset, \\ \boldsymbol{\mu}_c, & \text{otherwise.} \end{cases}

This moves the cluster query from a generic representative toward the subregion the user actually prefers.

10 Proportional Slot Allocation

Let $N_{\text{feed}}$ denote the desired number of products shown on a page or retrieval step. These slots are allocated across preferred clusters in proportion to user votes from both elicitation stages. Let

v_c = \mathbf{1}\{c \in \mathcal{A}_1\} + |\mathcal{B}_c|, \qquad q_c = \frac{v_c}{\sum_c v_c} N_{\text{feed}}.

The value of $N_{\text{feed}}$ is application-dependent and may be chosen by interface design, latency constraints, or user preference. The fractional quotas $q_c$ are converted into integer allocations using the largest remainder method, yielding counts $n_c$ that sum exactly to $N_{\text{feed}}$ .

11 Evaluation Framework

This comparison is based on a small set of 5 people and is intended only as a practical sanity check between three systems: a random feed, an unnamed cold-start recommender baseline, and the proposed dual-encoder pipeline. It is not presented as a formal experimental study, but the preference split is still useful as an early directional product signal.

Feed	Share of Subjects
Random feed	0%
Unnamed cold-start baseline	20%
Proposed dual-encoder pipeline	80%

Table 1: Informal preference split across three feeds using feedback from a set of 5 people.

12 Closing Notes

This approach is only as good as the metadata behind it. If the images or text are weak, noisy, or biased, the recommendations will reflect that. It is also a fairly simple retrieval pipeline, so in settings where precise ranking matters most, more expensive models would likely do better. Still, it gave us a simple way to make cold-start recommendations before any useful interaction graph existed.

A Cold-Start Dual-Encoder Embedding Pipelinefor Bootstrapping Graph-Based Recommendation and Retrieval Systems