[therobertbook]
Robert Hoang’s Notes » A Cold-Start Dual-Encoder Embedding Pipeline
← back to profile

A Cold-Start Dual-Encoder Embedding Pipeline
for Bootstrapping Graph-Based Recommendation and Retrieval Systems

Robert Hoang  ·  Eric Wang  ·  Randy Ren
{rhoang01, ewang55, rren05}@student.ubc.ca
April 14, 2026

This is a clean, simple solution we put together in a few hours to try and solve a problem we ran into in an exercise. It is not intended as a formal write-up. Code and demo coming soon... :)

1  Introduction

We ran into a simple cold-start problem: graph-based recommendation works best when user interactions already exist, but in a new setting there may be no useful graph at all. Our workaround was to use product images and text to build a shared embedding space, then use that space for retrieval, lightweight preference elicitation, and early feed generation. The goal here is simply to explain the idea clearly rather than present a formal experimental write-up.

2  Problem Setting

Consider a product catalogue in which each product pip_i has an image IiI_i and text description TiT_i. The objective is to construct a shared metric space in which semantic similarity between products, or between a product and a query-like preference vector, can be computed as a scalar score.

3  Dual-Encoder Formulation

A dual-encoder architecture is used, with an image encoder fθ:IRdif_\theta : \mathcal{I} \to \mathbb{R}^{d_i} and a text encoder gϕ:TRdtg_\phi : \mathcal{T} \to \mathbb{R}^{d_t}. fθf_\theta is a Vision Transformer and gϕg_\phi is an autoregressive Transformer over tokenized text. Their raw outputs are

hI=fθ(I),hT=gϕ(T).\mathbf{h}_I = f_\theta(I), \qquad \mathbf{h}_T = g_\phi(T).

The two modalities are commensurable, as both share the same latent space of dimension dd. Both embeddings are then 2\ell_2-normalized,

x=hIhI2,t=hThT2,\mathbf{x} = \frac{\mathbf{h}_I}{\|\mathbf{h}_I\|_2}, \qquad \mathbf{t} = \frac{\mathbf{h}_T}{\|\mathbf{h}_T\|_2},

so that all points lie on the unit hypersphere Sd1\mathcal{S}^{d-1} and dot products coincide with cosine similarity.

4  Contrastive Training Objective

The encoders and projection layers are trained jointly on paired observations {(Ii,Ti)}i=1N\{(I_i, T_i)\}_{i=1}^N using a symmetric contrastive objective. For normalized embeddings xi\mathbf{x}_i and tj\mathbf{t}_j, define

si,j=xitjτ,s_{i,j} = \frac{\mathbf{x}_i^\top \mathbf{t}_j}{\tau},

where τ>0\tau > 0 is a learnable temperature parameter. The image-to-text and text-to-image losses are

LIT=1Ni=1Nlogexp(si,i)j=1Nexp(si,j),\mathcal{L}_{I \to T} = -\frac{1}{N}\sum_{i=1}^{N} \log \frac{\exp(s_{i,i})}{\sum_{j=1}^{N} \exp(s_{i,j})},
LTI=1Ni=1Nlogexp(si,i)j=1Nexp(sj,i),\mathcal{L}_{T \to I} = -\frac{1}{N}\sum_{i=1}^{N} \log \frac{\exp(s_{i,i})}{\sum_{j=1}^{N} \exp(s_{j,i})},

and the total objective is

L=12(LIT+LTI).\mathcal{L} = \frac{1}{2}\left(\mathcal{L}_{I \to T} + \mathcal{L}_{T \to I}\right).

Minimizing this loss aligns matched image-text pairs while separating mismatched pairs, making cosine similarity a useful proxy for semantic relatedness.

5  Product Embedding Construction

Each product yields an image embedding xi\mathbf{x}_i and a text embedding ti\mathbf{t}_i. They are combined into a single product representation by convex combination followed by re-normalization:

ei=norm ⁣(αxi+(1α)ti),\mathbf{e}_i = \text{norm}\!\left(\alpha \mathbf{x}_i + (1-\alpha)\mathbf{t}_i\right),

where α[0,1]\alpha \in [0,1] controls the balance between visual and textual signal. The resulting embeddings {ei}i=1M\{\mathbf{e}_i\}_{i=1}^M are stored in a vector index supporting cosine-similarity search.

6  Geometry of the Embedding Space

Two geometric properties are central to the downstream system. First, because all embeddings lie on Sd1\mathcal{S}^{d-1}, cosine similarity is directly comparable across the space. Second, averaging a set of nearby embeddings and re-normalizing yields a meaningful centroid, which allows clusters and user selections to be summarized by a single query vector.

7  Offline Clustering via Stochastic Probing

In a cold-start setting, the catalogue is often still small enough to fit comfortably in memory, so direct clustering of the full embedding set is both feasible and preferable whenever bulk export is available. The stochastic probing procedure described here is therefore only an approximation for settings where the vector database does not expose the full embedding set.

We sample random unit vectors

vqN(0,Id),rq=norm(vq),\mathbf{v}_q \sim \mathcal{N}(\mathbf{0}, I_d), \qquad \mathbf{r}_q = \text{norm}(\mathbf{v}_q),

and query the index with each probe to obtain a top-KK neighbourhood Nq\mathcal{N}_q. The union of these neighbourhoods forms a sampled subset

Psample=q=1QNq.\mathcal{P}_{\text{sample}} = \bigcup_{q=1}^{Q} \mathcal{N}_q.

This approximation has a clear weakness: in high dimensions, each probe covers only a very small region of Sd1\mathcal{S}^{d-1}, so sparse or isolated product groups are more likely to be missed entirely. The number of probes needed for reliable coverage depends on the unknown data distribution, so it cannot be determined analytically in any general way.

kk-means with kk-means++ initialization is then run on {ei:iPsample}\{\mathbf{e}_i : i \in \mathcal{P}_{\text{sample}}\} to obtain centroids {μc}c=1k\{\boldsymbol{\mu}_c\}_{c=1}^k, which are re-normalized to remain on Sd1\mathcal{S}^{d-1}. For each centroid, the nearest real product is retrieved,

rc=argmini[M](1eiμc),r_c = \arg\min_{i \in [M]} \left(1 - \mathbf{e}_i^\top \boldsymbol{\mu}_c\right),

producing one representative item per semantic cluster.

8  Two-Stage Preference Elicitation

User modeling begins without historical interactions, so we estimate preferences through two lightweight selection stages.

In Stage 1, the user selects from the kk representative products. Let A1{1,,k}\mathcal{A}_1 \subseteq \{1, \ldots, k\} denote the chosen cluster indices. This identifies the semantic regions of the embedding space that match the user's broad taste.

In Stage 2, for each selected cluster cA1c \in \mathcal{A}_1, we query with centroid μc\boldsymbol{\mu}_c and retrieve a small refinement set Rc\mathcal{R}_c. The user then selects from cA1Rc\bigcup_{c \in \mathcal{A}_1} \mathcal{R}_c, producing a second-stage set A2\mathcal{A}_2. This stage sharpens preferences within the broad regions identified in Stage 1.

9  Centroid Refinement

For each cluster cc, let BcA2\mathcal{B}_c \subseteq \mathcal{A}_2 denote the second-stage selections from that cluster. The refined centroid is defined by

μc={norm ⁣(1BciBcei),if Bc,μc,otherwise.\boldsymbol{\mu}'_c = \begin{cases} \text{norm}\!\left(\frac{1}{|\mathcal{B}_c|}\sum_{i \in \mathcal{B}_c} \mathbf{e}_i\right), & \text{if } \mathcal{B}_c \neq \emptyset, \\ \boldsymbol{\mu}_c, & \text{otherwise.} \end{cases}

This moves the cluster query from a generic representative toward the subregion the user actually prefers.

10  Proportional Slot Allocation

Let NfeedN_{\text{feed}} denote the desired number of products shown on a page or retrieval step. These slots are allocated across preferred clusters in proportion to user votes from both elicitation stages. Let

vc=1{cA1}+Bc,qc=vccvcNfeed.v_c = \mathbf{1}\{c \in \mathcal{A}_1\} + |\mathcal{B}_c|, \qquad q_c = \frac{v_c}{\sum_c v_c} N_{\text{feed}}.

The value of NfeedN_{\text{feed}} is application-dependent and may be chosen by interface design, latency constraints, or user preference. The fractional quotas qcq_c are converted into integer allocations using the largest remainder method, yielding counts ncn_c that sum exactly to NfeedN_{\text{feed}}.

11  Evaluation Framework

This comparison is based on a small set of 5 people and is intended only as a practical sanity check between three systems: a random feed, an unnamed cold-start recommender baseline, and the proposed dual-encoder pipeline. It is not presented as a formal experimental study, but the preference split is still useful as an early directional product signal.

FeedShare of Subjects
Random feed0%
Unnamed cold-start baseline20%
Proposed dual-encoder pipeline80%

Table 1: Informal preference split across three feeds using feedback from a set of 5 people.

12  Closing Notes

This approach is only as good as the metadata behind it. If the images or text are weak, noisy, or biased, the recommendations will reflect that. It is also a fairly simple retrieval pipeline, so in settings where precise ranking matters most, more expensive models would likely do better. Still, it gave us a simple way to make cold-start recommendations before any useful interaction graph existed.