Search…

Word embeddings: from one-hot to dense representations

In this series (25 parts)
  1. Neural networks: the basic building block
  2. Forward pass and backpropagation
  3. Training neural networks: a practical guide
  4. Convolutional neural networks
  5. Recurrent neural networks and LSTMs
  6. Attention mechanism and transformers
  7. Word embeddings: from one-hot to dense representations
  8. Transfer learning and fine-tuning
  9. Optimization techniques for deep networks
  10. Regularization for deep networks
  11. Encoder-decoder architectures
  12. Generative models: an overview
  13. Restricted Boltzmann Machines
  14. Deep Belief Networks
  15. Variational Autoencoders
  16. Generative Adversarial Networks: training and theory
  17. DCGAN, conditional GANs, and GAN variants
  18. Representation learning and self-supervised learning
  19. Domain adaptation and fine-tuning strategies
  20. Distributed representations and latent spaces
  21. AutoML and hyperparameter optimization
  22. Neural architecture search
  23. Network compression and efficient inference
  24. Graph neural networks
  25. Practical deep learning: debugging and tuning

Prerequisites: Introduction to neural networks and Norms and distances.

Words are not numbers. Before a neural network can process language, you need a way to turn words into vectors. The naive approach, one-hot encoding, treats every word as equally different from every other word. That is almost never what you want. Dense word embeddings fix this by placing similar words close together in a learned vector space.

Why embeddings matter

Dense embeddings capture meaning through proximity. Words with similar roles end up near each other. Even more striking, the vector arithmetic encodes relationships:

AnalogyOperationResult
king is to queen as man is to womanking - man + womanqueen
Paris is to France as Rome is to ItalyParis - France + ItalyRome
walking is to walked as swimming is to swamwalking - walked + swimmingswam

These are not hand-coded rules. The relationships emerge from training on raw text.

From sparse one-hot vectors to dense embeddings

graph LR
  A["One-hot vector
Sparse, dim 50000
No similarity info"] --> B["Embedding matrix
Learned from data"]
  B --> C["Dense vector
dim 300
Similar words nearby"]

One-hot encoding gives every word a unique axis. No two words share any structure. Dense embeddings compress words into a small space where distance reflects meaning.

Here is what a small embedding space might look like. Each word gets a vector where dimensions capture aspects of meaning:

WordDim 1 (royalty)Dim 2 (gender)Dim 3 (animate)
king0.90.10.8
queen0.90.90.8
man0.10.10.9
car0.00.00.0
truck0.00.00.0

“King” and “queen” share high royalty and animate values but differ on gender. “Car” and “truck” cluster near zero on all three. Real embeddings have 50 to 300 dimensions, and the dimensions do not have clean labels, but the principle holds.

Now let’s see exactly why one-hot encoding fails and how training produces these dense vectors.

The problem with one-hot encoding

Suppose your vocabulary has V=5V = 5 words: {cat, dog, fish, car, truck}. A one-hot encoding assigns each word a vector of length VV with a single 1 and the rest 0s:

cat=[10000],dog=[01000],car=[00010]\text{cat} = \begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \\ 0 \end{bmatrix}, \quad \text{dog} = \begin{bmatrix} 0 \\ 1 \\ 0 \\ 0 \\ 0 \end{bmatrix}, \quad \text{car} = \begin{bmatrix} 0 \\ 0 \\ 0 \\ 1 \\ 0 \end{bmatrix}

Three problems stand out:

  1. No similarity information. The dot product between any two distinct one-hot vectors is zero. “Cat” is as far from “dog” as it is from “car.” The encoding tells the model nothing about word relationships.

  2. Huge, sparse vectors. Real vocabularies have 50,000 to 500,000 words. Each vector has exactly one non-zero entry. That is a massive waste of memory.

  3. No generalization. If the model learns something about “cat,” that knowledge does not transfer to “dog” at all. The two vectors share no structure.

We need a representation where similar words have similar vectors. That is what dense embeddings provide.

Dense embeddings: the core idea

Word embeddings projected to 2D, showing semantic clusters. Male-female pairs are separated along a consistent direction.

Instead of a sparse vector of length VV, map each word to a dense vector of much smaller dimension dd (typically 50 to 300). You store these vectors in an embedding matrix WRV×dW \in \mathbb{R}^{V \times d}, where row ii is the embedding for the ii-th word.

Looking up a word embedding is a matrix multiply in disguise. Multiplying the one-hot vector xRVx \in \mathbb{R}^V by WW just selects the corresponding row:

e=xTWe = x^T W

The key property: these vectors are learned from data. Words that appear in similar contexts end up with similar embeddings. “Cat” and “dog” both appear near “pet,” “feed,” and “vet,” so their vectors will be close. “Car” and “truck” cluster together for the same reason. This is the distributional hypothesis: a word is defined by the company it keeps.

Word2Vec: skip-gram

Word2Vec, introduced by Mikolov et al. in 2013, learns embeddings by training a shallow neural network on a simple task. The skip-gram variant works like this: given a center word, predict the surrounding context words.

The objective

Take a sentence: “the cat sat on the mat.” With a context window of size 2, the center word “sat” should predict “the,” “cat,” “on,” and “the.” For each (center, context) pair, the model maximizes:

P(wcontextwcenter)=exp(vwcontextvwcenter)w=1Vexp(vwvwcenter)P(w_{\text{context}} \mid w_{\text{center}}) = \frac{\exp(v_{w_{\text{context}}}' \cdot v_{w_{\text{center}}})}{\sum_{w=1}^{V} \exp(v_w' \cdot v_{w_{\text{center}}})}

Here vwv_w is the input embedding and vwv_w' is the output embedding for word ww. The denominator is a softmax over the entire vocabulary. Training maximizes the log-likelihood over all (center, context) pairs in the corpus.

The architecture

The skip-gram model is a single hidden-layer network. No activation function, no bias. The architecture is deliberately simple because the goal is not classification accuracy; it is learning good embeddings.

flowchart LR
  A["Center word
(one-hot, dim V)"] --> B["W_embed
(V × d)"]
  B --> C["Hidden layer
(dim d)"]
  C --> D["W_context
(d × V)"]
  D --> E["Softmax
(dim V)"]
  E --> F["Context word
probabilities"]

The embedding matrix WembedW_{\text{embed}} transforms the one-hot input into a dd-dimensional hidden vector. The context matrix WcontextW_{\text{context}} projects back to vocabulary size. After training, you discard WcontextW_{\text{context}} and keep WembedW_{\text{embed}} as your word embeddings.

Backpropagation pushes gradients through both matrices. The chain rule is straightforward here because the network has no non-linearities.

Word2Vec: CBOW

Continuous Bag of Words (CBOW) flips the skip-gram task. Given the context words, predict the center word. You take the embeddings of all context words, average them, and use that average to predict the center.

flowchart LR
  A1["Context word 1
(one-hot)"] --> B["W_embed
(V × d)"]
  A2["Context word 2
(one-hot)"] --> B
  A3["Context word 3
(one-hot)"] --> B
  A4["Context word 4
(one-hot)"] --> B
  B --> C["Average
(dim d)"]
  C --> D["W_context
(d × V)"]
  D --> E["Softmax
(dim V)"]
  E --> F["Center word
probability"]

For the center word wcw_c and context words w1,w2,,w2mw_1, w_2, \ldots, w_{2m} (window size mm), the objective maximizes:

P(wcw1,,w2m)=exp(vwcvˉ)w=1Vexp(vwvˉ)P(w_c \mid w_1, \ldots, w_{2m}) = \frac{\exp\left(v_{w_c}' \cdot \bar{v}\right)}{\sum_{w=1}^{V} \exp\left(v_w' \cdot \bar{v}\right)}

where vˉ=12mi=12mvwi\bar{v} = \frac{1}{2m} \sum_{i=1}^{2m} v_{w_i} is the average context embedding.

CBOW tends to train faster and works well for frequent words. Skip-gram handles rare words better because each word appears as a center word in its own training examples.

Negative sampling

Both skip-gram and CBOW have a problem: the softmax denominator sums over the entire vocabulary. For V=100,000V = 100{,}000, that means 100,000 exponentials per training example. Far too slow.

Negative sampling replaces the full softmax with a binary classification task. For each real (center, context) pair, sample kk random “negative” words that are not the true context. Then train a binary logistic regression to tell real context from noise.

How negative sampling works

graph TD
  A["Training pair:
center word, true context"] --> B["Positive example
cat, sat
Push closer"]
  A --> C["Sample k random negatives"]
  C --> D["Negative 1
cat, elephant"]
  C --> E["Negative 2
cat, bicycle"]
  C --> F["Negative k
cat, quantum"]
  D --> G["Push apart"]
  E --> G
  F --> G

Instead of normalizing over the entire vocabulary, you only update embeddings for the true context word and a handful of negatives. This cuts the cost from O(V) to O(k) per training pair.

The negative sampling objective for a positive pair (w,c)(w, c) with negative samples n1,,nkn_1, \ldots, n_k is:

L=logσ(vcvw)i=1klogσ(vnivw)\mathcal{L} = -\log \sigma(v_c' \cdot v_w) - \sum_{i=1}^{k} \log \sigma(-v_{n_i}' \cdot v_w)

where σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}} is the sigmoid function. The first term pushes the context embedding close to the center embedding. The second term pushes negative samples away.

Negative words are sampled from a noise distribution, typically the unigram distribution raised to the 3/4 power:

Pnoise(w)f(w)3/4P_{\text{noise}}(w) \propto f(w)^{3/4}

where f(w)f(w) is the word frequency. The 3/4 exponent upweights rare words relative to pure frequency sampling. In practice, k=5k = 5 to 2020 negatives works well, making training orders of magnitude faster than full softmax.

GloVe: global vectors from co-occurrence

GloVe (Global Vectors for Word Representation) takes a different approach. Instead of predicting context from a sliding window, it builds a global word-word co-occurrence matrix XX, where XijX_{ij} counts how often word jj appears in the context of word ii across the entire corpus.

The GloVe objective learns embeddings such that their dot product approximates the log co-occurrence:

J=i,j=1Vf(Xij)(wiTw~j+bi+b~jlogXij)2J = \sum_{i,j=1}^{V} f(X_{ij}) \left( w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij} \right)^2

Here wiw_i and w~j\tilde{w}_j are word and context vectors, bib_i and b~j\tilde{b}_j are bias terms, and f(Xij)f(X_{ij}) is a weighting function that caps the influence of very frequent pairs:

f(x)={(x/xmax)αif x<xmax1otherwisef(x) = \begin{cases} (x / x_{\max})^\alpha & \text{if } x < x_{\max} \\ 1 & \text{otherwise} \end{cases}

Typically xmax=100x_{\max} = 100 and α=0.75\alpha = 0.75. This prevents common word pairs like “the, of” from dominating the loss.

GloVe combines the strengths of count-based methods (using global statistics) with the strengths of predictive methods (learning low-dimensional representations). You optimize JJ with gradient descent, often using AdaGrad or Adam.

Word2Vec vs GloVe

graph TD
  A["Word2Vec"] --> B["Learns from local context
Sliding window over text"]
  A --> C["Predictive objective:
predict neighbors"]
  D["GloVe"] --> E["Learns from global statistics
Full co-occurrence matrix"]
  D --> F["Reconstruction objective:
approximate log counts"]
  B --> G["Scales to huge corpora
Online updates"]
  E --> H["Leverages corpus-wide
patterns in one pass"]

Word2Vec processes one window at a time, making it easy to train on streaming text. GloVe pre-computes a co-occurrence matrix, then optimizes over all pairs. In practice, both produce similar quality embeddings. GloVe tends to do better on analogy tasks; Word2Vec can be faster to train on very large corpora.

FastText: subword information

Word2Vec and GloVe both assign one vector per word. If a word is not in the vocabulary (out-of-vocabulary, or OOV), you are stuck. FastText, developed by Facebook Research, solves this with subword embeddings.

FastText represents each word as a bag of character n-grams. For example, with n=3n = 3 to 66, the word “running” produces n-grams like: <ru, run, unn, nni, nin, ing, ng>, <run, runn, unni, nnin, ning, ing>, and so on (angle brackets mark word boundaries).

The embedding for a word is the sum of its n-gram embeddings:

vrunning=gG(running)zgv_{\text{running}} = \sum_{g \in G(\text{running})} z_g

where G(w)G(w) is the set of n-grams for word ww and zgz_g is the learned vector for n-gram gg.

This gives FastText two advantages:

  1. OOV handling. A word never seen during training still has n-grams that overlap with known words. “Unfriendliness” shares n-grams with “unfriendly,” “friend,” and “friendliness.”

  2. Morphological patterns. Words with shared roots or suffixes naturally get similar embeddings. “Running,” “runner,” and “ran” all share the “run” n-gram.

The training objective is the same skip-gram with negative sampling, just with the modified embedding lookup.

Comparison table

MethodTraining objectiveHandles OOV?Captures morphology?Training dataTypical dimension
Word2Vec (skip-gram)Predict context from center wordNoNoLocal context windows100 to 300
Word2Vec (CBOW)Predict center from context wordsNoNoLocal context windows100 to 300
GloVeApproximate log co-occurrenceNoNoGlobal co-occurrence matrix50 to 300
FastTextSkip-gram on subword n-gramsYesYesLocal context windows + n-grams100 to 300

Worked examples

Example 1: cosine similarity

Cosine similarity measures the angle between two vectors, ignoring magnitude. It is defined as:

cos(va,vb)=vavbvavb\cos(v_a, v_b) = \frac{v_a \cdot v_b}{\|v_a\| \, \|v_b\|}

where \| \cdot \| is the L2 norm.

Let vking=[0.8,0.3,0.5]v_{\text{king}} = [0.8, 0.3, 0.5] and vqueen=[0.7,0.4,0.6]v_{\text{queen}} = [0.7, 0.4, 0.6].

Step 1: dot product.

vkingvqueen=0.8×0.7+0.3×0.4+0.5×0.6=0.56+0.12+0.30=0.98v_{\text{king}} \cdot v_{\text{queen}} = 0.8 \times 0.7 + 0.3 \times 0.4 + 0.5 \times 0.6 = 0.56 + 0.12 + 0.30 = 0.98

Step 2: norms.

vking=0.82+0.32+0.52=0.64+0.09+0.25=0.980.990\|v_{\text{king}}\| = \sqrt{0.8^2 + 0.3^2 + 0.5^2} = \sqrt{0.64 + 0.09 + 0.25} = \sqrt{0.98} \approx 0.990 vqueen=0.72+0.42+0.62=0.49+0.16+0.36=1.011.005\|v_{\text{queen}}\| = \sqrt{0.7^2 + 0.4^2 + 0.6^2} = \sqrt{0.49 + 0.16 + 0.36} = \sqrt{1.01} \approx 1.005

Step 3: cosine similarity.

cos(vking,vqueen)=0.980.990×1.005=0.980.9950.985\cos(v_{\text{king}}, v_{\text{queen}}) = \frac{0.98}{0.990 \times 1.005} = \frac{0.98}{0.995} \approx 0.985

A value close to 1 means the vectors point in nearly the same direction. “King” and “queen” are very similar in this embedding space.

Example 2: word analogy

The classic analogy test: king - man + woman \approx queen. We compute a result vector and find the nearest neighbor among candidates.

Vectors (4-dimensional for simplicity):

vking=[0.9,0.2,0.8,0.1],vman=[0.8,0.1,0.2,0.1]v_{\text{king}} = [0.9, 0.2, 0.8, 0.1], \quad v_{\text{man}} = [0.8, 0.1, 0.2, 0.1] vwoman=[0.7,0.3,0.2,0.9]v_{\text{woman}} = [0.7, 0.3, 0.2, 0.9]

Step 1: compute the result vector.

vresult=vkingvman+vwomanv_{\text{result}} = v_{\text{king}} - v_{\text{man}} + v_{\text{woman}} =[0.90.8+0.7,  0.20.1+0.3,  0.80.2+0.2,  0.10.1+0.9]= [0.9 - 0.8 + 0.7, \; 0.2 - 0.1 + 0.3, \; 0.8 - 0.2 + 0.2, \; 0.1 - 0.1 + 0.9] =[0.8,  0.4,  0.8,  0.9]= [0.8, \; 0.4, \; 0.8, \; 0.9]

Step 2: candidate vectors.

vqueen=[0.8,0.4,0.8,0.9],vprincess=[0.6,0.5,0.7,0.8],vprince=[0.9,0.1,0.7,0.2]v_{\text{queen}} = [0.8, 0.4, 0.8, 0.9], \quad v_{\text{princess}} = [0.6, 0.5, 0.7, 0.8], \quad v_{\text{prince}} = [0.9, 0.1, 0.7, 0.2]

Step 3: cosine similarity with each candidate.

First, vresult=0.64+0.16+0.64+0.81=2.25=1.5\|v_{\text{result}}\| = \sqrt{0.64 + 0.16 + 0.64 + 0.81} = \sqrt{2.25} = 1.5.

Queen:

vresultvqueen=0.64+0.16+0.64+0.81=2.25v_{\text{result}} \cdot v_{\text{queen}} = 0.64 + 0.16 + 0.64 + 0.81 = 2.25 vqueen=0.64+0.16+0.64+0.81=2.25=1.5\|v_{\text{queen}}\| = \sqrt{0.64 + 0.16 + 0.64 + 0.81} = \sqrt{2.25} = 1.5 cos=2.251.5×1.5=2.252.25=1.000\cos = \frac{2.25}{1.5 \times 1.5} = \frac{2.25}{2.25} = 1.000

Princess:

vresultvprincess=0.48+0.20+0.56+0.72=1.96v_{\text{result}} \cdot v_{\text{princess}} = 0.48 + 0.20 + 0.56 + 0.72 = 1.96 vprincess=0.36+0.25+0.49+0.64=1.741.319\|v_{\text{princess}}\| = \sqrt{0.36 + 0.25 + 0.49 + 0.64} = \sqrt{1.74} \approx 1.319 cos=1.961.5×1.319=1.961.9790.990\cos = \frac{1.96}{1.5 \times 1.319} = \frac{1.96}{1.979} \approx 0.990

Prince:

vresultvprince=0.72+0.04+0.56+0.18=1.50v_{\text{result}} \cdot v_{\text{prince}} = 0.72 + 0.04 + 0.56 + 0.18 = 1.50 vprince=0.81+0.01+0.49+0.04=1.351.162\|v_{\text{prince}}\| = \sqrt{0.81 + 0.01 + 0.49 + 0.04} = \sqrt{1.35} \approx 1.162 cos=1.501.5×1.162=1.501.7430.861\cos = \frac{1.50}{1.5 \times 1.162} = \frac{1.50}{1.743} \approx 0.861

Result: Queen (1.000) > Princess (0.990) > Prince (0.861). The analogy king - man + woman produces a vector identical to queen, confirming the analogy holds perfectly in this example.

Example 3: negative sampling loss

Compute the negative sampling loss for a skip-gram training example with 2 negative samples.

Let the center word uu have embedding vu=[0.5,0.3,0.2]v_u = [0.5, 0.3, -0.2]. The positive context word cc has embedding vc=[0.4,0.1,0.3]v_c = [0.4, 0.1, 0.3]. Two negative samples have embeddings vn1=[0.3,0.2,0.1]v_{n_1} = [-0.3, 0.2, 0.1] and vn2=[0.1,0.4,0.2]v_{n_2} = [0.1, -0.4, 0.2].

The loss is:

L=logσ(vcvu)logσ(vn1vu)logσ(vn2vu)\mathcal{L} = -\log \sigma(v_c \cdot v_u) - \log \sigma(-v_{n_1} \cdot v_u) - \log \sigma(-v_{n_2} \cdot v_u)

Step 1: dot products.

vcvu=0.4×0.5+0.1×0.3+0.3×(0.2)=0.20+0.030.06=0.17v_c \cdot v_u = 0.4 \times 0.5 + 0.1 \times 0.3 + 0.3 \times (-0.2) = 0.20 + 0.03 - 0.06 = 0.17 vn1vu=(0.3)×0.5+0.2×0.3+0.1×(0.2)=0.15+0.060.02=0.11v_{n_1} \cdot v_u = (-0.3) \times 0.5 + 0.2 \times 0.3 + 0.1 \times (-0.2) = -0.15 + 0.06 - 0.02 = -0.11 vn2vu=0.1×0.5+(0.4)×0.3+0.2×(0.2)=0.050.120.04=0.11v_{n_2} \cdot v_u = 0.1 \times 0.5 + (-0.4) \times 0.3 + 0.2 \times (-0.2) = 0.05 - 0.12 - 0.04 = -0.11

Step 2: apply sigmoid.

Recall σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}.

σ(0.17)=11+e0.17=11+0.844=11.8440.5424\sigma(0.17) = \frac{1}{1 + e^{-0.17}} = \frac{1}{1 + 0.844} = \frac{1}{1.844} \approx 0.5424 σ((0.11))=σ(0.11)=11+e0.11=11+0.896=11.8960.5275\sigma(-(-0.11)) = \sigma(0.11) = \frac{1}{1 + e^{-0.11}} = \frac{1}{1 + 0.896} = \frac{1}{1.896} \approx 0.5275 σ((0.11))=σ(0.11)0.5275\sigma(-(-0.11)) = \sigma(0.11) \approx 0.5275

Step 3: compute log terms.

log(0.5424)0.6122-\log(0.5424) \approx 0.6122 log(0.5275)0.6401-\log(0.5275) \approx 0.6401 log(0.5275)0.6401-\log(0.5275) \approx 0.6401

Step 4: total loss.

L=0.6122+0.6401+0.6401=1.8924\mathcal{L} = 0.6122 + 0.6401 + 0.6401 = 1.8924

This loss is relatively high because the dot product between the center and positive context word is small (0.17), meaning their embeddings are not yet well aligned. Training with SGD will adjust the embeddings to increase vcvuv_c \cdot v_u and decrease vn1vuv_{n_1} \cdot v_u and vn2vuv_{n_2} \cdot v_u, which lowers the loss.

Evaluation

How do you know if your embeddings are any good? Two approaches.

Intrinsic evaluation

Test the embeddings directly on word-level tasks.

Word analogy. Given “a is to b as c is to __,” find the word dd that maximizes cos(vbva+vc,vd)\cos(v_b - v_a + v_c, v_d). Standard benchmarks include the Google analogy dataset (syntactic: “run” to “running” as “swim” to “swimming”; semantic: “Paris” to “France” as “Berlin” to “Germany”).

Word similarity. Compare the cosine similarity ranking of word pairs against human judgments. Datasets like WordSim-353 and SimLex-999 provide human-rated similarity scores. You compute the Spearman rank correlation between your model’s similarities and the human scores.

Extrinsic evaluation

Use the embeddings as input features in a downstream task and measure task performance. Common tasks include named entity recognition (NER), sentiment analysis, and text classification. Better embeddings generally lead to better downstream performance.

Extrinsic evaluation is more expensive but more meaningful. Embeddings that score well on analogy tasks do not always produce the best results on real applications.

Limitations of static embeddings

All the methods above produce a single, fixed vector per word. This is a serious limitation because many words have multiple meanings.

Consider “bank”:

  • “I walked along the river bank.” (riverbank)
  • “I deposited money at the bank.” (financial institution)

A static embedding gives “bank” one vector that blends both meanings. The model cannot distinguish the two senses from context.

This is the polysemy problem. Static embeddings also struggle with less obvious cases. “Apple” (fruit vs. company) and “play” (theater vs. game) get single vectors that average across all their uses.

Contextual embeddings solve this. Models like ELMo and BERT produce different vectors for the same word depending on the surrounding sentence. The attention mechanism in transformers is central to this capability. These models also benefit from transfer learning: you train once on a large corpus and fine-tune on your specific task.

Still, static embeddings remain useful. They are fast to train, fast to look up, and work well for many tasks. Dimensionality reduction techniques like PCA can visualize them in 2D or 3D, making them valuable for exploratory analysis. For tasks where context matters less (information retrieval, simple classification), Word2Vec and GloVe are solid choices.

What comes next

Static embeddings give every word a fixed vector regardless of context. The next article on transfer learning covers how pretrained representations, including contextual embeddings, can be adapted to new tasks with minimal data. That shift from training embeddings from scratch to fine-tuning pretrained models is one of the biggest practical advances in NLP.

Start typing to search across all content
navigate Enter open Esc close