Oct 31, 2025 · 18 min read · Deep Learning

Word embeddings: from one-hot to dense representations

In this series (25 parts)

Prerequisites: Introduction to neural networks and Norms and distances.

Words are not numbers. Before a neural network can process language, you need a way to turn words into vectors. The naive approach, one-hot encoding, treats every word as equally different from every other word. That is almost never what you want. Dense word embeddings fix this by placing similar words close together in a learned vector space.

Why embeddings matter

Dense embeddings capture meaning through proximity. Words with similar roles end up near each other. Even more striking, the vector arithmetic encodes relationships:

Analogy	Operation	Result
king is to queen as man is to woman	king - man + woman	queen
Paris is to France as Rome is to Italy	Paris - France + Italy	Rome
walking is to walked as swimming is to swam	walking - walked + swimming	swam

These are not hand-coded rules. The relationships emerge from training on raw text.

From sparse one-hot vectors to dense embeddings

graph LR
  A["One-hot vector
Sparse, dim 50000
No similarity info"] --> B["Embedding matrix
Learned from data"]
  B --> C["Dense vector
dim 300
Similar words nearby"]

One-hot encoding gives every word a unique axis. No two words share any structure. Dense embeddings compress words into a small space where distance reflects meaning.

Here is what a small embedding space might look like. Each word gets a vector where dimensions capture aspects of meaning:

Word	Dim 1 (royalty)	Dim 2 (gender)	Dim 3 (animate)
king	0.9	0.1	0.8
queen	0.9	0.9	0.8
man	0.1	0.1	0.9
car	0.0	0.0	0.0
truck	0.0	0.0	0.0

“King” and “queen” share high royalty and animate values but differ on gender. “Car” and “truck” cluster near zero on all three. Real embeddings have 50 to 300 dimensions, and the dimensions do not have clean labels, but the principle holds.

Now let’s see exactly why one-hot encoding fails and how training produces these dense vectors.

The problem with one-hot encoding

Suppose your vocabulary has $V = 5$ words: {cat, dog, fish, car, truck}. A one-hot encoding assigns each word a vector of length $V$ with a single 1 and the rest 0s:

\text{cat} = \begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \\ 0 \end{bmatrix}, \quad \text{dog} = \begin{bmatrix} 0 \\ 1 \\ 0 \\ 0 \\ 0 \end{bmatrix}, \quad \text{car} = \begin{bmatrix} 0 \\ 0 \\ 0 \\ 1 \\ 0 \end{bmatrix}

Three problems stand out:

No similarity information. The dot product between any two distinct one-hot vectors is zero. “Cat” is as far from “dog” as it is from “car.” The encoding tells the model nothing about word relationships.
Huge, sparse vectors. Real vocabularies have 50,000 to 500,000 words. Each vector has exactly one non-zero entry. That is a massive waste of memory.
No generalization. If the model learns something about “cat,” that knowledge does not transfer to “dog” at all. The two vectors share no structure.

We need a representation where similar words have similar vectors. That is what dense embeddings provide.

Dense embeddings: the core idea

Word embeddings projected to 2D, showing semantic clusters. Male-female pairs are separated along a consistent direction.

Instead of a sparse vector of length $V$ , map each word to a dense vector of much smaller dimension $d$ (typically 50 to 300). You store these vectors in an embedding matrix $W \in \mathbb{R}^{V \times d}$ , where row $i$ is the embedding for the $i$ -th word.

Looking up a word embedding is a matrix multiply in disguise. Multiplying the one-hot vector $x \in \mathbb{R}^V$ by $W$ just selects the corresponding row:

e = x^T W

The key property: these vectors are learned from data. Words that appear in similar contexts end up with similar embeddings. “Cat” and “dog” both appear near “pet,” “feed,” and “vet,” so their vectors will be close. “Car” and “truck” cluster together for the same reason. This is the distributional hypothesis: a word is defined by the company it keeps.

Word2Vec: skip-gram

Word2Vec, introduced by Mikolov et al. in 2013, learns embeddings by training a shallow neural network on a simple task. The skip-gram variant works like this: given a center word, predict the surrounding context words.

The objective

Take a sentence: “the cat sat on the mat.” With a context window of size 2, the center word “sat” should predict “the,” “cat,” “on,” and “the.” For each (center, context) pair, the model maximizes:

P(w_{\text{context}} \mid w_{\text{center}}) = \frac{\exp(v_{w_{\text{context}}}' \cdot v_{w_{\text{center}}})}{\sum_{w=1}^{V} \exp(v_w' \cdot v_{w_{\text{center}}})}

Here $v_w$ is the input embedding and $v_w'$ is the output embedding for word $w$ . The denominator is a softmax over the entire vocabulary. Training maximizes the log-likelihood over all (center, context) pairs in the corpus.

The architecture

The skip-gram model is a single hidden-layer network. No activation function, no bias. The architecture is deliberately simple because the goal is not classification accuracy; it is learning good embeddings.

flowchart LR
  A["Center word
(one-hot, dim V)"] --> B["W_embed
(V × d)"]
  B --> C["Hidden layer
(dim d)"]
  C --> D["W_context
(d × V)"]
  D --> E["Softmax
(dim V)"]
  E --> F["Context word
probabilities"]

The embedding matrix $W_{\text{embed}}$ transforms the one-hot input into a $d$ -dimensional hidden vector. The context matrix $W_{\text{context}}$ projects back to vocabulary size. After training, you discard $W_{\text{context}}$ and keep $W_{\text{embed}}$ as your word embeddings.

Backpropagation pushes gradients through both matrices. The chain rule is straightforward here because the network has no non-linearities.

Word2Vec: CBOW

Continuous Bag of Words (CBOW) flips the skip-gram task. Given the context words, predict the center word. You take the embeddings of all context words, average them, and use that average to predict the center.

flowchart LR
  A1["Context word 1
(one-hot)"] --> B["W_embed
(V × d)"]
  A2["Context word 2
(one-hot)"] --> B
  A3["Context word 3
(one-hot)"] --> B
  A4["Context word 4
(one-hot)"] --> B
  B --> C["Average
(dim d)"]
  C --> D["W_context
(d × V)"]
  D --> E["Softmax
(dim V)"]
  E --> F["Center word
probability"]

For the center word $w_c$ and context words $w_1, w_2, \ldots, w_{2m}$ (window size $m$ ), the objective maximizes:

P(w_c \mid w_1, \ldots, w_{2m}) = \frac{\exp\left(v_{w_c}' \cdot \bar{v}\right)}{\sum_{w=1}^{V} \exp\left(v_w' \cdot \bar{v}\right)}

where $\bar{v} = \frac{1}{2m} \sum_{i=1}^{2m} v_{w_i}$ is the average context embedding.

CBOW tends to train faster and works well for frequent words. Skip-gram handles rare words better because each word appears as a center word in its own training examples.

Negative sampling

Both skip-gram and CBOW have a problem: the softmax denominator sums over the entire vocabulary. For $V = 100{,}000$ , that means 100,000 exponentials per training example. Far too slow.

Negative sampling replaces the full softmax with a binary classification task. For each real (center, context) pair, sample $k$ random “negative” words that are not the true context. Then train a binary logistic regression to tell real context from noise.

How negative sampling works

graph TD
  A["Training pair:
center word, true context"] --> B["Positive example
cat, sat
Push closer"]
  A --> C["Sample k random negatives"]
  C --> D["Negative 1
cat, elephant"]
  C --> E["Negative 2
cat, bicycle"]
  C --> F["Negative k
cat, quantum"]
  D --> G["Push apart"]
  E --> G
  F --> G

Instead of normalizing over the entire vocabulary, you only update embeddings for the true context word and a handful of negatives. This cuts the cost from O(V) to O(k) per training pair.

The negative sampling objective for a positive pair $(w, c)$ with negative samples $n_1, \ldots, n_k$ is:

\mathcal{L} = -\log \sigma(v_c' \cdot v_w) - \sum_{i=1}^{k} \log \sigma(-v_{n_i}' \cdot v_w)

where $\sigma(x) = \frac{1}{1 + e^{-x}}$ is the sigmoid function. The first term pushes the context embedding close to the center embedding. The second term pushes negative samples away.

Negative words are sampled from a noise distribution, typically the unigram distribution raised to the 3/4 power:

P_{\text{noise}}(w) \propto f(w)^{3/4}

where $f(w)$ is the word frequency. The 3/4 exponent upweights rare words relative to pure frequency sampling. In practice, $k = 5$ to $20$ negatives works well, making training orders of magnitude faster than full softmax.

GloVe: global vectors from co-occurrence

GloVe (Global Vectors for Word Representation) takes a different approach. Instead of predicting context from a sliding window, it builds a global word-word co-occurrence matrix $X$ , where $X_{ij}$ counts how often word $j$ appears in the context of word $i$ across the entire corpus.

The GloVe objective learns embeddings such that their dot product approximates the log co-occurrence:

J = \sum_{i,j=1}^{V} f(X_{ij}) \left( w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij} \right)^2

Here $w_i$ and $\tilde{w}_j$ are word and context vectors, $b_i$ and $\tilde{b}_j$ are bias terms, and $f(X_{ij})$ is a weighting function that caps the influence of very frequent pairs:

f(x) = \begin{cases} (x / x_{\max})^\alpha & \text{if } x < x_{\max} \\ 1 & \text{otherwise} \end{cases}

Typically $x_{\max} = 100$ and $\alpha = 0.75$ . This prevents common word pairs like “the, of” from dominating the loss.

GloVe combines the strengths of count-based methods (using global statistics) with the strengths of predictive methods (learning low-dimensional representations). You optimize $J$ with gradient descent, often using AdaGrad or Adam.

Word2Vec vs GloVe

graph TD
  A["Word2Vec"] --> B["Learns from local context
Sliding window over text"]
  A --> C["Predictive objective:
predict neighbors"]
  D["GloVe"] --> E["Learns from global statistics
Full co-occurrence matrix"]
  D --> F["Reconstruction objective:
approximate log counts"]
  B --> G["Scales to huge corpora
Online updates"]
  E --> H["Leverages corpus-wide
patterns in one pass"]

Word2Vec processes one window at a time, making it easy to train on streaming text. GloVe pre-computes a co-occurrence matrix, then optimizes over all pairs. In practice, both produce similar quality embeddings. GloVe tends to do better on analogy tasks; Word2Vec can be faster to train on very large corpora.

FastText: subword information

Word2Vec and GloVe both assign one vector per word. If a word is not in the vocabulary (out-of-vocabulary, or OOV), you are stuck. FastText, developed by Facebook Research, solves this with subword embeddings.

FastText represents each word as a bag of character n-grams. For example, with $n = 3$ to $6$ , the word “running” produces n-grams like: <ru, run, unn, nni, nin, ing, ng>, <run, runn, unni, nnin, ning, ing>, and so on (angle brackets mark word boundaries).

The embedding for a word is the sum of its n-gram embeddings:

v_{\text{running}} = \sum_{g \in G(\text{running})} z_g

where $G(w)$ is the set of n-grams for word $w$ and $z_g$ is the learned vector for n-gram $g$ .

This gives FastText two advantages:

OOV handling. A word never seen during training still has n-grams that overlap with known words. “Unfriendliness” shares n-grams with “unfriendly,” “friend,” and “friendliness.”
Morphological patterns. Words with shared roots or suffixes naturally get similar embeddings. “Running,” “runner,” and “ran” all share the “run” n-gram.

The training objective is the same skip-gram with negative sampling, just with the modified embedding lookup.

Comparison table

Method	Training objective	Handles OOV?	Captures morphology?	Training data	Typical dimension
Word2Vec (skip-gram)	Predict context from center word	No	No	Local context windows	100 to 300
Word2Vec (CBOW)	Predict center from context words	No	No	Local context windows	100 to 300
GloVe	Approximate log co-occurrence	No	No	Global co-occurrence matrix	50 to 300
FastText	Skip-gram on subword n-grams	Yes	Yes	Local context windows + n-grams	100 to 300

Worked examples

Example 1: cosine similarity

Cosine similarity measures the angle between two vectors, ignoring magnitude. It is defined as:

\cos(v_a, v_b) = \frac{v_a \cdot v_b}{\|v_a\| \, \|v_b\|}

where $\| \cdot \|$ is the L2 norm.

Let $v_{\text{king}} = [0.8, 0.3, 0.5]$ and $v_{\text{queen}} = [0.7, 0.4, 0.6]$ .

Step 1: dot product.

v_{\text{king}} \cdot v_{\text{queen}} = 0.8 \times 0.7 + 0.3 \times 0.4 + 0.5 \times 0.6 = 0.56 + 0.12 + 0.30 = 0.98

Step 2: norms.

\|v_{\text{king}}\| = \sqrt{0.8^2 + 0.3^2 + 0.5^2} = \sqrt{0.64 + 0.09 + 0.25} = \sqrt{0.98} \approx 0.990

\|v_{\text{queen}}\| = \sqrt{0.7^2 + 0.4^2 + 0.6^2} = \sqrt{0.49 + 0.16 + 0.36} = \sqrt{1.01} \approx 1.005

Step 3: cosine similarity.

\cos(v_{\text{king}}, v_{\text{queen}}) = \frac{0.98}{0.990 \times 1.005} = \frac{0.98}{0.995} \approx 0.985

A value close to 1 means the vectors point in nearly the same direction. “King” and “queen” are very similar in this embedding space.

Example 2: word analogy

The classic analogy test: king - man + woman $\approx$ queen. We compute a result vector and find the nearest neighbor among candidates.

Vectors (4-dimensional for simplicity):

v_{\text{king}} = [0.9, 0.2, 0.8, 0.1], \quad v_{\text{man}} = [0.8, 0.1, 0.2, 0.1]

v_{\text{woman}} = [0.7, 0.3, 0.2, 0.9]

Step 1: compute the result vector.

v_{\text{result}} = v_{\text{king}} - v_{\text{man}} + v_{\text{woman}}

= [0.9 - 0.8 + 0.7, \; 0.2 - 0.1 + 0.3, \; 0.8 - 0.2 + 0.2, \; 0.1 - 0.1 + 0.9]

= [0.8, \; 0.4, \; 0.8, \; 0.9]

Step 2: candidate vectors.

v_{\text{queen}} = [0.8, 0.4, 0.8, 0.9], \quad v_{\text{princess}} = [0.6, 0.5, 0.7, 0.8], \quad v_{\text{prince}} = [0.9, 0.1, 0.7, 0.2]

Step 3: cosine similarity with each candidate.

First, $\|v_{\text{result}}\| = \sqrt{0.64 + 0.16 + 0.64 + 0.81} = \sqrt{2.25} = 1.5$ .

Queen:

v_{\text{result}} \cdot v_{\text{queen}} = 0.64 + 0.16 + 0.64 + 0.81 = 2.25

\|v_{\text{queen}}\| = \sqrt{0.64 + 0.16 + 0.64 + 0.81} = \sqrt{2.25} = 1.5

\cos = \frac{2.25}{1.5 \times 1.5} = \frac{2.25}{2.25} = 1.000

Princess:

v_{\text{result}} \cdot v_{\text{princess}} = 0.48 + 0.20 + 0.56 + 0.72 = 1.96

\|v_{\text{princess}}\| = \sqrt{0.36 + 0.25 + 0.49 + 0.64} = \sqrt{1.74} \approx 1.319

\cos = \frac{1.96}{1.5 \times 1.319} = \frac{1.96}{1.979} \approx 0.990

Prince:

v_{\text{result}} \cdot v_{\text{prince}} = 0.72 + 0.04 + 0.56 + 0.18 = 1.50

\|v_{\text{prince}}\| = \sqrt{0.81 + 0.01 + 0.49 + 0.04} = \sqrt{1.35} \approx 1.162

\cos = \frac{1.50}{1.5 \times 1.162} = \frac{1.50}{1.743} \approx 0.861

Result: Queen (1.000) > Princess (0.990) > Prince (0.861). The analogy king - man + woman produces a vector identical to queen, confirming the analogy holds perfectly in this example.

Example 3: negative sampling loss

Compute the negative sampling loss for a skip-gram training example with 2 negative samples.

Let the center word $u$ have embedding $v_u = [0.5, 0.3, -0.2]$ . The positive context word $c$ has embedding $v_c = [0.4, 0.1, 0.3]$ . Two negative samples have embeddings $v_{n_1} = [-0.3, 0.2, 0.1]$ and $v_{n_2} = [0.1, -0.4, 0.2]$ .

The loss is:

\mathcal{L} = -\log \sigma(v_c \cdot v_u) - \log \sigma(-v_{n_1} \cdot v_u) - \log \sigma(-v_{n_2} \cdot v_u)

Step 1: dot products.

v_c \cdot v_u = 0.4 \times 0.5 + 0.1 \times 0.3 + 0.3 \times (-0.2) = 0.20 + 0.03 - 0.06 = 0.17

v_{n_1} \cdot v_u = (-0.3) \times 0.5 + 0.2 \times 0.3 + 0.1 \times (-0.2) = -0.15 + 0.06 - 0.02 = -0.11

v_{n_2} \cdot v_u = 0.1 \times 0.5 + (-0.4) \times 0.3 + 0.2 \times (-0.2) = 0.05 - 0.12 - 0.04 = -0.11

Step 2: apply sigmoid.

Recall $\sigma(x) = \frac{1}{1 + e^{-x}}$ .

\sigma(0.17) = \frac{1}{1 + e^{-0.17}} = \frac{1}{1 + 0.844} = \frac{1}{1.844} \approx 0.5424

\sigma(-(-0.11)) = \sigma(0.11) = \frac{1}{1 + e^{-0.11}} = \frac{1}{1 + 0.896} = \frac{1}{1.896} \approx 0.5275

\sigma(-(-0.11)) = \sigma(0.11) \approx 0.5275

Step 3: compute log terms.

-\log(0.5424) \approx 0.6122

-\log(0.5275) \approx 0.6401

-\log(0.5275) \approx 0.6401

Step 4: total loss.

\mathcal{L} = 0.6122 + 0.6401 + 0.6401 = 1.8924

This loss is relatively high because the dot product between the center and positive context word is small (0.17), meaning their embeddings are not yet well aligned. Training with SGD will adjust the embeddings to increase $v_c \cdot v_u$ and decrease $v_{n_1} \cdot v_u$ and $v_{n_2} \cdot v_u$ , which lowers the loss.

Evaluation

How do you know if your embeddings are any good? Two approaches.

Intrinsic evaluation

Test the embeddings directly on word-level tasks.

Word analogy. Given “a is to b as c is to __,” find the word $d$ that maximizes $\cos(v_b - v_a + v_c, v_d)$ . Standard benchmarks include the Google analogy dataset (syntactic: “run” to “running” as “swim” to “swimming”; semantic: “Paris” to “France” as “Berlin” to “Germany”).

Word similarity. Compare the cosine similarity ranking of word pairs against human judgments. Datasets like WordSim-353 and SimLex-999 provide human-rated similarity scores. You compute the Spearman rank correlation between your model’s similarities and the human scores.

Extrinsic evaluation

Use the embeddings as input features in a downstream task and measure task performance. Common tasks include named entity recognition (NER), sentiment analysis, and text classification. Better embeddings generally lead to better downstream performance.

Extrinsic evaluation is more expensive but more meaningful. Embeddings that score well on analogy tasks do not always produce the best results on real applications.

Limitations of static embeddings

All the methods above produce a single, fixed vector per word. This is a serious limitation because many words have multiple meanings.

Consider “bank”:

“I walked along the river bank.” (riverbank)
“I deposited money at the bank.” (financial institution)

A static embedding gives “bank” one vector that blends both meanings. The model cannot distinguish the two senses from context.

This is the polysemy problem. Static embeddings also struggle with less obvious cases. “Apple” (fruit vs. company) and “play” (theater vs. game) get single vectors that average across all their uses.

Contextual embeddings solve this. Models like ELMo and BERT produce different vectors for the same word depending on the surrounding sentence. The attention mechanism in transformers is central to this capability. These models also benefit from transfer learning: you train once on a large corpus and fine-tune on your specific task.

Still, static embeddings remain useful. They are fast to train, fast to look up, and work well for many tasks. Dimensionality reduction techniques like PCA can visualize them in 2D or 3D, making them valuable for exploratory analysis. For tasks where context matters less (information retrieval, simple classification), Word2Vec and GloVe are solid choices.

What comes next

Static embeddings give every word a fixed vector regardless of context. The next article on transfer learning covers how pretrained representations, including contextual embeddings, can be adapted to new tasks with minimal data. That shift from training embeddings from scratch to fine-tuning pretrained models is one of the biggest practical advances in NLP.

← Back to all series