Word embeddings: from one-hot to dense representations
In this series (25 parts)
- Neural networks: the basic building block
- Forward pass and backpropagation
- Training neural networks: a practical guide
- Convolutional neural networks
- Recurrent neural networks and LSTMs
- Attention mechanism and transformers
- Word embeddings: from one-hot to dense representations
- Transfer learning and fine-tuning
- Optimization techniques for deep networks
- Regularization for deep networks
- Encoder-decoder architectures
- Generative models: an overview
- Restricted Boltzmann Machines
- Deep Belief Networks
- Variational Autoencoders
- Generative Adversarial Networks: training and theory
- DCGAN, conditional GANs, and GAN variants
- Representation learning and self-supervised learning
- Domain adaptation and fine-tuning strategies
- Distributed representations and latent spaces
- AutoML and hyperparameter optimization
- Neural architecture search
- Network compression and efficient inference
- Graph neural networks
- Practical deep learning: debugging and tuning
Prerequisites: Introduction to neural networks and Norms and distances.
Words are not numbers. Before a neural network can process language, you need a way to turn words into vectors. The naive approach, one-hot encoding, treats every word as equally different from every other word. That is almost never what you want. Dense word embeddings fix this by placing similar words close together in a learned vector space.
Why embeddings matter
Dense embeddings capture meaning through proximity. Words with similar roles end up near each other. Even more striking, the vector arithmetic encodes relationships:
| Analogy | Operation | Result |
|---|---|---|
| king is to queen as man is to woman | king - man + woman | queen |
| Paris is to France as Rome is to Italy | Paris - France + Italy | Rome |
| walking is to walked as swimming is to swam | walking - walked + swimming | swam |
These are not hand-coded rules. The relationships emerge from training on raw text.
From sparse one-hot vectors to dense embeddings
graph LR A["One-hot vector Sparse, dim 50000 No similarity info"] --> B["Embedding matrix Learned from data"] B --> C["Dense vector dim 300 Similar words nearby"]
One-hot encoding gives every word a unique axis. No two words share any structure. Dense embeddings compress words into a small space where distance reflects meaning.
Here is what a small embedding space might look like. Each word gets a vector where dimensions capture aspects of meaning:
| Word | Dim 1 (royalty) | Dim 2 (gender) | Dim 3 (animate) |
|---|---|---|---|
| king | 0.9 | 0.1 | 0.8 |
| queen | 0.9 | 0.9 | 0.8 |
| man | 0.1 | 0.1 | 0.9 |
| car | 0.0 | 0.0 | 0.0 |
| truck | 0.0 | 0.0 | 0.0 |
“King” and “queen” share high royalty and animate values but differ on gender. “Car” and “truck” cluster near zero on all three. Real embeddings have 50 to 300 dimensions, and the dimensions do not have clean labels, but the principle holds.
Now let’s see exactly why one-hot encoding fails and how training produces these dense vectors.
The problem with one-hot encoding
Suppose your vocabulary has words: {cat, dog, fish, car, truck}. A one-hot encoding assigns each word a vector of length with a single 1 and the rest 0s:
Three problems stand out:
-
No similarity information. The dot product between any two distinct one-hot vectors is zero. “Cat” is as far from “dog” as it is from “car.” The encoding tells the model nothing about word relationships.
-
Huge, sparse vectors. Real vocabularies have 50,000 to 500,000 words. Each vector has exactly one non-zero entry. That is a massive waste of memory.
-
No generalization. If the model learns something about “cat,” that knowledge does not transfer to “dog” at all. The two vectors share no structure.
We need a representation where similar words have similar vectors. That is what dense embeddings provide.
Dense embeddings: the core idea
Word embeddings projected to 2D, showing semantic clusters. Male-female pairs are separated along a consistent direction.
Instead of a sparse vector of length , map each word to a dense vector of much smaller dimension (typically 50 to 300). You store these vectors in an embedding matrix , where row is the embedding for the -th word.
Looking up a word embedding is a matrix multiply in disguise. Multiplying the one-hot vector by just selects the corresponding row:
The key property: these vectors are learned from data. Words that appear in similar contexts end up with similar embeddings. “Cat” and “dog” both appear near “pet,” “feed,” and “vet,” so their vectors will be close. “Car” and “truck” cluster together for the same reason. This is the distributional hypothesis: a word is defined by the company it keeps.
Word2Vec: skip-gram
Word2Vec, introduced by Mikolov et al. in 2013, learns embeddings by training a shallow neural network on a simple task. The skip-gram variant works like this: given a center word, predict the surrounding context words.
The objective
Take a sentence: “the cat sat on the mat.” With a context window of size 2, the center word “sat” should predict “the,” “cat,” “on,” and “the.” For each (center, context) pair, the model maximizes:
Here is the input embedding and is the output embedding for word . The denominator is a softmax over the entire vocabulary. Training maximizes the log-likelihood over all (center, context) pairs in the corpus.
The architecture
The skip-gram model is a single hidden-layer network. No activation function, no bias. The architecture is deliberately simple because the goal is not classification accuracy; it is learning good embeddings.
flowchart LR A["Center word (one-hot, dim V)"] --> B["W_embed (V × d)"] B --> C["Hidden layer (dim d)"] C --> D["W_context (d × V)"] D --> E["Softmax (dim V)"] E --> F["Context word probabilities"]
The embedding matrix transforms the one-hot input into a -dimensional hidden vector. The context matrix projects back to vocabulary size. After training, you discard and keep as your word embeddings.
Backpropagation pushes gradients through both matrices. The chain rule is straightforward here because the network has no non-linearities.
Word2Vec: CBOW
Continuous Bag of Words (CBOW) flips the skip-gram task. Given the context words, predict the center word. You take the embeddings of all context words, average them, and use that average to predict the center.
flowchart LR A1["Context word 1 (one-hot)"] --> B["W_embed (V × d)"] A2["Context word 2 (one-hot)"] --> B A3["Context word 3 (one-hot)"] --> B A4["Context word 4 (one-hot)"] --> B B --> C["Average (dim d)"] C --> D["W_context (d × V)"] D --> E["Softmax (dim V)"] E --> F["Center word probability"]
For the center word and context words (window size ), the objective maximizes:
where is the average context embedding.
CBOW tends to train faster and works well for frequent words. Skip-gram handles rare words better because each word appears as a center word in its own training examples.
Negative sampling
Both skip-gram and CBOW have a problem: the softmax denominator sums over the entire vocabulary. For , that means 100,000 exponentials per training example. Far too slow.
Negative sampling replaces the full softmax with a binary classification task. For each real (center, context) pair, sample random “negative” words that are not the true context. Then train a binary logistic regression to tell real context from noise.
How negative sampling works
graph TD A["Training pair: center word, true context"] --> B["Positive example cat, sat Push closer"] A --> C["Sample k random negatives"] C --> D["Negative 1 cat, elephant"] C --> E["Negative 2 cat, bicycle"] C --> F["Negative k cat, quantum"] D --> G["Push apart"] E --> G F --> G
Instead of normalizing over the entire vocabulary, you only update embeddings for the true context word and a handful of negatives. This cuts the cost from O(V) to O(k) per training pair.
The negative sampling objective for a positive pair with negative samples is:
where is the sigmoid function. The first term pushes the context embedding close to the center embedding. The second term pushes negative samples away.
Negative words are sampled from a noise distribution, typically the unigram distribution raised to the 3/4 power:
where is the word frequency. The 3/4 exponent upweights rare words relative to pure frequency sampling. In practice, to negatives works well, making training orders of magnitude faster than full softmax.
GloVe: global vectors from co-occurrence
GloVe (Global Vectors for Word Representation) takes a different approach. Instead of predicting context from a sliding window, it builds a global word-word co-occurrence matrix , where counts how often word appears in the context of word across the entire corpus.
The GloVe objective learns embeddings such that their dot product approximates the log co-occurrence:
Here and are word and context vectors, and are bias terms, and is a weighting function that caps the influence of very frequent pairs:
Typically and . This prevents common word pairs like “the, of” from dominating the loss.
GloVe combines the strengths of count-based methods (using global statistics) with the strengths of predictive methods (learning low-dimensional representations). You optimize with gradient descent, often using AdaGrad or Adam.
Word2Vec vs GloVe
graph TD A["Word2Vec"] --> B["Learns from local context Sliding window over text"] A --> C["Predictive objective: predict neighbors"] D["GloVe"] --> E["Learns from global statistics Full co-occurrence matrix"] D --> F["Reconstruction objective: approximate log counts"] B --> G["Scales to huge corpora Online updates"] E --> H["Leverages corpus-wide patterns in one pass"]
Word2Vec processes one window at a time, making it easy to train on streaming text. GloVe pre-computes a co-occurrence matrix, then optimizes over all pairs. In practice, both produce similar quality embeddings. GloVe tends to do better on analogy tasks; Word2Vec can be faster to train on very large corpora.
FastText: subword information
Word2Vec and GloVe both assign one vector per word. If a word is not in the vocabulary (out-of-vocabulary, or OOV), you are stuck. FastText, developed by Facebook Research, solves this with subword embeddings.
FastText represents each word as a bag of character n-grams. For example, with to , the word “running” produces n-grams like: <ru, run, unn, nni, nin, ing, ng>, <run, runn, unni, nnin, ning, ing>, and so on (angle brackets mark word boundaries).
The embedding for a word is the sum of its n-gram embeddings:
where is the set of n-grams for word and is the learned vector for n-gram .
This gives FastText two advantages:
-
OOV handling. A word never seen during training still has n-grams that overlap with known words. “Unfriendliness” shares n-grams with “unfriendly,” “friend,” and “friendliness.”
-
Morphological patterns. Words with shared roots or suffixes naturally get similar embeddings. “Running,” “runner,” and “ran” all share the “run” n-gram.
The training objective is the same skip-gram with negative sampling, just with the modified embedding lookup.
Comparison table
| Method | Training objective | Handles OOV? | Captures morphology? | Training data | Typical dimension |
|---|---|---|---|---|---|
| Word2Vec (skip-gram) | Predict context from center word | No | No | Local context windows | 100 to 300 |
| Word2Vec (CBOW) | Predict center from context words | No | No | Local context windows | 100 to 300 |
| GloVe | Approximate log co-occurrence | No | No | Global co-occurrence matrix | 50 to 300 |
| FastText | Skip-gram on subword n-grams | Yes | Yes | Local context windows + n-grams | 100 to 300 |
Worked examples
Example 1: cosine similarity
Cosine similarity measures the angle between two vectors, ignoring magnitude. It is defined as:
where is the L2 norm.
Let and .
Step 1: dot product.
Step 2: norms.
Step 3: cosine similarity.
A value close to 1 means the vectors point in nearly the same direction. “King” and “queen” are very similar in this embedding space.
Example 2: word analogy
The classic analogy test: king - man + woman queen. We compute a result vector and find the nearest neighbor among candidates.
Vectors (4-dimensional for simplicity):
Step 1: compute the result vector.
Step 2: candidate vectors.
Step 3: cosine similarity with each candidate.
First, .
Queen:
Princess:
Prince:
Result: Queen (1.000) > Princess (0.990) > Prince (0.861). The analogy king - man + woman produces a vector identical to queen, confirming the analogy holds perfectly in this example.
Example 3: negative sampling loss
Compute the negative sampling loss for a skip-gram training example with 2 negative samples.
Let the center word have embedding . The positive context word has embedding . Two negative samples have embeddings and .
The loss is:
Step 1: dot products.
Step 2: apply sigmoid.
Recall .
Step 3: compute log terms.
Step 4: total loss.
This loss is relatively high because the dot product between the center and positive context word is small (0.17), meaning their embeddings are not yet well aligned. Training with SGD will adjust the embeddings to increase and decrease and , which lowers the loss.
Evaluation
How do you know if your embeddings are any good? Two approaches.
Intrinsic evaluation
Test the embeddings directly on word-level tasks.
Word analogy. Given “a is to b as c is to __,” find the word that maximizes . Standard benchmarks include the Google analogy dataset (syntactic: “run” to “running” as “swim” to “swimming”; semantic: “Paris” to “France” as “Berlin” to “Germany”).
Word similarity. Compare the cosine similarity ranking of word pairs against human judgments. Datasets like WordSim-353 and SimLex-999 provide human-rated similarity scores. You compute the Spearman rank correlation between your model’s similarities and the human scores.
Extrinsic evaluation
Use the embeddings as input features in a downstream task and measure task performance. Common tasks include named entity recognition (NER), sentiment analysis, and text classification. Better embeddings generally lead to better downstream performance.
Extrinsic evaluation is more expensive but more meaningful. Embeddings that score well on analogy tasks do not always produce the best results on real applications.
Limitations of static embeddings
All the methods above produce a single, fixed vector per word. This is a serious limitation because many words have multiple meanings.
Consider “bank”:
- “I walked along the river bank.” (riverbank)
- “I deposited money at the bank.” (financial institution)
A static embedding gives “bank” one vector that blends both meanings. The model cannot distinguish the two senses from context.
This is the polysemy problem. Static embeddings also struggle with less obvious cases. “Apple” (fruit vs. company) and “play” (theater vs. game) get single vectors that average across all their uses.
Contextual embeddings solve this. Models like ELMo and BERT produce different vectors for the same word depending on the surrounding sentence. The attention mechanism in transformers is central to this capability. These models also benefit from transfer learning: you train once on a large corpus and fine-tune on your specific task.
Still, static embeddings remain useful. They are fast to train, fast to look up, and work well for many tasks. Dimensionality reduction techniques like PCA can visualize them in 2D or 3D, making them valuable for exploratory analysis. For tasks where context matters less (information retrieval, simple classification), Word2Vec and GloVe are solid choices.
What comes next
Static embeddings give every word a fixed vector regardless of context. The next article on transfer learning covers how pretrained representations, including contextual embeddings, can be adapted to new tasks with minimal data. That shift from training embeddings from scratch to fine-tuning pretrained models is one of the biggest practical advances in NLP.