Word embeddings turned the messy, symbolic world of words into something we can calculate with: continuous vectors where geometry encodes meaning. Words that occur in similar contexts end up near each other; arithmetic on vectors can reflect analogies like king − man + woman ≈ queen. In this guide, you'll implement classic word embeddings from scratch—not by downloading a library and calling a function, but by building key components step by step. You’ll learn how the data is prepared, how objectives are designed, how gradients flow, and how to evaluate whether your vectors learned anything useful at all.
Along the way, we’ll balance intuition with concrete implementation details and reliable practices that make small projects work. If you stick with it, you'll finish with a lightweight, reproducible, and understandable embedding pipeline you can adapt to your own corpus.
The core idea behind word embeddings is distributional semantics: words used in similar contexts tend to have similar meanings. A numerical embedding compresses distributional information into a dense vector of, say, 50–300 dimensions so we can compute similarities and feed them to machine learning models. Unlike one-hot vectors, embeddings are compact, continuous, and expressive.
From scratch means:
It doesn’t have to mean zero libraries. You can absolutely use NumPy and SciPy for arrays and SVD; what you avoid is an off-the-shelf embedding trainer that hides the learning process. You’ll understand where each matrix comes from, what its rows/columns mean, and how training progresses.
Practical benefits of doing it this way include:
Embeddings are only as good as your data. A small, noisy corpus yields weak vectors regardless of modeling. Before modeling, you need to:
Choose your corpus
Normalize and tokenize
Build a vocabulary
Create training pairs with a sliding window
Example: a minimal tokenizer and pair builder in Python, enough for small experiments:
import re
from collections import Counter, defaultdict
# Basic tokenizer: splits on non-word characters, keeps simple words
TOKEN_RE = re.compile(r"\w+")
def tokenize(text):
tokens = TOKEN_RE.findall(text.lower())
# Map numbers to a placeholder
return ['<num>' if t.isdigit() else t for t in tokens]
# Build vocabulary with min_count
def build_vocab(tokenized_docs, min_count=5):
counter = Counter()
for doc in tokenized_docs:
counter.update(doc)
vocab = {'<unk>': 0}
for w, c in counter.items():
if c >= min_count and w != '<unk>':
vocab[w] = len(vocab)
return vocab
# Convert tokens to ids with <unk>
def to_ids(tokens, vocab):
unk = vocab['<unk>']
return [vocab.get(t, unk) for t in tokens]
# Create (center, context) pairs using a symmetric window
def make_pairs(ids, window=5):
pairs = []
for i, center in enumerate(ids):
left = max(0, i - window)
right = min(len(ids), i + window + 1)
for j in range(left, right):
if j != i:
pairs.append((center, ids[j]))
return pairs
In real projects, consider spaCy or NLTK for tokenization, and clean aggressively (deduplicate, remove boilerplate).
Before learning embeddings, build mental models with simple baselines:
Key insights:
Toy example:
Suppose your corpus is: 'cats chase mice' and 'dogs chase cats'. With window size 1, context of 'chase' includes 'cats', 'mice', 'dogs'. The co-occurrence counts will reflect that 'cats' and 'dogs' both co-occur with 'chase', so they should be embedded closer than, say, 'mice' and 'dogs' which co-occur less frequently.
Limitations:
Count-based embeddings compress a large co-occurrence matrix into a dense representation using matrix factorization. Steps:
Pros and cons:
Minimal implementation sketch:
import numpy as np
from collections import defaultdict
from scipy.sparse import dok_matrix, csr_matrix
from sklearn.decomposition import TruncatedSVD
def build_cooc(docs_ids, vocab_size, window=5):
X = dok_matrix((vocab_size, vocab_size), dtype=np.float32)
for ids in docs_ids:
for i, wi in enumerate(ids):
left = max(0, i - window)
right = min(len(ids), i + window + 1)
for j in range(left, right):
if j != i:
wj = ids[j]
X[wi, wj] += 1.0
return X.tocsr()
def ppmi_matrix(X):
# X is csr sparse
total = X.sum()
row_sums = np.array(X.sum(axis=1)).ravel()
col_sums = np.array(X.sum(axis=0)).ravel()
# PMI(i,j) = log(P(i,j)/(P(i)P(j))) = log(X_ij * total / (row_i * col_j))
X = X.tocoo()
data = []
for i, j, xij in zip(X.row, X.col, X.data):
pmi = np.log((xij * total) / (row_sums[i] * col_sums[j] + 1e-10) + 1e-10)
data.append(max(pmi, 0.0))
P = csr_matrix((data, (X.row, X.col)), shape=X.shape)
return P
def svd_embed(P, dim=100, alpha=0.5):
svd = TruncatedSVD(n_components=dim, random_state=42)
U = svd.fit_transform(P)
S = svd.singular_values_
W = U * (S**alpha)
return W
This approach echoes ideas from classic latent semantic analysis and modern variants like GloVe (which uses a log-weighted objective over co-occurrence counts rather than PMI explicitly). For small-to-medium corpora, SVD-based embeddings can be surprisingly strong baselines.
Predictive embeddings learn to predict a word from its context (CBOW) or predict context words from a center word (Skip-gram). Two parameter matrices are learned:
CBOW (Continuous Bag-of-Words):
Skip-gram (SG):
In both cases, naive training with softmax over |V| classes per prediction is expensive. Approximations—negative sampling or hierarchical softmax—reduce cost dramatically.
What do W_in and W_out represent? After training, W_in rows are typically used as the embeddings. W_out may capture complementary information; some implementations average them or use W_in only. Empirically, W_in vectors are standard for downstream use.
You want a loss that makes true (center, context) pairs score higher than random pairs. Three options:
Full softmax
Negative Sampling (SGNS)
Hierarchical Softmax
When to use which:
Below is a compact but complete SGNS implementation for learning word embeddings with NumPy. It is optimized for clarity rather than speed, so you can understand each step.
What it does:
import numpy as np
class SGNS:
def __init__(self, vocab_size, dim=100, neg_k=5, lr=0.025, seed=42):
rng = np.random.default_rng(seed)
self.vocab_size = vocab_size
self.dim = dim
self.neg_k = neg_k
self.lr0 = lr
# Initialize embeddings: small random values
self.W_in = (rng.random((vocab_size, dim)) - 0.5) / dim
self.W_out = (rng.random((vocab_size, dim)) - 0.5) / dim
self.rng = rng
@staticmethod
def sigmoid(x):
# Numerically stable sigmoid
out = np.empty_like(x)
pos = x >= 0
neg = ~pos
out[pos] = 1.0 / (1.0 + np.exp(-x[pos]))
expx = np.exp(x[neg])
out[neg] = expx / (1 + expx)
return out
def build_unigram_table(self, token_counts, power=0.75, table_size=10_0000):
# token_counts: array of length |V|
counts = np.array(token_counts, dtype=np.float64)
counts[0] = 1.0 # ensure <unk> has some mass
probs = counts ** power
probs /= probs.sum()
# build sampling table by multinomial expansion
table = np.zeros(table_size, dtype=np.int32)
cumulative = np.cumsum(probs)
j = 0
for i in range(table_size):
# find smallest j with cumulative[j] > i/table_size
r = (i + 1) / table_size
while j < len(cumulative) and cumulative[j] < r:
j += 1
table[i] = min(j, len(cumulative) - 1)
self.unigram_table = table
def sample_negatives(self, exclude, size):
# Sample from precomputed table, resample if equal to exclude
samples = []
while len(samples) < size:
idx = int(self.unigram_table[self.rng.integers(0, len(self.unigram_table))])
if idx != exclude:
samples.append(idx)
return np.array(samples, dtype=np.int32)
def train(self, pairs, epochs=1, batch_size=512):
# pairs: list of (center, context) ints
n = len(pairs)
for epoch in range(epochs):
# Shuffle pairs each epoch
self.rng.shuffle(pairs)
# simple linear decay of lr
lr = self.lr0
for start in range(0, n, batch_size):
end = min(start + batch_size, n)
batch = pairs[start:end]
# process each pair independently (could be vectorized further)
for c, o in batch:
h = self.W_in[c] # center vector
u_pos = self.W_out[o] # positive context vector
# Positive term
score_pos = np.dot(h, u_pos)
grad_pos = self.sigmoid(-score_pos) # d/dx -log(sigmoid(x))
# Negative samples
negs = self.sample_negatives(exclude=o, size=self.neg_k)
u_negs = self.W_out[negs]
score_negs = h @ u_negs.T
grad_negs = self.sigmoid(score_negs) # d/dx -log(sigmoid(-x))
# Update W_in[c]
grad_h = grad_pos * u_pos + (grad_negs @ u_negs)
self.W_in[c] += lr * grad_h
# Update positive output vector
self.W_out[o] += lr * (grad_pos * h)
# Update negative output vectors
self.W_out[negs] -= lr * (grad_negs[:, None] * h[None, :])
# Update learning rate schedule (simple decay)
lr = max(1e-4, lr * 0.999)
return self.W_in
Usage sketch with the preprocessing utilities above:
# Suppose tokenized_docs is a list of token lists
vocab = build_vocab(tokenized_docs, min_count=5)
docs_ids = [to_ids(doc, vocab) for doc in tokenized_docs]
# Build (center, context) pairs for all docs
pairs = []
for ids in docs_ids:
pairs.extend(make_pairs(ids, window=5))
# Compute counts for noise distribution
counts = [0] * len(vocab)
for ids in docs_ids:
for w in ids:
counts[w] += 1
model = SGNS(vocab_size=len(vocab), dim=100, neg_k=5, lr=0.025)
model.build_unigram_table(counts, power=0.75, table_size=200000)
W = model.train(pairs, epochs=2, batch_size=1024)
# W is your embedding matrix; rows align with vocab indices
Notes:
Learning rate schedule
Subsampling frequent words
Mini-batching and vectorization
Negative sampling cache
Initialization
Regularization
Reproducibility
Profiling
Rare words are challenging because they offer few contexts. You can mitigate this:
min_count threshold
Subword modeling (character n-grams)
Byte-Pair Encoding (BPE) or unigram LM tokenization
Backoff strategies
Domain adaptation
Evaluation tells you whether your vectors reflect meaningful semantics. Two broad categories:
Intrinsic evaluation
Word similarity/relatedness
Analogies (a:b :: c:?)
Neighborhood inspection
Extrinsic evaluation
Simple nearest neighbor utility:
import numpy as np
def normalize_rows(W):
norms = np.linalg.norm(W, axis=1, keepdims=True) + 1e-10
return W / norms
def nearest_neighbors(W, vocab, query_word, topk=10):
# W is embedding matrix; vocab is word->id dict
if query_word not in vocab:
return []
Wn = normalize_rows(W)
qid = vocab[query_word]
sims = Wn @ Wn[qid]
# exclude the query itself
sims[qid] = -np.inf
ids = np.argpartition(-sims, topk)[:topk]
return sorted([(w, float(sims[vocab[w]])) for w in vocab if vocab[w] in ids], key=lambda x: -x[1])
Track metrics over epochs; if cosine neighborhoods improve and loss decreases, you’re on the right path.
Visualization helps you internalize what the model learned:
Example with scikit-learn:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
# Pick a subset of words to visualize (e.g., top 1000 frequent)
ids_to_plot = list(range(1000))
W_subset = W[ids_to_plot]
labels = [w for w, i in sorted(vocab.items(), key=lambda x: x[1]) if i in ids_to_plot]
X2 = TSNE(n_components=2, perplexity=30, learning_rate='auto', init='random', random_state=42).fit_transform(W_subset)
plt.figure(figsize=(10, 10))
plt.scatter(X2[:,0], X2[:,1], s=2, alpha=0.6)
for i, lbl in enumerate(labels[:200]): # annotate a subset for readability
plt.annotate(lbl, (X2[i,0], X2[i,1]), fontsize=8, alpha=0.7)
plt.title('t-SNE of word embeddings')
plt.show()
Look for coherent clusters: months, countries, technology terms, emotions. If the plot is a noisy blob, re-check preprocessing and training.
Loss not decreasing
Degenerate vectors (all zeros or exploding norms)
Negative sampling bugs
Window construction mistakes
Off-by-one errors in vocab mapping
Memory overload with large co-occurrence
Evaluation mismatches
Reproducibility drift
From-scratch embeddings make sense when:
Pretrained embeddings are better when:
Hybrid approach:
Beyond classic word embeddings:
Subword (fastText-style)
Sketch for character n-grams:
def char_ngrams(word, min_n=3, max_n=6):
word = f'<{word}>'
grams = set()
for n in range(min_n, max_n+1):
for i in range(len(word)-n+1):
grams.add(word[i:i+n])
return grams
You would maintain an embedding table for all n-grams, hash them into a fixed-size bucket array, and use the sum as the word representation during training and inference.
Contextual embeddings (ELMo, BERT, GPT)
When to switch:
Follow this compact plan to go from raw text to usable vectors:
Data and preprocessing
Pair generation
Modeling choice
Optimization
Monitoring
Evaluation
Refinement
Packaging
Documentation
Reusability
A minimalist example of saving vectors:
import json
import numpy as np
# vocab: dict word->id; W: np.ndarray
id_to_word = {i: w for w, i in vocab.items()}
np.save('embeddings.npy', W)
with open('vocab.json', 'w') as f:
json.dump(id_to_word, f)
# Later, load and query
W = np.load('embeddings.npy')
with open('vocab.json') as f:
id_to_word = json.load(f)
word_to_id = {w: int(i) for i, w in id_to_word.items()}
As you go through these steps, be patient with iteration. Embeddings improve with better data curation and careful hyperparameter tuning.
Training word embeddings from scratch is an empowering exercise: you see how textual co-occurrence becomes geometry, how objectives sculpt the space, and how simple tricks like negative sampling and subsampling make the difference between theory and practice. With a small set of utilities and a clear pipeline, you can produce vectors tailored to your domain, inspect what they’ve learned, and deploy them as compact, effective features in downstream systems. Keep your code simple, your measurements honest, and your corpus clean, and your vectors will tell a compelling story about your text.