Embeddings Explained
Turning Words and Sentences into Vectors
Embeddings are a way for computers to turn words, sentences, or other data into lists of numbers called vectors. These vectors let a machine compare meaning mathematically instead of treating text as isolated symbols. Embeddings matter because they power search, recommendation systems, translation, chatbots, and many other AI tools. They help computers notice that similar ideas should end up close together in a geometric space.
The basic idea is that a model learns a mapping from text to coordinates in a high dimensional space. If two words or sentences appear in similar contexts, their vectors often point to nearby locations. Once text becomes vectors, a computer can measure similarity with tools like distance or cosine similarity. This makes it possible to cluster related documents, retrieve useful information, and feed language into larger machine learning systems.
Key Facts
- An embedding is a vector x = [x1, x2, ..., xn] that represents text as numbers in n dimensions.
- Similar meaning often corresponds to small distance: d(a,b) = sqrt(sum_i (ai - bi)^2).
- Cosine similarity compares direction: cos(theta) = (a · b) / (||a|| ||b||).
- Words used in similar contexts tend to learn similar embeddings.
- Sentence embeddings combine information from many tokens into one vector for the whole sentence.
- Higher dimension can capture more features, but it also increases storage and computation cost.
Vocabulary
- Embedding
- A numerical vector that represents the meaning or features of a word, sentence, or other item.
- Vector space
- A mathematical space where each item is placed at coordinates so distances and directions can be compared.
- Dimension
- One component or feature of a vector, such as one position in a list of numbers.
- Cosine similarity
- A measure of how similar two vectors are based on the angle between them.
- Clustering
- The grouping of nearby vectors so that similar items end up in the same region.
Common Mistakes to Avoid
- Treating embeddings as random number lists, which is wrong because each vector is learned to preserve useful patterns of meaning or context.
- Assuming close vectors always mean identical meaning, which is wrong because similar position usually suggests related usage, not perfect synonymy.
- Comparing vectors with raw length only, which is wrong because direction often matters more and cosine similarity is commonly more useful.
- Thinking a 2D plot is the full embedding, which is wrong because most real embeddings live in many dimensions and 2D diagrams are simplified projections.
Practice Questions
- 1 A word embedding has 128 dimensions. If each dimension is stored as one number, how many numbers are needed to store embeddings for 250 words?
- 2 Two vectors are a = [1, 2] and b = [4, 6]. Find the Euclidean distance d(a,b) = sqrt((1 - 4)^2 + (2 - 6)^2).
- 3 Why can embeddings help a search engine return useful results even when the query and the document do not use exactly the same words?