RAG - Dense Embedding

# ai# rag# llm# beginners

Ramya Perumal

Dense means continuous. When text is converted into a numerical representation called a vector...

Dense means continuous.

When text is converted into a numerical representation called a vector (point) that contains continuous values, it is called a dense embedding.

Unlike sparse vectors, where many values are zero, dense vectors contain meaningful numerical values across most dimensions.

Example

A dense vector may look like:
[0.123, -0.456, 0.789, 0.245, ...]

Multi-Dimensional Representation

Each vector is represented in an n-dimensional space.
This means:

Every value in the vector represents one dimension
Each dimension contains some numerical value other than zero
Similar meanings are stored closer together in vector space

All vectors are stored in a mathematical space called latent space.

Words or sentences with similar meanings are usually positioned closer together inside this latent space.

How Dense Embeddings are Generated

To convert text into vectors, we can use:

Embedding Models
Examples:

nomic-embed-text
BGE (Beijing Academy of Artificial Intelligence General Embedding) models

Transformer Models
Examples:

all-MiniLM-L6-v2
Nomic Transformer

These models are commonly available through:

Hugging Face
Ollama

Relationship Between LLMs and Transformers

LLMs internally use transformer architecture.

A transformer mainly contains two parts:

Encoder
Decoder

Encoder
The encoder converts text into embeddings (vectors).

Decoder
The decoder processes embeddings and generates human-readable text.

In embedding models, the encoder part is mainly used to generate vector representations.

Methods to Generate Embeddings

Embeddings can be generated in two ways:

1. Using Dedicated Embedding Models

These models are specifically trained for embedding generation.

Examples

nomic-embed-text
BGE models

This is the most common and efficient approach in RAG systems.

2. Using General LLMs Through Prompting

A general-purpose LLM can also generate embeddings by giving prompts that instruct the model to convert text into vector representations.

This approach is sometimes used in vectorless RAG systems.

Disadvantage
Higher computational cost
Slower performance
More token consumption

Measuring Embedding and Retrieval Accuracy

To measure retrieval accuracy effectively, unit tests should be written for the RAG pipeline.

The test cases should include:

Expected inputs
Expected outputs
Different query scenarios
Edge cases
Semantic similarity checks

This helps evaluate how accurately the embedding model retrieves relevant information.

Similarity Methods Used in Dense Embeddings

Dense embeddings commonly use one of the following similarity measurement methods:

Cosine Similarity

This is the most commonly used similarity method in RAG applications.

It measures the angle between vectors rather than physical distance.

If the vectors point in similar directions, the similarity score becomes higher.

Euclidean Distance

Measures the straight-line distance between vectors in vector space.

Dot Product

Measures similarity by multiplying corresponding vector values and summing them.

Why the Same Embedding Model Must Be Used

The same embedding model should be used for both:

Data ingestion phase
Retrieval phase

If different embedding models are used, the generated vectors may exist in completely different latent spaces or vector distributions.

As a result:

Similarity calculations become inaccurate
Retrieval quality decreases
Relevant chunks may not be retrieved correctly

Using the same embedding model ensures that both stored documents and user queries are represented consistently in the same vector space.

Sparse Embeddings

Sparse embeddings use TF-IDF and BM25 mechanisms for retrieval.

In sparse embeddings, vectors are generated mainly based on keyword frequency and importance rather than semantic meaning.

The combination of BM25 and vector search is called hybrid search.

Tools such as OpenSearch and Elasticsearch support hybrid search by combining:

Traditional keyword-based retrieval
Semantic vector-based retrieval

Similar to one-hot encoding, sparse embeddings generate vectors based on text frequency. Most values in the vector remain 0, while only important terms receive higher numerical values.

Example

[3.91, 0, 0, 1.62]

In this representation:

Higher values indicate more important or frequently occurring terms
Zero values indicate terms that are absent or not important in the document

Sparse embeddings mainly focus on exact keyword matching and are highly effective for traditional search use cases.