Vector Database — Semantic Search Architecture for AI

Posted on: 4/22/2026 5:12:31 AM

Table of contents

1. Embeddings — From Raw Data to Vector Space
1. Matryoshka Representation Learning (MRL)
2. Approximate Nearest Neighbor (ANN) — Why Not Brute Force?
3. Quantization — Compressing Vectors Without Losing Much Quality
4. Vector Database Comparison 2026
5. Production Architecture — RAG Pipeline with Vector Database
1. 5.1 Chunking Strategy
2. 5.2 Hybrid Search — Combining Vector + Keyword
6. Production Performance Optimization
7. When to Use a Purpose-Built Vector Database?
8. 2026 Trends
Conclusion

As AI applications increasingly rely on semantic search — from RAG (Retrieval-Augmented Generation) to recommendation systems — the question is no longer "should we use a vector database?" but "which one, with what indexing strategy, and how to deploy it?" This article dives deep into vector database internals: how embeddings work, indexing algorithms (HNSW, IVF, LSH), quantization techniques for memory reduction, and a detailed comparison of the most popular solutions in 2026.

$10.6BProjected vector DB market size by 2032

4-32xMemory reduction via Product Quantization

<1msQuery latency with HNSW on millions of vectors

95%+Recall achievable with well-tuned HNSW

1. Embeddings — From Raw Data to Vector Space

Before discussing databases, we need to understand embeddings — the process of converting unstructured data (text, images, audio) into numerical vectors in high-dimensional space. Each vector represents the "semantic meaning" of the original data.

graph LR
    A["Text: 'Caching reduces
latency'"] --> B["Embedding Model
(OpenAI, Cohere, BGE)"]
    B --> C["Vector
[0.12, -0.45, 0.78, ..., 0.33]
1536 dimensions"]
    D["Text: 'Cache speeds
up queries'"] --> E["Embedding Model"]
    E --> F["Vector
[0.11, -0.43, 0.76, ..., 0.31]
1536 dimensions"]
    C --> G["Cosine Similarity
= 0.97 (very close)"]
    F --> G

    style A fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style D fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style B fill:#e94560,stroke:#fff,color:#fff
    style E fill:#e94560,stroke:#fff,color:#fff
    style C fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style F fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style G fill:#4CAF50,stroke:#fff,color:#fff

Two sentences with different wording but close in vector space because they share the same semantics

Popular embedding models in 2026:

Model	Dimensions	Use Case	Key Feature
OpenAI text-embedding-3-large	3072	General purpose	Matryoshka support (flexible dimension reduction)
Cohere embed-v4	1024	Multilingual search	100+ languages, built-in binary quantization
BGE-M3	1024	Hybrid (dense + sparse)	Open-source, multi-granularity retrieval
Voyage-3	1024	Code & technical docs	Optimized for code search
GTE-Qwen2	768–8192	Long context embedding	Supports up to 128K tokens

Matryoshka Representation Learning (MRL)

A recent technique that allows embedding models to produce "nested" vectors — you can truncate a 3072-d vector down to 256-d while retaining ~90% quality. Extremely useful for saving memory at the cache/filter layer before reranking with full vectors.

2. Approximate Nearest Neighbor (ANN) — Why Not Brute Force?

Exact nearest neighbor search over 1 million 1536-d vectors requires ~1.5 billion float operations — taking seconds. With 1 billion vectors, it becomes infeasible. The solution: Approximate Nearest Neighbor (ANN) — accept recall <100% in exchange for 100-1000x speedup.

2.1 HNSW (Hierarchical Navigable Small World)

HNSW is the most popular ANN algorithm today, used as the default in most vector databases. Core idea: build a multi-layer hierarchical graph where each node is a vector and each edge connects "nearby" vectors.

graph TD
    subgraph "Layer 2 (Sparse - Long-range links)"
        L2A["Node A"] --- L2D["Node D"]
        L2D --- L2G["Node G"]
    end

    subgraph "Layer 1 (Medium density)"
        L1A["Node A"] --- L1B["Node B"]
        L1B --- L1D["Node D"]
        L1D --- L1F["Node F"]
        L1F --- L1G["Node G"]
    end

    subgraph "Layer 0 (Dense - All nodes)"
        L0A["Node A"] --- L0B["Node B"]
        L0B --- L0C["Node C"]
        L0C --- L0D["Node D"]
        L0D --- L0E["Node E"]
        L0E --- L0F["Node F"]
        L0F --- L0G["Node G"]
        L0A --- L0C
        L0B --- L0D
        L0E --- L0G
    end

    L2A -.-> L1A
    L2D -.-> L1D
    L2G -.-> L1G
    L1A -.-> L0A
    L1B -.-> L0B
    L1D -.-> L0D
    L1F -.-> L0F
    L1G -.-> L0G

    style L2A fill:#e94560,stroke:#fff,color:#fff
    style L2D fill:#e94560,stroke:#fff,color:#fff
    style L2G fill:#e94560,stroke:#fff,color:#fff
    style L1A fill:#2c3e50,stroke:#fff,color:#fff
    style L1B fill:#2c3e50,stroke:#fff,color:#fff
    style L1D fill:#2c3e50,stroke:#fff,color:#fff
    style L1F fill:#2c3e50,stroke:#fff,color:#fff
    style L1G fill:#2c3e50,stroke:#fff,color:#fff
    style L0A fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L0B fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L0C fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L0D fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L0E fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L0F fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L0G fill:#f8f9fa,stroke:#e94560,color:#2c3e50

HNSW structure: search starts at the top layer (sparse), progressively "zooms in" to layer 0 (dense) for more accurate nearest neighbors

Key HNSW parameters:

M (max connections per node): Higher → better recall but more RAM. Default typically 16–64.
efConstruction: Number of candidates considered during graph construction. Higher → better index quality but slower build.
efSearch: Number of candidates considered during query. This is the main "knob" for recall vs latency trade-off at runtime.

Complexity Analysis

Build time: O(N × log(N)) — each vector insert traverses the existing graph.
Query time: O(log(N)) — hops through layers from sparse to dense.
Memory: O(N × M × D) — stores both vectors and adjacency lists. This is the biggest drawback: 1 billion vectors × 1536-d × float32 ≈ 6TB for data alone, not counting graph structure.

2.2 IVF (Inverted File Index)

IVF partitions the vector space into nlist clusters using k-means. At query time, only the nearest nprobe clusters are searched instead of the entire dataset.

graph TD
    Q["Query Vector"] --> R["Find nprobe=3
nearest clusters"]
    R --> C1["Cluster 1
50K vectors"]
    R --> C3["Cluster 3
45K vectors"]
    R --> C5["Cluster 5
52K vectors"]

    C2["Cluster 2
48K vectors"]
    C4["Cluster 4
55K vectors"]

    C1 --> RES["Top-K Results"]
    C3 --> RES
    C5 --> RES

    style Q fill:#e94560,stroke:#fff,color:#fff
    style R fill:#2c3e50,stroke:#fff,color:#fff
    style C1 fill:#4CAF50,stroke:#fff,color:#fff
    style C3 fill:#4CAF50,stroke:#fff,color:#fff
    style C5 fill:#4CAF50,stroke:#fff,color:#fff
    style C2 fill:#f8f9fa,stroke:#e0e0e0,color:#999
    style C4 fill:#f8f9fa,stroke:#e0e0e0,color:#999
    style RES fill:#e94560,stroke:#fff,color:#fff

IVF only scans clusters near the query (green), skipping the rest (grey) — reducing vectors to compare by 90%+

IVF advantages: Uses less RAM than HNSW since it only loads needed clusters. Suitable for datasets too large for RAM.
Disadvantages: Requires a training step (k-means) before indexing. Recall depends on clustering quality.

2.3 HNSW vs IVF Comparison

Criteria	HNSW	IVF
Query speed	Very fast (sub-ms)	Fast (1-10ms)
Memory	High (stores graph + vectors)	Lower (only centroids + vectors)
Build time	Slow (sequential insert)	Faster (batch k-means)
Updates	Supports realtime insert/delete	Needs re-training when data distribution shifts
Scale	Good up to ~100M vectors	Better for 1B+ vectors
Recall@10	95-99% (ef tuning)	90-97% (nprobe tuning)

3. Quantization — Compressing Vectors Without Losing Much Quality

1 billion vectors × 1536 dimensions × 4 bytes (float32) = ~6TB RAM. Quantization solves this by compressing vectors into smaller representations.

3.1 Scalar Quantization (SQ)

Converts each dimension from float32 (4 bytes) to int8 (1 byte). 4x memory reduction, ~2-5% recall loss.

3.2 Product Quantization (PQ)

Splits each vector into m subvectors, each quantized independently using its own codebook. 4-32x memory reduction.

Product Quantization Example

A 1536-d vector is split into 192 subvectors × 8-d each. Each subvector maps to a codebook of 256 entries (8-bit). Result: instead of storing 1536 × 4 = 6144 bytes, only 192 × 1 = 192 bytes — a 32x reduction.

3.3 Binary Quantization

The most extreme: each dimension keeps only 1 bit (positive = 1, negative = 0). A 1536-d vector → 192 bytes. Hamming distance replaces cosine similarity — extremely fast on modern CPUs (POPCNT instruction). Cohere embed-v4 is specifically designed to work well with binary quantization.

Technique	Compression	Recall Loss	Speed Gain	Use Case
No compression (float32)	1x	0%	Baseline	Small datasets, maximum recall required
Scalar (int8)	4x	2-5%	2-3x	Best balance for most use cases
Product (PQ)	8-32x	5-15%	5-10x	Billion-scale, disk-based index
Binary	32x	10-20%	20-40x	Pre-filter / first-stage retrieval

Two-stage Retrieval Pattern

In production, many systems use binary/PQ quantization to quickly filter top-1000 candidates, then rerank with full-precision vectors for the final top-10. This pattern combines quantization speed with exact distance quality.

4. Vector Database Comparison 2026

graph TB
    subgraph "Purpose-Built Vector DB"
        Q["Qdrant
Rust, HNSW+"]
        M["Milvus
Go/C++, Disaggregated"]
        W["Weaviate
Go, Knowledge Graph"]
        P["Pinecone
Managed, Serverless"]
        CH["Chroma
Python, Lightweight"]
    end

    subgraph "Vector Extensions on Traditional DBs"
        PG["pgvector
PostgreSQL"]
        ES["Elasticsearch 9
kNN + BM25"]
        RE["Redis 8
Vector Search"]
    end

    style Q fill:#e94560,stroke:#fff,color:#fff
    style M fill:#2c3e50,stroke:#fff,color:#fff
    style W fill:#4CAF50,stroke:#fff,color:#fff
    style P fill:#e94560,stroke:#fff,color:#fff
    style CH fill:#2c3e50,stroke:#fff,color:#fff
    style PG fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style ES fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style RE fill:#f8f9fa,stroke:#e94560,color:#2c3e50

Two main categories: purpose-built vector DBs and extensions on traditional databases

Database	Language	Index	Filtering	Scale	Key Differentiator
Qdrant	Rust	HNSW + quantization	ACORN (in-graph filter)	100M+ vectors	Sub-100ms at 100M vectors, 95% recall. Fastest filtering via ACORN integrated into HNSW traversal
Milvus	Go/C++	HNSW, IVF, DiskANN	Post-filter	1B+ vectors	Disaggregated architecture (separate compute/storage). GPU-accelerated indexing. Strongest horizontal scaling
Weaviate	Go	HNSW + flat	Pre-filter + vector	500M+ vectors	Hybrid search (vector + keyword). Knowledge graph integration. Rich module ecosystem
Pinecone	Managed	Proprietary	Metadata filter	1B+ vectors	Fully managed serverless. Zero ops. Ideal for small teams wanting to ship fast
pgvector	C (PG ext)	HNSW, IVF	SQL WHERE clause	10-50M vectors	Use your existing PostgreSQL. No extra infra. Best when vector search is a secondary feature
Chroma	Python/Rust	HNSW	Metadata filter	1-10M vectors	Developer-friendly, embed in process. Great for prototyping and small-scale RAG

4.1 Qdrant — Superior Filtering Performance

Qdrant's biggest strength is its ACORN algorithm — instead of filtering before search (pre-filter) or after search (post-filter), ACORN integrates filtering directly into the HNSW graph traversal. Result: filtered queries remain fast even when filters eliminate 99% of candidates — something pre/post-filter systems struggle with.

4.2 Milvus — Disaggregated Architecture for Billions of Vectors

Milvus completely separates four tiers: access layer (proxy), coordinator (metadata), worker nodes (query/data/index), and storage (object store + message queue). Each tier scales independently — you can add query nodes without affecting indexing, or vice versa.

graph TD
    Client["Client SDK"] --> Proxy["Access Layer
(Proxy / Load Balancer)"]
    Proxy --> QN["Query Nodes
(Search)"]
    Proxy --> DN["Data Nodes
(Insert/Delete)"]
    Proxy --> IN["Index Nodes
(Build Index)"]

    Coord["Coordinator
(Root + Query + Data + Index Coord)"] --> QN
    Coord --> DN
    Coord --> IN

    QN --> S3["Object Storage
(S3 / MinIO)"]
    DN --> MQ["Message Queue
(Pulsar / Kafka)"]
    IN --> S3
    MQ --> S3

    style Client fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style Proxy fill:#e94560,stroke:#fff,color:#fff
    style QN fill:#4CAF50,stroke:#fff,color:#fff
    style DN fill:#4CAF50,stroke:#fff,color:#fff
    style IN fill:#4CAF50,stroke:#fff,color:#fff
    style Coord fill:#2c3e50,stroke:#fff,color:#fff
    style S3 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style MQ fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50

Milvus disaggregated architecture: each tier scales independently, storage uses object store (S3/MinIO)

4.3 pgvector — When You Already Have PostgreSQL

If your application already uses PostgreSQL and vector search is just a supplementary feature (e.g., similar product search, recommendation widget), pgvector is the most practical choice. No additional services, no data sync, direct JOINs with business logic tables.

-- Create table with vector column
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding vector(1536)
);

-- Create HNSW index
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);

-- Semantic search + business logic filter in a single query
SELECT d.id, d.content, 1 - (d.embedding <=> query_embedding) AS similarity
FROM documents d
JOIN categories c ON d.category_id = c.id
WHERE c.name = 'technology' AND d.status = 'published'
ORDER BY d.embedding <=> query_embedding
LIMIT 10;

pgvector Limitations

pgvector works well under 10M vectors. Beyond that threshold, performance degrades noticeably compared to purpose-built vector DBs. If your dataset is large or growing fast, start with pgvector then migrate to Qdrant/Milvus when needed — API patterns are similar.

5. Production Architecture — RAG Pipeline with Vector Database

The most common use case in 2026: RAG (Retrieval-Augmented Generation) — augmenting LLM responses with context from your own knowledge base for better accuracy and reduced hallucination.

graph TD
    subgraph "Indexing Pipeline (Offline)"
        DOC["Documents
(PDF, HTML, Markdown)"] --> CHUNK["Chunking
(512-1024 tokens)"]
        CHUNK --> EMB["Embedding Model"]
        EMB --> VDB["Vector Database
(Qdrant / Milvus)"]
        CHUNK --> META["Metadata Store
(PostgreSQL)"]
    end

    subgraph "Query Pipeline (Online)"
        USER["User Query"] --> QEMB["Embed Query"]
        QEMB --> SEARCH["Vector Search
+ Metadata Filter"]
        SEARCH --> VDB
        SEARCH --> RERANK["Reranker
(Cohere / ColBERT)"]
        RERANK --> PROMPT["Prompt Assembly
(Query + Context)"]
        PROMPT --> LLM["LLM
(Claude / GPT)"]
        LLM --> ANS["Answer
+ Citations"]
    end

    style DOC fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style CHUNK fill:#2c3e50,stroke:#fff,color:#fff
    style EMB fill:#e94560,stroke:#fff,color:#fff
    style VDB fill:#4CAF50,stroke:#fff,color:#fff
    style META fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style USER fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style QEMB fill:#e94560,stroke:#fff,color:#fff
    style SEARCH fill:#2c3e50,stroke:#fff,color:#fff
    style RERANK fill:#2c3e50,stroke:#fff,color:#fff
    style PROMPT fill:#e94560,stroke:#fff,color:#fff
    style LLM fill:#e94560,stroke:#fff,color:#fff
    style ANS fill:#4CAF50,stroke:#fff,color:#fff

Complete RAG pipeline: offline indexing + online query with reranking

5.1 Chunking Strategy

How you split documents into chunks directly impacts retrieval quality:

Strategy	Description	Pros	Cons
Fixed-size	Split by fixed token count (512)	Simple, predictable	May cut mid-sentence/paragraph
Recursive character	Split by boundary (paragraph → sentence → word)	Better context preservation	Uneven chunk sizes
Semantic chunking	Use embedding similarity to detect topic boundaries	Semantically coherent chunks	Slow, expensive embedding cost
Document-aware	Split by structure (heading, section, table)	Preserves original structure	Needs per-format parser

5.2 Hybrid Search — Combining Vector + Keyword

Vector search excels at capturing semantics ("how to speed up web" → finds articles about "performance optimization"), but struggles with exact matches (product names, error codes). Combining with BM25 keyword search yields the best results:

from qdrant_client import QdrantClient
from qdrant_client.models import (
    Prefetch, FusionQuery, Fusion, SparseVector
)

client = QdrantClient("localhost", port=6333)

# Hybrid search: dense + sparse (BM25) with Reciprocal Rank Fusion
results = client.query_points(
    collection_name="documents",
    prefetch=[
        Prefetch(
            query=dense_embedding,    # semantic search
            using="dense",
            limit=100
        ),
        Prefetch(
            query=SparseVector(       # keyword search (BM25)
                indices=sparse_indices,
                values=sparse_values
            ),
            using="sparse",
            limit=100
        ),
    ],
    query=FusionQuery(fusion=Fusion.RRF),  # Reciprocal Rank Fusion
    limit=10
)

6. Production Performance Optimization

6.1 Capacity Planning

RAM Estimation Formula for HNSW

RAM ≈ num_vectors × (dim × bytes_per_dim + M × 2 × 4 + overhead)

Example: 10M vectors × 1536-d × float32, M=16:
10M × (1536 × 4 + 16 × 2 × 4 + 100) ≈ 10M × 6372 ≈ 63.7 GB RAM

With scalar quantization (int8): 10M × (1536 × 1 + 128 + 100) ≈ 17.6 GB RAM — ~3.6x reduction.

6.2 Multi-Tenancy Patterns

When multiple customers (tenants) share a vector search system:

Collection per tenant: Full isolation, but high overhead with thousands of tenants. Suitable when each tenant has >100K vectors.
Partition key per tenant: Single collection, filter by tenant_id. More efficient for many small tenants. Qdrant supports payload_index on tenant_id for optimization.
Namespace (Pinecone): Logical separation within the same index. Zero overhead, but less isolation.

6.3 Monitoring Metrics

Metric	Description	Target
p99 query latency	Latency at 99th percentile	<100ms for interactive search
Recall@K	% ground-truth results in top-K	>95% for RAG
QPS (queries/second)	Throughput	Scale-dependent, typically 100-10K QPS
Index build time	Time to build index after adding data	Hours for million-scale, days for billion
Memory utilization	RAM usage vs provisioned	70-85% (buffer for spikes)

7. When to Use a Purpose-Built Vector Database?

graph TD
    START["Need vector search?"] -->|Yes| Q1["Dataset > 10M vectors?"]
    Q1 -->|No| Q2["Already have PostgreSQL?"]
    Q2 -->|Yes| PG["pgvector
No extra infra"]
    Q2 -->|No| Q3["Prototyping?"]
    Q3 -->|Yes| CHROMA["Chroma
Embed in-process"]
    Q3 -->|No| QDRANT["Qdrant
Self-host or cloud"]

    Q1 -->|Yes| Q4["Need disaggregated
scaling?"]
    Q4 -->|Yes| MILVUS["Milvus
Separate compute/storage"]
    Q4 -->|No| Q5["Complex filtering?"]
    Q5 -->|Yes| QDRANT2["Qdrant
ACORN filtering"]
    Q5 -->|No| Q6["Zero ops?"]
    Q6 -->|Yes| PINE["Pinecone
Fully managed"]
    Q6 -->|No| WEAV["Weaviate
Hybrid search"]

    style START fill:#e94560,stroke:#fff,color:#fff
    style Q1 fill:#2c3e50,stroke:#fff,color:#fff
    style Q2 fill:#2c3e50,stroke:#fff,color:#fff
    style Q3 fill:#2c3e50,stroke:#fff,color:#fff
    style Q4 fill:#2c3e50,stroke:#fff,color:#fff
    style Q5 fill:#2c3e50,stroke:#fff,color:#fff
    style Q6 fill:#2c3e50,stroke:#fff,color:#fff
    style PG fill:#4CAF50,stroke:#fff,color:#fff
    style CHROMA fill:#4CAF50,stroke:#fff,color:#fff
    style QDRANT fill:#4CAF50,stroke:#fff,color:#fff
    style MILVUS fill:#4CAF50,stroke:#fff,color:#fff
    style QDRANT2 fill:#4CAF50,stroke:#fff,color:#fff
    style PINE fill:#4CAF50,stroke:#fff,color:#fff
    style WEAV fill:#4CAF50,stroke:#fff,color:#fff

Decision tree for choosing a vector database based on practical needs

8. 2026 Trends

Quantization-Aware Training

Embedding models are now trained to work well with binary/scalar quantization out of the box. Cohere embed-v4 and Nomic embed-v2 lead the way — recall loss drops from 10-20% to 2-5% with binary quantization.

Disaggregated Architecture

Separating compute and storage is becoming standard. Milvus, Weaviate Cloud, Pinecone serverless all follow this pattern — reducing idle costs and enabling elastic scaling.

Multi-Vector & Late Interaction

Instead of one vector per document, using multiple vectors (per token/paragraph). ColBERT and ColPali represent this trend — significantly higher recall for long documents.

Serverless Vector Search

Pinecone serverless, Qdrant Cloud, Turbopuffer — pay only per query, no cluster provisioning needed. Lowering the barrier for startups and side projects.

Hybrid Database Convergence

PostgreSQL (pgvector), Elasticsearch 9, Redis 8 all add powerful vector search. The boundary between "vector DB" and "traditional DB with vector support" is increasingly blurred.

Conclusion

A vector database is not a silver bullet — it's one component in the semantic search pipeline. Understanding the trade-offs between HNSW (fast, RAM-hungry) and IVF (memory-efficient, needs training), knowing when to apply quantization (PQ for billion-scale, SQ for balance), and choosing the right database for your use case (pgvector for simplicity, Qdrant for filtering, Milvus for scale) — that's the key to building effective AI search systems in production.

References:

#Vector Database #HNSW #Semantic Search #RAG #Qdrant #Milvus #pgvector #Embedding #ANN

# Vector Database — Semantic Search Architecture for AI

As AI applications increasingly rely on **semantic search** — from RAG (Retrieval-Augmented Generation) to recommendation systems — the question is no longer "should we use a vector database?" but "which one, with what indexing strategy, and how to deploy it?" This article dives deep into vector database internals: how embeddings work, indexing algorithms (HNSW, IVF, LSH), quantization techniques for memory reduction, and a detailed comparison of the most popular solutions in 2026.

$10.6BProjected vector DB market size by 2032

4-32xMemory reduction via Product Quantization

<1msQuery latency with HNSW on millions of vectors

95%+Recall achievable with well-tuned HNSW

## 1. Embeddings — From Raw Data to Vector Space

Before discussing databases, we need to understand **embeddings** — the process of converting unstructured data (text, images, audio) into numerical vectors in high-dimensional space. Each vector represents the "semantic meaning" of the original data.

```
graph LR
    A["Text: 'Caching reduces  
latency'"] --> B["Embedding Model  
(OpenAI, Cohere, BGE)"]
    B --> C["Vector  
[0.12, -0.45, 0.78, ..., 0.33]  
1536 dimensions"]
    D["Text: 'Cache speeds  
up queries'"] --> E["Embedding Model"]
    E --> F["Vector  
[0.11, -0.43, 0.76, ..., 0.31]  
1536 dimensions"]
    C --> G["Cosine Similarity  
= 0.97 (very close)"]
    F --> G

style A fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style D fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style B fill:#e94560,stroke:#fff,color:#fff
    style E fill:#e94560,stroke:#fff,color:#fff
    style C fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style F fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style G fill:#4CAF50,stroke:#fff,color:#fff

```
Two sentences with different wording but close in vector space because they share the same semantics

Popular embedding models in 2026:

| Model | Dimensions | Use Case | Key Feature |
| --- | --- | --- | --- |
| **OpenAI text-embedding-3-large** | 3072 | General purpose | Matryoshka support (flexible dimension reduction) |
| **Cohere embed-v4** | 1024 | Multilingual search | 100+ languages, built-in binary quantization |
| **BGE-M3** | 1024 | Hybrid (dense + sparse) | Open-source, multi-granularity retrieval |
| **Voyage-3** | 1024 | Code & technical docs | Optimized for code search |
| **GTE-Qwen2** | 768–8192 | Long context embedding | Supports up to 128K tokens |

#### Matryoshka Representation Learning (MRL)

## 2. Approximate Nearest Neighbor (ANN) — Why Not Brute Force?

Exact nearest neighbor search over 1 million 1536-d vectors requires ~1.5 billion float operations — taking seconds. With 1 billion vectors, it becomes infeasible. The solution: **Approximate Nearest Neighbor (ANN)** — accept recall <100% in exchange for 100-1000x speedup.

### 2.1 HNSW (Hierarchical Navigable Small World)

```
graph TD
    subgraph "Layer 2 (Sparse - Long-range links)"
        L2A["Node A"] --- L2D["Node D"]
        L2D --- L2G["Node G"]
    end

subgraph "Layer 1 (Medium density)"
        L1A["Node A"] --- L1B["Node B"]
        L1B --- L1D["Node D"]
        L1D --- L1F["Node F"]
        L1F --- L1G["Node G"]
    end

subgraph "Layer 0 (Dense - All nodes)"
        L0A["Node A"] --- L0B["Node B"]
        L0B --- L0C["Node C"]
        L0C --- L0D["Node D"]
        L0D --- L0E["Node E"]
        L0E --- L0F["Node F"]
        L0F --- L0G["Node G"]
        L0A --- L0C
        L0B --- L0D
        L0E --- L0G
    end

L2A -.-> L1A
    L2D -.-> L1D
    L2G -.-> L1G
    L1A -.-> L0A
    L1B -.-> L0B
    L1D -.-> L0D
    L1F -.-> L0F
    L1G -.-> L0G

style L2A fill:#e94560,stroke:#fff,color:#fff
    style L2D fill:#e94560,stroke:#fff,color:#fff
    style L2G fill:#e94560,stroke:#fff,color:#fff
    style L1A fill:#2c3e50,stroke:#fff,color:#fff
    style L1B fill:#2c3e50,stroke:#fff,color:#fff
    style L1D fill:#2c3e50,stroke:#fff,color:#fff
    style L1F fill:#2c3e50,stroke:#fff,color:#fff
    style L1G fill:#2c3e50,stroke:#fff,color:#fff
    style L0A fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L0B fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L0C fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L0D fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L0E fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L0F fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L0G fill:#f8f9fa,stroke:#e94560,color:#2c3e50

```
HNSW structure: search starts at the top layer (sparse), progressively "zooms in" to layer 0 (dense) for more accurate nearest neighbors

**Key HNSW parameters:**

- `M` (max connections per node): Higher → better recall but more RAM. Default typically 16–64.
- `efConstruction`: Number of candidates considered during graph construction. Higher → better index quality but slower build.
- `efSearch`: Number of candidates considered during query. This is the main "knob" for recall vs latency trade-off at runtime.

#### Complexity Analysis

**Build time:** O(N × log(N)) — each vector insert traverses the existing graph.  
**Query time:** O(log(N)) — hops through layers from sparse to dense.  
**Memory:** O(N × M × D) — stores both vectors and adjacency lists. This is the biggest drawback: 1 billion vectors × 1536-d × float32 ≈ 6TB for data alone, not counting graph structure.

### 2.2 IVF (Inverted File Index)

IVF partitions the vector space into `nlist` clusters using k-means. At query time, only the nearest `nprobe` clusters are searched instead of the entire dataset.

```
graph TD
    Q["Query Vector"] --> R["Find nprobe=3  
nearest clusters"]
    R --> C1["Cluster 1  
50K vectors"]
    R --> C3["Cluster 3  
45K vectors"]
    R --> C5["Cluster 5  
52K vectors"]

C2["Cluster 2  
48K vectors"]
    C4["Cluster 4  
55K vectors"]

C1 --> RES["Top-K Results"]
    C3 --> RES
    C5 --> RES

style Q fill:#e94560,stroke:#fff,color:#fff
    style R fill:#2c3e50,stroke:#fff,color:#fff
    style C1 fill:#4CAF50,stroke:#fff,color:#fff
    style C3 fill:#4CAF50,stroke:#fff,color:#fff
    style C5 fill:#4CAF50,stroke:#fff,color:#fff
    style C2 fill:#f8f9fa,stroke:#e0e0e0,color:#999
    style C4 fill:#f8f9fa,stroke:#e0e0e0,color:#999
    style RES fill:#e94560,stroke:#fff,color:#fff

```
IVF only scans clusters near the query (green), skipping the rest (grey) — reducing vectors to compare by 90%+

**IVF advantages:** Uses less RAM than HNSW since it only loads needed clusters. Suitable for datasets too large for RAM.  
**Disadvantages:** Requires a training step (k-means) before indexing. Recall depends on clustering quality.

### 2.3 HNSW vs IVF Comparison

| Criteria | HNSW | IVF |
| --- | --- | --- |
| **Query speed** | Very fast (sub-ms) | Fast (1-10ms) |
| **Memory** | High (stores graph + vectors) | Lower (only centroids + vectors) |
| **Build time** | Slow (sequential insert) | Faster (batch k-means) |
| **Updates** | Supports realtime insert/delete | Needs re-training when data distribution shifts |
| **Scale** | Good up to ~100M vectors | Better for 1B+ vectors |
| **Recall@10** | 95-99% (ef tuning) | 90-97% (nprobe tuning) |

## 3. Quantization — Compressing Vectors Without Losing Much Quality

1 billion vectors × 1536 dimensions × 4 bytes (float32) = **~6TB RAM**. Quantization solves this by compressing vectors into smaller representations.

### 3.1 Scalar Quantization (SQ)

Converts each dimension from float32 (4 bytes) to int8 (1 byte). 4x memory reduction, ~2-5% recall loss.

### 3.2 Product Quantization (PQ)

Splits each vector into `m` subvectors, each quantized independently using its own codebook. 4-32x memory reduction.

#### Product Quantization Example

A 1536-d vector is split into 192 subvectors × 8-d each. Each subvector maps to a codebook of 256 entries (8-bit). Result: instead of storing 1536 × 4 = 6144 bytes, only 192 × 1 = **192 bytes** — a **32x** reduction.

### 3.3 Binary Quantization

| Technique | Compression | Recall Loss | Speed Gain | Use Case |
| --- | --- | --- | --- | --- |
| **No compression (float32)** | 1x | 0% | Baseline | Small datasets, maximum recall required |
| **Scalar (int8)** | 4x | 2-5% | 2-3x | Best balance for most use cases |
| **Product (PQ)** | 8-32x | 5-15% | 5-10x | Billion-scale, disk-based index |
| **Binary** | 32x | 10-20% | 20-40x | Pre-filter / first-stage retrieval |

#### Two-stage Retrieval Pattern

In production, many systems use **binary/PQ quantization** to quickly filter top-1000 candidates, then **rerank with full-precision vectors** for the final top-10. This pattern combines quantization speed with exact distance quality.

## 4. Vector Database Comparison 2026

```
graph TB
    subgraph "Purpose-Built Vector DB"
        Q["Qdrant  
Rust, HNSW+"]
        M["Milvus  
Go/C++, Disaggregated"]
        W["Weaviate  
Go, Knowledge Graph"]
        P["Pinecone  
Managed, Serverless"]
        CH["Chroma  
Python, Lightweight"]
    end

subgraph "Vector Extensions on Traditional DBs"
        PG["pgvector  
PostgreSQL"]
        ES["Elasticsearch 9  
kNN + BM25"]
        RE["Redis 8  
Vector Search"]
    end

style Q fill:#e94560,stroke:#fff,color:#fff
    style M fill:#2c3e50,stroke:#fff,color:#fff
    style W fill:#4CAF50,stroke:#fff,color:#fff
    style P fill:#e94560,stroke:#fff,color:#fff
    style CH fill:#2c3e50,stroke:#fff,color:#fff
    style PG fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style ES fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style RE fill:#f8f9fa,stroke:#e94560,color:#2c3e50

```
Two main categories: purpose-built vector DBs and extensions on traditional databases

| Database | Language | Index | Filtering | Scale | Key Differentiator |
| --- | --- | --- | --- | --- | --- |
| **Qdrant** | Rust | HNSW + quantization | ACORN (in-graph filter) | 100M+ vectors | Sub-100ms at 100M vectors, 95% recall. Fastest filtering via ACORN integrated into HNSW traversal |
| **Milvus** | Go/C++ | HNSW, IVF, DiskANN | Post-filter | 1B+ vectors | Disaggregated architecture (separate compute/storage). GPU-accelerated indexing. Strongest horizontal scaling |
| **Weaviate** | Go | HNSW + flat | Pre-filter + vector | 500M+ vectors | Hybrid search (vector + keyword). Knowledge graph integration. Rich module ecosystem |
| **Pinecone** | Managed | Proprietary | Metadata filter | 1B+ vectors | Fully managed serverless. Zero ops. Ideal for small teams wanting to ship fast |
| **pgvector** | C (PG ext) | HNSW, IVF | SQL WHERE clause | 10-50M vectors | Use your existing PostgreSQL. No extra infra. Best when vector search is a secondary feature |
| **Chroma** | Python/Rust | HNSW | Metadata filter | 1-10M vectors | Developer-friendly, embed in process. Great for prototyping and small-scale RAG |

### 4.1 Qdrant — Superior Filtering Performance

Qdrant's biggest strength is its **ACORN algorithm** — instead of filtering before search (pre-filter) or after search (post-filter), ACORN integrates filtering *directly into the HNSW graph traversal*. Result: filtered queries remain fast even when filters eliminate 99% of candidates — something pre/post-filter systems struggle with.

### 4.2 Milvus — Disaggregated Architecture for Billions of Vectors

Milvus completely separates four tiers: **access layer** (proxy), **coordinator** (metadata), **worker nodes** (query/data/index), and **storage** (object store + message queue). Each tier scales independently — you can add query nodes without affecting indexing, or vice versa.

```
graph TD
    Client["Client SDK"] --> Proxy["Access Layer  
(Proxy / Load Balancer)"]
    Proxy --> QN["Query Nodes  
(Search)"]
    Proxy --> DN["Data Nodes  
(Insert/Delete)"]
    Proxy --> IN["Index Nodes  
(Build Index)"]

Coord["Coordinator  
(Root + Query + Data + Index Coord)"] --> QN
    Coord --> DN
    Coord --> IN

QN --> S3["Object Storage  
(S3 / MinIO)"]
    DN --> MQ["Message Queue  
(Pulsar / Kafka)"]
    IN --> S3
    MQ --> S3

style Client fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style Proxy fill:#e94560,stroke:#fff,color:#fff
    style QN fill:#4CAF50,stroke:#fff,color:#fff
    style DN fill:#4CAF50,stroke:#fff,color:#fff
    style IN fill:#4CAF50,stroke:#fff,color:#fff
    style Coord fill:#2c3e50,stroke:#fff,color:#fff
    style S3 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style MQ fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50

```
Milvus disaggregated architecture: each tier scales independently, storage uses object store (S3/MinIO)

### 4.3 pgvector — When You Already Have PostgreSQL

If your application already uses PostgreSQL and vector search is just a supplementary feature (e.g., similar product search, recommendation widget), **pgvector** is the most practical choice. No additional services, no data sync, direct JOINs with business logic tables.

```sql
-- Create table with vector column
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding vector(1536)
);

-- Create HNSW index
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);

-- Semantic search + business logic filter in a single query
SELECT d.id, d.content, 1 - (d.embedding <=> query_embedding) AS similarity
FROM documents d
JOIN categories c ON d.category_id = c.id
WHERE c.name = 'technology' AND d.status = 'published'
ORDER BY d.embedding <=> query_embedding
LIMIT 10;
```

#### pgvector Limitations

pgvector works well **under 10M vectors**. Beyond that threshold, performance degrades noticeably compared to purpose-built vector DBs. If your dataset is large or growing fast, start with pgvector then migrate to Qdrant/Milvus when needed — API patterns are similar.

## 5. Production Architecture — RAG Pipeline with Vector Database

The most common use case in 2026: **RAG (Retrieval-Augmented Generation)** — augmenting LLM responses with context from your own knowledge base for better accuracy and reduced hallucination.

```
graph TD
    subgraph "Indexing Pipeline (Offline)"
        DOC["Documents  
(PDF, HTML, Markdown)"] --> CHUNK["Chunking  
(512-1024 tokens)"]
        CHUNK --> EMB["Embedding Model"]
        EMB --> VDB["Vector Database  
(Qdrant / Milvus)"]
        CHUNK --> META["Metadata Store  
(PostgreSQL)"]
    end

subgraph "Query Pipeline (Online)"
        USER["User Query"] --> QEMB["Embed Query"]
        QEMB --> SEARCH["Vector Search  
+ Metadata Filter"]
        SEARCH --> VDB
        SEARCH --> RERANK["Reranker  
(Cohere / ColBERT)"]
        RERANK --> PROMPT["Prompt Assembly  
(Query + Context)"]
        PROMPT --> LLM["LLM  
(Claude / GPT)"]
        LLM --> ANS["Answer  
+ Citations"]
    end

style DOC fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style CHUNK fill:#2c3e50,stroke:#fff,color:#fff
    style EMB fill:#e94560,stroke:#fff,color:#fff
    style VDB fill:#4CAF50,stroke:#fff,color:#fff
    style META fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style USER fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style QEMB fill:#e94560,stroke:#fff,color:#fff
    style SEARCH fill:#2c3e50,stroke:#fff,color:#fff
    style RERANK fill:#2c3e50,stroke:#fff,color:#fff
    style PROMPT fill:#e94560,stroke:#fff,color:#fff
    style LLM fill:#e94560,stroke:#fff,color:#fff
    style ANS fill:#4CAF50,stroke:#fff,color:#fff

```
Complete RAG pipeline: offline indexing + online query with reranking

### 5.1 Chunking Strategy

How you split documents into chunks directly impacts retrieval quality:

| Strategy | Description | Pros | Cons |
| --- | --- | --- | --- |
| **Fixed-size** | Split by fixed token count (512) | Simple, predictable | May cut mid-sentence/paragraph |
| **Recursive character** | Split by boundary (paragraph → sentence → word) | Better context preservation | Uneven chunk sizes |
| **Semantic chunking** | Use embedding similarity to detect topic boundaries | Semantically coherent chunks | Slow, expensive embedding cost |
| **Document-aware** | Split by structure (heading, section, table) | Preserves original structure | Needs per-format parser |

### 5.2 Hybrid Search — Combining Vector + Keyword

```python
from qdrant_client import QdrantClient
from qdrant_client.models import (
    Prefetch, FusionQuery, Fusion, SparseVector
)

client = QdrantClient("localhost", port=6333)

# Hybrid search: dense + sparse (BM25) with Reciprocal Rank Fusion
results = client.query_points(
    collection_name="documents",
    prefetch=[
        Prefetch(
            query=dense_embedding,    # semantic search
            using="dense",
            limit=100
        ),
        Prefetch(
            query=SparseVector(       # keyword search (BM25)
                indices=sparse_indices,
                values=sparse_values
            ),
            using="sparse",
            limit=100
        ),
    ],
    query=FusionQuery(fusion=Fusion.RRF),  # Reciprocal Rank Fusion
    limit=10
)
```

## 6. Production Performance Optimization

### 6.1 Capacity Planning

#### RAM Estimation Formula for HNSW

`RAM ≈ num_vectors × (dim × bytes_per_dim + M × 2 × 4 + overhead)`

Example: 10M vectors × 1536-d × float32, M=16:  
`10M × (1536 × 4 + 16 × 2 × 4 + 100) ≈ 10M × 6372 ≈ 63.7 GB RAM`

With scalar quantization (int8): `10M × (1536 × 1 + 128 + 100) ≈ 17.6 GB RAM` — ~3.6x reduction.

### 6.2 Multi-Tenancy Patterns

When multiple customers (tenants) share a vector search system:

- **Collection per tenant:** Full isolation, but high overhead with thousands of tenants. Suitable when each tenant has >100K vectors.
- **Partition key per tenant:** Single collection, filter by `tenant_id`. More efficient for many small tenants. Qdrant supports `payload_index` on tenant_id for optimization.
- **Namespace (Pinecone):** Logical separation within the same index. Zero overhead, but less isolation.

### 6.3 Monitoring Metrics

| Metric | Description | Target |
| --- | --- | --- |
| **p99 query latency** | Latency at 99th percentile | <100ms for interactive search |
| **Recall@K** | % ground-truth results in top-K | >95% for RAG |
| **QPS (queries/second)** | Throughput | Scale-dependent, typically 100-10K QPS |
| **Index build time** | Time to build index after adding data | Hours for million-scale, days for billion |
| **Memory utilization** | RAM usage vs provisioned | 70-85% (buffer for spikes) |

## 7. When to Use a Purpose-Built Vector Database?

```
graph TD
    START["Need vector search?"] -->|Yes| Q1["Dataset > 10M vectors?"]
    Q1 -->|No| Q2["Already have PostgreSQL?"]
    Q2 -->|Yes| PG["pgvector  
No extra infra"]
    Q2 -->|No| Q3["Prototyping?"]
    Q3 -->|Yes| CHROMA["Chroma  
Embed in-process"]
    Q3 -->|No| QDRANT["Qdrant  
Self-host or cloud"]

Q1 -->|Yes| Q4["Need disaggregated  
scaling?"]
    Q4 -->|Yes| MILVUS["Milvus  
Separate compute/storage"]
    Q4 -->|No| Q5["Complex filtering?"]
    Q5 -->|Yes| QDRANT2["Qdrant  
ACORN filtering"]
    Q5 -->|No| Q6["Zero ops?"]
    Q6 -->|Yes| PINE["Pinecone  
Fully managed"]
    Q6 -->|No| WEAV["Weaviate  
Hybrid search"]

style START fill:#e94560,stroke:#fff,color:#fff
    style Q1 fill:#2c3e50,stroke:#fff,color:#fff
    style Q2 fill:#2c3e50,stroke:#fff,color:#fff
    style Q3 fill:#2c3e50,stroke:#fff,color:#fff
    style Q4 fill:#2c3e50,stroke:#fff,color:#fff
    style Q5 fill:#2c3e50,stroke:#fff,color:#fff
    style Q6 fill:#2c3e50,stroke:#fff,color:#fff
    style PG fill:#4CAF50,stroke:#fff,color:#fff
    style CHROMA fill:#4CAF50,stroke:#fff,color:#fff
    style QDRANT fill:#4CAF50,stroke:#fff,color:#fff
    style MILVUS fill:#4CAF50,stroke:#fff,color:#fff
    style QDRANT2 fill:#4CAF50,stroke:#fff,color:#fff
    style PINE fill:#4CAF50,stroke:#fff,color:#fff
    style WEAV fill:#4CAF50,stroke:#fff,color:#fff

```
Decision tree for choosing a vector database based on practical needs

## 8. 2026 Trends

Quantization-Aware Training

Disaggregated Architecture

Separating compute and storage is becoming standard. Milvus, Weaviate Cloud, Pinecone serverless all follow this pattern — reducing idle costs and enabling elastic scaling.

Multi-Vector & Late Interaction

Instead of one vector per document, using multiple vectors (per token/paragraph). ColBERT and ColPali represent this trend — significantly higher recall for long documents.

Serverless Vector Search

Pinecone serverless, Qdrant Cloud, Turbopuffer — pay only per query, no cluster provisioning needed. Lowering the barrier for startups and side projects.

Hybrid Database Convergence

PostgreSQL (pgvector), Elasticsearch 9, Redis 8 all add powerful vector search. The boundary between "vector DB" and "traditional DB with vector support" is increasingly blurred.

## Conclusion

**References:**

- [A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge — arXiv](https://arxiv.org/html/2310.11703v2)
- [What's Changing in Vector Databases in 2026 — DEV Community](https://dev.to/actiandev/whats-changing-in-vector-databases-in-2026-3pbo)
- [Best Vector Databases 2026 — DataCamp](https://www.datacamp.com/blog/the-top-5-vector-databases)
- [Qdrant Documentation — qdrant.tech](https://qdrant.tech/documentation/)
- [Milvus Documentation — milvus.io](https://milvus.io/docs)
- [pgvector — GitHub](https://github.com/pgvector/pgvector)

FinOps — Cloud Cost Optimization Strategies for AWS, Azure & Cloudflare

Frontend Security 2026: CSP, Trusted Types, SRI & XSS Defense for Vue.js

Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.