Vector Database — Semantic Search Architecture for AI

Posted on: 4/22/2026 5:12:31 AM

As AI applications increasingly rely on semantic search — from RAG (Retrieval-Augmented Generation) to recommendation systems — the question is no longer "should we use a vector database?" but "which one, with what indexing strategy, and how to deploy it?" This article dives deep into vector database internals: how embeddings work, indexing algorithms (HNSW, IVF, LSH), quantization techniques for memory reduction, and a detailed comparison of the most popular solutions in 2026.

$10.6BProjected vector DB market size by 2032
4-32xMemory reduction via Product Quantization
<1msQuery latency with HNSW on millions of vectors
95%+Recall achievable with well-tuned HNSW

1. Embeddings — From Raw Data to Vector Space

Before discussing databases, we need to understand embeddings — the process of converting unstructured data (text, images, audio) into numerical vectors in high-dimensional space. Each vector represents the "semantic meaning" of the original data.

graph LR
    A["Text: 'Caching reduces
latency'"] --> B["Embedding Model
(OpenAI, Cohere, BGE)"] B --> C["Vector
[0.12, -0.45, 0.78, ..., 0.33]
1536 dimensions"] D["Text: 'Cache speeds
up queries'"] --> E["Embedding Model"] E --> F["Vector
[0.11, -0.43, 0.76, ..., 0.31]
1536 dimensions"] C --> G["Cosine Similarity
= 0.97 (very close)"] F --> G style A fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style D fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style B fill:#e94560,stroke:#fff,color:#fff style E fill:#e94560,stroke:#fff,color:#fff style C fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50 style F fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50 style G fill:#4CAF50,stroke:#fff,color:#fff

Two sentences with different wording but close in vector space because they share the same semantics

Popular embedding models in 2026:

ModelDimensionsUse CaseKey Feature
OpenAI text-embedding-3-large3072General purposeMatryoshka support (flexible dimension reduction)
Cohere embed-v41024Multilingual search100+ languages, built-in binary quantization
BGE-M31024Hybrid (dense + sparse)Open-source, multi-granularity retrieval
Voyage-31024Code & technical docsOptimized for code search
GTE-Qwen2768–8192Long context embeddingSupports up to 128K tokens

Matryoshka Representation Learning (MRL)

A recent technique that allows embedding models to produce "nested" vectors — you can truncate a 3072-d vector down to 256-d while retaining ~90% quality. Extremely useful for saving memory at the cache/filter layer before reranking with full vectors.

2. Approximate Nearest Neighbor (ANN) — Why Not Brute Force?

Exact nearest neighbor search over 1 million 1536-d vectors requires ~1.5 billion float operations — taking seconds. With 1 billion vectors, it becomes infeasible. The solution: Approximate Nearest Neighbor (ANN) — accept recall <100% in exchange for 100-1000x speedup.

2.1 HNSW (Hierarchical Navigable Small World)

HNSW is the most popular ANN algorithm today, used as the default in most vector databases. Core idea: build a multi-layer hierarchical graph where each node is a vector and each edge connects "nearby" vectors.

graph TD
    subgraph "Layer 2 (Sparse - Long-range links)"
        L2A["Node A"] --- L2D["Node D"]
        L2D --- L2G["Node G"]
    end

    subgraph "Layer 1 (Medium density)"
        L1A["Node A"] --- L1B["Node B"]
        L1B --- L1D["Node D"]
        L1D --- L1F["Node F"]
        L1F --- L1G["Node G"]
    end

    subgraph "Layer 0 (Dense - All nodes)"
        L0A["Node A"] --- L0B["Node B"]
        L0B --- L0C["Node C"]
        L0C --- L0D["Node D"]
        L0D --- L0E["Node E"]
        L0E --- L0F["Node F"]
        L0F --- L0G["Node G"]
        L0A --- L0C
        L0B --- L0D
        L0E --- L0G
    end

    L2A -.-> L1A
    L2D -.-> L1D
    L2G -.-> L1G
    L1A -.-> L0A
    L1B -.-> L0B
    L1D -.-> L0D
    L1F -.-> L0F
    L1G -.-> L0G

    style L2A fill:#e94560,stroke:#fff,color:#fff
    style L2D fill:#e94560,stroke:#fff,color:#fff
    style L2G fill:#e94560,stroke:#fff,color:#fff
    style L1A fill:#2c3e50,stroke:#fff,color:#fff
    style L1B fill:#2c3e50,stroke:#fff,color:#fff
    style L1D fill:#2c3e50,stroke:#fff,color:#fff
    style L1F fill:#2c3e50,stroke:#fff,color:#fff
    style L1G fill:#2c3e50,stroke:#fff,color:#fff
    style L0A fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L0B fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L0C fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L0D fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L0E fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L0F fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L0G fill:#f8f9fa,stroke:#e94560,color:#2c3e50

HNSW structure: search starts at the top layer (sparse), progressively "zooms in" to layer 0 (dense) for more accurate nearest neighbors

Key HNSW parameters:

  • M (max connections per node): Higher → better recall but more RAM. Default typically 16–64.
  • efConstruction: Number of candidates considered during graph construction. Higher → better index quality but slower build.
  • efSearch: Number of candidates considered during query. This is the main "knob" for recall vs latency trade-off at runtime.

Complexity Analysis

Build time: O(N × log(N)) — each vector insert traverses the existing graph.
Query time: O(log(N)) — hops through layers from sparse to dense.
Memory: O(N × M × D) — stores both vectors and adjacency lists. This is the biggest drawback: 1 billion vectors × 1536-d × float32 ≈ 6TB for data alone, not counting graph structure.

2.2 IVF (Inverted File Index)

IVF partitions the vector space into nlist clusters using k-means. At query time, only the nearest nprobe clusters are searched instead of the entire dataset.

graph TD
    Q["Query Vector"] --> R["Find nprobe=3
nearest clusters"] R --> C1["Cluster 1
50K vectors"] R --> C3["Cluster 3
45K vectors"] R --> C5["Cluster 5
52K vectors"] C2["Cluster 2
48K vectors"] C4["Cluster 4
55K vectors"] C1 --> RES["Top-K Results"] C3 --> RES C5 --> RES style Q fill:#e94560,stroke:#fff,color:#fff style R fill:#2c3e50,stroke:#fff,color:#fff style C1 fill:#4CAF50,stroke:#fff,color:#fff style C3 fill:#4CAF50,stroke:#fff,color:#fff style C5 fill:#4CAF50,stroke:#fff,color:#fff style C2 fill:#f8f9fa,stroke:#e0e0e0,color:#999 style C4 fill:#f8f9fa,stroke:#e0e0e0,color:#999 style RES fill:#e94560,stroke:#fff,color:#fff

IVF only scans clusters near the query (green), skipping the rest (grey) — reducing vectors to compare by 90%+

IVF advantages: Uses less RAM than HNSW since it only loads needed clusters. Suitable for datasets too large for RAM.
Disadvantages: Requires a training step (k-means) before indexing. Recall depends on clustering quality.

2.3 HNSW vs IVF Comparison

CriteriaHNSWIVF
Query speedVery fast (sub-ms)Fast (1-10ms)
MemoryHigh (stores graph + vectors)Lower (only centroids + vectors)
Build timeSlow (sequential insert)Faster (batch k-means)
UpdatesSupports realtime insert/deleteNeeds re-training when data distribution shifts
ScaleGood up to ~100M vectorsBetter for 1B+ vectors
Recall@1095-99% (ef tuning)90-97% (nprobe tuning)

3. Quantization — Compressing Vectors Without Losing Much Quality

1 billion vectors × 1536 dimensions × 4 bytes (float32) = ~6TB RAM. Quantization solves this by compressing vectors into smaller representations.

3.1 Scalar Quantization (SQ)

Converts each dimension from float32 (4 bytes) to int8 (1 byte). 4x memory reduction, ~2-5% recall loss.

3.2 Product Quantization (PQ)

Splits each vector into m subvectors, each quantized independently using its own codebook. 4-32x memory reduction.

Product Quantization Example

A 1536-d vector is split into 192 subvectors × 8-d each. Each subvector maps to a codebook of 256 entries (8-bit). Result: instead of storing 1536 × 4 = 6144 bytes, only 192 × 1 = 192 bytes — a 32x reduction.

3.3 Binary Quantization

The most extreme: each dimension keeps only 1 bit (positive = 1, negative = 0). A 1536-d vector → 192 bytes. Hamming distance replaces cosine similarity — extremely fast on modern CPUs (POPCNT instruction). Cohere embed-v4 is specifically designed to work well with binary quantization.

TechniqueCompressionRecall LossSpeed GainUse Case
No compression (float32)1x0%BaselineSmall datasets, maximum recall required
Scalar (int8)4x2-5%2-3xBest balance for most use cases
Product (PQ)8-32x5-15%5-10xBillion-scale, disk-based index
Binary32x10-20%20-40xPre-filter / first-stage retrieval

Two-stage Retrieval Pattern

In production, many systems use binary/PQ quantization to quickly filter top-1000 candidates, then rerank with full-precision vectors for the final top-10. This pattern combines quantization speed with exact distance quality.

4. Vector Database Comparison 2026

graph TB
    subgraph "Purpose-Built Vector DB"
        Q["Qdrant
Rust, HNSW+"] M["Milvus
Go/C++, Disaggregated"] W["Weaviate
Go, Knowledge Graph"] P["Pinecone
Managed, Serverless"] CH["Chroma
Python, Lightweight"] end subgraph "Vector Extensions on Traditional DBs" PG["pgvector
PostgreSQL"] ES["Elasticsearch 9
kNN + BM25"] RE["Redis 8
Vector Search"] end style Q fill:#e94560,stroke:#fff,color:#fff style M fill:#2c3e50,stroke:#fff,color:#fff style W fill:#4CAF50,stroke:#fff,color:#fff style P fill:#e94560,stroke:#fff,color:#fff style CH fill:#2c3e50,stroke:#fff,color:#fff style PG fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style ES fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style RE fill:#f8f9fa,stroke:#e94560,color:#2c3e50

Two main categories: purpose-built vector DBs and extensions on traditional databases

DatabaseLanguageIndexFilteringScaleKey Differentiator
QdrantRustHNSW + quantizationACORN (in-graph filter)100M+ vectorsSub-100ms at 100M vectors, 95% recall. Fastest filtering via ACORN integrated into HNSW traversal
MilvusGo/C++HNSW, IVF, DiskANNPost-filter1B+ vectorsDisaggregated architecture (separate compute/storage). GPU-accelerated indexing. Strongest horizontal scaling
WeaviateGoHNSW + flatPre-filter + vector500M+ vectorsHybrid search (vector + keyword). Knowledge graph integration. Rich module ecosystem
PineconeManagedProprietaryMetadata filter1B+ vectorsFully managed serverless. Zero ops. Ideal for small teams wanting to ship fast
pgvectorC (PG ext)HNSW, IVFSQL WHERE clause10-50M vectorsUse your existing PostgreSQL. No extra infra. Best when vector search is a secondary feature
ChromaPython/RustHNSWMetadata filter1-10M vectorsDeveloper-friendly, embed in process. Great for prototyping and small-scale RAG

4.1 Qdrant — Superior Filtering Performance

Qdrant's biggest strength is its ACORN algorithm — instead of filtering before search (pre-filter) or after search (post-filter), ACORN integrates filtering directly into the HNSW graph traversal. Result: filtered queries remain fast even when filters eliminate 99% of candidates — something pre/post-filter systems struggle with.

4.2 Milvus — Disaggregated Architecture for Billions of Vectors

Milvus completely separates four tiers: access layer (proxy), coordinator (metadata), worker nodes (query/data/index), and storage (object store + message queue). Each tier scales independently — you can add query nodes without affecting indexing, or vice versa.

graph TD
    Client["Client SDK"] --> Proxy["Access Layer
(Proxy / Load Balancer)"] Proxy --> QN["Query Nodes
(Search)"] Proxy --> DN["Data Nodes
(Insert/Delete)"] Proxy --> IN["Index Nodes
(Build Index)"] Coord["Coordinator
(Root + Query + Data + Index Coord)"] --> QN Coord --> DN Coord --> IN QN --> S3["Object Storage
(S3 / MinIO)"] DN --> MQ["Message Queue
(Pulsar / Kafka)"] IN --> S3 MQ --> S3 style Client fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style Proxy fill:#e94560,stroke:#fff,color:#fff style QN fill:#4CAF50,stroke:#fff,color:#fff style DN fill:#4CAF50,stroke:#fff,color:#fff style IN fill:#4CAF50,stroke:#fff,color:#fff style Coord fill:#2c3e50,stroke:#fff,color:#fff style S3 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50 style MQ fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50

Milvus disaggregated architecture: each tier scales independently, storage uses object store (S3/MinIO)

4.3 pgvector — When You Already Have PostgreSQL

If your application already uses PostgreSQL and vector search is just a supplementary feature (e.g., similar product search, recommendation widget), pgvector is the most practical choice. No additional services, no data sync, direct JOINs with business logic tables.

-- Create table with vector column
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding vector(1536)
);

-- Create HNSW index
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);

-- Semantic search + business logic filter in a single query
SELECT d.id, d.content, 1 - (d.embedding <=> query_embedding) AS similarity
FROM documents d
JOIN categories c ON d.category_id = c.id
WHERE c.name = 'technology' AND d.status = 'published'
ORDER BY d.embedding <=> query_embedding
LIMIT 10;

pgvector Limitations

pgvector works well under 10M vectors. Beyond that threshold, performance degrades noticeably compared to purpose-built vector DBs. If your dataset is large or growing fast, start with pgvector then migrate to Qdrant/Milvus when needed — API patterns are similar.

5. Production Architecture — RAG Pipeline with Vector Database

The most common use case in 2026: RAG (Retrieval-Augmented Generation) — augmenting LLM responses with context from your own knowledge base for better accuracy and reduced hallucination.

graph TD
    subgraph "Indexing Pipeline (Offline)"
        DOC["Documents
(PDF, HTML, Markdown)"] --> CHUNK["Chunking
(512-1024 tokens)"] CHUNK --> EMB["Embedding Model"] EMB --> VDB["Vector Database
(Qdrant / Milvus)"] CHUNK --> META["Metadata Store
(PostgreSQL)"] end subgraph "Query Pipeline (Online)" USER["User Query"] --> QEMB["Embed Query"] QEMB --> SEARCH["Vector Search
+ Metadata Filter"] SEARCH --> VDB SEARCH --> RERANK["Reranker
(Cohere / ColBERT)"] RERANK --> PROMPT["Prompt Assembly
(Query + Context)"] PROMPT --> LLM["LLM
(Claude / GPT)"] LLM --> ANS["Answer
+ Citations"] end style DOC fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style CHUNK fill:#2c3e50,stroke:#fff,color:#fff style EMB fill:#e94560,stroke:#fff,color:#fff style VDB fill:#4CAF50,stroke:#fff,color:#fff style META fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50 style USER fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style QEMB fill:#e94560,stroke:#fff,color:#fff style SEARCH fill:#2c3e50,stroke:#fff,color:#fff style RERANK fill:#2c3e50,stroke:#fff,color:#fff style PROMPT fill:#e94560,stroke:#fff,color:#fff style LLM fill:#e94560,stroke:#fff,color:#fff style ANS fill:#4CAF50,stroke:#fff,color:#fff

Complete RAG pipeline: offline indexing + online query with reranking

5.1 Chunking Strategy

How you split documents into chunks directly impacts retrieval quality:

StrategyDescriptionProsCons
Fixed-sizeSplit by fixed token count (512)Simple, predictableMay cut mid-sentence/paragraph
Recursive characterSplit by boundary (paragraph → sentence → word)Better context preservationUneven chunk sizes
Semantic chunkingUse embedding similarity to detect topic boundariesSemantically coherent chunksSlow, expensive embedding cost
Document-awareSplit by structure (heading, section, table)Preserves original structureNeeds per-format parser

5.2 Hybrid Search — Combining Vector + Keyword

Vector search excels at capturing semantics ("how to speed up web" → finds articles about "performance optimization"), but struggles with exact matches (product names, error codes). Combining with BM25 keyword search yields the best results:

from qdrant_client import QdrantClient
from qdrant_client.models import (
    Prefetch, FusionQuery, Fusion, SparseVector
)

client = QdrantClient("localhost", port=6333)

# Hybrid search: dense + sparse (BM25) with Reciprocal Rank Fusion
results = client.query_points(
    collection_name="documents",
    prefetch=[
        Prefetch(
            query=dense_embedding,    # semantic search
            using="dense",
            limit=100
        ),
        Prefetch(
            query=SparseVector(       # keyword search (BM25)
                indices=sparse_indices,
                values=sparse_values
            ),
            using="sparse",
            limit=100
        ),
    ],
    query=FusionQuery(fusion=Fusion.RRF),  # Reciprocal Rank Fusion
    limit=10
)

6. Production Performance Optimization

6.1 Capacity Planning

RAM Estimation Formula for HNSW

RAM ≈ num_vectors × (dim × bytes_per_dim + M × 2 × 4 + overhead)

Example: 10M vectors × 1536-d × float32, M=16:
10M × (1536 × 4 + 16 × 2 × 4 + 100) ≈ 10M × 6372 ≈ 63.7 GB RAM

With scalar quantization (int8): 10M × (1536 × 1 + 128 + 100) ≈ 17.6 GB RAM — ~3.6x reduction.

6.2 Multi-Tenancy Patterns

When multiple customers (tenants) share a vector search system:

  • Collection per tenant: Full isolation, but high overhead with thousands of tenants. Suitable when each tenant has >100K vectors.
  • Partition key per tenant: Single collection, filter by tenant_id. More efficient for many small tenants. Qdrant supports payload_index on tenant_id for optimization.
  • Namespace (Pinecone): Logical separation within the same index. Zero overhead, but less isolation.

6.3 Monitoring Metrics

MetricDescriptionTarget
p99 query latencyLatency at 99th percentile<100ms for interactive search
Recall@K% ground-truth results in top-K>95% for RAG
QPS (queries/second)ThroughputScale-dependent, typically 100-10K QPS
Index build timeTime to build index after adding dataHours for million-scale, days for billion
Memory utilizationRAM usage vs provisioned70-85% (buffer for spikes)

7. When to Use a Purpose-Built Vector Database?

graph TD
    START["Need vector search?"] -->|Yes| Q1["Dataset > 10M vectors?"]
    Q1 -->|No| Q2["Already have PostgreSQL?"]
    Q2 -->|Yes| PG["pgvector
No extra infra"] Q2 -->|No| Q3["Prototyping?"] Q3 -->|Yes| CHROMA["Chroma
Embed in-process"] Q3 -->|No| QDRANT["Qdrant
Self-host or cloud"] Q1 -->|Yes| Q4["Need disaggregated
scaling?"] Q4 -->|Yes| MILVUS["Milvus
Separate compute/storage"] Q4 -->|No| Q5["Complex filtering?"] Q5 -->|Yes| QDRANT2["Qdrant
ACORN filtering"] Q5 -->|No| Q6["Zero ops?"] Q6 -->|Yes| PINE["Pinecone
Fully managed"] Q6 -->|No| WEAV["Weaviate
Hybrid search"] style START fill:#e94560,stroke:#fff,color:#fff style Q1 fill:#2c3e50,stroke:#fff,color:#fff style Q2 fill:#2c3e50,stroke:#fff,color:#fff style Q3 fill:#2c3e50,stroke:#fff,color:#fff style Q4 fill:#2c3e50,stroke:#fff,color:#fff style Q5 fill:#2c3e50,stroke:#fff,color:#fff style Q6 fill:#2c3e50,stroke:#fff,color:#fff style PG fill:#4CAF50,stroke:#fff,color:#fff style CHROMA fill:#4CAF50,stroke:#fff,color:#fff style QDRANT fill:#4CAF50,stroke:#fff,color:#fff style MILVUS fill:#4CAF50,stroke:#fff,color:#fff style QDRANT2 fill:#4CAF50,stroke:#fff,color:#fff style PINE fill:#4CAF50,stroke:#fff,color:#fff style WEAV fill:#4CAF50,stroke:#fff,color:#fff

Decision tree for choosing a vector database based on practical needs

Quantization-Aware Training
Embedding models are now trained to work well with binary/scalar quantization out of the box. Cohere embed-v4 and Nomic embed-v2 lead the way — recall loss drops from 10-20% to 2-5% with binary quantization.
Disaggregated Architecture
Separating compute and storage is becoming standard. Milvus, Weaviate Cloud, Pinecone serverless all follow this pattern — reducing idle costs and enabling elastic scaling.
Multi-Vector & Late Interaction
Instead of one vector per document, using multiple vectors (per token/paragraph). ColBERT and ColPali represent this trend — significantly higher recall for long documents.
Serverless Vector Search
Pinecone serverless, Qdrant Cloud, Turbopuffer — pay only per query, no cluster provisioning needed. Lowering the barrier for startups and side projects.
Hybrid Database Convergence
PostgreSQL (pgvector), Elasticsearch 9, Redis 8 all add powerful vector search. The boundary between "vector DB" and "traditional DB with vector support" is increasingly blurred.

Conclusion

A vector database is not a silver bullet — it's one component in the semantic search pipeline. Understanding the trade-offs between HNSW (fast, RAM-hungry) and IVF (memory-efficient, needs training), knowing when to apply quantization (PQ for billion-scale, SQ for balance), and choosing the right database for your use case (pgvector for simplicity, Qdrant for filtering, Milvus for scale) — that's the key to building effective AI search systems in production.

References: