Vector Database — Semantic Search Architecture for AI
Posted on: 4/22/2026 5:12:31 AM
Table of contents
- 1. Embeddings — From Raw Data to Vector Space
- 2. Approximate Nearest Neighbor (ANN) — Why Not Brute Force?
- 3. Quantization — Compressing Vectors Without Losing Much Quality
- 4. Vector Database Comparison 2026
- 5. Production Architecture — RAG Pipeline with Vector Database
- 6. Production Performance Optimization
- 7. When to Use a Purpose-Built Vector Database?
- 8. 2026 Trends
- Conclusion
As AI applications increasingly rely on semantic search — from RAG (Retrieval-Augmented Generation) to recommendation systems — the question is no longer "should we use a vector database?" but "which one, with what indexing strategy, and how to deploy it?" This article dives deep into vector database internals: how embeddings work, indexing algorithms (HNSW, IVF, LSH), quantization techniques for memory reduction, and a detailed comparison of the most popular solutions in 2026.
1. Embeddings — From Raw Data to Vector Space
Before discussing databases, we need to understand embeddings — the process of converting unstructured data (text, images, audio) into numerical vectors in high-dimensional space. Each vector represents the "semantic meaning" of the original data.
graph LR
A["Text: 'Caching reduces
latency'"] --> B["Embedding Model
(OpenAI, Cohere, BGE)"]
B --> C["Vector
[0.12, -0.45, 0.78, ..., 0.33]
1536 dimensions"]
D["Text: 'Cache speeds
up queries'"] --> E["Embedding Model"]
E --> F["Vector
[0.11, -0.43, 0.76, ..., 0.31]
1536 dimensions"]
C --> G["Cosine Similarity
= 0.97 (very close)"]
F --> G
style A fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style D fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style B fill:#e94560,stroke:#fff,color:#fff
style E fill:#e94560,stroke:#fff,color:#fff
style C fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
style F fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
style G fill:#4CAF50,stroke:#fff,color:#fff
Two sentences with different wording but close in vector space because they share the same semantics
Popular embedding models in 2026:
| Model | Dimensions | Use Case | Key Feature |
|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 | General purpose | Matryoshka support (flexible dimension reduction) |
| Cohere embed-v4 | 1024 | Multilingual search | 100+ languages, built-in binary quantization |
| BGE-M3 | 1024 | Hybrid (dense + sparse) | Open-source, multi-granularity retrieval |
| Voyage-3 | 1024 | Code & technical docs | Optimized for code search |
| GTE-Qwen2 | 768–8192 | Long context embedding | Supports up to 128K tokens |
Matryoshka Representation Learning (MRL)
A recent technique that allows embedding models to produce "nested" vectors — you can truncate a 3072-d vector down to 256-d while retaining ~90% quality. Extremely useful for saving memory at the cache/filter layer before reranking with full vectors.
2. Approximate Nearest Neighbor (ANN) — Why Not Brute Force?
Exact nearest neighbor search over 1 million 1536-d vectors requires ~1.5 billion float operations — taking seconds. With 1 billion vectors, it becomes infeasible. The solution: Approximate Nearest Neighbor (ANN) — accept recall <100% in exchange for 100-1000x speedup.
2.1 HNSW (Hierarchical Navigable Small World)
HNSW is the most popular ANN algorithm today, used as the default in most vector databases. Core idea: build a multi-layer hierarchical graph where each node is a vector and each edge connects "nearby" vectors.
graph TD
subgraph "Layer 2 (Sparse - Long-range links)"
L2A["Node A"] --- L2D["Node D"]
L2D --- L2G["Node G"]
end
subgraph "Layer 1 (Medium density)"
L1A["Node A"] --- L1B["Node B"]
L1B --- L1D["Node D"]
L1D --- L1F["Node F"]
L1F --- L1G["Node G"]
end
subgraph "Layer 0 (Dense - All nodes)"
L0A["Node A"] --- L0B["Node B"]
L0B --- L0C["Node C"]
L0C --- L0D["Node D"]
L0D --- L0E["Node E"]
L0E --- L0F["Node F"]
L0F --- L0G["Node G"]
L0A --- L0C
L0B --- L0D
L0E --- L0G
end
L2A -.-> L1A
L2D -.-> L1D
L2G -.-> L1G
L1A -.-> L0A
L1B -.-> L0B
L1D -.-> L0D
L1F -.-> L0F
L1G -.-> L0G
style L2A fill:#e94560,stroke:#fff,color:#fff
style L2D fill:#e94560,stroke:#fff,color:#fff
style L2G fill:#e94560,stroke:#fff,color:#fff
style L1A fill:#2c3e50,stroke:#fff,color:#fff
style L1B fill:#2c3e50,stroke:#fff,color:#fff
style L1D fill:#2c3e50,stroke:#fff,color:#fff
style L1F fill:#2c3e50,stroke:#fff,color:#fff
style L1G fill:#2c3e50,stroke:#fff,color:#fff
style L0A fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style L0B fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style L0C fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style L0D fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style L0E fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style L0F fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style L0G fill:#f8f9fa,stroke:#e94560,color:#2c3e50
HNSW structure: search starts at the top layer (sparse), progressively "zooms in" to layer 0 (dense) for more accurate nearest neighbors
Key HNSW parameters:
M(max connections per node): Higher → better recall but more RAM. Default typically 16–64.efConstruction: Number of candidates considered during graph construction. Higher → better index quality but slower build.efSearch: Number of candidates considered during query. This is the main "knob" for recall vs latency trade-off at runtime.
Complexity Analysis
Build time: O(N × log(N)) — each vector insert traverses the existing graph.
Query time: O(log(N)) — hops through layers from sparse to dense.
Memory: O(N × M × D) — stores both vectors and adjacency lists. This is the biggest drawback: 1 billion vectors × 1536-d × float32 ≈ 6TB for data alone, not counting graph structure.
2.2 IVF (Inverted File Index)
IVF partitions the vector space into nlist clusters using k-means. At query time, only the nearest nprobe clusters are searched instead of the entire dataset.
graph TD
Q["Query Vector"] --> R["Find nprobe=3
nearest clusters"]
R --> C1["Cluster 1
50K vectors"]
R --> C3["Cluster 3
45K vectors"]
R --> C5["Cluster 5
52K vectors"]
C2["Cluster 2
48K vectors"]
C4["Cluster 4
55K vectors"]
C1 --> RES["Top-K Results"]
C3 --> RES
C5 --> RES
style Q fill:#e94560,stroke:#fff,color:#fff
style R fill:#2c3e50,stroke:#fff,color:#fff
style C1 fill:#4CAF50,stroke:#fff,color:#fff
style C3 fill:#4CAF50,stroke:#fff,color:#fff
style C5 fill:#4CAF50,stroke:#fff,color:#fff
style C2 fill:#f8f9fa,stroke:#e0e0e0,color:#999
style C4 fill:#f8f9fa,stroke:#e0e0e0,color:#999
style RES fill:#e94560,stroke:#fff,color:#fff
IVF only scans clusters near the query (green), skipping the rest (grey) — reducing vectors to compare by 90%+
IVF advantages: Uses less RAM than HNSW since it only loads needed clusters. Suitable for datasets too large for RAM.
Disadvantages: Requires a training step (k-means) before indexing. Recall depends on clustering quality.
2.3 HNSW vs IVF Comparison
| Criteria | HNSW | IVF |
|---|---|---|
| Query speed | Very fast (sub-ms) | Fast (1-10ms) |
| Memory | High (stores graph + vectors) | Lower (only centroids + vectors) |
| Build time | Slow (sequential insert) | Faster (batch k-means) |
| Updates | Supports realtime insert/delete | Needs re-training when data distribution shifts |
| Scale | Good up to ~100M vectors | Better for 1B+ vectors |
| Recall@10 | 95-99% (ef tuning) | 90-97% (nprobe tuning) |
3. Quantization — Compressing Vectors Without Losing Much Quality
1 billion vectors × 1536 dimensions × 4 bytes (float32) = ~6TB RAM. Quantization solves this by compressing vectors into smaller representations.
3.1 Scalar Quantization (SQ)
Converts each dimension from float32 (4 bytes) to int8 (1 byte). 4x memory reduction, ~2-5% recall loss.
3.2 Product Quantization (PQ)
Splits each vector into m subvectors, each quantized independently using its own codebook. 4-32x memory reduction.
Product Quantization Example
A 1536-d vector is split into 192 subvectors × 8-d each. Each subvector maps to a codebook of 256 entries (8-bit). Result: instead of storing 1536 × 4 = 6144 bytes, only 192 × 1 = 192 bytes — a 32x reduction.
3.3 Binary Quantization
The most extreme: each dimension keeps only 1 bit (positive = 1, negative = 0). A 1536-d vector → 192 bytes. Hamming distance replaces cosine similarity — extremely fast on modern CPUs (POPCNT instruction). Cohere embed-v4 is specifically designed to work well with binary quantization.
| Technique | Compression | Recall Loss | Speed Gain | Use Case |
|---|---|---|---|---|
| No compression (float32) | 1x | 0% | Baseline | Small datasets, maximum recall required |
| Scalar (int8) | 4x | 2-5% | 2-3x | Best balance for most use cases |
| Product (PQ) | 8-32x | 5-15% | 5-10x | Billion-scale, disk-based index |
| Binary | 32x | 10-20% | 20-40x | Pre-filter / first-stage retrieval |
Two-stage Retrieval Pattern
In production, many systems use binary/PQ quantization to quickly filter top-1000 candidates, then rerank with full-precision vectors for the final top-10. This pattern combines quantization speed with exact distance quality.
4. Vector Database Comparison 2026
graph TB
subgraph "Purpose-Built Vector DB"
Q["Qdrant
Rust, HNSW+"]
M["Milvus
Go/C++, Disaggregated"]
W["Weaviate
Go, Knowledge Graph"]
P["Pinecone
Managed, Serverless"]
CH["Chroma
Python, Lightweight"]
end
subgraph "Vector Extensions on Traditional DBs"
PG["pgvector
PostgreSQL"]
ES["Elasticsearch 9
kNN + BM25"]
RE["Redis 8
Vector Search"]
end
style Q fill:#e94560,stroke:#fff,color:#fff
style M fill:#2c3e50,stroke:#fff,color:#fff
style W fill:#4CAF50,stroke:#fff,color:#fff
style P fill:#e94560,stroke:#fff,color:#fff
style CH fill:#2c3e50,stroke:#fff,color:#fff
style PG fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style ES fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style RE fill:#f8f9fa,stroke:#e94560,color:#2c3e50
Two main categories: purpose-built vector DBs and extensions on traditional databases
| Database | Language | Index | Filtering | Scale | Key Differentiator |
|---|---|---|---|---|---|
| Qdrant | Rust | HNSW + quantization | ACORN (in-graph filter) | 100M+ vectors | Sub-100ms at 100M vectors, 95% recall. Fastest filtering via ACORN integrated into HNSW traversal |
| Milvus | Go/C++ | HNSW, IVF, DiskANN | Post-filter | 1B+ vectors | Disaggregated architecture (separate compute/storage). GPU-accelerated indexing. Strongest horizontal scaling |
| Weaviate | Go | HNSW + flat | Pre-filter + vector | 500M+ vectors | Hybrid search (vector + keyword). Knowledge graph integration. Rich module ecosystem |
| Pinecone | Managed | Proprietary | Metadata filter | 1B+ vectors | Fully managed serverless. Zero ops. Ideal for small teams wanting to ship fast |
| pgvector | C (PG ext) | HNSW, IVF | SQL WHERE clause | 10-50M vectors | Use your existing PostgreSQL. No extra infra. Best when vector search is a secondary feature |
| Chroma | Python/Rust | HNSW | Metadata filter | 1-10M vectors | Developer-friendly, embed in process. Great for prototyping and small-scale RAG |
4.1 Qdrant — Superior Filtering Performance
Qdrant's biggest strength is its ACORN algorithm — instead of filtering before search (pre-filter) or after search (post-filter), ACORN integrates filtering directly into the HNSW graph traversal. Result: filtered queries remain fast even when filters eliminate 99% of candidates — something pre/post-filter systems struggle with.
4.2 Milvus — Disaggregated Architecture for Billions of Vectors
Milvus completely separates four tiers: access layer (proxy), coordinator (metadata), worker nodes (query/data/index), and storage (object store + message queue). Each tier scales independently — you can add query nodes without affecting indexing, or vice versa.
graph TD
Client["Client SDK"] --> Proxy["Access Layer
(Proxy / Load Balancer)"]
Proxy --> QN["Query Nodes
(Search)"]
Proxy --> DN["Data Nodes
(Insert/Delete)"]
Proxy --> IN["Index Nodes
(Build Index)"]
Coord["Coordinator
(Root + Query + Data + Index Coord)"] --> QN
Coord --> DN
Coord --> IN
QN --> S3["Object Storage
(S3 / MinIO)"]
DN --> MQ["Message Queue
(Pulsar / Kafka)"]
IN --> S3
MQ --> S3
style Client fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style Proxy fill:#e94560,stroke:#fff,color:#fff
style QN fill:#4CAF50,stroke:#fff,color:#fff
style DN fill:#4CAF50,stroke:#fff,color:#fff
style IN fill:#4CAF50,stroke:#fff,color:#fff
style Coord fill:#2c3e50,stroke:#fff,color:#fff
style S3 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
style MQ fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
Milvus disaggregated architecture: each tier scales independently, storage uses object store (S3/MinIO)
4.3 pgvector — When You Already Have PostgreSQL
If your application already uses PostgreSQL and vector search is just a supplementary feature (e.g., similar product search, recommendation widget), pgvector is the most practical choice. No additional services, no data sync, direct JOINs with business logic tables.
-- Create table with vector column
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT,
embedding vector(1536)
);
-- Create HNSW index
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);
-- Semantic search + business logic filter in a single query
SELECT d.id, d.content, 1 - (d.embedding <=> query_embedding) AS similarity
FROM documents d
JOIN categories c ON d.category_id = c.id
WHERE c.name = 'technology' AND d.status = 'published'
ORDER BY d.embedding <=> query_embedding
LIMIT 10;
pgvector Limitations
pgvector works well under 10M vectors. Beyond that threshold, performance degrades noticeably compared to purpose-built vector DBs. If your dataset is large or growing fast, start with pgvector then migrate to Qdrant/Milvus when needed — API patterns are similar.
5. Production Architecture — RAG Pipeline with Vector Database
The most common use case in 2026: RAG (Retrieval-Augmented Generation) — augmenting LLM responses with context from your own knowledge base for better accuracy and reduced hallucination.
graph TD
subgraph "Indexing Pipeline (Offline)"
DOC["Documents
(PDF, HTML, Markdown)"] --> CHUNK["Chunking
(512-1024 tokens)"]
CHUNK --> EMB["Embedding Model"]
EMB --> VDB["Vector Database
(Qdrant / Milvus)"]
CHUNK --> META["Metadata Store
(PostgreSQL)"]
end
subgraph "Query Pipeline (Online)"
USER["User Query"] --> QEMB["Embed Query"]
QEMB --> SEARCH["Vector Search
+ Metadata Filter"]
SEARCH --> VDB
SEARCH --> RERANK["Reranker
(Cohere / ColBERT)"]
RERANK --> PROMPT["Prompt Assembly
(Query + Context)"]
PROMPT --> LLM["LLM
(Claude / GPT)"]
LLM --> ANS["Answer
+ Citations"]
end
style DOC fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style CHUNK fill:#2c3e50,stroke:#fff,color:#fff
style EMB fill:#e94560,stroke:#fff,color:#fff
style VDB fill:#4CAF50,stroke:#fff,color:#fff
style META fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
style USER fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style QEMB fill:#e94560,stroke:#fff,color:#fff
style SEARCH fill:#2c3e50,stroke:#fff,color:#fff
style RERANK fill:#2c3e50,stroke:#fff,color:#fff
style PROMPT fill:#e94560,stroke:#fff,color:#fff
style LLM fill:#e94560,stroke:#fff,color:#fff
style ANS fill:#4CAF50,stroke:#fff,color:#fff
Complete RAG pipeline: offline indexing + online query with reranking
5.1 Chunking Strategy
How you split documents into chunks directly impacts retrieval quality:
| Strategy | Description | Pros | Cons |
|---|---|---|---|
| Fixed-size | Split by fixed token count (512) | Simple, predictable | May cut mid-sentence/paragraph |
| Recursive character | Split by boundary (paragraph → sentence → word) | Better context preservation | Uneven chunk sizes |
| Semantic chunking | Use embedding similarity to detect topic boundaries | Semantically coherent chunks | Slow, expensive embedding cost |
| Document-aware | Split by structure (heading, section, table) | Preserves original structure | Needs per-format parser |
5.2 Hybrid Search — Combining Vector + Keyword
Vector search excels at capturing semantics ("how to speed up web" → finds articles about "performance optimization"), but struggles with exact matches (product names, error codes). Combining with BM25 keyword search yields the best results:
from qdrant_client import QdrantClient
from qdrant_client.models import (
Prefetch, FusionQuery, Fusion, SparseVector
)
client = QdrantClient("localhost", port=6333)
# Hybrid search: dense + sparse (BM25) with Reciprocal Rank Fusion
results = client.query_points(
collection_name="documents",
prefetch=[
Prefetch(
query=dense_embedding, # semantic search
using="dense",
limit=100
),
Prefetch(
query=SparseVector( # keyword search (BM25)
indices=sparse_indices,
values=sparse_values
),
using="sparse",
limit=100
),
],
query=FusionQuery(fusion=Fusion.RRF), # Reciprocal Rank Fusion
limit=10
)
6. Production Performance Optimization
6.1 Capacity Planning
RAM Estimation Formula for HNSW
RAM ≈ num_vectors × (dim × bytes_per_dim + M × 2 × 4 + overhead)
Example: 10M vectors × 1536-d × float32, M=16:
10M × (1536 × 4 + 16 × 2 × 4 + 100) ≈ 10M × 6372 ≈ 63.7 GB RAM
With scalar quantization (int8): 10M × (1536 × 1 + 128 + 100) ≈ 17.6 GB RAM — ~3.6x reduction.
6.2 Multi-Tenancy Patterns
When multiple customers (tenants) share a vector search system:
- Collection per tenant: Full isolation, but high overhead with thousands of tenants. Suitable when each tenant has >100K vectors.
- Partition key per tenant: Single collection, filter by
tenant_id. More efficient for many small tenants. Qdrant supportspayload_indexon tenant_id for optimization. - Namespace (Pinecone): Logical separation within the same index. Zero overhead, but less isolation.
6.3 Monitoring Metrics
| Metric | Description | Target |
|---|---|---|
| p99 query latency | Latency at 99th percentile | <100ms for interactive search |
| Recall@K | % ground-truth results in top-K | >95% for RAG |
| QPS (queries/second) | Throughput | Scale-dependent, typically 100-10K QPS |
| Index build time | Time to build index after adding data | Hours for million-scale, days for billion |
| Memory utilization | RAM usage vs provisioned | 70-85% (buffer for spikes) |
7. When to Use a Purpose-Built Vector Database?
graph TD
START["Need vector search?"] -->|Yes| Q1["Dataset > 10M vectors?"]
Q1 -->|No| Q2["Already have PostgreSQL?"]
Q2 -->|Yes| PG["pgvector
No extra infra"]
Q2 -->|No| Q3["Prototyping?"]
Q3 -->|Yes| CHROMA["Chroma
Embed in-process"]
Q3 -->|No| QDRANT["Qdrant
Self-host or cloud"]
Q1 -->|Yes| Q4["Need disaggregated
scaling?"]
Q4 -->|Yes| MILVUS["Milvus
Separate compute/storage"]
Q4 -->|No| Q5["Complex filtering?"]
Q5 -->|Yes| QDRANT2["Qdrant
ACORN filtering"]
Q5 -->|No| Q6["Zero ops?"]
Q6 -->|Yes| PINE["Pinecone
Fully managed"]
Q6 -->|No| WEAV["Weaviate
Hybrid search"]
style START fill:#e94560,stroke:#fff,color:#fff
style Q1 fill:#2c3e50,stroke:#fff,color:#fff
style Q2 fill:#2c3e50,stroke:#fff,color:#fff
style Q3 fill:#2c3e50,stroke:#fff,color:#fff
style Q4 fill:#2c3e50,stroke:#fff,color:#fff
style Q5 fill:#2c3e50,stroke:#fff,color:#fff
style Q6 fill:#2c3e50,stroke:#fff,color:#fff
style PG fill:#4CAF50,stroke:#fff,color:#fff
style CHROMA fill:#4CAF50,stroke:#fff,color:#fff
style QDRANT fill:#4CAF50,stroke:#fff,color:#fff
style MILVUS fill:#4CAF50,stroke:#fff,color:#fff
style QDRANT2 fill:#4CAF50,stroke:#fff,color:#fff
style PINE fill:#4CAF50,stroke:#fff,color:#fff
style WEAV fill:#4CAF50,stroke:#fff,color:#fff
Decision tree for choosing a vector database based on practical needs
8. 2026 Trends
Conclusion
A vector database is not a silver bullet — it's one component in the semantic search pipeline. Understanding the trade-offs between HNSW (fast, RAM-hungry) and IVF (memory-efficient, needs training), knowing when to apply quantization (PQ for billion-scale, SQ for balance), and choosing the right database for your use case (pgvector for simplicity, Qdrant for filtering, Milvus for scale) — that's the key to building effective AI search systems in production.
References:
FinOps — Cloud Cost Optimization Strategies for AWS, Azure & Cloudflare
Frontend Security 2026: CSP, Trusted Types, SRI & XSS Defense for Vue.js
Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.