Elasticsearch 9 and Hybrid Search 2026 — BBQ, ELSER, Retrievers API, and a Production Search System Architecture
Posted on: 4/17/2026 11:15:07 AM
Table of contents
- 1. Why Hybrid Search is the future of search
- 2. Elasticsearch 9 architecture — Core changes
- 3. HNSW deep dive — The algorithm behind vector search
- 4. Designing a production search system
- 5. Score fusion — RRF vs Linear Combination
- 6. ColBERT and multi-stage re-ranking
- 7. Scaling Elasticsearch for billion-scale
- 8. Search UX — Frontend integration with Vue
- 9. Monitoring and performance tuning
- 10. Elasticsearch vs alternatives in 2026
- Conclusion
In 2026, search is no longer just "type a keyword and return results" — users expect systems to understand query intent, find relevant content even when the keywords don't match, while still being dead-on accurate for specific queries like product SKUs or proper names. Elasticsearch 9 officially answers this with a Hybrid Search architecture — combining the power of the traditional Inverted Index (BM25) with Vector Search (HNSW + BBQ) in the same engine, eliminating the need to run two separate systems.
1. Why Hybrid Search is the future of search
Before diving into Elasticsearch 9, understand why we need Hybrid Search instead of either traditional paradigm alone.
1.1 The limits of Keyword Search (BM25)
BM25 — the term-frequency-based ranking algorithm (an improved TF-IDF) — remains the foundation of every search engine. It uses an Inverted Index to map every term to the list of documents containing it, enabling extremely fast queries (microseconds) across billions of documents.
However, BM25 has a fundamental weakness: vocabulary mismatch. When a user searches "how to lose weight effectively" but the document says "methods for managing body weight", BM25 can't match them — there's no shared vocabulary. That's the intrinsic limit of lexical matching.
1.2 The limits of Vector Search (Semantic)
Vector search solves the vocabulary-mismatch problem by turning text into embedding vectors (arrays of real-valued numbers in many dimensions), then searching by cosine/dot-product distance in that vector space. Two semantically similar sentences end up close together in that space, regardless of the exact words they use.
But vector search has its own weakness: it's bad at exact matching. Searching for order ID "ORD-2026-78543" with vector search performs terribly — embedding models aren't trained to distinguish random character strings.
Key insight
Hybrid Search isn't "pick one or the other" — it's combining both. Use BM25 for exact match, use vectors for semantic understanding, and fuse the results for the best relevance. Elasticsearch 9 does this natively in a single query.
| Criterion | BM25 (Keyword) | Vector Search | Hybrid Search |
|---|---|---|---|
| Exact match (SKU, proper name) | ✔ Excellent | ✘ Poor | ✔ Excellent |
| Semantic understanding | ✘ None | ✔ Excellent | ✔ Excellent |
| Latency | ✔ < 1 ms | ⚠ 5-50 ms | ⚠ 10-60 ms |
| Memory | ✔ Low | ✘ Very high | ⚠ High (BBQ cuts 95%) |
| Vocabulary mismatch | ✘ Unresolved | ✔ Handled well | ✔ Handled well |
| Overall relevance | ⚠ Moderate | ⚠ Moderate-Good | ✔ Best |
2. Elasticsearch 9 architecture — Core changes
Elasticsearch 9.0 (GA early 2025, continuously updated through 9.3 in 2026) delivers a wave of improvements that turn it from a keyword-only engine into a unified search platform.
graph TD
A[Client Query] --> B[Query DSL / Retrievers API]
B --> C{Query Router}
C --> D[BM25 Inverted Index]
C --> E[ELSER Sparse Vector]
C --> F[Dense Vector HNSW + BBQ]
D --> G[Score Normalization]
E --> G
F --> G
G --> H{Fusion Method}
H --> I[RRF - Reciprocal Rank Fusion]
H --> J[Linear Combination]
I --> K[Final Ranked Results]
J --> K
K --> L[Re-ranking ColBERT/ColPali]
L --> M[Response]
2.1 BBQ — Better Binary Quantization
This is the single biggest Elasticsearch 9 improvement for vector search. BBQ compresses vectors from float32 down to a binary representation (1 bit per dimension), slashing memory by up to 95% compared with raw float32.
On a 1-billion-vector, 768-dimension dataset:
From Elasticsearch 9.1 onward, BBQ is the default for every new dense_vector index. The algorithm runs in two steps: (1) quantize the vector to binary and use SIMD (POPCNT + XOR) for fast comparisons, (2) re-score the top candidates with the original vector to preserve recall.
{
"mappings": {
"properties": {
"content_embedding": {
"type": "dense_vector",
"dims": 768,
"index": true,
"similarity": "cosine",
"index_options": {
"type": "bbq_hnsw"
}
},
"content": {
"type": "text",
"analyzer": "vietnamese_custom"
},
"semantic_content": {
"type": "semantic_text"
}
}
}
}
2.2 ELSER v2 — Elastic Learned Sparse Encoder
ELSER is a sparse vector model trained by Elastic themselves that produces sparse embeddings you can index with the familiar inverted index while still getting semantic understanding. Unlike dense vectors (every dimension has a value), sparse vectors only have a few non-zero dimensions — each representing an "expanded term" the model learned.
Notable detail: ELSER ships inside Elasticsearch 9 — no external model server, no GPU, CPU inference runs directly on the node. When you create a semantic_text field without specifying an inference endpoint, Elasticsearch uses ELSER automatically.
When to use ELSER vs dense vectors
ELSER (sparse): ideal when you want semantic search but have no GPU, want a simple deployment, or have mostly English/common multilingual data. Dense vectors: ideal when you need deep multilingual (e.g. Jina v3), cross-modal search (text-to-image), or a RAG pipeline that needs custom embeddings.
2.3 Retrievers API — Composable search pipelines
Retrievers is a new Query DSL abstraction that lets you build multi-stage search pipelines declaratively. Instead of writing complex logic in the application layer, you compose it directly in the query.
{
"retriever": {
"rrf": {
"retrievers": [
{
"standard": {
"query": {
"match": {
"content": {
"query": "ways to optimize database performance",
"analyzer": "english_custom"
}
}
}
}
},
{
"standard": {
"query": {
"semantic": {
"field": "semantic_content",
"query": "ways to optimize database performance"
}
}
}
},
{
"knn": {
"field": "content_embedding",
"query_vector_builder": {
"text_embedding": {
"model_id": "jina-embeddings-v3",
"model_text": "ways to optimize database performance"
}
},
"k": 50,
"num_candidates": 200
}
}
],
"rank_window_size": 100
}
},
"_source": ["title", "content", "url"],
"size": 10
}
This query combines three retrievers: BM25 match, ELSER semantic, and dense kNN — fused with RRF. All of them run in a single roundtrip to Elasticsearch, with no orchestration in the application.
3. HNSW deep dive — The algorithm behind vector search
HNSW (Hierarchical Navigable Small World) is the ANN (Approximate Nearest Neighbor) algorithm Elasticsearch uses for dense vector search. Understanding how it works lets you tune performance for production.
graph TD
subgraph "Layer 2 - Coarsest"
A2[Node A] --- B2[Node B]
end
subgraph "Layer 1 - Medium"
A1[Node A] --- B1[Node B]
B1 --- C1[Node C]
A1 --- D1[Node D]
end
subgraph "Layer 0 - Finest"
A0[Node A] --- B0[Node B]
B0 --- C0[Node C]
A0 --- D0[Node D]
C0 --- E0[Node E]
D0 --- F0[Node F]
E0 --- F0
B0 --- E0
end
3.1 How HNSW works
HNSW builds a multi-layer graph in which:
- Layer 0 contains every node (vector); each node is connected to its
Mnearest neighbors - Layers 1, 2, … contain a random subset of nodes (probability decreases with each layer), forming "express highways" for fast navigation
- Search starts at the highest layer (few nodes, long hops), greedily moves toward neighbors closest to the query, then drops down a layer to refine
Two hyperparameters matter most:
| Parameter | Meaning | ES 9 default | Production recommendation |
|---|---|---|---|
m | Connections per node | 16 | 16-32 (higher = better recall, more RAM) |
ef_construction | Beam width when building the graph | 100 | 100-200 (higher = slower build, better graph) |
ef_search (num_candidates) | Beam width at search time | — | Recall-dependent: 100 (fast) to 500 (high recall) |
HNSW memory note
With 1 billion 768-dim float32 vectors, HNSW needs ~3-4 TiB of RAM for the vector data and the graph structure. That's why BBQ quantization (-95%) is a game-changer at production scale. If the dataset exceeds 100M vectors, also consider IVF or tiered storage.
4. Designing a production search system
This section covers an end-to-end production search architecture — from data ingestion to serving.
graph LR
subgraph Data Sources
DB[(Database)]
CMS[CMS/API]
S3[Object Storage]
end
subgraph Ingestion Pipeline
CDC[CDC / Change Stream]
ENR[Enrichment Service]
EMB[Embedding Service]
ING[Ingest Pipeline]
end
subgraph Elasticsearch Cluster
COORD[Coordinating Nodes]
DATA1[Data Node - Hot]
DATA2[Data Node - Warm]
ML[ML Node - ELSER]
end
subgraph Serving
API[Search API .NET]
CACHE[Redis Cache]
CLIENT[Vue Frontend]
end
DB --> CDC
CMS --> CDC
S3 --> CDC
CDC --> ENR
ENR --> EMB
EMB --> ING
ING --> COORD
COORD --> DATA1
COORD --> DATA2
COORD --> ML
CLIENT --> API
API --> CACHE
API --> COORD
4.1 Data ingestion — CDC + embedding pipeline
Rather than batch reindex, production systems should use CDC (Change Data Capture) — capturing every change from the source database (INSERT/UPDATE/DELETE) and pushing it to Elasticsearch in near real time. For SQL Server use Debezium or Change Tracking; for PostgreSQL use logical replication.
The pipeline processes each document in three steps:
- Enrichment: add metadata, normalize text, detect language
- Embedding: call the embedding model (Jina v3, OpenAI text-embedding-3-large, or self-hosted) to produce the dense vector
- Ingest: the Elasticsearch Ingest Pipeline handles ELSER sparse encoding + document indexing
public class SearchDocumentIndexer
{
private readonly ElasticsearchClient _client;
private readonly IEmbeddingService _embeddingService;
public async Task IndexDocumentAsync(ProductDocument doc)
{
var embedding = await _embeddingService
.GenerateEmbeddingAsync(doc.Title + " " + doc.Description);
var indexDoc = new SearchableProduct
{
Id = doc.Id,
Title = doc.Title,
Description = doc.Description,
ContentEmbedding = embedding,
SemanticContent = doc.Title + " " + doc.Description,
Category = doc.Category,
Price = doc.Price,
UpdatedAt = DateTime.UtcNow
};
await _client.IndexAsync(indexDoc, idx => idx
.Index("products-v2")
.Id(doc.Id.ToString())
.Pipeline("product-enrichment")
);
}
}
4.2 Index design — Multi-field strategy
A single document needs multiple "views" to power hybrid search:
{
"index_patterns": ["products-*"],
"template": {
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"index.knn": true,
"analysis": {
"analyzer": {
"en_analyzer": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "lowercase", "english_stop"]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "en_analyzer",
"fields": {
"keyword": { "type": "keyword" },
"suggest": {
"type": "completion",
"analyzer": "en_analyzer"
}
}
},
"description": {
"type": "text",
"analyzer": "en_analyzer"
},
"semantic_content": {
"type": "semantic_text"
},
"content_embedding": {
"type": "dense_vector",
"dims": 768,
"index": true,
"similarity": "cosine",
"index_options": {
"type": "bbq_hnsw",
"m": 24,
"ef_construction": 150
}
},
"sku": { "type": "keyword" },
"category": { "type": "keyword" },
"price": { "type": "float" },
"updated_at": { "type": "date" }
}
}
}
}
4.3 Search API — .NET integration
Elastic ships Elastic.Clients.Elasticsearch (the official .NET client) with full support for the Retrievers API and hybrid search.
public class HybridSearchService
{
private readonly ElasticsearchClient _client;
public async Task<SearchResult> HybridSearchAsync(
string query, string? category = null, int page = 1, int size = 10)
{
var response = await _client.SearchAsync<ProductDoc>(s => s
.Index("products-v2")
.From((page - 1) * size)
.Size(size)
.Retriever(r => r
.Rrf(rrf => rrf
.RankWindowSize(100)
.Retrievers(
ret => ret.Standard(std => std
.Query(q => q
.Bool(b => b
.Must(m => m
.MultiMatch(mm => mm
.Query(query)
.Fields(new[] { "title^3", "description" })
.Type(TextQueryType.BestFields)
.Fuzziness(new Fuzziness("AUTO"))
)
)
.Filter(BuildCategoryFilter(category))
)
)
),
ret => ret.Standard(std => std
.Query(q => q
.Semantic(sem => sem
.Field("semantic_content")
.Query(query)
)
)
),
ret => ret.Knn(knn => knn
.Field("content_embedding")
.QueryVectorBuilder(qvb => qvb
.TextEmbedding(te => te
.ModelId("jina-v3")
.ModelText(query)
)
)
.K(50)
.NumCandidates(200)
)
)
)
)
.Highlight(h => h
.Fields(f => f
.Add("title", new HighlightField
{
PreTags = new[] { "<mark>" },
PostTags = new[] { "</mark>" }
})
.Add("description", new HighlightField
{
PreTags = new[] { "<mark>" },
PostTags = new[] { "</mark>" }
})
)
)
);
return MapToSearchResult(response);
}
}
5. Score fusion — RRF vs Linear Combination
When combining results from multiple retrievers, you need a fusion strategy to merge and re-rank. Elasticsearch 9 supports two primary methods.
5.1 Reciprocal Rank Fusion (RRF)
RRF is rank-based — it only cares about a document's position (rank) in each retriever, not the absolute score. The formula:
RRF_score(d) = Sum( 1 / (k + rank_i(d)) )
// k = constant (default 60), rank_i = position in retriever i
Biggest advantage: no need to normalize scores across retrievers. BM25 might score 15.7, cosine similarity 0.89 — RRF handles it because it only looks at ranks.
5.2 Linear Combination (Weighted)
With Linear Combination, you assign weights to each retriever and sum the scores (after normalization). A fit when you already know which retriever matters most for a use case.
{
"retriever": {
"linear": {
"retrievers": [
{
"retriever": {
"standard": {
"query": { "match": { "content": "database optimization" } }
}
},
"weight": 0.3
},
{
"retriever": {
"knn": {
"field": "content_embedding",
"query_vector_builder": {
"text_embedding": {
"model_id": "jina-v3",
"model_text": "database optimization"
}
},
"k": 50,
"num_candidates": 200
}
},
"weight": 0.7
}
],
"normalizer": "min_max"
}
}
}
| Criterion | RRF | Linear Combination |
|---|---|---|
| Score normalization | Not needed | Required (min-max or z-score) |
| Tuning | Minimal (just k) | Weights tuned per use case |
| When to use | Default, when you don't know the data well | After A/B tests with known optimal weights |
| Performance | Good for most cases | Can be better when tuned correctly |
6. ColBERT and multi-stage re-ranking
Elasticsearch 9 supports multi-stage interaction models like ColBERT and ColPali via the MaxSim operator. A major step forward for search quality.
A 3-stage production pipeline:
graph LR
A[Query] --> B[Stage 1: Candidate Retrieval]
B --> |Top 1000| C[Stage 2: Hybrid Fusion RRF]
C --> |Top 100| D[Stage 3: ColBERT Re-ranking]
D --> |Top 10| E[Final Results]
ColBERT (Contextualized Late Interaction over BERT) produces an embedding for each token of the document and query, then applies MaxSim to compute an interaction score. More expensive than a bi-encoder but delivers relevance close to a cross-encoder at much higher speed.
7. Scaling Elasticsearch for billion-scale
7.1 Shard strategy
Basic production rules:
- Shard size: target 20-40 GiB/shard for search workloads, 40-60 GiB for logging
- Shard count: avoid too many small shards — each shard has overhead (segment metadata memory, thread pools)
- Shard count = total data size / target shard size, rounded up
7.2 Tiered architecture
| Tier | Hardware | Data | Characteristics |
|---|---|---|---|
| Hot | NVMe SSD, high RAM | 0-7 days / active index | Highest speed, search + indexing |
| Warm | Standard SSD | 7-30 days | Search only, force merge |
| Cold | HDD / Searchable Snapshots | 30+ days | Archival, occasional search |
| Frozen | Object Storage (S3) | Long-term storage | Searchable Snapshots, high latency |
7.3 Filtered vector search — ACORN
Since Elasticsearch 9.1, the ACORN algorithm significantly improves filtered kNN search performance. Previously, combining vector search with a filter (e.g. finding similar products only within the "electronics" category) forced Elasticsearch to scan many more candidates than necessary. ACORN integrates the filter directly into HNSW graph traversal, reducing nodes visited.
8. Search UX — Frontend integration with Vue
Good search isn't only backend — frontend UX decides the user experience.
8.1 Search-as-you-type with debounce
import { ref, watch } from 'vue'
import { useDebounceFn } from '@vueuse/core'
export function useHybridSearch() {
const query = ref('')
const results = ref<SearchResult[]>([])
const suggestions = ref<string[]>([])
const isLoading = ref(false)
const fetchSuggestions = useDebounceFn(async (q: string) => {
if (q.length < 2) {
suggestions.value = []
return
}
const res = await fetch(
`/api/search/suggest?q=${encodeURIComponent(q)}`
)
suggestions.value = await res.json()
}, 150)
const executeSearch = useDebounceFn(async (q: string) => {
if (!q.trim()) {
results.value = []
return
}
isLoading.value = true
try {
const res = await fetch(
`/api/search?q=${encodeURIComponent(q)}`
)
const data = await res.json()
results.value = data.hits
} finally {
isLoading.value = false
}
}, 300)
watch(query, (val) => {
fetchSuggestions(val)
executeSearch(val)
})
return { query, results, suggestions, isLoading }
}
8.2 Highlight and snippet
Elasticsearch returns highlight fragments — text chunks containing matches wrapped in <mark> tags. The frontend renders the sanitized HTML:
<template>
<div class="search-result" v-for="hit in results" :key="hit.id">
<h3 v-html="hit.highlight?.title?.[0] || hit.title" />
<p
class="snippet"
v-html="hit.highlight?.description?.[0]
|| truncate(hit.description, 200)"
/>
<div class="meta">
<span class="category">{{ hit.category }}</span>
<span class="score">Relevance: {{ hit.score.toFixed(2) }}</span>
</div>
</div>
</template>
9. Monitoring and performance tuning
9.1 Metrics to track
Search latency
P50 < 50 ms, P99 < 200 ms for the search API. Track per-retriever latency inside a hybrid query to pinpoint the bottleneck.
Indexing throughput
Target: > 5,000 docs/s for real-time indexing. Monitor indexing_pressure and rejected_requests to detect backpressure.
Recall & relevance
Use nDCG@10 and MRR to measure search quality. A/B test RRF weights regularly with real user clicks.
Resource usage
Monitor heap usage, segment memory, vector-index memory. BBQ cuts vector memory 95% but the HNSW graph still needs significant RAM.
9.2 Common pitfalls
Common hybrid-search deployment mistakes
- Embedding model mismatch: using different models for indexing vs searching produces meaningless cosine similarity. Always use the same model and version.
- Too many shards: every shard's kNN search is its own HNSW traversal, so 100 shards create 100× latency overhead. Consolidate shards when you can.
- Cold cache: the HNSW graph must be loaded into the OS page cache. After a restart the first query is very slow — use warming queries.
- Skipping re-ranking: hybrid retrieval delivers great recall, but precision needs re-ranking (ColBERT/cross-encoder) for top results.
10. Elasticsearch vs alternatives in 2026
| Criterion | Elasticsearch 9 | OpenSearch 2.x | Milvus/Qdrant | pgvector |
|---|---|---|---|---|
| Hybrid Search | ✔ Native (RRF, Linear) | ✔ Native (RRF) | ⚠ Limited | ⚠ Manual |
| Sparse Vector (ELSER) | ✔ Built-in | ✘ None | ✘ None | ✘ None |
| BBQ Quantization | ✔ Default | ✘ None | ✔ SQ/PQ | ✘ None |
| Full-text Search | ✔ Excellent | ✔ Good | ✘ Limited | ⚠ Basic |
| Ecosystem/.NET SDK | ✔ Mature | ⚠ Moderate | ✔ Good | ✔ Native (EF Core) |
| Operations complexity | ⚠ Moderate | ⚠ Moderate | ✘ High | ✔ Low |
| License | SSPL + Elastic | Apache 2.0 | Apache 2.0 | PostgreSQL |
How to choose
Elasticsearch 9: when you need both full-text and vector search in the same system, or already run the Elastic Stack. pgvector: when you already use PostgreSQL, have fewer than 10M vectors, and want the simplest option. Milvus/Qdrant: when vector search is the primary use case (RAG, recommendations) and you don't need full-text.
Conclusion
Elasticsearch 9 marks the shift from "search engine" to "unified search & vector platform". With BBQ quantization slashing memory 95%, ELSER enabling semantic search without GPUs, the Retrievers API for composable hybrid queries, and ColBERT/ColPali re-ranking — it's the most complete production search stack in 2026. The key isn't choosing keyword or vector, but combining both correctly in a single pipeline with the fusion strategy best suited to your use case.
References:
- Elasticsearch 9.0 Release — What's New (Elastic Blog)
- BBQ — Better Binary Quantization in Lucene & Elasticsearch (Elastic Labs)
- Elasticsearch 9.1: BBQ default & ACORN filtered vector search (Elastic Labs)
- Hybrid Search in Elasticsearch — Overview & Queries (Elastic Labs)
- From Vector Hype to Hybrid Reality: Is Elasticsearch Still the Right Bet? (Pureinsights 2026)
- Hybrid Search — OpenSearch Documentation
Web Performance 2026 — Core Web Vitals, Speculation Rules API, and View Transitions
AWS Lambda Serverless 2026: Architecture, SnapStart, Event-Driven Patterns, and the Production Free Tier
Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.