Elasticsearch 9 and Hybrid Search 2026 — BBQ, ELSER, Retrievers API, and a Production Search System Architecture

Posted on: 4/17/2026 11:15:07 AM

Table of contents

1. Why Hybrid Search is the future of search
1. 1.1 The limits of Keyword Search (BM25)
2. 1.2 The limits of Vector Search (Semantic)
  1. Key insight
2. Elasticsearch 9 architecture — Core changes
3. HNSW deep dive — The algorithm behind vector search
1. 3.1 How HNSW works
  1. HNSW memory note
4. Designing a production search system
5. Score fusion — RRF vs Linear Combination
1. 5.1 Reciprocal Rank Fusion (RRF)
2. 5.2 Linear Combination (Weighted)
6. ColBERT and multi-stage re-ranking
7. Scaling Elasticsearch for billion-scale
8. Search UX — Frontend integration with Vue
1. 8.1 Search-as-you-type with debounce
2. 8.2 Highlight and snippet
9. Monitoring and performance tuning
1. 9.1 Metrics to track
2. 9.2 Common pitfalls
  1. Common hybrid-search deployment mistakes
10. Elasticsearch vs alternatives in 2026
1. How to choose
Conclusion

In 2026, search is no longer just "type a keyword and return results" — users expect systems to understand query intent, find relevant content even when the keywords don't match, while still being dead-on accurate for specific queries like product SKUs or proper names. Elasticsearch 9 officially answers this with a Hybrid Search architecture — combining the power of the traditional Inverted Index (BM25) with Vector Search (HNSW + BBQ) in the same engine, eliminating the need to run two separate systems.

95% Memory reduction vs float32 with BBQ

30x Faster throughput with SIMD BBQ

5x Faster than OpenSearch (BBQ benchmark)

9.0 GA Latest Elasticsearch version

1. Why Hybrid Search is the future of search

Before diving into Elasticsearch 9, understand why we need Hybrid Search instead of either traditional paradigm alone.

1.1 The limits of Keyword Search (BM25)

BM25 — the term-frequency-based ranking algorithm (an improved TF-IDF) — remains the foundation of every search engine. It uses an Inverted Index to map every term to the list of documents containing it, enabling extremely fast queries (microseconds) across billions of documents.

However, BM25 has a fundamental weakness: vocabulary mismatch. When a user searches "how to lose weight effectively" but the document says "methods for managing body weight", BM25 can't match them — there's no shared vocabulary. That's the intrinsic limit of lexical matching.

1.2 The limits of Vector Search (Semantic)

Vector search solves the vocabulary-mismatch problem by turning text into embedding vectors (arrays of real-valued numbers in many dimensions), then searching by cosine/dot-product distance in that vector space. Two semantically similar sentences end up close together in that space, regardless of the exact words they use.

But vector search has its own weakness: it's bad at exact matching. Searching for order ID "ORD-2026-78543" with vector search performs terribly — embedding models aren't trained to distinguish random character strings.

Key insight

Hybrid Search isn't "pick one or the other" — it's combining both. Use BM25 for exact match, use vectors for semantic understanding, and fuse the results for the best relevance. Elasticsearch 9 does this natively in a single query.

Criterion	BM25 (Keyword)	Vector Search	Hybrid Search
Exact match (SKU, proper name)	✔ Excellent	✘ Poor	✔ Excellent
Semantic understanding	✘ None	✔ Excellent	✔ Excellent
Latency	✔ < 1 ms	⚠ 5-50 ms	⚠ 10-60 ms
Memory	✔ Low	✘ Very high	⚠ High (BBQ cuts 95%)
Vocabulary mismatch	✘ Unresolved	✔ Handled well	✔ Handled well
Overall relevance	⚠ Moderate	⚠ Moderate-Good	✔ Best

2. Elasticsearch 9 architecture — Core changes

Elasticsearch 9.0 (GA early 2025, continuously updated through 9.3 in 2026) delivers a wave of improvements that turn it from a keyword-only engine into a unified search platform.

graph TD
    A[Client Query] --> B[Query DSL / Retrievers API]
    B --> C{Query Router}
    C --> D[BM25 Inverted Index]
    C --> E[ELSER Sparse Vector]
    C --> F[Dense Vector HNSW + BBQ]
    D --> G[Score Normalization]
    E --> G
    F --> G
    G --> H{Fusion Method}
    H --> I[RRF - Reciprocal Rank Fusion]
    H --> J[Linear Combination]
    I --> K[Final Ranked Results]
    J --> K
    K --> L[Re-ranking ColBERT/ColPali]
    L --> M[Response]

Hybrid Search pipeline architecture in Elasticsearch 9

2.1 BBQ — Better Binary Quantization

This is the single biggest Elasticsearch 9 improvement for vector search. BBQ compresses vectors from float32 down to a binary representation (1 bit per dimension), slashing memory by up to 95% compared with raw float32.

On a 1-billion-vector, 768-dimension dataset:

~3 TiB Raw float32 memory

~150 GiB BBQ memory

20% Higher recall vs PQ

8-30x Faster throughput (SIMD)

From Elasticsearch 9.1 onward, BBQ is the default for every new dense_vector index. The algorithm runs in two steps: (1) quantize the vector to binary and use SIMD (POPCNT + XOR) for fast comparisons, (2) re-score the top candidates with the original vector to preserve recall.

elasticsearch-mapping.json

{
  "mappings": {
    "properties": {
      "content_embedding": {
        "type": "dense_vector",
        "dims": 768,
        "index": true,
        "similarity": "cosine",
        "index_options": {
          "type": "bbq_hnsw"
        }
      },
      "content": {
        "type": "text",
        "analyzer": "vietnamese_custom"
      },
      "semantic_content": {
        "type": "semantic_text"
      }
    }
  }
}

2.2 ELSER v2 — Elastic Learned Sparse Encoder

ELSER is a sparse vector model trained by Elastic themselves that produces sparse embeddings you can index with the familiar inverted index while still getting semantic understanding. Unlike dense vectors (every dimension has a value), sparse vectors only have a few non-zero dimensions — each representing an "expanded term" the model learned.

Notable detail: ELSER ships inside Elasticsearch 9 — no external model server, no GPU, CPU inference runs directly on the node. When you create a semantic_text field without specifying an inference endpoint, Elasticsearch uses ELSER automatically.

When to use ELSER vs dense vectors

ELSER (sparse): ideal when you want semantic search but have no GPU, want a simple deployment, or have mostly English/common multilingual data. Dense vectors: ideal when you need deep multilingual (e.g. Jina v3), cross-modal search (text-to-image), or a RAG pipeline that needs custom embeddings.

2.3 Retrievers API — Composable search pipelines

Retrievers is a new Query DSL abstraction that lets you build multi-stage search pipelines declaratively. Instead of writing complex logic in the application layer, you compose it directly in the query.

hybrid-search-query.json

{
  "retriever": {
    "rrf": {
      "retrievers": [
        {
          "standard": {
            "query": {
              "match": {
                "content": {
                  "query": "ways to optimize database performance",
                  "analyzer": "english_custom"
                }
              }
            }
          }
        },
        {
          "standard": {
            "query": {
              "semantic": {
                "field": "semantic_content",
                "query": "ways to optimize database performance"
              }
            }
          }
        },
        {
          "knn": {
            "field": "content_embedding",
            "query_vector_builder": {
              "text_embedding": {
                "model_id": "jina-embeddings-v3",
                "model_text": "ways to optimize database performance"
              }
            },
            "k": 50,
            "num_candidates": 200
          }
        }
      ],
      "rank_window_size": 100
    }
  },
  "_source": ["title", "content", "url"],
  "size": 10
}

This query combines three retrievers: BM25 match, ELSER semantic, and dense kNN — fused with RRF. All of them run in a single roundtrip to Elasticsearch, with no orchestration in the application.

3. HNSW deep dive — The algorithm behind vector search

HNSW (Hierarchical Navigable Small World) is the ANN (Approximate Nearest Neighbor) algorithm Elasticsearch uses for dense vector search. Understanding how it works lets you tune performance for production.

graph TD
    subgraph "Layer 2 - Coarsest"
        A2[Node A] --- B2[Node B]
    end
    subgraph "Layer 1 - Medium"
        A1[Node A] --- B1[Node B]
        B1 --- C1[Node C]
        A1 --- D1[Node D]
    end
    subgraph "Layer 0 - Finest"
        A0[Node A] --- B0[Node B]
        B0 --- C0[Node C]
        A0 --- D0[Node D]
        C0 --- E0[Node E]
        D0 --- F0[Node F]
        E0 --- F0
        B0 --- E0
    end

The multi-layer HNSW graph — search starts at the top layer and drills down to layer 0

3.1 How HNSW works

HNSW builds a multi-layer graph in which:

Layer 0 contains every node (vector); each node is connected to its M nearest neighbors
Layers 1, 2, … contain a random subset of nodes (probability decreases with each layer), forming "express highways" for fast navigation
Search starts at the highest layer (few nodes, long hops), greedily moves toward neighbors closest to the query, then drops down a layer to refine

Two hyperparameters matter most:

Parameter	Meaning	ES 9 default	Production recommendation
`m`	Connections per node	16	16-32 (higher = better recall, more RAM)
`ef_construction`	Beam width when building the graph	100	100-200 (higher = slower build, better graph)
`ef_search` (num_candidates)	Beam width at search time	—	Recall-dependent: 100 (fast) to 500 (high recall)

HNSW memory note

With 1 billion 768-dim float32 vectors, HNSW needs ~3-4 TiB of RAM for the vector data and the graph structure. That's why BBQ quantization (-95%) is a game-changer at production scale. If the dataset exceeds 100M vectors, also consider IVF or tiered storage.

4. Designing a production search system

This section covers an end-to-end production search architecture — from data ingestion to serving.

graph LR
    subgraph Data Sources
        DB[(Database)]
        CMS[CMS/API]
        S3[Object Storage]
    end
    subgraph Ingestion Pipeline
        CDC[CDC / Change Stream]
        ENR[Enrichment Service]
        EMB[Embedding Service]
        ING[Ingest Pipeline]
    end
    subgraph Elasticsearch Cluster
        COORD[Coordinating Nodes]
        DATA1[Data Node - Hot]
        DATA2[Data Node - Warm]
        ML[ML Node - ELSER]
    end
    subgraph Serving
        API[Search API .NET]
        CACHE[Redis Cache]
        CLIENT[Vue Frontend]
    end
    DB --> CDC
    CMS --> CDC
    S3 --> CDC
    CDC --> ENR
    ENR --> EMB
    EMB --> ING
    ING --> COORD
    COORD --> DATA1
    COORD --> DATA2
    COORD --> ML
    CLIENT --> API
    API --> CACHE
    API --> COORD

End-to-end search system architecture: from data source to frontend

4.1 Data ingestion — CDC + embedding pipeline

Rather than batch reindex, production systems should use CDC (Change Data Capture) — capturing every change from the source database (INSERT/UPDATE/DELETE) and pushing it to Elasticsearch in near real time. For SQL Server use Debezium or Change Tracking; for PostgreSQL use logical replication.

The pipeline processes each document in three steps:

Enrichment: add metadata, normalize text, detect language
Embedding: call the embedding model (Jina v3, OpenAI text-embedding-3-large, or self-hosted) to produce the dense vector
Ingest: the Elasticsearch Ingest Pipeline handles ELSER sparse encoding + document indexing

IngestPipeline.cs

public class SearchDocumentIndexer
{
    private readonly ElasticsearchClient _client;
    private readonly IEmbeddingService _embeddingService;

    public async Task IndexDocumentAsync(ProductDocument doc)
    {
        var embedding = await _embeddingService
            .GenerateEmbeddingAsync(doc.Title + " " + doc.Description);

        var indexDoc = new SearchableProduct
        {
            Id = doc.Id,
            Title = doc.Title,
            Description = doc.Description,
            ContentEmbedding = embedding,
            SemanticContent = doc.Title + " " + doc.Description,
            Category = doc.Category,
            Price = doc.Price,
            UpdatedAt = DateTime.UtcNow
        };

        await _client.IndexAsync(indexDoc, idx => idx
            .Index("products-v2")
            .Id(doc.Id.ToString())
            .Pipeline("product-enrichment")
        );
    }
}

4.2 Index design — Multi-field strategy

A single document needs multiple "views" to power hybrid search:

index-template.json

{
  "index_patterns": ["products-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "index.knn": true,
      "analysis": {
        "analyzer": {
          "en_analyzer": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "filter": ["icu_folding", "lowercase", "english_stop"]
          }
        }
      }
    },
    "mappings": {
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "en_analyzer",
          "fields": {
            "keyword": { "type": "keyword" },
            "suggest": {
              "type": "completion",
              "analyzer": "en_analyzer"
            }
          }
        },
        "description": {
          "type": "text",
          "analyzer": "en_analyzer"
        },
        "semantic_content": {
          "type": "semantic_text"
        },
        "content_embedding": {
          "type": "dense_vector",
          "dims": 768,
          "index": true,
          "similarity": "cosine",
          "index_options": {
            "type": "bbq_hnsw",
            "m": 24,
            "ef_construction": 150
          }
        },
        "sku": { "type": "keyword" },
        "category": { "type": "keyword" },
        "price": { "type": "float" },
        "updated_at": { "type": "date" }
      }
    }
  }
}

4.3 Search API — .NET integration

Elastic ships Elastic.Clients.Elasticsearch (the official .NET client) with full support for the Retrievers API and hybrid search.

SearchService.cs

public class HybridSearchService
{
    private readonly ElasticsearchClient _client;

    public async Task<SearchResult> HybridSearchAsync(
        string query, string? category = null, int page = 1, int size = 10)
    {
        var response = await _client.SearchAsync<ProductDoc>(s => s
            .Index("products-v2")
            .From((page - 1) * size)
            .Size(size)
            .Retriever(r => r
                .Rrf(rrf => rrf
                    .RankWindowSize(100)
                    .Retrievers(
                        ret => ret.Standard(std => std
                            .Query(q => q
                                .Bool(b => b
                                    .Must(m => m
                                        .MultiMatch(mm => mm
                                            .Query(query)
                                            .Fields(new[] { "title^3", "description" })
                                            .Type(TextQueryType.BestFields)
                                            .Fuzziness(new Fuzziness("AUTO"))
                                        )
                                    )
                                    .Filter(BuildCategoryFilter(category))
                                )
                            )
                        ),
                        ret => ret.Standard(std => std
                            .Query(q => q
                                .Semantic(sem => sem
                                    .Field("semantic_content")
                                    .Query(query)
                                )
                            )
                        ),
                        ret => ret.Knn(knn => knn
                            .Field("content_embedding")
                            .QueryVectorBuilder(qvb => qvb
                                .TextEmbedding(te => te
                                    .ModelId("jina-v3")
                                    .ModelText(query)
                                )
                            )
                            .K(50)
                            .NumCandidates(200)
                        )
                    )
                )
            )
            .Highlight(h => h
                .Fields(f => f
                    .Add("title", new HighlightField
                    {
                        PreTags = new[] { "<mark>" },
                        PostTags = new[] { "</mark>" }
                    })
                    .Add("description", new HighlightField
                    {
                        PreTags = new[] { "<mark>" },
                        PostTags = new[] { "</mark>" }
                    })
                )
            )
        );

        return MapToSearchResult(response);
    }
}

5. Score fusion — RRF vs Linear Combination

When combining results from multiple retrievers, you need a fusion strategy to merge and re-rank. Elasticsearch 9 supports two primary methods.

5.1 Reciprocal Rank Fusion (RRF)

RRF is rank-based — it only cares about a document's position (rank) in each retriever, not the absolute score. The formula:

RRF_score(d) = Sum( 1 / (k + rank_i(d)) )
// k = constant (default 60), rank_i = position in retriever i

Biggest advantage: no need to normalize scores across retrievers. BM25 might score 15.7, cosine similarity 0.89 — RRF handles it because it only looks at ranks.

5.2 Linear Combination (Weighted)

With Linear Combination, you assign weights to each retriever and sum the scores (after normalization). A fit when you already know which retriever matters most for a use case.

linear-combination-query.json

{
  "retriever": {
    "linear": {
      "retrievers": [
        {
          "retriever": {
            "standard": {
              "query": { "match": { "content": "database optimization" } }
            }
          },
          "weight": 0.3
        },
        {
          "retriever": {
            "knn": {
              "field": "content_embedding",
              "query_vector_builder": {
                "text_embedding": {
                  "model_id": "jina-v3",
                  "model_text": "database optimization"
                }
              },
              "k": 50,
              "num_candidates": 200
            }
          },
          "weight": 0.7
        }
      ],
      "normalizer": "min_max"
    }
  }
}

Criterion	RRF	Linear Combination
Score normalization	Not needed	Required (min-max or z-score)
Tuning	Minimal (just k)	Weights tuned per use case
When to use	Default, when you don't know the data well	After A/B tests with known optimal weights
Performance	Good for most cases	Can be better when tuned correctly

6. ColBERT and multi-stage re-ranking

Elasticsearch 9 supports multi-stage interaction models like ColBERT and ColPali via the MaxSim operator. A major step forward for search quality.

A 3-stage production pipeline:

graph LR
    A[Query] --> B[Stage 1: Candidate Retrieval]
    B --> |Top 1000| C[Stage 2: Hybrid Fusion RRF]
    C --> |Top 100| D[Stage 3: ColBERT Re-ranking]
    D --> |Top 10| E[Final Results]

Multi-stage search pipeline: retrieval, fusion, re-ranking

ColBERT (Contextualized Late Interaction over BERT) produces an embedding for each token of the document and query, then applies MaxSim to compute an interaction score. More expensive than a bi-encoder but delivers relevance close to a cross-encoder at much higher speed.

7. Scaling Elasticsearch for billion-scale

7.1 Shard strategy

Basic production rules:

Shard size: target 20-40 GiB/shard for search workloads, 40-60 GiB for logging
Shard count: avoid too many small shards — each shard has overhead (segment metadata memory, thread pools)
Shard count = total data size / target shard size, rounded up

7.2 Tiered architecture

Tier	Hardware	Data	Characteristics
Hot	NVMe SSD, high RAM	0-7 days / active index	Highest speed, search + indexing
Warm	Standard SSD	7-30 days	Search only, force merge
Cold	HDD / Searchable Snapshots	30+ days	Archival, occasional search
Frozen	Object Storage (S3)	Long-term storage	Searchable Snapshots, high latency

7.3 Filtered vector search — ACORN

Since Elasticsearch 9.1, the ACORN algorithm significantly improves filtered kNN search performance. Previously, combining vector search with a filter (e.g. finding similar products only within the "electronics" category) forced Elasticsearch to scan many more candidates than necessary. ACORN integrates the filter directly into HNSW graph traversal, reducing nodes visited.

8. Search UX — Frontend integration with Vue

Good search isn't only backend — frontend UX decides the user experience.

8.1 Search-as-you-type with debounce

useSearch.ts

import { ref, watch } from 'vue'
import { useDebounceFn } from '@vueuse/core'

export function useHybridSearch() {
  const query = ref('')
  const results = ref<SearchResult[]>([])
  const suggestions = ref<string[]>([])
  const isLoading = ref(false)

  const fetchSuggestions = useDebounceFn(async (q: string) => {
    if (q.length < 2) {
      suggestions.value = []
      return
    }
    const res = await fetch(
      `/api/search/suggest?q=${encodeURIComponent(q)}`
    )
    suggestions.value = await res.json()
  }, 150)

  const executeSearch = useDebounceFn(async (q: string) => {
    if (!q.trim()) {
      results.value = []
      return
    }
    isLoading.value = true
    try {
      const res = await fetch(
        `/api/search?q=${encodeURIComponent(q)}`
      )
      const data = await res.json()
      results.value = data.hits
    } finally {
      isLoading.value = false
    }
  }, 300)

  watch(query, (val) => {
    fetchSuggestions(val)
    executeSearch(val)
  })

  return { query, results, suggestions, isLoading }
}

8.2 Highlight and snippet

Elasticsearch returns highlight fragments — text chunks containing matches wrapped in <mark> tags. The frontend renders the sanitized HTML:

SearchResult.vue

<template>
  <div class="search-result" v-for="hit in results" :key="hit.id">
    <h3 v-html="hit.highlight?.title?.[0] || hit.title" />
    <p
      class="snippet"
      v-html="hit.highlight?.description?.[0]
        || truncate(hit.description, 200)"
    />
    <div class="meta">
      <span class="category">{{ hit.category }}</span>
      <span class="score">Relevance: {{ hit.score.toFixed(2) }}</span>
    </div>
  </div>
</template>

9. Monitoring and performance tuning

9.1 Metrics to track

Search latency

P50 < 50 ms, P99 < 200 ms for the search API. Track per-retriever latency inside a hybrid query to pinpoint the bottleneck.

Indexing throughput

Target: > 5,000 docs/s for real-time indexing. Monitor indexing_pressure and rejected_requests to detect backpressure.

Recall & relevance

Use nDCG@10 and MRR to measure search quality. A/B test RRF weights regularly with real user clicks.

Resource usage

Monitor heap usage, segment memory, vector-index memory. BBQ cuts vector memory 95% but the HNSW graph still needs significant RAM.

9.2 Common pitfalls

Common hybrid-search deployment mistakes

Embedding model mismatch: using different models for indexing vs searching produces meaningless cosine similarity. Always use the same model and version.
Too many shards: every shard's kNN search is its own HNSW traversal, so 100 shards create 100× latency overhead. Consolidate shards when you can.
Cold cache: the HNSW graph must be loaded into the OS page cache. After a restart the first query is very slow — use warming queries.
Skipping re-ranking: hybrid retrieval delivers great recall, but precision needs re-ranking (ColBERT/cross-encoder) for top results.

10. Elasticsearch vs alternatives in 2026

Criterion	Elasticsearch 9	OpenSearch 2.x	Milvus/Qdrant	pgvector
Hybrid Search	✔ Native (RRF, Linear)	✔ Native (RRF)	⚠ Limited	⚠ Manual
Sparse Vector (ELSER)	✔ Built-in	✘ None	✘ None	✘ None
BBQ Quantization	✔ Default	✘ None	✔ SQ/PQ	✘ None
Full-text Search	✔ Excellent	✔ Good	✘ Limited	⚠ Basic
Ecosystem/.NET SDK	✔ Mature	⚠ Moderate	✔ Good	✔ Native (EF Core)
Operations complexity	⚠ Moderate	⚠ Moderate	✘ High	✔ Low
License	SSPL + Elastic	Apache 2.0	Apache 2.0	PostgreSQL

How to choose

Elasticsearch 9: when you need both full-text and vector search in the same system, or already run the Elastic Stack. pgvector: when you already use PostgreSQL, have fewer than 10M vectors, and want the simplest option. Milvus/Qdrant: when vector search is the primary use case (RAG, recommendations) and you don't need full-text.

Conclusion

Elasticsearch 9 marks the shift from "search engine" to "unified search & vector platform". With BBQ quantization slashing memory 95%, ELSER enabling semantic search without GPUs, the Retrievers API for composable hybrid queries, and ColBERT/ColPali re-ranking — it's the most complete production search stack in 2026. The key isn't choosing keyword or vector, but combining both correctly in a single pipeline with the fusion strategy best suited to your use case.

References:

#Elasticsearch #Hybrid Search #Vector Search #BBQ #ELSER #HNSW #BM25 #system design

# Elasticsearch 9 and Hybrid Search 2026 — BBQ, ELSER, Retrievers API, and a Production Search System Architecture

In 2026, search is no longer just "type a keyword and return results" — users expect systems to **understand query intent**, find relevant content even when the keywords don't match, while still being dead-on accurate for specific queries like product SKUs or proper names. **Elasticsearch 9** officially answers this with a **Hybrid Search** architecture — combining the power of the traditional Inverted Index (BM25) with Vector Search (HNSW + BBQ) in the same engine, eliminating the need to run two separate systems.

95% Memory reduction vs float32 with BBQ

30x Faster throughput with SIMD BBQ

5x Faster than OpenSearch (BBQ benchmark)

9.0 GA Latest Elasticsearch version

## 1. Why Hybrid Search is the future of search

Before diving into Elasticsearch 9, understand why we need Hybrid Search instead of either traditional paradigm alone.

### 1.1 The limits of Keyword Search (BM25)

BM25 — the term-frequency-based ranking algorithm (an improved TF-IDF) — remains the foundation of every search engine. It uses an **Inverted Index** to map every term to the list of documents containing it, enabling extremely fast queries (microseconds) across billions of documents.

However, BM25 has a fundamental weakness: **vocabulary mismatch**. When a user searches "how to lose weight effectively" but the document says "methods for managing body weight", BM25 can't match them — there's no shared vocabulary. That's the intrinsic limit of lexical matching.

### 1.2 The limits of Vector Search (Semantic)

But vector search has its own weakness: it's **bad at exact matching**. Searching for order ID "ORD-2026-78543" with vector search performs terribly — embedding models aren't trained to distinguish random character strings.

#### Key insight

Hybrid Search isn't "pick one or the other" — it's **combining both**. Use BM25 for exact match, use vectors for semantic understanding, and **fuse** the results for the best relevance. Elasticsearch 9 does this natively in a single query.

| Criterion | BM25 (Keyword) | Vector Search | Hybrid Search |
| --- | --- | --- | --- |
| Exact match (SKU, proper name) | ✔ Excellent | ✘ Poor | ✔ Excellent |
| Semantic understanding | ✘ None | ✔ Excellent | ✔ Excellent |
| Latency | ✔ < 1 ms | ⚠ 5-50 ms | ⚠ 10-60 ms |
| Memory | ✔ Low | ✘ Very high | ⚠ High (BBQ cuts 95%) |
| Vocabulary mismatch | ✘ Unresolved | ✔ Handled well | ✔ Handled well |
| Overall relevance | ⚠ Moderate | ⚠ Moderate-Good | ✔ Best |

## 2. Elasticsearch 9 architecture — Core changes

Elasticsearch 9.0 (GA early 2025, continuously updated through 9.3 in 2026) delivers a wave of improvements that turn it from a keyword-only engine into a **unified search platform**.

```
graph TD
    A[Client Query] --> B[Query DSL / Retrievers API]
    B --> C{Query Router}
    C --> D[BM25 Inverted Index]
    C --> E[ELSER Sparse Vector]
    C --> F[Dense Vector HNSW + BBQ]
    D --> G[Score Normalization]
    E --> G
    F --> G
    G --> H{Fusion Method}
    H --> I[RRF - Reciprocal Rank Fusion]
    H --> J[Linear Combination]
    I --> K[Final Ranked Results]
    J --> K
    K --> L[Re-ranking ColBERT/ColPali]
    L --> M[Response]

```

Hybrid Search pipeline architecture in Elasticsearch 9

### 2.1 BBQ — Better Binary Quantization

This is the single biggest Elasticsearch 9 improvement for vector search. BBQ compresses vectors from float32 down to a **binary representation** (1 bit per dimension), slashing memory by up to **95%** compared with raw float32.

On a 1-billion-vector, 768-dimension dataset:

~3 TiB Raw float32 memory

~150 GiB BBQ memory

20% Higher recall vs PQ

8-30x Faster throughput (SIMD)

From Elasticsearch 9.1 onward, BBQ is the **default** for every new dense_vector index. The algorithm runs in two steps: (1) quantize the vector to binary and use SIMD (POPCNT + XOR) for fast comparisons, (2) re-score the top candidates with the original vector to preserve recall.

elasticsearch-mapping.json

```json
{
  "mappings": {
    "properties": {
      "content_embedding": {
        "type": "dense_vector",
        "dims": 768,
        "index": true,
        "similarity": "cosine",
        "index_options": {
          "type": "bbq_hnsw"
        }
      },
      "content": {
        "type": "text",
        "analyzer": "vietnamese_custom"
      },
      "semantic_content": {
        "type": "semantic_text"
      }
    }
  }
}
```

### 2.2 ELSER v2 — Elastic Learned Sparse Encoder

ELSER is a **sparse vector** model trained by Elastic themselves that produces sparse embeddings you can index with the familiar inverted index while still getting semantic understanding. Unlike dense vectors (every dimension has a value), sparse vectors only have a few non-zero dimensions — each representing an "expanded term" the model learned.

Notable detail: ELSER ships inside Elasticsearch 9 — no external model server, no GPU, CPU inference runs directly on the node. When you create a `semantic_text` field without specifying an inference endpoint, Elasticsearch uses ELSER automatically.

#### When to use ELSER vs dense vectors

**ELSER (sparse)**: ideal when you want semantic search but have no GPU, want a simple deployment, or have mostly English/common multilingual data. **Dense vectors**: ideal when you need deep multilingual (e.g. Jina v3), cross-modal search (text-to-image), or a RAG pipeline that needs custom embeddings.

### 2.3 Retrievers API — Composable search pipelines

hybrid-search-query.json

```json
{
  "retriever": {
    "rrf": {
      "retrievers": [
        {
          "standard": {
            "query": {
              "match": {
                "content": {
                  "query": "ways to optimize database performance",
                  "analyzer": "english_custom"
                }
              }
            }
          }
        },
        {
          "standard": {
            "query": {
              "semantic": {
                "field": "semantic_content",
                "query": "ways to optimize database performance"
              }
            }
          }
        },
        {
          "knn": {
            "field": "content_embedding",
            "query_vector_builder": {
              "text_embedding": {
                "model_id": "jina-embeddings-v3",
                "model_text": "ways to optimize database performance"
              }
            },
            "k": 50,
            "num_candidates": 200
          }
        }
      ],
      "rank_window_size": 100
    }
  },
  "_source": ["title", "content", "url"],
  "size": 10
}
```
This query combines **three retrievers**: BM25 match, ELSER semantic, and dense kNN — fused with RRF. All of them run in a single roundtrip to Elasticsearch, with no orchestration in the application.

## 3. HNSW deep dive — The algorithm behind vector search

**HNSW (Hierarchical Navigable Small World)** is the ANN (Approximate Nearest Neighbor) algorithm Elasticsearch uses for dense vector search. Understanding how it works lets you tune performance for production.

```
graph TD
    subgraph "Layer 2 - Coarsest"
        A2[Node A] --- B2[Node B]
    end
    subgraph "Layer 1 - Medium"
        A1[Node A] --- B1[Node B]
        B1 --- C1[Node C]
        A1 --- D1[Node D]
    end
    subgraph "Layer 0 - Finest"
        A0[Node A] --- B0[Node B]
        B0 --- C0[Node C]
        A0 --- D0[Node D]
        C0 --- E0[Node E]
        D0 --- F0[Node F]
        E0 --- F0
        B0 --- E0
    end

```

The multi-layer HNSW graph — search starts at the top layer and drills down to layer 0

### 3.1 How HNSW works

HNSW builds a multi-layer graph in which:

- **Layer 0** contains every node (vector); each node is connected to its `M` nearest neighbors
- **Layers 1, 2, …** contain a random subset of nodes (probability decreases with each layer), forming "express highways" for fast navigation
- Search starts at the highest layer (few nodes, long hops), greedily moves toward neighbors closest to the query, then drops down a layer to refine

Two hyperparameters matter most:

| Parameter | Meaning | ES 9 default | Production recommendation |
| --- | --- | --- | --- |
| `m` | Connections per node | 16 | 16-32 (higher = better recall, more RAM) |
| `ef_construction` | Beam width when building the graph | 100 | 100-200 (higher = slower build, better graph) |
| `ef_search` (num_candidates) | Beam width at search time | — | Recall-dependent: 100 (fast) to 500 (high recall) |

#### HNSW memory note

With 1 billion 768-dim float32 vectors, HNSW needs **~3-4 TiB of RAM** for the vector data and the graph structure. That's why BBQ quantization (-95%) is a game-changer at production scale. If the dataset exceeds 100M vectors, also consider IVF or tiered storage.

## 4. Designing a production search system

This section covers an end-to-end production search architecture — from data ingestion to serving.

```
graph LR
    subgraph Data Sources
        DB[(Database)]
        CMS[CMS/API]
        S3[Object Storage]
    end
    subgraph Ingestion Pipeline
        CDC[CDC / Change Stream]
        ENR[Enrichment Service]
        EMB[Embedding Service]
        ING[Ingest Pipeline]
    end
    subgraph Elasticsearch Cluster
        COORD[Coordinating Nodes]
        DATA1[Data Node - Hot]
        DATA2[Data Node - Warm]
        ML[ML Node - ELSER]
    end
    subgraph Serving
        API[Search API .NET]
        CACHE[Redis Cache]
        CLIENT[Vue Frontend]
    end
    DB --> CDC
    CMS --> CDC
    S3 --> CDC
    CDC --> ENR
    ENR --> EMB
    EMB --> ING
    ING --> COORD
    COORD --> DATA1
    COORD --> DATA2
    COORD --> ML
    CLIENT --> API
    API --> CACHE
    API --> COORD

```

End-to-end search system architecture: from data source to frontend

### 4.1 Data ingestion — CDC + embedding pipeline

Rather than batch reindex, production systems should use **CDC (Change Data Capture)** — capturing every change from the source database (INSERT/UPDATE/DELETE) and pushing it to Elasticsearch in near real time. For SQL Server use `Debezium` or `Change Tracking`; for PostgreSQL use `logical replication`.

The pipeline processes each document in three steps:

1. **Enrichment**: add metadata, normalize text, detect language
2. **Embedding**: call the embedding model (Jina v3, OpenAI text-embedding-3-large, or self-hosted) to produce the dense vector
3. **Ingest**: the Elasticsearch Ingest Pipeline handles ELSER sparse encoding + document indexing

IngestPipeline.cs

```csharp
public class SearchDocumentIndexer
{
    private readonly ElasticsearchClient _client;
    private readonly IEmbeddingService _embeddingService;

public async Task IndexDocumentAsync(ProductDocument doc)
    {
        var embedding = await _embeddingService
            .GenerateEmbeddingAsync(doc.Title + " " + doc.Description);

var indexDoc = new SearchableProduct
        {
            Id = doc.Id,
            Title = doc.Title,
            Description = doc.Description,
            ContentEmbedding = embedding,
            SemanticContent = doc.Title + " " + doc.Description,
            Category = doc.Category,
            Price = doc.Price,
            UpdatedAt = DateTime.UtcNow
        };

await _client.IndexAsync(indexDoc, idx => idx
            .Index("products-v2")
            .Id(doc.Id.ToString())
            .Pipeline("product-enrichment")
        );
    }
}
```

### 4.2 Index design — Multi-field strategy

A single document needs multiple "views" to power hybrid search:

index-template.json

```json
{
  "index_patterns": ["products-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "index.knn": true,
      "analysis": {
        "analyzer": {
          "en_analyzer": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "filter": ["icu_folding", "lowercase", "english_stop"]
          }
        }
      }
    },
    "mappings": {
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "en_analyzer",
          "fields": {
            "keyword": { "type": "keyword" },
            "suggest": {
              "type": "completion",
              "analyzer": "en_analyzer"
            }
          }
        },
        "description": {
          "type": "text",
          "analyzer": "en_analyzer"
        },
        "semantic_content": {
          "type": "semantic_text"
        },
        "content_embedding": {
          "type": "dense_vector",
          "dims": 768,
          "index": true,
          "similarity": "cosine",
          "index_options": {
            "type": "bbq_hnsw",
            "m": 24,
            "ef_construction": 150
          }
        },
        "sku": { "type": "keyword" },
        "category": { "type": "keyword" },
        "price": { "type": "float" },
        "updated_at": { "type": "date" }
      }
    }
  }
}
```

### 4.3 Search API — .NET integration

Elastic ships `Elastic.Clients.Elasticsearch` (the official .NET client) with full support for the Retrievers API and hybrid search.

SearchService.cs

```csharp
public class HybridSearchService
{
    private readonly ElasticsearchClient _client;

public async Task<SearchResult> HybridSearchAsync(
        string query, string? category = null, int page = 1, int size = 10)
    {
        var response = await _client.SearchAsync<ProductDoc>(s => s
            .Index("products-v2")
            .From((page - 1) * size)
            .Size(size)
            .Retriever(r => r
                .Rrf(rrf => rrf
                    .RankWindowSize(100)
                    .Retrievers(
                        ret => ret.Standard(std => std
                            .Query(q => q
                                .Bool(b => b
                                    .Must(m => m
                                        .MultiMatch(mm => mm
                                            .Query(query)
                                            .Fields(new[] { "title^3", "description" })
                                            .Type(TextQueryType.BestFields)
                                            .Fuzziness(new Fuzziness("AUTO"))
                                        )
                                    )
                                    .Filter(BuildCategoryFilter(category))
                                )
                            )
                        ),
                        ret => ret.Standard(std => std
                            .Query(q => q
                                .Semantic(sem => sem
                                    .Field("semantic_content")
                                    .Query(query)
                                )
                            )
                        ),
                        ret => ret.Knn(knn => knn
                            .Field("content_embedding")
                            .QueryVectorBuilder(qvb => qvb
                                .TextEmbedding(te => te
                                    .ModelId("jina-v3")
                                    .ModelText(query)
                                )
                            )
                            .K(50)
                            .NumCandidates(200)
                        )
                    )
                )
            )
            .Highlight(h => h
                .Fields(f => f
                    .Add("title", new HighlightField
                    {
                        PreTags = new[] { "<mark>" },
                        PostTags = new[] { "</mark>" }
                    })
                    .Add("description", new HighlightField
                    {
                        PreTags = new[] { "<mark>" },
                        PostTags = new[] { "</mark>" }
                    })
                )
            )
        );

return MapToSearchResult(response);
    }
}
```

## 5. Score fusion — RRF vs Linear Combination

When combining results from multiple retrievers, you need a **fusion** strategy to merge and re-rank. Elasticsearch 9 supports two primary methods.

### 5.1 Reciprocal Rank Fusion (RRF)

RRF is rank-based — it only cares about a document's **position** (rank) in each retriever, not the absolute score. The formula:

```
RRF_score(d) = Sum( 1 / (k + rank_i(d)) )
// k = constant (default 60), rank_i = position in retriever i
```
Biggest advantage: **no need to normalize scores** across retrievers. BM25 might score 15.7, cosine similarity 0.89 — RRF handles it because it only looks at ranks.

### 5.2 Linear Combination (Weighted)

With Linear Combination, you assign weights to each retriever and sum the scores (after normalization). A fit when you already know which retriever matters most for a use case.

linear-combination-query.json

```json
{
  "retriever": {
    "linear": {
      "retrievers": [
        {
          "retriever": {
            "standard": {
              "query": { "match": { "content": "database optimization" } }
            }
          },
          "weight": 0.3
        },
        {
          "retriever": {
            "knn": {
              "field": "content_embedding",
              "query_vector_builder": {
                "text_embedding": {
                  "model_id": "jina-v3",
                  "model_text": "database optimization"
                }
              },
              "k": 50,
              "num_candidates": 200
            }
          },
          "weight": 0.7
        }
      ],
      "normalizer": "min_max"
    }
  }
}
```

| Criterion | RRF | Linear Combination |
| --- | --- | --- |
| Score normalization | Not needed | Required (min-max or z-score) |
| Tuning | Minimal (just k) | Weights tuned per use case |
| When to use | Default, when you don't know the data well | After A/B tests with known optimal weights |
| Performance | Good for most cases | Can be better when tuned correctly |

## 6. ColBERT and multi-stage re-ranking

Elasticsearch 9 supports **multi-stage interaction models** like ColBERT and ColPali via the MaxSim operator. A major step forward for search quality.

A 3-stage production pipeline:

```
graph LR
    A[Query] --> B[Stage 1: Candidate Retrieval]
    B --> |Top 1000| C[Stage 2: Hybrid Fusion RRF]
    C --> |Top 100| D[Stage 3: ColBERT Re-ranking]
    D --> |Top 10| E[Final Results]

```

Multi-stage search pipeline: retrieval, fusion, re-ranking

**ColBERT** (Contextualized Late Interaction over BERT) produces an embedding for *each token* of the document and query, then applies MaxSim to compute an interaction score. More expensive than a bi-encoder but delivers relevance close to a cross-encoder at much higher speed.

## 7. Scaling Elasticsearch for billion-scale

### 7.1 Shard strategy

Basic production rules:

- **Shard size**: target 20-40 GiB/shard for search workloads, 40-60 GiB for logging
- **Shard count**: avoid too many small shards — each shard has overhead (segment metadata memory, thread pools)
- **Shard count** = total data size / target shard size, rounded up

### 7.2 Tiered architecture

| Tier | Hardware | Data | Characteristics |
| --- | --- | --- | --- |
| Hot | NVMe SSD, high RAM | 0-7 days / active index | Highest speed, search + indexing |
| Warm | Standard SSD | 7-30 days | Search only, force merge |
| Cold | HDD / Searchable Snapshots | 30+ days | Archival, occasional search |
| Frozen | Object Storage (S3) | Long-term storage | Searchable Snapshots, high latency |

### 7.3 Filtered vector search — ACORN

Since Elasticsearch 9.1, the **ACORN** algorithm significantly improves filtered kNN search performance. Previously, combining vector search with a filter (e.g. finding similar products only within the "electronics" category) forced Elasticsearch to scan many more candidates than necessary. ACORN integrates the filter directly into HNSW graph traversal, reducing nodes visited.

## 8. Search UX — Frontend integration with Vue

Good search isn't only backend — frontend UX decides the user experience.

### 8.1 Search-as-you-type with debounce

useSearch.ts

```typescript
import { ref, watch } from 'vue'
import { useDebounceFn } from '@vueuse/core'

export function useHybridSearch() {
  const query = ref('')
  const results = ref<SearchResult[]>([])
  const suggestions = ref<string[]>([])
  const isLoading = ref(false)

const fetchSuggestions = useDebounceFn(async (q: string) => {
    if (q.length < 2) {
      suggestions.value = []
      return
    }
    const res = await fetch(
      `/api/search/suggest?q=${encodeURIComponent(q)}`
    )
    suggestions.value = await res.json()
  }, 150)

const executeSearch = useDebounceFn(async (q: string) => {
    if (!q.trim()) {
      results.value = []
      return
    }
    isLoading.value = true
    try {
      const res = await fetch(
        `/api/search?q=${encodeURIComponent(q)}`
      )
      const data = await res.json()
      results.value = data.hits
    } finally {
      isLoading.value = false
    }
  }, 300)

watch(query, (val) => {
    fetchSuggestions(val)
    executeSearch(val)
  })

return { query, results, suggestions, isLoading }
}
```

### 8.2 Highlight and snippet

Elasticsearch returns highlight fragments — text chunks containing matches wrapped in `<mark>` tags. The frontend renders the sanitized HTML:

SearchResult.vue

```html
<template>
  <div class="search-result" v-for="hit in results" :key="hit.id">
    <h3 v-html="hit.highlight?.title?.[0] || hit.title" />
    <p
      class="snippet"
      v-html="hit.highlight?.description?.[0]
        || truncate(hit.description, 200)"
    />
    <div class="meta">
      <span class="category">{{ hit.category }}</span>
      <span class="score">Relevance: {{ hit.score.toFixed(2) }}</span>
    </div>
  </div>
</template>
```

## 9. Monitoring and performance tuning

### 9.1 Metrics to track

#### Search latency

**P50 < 50 ms, P99 < 200 ms** for the search API. Track per-retriever latency inside a hybrid query to pinpoint the bottleneck.

#### Indexing throughput

**Target: > 5,000 docs/s** for real-time indexing. Monitor `indexing_pressure` and `rejected_requests` to detect backpressure.

#### Recall & relevance

Use **nDCG@10** and **MRR** to measure search quality. A/B test RRF weights regularly with real user clicks.

#### Resource usage

Monitor **heap usage, segment memory, vector-index memory**. BBQ cuts vector memory 95% but the HNSW graph still needs significant RAM.

### 9.2 Common pitfalls

#### Common hybrid-search deployment mistakes

- **Embedding model mismatch**: using different models for indexing vs searching produces meaningless cosine similarity. Always use the same model and version.
- **Too many shards**: every shard's kNN search is its own HNSW traversal, so 100 shards create 100× latency overhead. Consolidate shards when you can.
- **Cold cache**: the HNSW graph must be loaded into the OS page cache. After a restart the first query is very slow — use warming queries.
- **Skipping re-ranking**: hybrid retrieval delivers great recall, but precision needs re-ranking (ColBERT/cross-encoder) for top results.

## 10. Elasticsearch vs alternatives in 2026

| Criterion | Elasticsearch 9 | OpenSearch 2.x | Milvus/Qdrant | pgvector |
| --- | --- | --- | --- | --- |
| Hybrid Search | ✔ Native (RRF, Linear) | ✔ Native (RRF) | ⚠ Limited | ⚠ Manual |
| Sparse Vector (ELSER) | ✔ Built-in | ✘ None | ✘ None | ✘ None |
| BBQ Quantization | ✔ Default | ✘ None | ✔ SQ/PQ | ✘ None |
| Full-text Search | ✔ Excellent | ✔ Good | ✘ Limited | ⚠ Basic |
| Ecosystem/.NET SDK | ✔ Mature | ⚠ Moderate | ✔ Good | ✔ Native (EF Core) |
| Operations complexity | ⚠ Moderate | ⚠ Moderate | ✘ High | ✔ Low |
| License | SSPL + Elastic | Apache 2.0 | Apache 2.0 | PostgreSQL |

#### How to choose

**Elasticsearch 9**: when you need both full-text and vector search in the same system, or already run the Elastic Stack. **pgvector**: when you already use PostgreSQL, have fewer than 10M vectors, and want the simplest option. **Milvus/Qdrant**: when vector search is the primary use case (RAG, recommendations) and you don't need full-text.

## Conclusion

Elasticsearch 9 marks the shift from "search engine" to **"unified search & vector platform"**. With BBQ quantization slashing memory 95%, ELSER enabling semantic search without GPUs, the Retrievers API for composable hybrid queries, and ColBERT/ColPali re-ranking — it's the most complete production search stack in 2026. The key isn't choosing keyword or vector, but **combining both correctly** in a single pipeline with the fusion strategy best suited to your use case.

**References:**

- [Elasticsearch 9.0 Release — What's New (Elastic Blog)](https://www.elastic.co/blog/whats-new-elastic-search-9-0-0)
- [BBQ — Better Binary Quantization in Lucene & Elasticsearch (Elastic Labs)](https://www.elastic.co/search-labs/blog/better-binary-quantization-lucene-elasticsearch)
- [Elasticsearch 9.1: BBQ default & ACORN filtered vector search (Elastic Labs)](https://www.elastic.co/search-labs/blog/elasticsearch-9-1-bbq-acorn-vector-search)
- [Hybrid Search in Elasticsearch — Overview & Queries (Elastic Labs)](https://www.elastic.co/search-labs/blog/hybrid-search-elasticsearch)
- [From Vector Hype to Hybrid Reality: Is Elasticsearch Still the Right Bet? (Pureinsights 2026)](https://pureinsights.com/blog/2026/from-vector-hype-to-hybrid-reality-is-elasticsearch-still-the-right-bet/)
- [Hybrid Search — OpenSearch Documentation](https://docs.opensearch.org/latest/vector-search/ai-search/hybrid-search/index/)

Web Performance 2026 — Core Web Vitals, Speculation Rules API, and View Transitions

AWS Lambda Serverless 2026: Architecture, SnapStart, Event-Driven Patterns, and the Production Free Tier

Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.