Elasticsearch 9 and Hybrid Search 2026 — BBQ, ELSER, Retrievers API, and a Production Search System Architecture

Posted on: 4/17/2026 11:15:07 AM

In 2026, search is no longer just "type a keyword and return results" — users expect systems to understand query intent, find relevant content even when the keywords don't match, while still being dead-on accurate for specific queries like product SKUs or proper names. Elasticsearch 9 officially answers this with a Hybrid Search architecture — combining the power of the traditional Inverted Index (BM25) with Vector Search (HNSW + BBQ) in the same engine, eliminating the need to run two separate systems.

95% Memory reduction vs float32 with BBQ
30x Faster throughput with SIMD BBQ
5x Faster than OpenSearch (BBQ benchmark)
9.0 GA Latest Elasticsearch version

Before diving into Elasticsearch 9, understand why we need Hybrid Search instead of either traditional paradigm alone.

1.1 The limits of Keyword Search (BM25)

BM25 — the term-frequency-based ranking algorithm (an improved TF-IDF) — remains the foundation of every search engine. It uses an Inverted Index to map every term to the list of documents containing it, enabling extremely fast queries (microseconds) across billions of documents.

However, BM25 has a fundamental weakness: vocabulary mismatch. When a user searches "how to lose weight effectively" but the document says "methods for managing body weight", BM25 can't match them — there's no shared vocabulary. That's the intrinsic limit of lexical matching.

1.2 The limits of Vector Search (Semantic)

Vector search solves the vocabulary-mismatch problem by turning text into embedding vectors (arrays of real-valued numbers in many dimensions), then searching by cosine/dot-product distance in that vector space. Two semantically similar sentences end up close together in that space, regardless of the exact words they use.

But vector search has its own weakness: it's bad at exact matching. Searching for order ID "ORD-2026-78543" with vector search performs terribly — embedding models aren't trained to distinguish random character strings.

Key insight

Hybrid Search isn't "pick one or the other" — it's combining both. Use BM25 for exact match, use vectors for semantic understanding, and fuse the results for the best relevance. Elasticsearch 9 does this natively in a single query.

CriterionBM25 (Keyword)Vector SearchHybrid Search
Exact match (SKU, proper name)✔ Excellent✘ Poor✔ Excellent
Semantic understanding✘ None✔ Excellent✔ Excellent
Latency✔ < 1 ms⚠ 5-50 ms⚠ 10-60 ms
Memory✔ Low✘ Very high⚠ High (BBQ cuts 95%)
Vocabulary mismatch✘ Unresolved✔ Handled well✔ Handled well
Overall relevance⚠ Moderate⚠ Moderate-Good✔ Best

2. Elasticsearch 9 architecture — Core changes

Elasticsearch 9.0 (GA early 2025, continuously updated through 9.3 in 2026) delivers a wave of improvements that turn it from a keyword-only engine into a unified search platform.

graph TD
    A[Client Query] --> B[Query DSL / Retrievers API]
    B --> C{Query Router}
    C --> D[BM25 Inverted Index]
    C --> E[ELSER Sparse Vector]
    C --> F[Dense Vector HNSW + BBQ]
    D --> G[Score Normalization]
    E --> G
    F --> G
    G --> H{Fusion Method}
    H --> I[RRF - Reciprocal Rank Fusion]
    H --> J[Linear Combination]
    I --> K[Final Ranked Results]
    J --> K
    K --> L[Re-ranking ColBERT/ColPali]
    L --> M[Response]
Hybrid Search pipeline architecture in Elasticsearch 9

2.1 BBQ — Better Binary Quantization

This is the single biggest Elasticsearch 9 improvement for vector search. BBQ compresses vectors from float32 down to a binary representation (1 bit per dimension), slashing memory by up to 95% compared with raw float32.

On a 1-billion-vector, 768-dimension dataset:

~3 TiB Raw float32 memory
~150 GiB BBQ memory
20% Higher recall vs PQ
8-30x Faster throughput (SIMD)

From Elasticsearch 9.1 onward, BBQ is the default for every new dense_vector index. The algorithm runs in two steps: (1) quantize the vector to binary and use SIMD (POPCNT + XOR) for fast comparisons, (2) re-score the top candidates with the original vector to preserve recall.

elasticsearch-mapping.json
{
  "mappings": {
    "properties": {
      "content_embedding": {
        "type": "dense_vector",
        "dims": 768,
        "index": true,
        "similarity": "cosine",
        "index_options": {
          "type": "bbq_hnsw"
        }
      },
      "content": {
        "type": "text",
        "analyzer": "vietnamese_custom"
      },
      "semantic_content": {
        "type": "semantic_text"
      }
    }
  }
}

2.2 ELSER v2 — Elastic Learned Sparse Encoder

ELSER is a sparse vector model trained by Elastic themselves that produces sparse embeddings you can index with the familiar inverted index while still getting semantic understanding. Unlike dense vectors (every dimension has a value), sparse vectors only have a few non-zero dimensions — each representing an "expanded term" the model learned.

Notable detail: ELSER ships inside Elasticsearch 9 — no external model server, no GPU, CPU inference runs directly on the node. When you create a semantic_text field without specifying an inference endpoint, Elasticsearch uses ELSER automatically.

When to use ELSER vs dense vectors

ELSER (sparse): ideal when you want semantic search but have no GPU, want a simple deployment, or have mostly English/common multilingual data. Dense vectors: ideal when you need deep multilingual (e.g. Jina v3), cross-modal search (text-to-image), or a RAG pipeline that needs custom embeddings.

2.3 Retrievers API — Composable search pipelines

Retrievers is a new Query DSL abstraction that lets you build multi-stage search pipelines declaratively. Instead of writing complex logic in the application layer, you compose it directly in the query.

hybrid-search-query.json
{
  "retriever": {
    "rrf": {
      "retrievers": [
        {
          "standard": {
            "query": {
              "match": {
                "content": {
                  "query": "ways to optimize database performance",
                  "analyzer": "english_custom"
                }
              }
            }
          }
        },
        {
          "standard": {
            "query": {
              "semantic": {
                "field": "semantic_content",
                "query": "ways to optimize database performance"
              }
            }
          }
        },
        {
          "knn": {
            "field": "content_embedding",
            "query_vector_builder": {
              "text_embedding": {
                "model_id": "jina-embeddings-v3",
                "model_text": "ways to optimize database performance"
              }
            },
            "k": 50,
            "num_candidates": 200
          }
        }
      ],
      "rank_window_size": 100
    }
  },
  "_source": ["title", "content", "url"],
  "size": 10
}

This query combines three retrievers: BM25 match, ELSER semantic, and dense kNN — fused with RRF. All of them run in a single roundtrip to Elasticsearch, with no orchestration in the application.

3. HNSW deep dive — The algorithm behind vector search

HNSW (Hierarchical Navigable Small World) is the ANN (Approximate Nearest Neighbor) algorithm Elasticsearch uses for dense vector search. Understanding how it works lets you tune performance for production.

graph TD
    subgraph "Layer 2 - Coarsest"
        A2[Node A] --- B2[Node B]
    end
    subgraph "Layer 1 - Medium"
        A1[Node A] --- B1[Node B]
        B1 --- C1[Node C]
        A1 --- D1[Node D]
    end
    subgraph "Layer 0 - Finest"
        A0[Node A] --- B0[Node B]
        B0 --- C0[Node C]
        A0 --- D0[Node D]
        C0 --- E0[Node E]
        D0 --- F0[Node F]
        E0 --- F0
        B0 --- E0
    end
The multi-layer HNSW graph — search starts at the top layer and drills down to layer 0

3.1 How HNSW works

HNSW builds a multi-layer graph in which:

  • Layer 0 contains every node (vector); each node is connected to its M nearest neighbors
  • Layers 1, 2, … contain a random subset of nodes (probability decreases with each layer), forming "express highways" for fast navigation
  • Search starts at the highest layer (few nodes, long hops), greedily moves toward neighbors closest to the query, then drops down a layer to refine

Two hyperparameters matter most:

ParameterMeaningES 9 defaultProduction recommendation
mConnections per node1616-32 (higher = better recall, more RAM)
ef_constructionBeam width when building the graph100100-200 (higher = slower build, better graph)
ef_search (num_candidates)Beam width at search timeRecall-dependent: 100 (fast) to 500 (high recall)

HNSW memory note

With 1 billion 768-dim float32 vectors, HNSW needs ~3-4 TiB of RAM for the vector data and the graph structure. That's why BBQ quantization (-95%) is a game-changer at production scale. If the dataset exceeds 100M vectors, also consider IVF or tiered storage.

4. Designing a production search system

This section covers an end-to-end production search architecture — from data ingestion to serving.

graph LR
    subgraph Data Sources
        DB[(Database)]
        CMS[CMS/API]
        S3[Object Storage]
    end
    subgraph Ingestion Pipeline
        CDC[CDC / Change Stream]
        ENR[Enrichment Service]
        EMB[Embedding Service]
        ING[Ingest Pipeline]
    end
    subgraph Elasticsearch Cluster
        COORD[Coordinating Nodes]
        DATA1[Data Node - Hot]
        DATA2[Data Node - Warm]
        ML[ML Node - ELSER]
    end
    subgraph Serving
        API[Search API .NET]
        CACHE[Redis Cache]
        CLIENT[Vue Frontend]
    end
    DB --> CDC
    CMS --> CDC
    S3 --> CDC
    CDC --> ENR
    ENR --> EMB
    EMB --> ING
    ING --> COORD
    COORD --> DATA1
    COORD --> DATA2
    COORD --> ML
    CLIENT --> API
    API --> CACHE
    API --> COORD
End-to-end search system architecture: from data source to frontend

4.1 Data ingestion — CDC + embedding pipeline

Rather than batch reindex, production systems should use CDC (Change Data Capture) — capturing every change from the source database (INSERT/UPDATE/DELETE) and pushing it to Elasticsearch in near real time. For SQL Server use Debezium or Change Tracking; for PostgreSQL use logical replication.

The pipeline processes each document in three steps:

  1. Enrichment: add metadata, normalize text, detect language
  2. Embedding: call the embedding model (Jina v3, OpenAI text-embedding-3-large, or self-hosted) to produce the dense vector
  3. Ingest: the Elasticsearch Ingest Pipeline handles ELSER sparse encoding + document indexing
IngestPipeline.cs
public class SearchDocumentIndexer
{
    private readonly ElasticsearchClient _client;
    private readonly IEmbeddingService _embeddingService;

    public async Task IndexDocumentAsync(ProductDocument doc)
    {
        var embedding = await _embeddingService
            .GenerateEmbeddingAsync(doc.Title + " " + doc.Description);

        var indexDoc = new SearchableProduct
        {
            Id = doc.Id,
            Title = doc.Title,
            Description = doc.Description,
            ContentEmbedding = embedding,
            SemanticContent = doc.Title + " " + doc.Description,
            Category = doc.Category,
            Price = doc.Price,
            UpdatedAt = DateTime.UtcNow
        };

        await _client.IndexAsync(indexDoc, idx => idx
            .Index("products-v2")
            .Id(doc.Id.ToString())
            .Pipeline("product-enrichment")
        );
    }
}

4.2 Index design — Multi-field strategy

A single document needs multiple "views" to power hybrid search:

index-template.json
{
  "index_patterns": ["products-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "index.knn": true,
      "analysis": {
        "analyzer": {
          "en_analyzer": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "filter": ["icu_folding", "lowercase", "english_stop"]
          }
        }
      }
    },
    "mappings": {
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "en_analyzer",
          "fields": {
            "keyword": { "type": "keyword" },
            "suggest": {
              "type": "completion",
              "analyzer": "en_analyzer"
            }
          }
        },
        "description": {
          "type": "text",
          "analyzer": "en_analyzer"
        },
        "semantic_content": {
          "type": "semantic_text"
        },
        "content_embedding": {
          "type": "dense_vector",
          "dims": 768,
          "index": true,
          "similarity": "cosine",
          "index_options": {
            "type": "bbq_hnsw",
            "m": 24,
            "ef_construction": 150
          }
        },
        "sku": { "type": "keyword" },
        "category": { "type": "keyword" },
        "price": { "type": "float" },
        "updated_at": { "type": "date" }
      }
    }
  }
}

4.3 Search API — .NET integration

Elastic ships Elastic.Clients.Elasticsearch (the official .NET client) with full support for the Retrievers API and hybrid search.

SearchService.cs
public class HybridSearchService
{
    private readonly ElasticsearchClient _client;

    public async Task<SearchResult> HybridSearchAsync(
        string query, string? category = null, int page = 1, int size = 10)
    {
        var response = await _client.SearchAsync<ProductDoc>(s => s
            .Index("products-v2")
            .From((page - 1) * size)
            .Size(size)
            .Retriever(r => r
                .Rrf(rrf => rrf
                    .RankWindowSize(100)
                    .Retrievers(
                        ret => ret.Standard(std => std
                            .Query(q => q
                                .Bool(b => b
                                    .Must(m => m
                                        .MultiMatch(mm => mm
                                            .Query(query)
                                            .Fields(new[] { "title^3", "description" })
                                            .Type(TextQueryType.BestFields)
                                            .Fuzziness(new Fuzziness("AUTO"))
                                        )
                                    )
                                    .Filter(BuildCategoryFilter(category))
                                )
                            )
                        ),
                        ret => ret.Standard(std => std
                            .Query(q => q
                                .Semantic(sem => sem
                                    .Field("semantic_content")
                                    .Query(query)
                                )
                            )
                        ),
                        ret => ret.Knn(knn => knn
                            .Field("content_embedding")
                            .QueryVectorBuilder(qvb => qvb
                                .TextEmbedding(te => te
                                    .ModelId("jina-v3")
                                    .ModelText(query)
                                )
                            )
                            .K(50)
                            .NumCandidates(200)
                        )
                    )
                )
            )
            .Highlight(h => h
                .Fields(f => f
                    .Add("title", new HighlightField
                    {
                        PreTags = new[] { "<mark>" },
                        PostTags = new[] { "</mark>" }
                    })
                    .Add("description", new HighlightField
                    {
                        PreTags = new[] { "<mark>" },
                        PostTags = new[] { "</mark>" }
                    })
                )
            )
        );

        return MapToSearchResult(response);
    }
}

5. Score fusion — RRF vs Linear Combination

When combining results from multiple retrievers, you need a fusion strategy to merge and re-rank. Elasticsearch 9 supports two primary methods.

5.1 Reciprocal Rank Fusion (RRF)

RRF is rank-based — it only cares about a document's position (rank) in each retriever, not the absolute score. The formula:

RRF_score(d) = Sum( 1 / (k + rank_i(d)) )
// k = constant (default 60), rank_i = position in retriever i

Biggest advantage: no need to normalize scores across retrievers. BM25 might score 15.7, cosine similarity 0.89 — RRF handles it because it only looks at ranks.

5.2 Linear Combination (Weighted)

With Linear Combination, you assign weights to each retriever and sum the scores (after normalization). A fit when you already know which retriever matters most for a use case.

linear-combination-query.json
{
  "retriever": {
    "linear": {
      "retrievers": [
        {
          "retriever": {
            "standard": {
              "query": { "match": { "content": "database optimization" } }
            }
          },
          "weight": 0.3
        },
        {
          "retriever": {
            "knn": {
              "field": "content_embedding",
              "query_vector_builder": {
                "text_embedding": {
                  "model_id": "jina-v3",
                  "model_text": "database optimization"
                }
              },
              "k": 50,
              "num_candidates": 200
            }
          },
          "weight": 0.7
        }
      ],
      "normalizer": "min_max"
    }
  }
}
CriterionRRFLinear Combination
Score normalizationNot neededRequired (min-max or z-score)
TuningMinimal (just k)Weights tuned per use case
When to useDefault, when you don't know the data wellAfter A/B tests with known optimal weights
PerformanceGood for most casesCan be better when tuned correctly

6. ColBERT and multi-stage re-ranking

Elasticsearch 9 supports multi-stage interaction models like ColBERT and ColPali via the MaxSim operator. A major step forward for search quality.

A 3-stage production pipeline:

graph LR
    A[Query] --> B[Stage 1: Candidate Retrieval]
    B --> |Top 1000| C[Stage 2: Hybrid Fusion RRF]
    C --> |Top 100| D[Stage 3: ColBERT Re-ranking]
    D --> |Top 10| E[Final Results]
Multi-stage search pipeline: retrieval, fusion, re-ranking

ColBERT (Contextualized Late Interaction over BERT) produces an embedding for each token of the document and query, then applies MaxSim to compute an interaction score. More expensive than a bi-encoder but delivers relevance close to a cross-encoder at much higher speed.

7. Scaling Elasticsearch for billion-scale

7.1 Shard strategy

Basic production rules:

  • Shard size: target 20-40 GiB/shard for search workloads, 40-60 GiB for logging
  • Shard count: avoid too many small shards — each shard has overhead (segment metadata memory, thread pools)
  • Shard count = total data size / target shard size, rounded up

7.2 Tiered architecture

TierHardwareDataCharacteristics
HotNVMe SSD, high RAM0-7 days / active indexHighest speed, search + indexing
WarmStandard SSD7-30 daysSearch only, force merge
ColdHDD / Searchable Snapshots30+ daysArchival, occasional search
FrozenObject Storage (S3)Long-term storageSearchable Snapshots, high latency

7.3 Filtered vector search — ACORN

Since Elasticsearch 9.1, the ACORN algorithm significantly improves filtered kNN search performance. Previously, combining vector search with a filter (e.g. finding similar products only within the "electronics" category) forced Elasticsearch to scan many more candidates than necessary. ACORN integrates the filter directly into HNSW graph traversal, reducing nodes visited.

8. Search UX — Frontend integration with Vue

Good search isn't only backend — frontend UX decides the user experience.

8.1 Search-as-you-type with debounce

useSearch.ts
import { ref, watch } from 'vue'
import { useDebounceFn } from '@vueuse/core'

export function useHybridSearch() {
  const query = ref('')
  const results = ref<SearchResult[]>([])
  const suggestions = ref<string[]>([])
  const isLoading = ref(false)

  const fetchSuggestions = useDebounceFn(async (q: string) => {
    if (q.length < 2) {
      suggestions.value = []
      return
    }
    const res = await fetch(
      `/api/search/suggest?q=${encodeURIComponent(q)}`
    )
    suggestions.value = await res.json()
  }, 150)

  const executeSearch = useDebounceFn(async (q: string) => {
    if (!q.trim()) {
      results.value = []
      return
    }
    isLoading.value = true
    try {
      const res = await fetch(
        `/api/search?q=${encodeURIComponent(q)}`
      )
      const data = await res.json()
      results.value = data.hits
    } finally {
      isLoading.value = false
    }
  }, 300)

  watch(query, (val) => {
    fetchSuggestions(val)
    executeSearch(val)
  })

  return { query, results, suggestions, isLoading }
}

8.2 Highlight and snippet

Elasticsearch returns highlight fragments — text chunks containing matches wrapped in <mark> tags. The frontend renders the sanitized HTML:

SearchResult.vue
<template>
  <div class="search-result" v-for="hit in results" :key="hit.id">
    <h3 v-html="hit.highlight?.title?.[0] || hit.title" />
    <p
      class="snippet"
      v-html="hit.highlight?.description?.[0]
        || truncate(hit.description, 200)"
    />
    <div class="meta">
      <span class="category">{{ hit.category }}</span>
      <span class="score">Relevance: {{ hit.score.toFixed(2) }}</span>
    </div>
  </div>
</template>

9. Monitoring and performance tuning

9.1 Metrics to track

Search latency

P50 < 50 ms, P99 < 200 ms for the search API. Track per-retriever latency inside a hybrid query to pinpoint the bottleneck.

Indexing throughput

Target: > 5,000 docs/s for real-time indexing. Monitor indexing_pressure and rejected_requests to detect backpressure.

Recall & relevance

Use nDCG@10 and MRR to measure search quality. A/B test RRF weights regularly with real user clicks.

Resource usage

Monitor heap usage, segment memory, vector-index memory. BBQ cuts vector memory 95% but the HNSW graph still needs significant RAM.

9.2 Common pitfalls

Common hybrid-search deployment mistakes

  • Embedding model mismatch: using different models for indexing vs searching produces meaningless cosine similarity. Always use the same model and version.
  • Too many shards: every shard's kNN search is its own HNSW traversal, so 100 shards create 100× latency overhead. Consolidate shards when you can.
  • Cold cache: the HNSW graph must be loaded into the OS page cache. After a restart the first query is very slow — use warming queries.
  • Skipping re-ranking: hybrid retrieval delivers great recall, but precision needs re-ranking (ColBERT/cross-encoder) for top results.

10. Elasticsearch vs alternatives in 2026

CriterionElasticsearch 9OpenSearch 2.xMilvus/Qdrantpgvector
Hybrid Search✔ Native (RRF, Linear)✔ Native (RRF)⚠ Limited⚠ Manual
Sparse Vector (ELSER)✔ Built-in✘ None✘ None✘ None
BBQ Quantization✔ Default✘ None✔ SQ/PQ✘ None
Full-text Search✔ Excellent✔ Good✘ Limited⚠ Basic
Ecosystem/.NET SDK✔ Mature⚠ Moderate✔ Good✔ Native (EF Core)
Operations complexity⚠ Moderate⚠ Moderate✘ High✔ Low
LicenseSSPL + ElasticApache 2.0Apache 2.0PostgreSQL

How to choose

Elasticsearch 9: when you need both full-text and vector search in the same system, or already run the Elastic Stack. pgvector: when you already use PostgreSQL, have fewer than 10M vectors, and want the simplest option. Milvus/Qdrant: when vector search is the primary use case (RAG, recommendations) and you don't need full-text.

Conclusion

Elasticsearch 9 marks the shift from "search engine" to "unified search & vector platform". With BBQ quantization slashing memory 95%, ELSER enabling semantic search without GPUs, the Retrievers API for composable hybrid queries, and ColBERT/ColPali re-ranking — it's the most complete production search stack in 2026. The key isn't choosing keyword or vector, but combining both correctly in a single pipeline with the fusion strategy best suited to your use case.

References: