Database Indexing — Nghệ thuật tối ưu query cho hệ thống Production

Posted on: 4/20/2026 11:12:43 AM

Table of contents

1. Index hoạt động như thế nào?
1. 1.1. Cấu trúc B-Tree — Nền tảng của hầu hết index
  1. Tại sao B-Tree hiệu quả?
2. Clustered vs Non-Clustered Index
3. Composite Index và quy tắc Leftmost Prefix
1. Sai lầm phổ biến với Composite Index
2. 3.1. Chiến lược sắp xếp cột trong Composite Index
4. Covering Index — Tránh Key Lookup
1. Khi nào nên dùng Covering Index?
5. Filtered Index — Index có điều kiện
6. Index đặc biệt trong PostgreSQL
1. BRIN Index — Vũ khí bí mật cho bảng lớn
7. Đọc hiểu Execution Plan
1. 7.1. Các operator quan trọng trong Execution Plan
8. Index Anti-patterns — Những sai lầm phải tránh
9. Chiến lược Index cho Production
10. Một số pattern index thường gặp
11. Tổng kết
1. Checklist trước khi deploy index mới
2. Tài liệu tham khảo

Bạn đã bao giờ thấy một query đơn giản chạy mất 30 giây trên production, rồi sau khi thêm đúng một dòng CREATE INDEX thì chỉ còn 5ms? Đó không phải phép màu — đó là sức mạnh của database indexing. Tuy nhiên, index không phải "thêm càng nhiều càng tốt". Hiểu sai về index có thể khiến hệ thống chậm hơn, tốn dung lượng vô ích, và biến mỗi lần INSERT thành cơn ác mộng.

Bài viết này sẽ đi sâu vào cách database index thực sự hoạt động bên trong, các loại index phổ biến trong SQL Server và PostgreSQL, cách đọc Execution Plan, và chiến lược đánh index cho hệ thống production triệu query mỗi ngày.

1000x Cải thiện tốc độ query khi index đúng cách

80% Vấn đề performance DB liên quan đến index

10-30% Dung lượng bổ sung cho mỗi index

O(log n) Độ phức tạp tìm kiếm B-Tree Index

1. Index hoạt động như thế nào?

Hãy tưởng tượng một cuốn sách dày 1000 trang. Nếu bạn muốn tìm từ "Deadlock", bạn có hai cách: lật từng trang (Table Scan) hoặc mở mục lục ở cuối sách (Index Lookup). Mục lục sách chính là nguyên lý cơ bản của database index.

Trong database, index là một cấu trúc dữ liệu riêng biệt được duy trì song song với bảng gốc. Khi bạn tạo index trên cột Email, database sẽ tạo một "bản sao thu gọn" chứa giá trị Email đã sắp xếp kèm pointer trỏ về row gốc trong bảng.

1.1. Cấu trúc B-Tree — Nền tảng của hầu hết index

B-Tree (Balanced Tree) là cấu trúc dữ liệu phổ biến nhất cho database index. Điểm đặc biệt của B-Tree là chiều cao rất thấp — một B-Tree chứa 1 triệu bản ghi thường chỉ cao 3-4 tầng. Nghĩa là chỉ cần 3-4 lần đọc đĩa (page read) để tìm được bất kỳ bản ghi nào.

graph TD
    Root["Root Node
Pointer: 30 | 60"]
    I1["Internal
10 | 20"]
    I2["Internal
40 | 50"]
    I3["Internal
70 | 80 | 90"]
    L1["Leaf
1,3,5,8,10"]
    L2["Leaf
12,15,18,20"]
    L3["Leaf
22,25,28,30"]
    L4["Leaf
32,35,38,40"]
    L5["Leaf
42,45,48,50"]
    L6["Leaf
55,58,60"]

    Root --> I1
    Root --> I2
    Root --> I3
    I1 --> L1
    I1 --> L2
    I1 --> L3
    I2 --> L4
    I2 --> L5
    I2 --> L6

    style Root fill:#e94560,stroke:#fff,color:#fff
    style I1 fill:#2c3e50,stroke:#fff,color:#fff
    style I2 fill:#2c3e50,stroke:#fff,color:#fff
    style I3 fill:#2c3e50,stroke:#fff,color:#fff
    style L1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L3 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L4 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L5 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L6 fill:#f8f9fa,stroke:#e94560,color:#2c3e50

Cấu trúc B-Tree Index — Root → Internal → Leaf nodes. Tìm giá trị 45 chỉ cần 3 bước: Root → Internal (40|50) → Leaf

Mỗi node trong B-Tree chứa nhiều key và pointer. Khi tìm kiếm giá trị 45:

Root node: 45 nằm giữa 30 và 60 → đi vào nhánh giữa
Internal node: 45 nằm giữa 40 và 50 → đi vào leaf tương ứng
Leaf node: Tìm thấy 45, trả về pointer đến row trong bảng

Tại sao B-Tree hiệu quả?

Với branching factor 100 (mỗi node chứa tối đa 100 key), một B-Tree 4 tầng có thể index tới 100 triệu bản ghi. Mỗi page read trong SSD mất khoảng 0.1ms, nên tìm kiếm 100 triệu bản ghi chỉ mất ~0.4ms thay vì quét tuần tự hàng phút.

2. Clustered vs Non-Clustered Index

Đây là khái niệm quan trọng nhất cần nắm, đặc biệt trong SQL Server. Hai loại index này hoạt động hoàn toàn khác nhau về mặt vật lý.

Đặc điểm	Clustered Index	Non-Clustered Index
Dữ liệu vật lý	Sắp xếp lại row trong bảng theo key	Tạo cấu trúc riêng, chứa pointer về row
Số lượng/bảng	Tối đa 1	Tối đa 999 (SQL Server) / không giới hạn (PostgreSQL)
Leaf node chứa	Toàn bộ data row	Index key + RID hoặc Clustered Key
Tốc độ range scan	Rất nhanh (data liền kề vật lý)	Chậm hơn (cần random I/O về bảng)
Chi phí INSERT	Cao hơn (phải giữ thứ tự vật lý)	Thấp hơn
Mặc định	Primary Key (SQL Server)	Phải tạo thủ công

graph LR
    subgraph CI["Clustered Index"]
        direction TB
        CR["Root: 50"]
        CL1["Leaf: Row 1-50
(chứa toàn bộ data)"]
        CL2["Leaf: Row 51-100
(chứa toàn bộ data)"]
        CR --> CL1
        CR --> CL2
    end

    subgraph NCI["Non-Clustered Index"]
        direction TB
        NR["Root: 'M'"]
        NL1["Leaf: Email A-M
+ Clustered Key"]
        NL2["Leaf: Email N-Z
+ Clustered Key"]
        NR --> NL1
        NR --> NL2
    end

    NL1 -.->|Key Lookup| CL1
    NL2 -.->|Key Lookup| CL2

    style CR fill:#e94560,stroke:#fff,color:#fff
    style NR fill:#2c3e50,stroke:#fff,color:#fff
    style CL1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style CL2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style NL1 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style NL2 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50

Non-Clustered Index cần Key Lookup để lấy dữ liệu không có trong index → chi phí I/O bổ sung

Trong PostgreSQL, khái niệm hơi khác: tất cả table đều là heap (không sắp xếp vật lý), và mọi index đều là "non-clustered" theo nghĩa SQL Server. Tuy nhiên, PostgreSQL có lệnh CLUSTER để sắp xếp lại data một lần theo index — nhưng không tự duy trì thứ tự sau đó.

-- SQL Server: Clustered Index tự động với Primary Key
CREATE TABLE Users (
    Id INT IDENTITY(1,1) PRIMARY KEY, -- Clustered Index tự động
    Email NVARCHAR(255),
    FullName NVARCHAR(500),
    CreatedDate DATETIME2
);

-- Non-Clustered Index trên Email
CREATE NONCLUSTERED INDEX IX_Users_Email
ON Users (Email);

-- PostgreSQL: Tương đương
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    email VARCHAR(255),
    full_name VARCHAR(500),
    created_date TIMESTAMPTZ
);

CREATE INDEX idx_users_email ON users (email);

3. Composite Index và quy tắc Leftmost Prefix

Composite index (index trên nhiều cột) là vũ khí mạnh nhất trong toolkit của bạn, nhưng cũng là nơi dễ mắc sai lầm nhất. Quy tắc vàng: thứ tự cột trong composite index quyết định query nào được hưởng lợi.

-- Tạo composite index
CREATE INDEX IX_Orders_Status_Date_Customer
ON Orders (Status, OrderDate, CustomerId);

Với index trên (Status, OrderDate, CustomerId), các query sau sẽ sử dụng được index:

Query WHERE clause	Sử dụng Index?	Giải thích
`WHERE Status = 1`	✅ Index Seek	Dùng cột đầu tiên
`WHERE Status = 1 AND OrderDate > '2026-01-01'`	✅ Index Seek	Dùng 2 cột đầu (leftmost prefix)
`WHERE Status = 1 AND OrderDate > '2026-01-01' AND CustomerId = 5`	✅ Index Seek	Dùng đủ 3 cột
`WHERE OrderDate > '2026-01-01'`	❌ Index Scan hoặc Table Scan	Bỏ qua cột đầu → không dùng được
`WHERE CustomerId = 5`	❌ Không dùng được	Bỏ qua 2 cột đầu
`WHERE Status = 1 AND CustomerId = 5`	⚠️ Partial Seek	Seek trên Status, scan phần còn lại

Sai lầm phổ biến với Composite Index

Nhiều developer tạo composite index theo thứ tự cột trong bảng thay vì thứ tự sử dụng trong query. Quy tắc đúng: cột có selectivity cao nhất (lọc được nhiều row nhất) nên đứng trước, nhưng phải đảm bảo nó luôn xuất hiện trong WHERE clause. Nếu Status chỉ có 3 giá trị nhưng luôn được filter, nó vẫn nên đứng trước OrderDate (millions giá trị) nếu query luôn lọc theo Status trước.

3.1. Chiến lược sắp xếp cột trong Composite Index

Thứ tự ưu tiên khi thiết kế composite index:

Equality first: Cột dùng = đứng trước cột dùng >, <, BETWEEN
Range sau: Cột dùng range filter đứng sau equality
Sort cuối: Cột dùng trong ORDER BY đứng cuối cùng

-- Query pattern:
SELECT * FROM Orders
WHERE Status = 'shipped'        -- equality
  AND OrderDate >= '2026-01-01' -- range
ORDER BY CustomerId;            -- sort

-- Index tối ưu: equality → range → sort
CREATE INDEX IX_Orders_Optimal
ON Orders (Status, OrderDate, CustomerId);

4. Covering Index — Tránh Key Lookup

Đây là kỹ thuật tối ưu mạnh mẽ nhất mà nhiều developer bỏ qua. Khi query cần trả về cột không nằm trong index key, database phải thực hiện thêm bước Key Lookup (quay về bảng gốc lấy data). Mỗi Key Lookup là một random I/O — với query trả về 10.000 rows, nghĩa là 10.000 random reads bổ sung.

2.3s

→

12ms

Index thường + Key Lookup vs Covering Index trên bảng 5 triệu rows với query trả về 8.000 kết quả

Covering Index giải quyết vấn đề này bằng cách INCLUDE thêm cột cần thiết vào leaf node của index — không ảnh hưởng thứ tự sắp xếp, nhưng loại bỏ hoàn toàn Key Lookup.

-- SQL Server: INCLUDE clause
CREATE NONCLUSTERED INDEX IX_Orders_Status_Covering
ON Orders (Status, OrderDate)
INCLUDE (CustomerId, TotalAmount, ShippingAddress);

-- Giờ query này được cover hoàn toàn:
SELECT CustomerId, TotalAmount, ShippingAddress
FROM Orders
WHERE Status = 'shipped' AND OrderDate >= '2026-01-01';
-- → Index Seek, KHÔNG Key Lookup

-- PostgreSQL: Tương đương (từ v11+)
CREATE INDEX idx_orders_status_covering
ON orders (status, order_date)
INCLUDE (customer_id, total_amount, shipping_address);

graph LR
    subgraph Without["Không có Covering Index"]
        Q1["Query"] --> IS1["Index Seek
tìm key"]
        IS1 --> KL["Key Lookup
10,000 random I/O"]
        KL --> R1["Result"]
    end

    subgraph With["Có Covering Index"]
        Q2["Query"] --> IS2["Index Seek
tìm key + INCLUDE data"]
        IS2 --> R2["Result"]
    end

    style Q1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style Q2 fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
    style KL fill:#e94560,stroke:#fff,color:#fff
    style IS1 fill:#2c3e50,stroke:#fff,color:#fff
    style IS2 fill:#4CAF50,stroke:#fff,color:#fff
    style R1 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style R2 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50

Covering Index loại bỏ hoàn toàn bước Key Lookup — giảm I/O đáng kể cho query trả về nhiều rows

Khi nào nên dùng Covering Index?

Kiểm tra Execution Plan: nếu bạn thấy Key Lookup chiếm >50% cost trong plan, đó là dấu hiệu rõ ràng cần Covering Index. Tuy nhiên, không INCLUDE quá nhiều cột vì index sẽ phình to, tốn dung lượng và chậm hơn khi INSERT/UPDATE.

5. Filtered Index — Index có điều kiện

Filtered Index (SQL Server) hay Partial Index (PostgreSQL) cho phép bạn chỉ index một subset của data. Đặc biệt hữu ích khi bạn thường xuyên query trên một phần nhỏ dữ liệu.

-- SQL Server: Filtered Index
-- Chỉ index orders chưa xử lý (5% tổng data)
CREATE NONCLUSTERED INDEX IX_Orders_Pending
ON Orders (OrderDate, CustomerId)
WHERE Status = 'pending';

-- Query sử dụng filtered index:
SELECT * FROM Orders
WHERE Status = 'pending' AND OrderDate > '2026-04-01';

-- PostgreSQL: Partial Index
CREATE INDEX idx_orders_pending
ON orders (order_date, customer_id)
WHERE status = 'pending';

95% Giảm dung lượng index so với full index

20x Nhanh hơn cho query trên subset data

~0 Ảnh hưởng INSERT cho data ngoài filter

6. Index đặc biệt trong PostgreSQL

PostgreSQL cung cấp nhiều loại index ngoài B-Tree, mỗi loại tối ưu cho use case khác nhau.

Loại Index	Use Case	Ví dụ
B-Tree	Comparison operators (=, <, >, BETWEEN)	Hầu hết các trường hợp
Hash	Chỉ equality (=), nhẹ hơn B-Tree	Lookup chính xác theo ID
GIN	Full-text search, JSONB, Array contains	`WHERE tags @> '{vue}'`
GiST	Geometric, range, full-text (ranking)	`WHERE location <@> point`
BRIN	Dữ liệu tự nhiên có thứ tự (time-series)	Log table sắp xếp theo timestamp
SP-GiST	Dữ liệu không cân bằng (IP, phone prefix)	Trie cho prefix search

-- GIN Index cho JSONB (PostgreSQL)
CREATE INDEX idx_products_attrs ON products
USING GIN (attributes jsonb_path_ops);

-- Query JSONB hiệu quả:
SELECT * FROM products
WHERE attributes @> '{"color": "red", "size": "L"}';

-- BRIN Index cho time-series data
-- Cực kỳ nhỏ (~0.1% kích thước B-Tree) nhưng hiệu quả
-- khi data insert theo thứ tự thời gian
CREATE INDEX idx_logs_created ON logs
USING BRIN (created_at) WITH (pages_per_range = 32);

-- Full-text Search với GIN
CREATE INDEX idx_articles_fts ON articles
USING GIN (to_tsvector('english', title || ' ' || body));

BRIN Index — Vũ khí bí mật cho bảng lớn

BRIN (Block Range Index) lưu min/max value cho mỗi block range (mặc định 128 pages). Với bảng log 500 triệu rows, BRIN index trên created_at chỉ tốn ~200KB so với B-Tree ~5GB. Hiệu quả nhất khi data insert tuần tự — đúng pattern của log/event tables.

7. Đọc hiểu Execution Plan

Execution Plan là "bản đồ" cho thấy database thực thi query như thế nào. Đọc được plan là kỹ năng bắt buộc để tối ưu index.

-- SQL Server: Xem Estimated Plan
SET SHOWPLAN_XML ON;
GO
SELECT * FROM Orders WHERE Status = 'pending';
GO
SET SHOWPLAN_XML OFF;

-- Hoặc Actual Plan (chạy query thật)
SET STATISTICS XML ON;
SELECT * FROM Orders WHERE Status = 'pending';
SET STATISTICS XML OFF;

-- PostgreSQL: EXPLAIN ANALYZE
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE status = 'pending';

7.1. Các operator quan trọng trong Execution Plan

graph TD
    TS["Table Scan / Seq Scan
Đọc toàn bộ bảng
❌ Chậm nhất"]
    IS["Index Scan
Đọc toàn bộ index
⚠️ Tốt hơn Table Scan"]
    IK["Index Seek / Index Only Scan
Nhảy đến vị trí chính xác
✅ Nhanh nhất"]
    KL["Key Lookup / Heap Fetch
Quay về bảng lấy cột thiếu
⚠️ Chi phí bổ sung"]

    TS -.->|"Thêm Index"| IS
    IS -.->|"WHERE selective"| IK
    IK -.->|"Cần cột ngoài index"| KL
    KL -.->|"INCLUDE cột"| IK

    style TS fill:#e94560,stroke:#fff,color:#fff
    style IS fill:#ff9800,stroke:#fff,color:#fff
    style IK fill:#4CAF50,stroke:#fff,color:#fff
    style KL fill:#ff9800,stroke:#fff,color:#fff

Hành trình tối ưu: Table Scan → Index Scan → Index Seek → Covering Index (loại bỏ Key Lookup)

-- PostgreSQL EXPLAIN output mẫu:
EXPLAIN (ANALYZE) SELECT email, full_name FROM users WHERE status = 'active';

-- KẾT QUẢ KHI KHÔNG CÓ INDEX:
Seq Scan on users  (cost=0.00..25432.00 rows=50000 width=64)
                   (actual time=0.021..245.123 rows=48532 loops=1)
  Filter: (status = 'active')
  Rows Removed by Filter: 951468
Planning Time: 0.085 ms
Execution Time: 267.456 ms      ← CHẬM

-- KẾT QUẢ SAU KHI TẠO COVERING INDEX:
-- CREATE INDEX idx_users_status ON users (status) INCLUDE (email, full_name);
Index Only Scan using idx_users_status on users
                   (cost=0.42..1523.45 rows=50000 width=64)
                   (actual time=0.031..8.234 rows=48532 loops=1)
  Index Cond: (status = 'active')
  Heap Fetches: 0               ← Không cần quay về bảng!
Planning Time: 0.112 ms
Execution Time: 12.567 ms       ← NHANH 21x

8. Index Anti-patterns — Những sai lầm phải tránh

8.1. Dùng hàm trên cột indexed

-- ❌ SAI: Index trên Email không được sử dụng
SELECT * FROM Users WHERE UPPER(Email) = 'ADMIN@EXAMPLE.COM';
SELECT * FROM Users WHERE YEAR(CreatedDate) = 2026;

-- ✅ ĐÚNG: Giữ cột indexed nguyên vẹn
SELECT * FROM Users WHERE Email = 'admin@example.com'; -- collation CI
SELECT * FROM Users
WHERE CreatedDate >= '2026-01-01' AND CreatedDate < '2027-01-01';

-- ✅ PostgreSQL: Expression Index nếu cần
CREATE INDEX idx_users_email_upper ON users (UPPER(email));

8.2. Over-indexing — Nhiều index quá

Mỗi index là một bản copy dữ liệu phải được cập nhật khi INSERT/UPDATE/DELETE. Bảng có 10 index nghĩa là mỗi INSERT phải ghi 11 lần (1 bảng + 10 index).

INSERT: 0.5ms

→

INSERT: 8ms

Chi phí INSERT trên bảng 0 index vs bảng 15 index — chậm 16x cho mỗi write operation

8.3. Index trên cột low-cardinality

-- ❌ Ít hiệu quả: Cột Gender chỉ có 2-3 giá trị
CREATE INDEX IX_Users_Gender ON Users (Gender);
-- Index trả về 50% rows → Optimizer chọn Table Scan thay vì Index

-- ✅ Tốt hơn: Composite index hoặc Filtered Index
CREATE INDEX IX_Users_Gender_Date ON Users (Gender, CreatedDate);
-- Hoặc nếu query phổ biến chỉ lọc 1 giá trị:
CREATE INDEX IX_Users_Active ON Users (CreatedDate) WHERE IsActive = 1;

8.4. Implicit conversion giết index

-- ❌ Cột UserId là VARCHAR nhưng truyền INT → implicit conversion
-- Database phải convert TỪNG ROW → Table Scan
SELECT * FROM Users WHERE UserId = 12345;

-- ✅ Truyền đúng kiểu
SELECT * FROM Users WHERE UserId = '12345';

8.5. LIKE với wildcard đầu

-- ❌ Index không dùng được: wildcard ở đầu
SELECT * FROM Products WHERE Name LIKE '%phone%';

-- ✅ Index dùng được: wildcard chỉ ở cuối
SELECT * FROM Products WHERE Name LIKE 'phone%';

-- ✅ Giải pháp cho full-text: dùng Full-Text Index
-- SQL Server:
CREATE FULLTEXT INDEX ON Products (Name) KEY INDEX PK_Products;
SELECT * FROM Products WHERE CONTAINS(Name, 'phone');

-- PostgreSQL:
CREATE INDEX idx_products_fts ON products USING GIN (to_tsvector('simple', name));
SELECT * FROM products WHERE to_tsvector('simple', name) @@ to_tsquery('phone');

9. Chiến lược Index cho Production

9.1. Quy trình 4 bước

Bước 1: Thu thập query patterns

Phân tích slow query log và workload thực tế. SQL Server có Query Store, PostgreSQL có pg_stat_statements. Tập trung vào top 20 query tốn resource nhất — chúng thường chiếm 80% tổng load.

-- SQL Server: Top queries by CPU
SELECT TOP 20
    qs.total_worker_time / qs.execution_count AS avg_cpu,
    qs.execution_count,
    SUBSTRING(qt.text, 1, 200) AS query_text
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) qt
ORDER BY avg_cpu DESC;

-- PostgreSQL: Top queries (cần enable pg_stat_statements)
SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

Bước 2: Phân tích Execution Plan

Với mỗi slow query, đọc Execution Plan để xác định: Table Scan? Key Lookup? Sort tốn kém? Từ đó quyết định cần tạo index mới hay mở rộng index có sẵn.

Bước 3: Thiết kế index theo nguyên tắc ESR

Equality → Sort → Range. Sắp xếp cột trong composite index theo thứ tự: cột equality trước, cột sort giữa, cột range cuối. Thêm INCLUDE cho cột cần SELECT nhưng không filter/sort.

Bước 4: Đo lường và tinh chỉnh

Tạo index trên staging, benchmark với production-like data, rồi mới deploy. Theo dõi index usage stats để phát hiện index không được dùng.

9.2. Phát hiện index thừa và thiếu

-- SQL Server: Index không bao giờ được dùng
SELECT
    OBJECT_NAME(i.object_id) AS table_name,
    i.name AS index_name,
    s.user_seeks, s.user_scans, s.user_lookups, s.user_updates
FROM sys.indexes i
LEFT JOIN sys.dm_db_index_usage_stats s
    ON i.object_id = s.object_id AND i.index_id = s.index_id
WHERE OBJECTPROPERTY(i.object_id, 'IsUserTable') = 1
    AND (s.user_seeks + s.user_scans + s.user_lookups) = 0
    AND s.user_updates > 0
ORDER BY s.user_updates DESC;

-- SQL Server: Missing index suggestions
SELECT
    d.statement AS table_name,
    d.equality_columns,
    d.inequality_columns,
    d.included_columns,
    s.avg_user_impact,
    s.user_seeks
FROM sys.dm_db_missing_index_details d
JOIN sys.dm_db_missing_index_groups g ON d.index_handle = g.index_handle
JOIN sys.dm_db_missing_index_group_stats s ON g.index_group_handle = s.group_handle
ORDER BY s.avg_user_impact * s.user_seeks DESC;

-- PostgreSQL: Index không được dùng
SELECT
    schemaname, tablename, indexname,
    idx_scan AS times_used,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;

9.3. Bảo trì Index định kỳ

Index bị fragmentation theo thời gian khi data INSERT/UPDATE/DELETE. Index fragmented nghĩa là leaf pages không liên tục trên disk, gây tăng I/O.

-- SQL Server: Kiểm tra fragmentation
SELECT
    OBJECT_NAME(ips.object_id) AS table_name,
    i.name AS index_name,
    ips.avg_fragmentation_in_percent,
    ips.page_count
FROM sys.dm_db_index_physical_stats(DB_ID(), NULL, NULL, NULL, 'LIMITED') ips
JOIN sys.indexes i ON ips.object_id = i.object_id AND ips.index_id = i.index_id
WHERE ips.avg_fragmentation_in_percent > 10
ORDER BY ips.avg_fragmentation_in_percent DESC;

-- Fragmentation 10-30%: REORGANIZE (online, nhẹ)
ALTER INDEX IX_Orders_Status ON Orders REORGANIZE;

-- Fragmentation > 30%: REBUILD (tốn hơn nhưng triệt để)
ALTER INDEX IX_Orders_Status ON Orders REBUILD WITH (ONLINE = ON);

-- PostgreSQL: REINDEX (tương đương REBUILD)
REINDEX INDEX CONCURRENTLY idx_orders_status;

-- PostgreSQL: VACUUM + ANALYZE (bảo trì routine)
VACUUM ANALYZE orders;

Lưu ý khi REBUILD Index trên Production

Trong SQL Server Enterprise, dùng ONLINE = ON để tránh lock bảng. Standard Edition không hỗ trợ online rebuild — phải lên lịch maintenance window. Trong PostgreSQL, luôn dùng CONCURRENTLY để tránh exclusive lock.

10. Một số pattern index thường gặp

10.1. Pagination hiệu quả

-- ❌ OFFSET lớn = chậm (phải đọc rồi bỏ N rows đầu)
SELECT * FROM Products ORDER BY CreatedDate DESC
OFFSET 100000 ROWS FETCH NEXT 20 ROWS ONLY;

-- ✅ Keyset Pagination (cursor-based) + Index
-- Index: CREATE INDEX IX_Products_Date ON Products (CreatedDate DESC, Id DESC);
SELECT TOP 20 * FROM Products
WHERE (CreatedDate, Id) < (@lastDate, @lastId)
ORDER BY CreatedDate DESC, Id DESC;

-- PostgreSQL tương đương:
SELECT * FROM products
WHERE (created_date, id) < ($1, $2)
ORDER BY created_date DESC, id DESC
LIMIT 20;

10.2. Soft Delete pattern

-- Hầu hết query chỉ cần active records
-- Filtered Index rất hiệu quả ở đây:
CREATE INDEX IX_Users_Active
ON Users (Email, CreatedDate)
WHERE IsDeleted = 0;

-- 95% queries chỉ truy cập data chưa xóa
-- Index chỉ chiếm 5% kích thước so với full index

10.3. Multi-tenant SaaS

-- TenantId luôn là cột đầu trong composite index
CREATE INDEX IX_Orders_Tenant
ON Orders (TenantId, Status, OrderDate)
INCLUDE (TotalAmount);

-- Mỗi tenant chỉ scan subset nhỏ của bảng
SELECT Status, COUNT(*), SUM(TotalAmount)
FROM Orders
WHERE TenantId = @tenantId AND OrderDate >= '2026-01-01'
GROUP BY Status;

11. Tổng kết

Database Indexing không chỉ là CREATE INDEX rồi xong. Đó là quá trình liên tục: phân tích workload → thiết kế index → đo lường → tinh chỉnh. Những điểm cần nhớ:

ESR Equality → Sort → Range: thứ tự cột trong composite index

INCLUDE Loại bỏ Key Lookup bằng Covering Index

30% Fragmentation threshold để REBUILD

DMV Dynamic Management Views: công cụ phát hiện index thừa/thiếu

Checklist trước khi deploy index mới

1. Đã test trên staging với production-like data volume? 2. Index có cover đủ cột SELECT/WHERE/ORDER BY không? 3. Composite index có đúng thứ tự ESR không? 4. Có index trùng lặp hoặc subset của index mới không? 5. Đã tính toán ảnh hưởng đến write performance chưa?

Index là trade-off giữa read performance và write cost + storage. Không có index hoàn hảo cho mọi trường hợp — chỉ có index phù hợp với workload cụ thể của bạn. Hãy bắt đầu từ slow query log, đọc Execution Plan, và áp dụng nguyên tắc ESR. Kết quả sẽ thấy ngay.

Tài liệu tham khảo

#Database Indexing #SQL Server #PostgreSQL #Query Optimization #Performance Tuning #system design

# Database Indexing — Nghệ thuật tối ưu query cho hệ thống Production

Bạn đã bao giờ thấy một query đơn giản chạy mất 30 giây trên production, rồi sau khi thêm đúng một dòng `CREATE INDEX` thì chỉ còn 5ms? Đó không phải phép màu — đó là sức mạnh của database indexing. Tuy nhiên, index không phải "thêm càng nhiều càng tốt". Hiểu sai về index có thể khiến hệ thống chậm hơn, tốn dung lượng vô ích, và biến mỗi lần INSERT thành cơn ác mộng.

1000x Cải thiện tốc độ query khi index đúng cách

80% Vấn đề performance DB liên quan đến index

10-30% Dung lượng bổ sung cho mỗi index

O(log n) Độ phức tạp tìm kiếm B-Tree Index

## 1. Index hoạt động như thế nào?

Trong database, index là một cấu trúc dữ liệu riêng biệt được duy trì song song với bảng gốc. Khi bạn tạo index trên cột `Email`, database sẽ tạo một "bản sao thu gọn" chứa giá trị Email đã sắp xếp kèm pointer trỏ về row gốc trong bảng.

### 1.1. Cấu trúc B-Tree — Nền tảng của hầu hết index

```
graph TD
    Root["Root Node  
Pointer: 30 | 60"]
    I1["Internal  
10 | 20"]
    I2["Internal  
40 | 50"]
    I3["Internal  
70 | 80 | 90"]
    L1["Leaf  
1,3,5,8,10"]
    L2["Leaf  
12,15,18,20"]
    L3["Leaf  
22,25,28,30"]
    L4["Leaf  
32,35,38,40"]
    L5["Leaf  
42,45,48,50"]
    L6["Leaf  
55,58,60"]

Root --> I1
    Root --> I2
    Root --> I3
    I1 --> L1
    I1 --> L2
    I1 --> L3
    I2 --> L4
    I2 --> L5
    I2 --> L6

style Root fill:#e94560,stroke:#fff,color:#fff
    style I1 fill:#2c3e50,stroke:#fff,color:#fff
    style I2 fill:#2c3e50,stroke:#fff,color:#fff
    style I3 fill:#2c3e50,stroke:#fff,color:#fff
    style L1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L3 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L4 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L5 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L6 fill:#f8f9fa,stroke:#e94560,color:#2c3e50

```
Cấu trúc B-Tree Index — Root → Internal → Leaf nodes. Tìm giá trị 45 chỉ cần 3 bước: Root → Internal (40|50) → Leaf

Mỗi node trong B-Tree chứa nhiều key và pointer. Khi tìm kiếm giá trị 45:

1. **Root node**: 45 nằm giữa 30 và 60 → đi vào nhánh giữa
2. **Internal node**: 45 nằm giữa 40 và 50 → đi vào leaf tương ứng
3. **Leaf node**: Tìm thấy 45, trả về pointer đến row trong bảng

#### Tại sao B-Tree hiệu quả?

## 2. Clustered vs Non-Clustered Index

Đây là khái niệm quan trọng nhất cần nắm, đặc biệt trong SQL Server. Hai loại index này hoạt động hoàn toàn khác nhau về mặt vật lý.

| Đặc điểm | Clustered Index | Non-Clustered Index |
| --- | --- | --- |
| **Dữ liệu vật lý** | Sắp xếp lại row trong bảng theo key | Tạo cấu trúc riêng, chứa pointer về row |
| **Số lượng/bảng** | Tối đa 1 | Tối đa 999 (SQL Server) / không giới hạn (PostgreSQL) |
| **Leaf node chứa** | Toàn bộ data row | Index key + RID hoặc Clustered Key |
| **Tốc độ range scan** | Rất nhanh (data liền kề vật lý) | Chậm hơn (cần random I/O về bảng) |
| **Chi phí INSERT** | Cao hơn (phải giữ thứ tự vật lý) | Thấp hơn |
| **Mặc định** | Primary Key (SQL Server) | Phải tạo thủ công |

```
graph LR
    subgraph CI["Clustered Index"]
        direction TB
        CR["Root: 50"]
        CL1["Leaf: Row 1-50  
(chứa toàn bộ data)"]
        CL2["Leaf: Row 51-100  
(chứa toàn bộ data)"]
        CR --> CL1
        CR --> CL2
    end

subgraph NCI["Non-Clustered Index"]
        direction TB
        NR["Root: 'M'"]
        NL1["Leaf: Email A-M  
+ Clustered Key"]
        NL2["Leaf: Email N-Z  
+ Clustered Key"]
        NR --> NL1
        NR --> NL2
    end

NL1 -.->|Key Lookup| CL1
    NL2 -.->|Key Lookup| CL2

style CR fill:#e94560,stroke:#fff,color:#fff
    style NR fill:#2c3e50,stroke:#fff,color:#fff
    style CL1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style CL2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style NL1 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style NL2 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50

```
Non-Clustered Index cần Key Lookup để lấy dữ liệu không có trong index → chi phí I/O bổ sung

Trong PostgreSQL, khái niệm hơi khác: tất cả table đều là heap (không sắp xếp vật lý), và mọi index đều là "non-clustered" theo nghĩa SQL Server. Tuy nhiên, PostgreSQL có lệnh `CLUSTER` để sắp xếp lại data một lần theo index — nhưng không tự duy trì thứ tự sau đó.

```sql
-- SQL Server: Clustered Index tự động với Primary Key
CREATE TABLE Users (
    Id INT IDENTITY(1,1) PRIMARY KEY, -- Clustered Index tự động
    Email NVARCHAR(255),
    FullName NVARCHAR(500),
    CreatedDate DATETIME2
);

-- Non-Clustered Index trên Email
CREATE NONCLUSTERED INDEX IX_Users_Email
ON Users (Email);

-- PostgreSQL: Tương đương
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    email VARCHAR(255),
    full_name VARCHAR(500),
    created_date TIMESTAMPTZ
);

CREATE INDEX idx_users_email ON users (email);
```

## 3. Composite Index và quy tắc Leftmost Prefix

Composite index (index trên nhiều cột) là vũ khí mạnh nhất trong toolkit của bạn, nhưng cũng là nơi dễ mắc sai lầm nhất. Quy tắc vàng: **thứ tự cột trong composite index quyết định query nào được hưởng lợi**.

```sql
-- Tạo composite index
CREATE INDEX IX_Orders_Status_Date_Customer
ON Orders (Status, OrderDate, CustomerId);
```
Với index trên `(Status, OrderDate, CustomerId)`, các query sau sẽ sử dụng được index:

| Query WHERE clause | Sử dụng Index? | Giải thích |
| --- | --- | --- |
| `WHERE Status = 1` | ✅ Index Seek | Dùng cột đầu tiên |
| `WHERE Status = 1 AND OrderDate > '2026-01-01'` | ✅ Index Seek | Dùng 2 cột đầu (leftmost prefix) |
| `WHERE Status = 1 AND OrderDate > '2026-01-01' AND CustomerId = 5` | ✅ Index Seek | Dùng đủ 3 cột |
| `WHERE OrderDate > '2026-01-01'` | ❌ Index Scan hoặc Table Scan | Bỏ qua cột đầu → không dùng được |
| `WHERE CustomerId = 5` | ❌ Không dùng được | Bỏ qua 2 cột đầu |
| `WHERE Status = 1 AND CustomerId = 5` | ⚠️ Partial Seek | Seek trên Status, scan phần còn lại |

#### Sai lầm phổ biến với Composite Index

Nhiều developer tạo composite index theo thứ tự cột trong bảng thay vì thứ tự sử dụng trong query. Quy tắc đúng: **cột có selectivity cao nhất (lọc được nhiều row nhất) nên đứng trước**, nhưng phải đảm bảo nó luôn xuất hiện trong WHERE clause. Nếu `Status` chỉ có 3 giá trị nhưng luôn được filter, nó vẫn nên đứng trước `OrderDate` (millions giá trị) nếu query luôn lọc theo Status trước.

### 3.1. Chiến lược sắp xếp cột trong Composite Index

Thứ tự ưu tiên khi thiết kế composite index:

1. **Equality first**: Cột dùng `=` đứng trước cột dùng `>`, `<`, `BETWEEN`
2. **Range sau**: Cột dùng range filter đứng sau equality
3. **Sort cuối**: Cột dùng trong ORDER BY đứng cuối cùng

```sql
-- Query pattern:
SELECT * FROM Orders
WHERE Status = 'shipped'        -- equality
  AND OrderDate >= '2026-01-01' -- range
ORDER BY CustomerId;            -- sort

-- Index tối ưu: equality → range → sort
CREATE INDEX IX_Orders_Optimal
ON Orders (Status, OrderDate, CustomerId);
```

## 4. Covering Index — Tránh Key Lookup

Đây là kỹ thuật tối ưu mạnh mẽ nhất mà nhiều developer bỏ qua. Khi query cần trả về cột không nằm trong index key, database phải thực hiện thêm bước **Key Lookup** (quay về bảng gốc lấy data). Mỗi Key Lookup là một random I/O — với query trả về 10.000 rows, nghĩa là 10.000 random reads bổ sung.

2.3s

→

12ms

**Index thường + Key Lookup** vs **Covering Index** trên bảng 5 triệu rows với query trả về 8.000 kết quả

Covering Index giải quyết vấn đề này bằng cách `INCLUDE` thêm cột cần thiết vào leaf node của index — không ảnh hưởng thứ tự sắp xếp, nhưng loại bỏ hoàn toàn Key Lookup.

```sql
-- SQL Server: INCLUDE clause
CREATE NONCLUSTERED INDEX IX_Orders_Status_Covering
ON Orders (Status, OrderDate)
INCLUDE (CustomerId, TotalAmount, ShippingAddress);

-- Giờ query này được cover hoàn toàn:
SELECT CustomerId, TotalAmount, ShippingAddress
FROM Orders
WHERE Status = 'shipped' AND OrderDate >= '2026-01-01';
-- → Index Seek, KHÔNG Key Lookup

-- PostgreSQL: Tương đương (từ v11+)
CREATE INDEX idx_orders_status_covering
ON orders (status, order_date)
INCLUDE (customer_id, total_amount, shipping_address);
```

```
graph LR
    subgraph Without["Không có Covering Index"]
        Q1["Query"] --> IS1["Index Seek  
tìm key"]
        IS1 --> KL["Key Lookup  
10,000 random I/O"]
        KL --> R1["Result"]
    end

subgraph With["Có Covering Index"]
        Q2["Query"] --> IS2["Index Seek  
tìm key + INCLUDE data"]
        IS2 --> R2["Result"]
    end

style Q1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style Q2 fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
    style KL fill:#e94560,stroke:#fff,color:#fff
    style IS1 fill:#2c3e50,stroke:#fff,color:#fff
    style IS2 fill:#4CAF50,stroke:#fff,color:#fff
    style R1 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style R2 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50

```
Covering Index loại bỏ hoàn toàn bước Key Lookup — giảm I/O đáng kể cho query trả về nhiều rows

#### Khi nào nên dùng Covering Index?

Kiểm tra Execution Plan: nếu bạn thấy **Key Lookup** chiếm >50% cost trong plan, đó là dấu hiệu rõ ràng cần Covering Index. Tuy nhiên, không INCLUDE quá nhiều cột vì index sẽ phình to, tốn dung lượng và chậm hơn khi INSERT/UPDATE.

## 5. Filtered Index — Index có điều kiện

```sql
-- SQL Server: Filtered Index
-- Chỉ index orders chưa xử lý (5% tổng data)
CREATE NONCLUSTERED INDEX IX_Orders_Pending
ON Orders (OrderDate, CustomerId)
WHERE Status = 'pending';

-- Query sử dụng filtered index:
SELECT * FROM Orders
WHERE Status = 'pending' AND OrderDate > '2026-04-01';

-- PostgreSQL: Partial Index
CREATE INDEX idx_orders_pending
ON orders (order_date, customer_id)
WHERE status = 'pending';
```

95% Giảm dung lượng index so với full index

20x Nhanh hơn cho query trên subset data

~0 Ảnh hưởng INSERT cho data ngoài filter

## 6. Index đặc biệt trong PostgreSQL

PostgreSQL cung cấp nhiều loại index ngoài B-Tree, mỗi loại tối ưu cho use case khác nhau.

| Loại Index | Use Case | Ví dụ |
| --- | --- | --- |
| **B-Tree** | Comparison operators (=, <, >, BETWEEN) | Hầu hết các trường hợp |
| **Hash** | Chỉ equality (=), nhẹ hơn B-Tree | Lookup chính xác theo ID |
| **GIN** | Full-text search, JSONB, Array contains | `WHERE tags @> '{vue}'` |
| **GiST** | Geometric, range, full-text (ranking) | `WHERE location <@> point` |
| **BRIN** | Dữ liệu tự nhiên có thứ tự (time-series) | Log table sắp xếp theo timestamp |
| **SP-GiST** | Dữ liệu không cân bằng (IP, phone prefix) | Trie cho prefix search |

```sql
-- GIN Index cho JSONB (PostgreSQL)
CREATE INDEX idx_products_attrs ON products
USING GIN (attributes jsonb_path_ops);

-- Query JSONB hiệu quả:
SELECT * FROM products
WHERE attributes @> '{"color": "red", "size": "L"}';

-- BRIN Index cho time-series data
-- Cực kỳ nhỏ (~0.1% kích thước B-Tree) nhưng hiệu quả
-- khi data insert theo thứ tự thời gian
CREATE INDEX idx_logs_created ON logs
USING BRIN (created_at) WITH (pages_per_range = 32);

-- Full-text Search với GIN
CREATE INDEX idx_articles_fts ON articles
USING GIN (to_tsvector('english', title || ' ' || body));
```

#### BRIN Index — Vũ khí bí mật cho bảng lớn

BRIN (Block Range Index) lưu min/max value cho mỗi block range (mặc định 128 pages). Với bảng log 500 triệu rows, BRIN index trên `created_at` chỉ tốn ~200KB so với B-Tree ~5GB. Hiệu quả nhất khi data insert tuần tự — đúng pattern của log/event tables.

## 7. Đọc hiểu Execution Plan

Execution Plan là "bản đồ" cho thấy database thực thi query như thế nào. Đọc được plan là kỹ năng bắt buộc để tối ưu index.

```sql
-- SQL Server: Xem Estimated Plan
SET SHOWPLAN_XML ON;
GO
SELECT * FROM Orders WHERE Status = 'pending';
GO
SET SHOWPLAN_XML OFF;

-- Hoặc Actual Plan (chạy query thật)
SET STATISTICS XML ON;
SELECT * FROM Orders WHERE Status = 'pending';
SET STATISTICS XML OFF;

-- PostgreSQL: EXPLAIN ANALYZE
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE status = 'pending';
```

### 7.1. Các operator quan trọng trong Execution Plan

```
graph TD
    TS["Table Scan / Seq Scan  
Đọc toàn bộ bảng  
❌ Chậm nhất"]
    IS["Index Scan  
Đọc toàn bộ index  
⚠️ Tốt hơn Table Scan"]
    IK["Index Seek / Index Only Scan  
Nhảy đến vị trí chính xác  
✅ Nhanh nhất"]
    KL["Key Lookup / Heap Fetch  
Quay về bảng lấy cột thiếu  
⚠️ Chi phí bổ sung"]

style TS fill:#e94560,stroke:#fff,color:#fff
    style IS fill:#ff9800,stroke:#fff,color:#fff
    style IK fill:#4CAF50,stroke:#fff,color:#fff
    style KL fill:#ff9800,stroke:#fff,color:#fff

```
Hành trình tối ưu: Table Scan → Index Scan → Index Seek → Covering Index (loại bỏ Key Lookup)

```text
-- PostgreSQL EXPLAIN output mẫu:
EXPLAIN (ANALYZE) SELECT email, full_name FROM users WHERE status = 'active';

-- KẾT QUẢ KHI KHÔNG CÓ INDEX:
Seq Scan on users  (cost=0.00..25432.00 rows=50000 width=64)
                   (actual time=0.021..245.123 rows=48532 loops=1)
  Filter: (status = 'active')
  Rows Removed by Filter: 951468
Planning Time: 0.085 ms
Execution Time: 267.456 ms      ← CHẬM

-- KẾT QUẢ SAU KHI TẠO COVERING INDEX:
-- CREATE INDEX idx_users_status ON users (status) INCLUDE (email, full_name);
Index Only Scan using idx_users_status on users
                   (cost=0.42..1523.45 rows=50000 width=64)
                   (actual time=0.031..8.234 rows=48532 loops=1)
  Index Cond: (status = 'active')
  Heap Fetches: 0               ← Không cần quay về bảng!
Planning Time: 0.112 ms
Execution Time: 12.567 ms       ← NHANH 21x
```

## 8. Index Anti-patterns — Những sai lầm phải tránh

### 8.1. Dùng hàm trên cột indexed

```sql
-- ❌ SAI: Index trên Email không được sử dụng
SELECT * FROM Users WHERE UPPER(Email) = 'ADMIN@EXAMPLE.COM';
SELECT * FROM Users WHERE YEAR(CreatedDate) = 2026;

-- ✅ ĐÚNG: Giữ cột indexed nguyên vẹn
SELECT * FROM Users WHERE Email = 'admin@example.com'; -- collation CI
SELECT * FROM Users
WHERE CreatedDate >= '2026-01-01' AND CreatedDate < '2027-01-01';

-- ✅ PostgreSQL: Expression Index nếu cần
CREATE INDEX idx_users_email_upper ON users (UPPER(email));
```

### 8.2. Over-indexing — Nhiều index quá

Mỗi index là một bản copy dữ liệu phải được cập nhật khi INSERT/UPDATE/DELETE. Bảng có 10 index nghĩa là mỗi INSERT phải ghi 11 lần (1 bảng + 10 index).

INSERT: 0.5ms

→

INSERT: 8ms

Chi phí INSERT trên bảng **0 index** vs bảng **15 index** — chậm 16x cho mỗi write operation

### 8.3. Index trên cột low-cardinality

```sql
-- ❌ Ít hiệu quả: Cột Gender chỉ có 2-3 giá trị
CREATE INDEX IX_Users_Gender ON Users (Gender);
-- Index trả về 50% rows → Optimizer chọn Table Scan thay vì Index

-- ✅ Tốt hơn: Composite index hoặc Filtered Index
CREATE INDEX IX_Users_Gender_Date ON Users (Gender, CreatedDate);
-- Hoặc nếu query phổ biến chỉ lọc 1 giá trị:
CREATE INDEX IX_Users_Active ON Users (CreatedDate) WHERE IsActive = 1;
```

### 8.4. Implicit conversion giết index

```sql
-- ❌ Cột UserId là VARCHAR nhưng truyền INT → implicit conversion
-- Database phải convert TỪNG ROW → Table Scan
SELECT * FROM Users WHERE UserId = 12345;

-- ✅ Truyền đúng kiểu
SELECT * FROM Users WHERE UserId = '12345';
```

### 8.5. LIKE với wildcard đầu

```sql
-- ❌ Index không dùng được: wildcard ở đầu
SELECT * FROM Products WHERE Name LIKE '%phone%';

-- ✅ Index dùng được: wildcard chỉ ở cuối
SELECT * FROM Products WHERE Name LIKE 'phone%';

-- ✅ Giải pháp cho full-text: dùng Full-Text Index
-- SQL Server:
CREATE FULLTEXT INDEX ON Products (Name) KEY INDEX PK_Products;
SELECT * FROM Products WHERE CONTAINS(Name, 'phone');

-- PostgreSQL:
CREATE INDEX idx_products_fts ON products USING GIN (to_tsvector('simple', name));
SELECT * FROM products WHERE to_tsvector('simple', name) @@ to_tsquery('phone');
```

## 9. Chiến lược Index cho Production

### 9.1. Quy trình 4 bước

Bước 1: Thu thập query patterns

Phân tích slow query log và workload thực tế. SQL Server có **Query Store**, PostgreSQL có **pg_stat_statements**. Tập trung vào top 20 query tốn resource nhất — chúng thường chiếm 80% tổng load.

```sql
-- SQL Server: Top queries by CPU
SELECT TOP 20
    qs.total_worker_time / qs.execution_count AS avg_cpu,
    qs.execution_count,
    SUBSTRING(qt.text, 1, 200) AS query_text
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) qt
ORDER BY avg_cpu DESC;

-- PostgreSQL: Top queries (cần enable pg_stat_statements)
SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
```

Bước 2: Phân tích Execution Plan

Với mỗi slow query, đọc Execution Plan để xác định: Table Scan? Key Lookup? Sort tốn kém? Từ đó quyết định cần tạo index mới hay mở rộng index có sẵn.

Bước 3: Thiết kế index theo nguyên tắc ESR

**E**quality → **S**ort → **R**ange. Sắp xếp cột trong composite index theo thứ tự: cột equality trước, cột sort giữa, cột range cuối. Thêm INCLUDE cho cột cần SELECT nhưng không filter/sort.

Bước 4: Đo lường và tinh chỉnh

Tạo index trên staging, benchmark với production-like data, rồi mới deploy. Theo dõi index usage stats để phát hiện index không được dùng.

### 9.2. Phát hiện index thừa và thiếu

```sql
-- SQL Server: Index không bao giờ được dùng
SELECT
    OBJECT_NAME(i.object_id) AS table_name,
    i.name AS index_name,
    s.user_seeks, s.user_scans, s.user_lookups, s.user_updates
FROM sys.indexes i
LEFT JOIN sys.dm_db_index_usage_stats s
    ON i.object_id = s.object_id AND i.index_id = s.index_id
WHERE OBJECTPROPERTY(i.object_id, 'IsUserTable') = 1
    AND (s.user_seeks + s.user_scans + s.user_lookups) = 0
    AND s.user_updates > 0
ORDER BY s.user_updates DESC;

-- SQL Server: Missing index suggestions
SELECT
    d.statement AS table_name,
    d.equality_columns,
    d.inequality_columns,
    d.included_columns,
    s.avg_user_impact,
    s.user_seeks
FROM sys.dm_db_missing_index_details d
JOIN sys.dm_db_missing_index_groups g ON d.index_handle = g.index_handle
JOIN sys.dm_db_missing_index_group_stats s ON g.index_group_handle = s.group_handle
ORDER BY s.avg_user_impact * s.user_seeks DESC;

-- PostgreSQL: Index không được dùng
SELECT
    schemaname, tablename, indexname,
    idx_scan AS times_used,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;
```

### 9.3. Bảo trì Index định kỳ

Index bị **fragmentation** theo thời gian khi data INSERT/UPDATE/DELETE. Index fragmented nghĩa là leaf pages không liên tục trên disk, gây tăng I/O.

```sql
-- SQL Server: Kiểm tra fragmentation
SELECT
    OBJECT_NAME(ips.object_id) AS table_name,
    i.name AS index_name,
    ips.avg_fragmentation_in_percent,
    ips.page_count
FROM sys.dm_db_index_physical_stats(DB_ID(), NULL, NULL, NULL, 'LIMITED') ips
JOIN sys.indexes i ON ips.object_id = i.object_id AND ips.index_id = i.index_id
WHERE ips.avg_fragmentation_in_percent > 10
ORDER BY ips.avg_fragmentation_in_percent DESC;

-- Fragmentation 10-30%: REORGANIZE (online, nhẹ)
ALTER INDEX IX_Orders_Status ON Orders REORGANIZE;

-- Fragmentation > 30%: REBUILD (tốn hơn nhưng triệt để)
ALTER INDEX IX_Orders_Status ON Orders REBUILD WITH (ONLINE = ON);

-- PostgreSQL: REINDEX (tương đương REBUILD)
REINDEX INDEX CONCURRENTLY idx_orders_status;

-- PostgreSQL: VACUUM + ANALYZE (bảo trì routine)
VACUUM ANALYZE orders;
```

#### Lưu ý khi REBUILD Index trên Production

Trong SQL Server Enterprise, dùng `ONLINE = ON` để tránh lock bảng. Standard Edition không hỗ trợ online rebuild — phải lên lịch maintenance window. Trong PostgreSQL, luôn dùng `CONCURRENTLY` để tránh exclusive lock.

## 10. Một số pattern index thường gặp

### 10.1. Pagination hiệu quả

```sql
-- ❌ OFFSET lớn = chậm (phải đọc rồi bỏ N rows đầu)
SELECT * FROM Products ORDER BY CreatedDate DESC
OFFSET 100000 ROWS FETCH NEXT 20 ROWS ONLY;

-- ✅ Keyset Pagination (cursor-based) + Index
-- Index: CREATE INDEX IX_Products_Date ON Products (CreatedDate DESC, Id DESC);
SELECT TOP 20 * FROM Products
WHERE (CreatedDate, Id) < (@lastDate, @lastId)
ORDER BY CreatedDate DESC, Id DESC;

-- PostgreSQL tương đương:
SELECT * FROM products
WHERE (created_date, id) < ($1, $2)
ORDER BY created_date DESC, id DESC
LIMIT 20;
```

### 10.2. Soft Delete pattern

```sql
-- Hầu hết query chỉ cần active records
-- Filtered Index rất hiệu quả ở đây:
CREATE INDEX IX_Users_Active
ON Users (Email, CreatedDate)
WHERE IsDeleted = 0;

-- 95% queries chỉ truy cập data chưa xóa
-- Index chỉ chiếm 5% kích thước so với full index
```

### 10.3. Multi-tenant SaaS

```sql
-- TenantId luôn là cột đầu trong composite index
CREATE INDEX IX_Orders_Tenant
ON Orders (TenantId, Status, OrderDate)
INCLUDE (TotalAmount);

-- Mỗi tenant chỉ scan subset nhỏ của bảng
SELECT Status, COUNT(*), SUM(TotalAmount)
FROM Orders
WHERE TenantId = @tenantId AND OrderDate >= '2026-01-01'
GROUP BY Status;
```

## 11. Tổng kết

Database Indexing không chỉ là `CREATE INDEX` rồi xong. Đó là quá trình liên tục: phân tích workload → thiết kế index → đo lường → tinh chỉnh. Những điểm cần nhớ:

ESR Equality → Sort → Range: thứ tự cột trong composite index

INCLUDE Loại bỏ Key Lookup bằng Covering Index

30% Fragmentation threshold để REBUILD

DMV Dynamic Management Views: công cụ phát hiện index thừa/thiếu

#### Checklist trước khi deploy index mới

### Tài liệu tham khảo

- [Use The Index, Luke — A Guide to Database Performance](https://use-the-index-luke.com/)
- [Microsoft SQL Server Index Architecture and Design Guide](https://learn.microsoft.com/en-us/sql/relational-databases/indexes/indexes)
- [PostgreSQL Documentation — Indexes](https://www.postgresql.org/docs/current/indexes.html)
- [Hussein Nasser — Database Indexing Explained](https://www.youtube.com/watch?v=HubezKbFL7E)
- [pganalyze — Efficient Search in PostgreSQL](https://www.pganalyze.com/blog/efficient-search-in-postgres)

Meilisearch 2026 — Deep dive: Hybrid Search, Fragments API và tương lai AI Retrieval

Cloudflare Developer Platform — Xây Full-Stack App Miễn Phí trên Edge

Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.