BM25 Full-Text Search
Fluree provides integrated full-text search using the BM25 (Best Matching 25) ranking algorithm. BM25 indexes are implemented as graph sources: they index text content from a source ledger and expose search results that can be joined with structured graph queries.
What is BM25?
BM25 is a probabilistic ranking function that scores documents based on query term frequency and document length normalization. It's widely used in search engines and information retrieval systems.
Key features:
- Term frequency with saturation (controlled by k1)
- Inverse document frequency weighting
- Document length normalization (controlled by b)
- English stemming and stopword filtering (default analyzer)
- Block-Max WAND for efficient top-k queries (early termination)
- Incremental index updates
- Time-travel: query the index as of any past transaction
Creating a BM25 Index
BM25 indexes are created via the Rust API using Bm25CreateConfig. There are no HTTP endpoints for index management yet — indexes are managed programmatically.
Basic Index
use fluree_db_api::{Bm25CreateConfig, FlureeBuilder};
use serde_json::json;
let fluree = FlureeBuilder::file("/path/to/data").build()?;
// Create a ledger and insert some data
let ledger = fluree.create_ledger("docs:main").await?;
let tx = json!({
"@context": { "ex": "http://example.org/" },
"@graph": [
{ "@id": "ex:doc1", "@type": "ex:Article", "ex:title": "Rust programming guide" },
{ "@id": "ex:doc2", "@type": "ex:Article", "ex:title": "Python for beginners" },
{ "@id": "ex:doc3", "@type": "ex:Article", "ex:title": "Systems programming in Rust" }
]
});
let ledger = fluree.insert(ledger, &tx).await?.ledger;
// Define the indexing query
let query = json!({
"@context": { "ex": "http://example.org/" },
"where": [{ "@id": "?x", "@type": "ex:Article", "ex:title": "?title" }],
"select": { "?x": ["@id", "ex:title"] }
});
// Create the BM25 index
let config = Bm25CreateConfig::new("article-search", "docs:main", query);
let result = fluree.create_full_text_index(config).await?;
println!("Indexed {} documents", result.doc_count);
println!("Graph source: {}", result.graph_source_id); // "article-search:main"
The graph source ID is {name}:{branch} — for example, article-search:main.
Indexing Query
The indexing query defines what to index. It's a standard Fluree JSON-LD query with these requirements:
- Must include
@idin the select (to identify documents) - Must use
selectwith a map form:{"?x": ["@id", "ex:prop1", "ex:prop2"]} - All selected text properties are extracted and tokenized for search
The query can filter by type, filter by property values, or use any valid Fluree where clause:
{
"@context": { "ex": "http://example.org/" },
"where": [
{ "@id": "?x", "@type": "ex:Article", "ex:title": "?title" },
{ "@id": "?x", "ex:status": "published" }
],
"select": { "?x": ["@id", "ex:title", "ex:content", "ex:tags"] }
}
Configuration Options
| Parameter | Default | Description |
|---|---|---|
name | (required) | Graph source name. Cannot contain :. |
ledger | (required) | Source ledger alias (e.g., "docs:main") |
query | (required) | Indexing query (JSON-LD, must have select) |
branch | "main" | Branch name for the graph source |
k1 | 1.2 | Term frequency saturation. Higher = more weight to term frequency. Must be > 0. Typical range: 1.2-2.0. |
b | 0.75 | Document length normalization. 0 = no normalization, 1 = full normalization. Must be 0.0-1.0. |
let config = Bm25CreateConfig::new("search", "docs:main", query)
.with_branch("dev")
.with_k1(1.5)
.with_b(0.5);
Text Analysis
Fluree uses a default English analyzer that applies:
- Tokenization: Unicode-aware word boundary splitting
- Lowercasing: All tokens converted to lowercase
- Stopword filtering: Common English words removed (the, a, an, is, etc.)
- Stemming: Snowball English stemmer reduces words to root forms (e.g., "programming" -> "program")
The analyzer is not configurable — it always uses the English pipeline for consistency.
Querying BM25 Indexes
JSON-LD Query Syntax
BM25 search is integrated into Fluree's query system via the f: namespace predicates:
{
"@context": {
"ex": "http://example.org/",
"f": "https://ns.flur.ee/db#"
},
"from": "docs:main",
"where": [
{
"f:graphSource": "article-search:main",
"f:searchText": "rust programming",
"f:searchLimit": 10,
"f:searchResult": {
"f:resultId": "?doc",
"f:resultScore": "?score"
}
},
{ "@id": "?doc", "ex:author": "?author" }
],
"select": ["?doc", "?score", "?author"]
}
Pattern fields:
| Field | Description |
|---|---|
f:graphSource | Graph source ID (e.g., "article-search:main") |
f:searchText | Query text (analyzed with same pipeline as indexing) |
f:searchLimit | Maximum number of search results |
f:searchResult | Binding object for results |
f:resultId | Variable binding for the document IRI |
f:resultScore | Variable binding for the BM25 relevance score |
f:resultLedger | (Optional) Variable binding for ledger provenance |
Combining Search with Structured Queries
The search pattern produces ?doc and ?score bindings. These can be joined with ledger data using normal where clauses:
{
"@context": {
"ex": "http://example.org/",
"f": "https://ns.flur.ee/db#"
},
"from": "docs:main",
"where": [
{
"f:graphSource": "article-search:main",
"f:searchText": "rust",
"f:searchLimit": 20,
"f:searchResult": { "f:resultId": "?doc", "f:resultScore": "?score" }
},
{ "@id": "?doc", "ex:title": "?title" },
{ "@id": "?doc", "ex:author": "?author" }
],
"select": ["?doc", "?title", "?author", "?score"]
}
The BM25 search runs first and produces candidate bindings. The subsequent where clauses join those candidates with the source ledger to retrieve additional properties.
Rust API: Direct Search
You can also use the Rust API directly for programmatic search without the query engine:
use fluree_db_query::bm25::{Analyzer, Bm25Scorer};
// Load the index
let index = fluree.load_bm25_index("article-search:main").await?;
// Analyze query terms (same pipeline as indexing)
let analyzer = Analyzer::english_default();
let terms = analyzer.analyze_to_strings("rust programming");
let term_refs: Vec<&str> = terms.iter().map(|s| s.as_str()).collect();
// Score and rank
let scorer = Bm25Scorer::new(&index, &term_refs);
let results = scorer.top_k(10);
for (doc_key, score) in &results {
println!("{}: {:.2}", doc_key.subject_iri, score);
}
Rust API: Query with BM25
Use query_connection_with_bm25 for integrated queries:
let query = json!({
"@context": { "ex": "http://example.org/", "f": "https://ns.flur.ee/db#" },
"from": "docs:main",
"where": [
{
"f:graphSource": "article-search:main",
"f:searchText": "rust",
"f:searchLimit": 10,
"f:searchResult": { "f:resultId": "?doc", "f:resultScore": "?score" }
},
{ "@id": "?doc", "ex:author": "?author" }
],
"select": ["?doc", "?score", "?author"]
});
let result = fluree.query_connection_with_bm25(&query).await?;
Index Maintenance
Syncing
BM25 indexes are not automatically updated when the source ledger changes. You must explicitly sync them:
// Incremental sync (detects changes since last watermark)
let sync_result = fluree.sync_bm25_index("article-search:main").await?;
println!("Upserted: {}, Removed: {}", sync_result.upserted, sync_result.removed);
// Force full resync (rebuilds the entire index)
let sync_result = fluree.resync_bm25_index("article-search:main").await?;
Incremental sync uses property dependency tracking to identify which subjects changed since the last indexed commit. Only affected documents are re-queried and re-indexed. If no affected subjects are detected, it falls back to a full resync.
Background Maintenance Worker
For production use, the Bm25MaintenanceWorker can be configured to automatically sync indexes when source ledgers change:
- Watches for commit events on source ledgers
- Debounces rapid commits (configurable interval)
- Bounded concurrency for concurrent sync operations
- Registers/unregisters graph sources dynamically
Staleness Checking
Check whether an index is behind its source ledger:
let check = fluree.check_bm25_staleness("article-search:main").await?;
println!("Index at t={}, ledger at t={}, stale: {}, lag: {}",
check.index_t, check.ledger_t, check.is_stale, check.lag);
Time-Travel
Load an index at a specific historical transaction time:
// Load index as of transaction t=5
let (index, actual_t) = fluree.load_bm25_index_at("article-search:main", 5).await?;
println!("Loaded snapshot at t={}, docs: {}", actual_t, index.num_docs());
BM25 maintains a manifest of historical snapshots. The manifest is stored in content-addressed storage and tracks all snapshot versions. load_bm25_index_at selects the snapshot with the largest index_t <= as_of_t.
Dropping an Index
let drop_result = fluree.drop_full_text_index("article-search:main").await?;
println!("Deleted {} snapshots", drop_result.deleted_snapshots);
// Drop is idempotent
let drop_again = fluree.drop_full_text_index("article-search:main").await?;
assert!(drop_again.was_already_retracted);
Dropping marks the graph source as retracted in the nameservice and deletes all snapshot blobs from storage. The index can be recreated with the same name afterward.
Scoring and Top-K Optimization
For top-k queries (the typical case via f:searchLimit), BM25 uses Block-Max WAND (Weak AND) to avoid scoring every matching document. Posting lists are divided into fixed-size blocks (128 postings each) with per-block metadata (maximum term frequency). WAND uses these to compute score upper bounds, skipping entire blocks that cannot contribute to the current top-k results.
This makes top_k(10) on a 100K-document index significantly faster than scoring all matches — the algorithm terminates early once it can prove no remaining document can displace the current top results.
When block metadata is unavailable (e.g., during index building before the first snapshot), scoring falls back to dense accumulation over all postings.
Storage Format
V4 Chunked Format
Large BM25 indexes use a chunked storage format (v4) that splits the index into:
- Root blob: Terms dictionary, document metadata, BM25 statistics, routing table
- Posting leaflet blobs: Compressed posting lists (~2MB each), stored as separate content-addressed objects. Each posting list includes block metadata (128 postings per block with
max_doc_idandmax_tf) used for WAND score upper bounds and block-level navigation.
This enables selective loading: queries only fetch the leaflets containing terms that match the search query, rather than loading the entire index.
Leaflet Caching
Posting leaflets are cached in the global LeafletCache (shared with core index leaflets). Cache entries are keyed by content ID hash and are immutable (content-addressed data never changes). The cache uses moka's TinyLFU eviction and is governed by the global cache budget (--cache-max-mb / FLUREE_CACHE_MAX_MB, default: tiered fraction of RAM — 30% <4GB, 40% 4-8GB, 50% ≥8GB).
Parallel I/O
Both reads and writes use bounded-concurrency parallel I/O (buffer_unordered(32)) for leaflet operations. This caps socket pressure when working with object stores like S3 while still providing significant throughput improvement over sequential access.
Format Selection
The storage format is selected automatically based on the storage backend:
- File storage: V3 single-blob format (optimized for local filesystem)
- Memory / S3 / object store: V4 chunked format (enables selective loading and caching)
Deployment Modes
Embedded Mode (Default)
In embedded mode, the BM25 index is loaded and searched within the same process as Fluree. This is the default behavior.
Remote Mode
In remote mode, search queries are delegated to a dedicated search service (fluree-search-httpd):
fluree-search-httpd \
--storage-root file:///var/fluree/data \
--nameservice-path file:///var/fluree/ns \
--listen 0.0.0.0:9090
Both modes use identical analyzer configuration, BM25 scoring algorithm, and time-travel semantics — queries return identical results regardless of deployment mode.
See BM25 Graph Source for details on the remote search protocol.
Related Documentation
- BM25 Graph Source - Graph source integration and remote search protocol
- Background Indexing - Core index architecture
- Vector Search - Similarity search
- Graph Sources Overview - Graph source concepts