FlureeLabs

BM25 Full-Text Search

Fluree provides integrated full-text search using the BM25 (Best Matching 25) ranking algorithm. BM25 indexes are implemented as graph sources: they index text content from a source ledger and expose search results that can be joined with structured graph queries.

What is BM25?

BM25 is a probabilistic ranking function that scores documents based on query term frequency and document length normalization. It's widely used in search engines and information retrieval systems.

Key features:

  • Term frequency with saturation (controlled by k1)
  • Inverse document frequency weighting
  • Document length normalization (controlled by b)
  • English stemming and stopword filtering (default analyzer)
  • Block-Max WAND for efficient top-k queries (early termination)
  • Incremental index updates
  • Time-travel: query the index as of any past transaction

Creating a BM25 Index

BM25 indexes are created via the Rust API using Bm25CreateConfig. There are no HTTP endpoints for index management yet — indexes are managed programmatically.

Basic Index

use fluree_db_api::{Bm25CreateConfig, FlureeBuilder};
use serde_json::json;

let fluree = FlureeBuilder::file("/path/to/data").build()?;

// Create a ledger and insert some data
let ledger = fluree.create_ledger("docs:main").await?;
let tx = json!({
    "@context": { "ex": "http://example.org/" },
    "@graph": [
        { "@id": "ex:doc1", "@type": "ex:Article", "ex:title": "Rust programming guide" },
        { "@id": "ex:doc2", "@type": "ex:Article", "ex:title": "Python for beginners" },
        { "@id": "ex:doc3", "@type": "ex:Article", "ex:title": "Systems programming in Rust" }
    ]
});
let ledger = fluree.insert(ledger, &tx).await?.ledger;

// Define the indexing query
let query = json!({
    "@context": { "ex": "http://example.org/" },
    "where": [{ "@id": "?x", "@type": "ex:Article", "ex:title": "?title" }],
    "select": { "?x": ["@id", "ex:title"] }
});

// Create the BM25 index
let config = Bm25CreateConfig::new("article-search", "docs:main", query);
let result = fluree.create_full_text_index(config).await?;

println!("Indexed {} documents", result.doc_count);
println!("Graph source: {}", result.graph_source_id); // "article-search:main"

The graph source ID is {name}:{branch} — for example, article-search:main.

Indexing Query

The indexing query defines what to index. It's a standard Fluree JSON-LD query with these requirements:

  • Must include @id in the select (to identify documents)
  • Must use select with a map form: {"?x": ["@id", "ex:prop1", "ex:prop2"]}
  • All selected text properties are extracted and tokenized for search

The query can filter by type, filter by property values, or use any valid Fluree where clause:

{
    "@context": { "ex": "http://example.org/" },
    "where": [
        { "@id": "?x", "@type": "ex:Article", "ex:title": "?title" },
        { "@id": "?x", "ex:status": "published" }
    ],
    "select": { "?x": ["@id", "ex:title", "ex:content", "ex:tags"] }
}

Configuration Options

ParameterDefaultDescription
name(required)Graph source name. Cannot contain :.
ledger(required)Source ledger alias (e.g., "docs:main")
query(required)Indexing query (JSON-LD, must have select)
branch"main"Branch name for the graph source
k11.2Term frequency saturation. Higher = more weight to term frequency. Must be > 0. Typical range: 1.2-2.0.
b0.75Document length normalization. 0 = no normalization, 1 = full normalization. Must be 0.0-1.0.
let config = Bm25CreateConfig::new("search", "docs:main", query)
    .with_branch("dev")
    .with_k1(1.5)
    .with_b(0.5);

Text Analysis

Fluree uses a default English analyzer that applies:

  1. Tokenization: Unicode-aware word boundary splitting
  2. Lowercasing: All tokens converted to lowercase
  3. Stopword filtering: Common English words removed (the, a, an, is, etc.)
  4. Stemming: Snowball English stemmer reduces words to root forms (e.g., "programming" -> "program")

The analyzer is not configurable — it always uses the English pipeline for consistency.

Querying BM25 Indexes

JSON-LD Query Syntax

BM25 search is integrated into Fluree's query system via the f: namespace predicates:

{
    "@context": {
        "ex": "http://example.org/",
        "f": "https://ns.flur.ee/db#"
    },
    "from": "docs:main",
    "where": [
        {
            "f:graphSource": "article-search:main",
            "f:searchText": "rust programming",
            "f:searchLimit": 10,
            "f:searchResult": {
                "f:resultId": "?doc",
                "f:resultScore": "?score"
            }
        },
        { "@id": "?doc", "ex:author": "?author" }
    ],
    "select": ["?doc", "?score", "?author"]
}

Pattern fields:

FieldDescription
f:graphSourceGraph source ID (e.g., "article-search:main")
f:searchTextQuery text (analyzed with same pipeline as indexing)
f:searchLimitMaximum number of search results
f:searchResultBinding object for results
f:resultIdVariable binding for the document IRI
f:resultScoreVariable binding for the BM25 relevance score
f:resultLedger(Optional) Variable binding for ledger provenance

Combining Search with Structured Queries

The search pattern produces ?doc and ?score bindings. These can be joined with ledger data using normal where clauses:

{
    "@context": {
        "ex": "http://example.org/",
        "f": "https://ns.flur.ee/db#"
    },
    "from": "docs:main",
    "where": [
        {
            "f:graphSource": "article-search:main",
            "f:searchText": "rust",
            "f:searchLimit": 20,
            "f:searchResult": { "f:resultId": "?doc", "f:resultScore": "?score" }
        },
        { "@id": "?doc", "ex:title": "?title" },
        { "@id": "?doc", "ex:author": "?author" }
    ],
    "select": ["?doc", "?title", "?author", "?score"]
}

The BM25 search runs first and produces candidate bindings. The subsequent where clauses join those candidates with the source ledger to retrieve additional properties.

Rust API: Direct Search

You can also use the Rust API directly for programmatic search without the query engine:

use fluree_db_query::bm25::{Analyzer, Bm25Scorer};

// Load the index
let index = fluree.load_bm25_index("article-search:main").await?;

// Analyze query terms (same pipeline as indexing)
let analyzer = Analyzer::english_default();
let terms = analyzer.analyze_to_strings("rust programming");
let term_refs: Vec<&str> = terms.iter().map(|s| s.as_str()).collect();

// Score and rank
let scorer = Bm25Scorer::new(&index, &term_refs);
let results = scorer.top_k(10);

for (doc_key, score) in &results {
    println!("{}: {:.2}", doc_key.subject_iri, score);
}

Rust API: Query with BM25

Use query_connection_with_bm25 for integrated queries:

let query = json!({
    "@context": { "ex": "http://example.org/", "f": "https://ns.flur.ee/db#" },
    "from": "docs:main",
    "where": [
        {
            "f:graphSource": "article-search:main",
            "f:searchText": "rust",
            "f:searchLimit": 10,
            "f:searchResult": { "f:resultId": "?doc", "f:resultScore": "?score" }
        },
        { "@id": "?doc", "ex:author": "?author" }
    ],
    "select": ["?doc", "?score", "?author"]
});

let result = fluree.query_connection_with_bm25(&query).await?;

Index Maintenance

Syncing

BM25 indexes are not automatically updated when the source ledger changes. You must explicitly sync them:

// Incremental sync (detects changes since last watermark)
let sync_result = fluree.sync_bm25_index("article-search:main").await?;
println!("Upserted: {}, Removed: {}", sync_result.upserted, sync_result.removed);

// Force full resync (rebuilds the entire index)
let sync_result = fluree.resync_bm25_index("article-search:main").await?;

Incremental sync uses property dependency tracking to identify which subjects changed since the last indexed commit. Only affected documents are re-queried and re-indexed. If no affected subjects are detected, it falls back to a full resync.

Background Maintenance Worker

For production use, the Bm25MaintenanceWorker can be configured to automatically sync indexes when source ledgers change:

  • Watches for commit events on source ledgers
  • Debounces rapid commits (configurable interval)
  • Bounded concurrency for concurrent sync operations
  • Registers/unregisters graph sources dynamically

Staleness Checking

Check whether an index is behind its source ledger:

let check = fluree.check_bm25_staleness("article-search:main").await?;
println!("Index at t={}, ledger at t={}, stale: {}, lag: {}",
    check.index_t, check.ledger_t, check.is_stale, check.lag);

Time-Travel

Load an index at a specific historical transaction time:

// Load index as of transaction t=5
let (index, actual_t) = fluree.load_bm25_index_at("article-search:main", 5).await?;
println!("Loaded snapshot at t={}, docs: {}", actual_t, index.num_docs());

BM25 maintains a manifest of historical snapshots. The manifest is stored in content-addressed storage and tracks all snapshot versions. load_bm25_index_at selects the snapshot with the largest index_t <= as_of_t.

Dropping an Index

let drop_result = fluree.drop_full_text_index("article-search:main").await?;
println!("Deleted {} snapshots", drop_result.deleted_snapshots);

// Drop is idempotent
let drop_again = fluree.drop_full_text_index("article-search:main").await?;
assert!(drop_again.was_already_retracted);

Dropping marks the graph source as retracted in the nameservice and deletes all snapshot blobs from storage. The index can be recreated with the same name afterward.

Scoring and Top-K Optimization

For top-k queries (the typical case via f:searchLimit), BM25 uses Block-Max WAND (Weak AND) to avoid scoring every matching document. Posting lists are divided into fixed-size blocks (128 postings each) with per-block metadata (maximum term frequency). WAND uses these to compute score upper bounds, skipping entire blocks that cannot contribute to the current top-k results.

This makes top_k(10) on a 100K-document index significantly faster than scoring all matches — the algorithm terminates early once it can prove no remaining document can displace the current top results.

When block metadata is unavailable (e.g., during index building before the first snapshot), scoring falls back to dense accumulation over all postings.

Storage Format

V4 Chunked Format

Large BM25 indexes use a chunked storage format (v4) that splits the index into:

  • Root blob: Terms dictionary, document metadata, BM25 statistics, routing table
  • Posting leaflet blobs: Compressed posting lists (~2MB each), stored as separate content-addressed objects. Each posting list includes block metadata (128 postings per block with max_doc_id and max_tf) used for WAND score upper bounds and block-level navigation.

This enables selective loading: queries only fetch the leaflets containing terms that match the search query, rather than loading the entire index.

Leaflet Caching

Posting leaflets are cached in the global LeafletCache (shared with core index leaflets). Cache entries are keyed by content ID hash and are immutable (content-addressed data never changes). The cache uses moka's TinyLFU eviction and is governed by the global cache budget (--cache-max-mb / FLUREE_CACHE_MAX_MB, default: tiered fraction of RAM — 30% <4GB, 40% 4-8GB, 50% ≥8GB).

Parallel I/O

Both reads and writes use bounded-concurrency parallel I/O (buffer_unordered(32)) for leaflet operations. This caps socket pressure when working with object stores like S3 while still providing significant throughput improvement over sequential access.

Format Selection

The storage format is selected automatically based on the storage backend:

  • File storage: V3 single-blob format (optimized for local filesystem)
  • Memory / S3 / object store: V4 chunked format (enables selective loading and caching)

Deployment Modes

Embedded Mode (Default)

In embedded mode, the BM25 index is loaded and searched within the same process as Fluree. This is the default behavior.

Remote Mode

In remote mode, search queries are delegated to a dedicated search service (fluree-search-httpd):

fluree-search-httpd \
  --storage-root file:///var/fluree/data \
  --nameservice-path file:///var/fluree/ns \
  --listen 0.0.0.0:9090

Both modes use identical analyzer configuration, BM25 scoring algorithm, and time-travel semantics — queries return identical results regardless of deployment mode.

See BM25 Graph Source for details on the remote search protocol.

Related Documentation