FlureeLabs

BM25 Graph Source

BM25 indexes in Fluree are implemented as graph sources, allowing full-text search to be seamlessly integrated with structured graph queries through the standard query interface.

Overview

A BM25 graph source:

  • Indexes text content from a source ledger using a configurable query
  • Provides relevance-ranked search results via BM25 scoring
  • Integrates with JSON-LD queries through f: namespace predicates
  • Supports time-travel (query the index at any historical point)
  • Maintains a manifest of snapshots for incremental sync

For index creation, configuration, and lifecycle management, see BM25 Full-Text Search.

Querying BM25 Graph Sources

JSON-LD Search Pattern

BM25 search uses the f: (Fluree) namespace predicates in where clauses:

{
    "@context": {
        "ex": "http://example.org/",
        "f": "https://ns.flur.ee/db#"
    },
    "from": "docs:main",
    "where": [
        {
            "f:graphSource": "article-search:main",
            "f:searchText": "rust programming",
            "f:searchLimit": 10,
            "f:searchResult": {
                "f:resultId": "?doc",
                "f:resultScore": "?score"
            }
        },
        { "@id": "?doc", "ex:title": "?title" }
    ],
    "select": ["?doc", "?title", "?score"]
}

Pattern Fields

FieldRequiredDescription
f:graphSourceYesGraph source ID (e.g., "article-search:main")
f:searchTextYesQuery text. Analyzed with the same tokenizer/stemmer as indexing.
f:searchLimitYesMaximum number of search results to return
f:searchResultYesObject with variable bindings for results
f:resultIdYesVariable for the matched document IRI (e.g., "?doc")
f:resultScoreNoVariable for the BM25 relevance score (e.g., "?score")
f:resultLedgerNoVariable for the source ledger alias (for multi-ledger provenance)

How It Works

  1. The search pattern is parsed and turned into a Bm25SearchOperator
  2. The operator loads the BM25 index from storage (using the leaflet cache when available)
  3. Query text is analyzed (tokenized, lowercased, stopwords removed, stemmed)
  4. The top-k results are computed using Block-Max WAND, which skips posting list segments whose upper-bound scores cannot enter the result set, then returns the highest-scoring documents
  5. Results produce variable bindings (?doc, ?score) that flow into subsequent where clauses
  6. Subsequent patterns join against the source ledger to retrieve additional properties

Joining with Ledger Data

The primary use case is combining search results with structured graph data:

{
    "@context": {
        "ex": "http://example.org/",
        "f": "https://ns.flur.ee/db#"
    },
    "from": "docs:main",
    "where": [
        {
            "f:graphSource": "article-search:main",
            "f:searchText": "database design",
            "f:searchLimit": 20,
            "f:searchResult": { "f:resultId": "?doc", "f:resultScore": "?score" }
        },
        { "@id": "?doc", "ex:title": "?title" },
        { "@id": "?doc", "ex:author": "?author" },
        { "@id": "?doc", "ex:year": "?year" }
    ],
    "select": ["?doc", "?title", "?author", "?year", "?score"]
}

The BM25 search runs first, producing a set of (?doc, ?score) bindings. The remaining where clauses join those bindings against the source ledger to enrich results with structured data.

Rust API

Creating and Querying

use fluree_db_api::{Bm25CreateConfig, FlureeBuilder};
use serde_json::json;

let fluree = FlureeBuilder::memory().build_memory();

// Seed ledger
let ledger0 = fluree.create_ledger("docs:main").await?;
let tx = json!({
    "@context": { "ex": "http://example.org/" },
    "@graph": [
        { "@id": "ex:doc1", "@type": "ex:Doc", "ex:title": "Rust guide", "ex:author": "Alice" },
        { "@id": "ex:doc2", "@type": "ex:Doc", "ex:title": "Python intro", "ex:author": "Bob" }
    ]
});
let ledger = fluree.insert(ledger0, &tx).await?.ledger;

// Create index
let query = json!({
    "@context": { "ex": "http://example.org/" },
    "where": [{ "@id": "?x", "@type": "ex:Doc", "ex:title": "?title" }],
    "select": { "?x": ["@id", "ex:title"] }
});
let config = Bm25CreateConfig::new("search", "docs:main", query);
let created = fluree.create_full_text_index(config).await?;

// Query with BM25 search + ledger join
let search_query = json!({
    "@context": { "ex": "http://example.org/", "f": "https://ns.flur.ee/db#" },
    "from": "docs:main",
    "where": [
        {
            "f:graphSource": &created.graph_source_id,
            "f:searchText": "rust",
            "f:searchLimit": 10,
            "f:searchResult": { "f:resultId": "?doc", "f:resultScore": "?score" }
        },
        { "@id": "?doc", "ex:author": "?author" }
    ],
    "select": ["?doc", "?score", "?author"]
});

let result = fluree.query_connection_with_bm25(&search_query).await?;

Using FlureeIndexProvider

The FlureeIndexProvider implements the Bm25IndexProvider and Bm25SearchProvider traits, used by the query engine for graph source resolution:

use fluree_db_api::FlureeIndexProvider;
use fluree_db_query::bm25::{Bm25IndexProvider, Bm25Scorer, Analyzer};

let provider = FlureeIndexProvider::new(&fluree);

// Load index through the provider (with optional sync and time-travel)
let index = provider
    .bm25_index("search:main", Some(ledger.t()), false, None)
    .await?;

// Direct search
let analyzer = Analyzer::english_default();
let terms = analyzer.analyze_to_strings("rust");
let term_refs: Vec<&str> = terms.iter().map(|s| s.as_str()).collect();
let scorer = Bm25Scorer::new(&index, &term_refs);
let results = scorer.top_k(10);

Remote Search Service

For large indexes or multi-instance deployments, BM25 (and vector) search can be delegated to a standalone search service: the fluree-search-httpd binary.

Important: the search service is a separate process with its own listen port and its own HTTP API. It is not mounted under the main Fluree server's api_base_url (/v1/fluree/...). It needs read access to the same storage and nameservice paths the main server writes to, so the typical deployment is to share a storage volume.

Prerequisite: the index must already exist

fluree-search-httpd only serves queries against existing indexes; it does not create them. Today, BM25 and vector graph-source indexes are created via the Rust API (Bm25CreateConfig + create_full_text_index, or VectorCreateConfig + create_vector_index). HTTP endpoints for index creation are not yet available — see the note in API endpoints.

The recommended workflow is:

  1. Run the Fluree server (or use the Rust API directly) to create the BM25 / vector index on a shared storage path.
  2. Run fluree-search-httpd against the same --storage-root and --nameservice-path.
  3. Point clients (or the main Fluree server's SearchDeploymentConfig) at the search service's /v1/search endpoint.

Running the Search Service

fluree-search-httpd \
  --storage-root file:///var/fluree/data \
  --nameservice-path file:///var/fluree/ns \
  --listen 0.0.0.0:9090

Configuration options (CLI flag / env var):

FlagEnv varDefaultDescription
--storage-rootFLUREE_STORAGE_ROOT(required)Path to Fluree storage (where indexes are persisted). file:// prefix optional.
--nameservice-pathFLUREE_NAMESERVICE_PATH(required)Path to nameservice data.
--listenFLUREE_SEARCH_LISTEN0.0.0.0:9090Address and port to bind.
--cache-max-entriesFLUREE_SEARCH_CACHE_MAX_ENTRIES100Maximum cached indexes.
--cache-ttl-secsFLUREE_SEARCH_CACHE_TTL_SECS300Cache TTL in seconds.
--max-limitFLUREE_SEARCH_MAX_LIMIT1000Maximum results per query.
--default-timeout-msFLUREE_SEARCH_DEFAULT_TIMEOUT_MS30000Default request timeout.
--max-timeout-msFLUREE_SEARCH_MAX_TIMEOUT_MS300000Maximum allowed request timeout.

Vector search is feature-gated: build/run a binary that includes the vector feature to enable the vector backend. When enabled, GET /v1/capabilities reports "vector" in supported_query_kinds.

Docker Deployment

Run the search service in Docker against a shared volume that the main Fluree server also mounts:

docker run -d --name fluree-search \
  -p 9090:9090 \
  -v fluree-data:/var/lib/fluree \
  -e FLUREE_STORAGE_ROOT=/var/lib/fluree/storage \
  -e FLUREE_NAMESERVICE_PATH=/var/lib/fluree/ns \
  fluree/search-httpd:latest

For a full Compose example showing the main server + search service sharing a volume, see Running with Docker › Search service.

Search Protocol

The remote search service uses a JSON-based protocol on POST /v1/search. The request is the same shape regardless of backend; the query.kind discriminator selects BM25 vs. vector.

BM25 request:

{
  "protocol_version": "1.0",
  "graph_source_id": "article-search:main",
  "query": { "kind": "bm25", "text": "rust programming" },
  "limit": 20,
  "as_of_t": 150,
  "sync": false,
  "timeout_ms": 5000
}

Vector request (requires the vector feature):

{
  "protocol_version": "1.0",
  "graph_source_id": "doc-embeddings:main",
  "query": { "kind": "vector", "vector": [0.12, -0.34, ...], "metric": "cosine" },
  "limit": 10
}

A vector_similar_to variant takes a to_iri instead of an explicit vector — the server resolves the entity's embedding from the source ledger.

Response:

{
  "protocol_version": "1.0",
  "index_t": 150,
  "hits": [
    { "iri": "http://example.org/doc1", "ledger_id": "docs:main", "score": 8.75 },
    { "iri": "http://example.org/doc2", "ledger_id": "docs:main", "score": 7.32 }
  ],
  "took_ms": 12
}

Endpoints:

  • POST /v1/search — execute a search query (BM25 or vector)
  • GET /v1/capabilities — protocol version, supported query kinds, max limit/timeout
  • GET /v1/health — health check

Time-travel: BM25 supports as_of_t (the service walks the manifest to find the newest snapshot ≤ t). Vector indexes are head-only and reject as_of_t.

Auth: the standalone service does not enforce auth itself — front it with a reverse proxy (or a network policy) if it shouldn't be publicly reachable. The auth_token field on the main server's SearchDeploymentConfig is sent as a Bearer token, so any proxy you put in front can validate it.

Where this fits in your architecture

Two ways to use the search service today:

  1. Direct client → search service. Your application sends BM25 / vector requests straight to fluree-search-httpd and joins the resulting IRIs back to the main Fluree server's query API on the application side. This is the path that works end-to-end today and is appropriate when search traffic dominates and you want it isolated from your main Fluree process.
  2. Main Fluree server → search service (transparent delegation). The query path inside the main server has the plumbing to consult a per-graph-source SearchDeploymentConfig and forward to a remote endpoint. This wiring is not yet exposed end-to-end through the create APIs — Bm25CreateConfig has no deployment builder, and the deployment field is not persisted to the nameservice config record by today's create flow. Track this as a near-term gap; until then, query the search service directly.

Parity Guarantee

Both embedded and remote modes use identical:

  • Analyzer configuration (tokenization, stemming, stopwords)
  • BM25 scoring algorithm and parameters
  • Time-travel and sync semantics

Queries return identical results regardless of deployment mode.

Time-travel note: BM25 time-travel selection is implemented by BM25 itself via a manifest/root in storage. The nameservice stores only a head pointer to the latest BM25 manifest (an opaque address) and does not store BM25 snapshot history.

Graph Source Identity

BM25 graph sources are registered in the nameservice as @type: "f:GraphSourceDatabase" records:

  • ID format: {name}:{branch} (e.g., article-search:main)
  • Name: Cannot contain : (reserved for ID formatting)
  • Branch: Defaults to "main"
  • Dependencies: Tracked for the source ledger(s) the index draws from
  • Config: Stores the indexing query and BM25 parameters (k1, b)

List ledgers and graph sources to discover BM25 graph sources:

curl http://localhost:8090/v1/fluree/ledgers

Related Documentation