BM25 Graph Source
BM25 indexes in Fluree are implemented as graph sources, allowing full-text search to be seamlessly integrated with structured graph queries through the standard query interface.
Overview
A BM25 graph source:
- Indexes text content from a source ledger using a configurable query
- Provides relevance-ranked search results via BM25 scoring
- Integrates with JSON-LD queries through
f:namespace predicates - Supports time-travel (query the index at any historical point)
- Maintains a manifest of snapshots for incremental sync
For index creation, configuration, and lifecycle management, see BM25 Full-Text Search.
Querying BM25 Graph Sources
JSON-LD Search Pattern
BM25 search uses the f: (Fluree) namespace predicates in where clauses:
{
"@context": {
"ex": "http://example.org/",
"f": "https://ns.flur.ee/db#"
},
"from": "docs:main",
"where": [
{
"f:graphSource": "article-search:main",
"f:searchText": "rust programming",
"f:searchLimit": 10,
"f:searchResult": {
"f:resultId": "?doc",
"f:resultScore": "?score"
}
},
{ "@id": "?doc", "ex:title": "?title" }
],
"select": ["?doc", "?title", "?score"]
}
Pattern Fields
| Field | Required | Description |
|---|---|---|
f:graphSource | Yes | Graph source ID (e.g., "article-search:main") |
f:searchText | Yes | Query text. Analyzed with the same tokenizer/stemmer as indexing. |
f:searchLimit | Yes | Maximum number of search results to return |
f:searchResult | Yes | Object with variable bindings for results |
f:resultId | Yes | Variable for the matched document IRI (e.g., "?doc") |
f:resultScore | No | Variable for the BM25 relevance score (e.g., "?score") |
f:resultLedger | No | Variable for the source ledger alias (for multi-ledger provenance) |
How It Works
- The search pattern is parsed and turned into a
Bm25SearchOperator - The operator loads the BM25 index from storage (using the leaflet cache when available)
- Query text is analyzed (tokenized, lowercased, stopwords removed, stemmed)
- The top-k results are computed using Block-Max WAND, which skips posting list segments whose upper-bound scores cannot enter the result set, then returns the highest-scoring documents
- Results produce variable bindings (
?doc,?score) that flow into subsequent where clauses - Subsequent patterns join against the source ledger to retrieve additional properties
Joining with Ledger Data
The primary use case is combining search results with structured graph data:
{
"@context": {
"ex": "http://example.org/",
"f": "https://ns.flur.ee/db#"
},
"from": "docs:main",
"where": [
{
"f:graphSource": "article-search:main",
"f:searchText": "database design",
"f:searchLimit": 20,
"f:searchResult": { "f:resultId": "?doc", "f:resultScore": "?score" }
},
{ "@id": "?doc", "ex:title": "?title" },
{ "@id": "?doc", "ex:author": "?author" },
{ "@id": "?doc", "ex:year": "?year" }
],
"select": ["?doc", "?title", "?author", "?year", "?score"]
}
The BM25 search runs first, producing a set of (?doc, ?score) bindings. The remaining where clauses join those bindings against the source ledger to enrich results with structured data.
Rust API
Creating and Querying
use fluree_db_api::{Bm25CreateConfig, FlureeBuilder};
use serde_json::json;
let fluree = FlureeBuilder::memory().build_memory();
// Seed ledger
let ledger0 = fluree.create_ledger("docs:main").await?;
let tx = json!({
"@context": { "ex": "http://example.org/" },
"@graph": [
{ "@id": "ex:doc1", "@type": "ex:Doc", "ex:title": "Rust guide", "ex:author": "Alice" },
{ "@id": "ex:doc2", "@type": "ex:Doc", "ex:title": "Python intro", "ex:author": "Bob" }
]
});
let ledger = fluree.insert(ledger0, &tx).await?.ledger;
// Create index
let query = json!({
"@context": { "ex": "http://example.org/" },
"where": [{ "@id": "?x", "@type": "ex:Doc", "ex:title": "?title" }],
"select": { "?x": ["@id", "ex:title"] }
});
let config = Bm25CreateConfig::new("search", "docs:main", query);
let created = fluree.create_full_text_index(config).await?;
// Query with BM25 search + ledger join
let search_query = json!({
"@context": { "ex": "http://example.org/", "f": "https://ns.flur.ee/db#" },
"from": "docs:main",
"where": [
{
"f:graphSource": &created.graph_source_id,
"f:searchText": "rust",
"f:searchLimit": 10,
"f:searchResult": { "f:resultId": "?doc", "f:resultScore": "?score" }
},
{ "@id": "?doc", "ex:author": "?author" }
],
"select": ["?doc", "?score", "?author"]
});
let result = fluree.query_connection_with_bm25(&search_query).await?;
Using FlureeIndexProvider
The FlureeIndexProvider implements the Bm25IndexProvider and Bm25SearchProvider traits, used by the query engine for graph source resolution:
use fluree_db_api::FlureeIndexProvider;
use fluree_db_query::bm25::{Bm25IndexProvider, Bm25Scorer, Analyzer};
let provider = FlureeIndexProvider::new(&fluree);
// Load index through the provider (with optional sync and time-travel)
let index = provider
.bm25_index("search:main", Some(ledger.t()), false, None)
.await?;
// Direct search
let analyzer = Analyzer::english_default();
let terms = analyzer.analyze_to_strings("rust");
let term_refs: Vec<&str> = terms.iter().map(|s| s.as_str()).collect();
let scorer = Bm25Scorer::new(&index, &term_refs);
let results = scorer.top_k(10);
Remote Search Service
For large indexes or multi-instance deployments, BM25 (and vector) search can be delegated to a standalone search service: the fluree-search-httpd binary.
Important: the search service is a separate process with its own listen port and its own HTTP API. It is not mounted under the main Fluree server's
api_base_url(/v1/fluree/...). It needs read access to the same storage and nameservice paths the main server writes to, so the typical deployment is to share a storage volume.
Prerequisite: the index must already exist
fluree-search-httpd only serves queries against existing indexes; it does not create them. Today, BM25 and vector graph-source indexes are created via the Rust API (Bm25CreateConfig + create_full_text_index, or VectorCreateConfig + create_vector_index). HTTP endpoints for index creation are not yet available — see the note in API endpoints.
The recommended workflow is:
- Run the Fluree server (or use the Rust API directly) to create the BM25 / vector index on a shared storage path.
- Run
fluree-search-httpdagainst the same--storage-rootand--nameservice-path. - Point clients (or the main Fluree server's
SearchDeploymentConfig) at the search service's/v1/searchendpoint.
Running the Search Service
fluree-search-httpd \
--storage-root file:///var/fluree/data \
--nameservice-path file:///var/fluree/ns \
--listen 0.0.0.0:9090
Configuration options (CLI flag / env var):
| Flag | Env var | Default | Description |
|---|---|---|---|
--storage-root | FLUREE_STORAGE_ROOT | (required) | Path to Fluree storage (where indexes are persisted). file:// prefix optional. |
--nameservice-path | FLUREE_NAMESERVICE_PATH | (required) | Path to nameservice data. |
--listen | FLUREE_SEARCH_LISTEN | 0.0.0.0:9090 | Address and port to bind. |
--cache-max-entries | FLUREE_SEARCH_CACHE_MAX_ENTRIES | 100 | Maximum cached indexes. |
--cache-ttl-secs | FLUREE_SEARCH_CACHE_TTL_SECS | 300 | Cache TTL in seconds. |
--max-limit | FLUREE_SEARCH_MAX_LIMIT | 1000 | Maximum results per query. |
--default-timeout-ms | FLUREE_SEARCH_DEFAULT_TIMEOUT_MS | 30000 | Default request timeout. |
--max-timeout-ms | FLUREE_SEARCH_MAX_TIMEOUT_MS | 300000 | Maximum allowed request timeout. |
Vector search is feature-gated: build/run a binary that includes the vector feature to enable the vector backend. When enabled, GET /v1/capabilities reports "vector" in supported_query_kinds.
Docker Deployment
Run the search service in Docker against a shared volume that the main Fluree server also mounts:
docker run -d --name fluree-search \
-p 9090:9090 \
-v fluree-data:/var/lib/fluree \
-e FLUREE_STORAGE_ROOT=/var/lib/fluree/storage \
-e FLUREE_NAMESERVICE_PATH=/var/lib/fluree/ns \
fluree/search-httpd:latest
For a full Compose example showing the main server + search service sharing a volume, see Running with Docker › Search service.
Search Protocol
The remote search service uses a JSON-based protocol on POST /v1/search. The request is the same shape regardless of backend; the query.kind discriminator selects BM25 vs. vector.
BM25 request:
{
"protocol_version": "1.0",
"graph_source_id": "article-search:main",
"query": { "kind": "bm25", "text": "rust programming" },
"limit": 20,
"as_of_t": 150,
"sync": false,
"timeout_ms": 5000
}
Vector request (requires the vector feature):
{
"protocol_version": "1.0",
"graph_source_id": "doc-embeddings:main",
"query": { "kind": "vector", "vector": [0.12, -0.34, ...], "metric": "cosine" },
"limit": 10
}
A vector_similar_to variant takes a to_iri instead of an explicit vector — the server resolves the entity's embedding from the source ledger.
Response:
{
"protocol_version": "1.0",
"index_t": 150,
"hits": [
{ "iri": "http://example.org/doc1", "ledger_id": "docs:main", "score": 8.75 },
{ "iri": "http://example.org/doc2", "ledger_id": "docs:main", "score": 7.32 }
],
"took_ms": 12
}
Endpoints:
POST /v1/search— execute a search query (BM25 or vector)GET /v1/capabilities— protocol version, supported query kinds, max limit/timeoutGET /v1/health— health check
Time-travel: BM25 supports as_of_t (the service walks the manifest to find the newest snapshot ≤ t). Vector indexes are head-only and reject as_of_t.
Auth: the standalone service does not enforce auth itself — front it with a reverse proxy (or a network policy) if it shouldn't be publicly reachable. The auth_token field on the main server's SearchDeploymentConfig is sent as a Bearer token, so any proxy you put in front can validate it.
Where this fits in your architecture
Two ways to use the search service today:
- Direct client → search service. Your application sends BM25 / vector requests straight to
fluree-search-httpdand joins the resulting IRIs back to the main Fluree server's query API on the application side. This is the path that works end-to-end today and is appropriate when search traffic dominates and you want it isolated from your main Fluree process. - Main Fluree server → search service (transparent delegation). The query path inside the main server has the plumbing to consult a per-graph-source
SearchDeploymentConfigand forward to a remote endpoint. This wiring is not yet exposed end-to-end through the create APIs —Bm25CreateConfighas no deployment builder, and the deployment field is not persisted to the nameservice config record by today's create flow. Track this as a near-term gap; until then, query the search service directly.
Parity Guarantee
Both embedded and remote modes use identical:
- Analyzer configuration (tokenization, stemming, stopwords)
- BM25 scoring algorithm and parameters
- Time-travel and sync semantics
Queries return identical results regardless of deployment mode.
Time-travel note: BM25 time-travel selection is implemented by BM25 itself via a manifest/root in storage. The nameservice stores only a head pointer to the latest BM25 manifest (an opaque address) and does not store BM25 snapshot history.
Graph Source Identity
BM25 graph sources are registered in the nameservice as @type: "f:GraphSourceDatabase" records:
- ID format:
{name}:{branch}(e.g.,article-search:main) - Name: Cannot contain
:(reserved for ID formatting) - Branch: Defaults to
"main" - Dependencies: Tracked for the source ledger(s) the index draws from
- Config: Stores the indexing query and BM25 parameters (k1, b)
List ledgers and graph sources to discover BM25 graph sources:
curl http://localhost:8090/v1/fluree/ledgers
Related Documentation
- BM25 Full-Text Search - Index creation, configuration, maintenance, and storage internals
- Graph Sources Overview - Graph source concepts
- Query Datasets - Multi-graph queries