Graph Sources and Integrations

Graph sources extend Fluree's query capabilities by integrating specialized indexes and external data sources. Graph sources appear as queryable ledgers but are backed by different storage and indexing systems.

Graph Source Types

Overview

Introduction to graph sources:

What are graph sources
Architecture and design
Use cases
Performance characteristics
Creating and managing graph sources

Iceberg / Parquet

Apache Iceberg data lake integration:

Querying Iceberg tables
Parquet file support
Schema mapping
Partition pruning
Performance optimization

R2RML

Relational database mapping:

R2RML standard
Mapping relational data to RDF
SQL query generation
Join optimization
Supported databases (PostgreSQL, MySQL, etc.)

BM25 Graph Source

Full-text search as graph source:

BM25 index as queryable ledger
Search predicates
Combining with structured queries
Real-time index updates

What are Graph Sources?

Graph sources are queryable data sources that appear as Fluree ledgers but are backed by specialized storage:

Standard Ledger:

mydb:main → RDF triple store → SPOT/POST/OPST/PSOT indexes

Graph Source:

products-search:main → BM25 index → Inverted text index
products-vector:main → HNSW → Vector similarity index
warehouse-data:main → Iceberg → Parquet files
sql-db:main → R2RML → PostgreSQL tables

Query Transparency

Graph sources are queried like regular ledgers:

{
  "@context": {"f": "https://ns.flur.ee/db#"},
  "from": "products:main",
  "select": ["?product", "?score"],
  "where": [
    {
      "f:graphSource": "products-search:main",
      "f:searchText": "laptop",
      "f:searchLimit": 20,
      "f:searchResult": { "f:resultId": "?product", "f:resultScore": "?score" }
    }
  ]
}

Note: SPARQL queries use the same f: namespace pattern (f:graphSource, f:searchText, etc.) within JSON-LD query syntax.

Multi-Graph Queries

Combine regular ledgers with graph sources:

{
  "@context": {"f": "https://ns.flur.ee/db#"},
  "from": "products:main",
  "select": ["?product", "?name", "?price", "?score"],
  "where": [
    {
      "f:graphSource": "products-search:main",
      "f:searchText": "laptop",
      "f:searchLimit": 20,
      "f:searchResult": { "f:resultId": "?product", "f:resultScore": "?score" }
    },
    { "@id": "?product", "schema:name": "?name" },
    { "@id": "?product", "schema:price": "?price" }
  ],
  "orderBy": ["-?score"]
}

Joins structured data from products:main with search results from the products-search:main graph source.

Graph Source Lifecycle

1. Create Graph Source

Define mapping/configuration:

curl -X POST http://localhost:8090/index/bm25?ledger=mydb:main \
  -d '{"name": "products-search", "fields": [...]}'

2. Initial Indexing

Build index from source data:

Load data from source ledger
Transform to target format
Build specialized index
Publish to nameservice

3. Incremental Updates

Keep synchronized with source:

Monitor source ledger for changes
Update graph source incrementally
Maintain consistency

4. Query Execution

Execute queries against graph source:

Parse query
Route to appropriate backend
Execute specialized query
Return results

Supported Graph Sources

BM25 Full-Text Search

Purpose: Keyword search with relevance ranking

Backend: Inverted index

Use Cases:

E-commerce product search
Document search
Knowledge base search

Example:

{
  "@context": {"f": "https://ns.flur.ee/db#"},
  "from": "docs:main",
  "where": [
    {
      "f:graphSource": "docs-search:main",
      "f:searchText": "quarterly report",
      "f:searchLimit": 20,
      "f:searchResult": { "f:resultId": "?doc" }
    }
  ]
}

See BM25 Graph Source and BM25 Indexing.

Vector Similarity Search

Purpose: Semantic search using embeddings

Backend: HNSW index (embedded or remote)

Use Cases:

Semantic search
Recommendations
Image similarity
Clustering

See Vector Search for details.

Apache Iceberg

Purpose: Query data lake tables

Backend: Apache Iceberg / Parquet files

Use Cases:

Analytics on historical data
Data warehouse integration
Large-scale batch data

Example:

{
  "from": "warehouse-sales:main",
  "select": ["?date", "?revenue"],
  "where": [
    { "@id": "?sale", "warehouse:date": "?date" },
    { "@id": "?sale", "warehouse:revenue": "?revenue" }
  ],
  "filter": "?date >= '2024-01-01'"
}

See Iceberg / Parquet.

R2RML (Relational Databases)

Purpose: Query relational databases as RDF

Backend: SQL databases (PostgreSQL, MySQL, etc.)

Use Cases:

Existing database integration
Incremental adoption of graph queries
Unified queries across systems

Example:

{
  "from": "sql-customers:main",
  "select": ["?name", "?email"],
  "where": [
    { "@id": "?customer", "schema:name": "?name" },
    { "@id": "?customer", "schema:email": "?email" }
  ]
}

See R2RML.

Architecture

Graph Source Registry

Graph sources registered in nameservice:

{
  "graph_source_id": "products-search:main",
  "type": "bm25",
  "source": "products:main",
  "backend": "inverted_index",
  "status": "ready"
}

Query Routing

Query engine routes to appropriate backend:

Query: FROM <products-search:main>
  ↓
Nameservice lookup: type=bm25
  ↓
Route to BM25 query engine
  ↓
Execute against inverted index
  ↓
Return results

Result Integration

Results from graph sources join with regular graphs:

FROM <products:main>, <products-search:main>
  ↓
Execute subquery on products:main → Results A
Execute subquery on products-search:main → Results B
  ↓
Join Results A + B on ?product
  ↓
Return combined results

Performance Considerations

Query Planning

Graph sources affect query optimization:

Specialized indexes enable efficient filtering
Push filters down to graph source when possible
Minimize data transfer between graphs

Data Transfer

Minimize data movement:

Filter in graph source before joining
Use selective projections
Leverage graph source's native capabilities

Caching

Some graph source backends support caching:

BM25: Results cacheable
Vector: Similar queries share computation
Iceberg: Parquet file caching
R2RML: SQL query plan caching

Best Practices

1. Choose Appropriate Graph Source Type

Match graph source to use case:

Keyword search → BM25
Semantic search → Vector
Analytics → Iceberg
Relational database integration → R2RML

2. Filter Early

Push filters to graph sources:

Good:

{
  "@context": {"f": "https://ns.flur.ee/db#"},
  "from": "products:main",
  "where": [
    {
      "f:graphSource": "products-search:main",
      "f:searchText": "laptop",
      "f:searchLimit": 50,
      "f:searchResult": { "f:resultId": "?p" }
    },
    { "@id": "?p", "schema:price": "?price" }
  ],
  "filter": "?price < 1000"
}

3. Monitor Graph Source Lag

Check synchronization status:

curl http://localhost:8090/index/status/products-search:main

4. Use Appropriate Limits

Limit results from graph sources:

{
  "@context": {"f": "https://ns.flur.ee/db#"},
  "from": "products:main",
  "where": [
    {
      "f:graphSource": "products-search:main",
      "f:searchText": "query",
      "f:searchLimit": 100,
      "f:searchResult": { "f:resultId": "?p" }
    }
  ]
}

5. Test Performance

Profile queries combining graph sources:

curl -X POST http://localhost:8090/v1/fluree/explain \
  -d '{...}'

Troubleshooting

Graph Source Not Found

{
  "error": "GraphSourceNotFound",
  "message": "Graph source not found: products-search:main"
}

Solution: Create graph source or check name spelling.

Synchronization Lag

Graph source out of sync with source:

# Check status
curl http://localhost:8090/index/status/products-search:main

# Trigger rebuild
curl -X POST http://localhost:8090/index/rebuild/products-search:main

Poor Performance

Query combining graph sources is slow:

Check explain plan
Add filters to reduce result set
Ensure indexes are up-to-date
Consider query rewrite

Graph Source Types

Overview

Iceberg / Parquet

R2RML

BM25 Graph Source

What are Graph Sources?

Query Transparency

Multi-Graph Queries

Graph Source Lifecycle

1. Create Graph Source

2. Initial Indexing

3. Incremental Updates

4. Query Execution

Supported Graph Sources

BM25 Full-Text Search

Vector Similarity Search

Apache Iceberg

R2RML (Relational Databases)

Architecture

Graph Source Registry

Query Routing

Result Integration

Performance Considerations

Query Planning

Data Transfer

Caching

Best Practices

1. Choose Appropriate Graph Source Type

2. Filter Early

3. Monitor Graph Source Lag

4. Use Appropriate Limits

5. Test Performance

Troubleshooting

Graph Source Not Found

Synchronization Lag

Poor Performance

Related Documentation