Graph Sources and Integrations
Graph sources extend Fluree's query capabilities by integrating specialized indexes and external data sources. Graph sources appear as queryable ledgers but are backed by different storage and indexing systems.
Graph Source Types
Overview
Introduction to graph sources:
- What are graph sources
- Architecture and design
- Use cases
- Performance characteristics
- Creating and managing graph sources
Iceberg / Parquet
Apache Iceberg data lake integration:
- Querying Iceberg tables
- Parquet file support
- Schema mapping
- Partition pruning
- Performance optimization
R2RML
Relational database mapping:
- R2RML standard
- Mapping relational data to RDF
- SQL query generation
- Join optimization
- Supported databases (PostgreSQL, MySQL, etc.)
BM25 Graph Source
Full-text search as graph source:
- BM25 index as queryable ledger
- Search predicates
- Combining with structured queries
- Real-time index updates
What are Graph Sources?
Graph sources are queryable data sources that appear as Fluree ledgers but are backed by specialized storage:
Standard Ledger:
mydb:main → RDF triple store → SPOT/POST/OPST/PSOT indexes
Graph Source:
products-search:main → BM25 index → Inverted text index
products-vector:main → HNSW → Vector similarity index
warehouse-data:main → Iceberg → Parquet files
sql-db:main → R2RML → PostgreSQL tables
Query Transparency
Graph sources are queried like regular ledgers:
{
"@context": {"f": "https://ns.flur.ee/db#"},
"from": "products:main",
"select": ["?product", "?score"],
"where": [
{
"f:graphSource": "products-search:main",
"f:searchText": "laptop",
"f:searchLimit": 20,
"f:searchResult": { "f:resultId": "?product", "f:resultScore": "?score" }
}
]
}
Note: SPARQL queries use the same
f:namespace pattern (f:graphSource,f:searchText, etc.) within JSON-LD query syntax.
Multi-Graph Queries
Combine regular ledgers with graph sources:
{
"@context": {"f": "https://ns.flur.ee/db#"},
"from": "products:main",
"select": ["?product", "?name", "?price", "?score"],
"where": [
{
"f:graphSource": "products-search:main",
"f:searchText": "laptop",
"f:searchLimit": 20,
"f:searchResult": { "f:resultId": "?product", "f:resultScore": "?score" }
},
{ "@id": "?product", "schema:name": "?name" },
{ "@id": "?product", "schema:price": "?price" }
],
"orderBy": ["-?score"]
}
Joins structured data from products:main with search results from the products-search:main graph source.
Graph Source Lifecycle
1. Create Graph Source
Define mapping/configuration:
curl -X POST http://localhost:8090/index/bm25?ledger=mydb:main \
-d '{"name": "products-search", "fields": [...]}'
2. Initial Indexing
Build index from source data:
- Load data from source ledger
- Transform to target format
- Build specialized index
- Publish to nameservice
3. Incremental Updates
Keep synchronized with source:
- Monitor source ledger for changes
- Update graph source incrementally
- Maintain consistency
4. Query Execution
Execute queries against graph source:
- Parse query
- Route to appropriate backend
- Execute specialized query
- Return results
Supported Graph Sources
BM25 Full-Text Search
Purpose: Keyword search with relevance ranking
Backend: Inverted index
Use Cases:
- E-commerce product search
- Document search
- Knowledge base search
Example:
{
"@context": {"f": "https://ns.flur.ee/db#"},
"from": "docs:main",
"where": [
{
"f:graphSource": "docs-search:main",
"f:searchText": "quarterly report",
"f:searchLimit": 20,
"f:searchResult": { "f:resultId": "?doc" }
}
]
}
See BM25 Graph Source and BM25 Indexing.
Vector Similarity Search
Purpose: Semantic search using embeddings
Backend: HNSW index (embedded or remote)
Use Cases:
- Semantic search
- Recommendations
- Image similarity
- Clustering
See Vector Search for details.
Apache Iceberg
Purpose: Query data lake tables
Backend: Apache Iceberg / Parquet files
Use Cases:
- Analytics on historical data
- Data warehouse integration
- Large-scale batch data
Example:
{
"from": "warehouse-sales:main",
"select": ["?date", "?revenue"],
"where": [
{ "@id": "?sale", "warehouse:date": "?date" },
{ "@id": "?sale", "warehouse:revenue": "?revenue" }
],
"filter": "?date >= '2024-01-01'"
}
See Iceberg / Parquet.
R2RML (Relational Databases)
Purpose: Query relational databases as RDF
Backend: SQL databases (PostgreSQL, MySQL, etc.)
Use Cases:
- Existing database integration
- Incremental adoption of graph queries
- Unified queries across systems
Example:
{
"from": "sql-customers:main",
"select": ["?name", "?email"],
"where": [
{ "@id": "?customer", "schema:name": "?name" },
{ "@id": "?customer", "schema:email": "?email" }
]
}
See R2RML.
Architecture
Graph Source Registry
Graph sources registered in nameservice:
{
"graph_source_id": "products-search:main",
"type": "bm25",
"source": "products:main",
"backend": "inverted_index",
"status": "ready"
}
Query Routing
Query engine routes to appropriate backend:
Query: FROM <products-search:main>
↓
Nameservice lookup: type=bm25
↓
Route to BM25 query engine
↓
Execute against inverted index
↓
Return results
Result Integration
Results from graph sources join with regular graphs:
FROM <products:main>, <products-search:main>
↓
Execute subquery on products:main → Results A
Execute subquery on products-search:main → Results B
↓
Join Results A + B on ?product
↓
Return combined results
Performance Considerations
Query Planning
Graph sources affect query optimization:
- Specialized indexes enable efficient filtering
- Push filters down to graph source when possible
- Minimize data transfer between graphs
Data Transfer
Minimize data movement:
- Filter in graph source before joining
- Use selective projections
- Leverage graph source's native capabilities
Caching
Some graph source backends support caching:
- BM25: Results cacheable
- Vector: Similar queries share computation
- Iceberg: Parquet file caching
- R2RML: SQL query plan caching
Best Practices
1. Choose Appropriate Graph Source Type
Match graph source to use case:
- Keyword search → BM25
- Semantic search → Vector
- Analytics → Iceberg
- Relational database integration → R2RML
2. Filter Early
Push filters to graph sources:
Good:
{
"@context": {"f": "https://ns.flur.ee/db#"},
"from": "products:main",
"where": [
{
"f:graphSource": "products-search:main",
"f:searchText": "laptop",
"f:searchLimit": 50,
"f:searchResult": { "f:resultId": "?p" }
},
{ "@id": "?p", "schema:price": "?price" }
],
"filter": "?price < 1000"
}
3. Monitor Graph Source Lag
Check synchronization status:
curl http://localhost:8090/index/status/products-search:main
4. Use Appropriate Limits
Limit results from graph sources:
{
"@context": {"f": "https://ns.flur.ee/db#"},
"from": "products:main",
"where": [
{
"f:graphSource": "products-search:main",
"f:searchText": "query",
"f:searchLimit": 100,
"f:searchResult": { "f:resultId": "?p" }
}
]
}
5. Test Performance
Profile queries combining graph sources:
curl -X POST http://localhost:8090/v1/fluree/explain \
-d '{...}'
Troubleshooting
Graph Source Not Found
{
"error": "GraphSourceNotFound",
"message": "Graph source not found: products-search:main"
}
Solution: Create graph source or check name spelling.
Synchronization Lag
Graph source out of sync with source:
# Check status
curl http://localhost:8090/index/status/products-search:main
# Trigger rebuild
curl -X POST http://localhost:8090/index/rebuild/products-search:main
Poor Performance
Query combining graph sources is slow:
- Check explain plan
- Add filters to reduce result set
- Ensure indexes are up-to-date
- Consider query rewrite
Related Documentation
- Overview - Graph source concepts
- BM25 - Full-text search
- Vector Search - Similarity search
- Iceberg - Data lake integration
- R2RML - Relational mapping
- Query Datasets - Multi-graph queries