Data Lake: Which RDF Data Store?
A number of triple stores are listed here. GraphDB looks interesting, but I suspect the enterprise edition tips the scale towards an Open Source solution. In the world of big data, I suspect you really want SPARSQL over Hadoop?
store data in a key value store in Hadoop [Hbase], but then index it with our graph database so that we can do these SPARQL queries
This is echo’d by “Avoiding Three Common Pitfalls of Data Lakes”:
Smart data technologies substantially reduce the complexity of data lake implementations while accelerating the time-to-value they produce. The graph-based model and detailed descriptions of data elements they enable substantially enhance integration efforts, enabling business users to link data according to relevant attributes that provide pivotal context across data sources and business models. Resource Description Framework (RDF) graphs are designed to incorporate new data and sources without having to re-configure existing representations. The result is considerably decreased time to a more profound form of analytics, in which users can not only ask more questions more expediently than before, but also determine relationships and data context to issue ad-hoc queries for specific needs.
Although older, still of interest “Storing (and querying) RDF in NoSQL database managers”
The paper then describes the storage and querying of RDF using HBase with Jena for querying, HBase with Hive as the query engine (with Jena’s ARQ to parse the queries before converting them to HiveQL), CumulusRDF (Cassandra with Sesame), and Couchbase
Which leads to the following possible options:
- RDF data store used to index your data lake
- D2R on top of your data lake
Anyone got a view?