Semantic Data Model
“When Hadoop Simply Isn’t Enough: How to Purpose-Build Architecture for Industrial Data” offer an interesting read, but not a mention of RDF’s. ElasticSearch and Hadoop are mentioned as part of the solution, but I’m not clear on how linkage of data is achieved. Am I missing something?
“The Data Lake Concept Is Maturing” however provide a more interesting read, with Apache Hadoop Distributed File System being call out for storage, coupled with:
when selecting a NoSQL database with which to work with their Hadoop clusters. MongoDB, he said, is typically used for department-level cache applications, Apache Cassandra for highly distributed interactive applications, and Apache Hbase for analytic applications which Bodkin said “can tolerate a bit more latency, having a smaller number of places where machine-learned models sit right next to your compute cluster in Hadoop.”
Sempala research is interesting, but I don’t see the code anywhere.
Which still leads to the question of what is the latest data lake software stack? Is the road HBase to hold the data and/or the RDF’s? Jena-HBase Or is HBase paired with a separate graph database offering SPARQL queries?