Semantic Data Lake
Continuing on the data lake road, Kafka Connect look very interesting from a way to feed a data lake. This leads to “Prototype of Data Processing Infrastructure”, and usage of RDF and CumulusRDF. RDF if interesting since its gained uptake due to SPARQL – semantic web.
Further searching of the web yields a very interesting article in the data analytics space, “Advanced Real-Time Healthcare Analytics with Apache Spark”. Thankfully the article provide an appropriate architecture diagram, which is nicely using Apache Kafka 🙂 More interesting is that its using Ontologies:
The architecture is hybrid and also includes a production rule engine and an ontology reasoner. This is done in order to leverage existing clinical domain knowledge available from evidence-based clinical practice guidelines (CPGs) and biomedical ontologies like SNOMED. This approach complements machine learning algorithms’ probabilistic approach to clinical decision making under uncertainty. The production rule system can translate CPGs into executable rules which are fully integrated with clinical processes (workflows) and events. Drools supports both forward and backward chaining as well as the modeling of business processes (clinical workflows) with the business process modeling notation (BPMN). There are patterns for integrating rules and processes.
“What we’re investing most of our time in now is the semantic data lake, where we store data in a key value store in Hadoop [Hbase], but then index it with our graph database so that we can do these SPARQL queries,”