Data Lake Architecture: Stream Centric
There are numerous mays to create and feed your data lake. One theme that is particularly interesting leverages Apache Kafka, and is well documented in “Putting Apache Kafka To Use: A Practical Guide to Building a Stream Data Platform“. The article does a good job of explaining the ad-hoc road:
“piping between systems and applications on an as needed basis and shoe-horned any asynchronous processing into request-response web services. “
Which turns into an interesting diagram 🙂
The article then goes onto Version 2, appropriately names “Kafka stuff” which has an improved architecture, with well defined flows and patterns – “stream-centric data architecture”, and benefits:
Data Integration: The stream data platform captures streams of events or data changes and feeds these to other data systems such as relational databases, key-value stores, Hadoop, or the data warehouse.
Stream processing: It enables continuous, real-time processing and transformation of these streams and makes the results available system-wide.
In the case of leveraging H2O, this offer the ability to leverage Flow through SparkingWater on top of Apache Spark and the Data Lake (HDFS), and also off Apache Kafka streaming using the H2O POJO’s, opening up the opportunity for real-time pushed business insight to the User Experience.
Curious if any readers have found an improved approach over Apache Kafka to solve the Data Lake data integration problem, and likewise the Machine Learning solution.