Machine Learning: Spark vs Flink
Interesting slide deck from Capital One comparing Flink vs Spark. H2O has Spark integration though SparkingWater which is all well and good, but Flink look more interesting 🙂
Having not done much work with TensorFlow, I see it has its own cluster for distribution. However, databricks has integrated it with Spark.
Deeplearning4j provide a interesting comparison of a number of ML libraries, with mention of Spark, but no mention of Flink.
There is some benchmarking of Spark and Flink here, with in many ways expected outcomes:
Apache Flink outperforms Apache Spark in processing machine learning & graph algorithms and relational queries but not in batch processing!
Some searching, and we get to Full Metal Data Lake. Interesting, but not quite what I’m looking for. However, it does point me at Apache Drill.
“Data Lake Architecture Considerations & Composition” provides direction on a Data Lake architecture being composed of three layers and three tiers. Extremely helpful, and one of the better articles I’ve found on data lakes. Probably also worth a read is “2nd Version of Data Lake vs. Data Warehouse”
Back to Flink, Zalando’s next generation data integration and distribution platform Saiki provides some thoughts on architecture. Nice to see Saiki’s unified log uses Apache Kafka to feed the data lake – great choice 🙂