Machine Learning: Spark vs Flink

•April 26, 2016 • Leave a Comment

Interesting slide deck from Capital One comparing Flink vs Spark.  H2O has Spark integration though SparkingWater which is all well and good, but Flink look more interesting:)

Having not done much work with TensorFlow, I see it has its own cluster for distribution.  However, databricks has integrated it with Spark.

Deeplearning4j provide a interesting comparison of a number of ML libraries, with mention of Spark, but no mention of Flink.

There is some benchmarking of Spark and Flink here, with in many ways expected outcomes:

Apache Flink outperforms Apache Spark in processing machine learning & graph algorithms and relational queries but not in batch processing!

Some searching, and we get to Full Metal Data Lake.  Interesting, but not quite what I’m looking for.  However, it does point me at Apache Drill.

“Data Lake Architecture Considerations & Composition” provides direction on a Data Lake architecture being composed of three layers and three tiers.  Extremely helpful, and one of the better articles I’ve found on data lakes.  Probably also worth a read is “2nd Version of Data Lake vs. Data Warehouse”

Back to Flink, Zalando’s next generation data integration and distribution platform Saiki provides some thoughts on architecture.  Nice to see Saiki’s unified log uses Apache Kafka to feed the data lake – great choice:)

Kafka Connect – Data Lake Integration

•April 26, 2016 • Leave a Comment

“Hello World, Kafka Connect + Kafka Streams” offers a very interesting read on standardising the way to connect to Kafka, and thus deliver data into a streaming data platform (feeding a Data Lake and more).

“Announcing Kafka Connect: Building large-scale low-latency data pipelines” provides more details of why you probably want to look at Connect if you are using Kafka

Kafka Connect is a framework for large scale, real-time stream data integration using Kafka

Documentation can be found here.  Kafka Connect JDBC Connector here.

Data Lake Architecture: Stream Centric

•April 26, 2016 • Leave a Comment

There are numerous mays to create and feed your data lake.  One theme that is particularly interesting leverages Apache Kafka, and is well documented in “Putting Apache Kafka To Use: A Practical Guide to Building a Stream Data Platform“.  The article does a good job of explaining the ad-hoc road:

“piping between systems and applications on an as needed basis and shoe-horned any asynchronous processing into request-response web services. “

Which turns into an interesting diagram:)

The article then goes onto Version 2, appropriately names “Kafka stuff” which has an improved architecture, with well defined flows and patterns – “stream-centric data architecture”, and benefits:

  1. Data Integration: The stream data platform captures streams of events or data changes and feeds these to other data systems such as relational databases, key-value stores, Hadoop, or the data warehouse.

  2. Stream processing: It enables continuous, real-time processing and transformation of these streams and makes the results available system-wide.

In the case of leveraging H2O, this offer the ability to leverage Flow through SparkingWater on top of Apache Spark and the Data Lake (HDFS), and also off Apache Kafka streaming using the H2O POJO’s, opening up the opportunity for real-time pushed business insight to the User Experience.

Both articles are well worth a read.

Curious if any readers have found an improved approach over Apache Kafka to solve the Data Lake data integration problem, and likewise the Machine Learning solution.

Artificial Intelligence for Humans

•April 25, 2016 • Leave a Comment

Artificial Intelligence for Humans Volume 1: Fundamental Algorithms offer a fairly easy read on a topic that is generating a lot of interesting in the last few years.  Although not an overly detailed and deep both, its worth a read for the beginner, or if anything to at least capture a few important concepts such as:

  • Page 7 – Four ways to classify model problems – Data classification (determine the class in which the input data falls), regression analysis (training with input data), Clustering (similar to classification, but for clusters) and Time Series (mapping input values to output values)
  • Page 23 – Supervised and Unsupervised Training
  • Page 83 – K-Means clustering – breaking observations into a specified number of groups
  • Page 163 – Linear Regression.  Establishing relationships between input and output vectors

Data Science: Problem Forumulation

•April 25, 2016 • Leave a Comment

One of the issue with data science is ensuring you know what your attempting to solve – think of it as the ROI.  Like the constant refactoring of code that never makes it to production, hours/days/weeks can be spent on data frame construction, modelling, tuning, refinement.  However, at some points you need to step back from the cycle of modelling, revisit the problem, and validate that problem you perceived you were looking at, is still the right problem, and your solution is moving you towards a conclusion.

In my experience this follows a certain pipeline:

  • Discuss problem
  • Write down problem
  • Identify data sources
  • Refine problem with data sources in mind
  • Build data frame
  • Refine problem
  • Model
  • Capture evidence of results to construct the story of solution – useful for management and discussion
  • Refine problem
  • Tune Model
  • Write summary, next steps, present to business with ROI

Or at least something similar to the above.  Clearly I’d run this work through Kanban:)  The business can now see value add (ROI), and decide on next steps.  Thus avoiding the questions like “What are those data scientists actually doing?”

Get a Data Lake – ELT not ETL

•April 24, 2016 • Leave a Comment

Data Lakes are one of the buzzwords that has been going around for some time in the “big data” era.  Many companies/people has figures out what a data lake is, have create one, and are using it to great effect.  Others are still confused or unsure.

There are many articles and blog posts these days which provide clarity on data lakes.  Here’s one definition:

A Data Lake is a data store used for storing and processing large volumes of data. They are often used to collect raw data in native format before datasets are used for analytics purposes

Which leads to, in many ways, a pivotal line, “ELT not ETL” – thanks to James Serra’s posting.

ELT instead of ETL (loading the data into the data lake and then processing it). This can speed up transformations as the data lake is usually in a Hadoop cluster that can transform data much faster than an ETL tool

Which then leads to identification of all the data sources in your organisation, a deciding how best to extract and load the data from those sources – SpreadSheets, REST services, relational databases, etc.

H2OFrame – Loading Data

•April 20, 2016 • Leave a Comment

“Apache Spark is a fast and general engine for large-scale data processing”.  Clearly this is the reason H2O has been married to Spark through Sparkling Water.  This integration offers:

  • Utilities to publish Spark data structures (RDDs, DataFrames) as H2O’s frames and vice versa
  • DSL to use Spark data structures as input for H2O’s algorithms
  • Basic building blocks to create ML applications utilizing Spark and H2O APIs
  • Python interface enabling use of Sparkling Water directly from pySpark

“How-to: Build a Machine-Learning App Using Sparkling Water and Apache Spark” provides insight into the conversation of data between Spark and H2O – from Spark resilient distributed dataset (RDD)  to H2OFrame and vice versa.

H2OFrame offers a few other ways to load data into H2OFrame’s as provided by this documentation page:

  • local filesystems
  • HDFS
  • S3

I’m therefore wondering, if the data set is relatively small, its probably easier to expose the data through a REST endpoint rather than downloading to CVS just to load the file into H2OFrame.  Maybe in the scenario were at the end of a time period, I want a snap of data?  I can see the advantage of SparklingWater when I’ve got data in hdfs or I need the Spark cluster power on a particular problem.  However, for small datasets, I’m not sure one needs SparklingWater ?

Which leads to the following thought process:

  • If your streaming data, then decide where the POJO model is going to run – its Java, so this should be easy – data subscriber.
  • If you are have lots of data, SparklingWater and cluster is needed to ensure performance
  • If your spiking, RStudio of similar for accessing H2O is good enough
  • Hooking in streaming data to Apache Spark and H2O doesn’t seem to be required, given the first bullet point above.

Get every new post delivered to your Inbox.

Join 760 other followers