H2OFrame – Loading Data

“Apache Spark is a fast and general engine for large-scale data processing”.  Clearly this is the reason H2O has been married to Spark through Sparkling Water.  This integration offers:

  • Utilities to publish Spark data structures (RDDs, DataFrames) as H2O’s frames and vice versa
  • DSL to use Spark data structures as input for H2O’s algorithms
  • Basic building blocks to create ML applications utilizing Spark and H2O APIs
  • Python interface enabling use of Sparkling Water directly from pySpark

“How-to: Build a Machine-Learning App Using Sparkling Water and Apache Spark” provides insight into the conversation of data between Spark and H2O – from Spark resilient distributed dataset (RDD)  to H2OFrame and vice versa.

H2OFrame offers a few other ways to load data into H2OFrame’s as provided by this documentation page:

  • local filesystems
  • HDFS
  • S3

I’m therefore wondering, if the data set is relatively small, its probably easier to expose the data through a REST endpoint rather than downloading to CVS just to load the file into H2OFrame.  Maybe in the scenario were at the end of a time period, I want a snap of data?  I can see the advantage of SparklingWater when I’ve got data in hdfs or I need the Spark cluster power on a particular problem.  However, for small datasets, I’m not sure one needs SparklingWater ?

Which leads to the following thought process:

  • If your streaming data, then decide where the POJO model is going to run – its Java, so this should be easy – data subscriber.
  • If you are have lots of data, SparklingWater and cluster is needed to ensure performance
  • If your spiking, RStudio of similar for accessing H2O is good enough
  • Hooking in streaming data to Apache Spark and H2O doesn’t seem to be required, given the first bullet point above.

~ by mdavey on April 20, 2016.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: