H2OFrame – Loading Data
- Utilities to publish Spark data structures (RDDs, DataFrames) as H2O’s frames and vice versa
- DSL to use Spark data structures as input for H2O’s algorithms
- Basic building blocks to create ML applications utilizing Spark and H2O APIs
- Python interface enabling use of Sparkling Water directly from pySpark
“How-to: Build a Machine-Learning App Using Sparkling Water and Apache Spark” provides insight into the conversation of data between Spark and H2O – from Spark resilient distributed dataset (RDD) to H2OFrame and vice versa.
H2OFrame offers a few other ways to load data into H2OFrame’s as provided by this documentation page:
- local filesystems
I’m therefore wondering, if the data set is relatively small, its probably easier to expose the data through a REST endpoint rather than downloading to CVS just to load the file into H2OFrame. Maybe in the scenario were at the end of a time period, I want a snap of data? I can see the advantage of SparklingWater when I’ve got data in hdfs or I need the Spark cluster power on a particular problem. However, for small datasets, I’m not sure one needs SparklingWater ?
Which leads to the following thought process:
- If your streaming data, then decide where the POJO model is going to run – its Java, so this should be easy – data subscriber.
- If you are have lots of data, SparklingWater and cluster is needed to ensure performance
- If your spiking, RStudio of similar for accessing H2O is good enough
- Hooking in streaming data to Apache Spark and H2O doesn’t seem to be required, given the first bullet point above.