Streaming data into H2o
Not something that appears widely publicised, which in my view is strange given the real-time world we live in these days, but after some web searching I found a relevant article on the H2O World 2015 Training site.
- databrick streaming
- Diving into Spark Streaming’s Execution Model
- Improvements to Kafka integration of Spark Streaming
- How-to: Build a Machine-Learning App Using Sparkling Water and Apache Spark
Real-time Predictions With H2O on Storm isn’t quite what I was looking for, I’m more interested in Apache Kafka. However, its close enough to the problem to be useful🙂
Source code can be found here. TestH2ODataSpout.java is the fake real-time data. H2OStormStarter.java is the interesting code, which consumes the real-time data, and pushing results out via the collector emit() and ack() functions.
PredictionBolt emits tuples via the collector to ClassifierBolt who write the data to a file (out) and also emits the classification result to the collector.
The cheat is that ClassifierBolt write to a file (out), which is read by the JS code. In reality the results from ClassifierBolt should probably go via a websocket to the HTML user interface.
Net out, using Storm, Kafka or other technology is irrelevant, the key is exporting the model as a Java POJO via R, and h2o.download_pojo().