RStudio H2O Data Munging

The following is probably already well known to RStudio H2O experts, but for anyone not familiar with H2O, it may aid.

Some context to begin:  I’ve got various data sets, and want to begin to experiment with the various H2O algo’s, ideally to gain insight from the data through predictions, and how the data is clustered.  The initial data set isn’t large, but will I suspect inhibit algorithms such as deep learning.

When loading a CSV file with headers via h2o.uploadFile, make sure each column header is unique🙂  Obvious, but not sometimes🙂

data.hex <- h2o.uploadFile(path = "/Users/Matt/Downloads/sampledata.csv",header = TRUE, sep = ",", destination_frame = "data.hex")

Colnames and summary are useful sometimes:


RStudio Tips can be useful.

Histograms can be easily displayed for a column using:

h2o.hist(data.hex$"<column name>")

Get to the help page of a H2O algorithm using help:


Cross Validated is a question and answer site for people interested in statistics, machine learning, data analysis, data mining, and data visualization – I’ve found it useful a few times for answers to questions.

The R H2O tutorial is helpful, but what is missing is more detail on understanding the output from the algorithms.  As an example, take the sample code for prostate.csv.  Once the Generalized linear model is applied to the dataset, more information on the output, and also the summary() output would aid beginners.  I suspect this is some book or other reading material that solve this knowledge gap.

Re-writing the prostate code for 3.8.06 H2O, leads to the following:

prostate.hex <- h2o.importFile(path= "")
data.split <- h2o.splitFrame(data = prostate.hex, ratios = 0.8)
prostate.train <- data.split[[1]]
prostate.test <- data.split[[2]]
prostate.glm <- h2o.glm(y = "CAPSULE", x = c("AGE","RACE","PSA","DCAPS"), training_frame = prostate.train, validation_frame = prostate.test,family = "binomial", nfolds = 10, alpha = 0.5) <- h2o.predict(object = prostate.glm, newdata = prostate.hex)

Which provides the following output form summary():

predict p0 p1
Min. :0.0000 Min. :0.001468 Min. :0.1590
1st Qu.:0.0000 1st Qu.:0.538552 1st Qu.:0.2911
Median :1.0000 Median :0.663140 Median :0.3369
Mean :0.5737 Mean :0.588795 Mean :0.4112
3rd Qu.:1.0000 3rd Qu.:0.708884 3rd Qu.:0.4614
Max. :1.0000 Max. :0.840978 Max. :0.9985

Which based on this posting, provides “the predicted label with the probabilities of all possible outcomes (or numeric outputs for regression problems)”

~ by mdavey on April 4, 2016.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: