For all of us who have hit the proverbial “R” wall due to memory size limitations, H2O is a welcome relief. H2O (www.h2o.ai) is an open-source, in-memory, distributed machine learning platform.
H2O’s core code is written in Java. Inside H2O, a Distributed Key/Value store is used to access and reference data, models, objects, etc., across all nodes and machines. The algorithms are implemented on top of H2O’s distributed Map/Reduce framework and utilize the Java Fork/Join framework for multi-threading. [see: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/welcome.html]
The biggest advantage I found was the ease of switching back and forth between what is called an H2O frame and the R dataframe. The moment we switch to H2O frame the code runs on the h2O cluster that we set up. Setting up the H2O cluster, even on your own laptop, is a breeze. The commands to invoke H2O from within the Rstudio are very straightforward, the tutorial: https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/howto/Connecting_RStudio_to_Sparkling_Water.md
You can quickly get started with machine learning in H2O within Rstudio with this easy to use tutorial: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/RBooklet.pdf
H2O does many things that R does: transformations, aggregations, etc. It also claims to have a rapidly expanding library for machine learning. The documentation is easy to follow, which is a big plus. Some of the world’s largest firms have been quoted on h2o’s website as users of their product. H2O also includes an interesting suite of tools with cool sounding names:
- Base H2O
- Sparkling Water (combining Spark and H2O…nice wordplay)
- Steam (end-to-end AI engine to streamline deployment of apps)
- Deep-water (state-of-the-art deep learning models in H2O)
I ran a random forest model with 500 trees and 1.8 million records and it ran pretty quickly on my laptop. Obviously the real computational power can be harnessed and experienced only when it is run on a large cluster with several nodes. The H2O billion row machine learning benchmark for solving a logistic regression problem is said to take ~35 seconds on 16EC2 nodes and the performance supposedly get better as more nodes are added (see: http://www.stat.berkeley.edu/~ledell/docs/h2o_hpccon_oct2015.pdf for a detailed performance assessment).
All in all, H2O is a great alternative to try out as you crunch those extremely large datasets, where R cannot help.