For all of us who have hit the proverbial “R” wall due to memory size limitations, H2O is a welcome relief. H2O (www.h2o.ai) is an open-source, in-memory, distributed machine learning platform.

H2O’s core code is written in Java. Inside H2O, a Distributed Key/Value store is used to access and reference data, models, objects, etc., across all nodes and machines. The algorithms are implemented on top of H2O’s distributed Map/Reduce framework and utilize the Java Fork/Join framework for multi-threading. [see: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/welcome.html]

The biggest advantage I found was the ease of switching back and forth between what is called an H2O frame and the R dataframe. The moment we switch to H2O frame the code runs on the h2O cluster that we set up. Setting up the H2O cluster, even on your own laptop, is a breeze. The commands to invoke H2O from within the Rstudio are very straightforward, the tutorial: https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/howto/Connecting_RStudio_to_Sparkling_Water.md

You can quickly get started with machine learning in H2O within Rstudio with this easy to use tutorial: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/RBooklet.pdf

H2O does many things that R does: transformations, aggregations, etc. It also claims to have a rapidly expanding library for machine learning. The documentation is easy to follow, which is a big plus. Some of the world’s largest firms have been quoted on h2o’s website as users of their product. H2O also includes an interesting suite of tools with cool sounding names:

  • Base H2O
  • Sparkling Water (combining Spark and H2O…nice wordplay)
  • Steam (end-to-end AI engine to streamline deployment of apps)
  • Deep-water (state-of-the-art deep learning models in H2O)

I ran a random forest model with 500 trees and 1.8 million records and it ran pretty quickly on my laptop. Obviously the real computational power can be harnessed and experienced only when it is run on a large cluster with several nodes.  The H2O billion row machine learning benchmark for solving a logistic regression problem is said to take ~35 seconds on 16EC2 nodes and the performance supposedly get better as more nodes are added (see: http://www.stat.berkeley.edu/~ledell/docs/h2o_hpccon_oct2015.pdf for a detailed performance assessment).

All in all, H2O is a great alternative to try out as you crunch those extremely large datasets, where R cannot help.


Analytics Strategy: Consulting v/s Hub & Spoke v/s The Embedded Model

How should organizations define an analytics strategy? There are typically three usual models that I have come across with different penetration, reach, and success. What works for one organization may or may not work for others and that is axiomatic.

The Consulting Model:  In this model a tight group of data science and analytics professionals work closely with business units to understand challenges, design, develop, and deliver the analytics. As the name suggests the approach is more consultative in nature (shorter term projects). This approach delivers the most bang for the buck, since the team can achieve quick wins for business leaders and demonstrate the value of analytics, leading to potential sustained consultative engagements with the business units. This model works well in organizations relatively new to analytics.

The Hub and Spoke Model:  This model relies on a central hub or center of excellence, which builds an entire team of data scientists, data engineers, and analytics professionals. Such hubs/ hub-spoke-genericCOEs are given the mandate to serve as a clearing house for analytics in the organization. Many examples of this model exist in very large mature data organizations (IBM and Microsoft, among others). The spoke refers to small teams dispatched from the hub to design, develop, and deliver analytics. This provides a more sustainable approach for organizations dealing with external accounts/ clients / partners for deploying analytics. The COE will continue to serve as the delivery arm for the analytics since it has all the data, infrastructure, and personnel in one place.

The Embedded Model:  This model has embedded analytics teams within business units. Usually, this is an approach taken by financial companies where specialized teams work with the business in continuously delivering insights. Obviously, this is not a scalable approach for organizations, albeit successful “locally.” This provides business units with analysts who do not have to be coached on the ins and outs of, say, quantitative trading strategies. However, it does have the limitation of ‘tunnel vision’ with respect to solving analytics challenges.

Obviously there are organizations which do a mix of all the above or a subset of the above approaches. Personally, I believe a hub and spoke model works best since it is in the interest of any organization to have a long-term vision of what analytics can do. If an organization as a whole wishes to be data-driven in everything they do, hub-and-spoke / COE model is the way to go. This also allows for expert generalists to be developed over time and gain experience and expertise across multiple business functions. This may take time to set up, but I believe the investment may be worth it. The age old adage “If you build it they will come,” works!

[img source: https://blog.bmtmicro.com/wp-content/uploads/2015/08/hub-spoke-generic.gif]

“The Hammer and Nail” philosophy in Analytics

In the world of data science and analytics these days, we are all faced with the key question of what technique or methodology solves which problem. During many of the interviews I have conducted over the last several years, I have heard all fancy algorithms being paraded around, without candidates really understanding why those are to be used. The most common ones I hear from screen-shot-2017-01-28-at-1-13-56-pmcandidates are: support vector machines without understanding what support vectors are, deep learning without understanding what neural networks are, random forests without understanding what really makes them random, and naive Bayes classifier without understanding why it is called ‘naive.’

The package driven languages of data science like R, Python, SAS, etc have made it extraordinarily easy for people to use all these complex algorithms without actually understanding the underlying statistics, mathematics, and optimization principles. This is called the famous “hammer and nail analogy.” When you have a hammer everything looks like a nail. It has become commonplace for people to use R or Python and to try all algorithms and simply pick the one with high accuracy measures, without really understanding what the business needs and the problem needs. Not all problems require deep learning. No really, they don’t! Some of the common challenges in the insurance industry for example, may simply need association rule mining or decision trees. Some may need more complex modeling and simulation for risk-based analyses.

One of the first  exercises I used to give to my doctoral assistants in research or my team in industry was to code an entire algorithm without using any packages in R or Python. This gives candidates a deep understanding of the internal workings of algorithms. Data Science and Analytics are part art and part science. Use it wisely. Your goal is to solve a business challenge and drive business value, not to show off what technique is the latest and greatest trend on social media.  Do not use a chainsaw where a scalpel will do.

Review of “The New Digital Age” by Eric Schmidt and Jared Cohen

The New Digital Age Cover
Image Source: http://on.thestar.com/2jcpuYI

Have you seen the movies: The Minority Report, The Terminator Series, The Net, and Live Free or Die Hard (Die Hard 4.0). This book is a print version of all those movies and numerous other flicks rolled into one. Sure, its written by the head honchos of Google, so I guess it deserves (or rather expects) to be read by people all over. But it could have just as well been written by someone anonymous down the street. The reason I say that is unless you have been living under a rock, it is common knowledge that there are no disconnected devices or disconnected individuals, at least in the developed world. And in the developing world, generations of people are leap frogging into the connected era with smart phones by the virtue of having entirely skipped the personal computer revolution. We are already in the era of smart money (bitcoins), smart homes, smart phones, smart cars, hyper-loop and Mars colonization in the horizon, Amazon echo, Apple Siri, and whatever else Schmidt and his team at Google are thinking up.

Don’t get me wrong, it is a decent book to understand the inherent dualities when it comes to everything around us going digital. Each chapter of the book examines how the many facets of our lives will be fundamentally transformed: ourselves, the  people around us, institutions, and governments. Schmidt and Cohen also theorize on how the digitized world  would influence terrorism and counter-terrorism efforts, how it can influence repressive regimes and the people who would rebel. There is also a dedicated chapter about how environmental and man-made catastrophes in the digitized world can unleash innovation to speed up the reconstruction efforts. A chapter that stood out was the one on the “Future of Revolution.” It discusses how ordinary citizens in the Arab Spring used technology to spread the message of freedom and brotherhood, and to coordinate peaceful protests despite technological and physical oppression by their respective regimes.

Each chapter in the book examines the pros and cons of the digital world. Each chapter has a “protagonist:” ourselves, or governments, good people, and bad people. By the end of Chapter 2 or 3, it gets rather repetitive and quite frankly, a little depressing. I am sure the the book was intended to be thought-provoking, as we step into the connected digital future, and it did its job! At the end of the book, I was wondering if I should relocate to a small village in a serene corner of the world, disconnect from the internet, grow my own food, and live out a simpler life with my family.

My rating is 3.0/5.0. I just had to finish it since I started it. Was not a compelling read.


Much Ado about Some Things…

Originally written in 2016, edited in Jan 2017.


I cannot turn a page in newspapers or browse for news on the internet without reading (and rolling my eyes) about the emergence, reemergence, take over, new era, new age, and deluge, of Big Data Analytics or how machine learning algorithms like deep learning or some other cognitive, neuro, learning are going to save the world. We all have heard some banalities being bandied about quite a bit…

  • Big data is the next oil, the next soil, etc.
  • Data matures like wine and applications like fish
  • IoT is going to disrupt the way we live, it is a lot bigger than Big Data
  • Artificial intelligence is going to take over the planet and all our jobs (Terminator style…well not quite, I made that one up, the other three are real by the way.)

My personal favorite rebuttal for all these is “Not everything that counts can be counted, and not everything that can be counted counts.” This quote was said to be hanging in Einstein’s office in Princeton (not sure if this is true or not, but the saying makes sense). With all due respect to data scientists and other analytics professionals (I am one of them) can we all please go easy on the hype and not make it all sound so cheesy. It’s like the dotcom bubble dejavu all over again. Every startup I hear about is using some fancy ‘new’ algorithm, every company is talking about how analytics will change the world. Sure, some will and should..for the better.

Let’s get a few things in order. Analytics and data science have been there for a decades. They were just known with different non-appealing names: statistics, optimization, computer science, algorithms, etc. Clearly, none of them sound as appealing as “Data Science” or “Analytics.” We should all be thankful that industry as well as academia woke up and took notice about “smart decision-making,” and I guess some amount of branding was necessary for it to be taken seriously. Duly noted.

Now, can we get back to doing good work and not sell snake oil. All of us end up sounding ridiculous, naïve, and quite frankly a little annoying. The field runs the risk of being turned into a sham by some used-car salesmen (no offense to them). Let me give you a personal anecdote. I approached a conference organizer (in India) about submitting a proposal to speak at a conference. He unabashedly sent me a brochure with a detailed price list of how I can buy slots to talk about my ideas. Never once did he talk about my proposal, what the idea was, or even what the model / algorithm / application was. All he cared about was $$$.

The brochure even said I can pay extra to talk more (buy an entire session that is). This is what knowledge in our world has come to. Who can sell the snake oil better…who can market things and make them sound better… who can come up with more cool sounding jargon…who can create entire fake conferences where people pay to talk and ideas go to die. Sure conferences cost money, but it has to have a rigorous review process, such as KDD or most of the IEEE conferences.

So how do we stop this madness? I have a few pointers that some of you may agree with. I have already spoken to a few serious data scientists and they share my views.

  • Refrain from saying and posting stuff unless it makes scientific sense (do not do it just to get more likes and shares on your social media feeds)
  • Reputed websites should have strict editorial and review processes and not publish garbage
  • Serious data scientists should refrain from giving talks where you have to pay to simply buy a slot without any formal review process
  • When recruiting for your teams, consider hiring analytics professionals who are certified or those who have demonstrable skills

If we do not give any value to our own profession, trust me no one else will. We will all end up looking like used car salespeople.