Once again I have my head in the clouds and am quite serious about freeing up some time and retiring other commitments to get there. Essentially that means less time on the XBox and negotiating a graceful end to all of my political writing. This weekend I'll be at Cal Tech's conference on the subject.
As long as I can remember, probably going back to '03 when I got some interest from the Google people, I've wanted to know what Google was going to do about DW and BI. It turns out that their interest in me was strictly for Essbase financials, and it wasn't worth it to try and hack the economic environment of Northern Cal at that time to do what I'd been doing since 96. Besides, they gave me no hint that I would be talking to their database people. But there was little coming out of Google that let me know what they were thinking.
Even as they distributed Google Finance with their stock chart widget, which was pretty cool, they seemed a long way away from either publishing the Google OS or the Google FS. Then came BigTable and Map Reduce. It was awesome.
I can recall a conference in Orlando for Solutions. It was the first time that I met Gerry Villamil's twin brother. Gerry and Leah Wheelan had just started Beacon. I shoulda joined then, but as usual my head was elsewhere. I had a great deal of respect for Gerry and I didn't want to join his organization unless and until I was certain that Essbase was what I ultimately wanted to do, essentially forever. I also met Heba Betata, who was by far one of the sharpest, no-nonsense programmers I ever met. She was presenting the evolution of what we two started at I2, an Essbase based application for supply chain analytics. But the most memorable event was the two guys from Chase (I'm pretty sure it was Chase) who had implemented Map Reduce to pre-process data into an Essbase cluster. The first toe had been put in the water towards Google/BI integration.
For some time afterwards, I followed some of the developments of BigTable via delicious and some here in the blog. I have taken up that burden again to find a great deal of progress.
In my head, as some of you may know, is a menagerie of ideas, a small fraction of which would make awesome applications. It is my estimation that I have, probably, three excellent ideas. Unfortunately they are way too big for me to handle and I don't know how to broker them. On the other hand, they really energize me and keep me going and nobody else has come close. That is because in part, they require what Big Data does, and the Big Data technologies are new. Sorta.
See I've also paid attention to the LHC at a surface level. And I am astounded by the amount of data those guys are processing. I figure that sooner or later some of that technology is going to trickle down at an open source level. In the meantime, I have this itch about corporate IT - that basically they are ill-equipped to handle the volume and quality of data systems that the next generation will demand. That kids growing up on videogames today are going to be extraordinarily pissed at the quality of enterprise software if it stays on the current path.
It's all very cool to make jokes about Facebook if you're an ERP wizard until you start considering what it takes to handle 50 million users. And me as a non-Teradata database guy - well I'm just embarrassed at having been leepfrogged in analytics by people sending quizzes about their zodiac signs. From the very beginnings of my career, I expected that the most demanding database applications would be coming from Wall Street and then from the Fortune 500. I didn't even consider NSA (nor did I know that it existed). My cousin started working that angle in '96. I certainly didn't expect, that outside of search engines, that the web would be the place to be. As it stands, I prefer the structured data and the sorts of data that people *must* pay attention to, so I'm hardly interested in the generic web. But it's clear to me that new technologies are making it so that new analytic vistas are opening up. I think that finally, the sorts of analyses some of us were brainstorming in 2000 at Hyperion are once again open for consideration. In other words, we are at the dawn of a new era of BI and DW.
There are three names in my head. Greenplum, Vertica and Cloudera. As I investigate the ways and means by which corporate IT is going to meet the challenges of dealing with ever expanding sets of data, I have gravitated towards them at the same time I have looked at other Cloud enablers. As for Oracle, well, every important enterprise software development that applies to business will either die or end up at Oracle, SAP or EMC. Or it will remain free. So I'm not concerned about conflict at all. Oracle will develop inhouse strategy and product set as well, because the expansion of business data is essentially inevitable. Those three names are about databases, but there's a lot more to this cloudy stuff. There is also cloud storage. There I'm thinking about Amazon (of course), Mosso (now the Rackspace Cloud), Dropbox, JungleDisk and Nirvanix. There's another startup that's doing green storage strategies - it's the name of an island - I forget. But I like them too.
All of this points to directions in storage management and data management that extend the data supply chain for enterprise performance management beyond its traditional boundaries. Obviously the best way to work this data today is in the analytical frameworks setup in the best of breed apps of Hyperion. But maximizing the value of corporate structured and non-structured data requires more than just snapping those large toolsets into place. And aside from the excitement of new database technologies, where I'm interested is where and how all that integration takes place. The great news is that Hadoop is probably the answer.
Hadoop is the Apache project that establishes a parallel equivalent to the Google architecture. It is a peta-scale data management system that takes advantage of cheap redundant hardware. I think of it as the final realization of the ultimate expression of the Beowulf cluster, but what do I know? At the core of Hadoop is Hadoop MapReduce which is a pair of functions that enable massive parallel processing in a file system that consists of (dozens, hundreds, thousands) machine nodes with redundancy. The Hadoop project also contains a goodly number of other systems that assist in management of all that data, and in particular contains Hive and Pig. These are systems that will enable that angular two way arrow between Hadoop and BI Applications.
It comes as something of a surprise to me, but the wave of the future seems to be SQL. I'm not sure that there is much of a demand or a point for MDX nor for custom open APIs to Analytic Services. I might be wrong but the number of people learning those APIs and the advantages gleaned and products generated were minimal. Perhaps Ms. Biskantaui has evidence to the contrary, but I think that SQL extensions are it. It will remain 4GL or Java and the 4GL is SQL. In any case, Hive is an augmented SQL and I think that it is the case that Vertical and Greenplum will use similar dialects in order to leverage their front-end compatibility.
All this means that hybridization is afoot. The low cost of entry to this new sort of DW which is extraordinarily good at handling masses of logfiles for example while retaining analytic capacity, means that a lot of experimentation is possible. The pay as you go availability of Amazon EC2 means people can pump it or dump it as they please and expense the entire exercise.
What will be interesting to see is how each of the many prongs of market strategies come to deal with the phenomenon of the new distributed parallel technologies. SOA, SAAS, CDN. When it comes to the next generation of enterprise applications.