I have just completed some work with a rather massive amount of data, quite enough to say that I have some experience with big data. There are several things I have learned.
Data Management Matters
As a priority in dealing with big data, the less you touch it, the better off you are. Which means, unfortunately, that an iterative approach does not work very well. You need to know exactly how to deal with your data and you need to optimize the processes before you commit to them. A significant number of assumptions one normally takes for granted can no longer be taken. For example, if you have two or three ways to determine how to eliminate duplicate records within your data, you probably need to find *the* most efficient method. That means watching explain plans. That means timing a shell script vs a Ruby script. That means working processes 'close to the iron' rather than remotely.
It's rather obvious in retrospect but when you are dealing with 'sensor data' most of it is numbers. That means there are not a lot of attributes and keys for you to deal with, and compression is not going to go as far as you normally think. Therefore the grain of interest for your end-user queries is of paramount importance. Your time dimension means everything. Know the time scope of your queries and partition your data appropriately.
I have been playing with Postgres for about two years now, and there have been a lot of theoretical benefits for doing so, not the least of which is that it is rather standard and has decent free tools. Now that I do so much more work with Vertica and Redshift, Postgres is my best friend. And I suspect it will remain that way as I start playing with VoltDB. Postgres on my Mac is everything that was once Access, Excel and MSSQL Express. And 95% of my code is portable. It makes a huge difference.
Everything I need in an ETL, I hand build with Bash, Ruby and Regex. My partner in this exercise in crime is Rubular.com It is the singular most irreplaceable tool in my kit. I now have a large string of parsing templates in Ruby, many of which simply require just the right regular expression. But let me go into detail for a moment.
I am convinced that the Digital Renaissaence is just about over. There are a lot of innovations left in our imagination, but the fundamental leap forward in handiwork is just about complete. In a way, we coders are all like stonemasons in a world increasing dominated by poured concrete. There are two ways to look at this, and I look at it, despite my proclivities towards hand-building, as a good thing. On the one hand, you don't build cathedrals with poured concrete, on the other hand, you do build giant factories and skyscrapers. Cathedrals are spectacular and fascinating. Skyscrapers and factories are mundane and boring. But the work that runs the world happens in the utility buildings. You may be inspired to pray for safety in the cathedral, but you build airbags in the factory.
So now that our tools are all grown up, we are moving from the Digital Renaissance to the Digital Industrial Revolution. I'd like to think of myself (with any luck) as a digital industrialist. This would be in contrast to a Digital Impressario or a Digital Tourguide. I have to say that at long last I look at Social Media as an evolutionary dead end. (Another Story)
What I mean by having my tools all grown up is precisely what I've been talking about with database interoperability and multiple database systems. We at Full360 are very serious about building open platform systems in which we ingest data into massive storage, manage data and security with key-value store databases, process transactions with tech like VoltDB and serve big data analytics with Redshift and Vertica. Is that wild or what?
By the way we are looking to be on the verge of hacking virtual platforms as well. We have already seen AMIs working in VirtualBox and VMWare. It's only a matter of time before we can spin up databases of platforms for databases. We are deeply into that kind of customization. So yeah, that's another one of my old geek dreams coming true - what I used to call '3DB systems', then thinking of multidimensional, relational and object oriented databases in one system. (The impetus for this dream was a lack of a CORBA toolset for requirements we got from I2 Technologies 15 years ago - prior to the release of the Java API for custom Essbase functions, but I digress). The dream was downgraded to a desire to get 'double DBAs' working for Arberion...
So I want to be a digital industrialist now that I'm all grown up and that means realizing the dreams of 'operational business intelligence' or in other words, dealing with the larger scale structured transactional data previously off limits to non-shardable analytic database engines. OMG it's like rural electrification. There is so much data we can now deal with. Not long ago we in the industry talked about 'subject specific data warehouses', now we're talking about multiples of that at a price mid-market companies can handle.
For me, I've always preferred working with aerospace and manufacturing companies. Why? Because when you have tooled up a factory, you already know what data matters and the analytic value of it (if you could only capture and use it all). So now we can put the equivalent of a mainframe on the factory floor, in the engineering labs, and connect marketing and finance people too. Awesome.
The more I live with my iPhone, the more I realize that all I really need when it comes to business mobile computing, are alerts. So I'm going to go out and find something that delivers me IOS badges. Something a bit more specific and sophisticated than Twitter, but not much.