I keep worrying about being formal vs being casual when I talk about what I'm doing professionally. So I have a whole lot of half-finished outlines describing the shift in my work over the past three years. But I suppose I can speak with enough authority to be casual now as I continue to talk about big data - and yes my biases will creep in.
Let's identify big data.
What makes big data big? The fact that you can no longer take operating with it for granted. It's that simple. There are several dimensions into which your data gets too big for the britches you bought it. And if you've heard this before, please forgive me, but consensus has to do with independent validation of what others are experiencing.. So my aim is to find the pain as I start diagnosing what kind of data headaches you are dealing with.
In general, I find that people are much more specific and consistent when they talk about pain than when they talk about what they hope for or what they think otehr people might be thinking - so let's take the headache approach to defining specific types of big data.
Data gets out of hand when you have to process it faster. Think of those processing windows for your SLA. What is real-time, near-real time? How much data do you need to see that is very recent? Is there a use case for the analysis of events that have taken place in the past two hours? 30 minutes? 5 minutes? Every database can process data in near real time, but how much do you need processed every minute? When this gets to be a headache, you are dealing with big data.
2. Veracity / Validity
When you have to process a billion records how many of them are perfect? Meaning that if one field is bogus, you have to identify that cell within the mass of data you are processing, record by record. When it comes to database ingestion, chances are you can't just segregate out a million errors and then re-run them separately as a standard function of the database. So you have to engineer a process that will do so for you. When this gets to be a headache, you are dealing with big data.
In your current system, there is some data you have to deal with that makes you cringe because according to the current business rules you have to make six passes over the table in order to categorize it properly. Fortunately there are only 500,000 of those such records. If that section of the data gets big, you have a problem. Or how about this. Some data you have requires 255 characters and there are no short descriptions. Or some data you have possesses two short keys and 275 floating point values. Big data means that these typical 'outliers' may now exist in significant sizes. If your weird data is getting larger, you are dealing with big data.
4. Virtual Hardware
Obviously, some numbers just sound big. Since I am learning to think *really* big when I consider the capacities of Amazon Redshift things change. "More than I ever had to deal with" is actually quite big enough. But let's talk machine size for a minute. Once upon a time, The Gap was tapping their feet waiting for an OLAP that could handle a million member dimension - for their marketing. They made available Sun's largest single machine, an E9000 or some such and re-niced our processes so we could have more of the box than any single program before. Once upon another time, HP made a 36 processor Superdome available for my team as we processed data for a design at Boeing. In those days, that was huge. Today both the Sun and the HP could fit inside of the processing space of a five node 8XL Redshift cluster with plenty room to spare. If you cannot conceive of compute power above single boxes, no matter how large those boxes may be, your data may soon become a headache.
Let me expand upon this point because it is the point over which some rather serious economics hinge, especially those economics about your job in the enterprise.
In business computing, there will always be unpredictable market conditions, evolving product performance data, changing skill and knowledge sets within the management and rank and file, and wildly varying customer information, not to mention regulatory compliance reports. That means to keep your business on top of the data being generated in your industry will always be beyond your ability - even if you have all the resources of Fort Meade. So what you have been allowed to build has always been constrained by budget, time, resources.
The Big Data revolution is all about putting better tools and processes in place which allow you to do more with less. If new products and technology don't expand your capacity, then you are wasting time, money and effort. So the opportunity should be considered in terms of what you can do (more) with these new advances.
If your organization's aims are to cut costs and keep the same functionality, your job is at risk anyway. It's not the cloud that threatens your position, it is your management's view the value of your work to the company. So if you feel threatened by the existence of cloud computing, what you need to do is start considering the unpredictable market conditions, evolving product performance data, changing skillsets and regulations as they relate to your unique understanding of what your company is experiencing. If you can't come up with ideas for improvement your management will by into, that says more about your company than it does about the cloud. The cloud is happening. It presents new economies of scale that your company does not have to pay for to research and innovate on. How your company takes advantage depends upon the culture of your company.
Let's go back to basics. Why do we build systems and how do we judge their effectiveness? I have golden rules for end-user computing. We'll talk about those in the next part where we can identify where change is happening, understand if we need big data technologies to solve those big headaches and then demonstrate the value you can bring to your business.
What is big data?
I want to eliminate some confusion by giving a history of big data from my perspective. I don't know how long this will take, but I'd guess about 5000 words. Let's get started.
Big Data is clumsy. Because it is big the habits you have with non-big data become impractical. Data becomes 'big' when it forces you, by its size, to adopt new processes, hardware or software to deal with it in a reasonable period of time.
Part One - Personal Data Growing Pains
I started thinking about big data a long time ago. I'd say it was somewhere around 2002. At the time, I used a laptop with about a 5GB drive. This was about as much as I would ever need to run my 32bit OS (Win 2K Pro which soon became Win XP Pro). But I had collected about 4000 names in my Palm Pilot contacts. As I had to move and sync this contact information with Outlook the system started to fail. It got slower and slower. The same thing was happening with my email, which I had all in Eudora. Migrating this all important data from laptop to laptop began to be a serious problem. One day something went wrong. Somehow my address sync created duplicate records. Now I had 8000 contacts, Outlook slowed to a crawl and there was no way for me to deduplicate the records except by writing a custom progam (I didn't know how) or manually. It took months.
This simple mistake put me in a class of user that the builders of Outlook didn't consider mainstream. Even years later as I uploaded my contacts to Google's new mail service, GMail, it was very slow to lookup a contact. I could have paid $200 to purchase the popular contact manager Act!, but I didn't want to. The point was that my data was too big for Outlook - the way I had managed contacts all along.
So the first notion I want to communicate is that no data is really too big. Somewhere there is a system that can handle your data no matter how big you think it is. The question is whether or not you must endure a paradigm shift to accomodate your outgrowing of your old system. Let us recall archive.org. The Internet Archive has been eating and saving the entire static WWW for a decade now.
What really began to give me problems was my music collection. At some point I began using iTunes to rip my 600+ compact disks. This filled up my laptop in no time. I started buying external hard drives. At firstI got an expensive LaCie Porsche Design external USB drive. You may remember them from that Will Smith movie I Am Legend - in his laboratory basement... OK that's just me. Every six months my music, movies and pictures would outgrow my capacity and I bought a new one. Fortunately drive capacities were going up. 100, 200, 250GB. It was hard to imagine I would fill these up, but I did. Soon was dragging around the Seagate drives that eventually became the FreeAgent Pro series. A blue one for my business information and a red one for my music. Capacity finally but capricious.
Like clockwork every year for three years straight one of my major hard drives would fail. By 2007 I had a terabyte at home spinning in various disks attached to my Dell tower. By then I had become familiar with two vital tools. The first was a disk recovery system called simply Recover My Files, and the second was disk management program called File Tree Pro which had a deduplication facility as well as a heatmap visualizer.
Disaster recovery became an issue. When your collection of data becomes larger, it becomes more precious because you realize you've been collecting it for years and you'll do anything to retain it. Over this period of failing drives, I realized nothing was foolproof so I had to make friends with redundancy. I kept a copy of all my iTunes music on my Dell tower and also on an external hard drive. And because of the way Recover My Files worked, I had up to half of my files named Recovered 039203.mp3 or some such. So some of my files were redundant and I knew it, some I didn't know. But now I had a class of files I couldn't identify but dare not throw away. I know that the next time a drive fails, I want to have a duplicate but now I have all these unidentified files in iTunes and that slows iTunes down. So literally for years I would see a song 'Recovered 034930.mp3' and I would have to listen to it, identify it and change the tags. Metadata management hell.
Recovery meant I ran out to Fry's on a Thursday, bought a new drive, plugged it in and dedicated 100% of that machine's processes to scour the drive. Maybe by Sunday I will have gotten through 200GB. Recovery is non-trivial. Deduplication is both necessary and risky.
In the normal course of dealing with non-big data, having multiple versions makes sense. Deleting and restoring doesn't take much thought. Backup is simple, disk is cheap. Dealing with big data, deleting may not make sense, restoring takes time, yes disk is cheap but now it seems very slow.
I've had websites for a long time. I've also had FTP. But I never considered FTPing anything but the most important files up to my website for safekeeping. Like most folks I used Google Docs. Then I got Flickr and Picasa to hold my pictures. It was a long time before a web-based music storage service became available. But I did start using Carbonite. But for all of these services, if I wasn't creating content at that web location, I always had to deal with copying data from my machine over the web. Painstaking and slow. The first real breakthrough came with Dropbox, and then with Backblaze. These two services (finally) let me keep my data locally and worked in the background without intervention from me to make copies of my data. Now I had smart duplication, the kind of redundancy that didn't have the failings of my own recovery and backups. Files were duplicated in a separate place, I could identify all of them, and I didn't have to spend 100% of my processing time and attention making it so.
When you are dealing with big data, you need to have this kind of backend process working in your favor. Hardware will eventually fail, networks are too slow, metadata management is an awful thing to do manually.
Today I have the following personal setup.
I run everything from my MacBook Pro which I have attached:
OneBox - 1TB laptop hard drive
Bluebox - 320GB of business related files
Silverbox - 3TB of personal, non-media files
Whitebox - 2TB of media files
BBox - 1TB of backup of my main 1TB hard-drive
ZBox - encrypted information
Backblaze - backup of all the above except for BBox.
DropBox - 10GB of documents including ZBox (redundant to OneBox)
BlueKey - External USB key including portions of DropBox & ZBox
iTunes Match - 25,000 duplicated music
PicasaWeb - some ungodly amount of duplicated pictures
Flickr - some unique fraction of same pictures + more
AmazonMusic Player - about 5000 MP3s
Audible - all of my audiobookx
Amazon - all of my eBooks
OReilly - the rest of my nonAmazon eBooks
Evernote - some 5000 documents
iCloud - Contacts, notes, calendars (also on iPad, iPhone)
So I have all of my data locally, most of it 1X redundant locally, the more important 2X locally redundant. I have all of my data remotely redundant, some of my music and some of my pictures 2X remotely redundant. I manually initiate BBox backups which are incremental about 15 times a week (Apple TimeMachine)
My most important pictures in the world are on my keychain.
Additionally I burn various USB sticks, DVD-ROMS and SD Cards with the more important fractions of important files and stash them around the house and in the garage, and physically offsite. Plus I have some stuff in Amazon S3 but I forget what.
What I don't have redundantly is all my old software, which lie in Case Logic CD cases on my bookshelf. It turns out that almost nothing is as useless as old software. Even Eudora.
I have come to trust the cloud for backup, but there isn't much that I have there that I do not keep locally, which basically means my blogs and what I've said on FB, Twitter and the dozens of webchat forums I have attended over the years.
What I have learned is that at some point, you will outgrow your current systems, even if you didn't plan to. When you do there will be barriers you cross (for me it was 100GB collections) that force you to deal with new systems, new problems, new costs, but mostly new habits & new processes.
Then guess what. It's just data again. Yeah, I have about 6-7TB of data, not counting redundant backups that are my personal data. I spend maybe 2 minutes a day making sure it gets backed up, and almost no time worrying about it.
I have just completed some work with a rather massive amount of data, quite enough to say that I have some experience with big data. There are several things I have learned.
Data Management Matters
As a priority in dealing with big data, the less you touch it, the better off you are. Which means, unfortunately, that an iterative approach does not work very well. You need to know exactly how to deal with your data and you need to optimize the processes before you commit to them. A significant number of assumptions one normally takes for granted can no longer be taken. For example, if you have two or three ways to determine how to eliminate duplicate records within your data, you probably need to find *the* most efficient method. That means watching explain plans. That means timing a shell script vs a Ruby script. That means working processes 'close to the iron' rather than remotely.
It's rather obvious in retrospect but when you are dealing with 'sensor data' most of it is numbers. That means there are not a lot of attributes and keys for you to deal with, and compression is not going to go as far as you normally think. Therefore the grain of interest for your end-user queries is of paramount importance. Your time dimension means everything. Know the time scope of your queries and partition your data appropriately.
I have been playing with Postgres for about two years now, and there have been a lot of theoretical benefits for doing so, not the least of which is that it is rather standard and has decent free tools. Now that I do so much more work with Vertica and Redshift, Postgres is my best friend. And I suspect it will remain that way as I start playing with VoltDB. Postgres on my Mac is everything that was once Access, Excel and MSSQL Express. And 95% of my code is portable. It makes a huge difference.
Everything I need in an ETL, I hand build with Bash, Ruby and Regex. My partner in this exercise in crime is Rubular.com It is the singular most irreplaceable tool in my kit. I now have a large string of parsing templates in Ruby, many of which simply require just the right regular expression. But let me go into detail for a moment.
I am convinced that the Digital Renaissaence is just about over. There are a lot of innovations left in our imagination, but the fundamental leap forward in handiwork is just about complete. In a way, we coders are all like stonemasons in a world increasing dominated by poured concrete. There are two ways to look at this, and I look at it, despite my proclivities towards hand-building, as a good thing. On the one hand, you don't build cathedrals with poured concrete, on the other hand, you do build giant factories and skyscrapers. Cathedrals are spectacular and fascinating. Skyscrapers and factories are mundane and boring. But the work that runs the world happens in the utility buildings. You may be inspired to pray for safety in the cathedral, but you build airbags in the factory.
So now that our tools are all grown up, we are moving from the Digital Renaissance to the Digital Industrial Revolution. I'd like to think of myself (with any luck) as a digital industrialist. This would be in contrast to a Digital Impressario or a Digital Tourguide. I have to say that at long last I look at Social Media as an evolutionary dead end. (Another Story)
What I mean by having my tools all grown up is precisely what I've been talking about with database interoperability and multiple database systems. We at Full360 are very serious about building open platform systems in which we ingest data into massive storage, manage data and security with key-value store databases, process transactions with tech like VoltDB and serve big data analytics with Redshift and Vertica. Is that wild or what?
By the way we are looking to be on the verge of hacking virtual platforms as well. We have already seen AMIs working in VirtualBox and VMWare. It's only a matter of time before we can spin up databases of platforms for databases. We are deeply into that kind of customization. So yeah, that's another one of my old geek dreams coming true - what I used to call '3DB systems', then thinking of multidimensional, relational and object oriented databases in one system. (The impetus for this dream was a lack of a CORBA toolset for requirements we got from I2 Technologies 15 years ago - prior to the release of the Java API for custom Essbase functions, but I digress). The dream was downgraded to a desire to get 'double DBAs' working for Arberion...
So I want to be a digital industrialist now that I'm all grown up and that means realizing the dreams of 'operational business intelligence' or in other words, dealing with the larger scale structured transactional data previously off limits to non-shardable analytic database engines. OMG it's like rural electrification. There is so much data we can now deal with. Not long ago we in the industry talked about 'subject specific data warehouses', now we're talking about multiples of that at a price mid-market companies can handle.
For me, I've always preferred working with aerospace and manufacturing companies. Why? Because when you have tooled up a factory, you already know what data matters and the analytic value of it (if you could only capture and use it all). So now we can put the equivalent of a mainframe on the factory floor, in the engineering labs, and connect marketing and finance people too. Awesome.
The more I live with my iPhone, the more I realize that all I really need when it comes to business mobile computing, are alerts. So I'm going to go out and find something that delivers me IOS badges. Something a bit more specific and sophisticated than Twitter, but not much.
I have this really difficult time writing about my industry and I hope to resolve it sometime this year. In the meantime I do attend a lot of Meetups here in LA to try and catch the flavor of a distribution of souls who operate where they do. It's rather tough to do all this working at home since most of my career I've had sales and marketing folks arranging those sorts of meetings for me. So these days I'm feeling the odd pain of not having a sales and marketing department and an army of fellow employees with new acronyms and strategies on their lips after three days of team building at the big hotel and convention center...
What I do have is the rare opportunity to meet practitioners out here on what seems to be the edge of technologies and practices the market may or may not come to grips with in the next few years. That also being in the relatively new field for me of Open Source makes it doubly complicated. I mean, everybody can know the same tools, have the same capabilities. Scary. Weird. So I'm always a bit hesitant to talk about exactly what it is I do, what it is I want to do, and I tend to stick to what I think is worth doing. That generally makes me sound like an idiot because I take clues from so many places. Cool when you're already a millionaire, silly when you have a day job. Nevertheless I'm a shameless data guy and I like talking with data guys. I kind of assume that everyone can build what I would like to see built and what's all keeping us from it is ... well the same lack of appropriately funded sales and marketing people.
Tonight was a big meet in West LA sponsored by an outfit I happen to think a great deal of - that being Factual, where all the men have long hair. OK 75% of the men have long hair. I do trust my gut, and my gut tells me they're going to do very well in the long term. I had that same gut feeling about Informatica and Comscore in my previous lives. Then again, I was wrong about Acta, or did they get acquired by SAP? I forget. I have said it before, if I had to work any other place it would be there or Evernote or SpaceX. Jeez. I have expensive taste. There is actually one other place that ranks close to my current paradise and that is in the CUSP program at NYU directed by none other than Steven E. Koonin, one of the T50.
For those of you who don't know, I have decided that there are probably only 50 people alive in the world whose books and speeches are worth reading and hearing. I call them the T50 among whom are Koonin, Nassim Taleb, Clay Shirky, Thomas PM Barnett, Neal Stephenson, David P Goldman and Danny Hillis. Hmm. I've got to put up a T50 page.
Anyway, in all things data, I'm trying to find a secure way to share everything I know without giving away my lunch, and quite frankly I don't have the math, the money or the market share to be too loose-lipped. So I find it a precarious business, and I'm not quite secure enough. Still, I know what I like and what I like is what I am exemplifying in a couple semi-open projects. H/T goes to someone who will probably never read Cubegeek about staking out territory whether or not you can defend your claim.
I am compelled by my history and my interests to value highly that data which might serve to inform the public about the real statistics behind every claim putatively made for the public benefit. The example I gave tonight was the price of shrimp in the Gulf of Mexico. And I am interested to see that the next generation of journalists will participate in the curation of open source data on some platform and standard that I have something to do with. That is the aim of Project Longitude. If you recall the transfixing tragedy that was the failure of the Deepwater Horizon and all of the conspiracy theories that arose, you may recall the arguments in support or against the seemingly arbitrary figure the President sought in damages against BP, the owner (?) operator(?) of the underwater gusher. One of these was the matter of the depletion of wildlife and economic impact to the region. I happen to think there ought to be a goodly number of people who would like to curate that information who might communicate with commodities traders. At some point, especially with journalists, information that is no longer breaking news loses its value, and so could be open sourced. The problem is making a good format. I would like to create a market for 'antique' data and make some hub for structured time series data on a relatively low cost, on demand cloud platform.
Right now, there is structured data of high value currently churning through corporate America. I know, that's how I paid for everything I own. I have been fortunate enough to start in the market before it got specialized. So I have done backends, front-ends, sales, tech support, consulting and management all around corporate financial data. This has allowed me the dubious distinction of being able to see that whole big picture from experience, rather than shutup and be a deeply experienced Java programmer. (yea I saw that coming too.. before they even had Swing). But now the price of much of data care and feeding is making a lot more data reasonably available to a lot more people, and I've packed my Conestoga and hitched it to what I hope to be a bull market.
I talked to (that) Randal Schwartz about my old old idea of XRepublic and tried not to bore him with the details. As I explained it, I realized that I'm a bit more cynical about the prospects myself. This prepared me for his response which was that we need a better quality of citizen before a better set of tools will produce a better quality of politics. I agree and disagree. We had a good laugh about a parliamentarian bot, or something programmed to autoboot defiers of Godwin's Law, or at least I thought it was funny enough to mention here. But I do believe that certain behaviors can be created with the proper UX. I mean, if Farmville doesn't make one good actual farmers, the virtual farms do actually grow and they do represent the results of your efforts. Good democratic practices can be gamified even if the result isn't real democracy. The experience has to be more instructive than what currently exists, exactly nothing. But sadly, it reminds me that I had a much less pleasant conversation on the same topic with Dave Winer 6 years ago. Maybe that's my clue that this may not be a good idea. Yet.
Most edifying by far were conversations with Lars and Jason about the shared pain of data management in AWS. I've been going through a bit of that, and I think the ideas I've been kicking around for a nicely kitted out VM or appliance spec will find a lot of uptake. That one is Project Kleiglight, now internal to Full360 as I push for it. In the meantime, I suspect that there are a number of smarter ways for me to accomplish a bit of that without springing for DMExpress from Syncsort. Or maybe not. But this goes back to thinking about the dimensions of the product / service offering and interpreting what Factual and Infochimps are doing in their architectures that might have already covered that ground. TBD.
All told it was a great Meetup at Steingarten, and cheap parking too.
I'm going to do my piece for improving debate in the US by designing a standard for longitudinal data in public debates. I'm not going to over-smith the words right now, I just want to get out to a quick start. I'm hoping to gather collaborators that will assist in telling me what they need from nice, quick statistical data of the various sorts we want to open source.
The intro is here.
The first basic idea is to provide a kind of package that contains data, schema and validation dropped into a S3 repo. Think of it as Github for data.
What tumultuous times!
I've got to say it has been really weird looking at the empty space that is Cubegeek for the past 60 days or so. But one of my goals for the year is actually going to be convergence. I've finally put so many things in philosophical order in my life, at long last, that I don't feel the necessity to separate all of my avocations from my vocations. Plus I worked out a good deal with my own skills, my intentions and the good folks at Dreamhost. But there's a lot of ground to cover.
The huge news, of course, is Amazon Redshift which throws a big 500 pound gorilla wrench in everybody's business model. A number of pundits have pooh-poohed the whole thing but I have to tell you, this is a major part of the future, whether you like it or not. Redshift is Moore's Law for databases. It's impossible to ignore. Quite frankly I'm not even sure how I can deal with the fact of its existence, because basically somebody with a reasonable amount of skills can put together a data warehouse, quickly. The upshot is that a lot of consulting can be done at home, the way I do it, and a lot of cheap - even throwaway DWs can be built. This has scary implications for the quality of said DWs and nobody knows exactly what sections of the market Redshift will come to dominate, but I can tell you this, our friends in the database world are defacating building materials.
I have been working with Redshift for several weeks now and its strengths are many. Primarily, I'm all focused on its elascticity and its price. Additionally I like that I can script everything at the API level. I haven't done all that yet, but I know that I can. It is lacking some nice developer tools at the Toad level, and if I were one of the guys at Panic Software, I'd make sure that is my next project. As much as I love the Bootstrap web interface that Amazon has got running, nothing beats a finely honed fat client. Anyway, the biggest strength of Redshift right now is its ability to load data from S3, and we're thinking up some techniques and product designs that are going to take advantage of that. So check back with me in six months and ask about Project Kleiglight. In the meantime, we are learning by doing in Redshift.
Here's my first opinion. Everybody who is using MySQL or MSSQL should migrate to Redshift as soon as they think they're ready for more performance. Period. Whatever market that is, I'll take it.
Here's my second opinion. Teradata is toast.
My Ruby-fu is up marginally. I picked up the Nokogiri gem and am now working a bit smarter with File. I've done some nice integration with standard unix command and also with loggers. So I would call myself competent with XML, YAML and JSON. I still haven't swung back to improve my Cucumber but am plenty comfortable with rspec. I'm working on a utility gem of my own for some text manipulation stuff that I do all the time. Next I'm going to play with the parallel gem to see how I can scale up certain ops.
I'm lagging on my seal book - the OReilly on Exploring with R, and I'm finally getting rid of the paranoia that sent me wheeling two months ago. Nevertheless, I still read Darkside and attend a couple security hacker meetups.
I've seriously upped my Chef game in the past couple months. Working on our elasticPM code has gotten me fairly deep into the implementation end of orchestration. It is now clear to me that much of what we have been doing is so utterly advanced - we've been on the edge in many ways of what Chef can do with Windows, and our unorthodox approach has been what has been making Chef's learning curve more difficult than what I expected. In addition, working more with Vagrant has improved my capabilities with virtualization. The next version of Vagrant is going to be awesome, I hear. So I've got about a dozen VMs here on my Mac. As they become migratable into AWS AMIs it's going to be awesome.
Speaking of which, I did get a chance to migrate an AMI across a region via the (2 month) old way of moving core snapshots. So before I could write code to automate that (but I've been busy) Amazon introduced a way to do it directly. So I haven't done it the newest way, but there's one more barrier to internationalization knocked over.
You have to realize that these days I consider myself to be something of an IT guy in the biggest IT shop on the planet, which is AWS. The new architecture is improving every month. More on this separately.
The reintegration project starts with me getting into a couple , three web architectures. I've gotten the static blog thing worked out with Jekyll and Octopress. So I'll probly migrate all this Cubegeek stuff under the single new site. But I really have to get this Node.js and Rails thing knocked out so I can speak that language to customers as well. You see a lot of our business comes from people with low resistance to moving their assets to the cloud - since a lot of them have used colocation before. I'm going to try to head a lot of them off at the pass since Amazon has DynamoDB, RDS, Redshift and Hadoop, three of which I've had my hands on. So a lot of the confusion I used to have over MongoDB, Cassandra, Riak, CouchDB and SOLR, I no longer have. I just ape the party line and say go Dynamo.
Vertica has been very very good to me. So my take on this in splitting the difference goes something like this. Redshift is for when you have invested *some* time into your DW and you want something low maintenance. Vertica is for when you need to tune the crap out of your system and you want near-realtime stuff. Basically, Vertica has all the bells and whistles for extreme computing. Redshift is more like MSSQL to Vertica's Oracle. Sorry, I hate analogies too, but that's about as close as I want to get to a hardball assessment in this post. I've played with a lot of databases in my time and I love Essbase and Vertica for the same reasons - their internals are beautiful and they enable an entirely new class of computing. However, I like Redshift for the same reason I like MSSQL, simplicity and elegance - except I know Redshift has a lot more upside than MSSQL.
I have worked with just about every major database technology going back to something called BCC out of Utah. Right now is the golden age, because today we have all the major technologies available in stacks that can be built on Amazon. It a very exciting time to be a data architect.
It's the first week of the month and so I have a number of tasks to do, but it's also a quarter and a year so the administrative piles up. And of course my system crashed. There is so much to relate since my last entry over here.
I have some technical goals for the year, many of which have to do with hacking and security. Essentially, I have put together a vision of the future which involves my working in several capacities beyond what I am doing now. I expect that my current company will grow with the idea. So here's the idea. My nickel says that five years from now, there will be a global systems integrator specializing in building 'enterprise' applications and migrating current such technologies to the Amazon Cloud. It will be the equivalent of the Cambridge Technology Partners of the 90s. In other words, really sharp, really reputable and really profitable. And shortly thereafter it will morph into something massive and more mediocre, like Avenade or the old Andersen Consulting. In other words, cloud computing is on its way to becoming standardized and cheap (first with AWS) and then big CIOs will take the leap. Developers with API programming skills, network guys with security skills, database guys with infrastructure skills, these will be the core movers and shakers in global enablement. I'm doubling down on the cloud by incrementally adding to my security props. Why?
Because *this* year I'm going to be doing instant data warehouses, limited only by the reluctance of people to give up the data. So I'm going to have to get fairly familiar with what attackers do and what processes insure data integrity. So when we go big at Full360 I will be well along the full enterprise migration practice path.
So I am spending a lot more time with technical meetup groups.
The first discipline that I will have to master is the end to end usage of VMs. At Full360 we have created a gem that takes some generic images and makes them usable locally through VMWare Fusion and/or VirtualBox - Vagrant. So a little ubiquity of VMs is a good start as we can run our Chef configurations on them locally or remotely hosted on Amazon. The next step is getting some portability with these VMs from local, to local region to international region. The recent addition of an EBS Snapshot Migration capability to the AWS bag of tricks (not quite standardized into the awscli or into right_aws) allows me to conjure up a VM on my laptop, configure it locally (free) then formalize the Chef recipe, recreate the same thing on a micro instance in an AWS US region (free) then move it over into the EU or Australia or elsewhere for basically the cost of data transfer.
I have yet to get some objective statistics on exactly how much of a difference it makes to run queries against a local vs international region (I suspect it's not a whole lot depending on the heaviness of the query) but it's a strong selling point to be able to move a customer stack to and fro. Not only that but it will make cross-regional applications something else we can do, another level of DR.
This weekend I got a head full of security and paranoia, which is my general heading for understanding the evil that men do. I ran into a guy who sits in NOCs and figures out where the bad guys are likely to hit. His hardnosed attitude was a refreshing wakeup call, and forthrightly more useful than any number of Hak5 videos. Not saying that Hak5 isn't useful, they very much are, but only when you talk to a security professional can you get an idea of what can't be said nicely on a freely available YouTube. A lot. What M reminded me was that in the military, a 'threat' is always considered a possibility because a threat assessment deals with enemy capacity regardless of intention or liklihood. So threat assessment is very different from risk management. Risk management is guessing what can hurt you if you do X. Threat assessment is dealing with what your worst enemy can do to you no matter what you do. Managing risk is small peanuts compared to handling threat contingencies. Or to put it more crudely, risk management is guessing which skirt is a bit too short to wear to the party. Threat contingency planning is practicing plunging the knife into the throat of the rapist when he's on top of you.
So part of the security business talks about stuff that sounds like FUD, until you see it. That's what new in my world this week. I saw it. I watched a tool snatch a password out of memory in 2 seconds flat, in plaintext. I saw a tool that.. well to save a lot of noise, I saw a small fraction of the Backtrack Linux suite in action. It's faster than Alka-Seltzer. And even if you think you are an uninteresting target (which I probably am considering the people who may have (deniably) already hacked me), seeing these tools being worked by a pro is kind of awesome scary.
I also got a taste of what goes on in dox attacks, which is a kind of second stage mayhem that may be put upon you once your box has been owned. Or, maybe you don't have to have had your network penetrated - there's plenty enough free and easy information to get on you via various services, and a world of not so free information that can be gotten for the price of a little deception.
In all of these matters, what's fascinating are the tools. Tools like Pipl, Nesus, Intellus, Freemind, Armitage/metasploit, nmap, Maltego, KnowEm, Ancestry, Expose, Monav2.. and a bunch of others flavors of kiddie candy that I'm a little silly about at the moment.
I've also had the decent fortune to do more administrative tape-ape kind of work of the sort I guess most college kids get nowadays, but I never did. Don't ask me why my college had NOS and little to no UNIX that I ever saw, but that was the 80s. I always thought that admin type duties were grunt work, but I'm seeing the use of grunting. It's no longer enough for me to know all those two letter commands, I've got to see what they do in odd combinations. So I got a pretty thorough hash through ssh this past week, learning just a bit more than I think you can learn from a book. So now a bit of that tunelling and reverse shell stuff makes a bit more sense - I can review it now.
The school of hard knocks continues. What fun!
Everyday I find another reason to celebrate that I am living in the future. Today I heard about Markdown for the fifth or sixth time, it clicked in my head and I started using it. Markdown is RTF for the web.
If you don't remember RTF, it was the predominating word processing format before the great Microsoft Word - Wordperfect wars of the 80s and 90s resulted in the triumph of Microsoft Word. RTF has always been great, but there has been no need to use it considering the ubiquity of Word and the periodic need to use tables and pictures.
The web has changed all that with HTML and with blogging in particular. I've been blabbing and blogging for about a decade now and 99% of what I need to do is handled with this little short two level toolbar that's above the small text window in my browser. I'm talking about the Typepad editor. Well, you take some combination of that and a wiki editor and you have what Markdown is and does. It's perfectly lightweight for the overwhelming majority of writing I need to do, especially considering that I pretty much want to get everything out onto the web. Everything else that I want more permanently, I print *from* the web into a PDF and send it off to Evernote.
But I can tell you this. After more than 20 years on the web, I need embedded graphics and fancy formats about 4% of the time, if that. And the best fancy graphics are put together as an embedded graphic that speaks for itself, regardless of where some layout genius decides to put it. If you're like me, you love Evernote Clearly and other tools that de-slickify web pages and gives me the text I need in a clear, readable format.
Titles, headers, bolds, links, bullets, blockquotes, italics and the occasional table, strikethrough and graphic. That's what I need. That's what Markdown delivers. I'm sold.
It turns out that for Mac, Mou is the prizewinner. Perfect, simple, effective. This is where everything goes from now on. Let somebody clever turn my Markdown into HTML5 of XML stuff. Let some CMS put together the CSS template and fancify. I've got my new editor.