Maciej Mróz Personal Blog

Because why not

Mar 9, 2013 - 5 minute read - Technology

The structure in unstructured data

Working with data is something I personally consider one of the coolest things one can do right now in the tech industry. Typically when someone thinks or talks about “Big Data” it implies petabyte data sets and Hadoop clusters spanning 100+ nodes. The truth is, even a lot smaller data and humbler infrastructure can provide great insights and drive the product/service innovation. To me its really a lot closer to “Smart Data” than “Big Data” - it doesn’t matter how big your data set is, but how much you are able to do with it, and how quickly.

In my workplace we run our own data gathering, storage, and processing. Our own code, our own infrastructure. Managed, architected, and even partially implemented by yours truly. What is special about our system is that compared to most commercial systems, we allow our products to report pretty much arbitrary data. There are very basic guidelines that must be met for standard metrics to work, but what is actually possible goes way beyond that. Product teams have access to raw data and can create their own processing tools if they want/need to. They can also use Hive to query the data if they want to - very often you can just brute force your answer from data, so why not do it? Programmer’s time is expensive, after all.

The problem is that if you are not willing to throw a lot money at scaling your Hadoop installation, the queries start to become very slow, very quickly. And that really kills the organization agility: having to wait many hours for query result is not what anyone wants. The cycle from running set of queries to releasing new version of the game (or whatever the organization is doing) should optimally be possible to fit within one business day. Running single Hive query for many hours prevents that. It is a problem I consider to be strategically critical, so I decided to attack it myself.

Because my target was reducing query times from hours to minutes, I needed solution that’s radically different from simply getting more hardware. I needed to get smarter about how the data is processed. A word of warning: if you know what OLAP cube is, and explored how OLAP systems work, what I am about to say will likely sound very basic. You may stop reading now :) The first time I’ve been able to play with a system like that, I wasn’t really impressed by what it can do, but I was deeply impressed by how quickly it was coming up with responses. Seriously, if a system can answer your question about data set faster than it takes to read it from disk there’s something quite sophisticated happening behind the scenes. Obviously, they couldn’t just index the data for queries I was about to ask without knowing the queries. Classic btree indexes were not the answer. But it was still very quick. To understand why one needs to realize very simple thing: yes you can slice&dice; through multidimensional data very quickly, but at the same time you are limited to quite simple queries. The data is preprocessed and stored in a way that makes specific type of queries run fast. Vast majority of analytical queries will fall into category of exploring the OLAP cube.

My first thought about the problem of preproprocessing the data was that we cannot do it because it has no fixed structure. It’s all JSON objects, the product teams can put anything they want inside, how am I going to do anything meningful with it? 100 milliseconds later I realized that calling our data unstructured is just not true. Yes, it would be hard to fit into fixed schema SQL table, but it has tons of regularities. In fact, our technical guidelines for analytics instrumentation specify recommended list of dimensions, we just call them differently! And, because I typically acted as an outside expert to the team when planning the instrumentation, the guidelines are actually followed!

Realizing that our data has a lot of structure was the tipping point. The idea looked good, I started some Python programming in order to implement a system that would process the raw data and store the aggregated cache for later processing. I decided to cut out some less important stuff (very noisy dimensions), and obviously cut out data that makes little sense aggregating. Coding, coding, coding … done. Ok, perhaps it was slightly more complicated. For example, some dimensions were not known before processing and I had to use a bit of regexp magic :) Still, nothing I would really call hard. After launching the aggregator script I was very pleased with the results: depending on the project, I was getting 50-100x smaller data size compared to raw data set. That pretty much linearly translates into shorter query times. While the preprocessing is rather costly, it happens only once per data chunk so it’s not really an issue - just another cron job to run on one of servers (or more, should the need arise :P ). It’s not a solution to fit every single use case out there. In fact, I think this solution is just another great example of Pareto principle - I solved 80% of the problems with 20% of effort. What’s good is that now we have more time and computing resources to focus on the remaining 20% of data problems, which are a lot tougher than quickly generating simple reports :)