Backtype has built a powerful system to analyze realtime social data. They help you with insights about your social influence on Twitter and YCombinator’s Hacker News by analyzing tweets.The following graph shows the (not so impressive) stats for my blog:
Because Backtype is processing all tweets for URLs to calculate your influence, they have to process a massive amount of data from the Twitter Firehose. The Firehose can go as fast as 7000 tweets per minute during New Years Eve in Tokyo. That’s a massive 117 tweets per second!
The big problem with realtime is that you can not or not easily process it in batches, because the data keeps coming. When you batch this amount of data you have to be able to process the data faster than realtime or create an always growing backlog. During the World Cup finals last year when The Netherlands was playing against Spain, we (Mobypicture’s MobyNow) had a small flaw in our code processing the tweets with #ned and #wk2010. During roughly 90 minutes we had built a backlog of 18 hours worth of processing. Because people kept using #ned and #wk2010 it was almost impossible to remove the backlog, we had to go many times faster than realtime to remove it. While displaying realtime tweets you don’t want to be more then a couple of seconds behind. With batched processing this process of removing the backlog is a fight you are fighting every time you process a batch.
Backlog also recognized that batched processing wasn’t the way to go for their analytics. So they recently developed a new system for doing realtime processing called Storm to replace their old system of queues and workers:
Storm is a distributed, reliable, and fault-tolerant stream processing system. Its use cases are so broad that we consider it to be a fundamental new primitive for data processing. That’s why we call it the Hadoop of realtime: it does for realtime processing what Hadoop does for batch processing. We are planning to open-source Storm sometime in the next few months.
The realtime processing system of Mobypicture consists, just as Backlog’s used to be, out of a series of queues and workers. We kick a piece of content from one queue to another, trying not to slow down the whole process when more data than we can handle comes in. The system isn’t really fault tolerant. Every queue has to stay online and we can’t process anything when one of the queues goes down. It’s a complex system and I understand why Backlog has tried to replace it with a more generic solution, they can deploy anywhere.
[Storm] abstracts the message passing away, automatically parallelizes the stream computation on a cluster of machines, and lets you focus on your realtime processing logic. Even more interesting, Storm enables a whole new range of applications we didn’t anticipate when we initially designed it.
Properties of Storm
I won’t quote them all, you can find the longer descriptions in the original blogpost by Backtype, but these are the main characteristics:
- Simple programming model: Just like how MapReduce dramatically lowers the complexity for doing parallel batch processing, Storm’s programming model dramatically lowers the complexity for doing realtime processing.
- Runs any programming language
- Fault-tolerant: To launch a processing topology on Storm, all you have to do is provide a jar containing all your code. Storm then distributes that jar, assigns workers across the cluster to execute the topology, monitors the topology, and automatically reassigns workers that go down.
- Horizontally scalable
- Fast: Storm is built with speed in mind. ZeroMQ is used for the underlying message passing, and care has been taken so that messages are processed extremely quickly.
I’m really curious about Storm. The use-cases are numerous, from stream processing to continuous calculations. Love to get my hands on it and do some testing. Maybe in a couple of months.
In the meantime, check out Preview of Storm: The Hadoop of Realtime Processing by BackType Technology. They have more interesting tech articles and also have some very interesting tech presentations.