Engineering

Storm: The Hadoop of Realtime Processing

Backtype has built a powerful system to analyze realtime social data. They help you with insights about your social influence on Twitter and YCombinator’s Hacker News by analyzing tweets.The following graph shows the (not so impressive) stats for my blog:

Because Backtype is processing all tweets for URLs to calculate your influence, they have to process a massive amount of data from the Twitter Firehose. The Firehose can go as fast as 7000 tweets per minute during New Years Eve in Tokyo. That’s a massive 117 tweets per second!

The big problem with realtime is that you can not or not easily process it in batches, because the data keeps coming. When you batch this amount of data you have to be able to process the data faster than realtime or create an always growing backlog. During the World Cup finals last year when The Netherlands was playing against Spain, we (Mobypicture’s MobyNow) had a small flaw in our code processing the tweets with #ned and #wk2010. During roughly 90 minutes we had built a backlog of 18 hours worth of processing. Because people kept using #ned and #wk2010 it was almost impossible to remove the backlog, we had to go many times faster than realtime to remove it. While displaying realtime tweets you don’t want to be more then a couple of seconds behind. With batched processing this process of removing the backlog is a fight you are fighting every time you process a batch.

Backlog also recognized that batched processing wasn’t the way to go for their analytics. So they recently developed a new system for doing realtime processing called Storm to replace their old system of queues and workers: Read the rest of ‘Storm: The Hadoop of Realtime Processing’ »

The four categories of NoSQL databases

Very interesting read on the Monitis Blog about picking the right NoSQL tool. They dive into what it is, what’s possibly wrong with RDBMS, describe the different categories of NoSQL and the pros and cons of the different types.

Most people just see one big pile of NoSQL databases, while there are quite some differences. You couldn’t use a Key-Value store when you need a Graph database for example, while Relational database systems are all quite compatible.

Montis describes the following categories:

1. Key-values Stores

The main idea here is using a hash table where there is a unique key and a pointer to a particular item of data. The Key/value model is the simplest and easiest to implement. But it is inefficient when you are only interested in querying or updating part of a value, among other disadvantages.

Examples: Tokyo Cabinet/Tyrant, Redis, Voldemort, Oracle BDB, Amazon SimpleDBRiak Read the rest of ‘The four categories of NoSQL databases’ »

Route 53 and Elastic Load Balancing integration

Another batch of great new AWS features, this time for Route 53 (DNS) and ELB. One in particular is very interesting. Route 53 and Elastic Load Balancing integration. Werner Vogel explains why this is a big deal:

Due to restrictions in DNS a root domain “zone apex” cannot be mapped to another domain name using a CNAME record. This has caused customers who wanted to have the root of their domain e.g. allthingsdistributed.com point to same location where for example www.allthingsdistributed.com is pointing to jump through complex redirect hoops. Through a better integration between Elastic Load Balancing and Route 53 we can now offer the ability to map the zone apex to ELB without all the redirection muck. ELB and Route 53 work together closely to ensure that if the address of the load balancer changes this is quickly reflected in Route 53.

That’s why Mobypicture is running on www.mobypicture.com instead of mobypicture.com and we had to run our own loadbalancers and scaling metrics for our moby.to url shortener. From now on we can completely switch to ELB, including all other benefits.

More information can found on the Elastic Load Balancing and the Route 53 detail pages or on Werner’s blogpost “New Route 53 and ELB features: IPv6, Zone Apex, WRR and more

Custom Metrics for Amazon CloudWatch

You can now store your business and application metrics in Amazon CloudWatch. You can view graphs, set alarms, and initiate automated actions based on these metrics, just as you can for the metrics that CloudWatch already stores for your AWS resources.

On first sight I was not really impressed. It does not sound really useful, until you dig deeper into the possibilities. For example adding page latency to CloudWatch gives you far better insight WHY your latency rises compared to free server resources.

Here’s an example of the kinds of insights that can be gained when you use custom metrics. The following chart shows the amount of free memory (the jagged line) and the application’s page latency:

As you can see, page latency increases as free memory decreases (looks like there’s some garbage collection going on as well).

Read the rest of ‘Custom Metrics for Amazon CloudWatch’ »