The technology behind Tweetrad.io.

Wow it’s been an exciting couple of days! Tweetrad.io has received nearly 28,000 pageviews in the past 2 days thanks to a successful Hacker News article, a blog post from Mashable, a celebrity tweet from Alyssa Milano, and subsequent viral action on twitter. I’m pleased to say that Tweetrad.io for the most part has stood up well to this traffic spike.

traffic

Here’s a brief write-up on the technology behind Tweetrad.io, how the system evolved and the reasons behind our technology choices.

Tweetrad.io was initially born out of a weekend hacking session where I was playing around with the idea of a ruby daemon that would search twitter and then fire off a configurable ruby script. The first script I wrote to demonstrate the capabilities of what at the time I was calling “Twobots (twitter robots)” was to just run the tweet text through OS X’s say command. I brought the project into work on Monday to show my coworkers and we all got a good laugh out of listening to the voices read various humorous and mostly NSFW tweets. Edwin immediately saw the entertainment potential and we decided to partner on it as a side project.

Our First-pass Architecture

We knew early on that processing text to speech on a high volume of tweets called for a queue based architecture. The initial plan for Tweetrad.io called for a lightweight application server and three types of asynchronous daemon services (Searchers, Converters, Monitors) all running on Amazon Web Services.

first-pass architecture

When a user ran a query on this version of Tweetrad.io the web application added a row to the database with the search query the user provided. A searcher daemon would see the unprocessed query in the database and fire off a query to the twitter search api (via the excellent Grackle library). The retrieved tweets were then pushed onto an SQS queue based on MD5 hash of the search query. Next a converter daemon would see the new queue and burn through it converting each tweet to an MP3 utilizing the open source Festival TTS engine. The Mp3 was then written to an S3 public bucket from whence it could be served back to the javascript client directly. A pool of Monitor services was responsible for making sure that tweets were periodically cleaned up from S3 as they reached a max age threshold. This was necessary to avoid incurring high S3 storage bills.

This architecture was cleanly organized and allowed us to tune the conversion process by managing the numbers of each type of daemon; however, as we tested we quickly discovered that some of our initial ideas would not work in production.

Problems We Encountered

Twitter Search API Rate Limiting
Although running a pool of searcher bots on EC2 seemed like a good way to scale up to handle lots of concurrent queries, we quickly ran up against rate-limiting on our search API calls. My coworker Aaron suggested that we move querying down to the client to distribute the API hits. This worked like a charm and eliminated the need for the Searcher service pool altogether. Now when a user searches that query is run on the client directly via a jsonp call to the twitter search api. The json for the retrieved tweets is posted directly to our Sinatra web application, which handles putting the tweets into queue for conversion.

Problems with Festival TTS
Initially we planned to do the majority of our text to speech conversion using Festival running on our App Server and potentially scaling out additinal EC2 instances as needed. Although Festival had some excellent voices and it’s support for SABLE allowed us to generate fairly natural sounding speech, we found that running it on the server caused our load average to spike as query intensity increased. We also found that OS X’s say command provided somewhat more natural sounding voices. Our solution was to run the conversion processes on a cluster of several OS X boxes running at our homes. This allowed us to alleviate load on the Web Server and leverage the high quality text to speech in OS X while simultaneously reducing our EC2 bill.

Concurrency Issues with ActiveRecord
Each converter service runs up to 20 concurrent conversions. I initially had a lot of problems getting ActiveRecord to work properly in a multithreaded ruby script. Eventually I discovered the problem was with the Mysql ruby gem. The solution came in the form of the Mysqlplus gem from neverblock (more info). If you are tearing your hair out trying to get a multithreaded ruby script to interact with a mysql database I highly recommend checking out this project.

Cost of running on AWS
Although AWS provides excellent scalable infrastructure at a reasonable price, Tweetrad.io, as a side-project with no funding, is operating on a shoestring budget. In development our AWS bills were pretty small, but I had concerns that if we started to generate real traffic the bills could go up quickly. Notably our architecture relied heavily on SQS; and the daemon jobs constantly polling the queue for updates was causing SQS to be a surprisingly significant percentage of our bill. To alleviate these billing concerns we decided to see if we could run the service on hardware freely available to us. Edwin owns a collocated rackmount server that he uses for testing from time to time. We decide to use this machine as our web application server. Since our Monitor service ensures the number of Tweet mp3s is kept relatively small we realized we didn’t really need the scalability of S3. We opted to just use the local disk on our Application server. Further the Converter services were distributed out across several home computers accessing the database queue over the internet and using scp to write the converted mp3s back to the web server.

This brings us to…

Our Current Architecture

Our new architecture while less impressive on paper, is simpler and more cost effective then the initial architecture we designed. Although it may be necessary to scale out on AWS at some point, the current solution is standing up to load nicely at the moment.

current architecture

With our current architecture, when a user hits Tweetrad.io with a search query our page-cached javascript and html client is returned to thee browser. The javascript client then fires off a jsonp request to the twitter search api and loads the results into a client side tweet cache. Meanwhile a player built with Soundmanager2 starts checking the local tweet cache for unplayed tweets. When an unplayed tweet is found the player makes a get request to a canonical url based on the tweet id. If the file is found it is streamed from the server and played. If a 404 is received the tweet json is posted to the sinatra service at /convert. The sinatra service checks to see if this tweet has already been queued for conversion. If not the tweet is written to the mysql queue (This is the only dynamic action in the web application. Everything else is page cached). Converter processes running on various OS X boxes outside the datacenter poll the mysql queue directly. When a converter finds a row to be processed it locks the row so other converters won’t pick up the job. The converter then runs it’s conversion process and pushes the file back to our application server using scp. Back on the client polling for the converting mp3 has been continuing periodically. As the tweet is now available in the app server’s audio tweet cache, the next request for the tweet mp3 will be a 200 and sound manager begins playing the tweet. The player keeps track of how many unplayed tweets are in the queue and periodically goes back to twitter for more recent or previous page tweets for conversion.

Open Source Projects we use
A big thank you to the developers of all the open source software that tweetrad.io runs on. Without these projects we’d never have gotten this bird off the ground.

Ruby – our language
Mysql – our database
nginx – our webserver
vlad the deployer
passenger – easy deployment for rack based applications
Rack – ruby webserver interface
Sinatra – framework for building lightweight web applications and services
sinatra-cache – page cache plugin for sinatra Daemons – ruby gem we use to daemons our various services
ActiveRecord – domain model for our database queue
Mysqlplus – allow us to do threadsafe mysql access in ruby
Prototype – javascript extensions and dom utilities
Scriptaculous – effects for morphing css
SoundManager2 – javascript library for playing audio
Raphael – javascript vector drawing library (provides fun radio wave animation)
Grackle – not currently using but was a big part of the early prototypes
Festival TTS – not currently using but will use if our scalability needs grow beyond what we can support on our local os x cluster

Thanks again for checking out Tweetrad.io and remember “At Tweetrad.io, we read the tweets so you don’t have to!”

blog comments powered by Disqus