How I Launched 4sqtransit in Two Weeks on Windows Azure
My latest project, 4sqtransit, is a small application that delivers real-time public transit schedules to your phone via text message whenever you check in at a transit stop on Foursquare. I’m still a little uncertain about how word got out about what I was doing, but I woke up one morning to find out that 72 people were using my service. Later that day, I got an email from the co-founder of Foursquare. The following week, it was on the front page of Mashable. In the days following that, my user base surpassed 700 and my inbox was flooded with transit agencies asking to be added to my service. Today, 4sqtransit supports over 157 transit agencies around the globe and churns out nearly 400 text messages per day.
I’d like to focus on the technical aspect of 4sqtransit today. I’ll start by explaining the complete workflow of 4sqtransit from start to finish, the technologies that I am using, some of the challenges that I ran into, and then discuss how I was able to scale out my application using Windows Azure as more users signed up and my request rate increased exponentially. So first things first, I’ll give you a top-down view of the 4sqtransit workflow. When a user checks in on Foursquare, I receive a notification from the Foursquare Push API that the user has checked in, with details about which of my users checked in and where they checked in at. My service then matches this Foursquare user to the user in my database to determine which transit agency they use, which they specified when they signed up for my application. I then query that transit agency for the nearest transit stop, based on the GPS coordinates of the user’s check in location from Foursquare, and calculate the distance from the user to the transit stop. If the stop is within 100 meters of the user’s check in location, then I move forward and deliver the stop times, otherwise I ignore the check in. To deliver stop times, I again query the user’s transit agency for the stop times in the next 2 hours and send this information to the user by text message, using Twilio.
While this might sound fairly simple, consider this- 4sqtransit currently has around 800 users. A recent tweet from Foursquare on Twitter indicated that the average Foursquare user checks in around 3 to 4 times per day. Remember, the Foursquare Push API sends me every check in for all of my users, in real-time. That means 4sqtransit processes nearly 3200 check ins every day, and at peak times, that’s roughly 150 check ins per hour. Keep that number in mind as I discuss how exactly I “query” these transit agencies for stop locations and times. 4sqtransit aims to provide real-time data to it’s users. Real-time data has to be delivered as a web service, in some shape or form, in order to ensure accurate information. Each of these APIs is completely unique, and certain methods and parameters that I use for one agency don’t always exist for another agency. I basically had to hard-code a unique consumer for each of these real-time transit agencies' APIs. For the agencies that did not support real-time data, I had to rely on scheduled data, delivered in GTFS (General Transit Feed Specification) format, which even then, not every transit agency provides. On occasion, I had to use a combination of GTFS and real-time API if an agency didn’t provide a meaningful way of finding stop locations.
Querying a real-time API is a no-brainer: I have the user’s GPS coordinates, I just need to find the nearest stop location, and then find the upcoming departure times for that stop. The real challenge arises when I try to query a GTFS feed. A GTFS feed is basically a ZIP file of roughly 10 CSV text files, which I store locally. When compressed, this ZIP file can range anywhere from 2MB to 200MB. Uncompressed, these CSV files can be anywhere from 6MB to 600MB. While the GTFS format is very thorough and standardized, it’s not exactly convenient. To find a stop time, I have to query a list of a stop times, compare that to the list of trips, and compare that to the list of routes. I use a massive LINQ query to get the data that I’m looking for. Depending on the transit agency, this request can take anywhere from 10 seconds to 30 seconds to execute from start to finish.
I initially decided to host this application on Windows Azure because I needed a cloud hosting provider that supported SSL certificates with a custom domain name for free, something that Azure handled flawlessly, and something that AppHarbor has only just started supporting. Azure integrates very nicely with Visual Studio and makes deployment a breeze. One of the nice things about Windows Azure is you have full access to the server your application is running on. I often open up a Remote Desktop connection to my server to monitor performance or review log files. As my application’s user base grew and the number of requests I was handling increased, I was able to seamlessly scale out my application with Windows Azure. Just by upgrading my application from an Extra Small compute instance to a Small compute instance, the average response time for my GTFS queries dropped by over 50%. I didn’t modify any code, I simply upgraded my server in Windows Azure. Aside from the computational challenges, this application was quickly becoming fairly sizeable. With 157 transit agencies, some of which using GTFS in some form or another, the size of my project solution was easily in the multiple GB range. Deploying this to Azure Compute was quickly becoming an all day affair, literally. I decided to add a Windows Azure Storage account, which would allow me to host static files in Azure, separate from my application, at nearly local hard-disk I/O performance. I uploaded my GTFS files, outsourcing them from my project solution in Visual Studio, and made a few small changes to my code to access these from my Storage account.
I was stunned with the performance of Windows Azure. To be able to deploy my application to Azure, setup an SSL certificate, setup a SQL Server database, scale up my compute instances, and provision a storage account in real-time without any training on this platform whatsoever, was amazing. Azure truly is a “command center” for web applications. If there’s something you need, chances are that Windows Azure is already doing it. I’ll admit I was a little concerned about deploying my application on a platform that I had never used before, but I am thoroughly impressed with how easy it is to develop for and how powerful it is. Azure has easily become a permanent asset to my business.
You should follow @mbmccormick on twitter right now
Because who wouldn't want to keep tabs on this passionate software developer, elite hacker, food connoisseur, and dog lover? No one, that's who.