Your Network Broke My App: A Cautionary Tale of Placing Blame
I’d like to imagine the call happened late at night. Pygora Boer, the CEO of the popular “Find my Goat” (FMG)* app, called an AT&T executive. I imagine the AT&T exec rubbing his or her eyes, trying to wake up and understand what is going on.
Pygora the CEO (in an agitated voice): Hey AT&T executive. We have a problem. We are seeing a 10-20% drop in Find my Goat traffic on AT&T’s mobile network. There is obviously an issue with your network, and we need to resolve it ASAP! People are losing their goats and can’t find them with our app. It is only a matter of time until a goat dies because your network is blocking our application!
AT&T Exec: Hi Pygora, I’m sorry to hear that. I too care deeply about goats. Let’s get a bridge set-up with our network team and your crack developers, and solve the problem before any goats are harmed.
I imagine that this is the worry that Pygora had:
(Note, no goats were harmed during this outage)
Over the next several days, long triage calls were held (and we discovered that FMG was having similar calls with other carriers), looking into where the packets were traversing through the networks, and painstakingly identifying potential issues. On day three of the calls, I (as a mobile application performance specialist) was brought in. I was brought up to speed, and I felt that the tools we have in the AT&T Developer program could certainly help.
Now, dear readers, you know that I work with AT&T’s Application Resource Optimizer (ARO), and ARO allows developers (or me) to look at how mobile applications behave on the network and provides suggestions on how to improve the mobile application. Being a goat lover, I had actually tested Find My Goat several times in the past, and tried to work with the FMG developers to make their mobile application run faster and use less battery. I am also a FMG user, as my family has a pair of goats in our barn.
Here is Bodhi the dog playing “Find My Goat”
It probably makes sense now to tell you how Find my Goat works. You see, goats are (to quote Gollum from Tolkien’s masterworks) ‘tricksy.’
Goats are always escaping, and you want to find your goats as quickly as possible before they can be attacked by a boa constrictor (or more likely in the Pacific Northwest, a coyote or a car). So, we have outfitted our goats with GPS tags on their collars, and we can use the Find my Goat app to find our escapees. The app refreshes the data from the server every 5 seconds, so that you can see your goat’s movements in real time.
All the goats are at home, and safe!
In my initial tests of Find My Goat, I could see that the goat position updates were all sent to my phone on one connection. This means that each Find My Goat user has one IP address at the FMG servers (and each goat is utilizing one IP as well). For my family (one phone +2 goats would be 3 IPs). The three devices communicate a lot, and we have to charge our goats collars fairly regularly. This uses the device radio a lot, and when the application is active, it can cause a lot of battery drain.
As I was looking at the old data, I was collecting a new dataset with ARO on the latest incarnation of FMG. When I completed my study, I looked at the new data. I saw similar 5 second pings updating my goats’ positions in real-time. However, there was a slight difference. In the latest version of FMG, each update was utilizing a NEW connection to the Find My Goat servers. Now, my three connections were creating 3*20 = 60 connections per minute to the Find My Goat servers. This 20x jump in connectivity might be the problem.
I raised this point to the teams on the triage bridge. They looked and they indeed discovered that the connection failures were due to the Find My Goat servers running out of IP addresses. The servers were finding multiple users connecting on the same IPs, and rejecting the newer ones, preventing goats (and their worried owners) from connecting to the server.
This is how hackers create a Distributed Denial of Service (DDoS) attack on servers. You get thousands of connections from all over the world hitting a specific set of servers until they exhaust all their connections/ports, and legitimate users can no longer connect. However, in this case, due to a missed test case, a new version of an application (with a passionate user base of goat lovers) was able to instigate a DDoS attack on ITSELF!
The moral of this story is: Test your application prior to release. If an issue arises soon after the release of a new version of your app, Occam’s Razor will tell you that perhaps you should look to your application before placing blame on additional parties.
The first weekend in January 2015, the AT&T Developer Summit is kicking off in Las Vegas. Come hear Tiffany Kinkade and I give a talk on “App Performance: What these Apps Did Will Shock You” where we highlight some best practices to avoid when building mobile applications!
*Note: The app is not really Find my Goat, and the name of the CEO has been changed to goat breeds to preserve his or her anonymity.