You're looking at the wrong planes
#business I discuss why obvious problems often times have non-obvious solutions
During WWII, airplanes were in scarce supply but were critical in fighting a war that was not on the North American continent. Among US industries, aircraft manufacturing ranked number 41 in 1939. While the US would go on to manufacture more than 300,000 planes in the next five years, the country needed to find ways to keep the aircraft they had airborne.
To help keep more planes in the air, the government formed a research team to study the bullet holes on planes that survived to address this problem. The logic was that if there were a discernable pattern, that would create an opportunity to reinforce existing aircraft in those areas that tended to attract more gunfire. Despite a year-long program that involved researching and reinforcing planes, the team was dismayed to learn that the enemy shot down planes at the same rate. At this point, a mathematician named Abraham Wald, who worked at Columbia University, was brought onto the project to validate the mathematics behind the groups' conclusions. After examining the data, Wald concluded that the research was flawless but noted that there were no bullet holes on the panels that covered the engines of the planes that were studied. No doubt, bullets hit these engines, so where were these planes? Studying the planes that survived, inadvertently seduced the researchers into drawing the wrong conclusions. The planes they needed to study were missing - they were shot down.
With an abundance of data, it can be baffling when there's an obvious problem, and yet despite our efforts mining the data, the explanation is nowhere to be found. How could obvious problems have solutions that are so hard to find? I often observe this in product and engineering; we frequently look at the wrong planes.
I've had many experiences with this phenomenon, notably right before Christmas in 2011 while working on the mobile apps at Netflix. Our last release of the year ended up being problematic as customers immediately began flooding Twitter and App Store review channels, complaining that they could not log in to the application as soon as the new apps were available. Unfortunately, the production telemetry suggested everything was fine because folks were logging in roughly at the same rate within a margin of error. For our part, we could not reproduce the problems that customers seem to be complaining about. Our password rejection rates were within the range we expected, and our automation test scripts found no issues in the login process when we shipped the software. As a result, many of our colleagues believed there were no issues in production, but the mobile team I managed was convinced there was an issue. The iOS app descended from a solid 4-star application to just 1.5 stars in a matter of hours. If the problem was so obvious, why couldn't we detect it? We were missing data.
The first step in addressing these types of issues is to know when to cut your losses and stop looking at your existing data. As the hours went by, I concluded that we needed to understand those customers who were specifically struggling with this login problem. Maybe some people legitimately are mistyping their password, but maybe there is something else happening. We were missing information beyond just our failed login counters. Was there something special about certain passwords? We could not answer any of these questions, by interrogating the data we had - it’s all encrypted. However, by asking ourselves what data are we missing, we started to understand what the missing plane might look like.
This leads to the second step in the solution, determining the data you need. Controversially, I was convinced the solution had something to do with the passwords themselves because, for all intents and purposes, the vast majority of people were logging in just fine.
Luckily, the team had a few tricks in our back pocket. To launch quickly and operate close to the metal, the team would often expose a particular Twitter account in the app store description dubbed @NetflixMobile (strangely, it still exists to this day but appears to have been dormant since I left). We quickly mobilized and advertised the account in the App Store, and the complaints came flooding into Twitter. As users complained, I engaged our customers through the Twitter account and asked them to DM their password to me and change it immediately to see if the problem was solved. In almost every case, the problem went away! Was it the password change that fixed it or something else? When I stared at the now-changed passwords, I noticed something.
012ae 0f0292acd 0da2a 0f2128e 0ead345
(made up passwords for illustration purposes)
The passwords looked like hexadecimal numbers. It turned out that the team had integrated a String library that would intelligently identify strings that looked like numbers (including those in hexadecimal format) and convert them into an integer which we passed to our authentication systems. For passwords, we should not be converting the string and instead, we should pass the user’s input through.
In the end, the team moved quickly to solve the bug without a new app release, and we saved Christmas for people worldwide eager to stream Netflix on their new phones. Why didn't our existing approaches detect the issue? With millions of customers, users who happened to have a password in this format were rare - we guessed perhaps about 1% of our users. Yet those 1% of customers impacted by this problem, still represented a large number of people in absolute numbers who were a vocal minority that could have damaged the brand.
We were looking at the wrong planes more than a decade ago, but I learned a valuable lesson that day that I continue to re-learn periodically because it’s so easy to fall into this trap of thinking you have all the answers because you have data despite the fact it’s not the right data to be looking at. So, if you’re going through some struggles right now at work, maybe consider the possibility that you’re looking at the wrong planes.
Great case study for a MBA course!
Great story, Han. Reminds me of an incident where Twitter couldn’t decrypt it’s credentials table: https://max.levch.in/post/724289457144070144/shamir-secret-sharing-its-3am-paul-the-head-of