Guest post by Daniel Barth-Jones
For anyone who follows the increasingly critical topic of data privacy closely, it would have been impossible to miss the remarkable chain reaction that followed the New York TLC’s (Taxi and Limousine Commission) recent release of data on more than 173 million taxi rides in response to a FOIL (Freedom of Information Law) request by Urbanist and self-described “Data Junkie” Chris Whong. It wasn’t long at all after the data went public that the sharp eyes and keen wit of software engineer Vijay Pandurangan detected that taxi drivers’ license numbers and taxi plate (or medallion) numbers hadn’t been anonymized properly and could be decoded due to the failed encryption process.
Soon after Pandurangan’s revelation of the botched unsalted MD5 cryptographic hash in the TLC data, Anthony Tockar, working on a summer Data Science internship with Neustar, posted his blog “Riding with the Stars: Passenger Privacy in the NYC Taxicab Dataset” with the aim of introducing the concept of “differential privacy” and announcing Neustar’s expertise in this area. (It’s well worth checking out both Tockar’s short, but informative, tutorial on differential privacy and his application of the method to the maps of the TLC taxi data as his smartly designed graphics allow you interactively adjust differential privacy’s “epsilon” parameter and see its impact on the results.)
To illustrate possible rider privacy risks for the TLC taxi-data, Tockar, armed with some celebrity paparazzi photos and some clever insights as to when, where and how to find potential vulnerabilities produced a blog post replete with attention grabbing tales of miserly celebrities who stiffed drivers on their tips and cyber-stalking strip club patrons, which quickly went viral. And so as to up the fear, uncertainty, and dread (FUD) factors surrounding his attacks, Tockar further gravely warned us all in his post that:
Equipped with this [TLC Taxi] dataset, and just a little auxiliary information about you, it would be quite trivial for someone to follow your movements, collecting data on your whereabouts and habits, while you remain blissfully unaware. A stalker could find out where you live and work. Your partner may spy on you. A thief could work out when you’re away from home, based on your habits.
However, as I’ll explain in more detail, sorting out these quite concerning claims in a rational fashion which will enable us to consider complex decisions about the possible trade-offs between Freedom of Information and open government principles and data privacy concerns requires that we move beyond mere citation of anecdotes (or worse, collections of anecdotes in which carefully targeted and especially vulnerable, non-representative cases have been repackaged as “anecdata”). Instead, we must base our risk assessment in a systematic investigation appropriately founded in the principles of scientific study design and statistically representative samples. Regrettably though, this wasn’t the case here and has quite often not been the case for many headline snatching re-identification attacks that have repeatedly made the news in recent years.
The ensuing TLC Taxi headlines in the follow-up press for Tockar’s blog (“If you think you’ve anonymized a data set, you’re probably wrong” or “How Big Brother watches you with metadata”) and accompanying twitter buzz (ranging from “yet another amazing piece on finding out detailed picture of people’s lives from anonymized data” to “It’s virtually impossible to anonymize large data sets”) soon conveyed that the verdict was in on this latest data re-identification attack, and we should all (to paraphrase the hype) “be afraid, be very afraid”.
When Does 99.9999936 % Equal Zero Percent?
Somehow in the mind’s eye of many readers, even though Tockar’s taxi ride re-identifications were selectively focused only on a particularly vulnerable and almost unimaginably small proportion of the 173 million rides, Tockar’s blog was seen as yet another demonstration that “anonymized data really isn’t”. However, it would hopefully be clear that examining a miniscule proportion of cases from a population of 173 million rides couldn’t possibly form any meaningful basis of evidence for broad assertions about the risks that taxi-riders might face from such a data release (at least with the taxi medallion/license data removed as will now be the practice for FOIL request data). Even though no evidence had been presented for at least 99.9999…% of the rides, the wisdom of the crowd as conveyed by twitter buzz had somehow reached the conclusion that “It’s virtually impossible to anonymize large data sets”. The irony of this wayward conclusion is that, as is the case for most of the famous re-identification attacks which have made headlines (Governor Weld, AOL, Netflix, Personal Genome Project , etc.), the more accurate conclusion should be “anonymized data really wasn’t”, at least not anonymized to any demanding de-identification standards as are routinely imposed on de-identified data under the HIPAA Privacy regulations for health data).
With close examination it appears that these supposed re-identifications from the TLC taxi data may be constructed more with smoke and mirrors than they are actually exposing unavoidable privacy threats posed by the released taxi data; or at least this would have been the case if the TLC data had been properly anonymized using rigorous de-identification standards.
Making a Splash with Celebrity Targets and Failed Hash
Tockar was apparently able to re-identify his two celebrity targets (Bradley Cooper and Jessica Alba) by using the cab’s medallion or license plate data which was left exposed by the failed cryptographic hash in the TLC data. Using the failed hash and additional clues on time and location of the pick-ups or drop-offs obtained from celebrity blogs posting photos taken by paparazzi who were tailing the celebs, Tockar was able to fairly easily look up their rides. By exploiting this clear anonymization failure, he showed that he could re-identify two out of 173+ million taxi rides. Not long after this, J.K. Trotter from Gawker then took this a step further and added another nine re-identifications to the celebrity ride re-identification tally.
So amazingly, by re-identifying a mere 11 celebrity rides out of 173 million rides (using data which hadn’t been properly de-identified), the perception of readers had been swayed to conclude that the taxi data represented some innate inability to reliably anonymize data rather than being properly perceived as resulting from a failure to implement appropriate anonymization methods and as a reminder of a simple fact: If you have packs of paparazzi trailing you and photographing your every move, you just won’t have much privacy. For example, in the Alba case, it seems pretty clear from the photos that she was being photographed both at the pick-up and drop-off for her ride. So her scandalous “privacy loss” attributable to the TLC data boils down to just a questionable insinuation that she failed to tip, considering that cash tips aren’t captured by the TLC data.
The 11 in 173 million risk demonstrated for this celebrity ride re-identification (or 1 in 15,743,614) is truly infinitesimal. To put this in perspective, this risk is over 1,000 times smaller than one’s lifetime risk of being hit by lighting. With proper de-identification applied and the cryptographic hash problem fixed in any future data releases, this spooky specter of celebrity cyber-stalking using TLC taxi data is likely to vanish as soon as one turns on the lights.
The situation is likewise with the purported (but in my opinion after detailed examination, highly dubious) re-identifications of Hustler Club patron rides. For this next cyber-stalking attack, Tockar mapped out all rides starting near the Hustler Club between 12 to 6 am. His next step was then to promptly discard over 80 percent of the rides because they were less than 5 miles in length and he recognized that Manhattan was generally too densely populated to isolate individuals. He then proceeded to map the remaining rides and search for clusters of drop-off locations. Through this process he identified 23 drop-off clusters (including a huge “mega-cluster” surrounding Wall Street) which he attributed only to individuals who frequented the Hustler Club. What wasn’t mentioned in his blog was the 3,000 seat concert venue and two nightclubs nearby, or the NYC Taxi Cab stand located at the corner in front of the strip club.
NYC Taxi Stand in front of the Hustler Club
I’ve examined the U.S. Census Block data (the smallest geographic unit used by the Census) for census blocks surrounding each of these drop-off clusters in some detail. Most of the Census block areas surrounding the clusters show populations of well more than 1,000 persons and all but two of these 23 clusters are likely to have populations of at least a few dozen persons within an easy one-to-two minute walk from the cluster. (Certainly, under the assumption that Tockar could be right about these rides belonging to a strip club patron coming home in the middle of the night, it seems wise to suppose that they could also be likely to have the cab stop short of their residence.)
Having studied Tockar’s allegations in some detail, I personally don’t believe them to be any more than conjecture. But to avoid further promoting any possible privacy intrusions resulting from Tockar’s publicizing this attack, (a journalistic and research ethics issue which I’ve written about elsewhere), let’s just go with the assertion that he re-identified someone who was at the Hustler Club.
Still, it’s safe to say that, if Tockar did actually re-identify anyone using this approach, the re-identification risk associated with this demonstration is very small and could only be plausibly achieved by “cherry-picking” and limiting the attack strategy to utilize both a pick-up area and drop-off areas where the residential population density was quite low.
And, of course, given that such attacks will only be feasible by selectively focusing on areas with very low population densities, then the proportion of individuals within New York who could possibly be impacted by such attacks would also have to be very small indeed, simply because New York City’s population densities in most areas are very high which would thus would protect the very vast majority of people from any possibility of such an attack. In fact, within the New York City limits more than 90% of the population lives in Census Blocks where there are at least 50 people residing within an area equivalent to a one-minute walking distance and, furthermore, more than 99% of the population lives in Census Blocks where there are at least 16 people residing within an area equivalent to a one-minute walking distance. And, because Taxi service naturally tends to concentrate toward those areas providing the greatest ride opportunities, taxi ride densities from one very low population density area to yet another very low density area are comparatively quite rare.
It’s worth further pointing out that, with such miniscule re-identification risks, Tockar’s fear-inducing admonitions that TLC Taxi data would be used by your suspicious partner, a thief, or some other malicious actor who is out to bring you harm, just ring hollow. As was more broadly recognized for statistical disclosure risk assessments in general decades ago, ensuring a very low probability of re-identification success further dissuades data intruders from even attempting it in the first place because the chance of success is so very remote. So, it takes some real mental gymnastics to suppose that someone who means you harm will bother to file a FOIL data request with the TLC and typically wait for months to get the data in order to achieve their end. If you have someone who is out to get you, your problem isn’t the extremely unlikely event that they’d go through all the effort to use TLC data to do so in spite of their having virtually no chance of success. Your problem is that there is someone who is out to get you and they’d likely find many other much easier opportunities than those presented by the TLC data to achieve their ends.
Still, the TLC data attacks (like so many data re-identification demonstrations) are able to effectively invoke fear because they succeed with a logical diversion in this probabilistic equivalent of a game of Three Card Monte. You start off contemplating the probability that an individual (i.e., you) might be targeted in such a data attack; but what you are actually shown is the probability (although still remote) that somebody (i.e., anyone and everyone who could be targeted in the attack) could be attacked. And, even though the revealed risk is exceedingly remote, this risk unfortunately isn’t processed by our brains rationally, because we have an empirically demonstrated reduced capacity to rationally assess probabilities and respond rationally to risks when fear has been invoked. Having now witnessed evidence of a successful attack, our assessments of the probability of it actually being implemented in the real world may subconsciously become 100 percent–which is highly distortive of the true risk calculation that we face.
Getting Beyond Anecdata
So, how can we avoid undue influence from such targeted, non-representative data re-identification demonstration attacks? One of the first steps would be to pay highest credence only to re-identification studies which have used scientifically valid research designs and which have used statistical random sampling methods to assure that their results are representative of the true re-identification risks posed by the data. Had a statistically valid random sample been used instead of only selectively targeting especially vulnerable opportunities in the data, it’s very unlikely that any re-identifications would have been demonstrated from the TLC data given the very high population and taxi densities in most of New York. This isn’t to say that Tockar’s quite clever insights into how the data might be attacked had no value. His focused attacks do provide valuable insight into some potential (but very rare) vulnerabilities, which point us to some fairly straightforward de-identification steps that might be useful for future data releases. But I would contend that, in order to rightly claim the title of “data science” or “re-identification science”, the methods used need to also provide us with some systematic and generalizable means of properly assessing the data re-identification probabilities for the entire population at risk.
By doing this, the discipline of consistently using statistical and scientific methods to examine data privacy risks can help us separate data privacy facts from folklore. Clearly, the antidote that will get us past using anecdote (and it’s much more deceptive sibling “Anecdata”) is to instead rest upon a solid foundation in the roots of scientific study design and statistics.
Then, once we are properly armed with some solid evidence regarding typical re-identification risks posed by the data under a variety of relevant data intrusion/data privacy threat models, we’ll be much better equipped to make some challenging, but highly necessary, decisions about possible trade-offs between our values regarding open information and government and the also vitally important issue of protecting individual’s privacy. Unfortunately, as I’ve written about and illustrated elsewhere, the “unfortunate truth” is that we will inevitably be forced to choose between trade-offs in data utility and data privacy because of fundamental mathematical constraints.
Fortunately, by performing systematic, scientifically sound risk evaluations and using realistic threat models, we can get past falling prey to simplistic and flawed reasoning about the complex questions which surround the challenge of balancing the public good stemming from Freedom of Information laws and open data initiatives with the equally important issue of substantively protecting against possible data privacy risks and harms.
Daniel Barth-Jones, M.P.H., Ph.D., is a HIV/Infectious Disease Epidemiologist on the faculty at the Mailman School of Public Health at Columbia University. His work in the area of statistical disclosure control and data privacy under the HIPAA Privacy Rule provisions for de-identification is focused on the importance of properly balancing competing goals of protecting patient privacy and preserving the accuracy of scientific research and statistical analyses conducted with de-identified data.
This post has been updated to add detail and fix links.
 This suggests what is likely to be a data utility-preserving solution for protecting against such remote risks. A fairly simple redaction of taxi rides picking up or dropping off in areas with low population densities but also retaining certain high taxi density areas (such as occurs on the adjoining streets at the Times Square pedestrian plaza or Central Park) would leave the vast majority of the taxi data intact. This approach might also be combined with some further well-designed time and location noise injection perturbations accounting for the densities of pick-up/drop-off times and locations and their associated correlations with trip distances, fares, etc. in order to further protect against re-identification while producing little important reduction in the data utility for the very vast majority of rides.
As proposed by Tockar, Differential Privacy provides yet another possible alternative for protecting data privacy which often works well for very large data sets such as the TLC Taxi data, but it is not without important misinterpretations, limitations and criticisms, one of most important of which is the critical problem of how to properly set the value of the privacy parameter epsilon in order to reduce the probability of re-identification to some acceptable threshold value.
 Statisticians involved in data disclosure risk assessments have been using such threat modeling approaches for at least the past 15 years. See Elliot, M. and Dale, A. Scenarios of attack: the data intruder’s perspective on statistical disclosure risk within linked reference.