Data Re-Identification and the End of Privacy

Data Re-Identification and the End of Privacy

The number of websites and apps requiring users to enter personal information to access their service, no matter how insignificant or unrelated, is quite disturbing. Recently, I was involved in a cyber forensics case that began when a client innocently entered their address and birthday for a free trial on a subscription-based site, ultimately leading to disaster on an unrelated site millions of web miles away. Why would a digital newspaper want your birthday? Or perhaps you should ask the question - why does any site or app ask for certain personal information that it has no business storing and is unrelated to the service being provided? Why do apps request access to your camera and microphone but make no use of it? Yet people click through without thinking twice. More important, knowing that a site stores your personal data without a specific usage scenario, why agree to give it to them in the first place? Users add profile pictures to accounts simply to display them without any real understanding of where they will ultimately end up or how they will be utilized.

The answer to the question of why is founded on the concept of anonymizing your data. Everyone has surely seen verbiage on sites convincing its users not to be concerned about giving away your personal data because even though it will be stored, it will be done so in an anonymous fashion. Without getting too technical about the mechanism of storage, it means that when they add your information to a database as a new record, instead of using your SSN or other PII (Personally Identifiable Information) as a key to the data, they use a randomized key instead, data commonly referred to as a GUID (Globally Unique Identifier). Additional methods are also employed, such as encrypting ID numbers instead of storing them in their original form, known as plaintext. For example, if your SSN was 123-45-6789, it would be encrypted or simply encoded using some widely accepted method and stored in the database as something like “MTIzLTQ1LTY3ODk” (simple Base64). Storing data using this method provides protection for the individual with a basic level of privacy. Thus, if the database was somehow breached, all the data associated with the ID would have no associated context, hence no real meaning. For example, if I hacked a server and retrieved some records, I would just end up with rows and rows of birthdays, addresses, pet names, purchasing data, and other data that was being collected but deemed not important enough to encrypt. Although I managed to obtain personal data, without a related identifier or other semantic context, it would most likely be of little value.

The process of sanitizing your information with the primary intent of privacy is known as data anonymization. Because it is encrypted and/or all identifiable information has been distilled, it makes it possible to transfer this information across boundaries, such as between company departments, over the web, or sold to data mining companies for the purpose of analytics. For example, in the case of medical data, it would allow someone in research to study important trends and perhaps perform disease analytics without knowing who the data ultimately belongs to. In this case, the scientist could obtain huge sets of data containing everything related to the disease being studied, but that doesn’t contain names, addresses, or any other PII.

In today's web-based world, the amount of data about you on the web or stored in company-owned databases is tremendous. If you think about the domains that currently have unfettered access to your personal data, in less than a minute you would most likely identify Facebook (including Instagram), Twitter, LinkedIn, Pinterest, the DMV, the Apple or Samsung Store, Google (including YouTube, Google Maps, Gmail, and Google Music), Yahoo!, every mail service you utilize, Pandora, Amazon (in multiple databases), Reddit, Windows Live, Office 365, Adobe, your local government, the IRS, Target, Walmart, Walgreens, Rite-Aid, Wegmans, Netflix, Hulu, Imgur, Verizon, AT&T, T-Mobile, restaurants you frequent, dating sites, pornography sites (if you are naive enough to register), and hundreds of others. In time, a basic list of locations would number in the thousands. Add the companies who purchase your information from data brokers and you may begin to lose sleep.

If you have read this far, your primary concern, and the purpose of this post, is to introduce the concept of de-anonymization and re-identification. De-anonymization is the reverse process whereby anonymous data is cross-referenced with other data sources to re-identify the anonymous data source. Generalization and perturbation are two of the most popular de-anonymization approaches for relational data. Data re-identification is the practice of matching deidentified data with publicly available information, or auxiliary data, in order to discover the individual to which the data belongs to. In other words, it gives meaning and context to data, removes privacy constraints, and breaks any promises made to keep your data safe.

These techniques are major concerns due to the amount of data that is being released by and to various outlets. If you only have a sparse amount of information in several places, the chances of someone connecting data and discovering relationships between different sources is relatively low. If you read my post from yesterday, you would see how it would be difficult to create any type of relationship graph with the data. However, as a general rule, as the amount of data released grows, even though it may be anonymous, it becomes increasingly easier to make connections and relations between the data and ultimately tying it back to an individual. Finally, if a company storing your private data is breached, the chances of connecting it to other databases, regardless of location, continues to increase.

If you have ever played the game Jenga, think of all the blocks stacked on top of each other as companies that are storing your private data. If it gets big enough, all it takes is one little mistake and the entire tower comes tumbling down. If a single company is breached and suffers the loss of anonymous data, it may contain the key that unlocks and makes sense of a significant amount of formerly unrelated information. Since a tremendous amount of data is already in the public domain, it transforms the issue into a large puzzle that is missing a single piece. Information in the public domain, even if it is anonymized, may be re-identified in combination with other pieces of available data and some basic algorithms. On the positive side, this is one of the techniques used for tracking criminals, sex offenders, and the distribution of illegal weapons and narcotics. On the downside, it is used to make sense of the many large repositories of personal data which could then be tied to every other piece of data you produce in the future. With the amount of “big data” flooding the web and application providers, re-identification is becoming gradually easier. This is due to the abundance and constant collection and analysis of information, along with the evolution of technologies and the advancement in algorithms for connecting this data.

In a study, it was found that over 80% of the U.S. population can be identified using a combination of their gender, birthdate, and zip code. This is more than a serious threat. Once again, to continue the theme from yesterday - we are not the consumer anymore, but we are the data. The next time you are bored, just hang out at the local pharmacy and within minutes, you will hear the pharmacist ask the next customer for their name and birthday. From this, you could easily assume their zip code (which is either the same as yours or a nearby town) and you would have enough to start the re-identification process.

Although state laws are beginning to protect the consumer in more ways, it may be too late. In the last couple of years, the explosion in big data has projected your personal information all over the web like shrapnel. The fact remains - it only takes a couple scraps of data about someone to begin piecing together the puzzle that could eventually grant someone access to their medical records, bank accounts, social media connections, cell phone bills, pharmacy history, private activity from a dating site, and just about any other database in which they exist.

In the race between technology and the law, technology is the hare and the law is the tortoise. Unfortunately, we are simply spectators standing on the sidelines, scratching our heads and wondering when it all went wrong. We have become too desensitized to worry about privacy.

Outstanding post. Am in work developing a hyper personalized life coach app, which of course needs all the personal data it can get to be most effective; yet our company wants to seriously protect the data and develop a trust relationship with the user (monetizing the app does not depend on the personal data). Your insights to our users to severely limit public accessible data and therefore help prevent de-anonymizing encrypted information is most helpful.

To view or add a comment, sign in

More articles by Michael Richardson

  • Top Actions For Increasing Security & Privacy on Your Phone

    There are many options for securing your mobile device and data, as well as increasing your level of privacy. Many…

  • The Directed Superhighway

    It is amazing how much web traffic begins and is directed by Google, Facebook, Amazon, etc., and their associated…

  • New Facebook Adware Spreading Fast

    This week, attackers are utilizing Facebook Messenger, malicious Javascript, and in some cases, social engineering to…

    1 Comment
  • Credential Stuffing - Intro and Stats

    I was recently analyzing security data for a major retailer to detect any useful trends. Nothing really startles me…

    3 Comments

Insights from the community

Others also viewed

Explore topics