Data Modeling in Times of GDPR
This article is based on a presentation held at Data Modeling Meetup Munich.
As you may have noticed from your inbox, the General Data Protection Regulation (GDPR)became applicable a few weeks ago. This article first gives you some background information on the GDPR (what this GDPR thing is all about and why it can be a problem for many organizations) and then suggests some approaches for dealing with it from a data modeling/data architecture perspective.
Background
GDPR
GDPR is short for General Data Protection Regulation and this is in turn is short for REGULATION (EU) 2016/679 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). For obvious reasons, the abbreviation is used more often than the full name.
On 25 January 2012, the European Commission published the first proposal for what would later become the GDPR. After extensive consultations and negotiations, on 15 December 2015, the European Parliament, the European Council and the European Commission agreed on a joint proposal. The GDPR came into effect on 24 May 2016 and, after a two-year grace period, became applicable on 25 May 2018. Many organizations only dealt with the new regulation at the very last moment, leading to overflowing inboxes in May of 2018.
This article can only scratch the surface of what the GDPR is all about. If you want to know all the details and/or have too much free time, please look here for the full English text.
Personal Data
The GDPR regulates the processing of personal data. In Article 4, it defines personal data as “any information relating to an identified or identifiable natural person”. In this context, natural persons (i.e., living people) are also called data subjects.
As far as the GDPR is concerned, a data subject is identifiable if this person “can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person”.
This means that data is not only considered personal by the GDPR if it has your name or social security number on it but in any case where it is sufficiently specific to allow for the conclusion that this data is about you.
Data Subjects’ Rights
As a data subject, you have several rights concerning your personal data:
- right of access (Article 15),
- right to rectification (Article 16),
- right to erasure (sometimes called right to be forgotten, Article 17)
- right to restriction of processing (Article 18),
- right to data portability (Article 20), and
- right to object (Article 21).
While these rights are probably a nice thing to have, they can mean a lot of work for the organizations that process personal data (i.e., all organizations).
Problem
All these data subjects’ rights can be a problem for your organization because personal data is basically everywhere:
- operative systems,
- data warehouses and data marts,
- data lakes,
- analyses (i.e., Excel files),
- the organization’s website, and
- social media,
among other places.
This means that if someone contacted your organization to exercise their right of access or your right to erasure, you’d better have a way to find their personal data in all these places (and make sure it is theirs and not someone else’s before you give it to them or delete it from your systems).
Possible Solutions
There are several approaches from the data architecture/data modeling realm that can make it easier for an organization to find and identify your data (or to obfuscate it sufficiently so that it isn’t your data anymore but still useful to them).
Metadata
An important factor in all of this is metadata, often (somewhat unhelpfully) defined as data about data.
This includes data lineage, i.e. documenting where data elements came from, how they were transformed along the way and where (and by whom they) will be used. Data lineage gives you traceability which is very helpful in the GDPR context. You can tell where the data is, why it is how it is, and who else got this data.
Some special categories of personal data are more sensitive than others. This includes data “revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health or data concerning a natural person’s sex life or sexual orientation” (Article 9). As part of your metadata, you should mark all data elements that belong to one of these categories and deal with them accordingly.
In addition, it is always a good idea to document why you think there is some lawfulness of processing in what you are doing with certain sets of data (Article 6). As part of your metadata, you should store the reason why you are allowed to use the data (consent, legal obligation, etc.) and the purposes for which you may use it.
Surrogate Keys
Personal data should never be used to identify or link data elements. In database terms, it is not recommended to use personal data as primary or foreign keys. Use surrogate keys instead. There are several kinds of surrogate keys, for example:
- sequences,
- hashes, and
- UUIDs (often called GUIDs).
It may pay off to invest some time thinking about which kind of surrogate key to use and in what way. Hashes and some earlier versions of UUIDs might be too close to the actual personal data: if you know the original data element and the algorithm, you can easily find the corresponding hash or UUID value. In addition, if the surrogate key becomes widely used in your organization, you might accidentally have created a new type of personal data and are back where you started.
Key Tables
Many newer data modeling methodologies (e.g., data vault and anchor modeling) include key tables that can make it easier to find and identify your data elements. These tables are used as integration points for data about the same thing that comes from different sources. While data vault usually includes both a surrogate key and the natural key in these tables (called hubs here), other methodologies (like anchor modeling) just include a surrogate key.
For GDPR purposes, it might make sense to follow this approach to its logical extreme and create a key table for all natural persons your organization deals with (customers, vendor reprensatives, employees, their dependents, ...). Then, all personal data is linked to this person key table and you can tell exactly which data subjects might come to you and want to exercise their rights.
Splitting Tables
If people invoke their rights to erasure or restriction of processing, it could be helpful if the data is already split into different tables depending on how personal (or otherwise special) it is:
- personal and non-personal data,
- permanently and temporarily stored data (i.e., data that you have to delete at some point), or
- regular and special-categories data (see above).
This approach could also be followed to its logical extreme with anchor modeling or, more generally, binary tables where there is a dedicated table for each attribute. That way, you could annotate, historize (and, if necessary, correct or delete) your data at the attribute level.
Historization
The GDPR might be the finally nudge to make you realize that for proper historization, you need more than one timeline. Only with bitemporal (or, if you want to be absolutely precise, tritemporal) data you can tell at all times what data you stored about someone at which point in time.
With bitemporal data, you have two timelines: state time (when it happened in the real world) and assertion time (when you said it happened at that time in the real world). With tritemporal data, you have three: state time, speech act time (when someone said it happened at that time) and transcription time (when you stored in the database that someone said it happened at that time).
This kind of multitemporal storage can be helpful in a variety of use cases:
- error correction (both from your side and via the right to rectification),
- late-arriving data,
- advance planning, and
- time travel (how did the numbers for 2017 look in May of 2018)?
If your head isn’t already spinning, there is a book by Tom Johnston called Bitemporal Data you might want to look into.
Anonymization
There is an alternative to using actual personal data and having to deal with all this GDPR stuff: anonymization. If it’s not possible to identify the data subject anymore, you won’t have to delete the data and can more or less do everything with it that you ever wanted to. It’s also possible to anonymize personal data instead of erasing it.
Among other things, the GDPR introduces the concept of privacy by design. In practice, this means that you work with anonymized data whenever possible and only switch to actual personal data when it is absolutely necessary. This reduces the amount of personal data you produce (and that could get lost or would have to be deleted later).
Virtualization
One last approach to help you dealing with the GDPR is data virtualization. By using database views or special virtualization tools, you can store personal or otherwise sensitive data just once, in some central location where you can have an eye on it. Copies that don’t exist don’t use up any space, won’t have to be corrected and won’t have to be deleted.
Conclusion
Hopefully, this article could show that while the GDPR might demand a lot from you in terms of data subjects’ rights and other legal obligations, there are many approaches and tools available to make dealing with them easier. They might even change your data architecture and your data quality for the better.
If you have any questions, suggestions or remarks, please feel free to comment.
Senior data architect
6yYou forget two elements which are important in modeling looking forward: 1. You must be able to delete personal data, without breaking allowed use. 2. Changed regulations having impact on historical data. Reinterpretation of data will be needed. For instance: when is anonymous deemed anonymous or what is aggregated use.
Your digital transformation is my daily business | Uniting technology & people for success
6yGreat article!