Data Security
As a data engineer or a data architect, we need to understand data security. Whether you are working on Azure, AWS, Snowflake or Databricks, you will need to implement data security.
First up it's RBAC, or Role Based Access Control.
Not all users can access all tables. To restrict access we need to allow certain users to access certain tables. This is called "User Based Access Control" or UBAC and must not be used. Instead we need to use "Role Based Access Control". A group of users are assigned to a role, such as "Data Analyst". It is this role that we give access to certain tables. RBAC is much more flexible than UBAC, and simpler to manage.
Next it's privilege.
Privilage is about WHAT can be done to an object. For example, INSERT into a TABLE, MODIFY a DATABASE, MANAGE a COMPUTE CLUSTER, and so on. A role can have different privileges on different objects.
Dicretionary Access Control or DAC
DAC meaning that the owner of an object has the power to grant access to that object.
User Authentication
It means to prove that a user is really who they say they are. For example, using Microsoft Entra ID or Google Authentication. This way user doesn't need to type a password when logging in. The simplest method of user authentication is user name and password.
IP Based Security
In Azure or Snowflake you can specify which IP range(s) you are allowing user to login from. This is a good practice and it increases the security. This is configured via something called "Network Policy".
Recommended by LinkedIn
Data Encryption (at rest)
When data is travelling from one system to another, it needs to be encrypted. This is called "in transit". When the data is stored in a database or in a file, it also needs to be encrypted. This is called "at rest". Encryption is using key, usually a 256 bit key. Each table and file should be encrypted using a different key, which is created by a "parent key". This way if we had a security breach, the intruder would only be able to access 1 table, and not other tables. Key must be regularly recreated.
Data Encryption (in transit)
When a user accesses a database, the connection between that user and the database is across the internet. So it must be encrypted using https. During ingestion (say from S3 into database), data must be encrypted. During unloading, data must also be encrypted.
Row Level Security
Not every user can access every row in a table. But it depends on which role the user belongs to. This is configured on something called "Row Access Policy".
Column Security
Not every user can access every column in a table. This is configured via secure view. And even if a user can access that column, they might not be allowed to see the value, such as the employee salary and customer address. So role A can see the salary, but role B can only see *** (the data is masked). Can be partially masked too, for example: can only see the domain part of an email address, or the year of date of birth.
Data Sharing
When sharing data we need to give permission to the external users. Everything is at play in order to do that: RBAC, privilege, authentication, encryption, row level security and column level security.
PII (Personally Identifiable Information)
PII means data on person, such as address, data of birth and mobile number. We need to tag sensitive data such as PII and PHI (Protected Health Information) so we can manage it. Before allowing users (and developers!) to access sensitive data, usually DPO sign off is required (Data Protection Officer). Where we store the database (and data files) is serious consideration. Usually PII data has to be stored in the same country as the company jurisdiction, except in EU which can be stored in any of the 27 member states.
List of my articles: https://lnkd.in/eRTNN6GP