We are in the early phase of an era in which due to the growth of cloud technologies, the capacity for data storage, processing, analysis and exploitation has become a gold mine. Data has become the most valuable asset for companies and many enterprises create value with Big Data, Data Science, IoT or Machine Learning. Even though these objectives have become outstandingly popular there are still many entry barriers and challenges for companies that try to utilise cloud environments for business value creation. Data privacy and security compliances, as well as a general mistrust towards the public cloud service providers, are obstacles that are necessary to overcome to be able to leverage cloud environments securely.
In this three-part article series we will cover the question, how European enterprises can meet data security and privacy requirements, yet leverage the opportunities of public clouds and hyperscalers for big data analytics. To achieve this we will focus on data de-identification and apply data governance techniques.
In the first part of the article, we will cover the foundations for this challenge. Let’s find out where the problem originates, what impact it has on European enterprises and define which strategies can be applied to leverage the public cloud for big data analytics and general cloud adoption.
The second part of this article series applies the theoretical knowledge framework and presents a real life example of possible solutions. We will cover how data security and privacy can be achieved in public or private cloud environments with cloud native and open source technologies.
Finally, in the third part of the article we will present an example implementation of the open source approach. We will cover a deep dive on data de-identifications, data cataloging and de-identified information in data lakes.
Before we begin, let’s first focus on understanding the problem. Almost every company in Europe faces the same challenges regarding data security and privacy compliances when it comes to public cloud adoption. Especially for big data analytics there are conflicting goals between security and privacy compared to the utility of data for analytics [TOM19]. Therefore the following question arises:
“How can data be protected in big data cloud environments while enabling a maximum of processing functionality and minimizing performance constraints as well as utility loss?” [BON20]
As explained by Bondel et al. in [BON20], this problem originates from the general mistrust towards the public cloud service provider (abbr. CSP) and is reinforced by the introduction of data protection laws like General Data Protection Regulation (abbr. GDPR) and the Clarifying Lawful Overseas Use of Data Act (abbr. CLOUD Act). For the sake of completeness let’s recap what GDPR and the CLOUD act is and how this affects European enterprises.
In simple terms, GDPR is a legal framework for collection and processing of personal information from European individuals [GDP21]. GDPR especially highlights the relevance for the protection of personally identifiable information (abbr. PII). The CLOUD Act is a United States federal law that allows US authorities to access company and customer data from cloud and communication providers, when the company is based in the USA or is subject to US law [WIK21]. With this in mind, European companies may be held accountable for data access requests by US authorities, even though they are compliant with GDPR in the first place [SSC18].
Therefore European enterprises are facing critical decisions, either to adapt and leverage new technologies, like hyperscalers and public cloud environments, or to risk becoming less competitive due to lack of agility.
Taking advantage of the cost and flexibility options of public clouds cannot be ignored and hence is not a desired objective. Therefore, this article works out a concept to tackle this problem and enable European companies to leverage public cloud environments and compete in the market. Doing so we will follow a zero trust approach, protecting sensitive data from all third parties like CSP, government authorities and third party attackers.
Data Privacy and Security Approaches
There are several approaches covering data privacy and security compliance for public clouds. They can be divided into four categories:
- Data security against external attackers
- Privacy enabling through cryptographic encryption
- Data splitting in sensitive and non-sensitive datasets
- Data de-identification / anonymization
As described in [DOM19] and [BON20] securing data against external attackers does not cover the protection of data against CSP and government authorities and therefore violates our zero trust approach. Enabling privacy through cryptographic encryption is a fundamental part of data security and privacy. However, this data needs to be decrypted for processing and an encryption key leakage can lead to critical exposure of such sensitive data. Furthermore, splitting sensitive and non-sensitive data may introduce utility loss and organisational overhead since sensitive data is required to be processed in a trusted environment such as on-prem or in a private cloud. Finally, the data de-identification and anonymization approach is the most promising regarding maximizing cloud processing functionality and minimizing performance constraints as well as utility loss [BON20, DOM19].
It is important to mention that all of the above approaches are important and necessary for meeting data privacy compliances. However, most scientific and practical works do not follow the zero trust approach like proposed in this article [BON20, DOM19]. Therefore for the following case studies of this article we will assume that general data security and privacy practices are present and focus on how to complement these with data de-identification and anonymization to deliver value for the zero trust approach.
Before we continue, let us distinguish between data anonymization and de-identification. For data anonymization all PII are removed or replaced by placeholder strings [EDU15]. It is not possible to re-identify the actual PII from the anonymised data. Example:
Data de-identification removes PII from the dataset by applying masking, tokenization or other de-identification techniquest but preserves the relationship between the data set and the data containing the PII. De-identified data may be re-associated with the data set at a later time. It also allows us to easily apply machine learning techniques and pursue our big data objectives. Example:
Please be aware that the terms de-identifications, anonymization, tokenizing, pseudonymization and others refer to different data treatments and may not be used interchangeably. To prevent confusion we will use the term data de-identification referring to the process of removing PII from data but preserving the relationship between the data set and the data containing the PII. For more references check out this great Guideline for Data De-Identification or Anonymization as well as other publications covering de-identification techniques.
So we have narrowed down the problem and identified the most promising approach to tackle this challenge. In the next section, let’s create a framework for data de-identification for big data analytics on cloud environments and reduce the risk of sensitive data exposure with a zero trust approach.
Data De-identification Framework
So is data de-identification the magic key enabling full data utility without any downside? Almost, but not only. In practice it requires more components than just the de-identification itself. Thinking about the problem raises questions like: How do you de-identify your data? Which fields should be de-identified and how should the de-identified data look like so it is still usable for public cloud big data analytics? How do you track de-identification changes to your data and is it possible to re-identify the data in case needed?
Lets try to identify the core components. The foundation for all these questions is data governance. You will find many great articles about data governance and its relevance for big data analytics out there, for our case we will refer to the Google Cloud Platform definition for data governance.
“Data governance is everything you do to ensure data is secure, private, accurate, available, and usable. It includes the actions people must take, the processes they must follow, and the technology that supports them throughout the data life cycle.” [GOO21]
Besides other objectives, data governance also covers the security and privacy of data. This means, we do not only have to de-identify the data we also have to make it traceable, understandable and include it in the data life cycle within the data governance framework.
After some research, the core components for a data governance platform regarding data de-identification for big data analytics can be identified as follows:
- Data catalog
- PII identification/cataloging
- Data de-identification
- Data re-identification
- De-identified data lake
A data catalog is an organized inventory of data assets in the organization [ORA21]. It holds metadata describing dataset properties like available fields, field type, data lineage and whether the data contains PII or other sensitive information. Data catalogs usually offer additional functionality, like data lineage information and data discovery [ORA21]. Like you probably guessed, we will use such a data catalog for our de-identification platform. In our case we want to persist metadata describing which fields contain PII including the type of the PII and a confidence score. We will use this metadata later on to automatically select de-identification treatments based on defined privacy and security guidelines.
It is not always known which data assets contain sensitive information. Therefore the second data governance component is responsible for PII identification. The idea is to get a sample of the data set, infer the data schema as well as identify which fields of this data asset contain PII and finally create a metadata entry in the data catalog. Ideally, this component should offer the ability to process large samples, and also to query the original data and apply filters and transforms. Later on we will refer to this metadata and treat the actual data according to our data security and privacy guidelines.
The metadata of the data catalog allows us to accept sensitive data and transform it by removing, masking, or tokenizing PII information. Therefore the data de-identification component is responsible for accepting data, retrieving the corresponding metadata from the data catalogue and finally applying de-identification treatments. This component should be able to process large amounts of data in batches but also through streaming. The de-identified data is finally stored in a cloud environment for processing purposes.
Please be aware that even though many data de-identification techniques exist, there is still some risk of re-identification. We will not cover this in detail but we encourage you to research about risk analysis of de-identified data like covered by Li et al in [LI07] or in many other posts.
You may wonder whether the de-identification can be reversed. This process is called re-identification and is the main purpose of the fourth component for our data de-identification framework. This component is responsible for providing re-identification techniques for the de-identified data. Please note that not all de-identification techniques are reversible. Only those that use cryptographic keys to create a tokenized value of the original PII can be reversed. We need to store this key in case we want to re-identify the data at a later time.
Of course we need to store our de-identified data. There are many possible data sinks to use depending on the use case for the de-identified data. In our case, to achieve big data analytics objectives, we will store the de-identified data in a data lake. A data lake enables organizations to store massive amounts of data in a central location. Diverse groups in an organization can access it easily to categorize, process, analyze, and consume the data. Since the data can be stored as-is, there is no need to convert it to a predefined schema. In addition, querying this data for analytics no longer requires knowing the questions beforehand.
A de-identified data lake (abbr. DIDL) solves the data privacy problem by storing de-identified data. Therefore sensitive information does not even enter the data lake. By using a de-identified data lake we reduce the risk of exposing sensitive information through data breaches or the misuse of data. Additionally we preserve the data utility that allows us to achieve our big data objectives.
Finally, the following diagram shows all de-identification framework components in a technology agnostic architecture. Even though this is quite theoretical, we will cover this more in depth in the second part of this article series. There we will apply our theoretical de-identification framework on a real life public cloud example as well trying to build a de-identification platform from scratch with open source technologies.
Let’s sum up. In the first part of the article we worked out that European enterprises are in the conflict between leveraging public cloud for business value generation and data security and privacy compliances. We investigated the origin of this issue and tracked it back to GDPR and CLOUD Act law enforcements as well as a general mistrust towards cloud service providers. Based on this, we worked out a framework for data de-identification that allows us cloud adoption for big data analytics. We identified the core components like metadata catalog, PII identification and cataloging and finally the de-identification as well as re-identification of the actual data.
So we have covered the theoretical part, but how would you solve this in the real world if you were an actual company? In case you got curious, stay tuned and follow us as we cover these questions in the second part of our article series “Meet European Data Security and Privacy Compliances with Big Data Analytics in Cloud Environments – Part 2: Case Studies in Business Environments“. We will apply our theoretical framework and present actual real business environments solutions on public and private clouds by following cloud native and open source approaches.
List of abbreviations
CLOUD Act – Clarifying Lawful Overseas Use of Data Act
CSP – Cloud Service Provider
DIDL – De-identified Data Lake
GDPR – General Data Protection Regulation
PII – Personally Identifiable Information
Image: unsplash | @blakeconnally