The famous quote of Watt S. Humphrey, the father of quality in software, states that every company is a software company. In 2021 it is safe to say that this concept can be applied to big data analytics. Data has become the most valuable asset for companies and many enterprises create value with Big Data, Data Science, IoT or Machine Learning. But how do you leverage the public cloud for big data analytics in a modern world with strict data privacy and security regulations and a general mistrust toward cloud service providers.
In this three-part article series we will investigate how European enterprises can meet data security and privacy requirements, yet leverage the opportunities of public clouds and hyperscalers for big data analytics. To achieve this we will focus on data de-identification and apply data governance techniques.This is the second part of the article series where we will apply our previously defined theoretical data de-identification framework on real life business case studies. We will present the conflict European enterprises are facing when moving to the public cloud for big data analytics. Furthermore we will cover the upcoming possibilities of sovereign clouds and finally we will present how to design the de-identification platform on public clouds with cloud-native services. After reading this article you will know how to leverage the public cloud by preserving data utility, minimizing overhead and remaining data security and privacy compliant.
Before we start, we kindly encourage you to check out part one of this article series “Meet European Data Security and Privacy Compliance with Big Data Analytics in Public Cloud Environments – Part 1: Foundations”. There we have covered the origin of the problem, identified the most promising approach to tackle the challenge and defined a technology agnostic data de-identification framework for big data analytics.
In this part of the article series we will focus on how to actually apply this framework to real business solutions. First, imagine we are an actual company with the objective of building a de-identified data lake (abbr. DIDL) for big data analytics as introduced in the first part of the article series. Then let’s recap that we pursue the zero trust approach, by protecting sensitive data from all third parties like cloud service providers, government authorities and third party attackers.
We immediately realise the subsequent problem: The physical location of the de-identification process matters, since moving data to the public cloud and applying de-identification treatments after data movement violates our zero trust approach. With this in mind we are left with the following options:
- Apply data de-identification treatments before public cloud movement to remain zero trust compliant
- Apply data de-identification treatments after public cloud movement and relax our zero trust approach
Each option has its pros and cons. De-identification after public cloud movement allows us to leverage cloud native de-identification services and accelerate the value generation from big data analytics. However, we violate our zero trust approach by moving data containing PII and trusting the cloud service providers with handling our data security and privacy compliance.
De-identification before public cloud movement allows us to own the data completely, including the de-identification process itself. This approach fully lives up the zero trust criteria and minimises sensitive data exposure risks. On the other hand, we need to implement the de-identification treatments in a trusted environment like a private cloud by ourselves. This introduces overhead, costs and slows down the value generation of big data insights.
There is no golden solution that ticks all the boxes. However, recent happenings and announcements may introduce upcoming possibilities that may change the cloud industry in Europe as currently known. Let’s checkout what these opportunities are and how they might change our approach in future.
Upcoming Opportunities with Sovereign Clouds
For years now, the European Union has been a driving force behind the quest for cloud sovereignty in Europe [ECT21]. The objective is to identify strategic investment to enable the development and adoption of competitive, trusted, and sustainable cloud and edge services across the EU [ECT21]. The term ‘data sovereignty’ is strongly emphasized and goes beyond storing the data in a European data center. Data sovereignty is also about who can access it, a particularly hot topic when it comes to cloud providers listed in the US and therefore subject to the CLOUD Act. A European sovereign cloud might accelerate cloud adoption and big data analytics for European enterprises and the public sector by providing a trusted environment and full data ownership. Let’s investigate the recent announcements and happenings around sovereign clouds and their possible market effects.
But first, what is a sovereign cloud exactly? Even though [ECT21] mentions ‘sovereignty’ and ‘sovereign cloud’ over 50 times in the European report for next-generation cloud-edge, a definition of the terms is not provided. We will refer to the most detailed description provided by [OVH21], one of the founding members of trusted cloud initiatives in Europe:
“A sovereign cloud […] ensures that its infrastructure and processing operations are carried out in strict compliance with the rules in effect. […] These rules are enforced in whichever countries the provider operates and offers its services in. […] The provider complies with regulations, and ensures that data is protected from any interventions other than what the customer carries out, […] ensuring that no extraterritorial rights apply to data, and that the data is not used by third parties — whether it is to power AI algorithms, or contribute to the enrichment of monolithic platforms” [OVH21].
Now that we know what a sovereign cloud is, let’s identify the main players for cloud sovereignty in Europe. First, there is Gaia-X, a federated and secure framework for a decentralized network of cloud service providers [GAI21]. Gaia-X is less the network itself but more the rules framework the cloud service providers need to follow to become a node of the Gaia-X network [GAI21]. In other words, Gaia-X is a European cloud ecosystem based on common standards and open source technology stacks like Openstack and Sovereign Cloud Stack. Gaia-X was launched by Germany and France and has become the central cornerstone of the European cloud strategy backed by institutional and many commercial partners [GAI21, PLU21].
One of these partners is Deutsche Telekom AG, a founding member of Gaia-X. Deutsche Telekom is also a trusted cloud service provider within the Gaia-X network with their OPEN TELEKOM CLOUD [OTC21]. Other relevant founding members and contributors are Beckhoff Automation, BMW, Bosch, DE-CIX, German Edge Cloud, PlusServer, SAP and Siemens [WIK21].
This development has introduced new aspects for the cloud market. Therefore it is not surprising that public cloud providers aspire to compete and provide trusted cloud environments too. And if you read closely, you will find that all three major cloud providers like Amazon, Microsoft and Google are listed as partners in the members list of Gaia-X [GAI21]. Beyond this, the recent announcement of the partnership between T-Systems and Google Cloud regarding delivering a sovereign version of the Google Cloud Platform for the German market may indicate the future development of the European cloud industry [TSY21]. This assumption is strengthened by a second partnership announcement between Thales and Google Cloud for sovereign cloud offerings in France [GOO21].
Under the new T-Systems and Google agreement, T-Systems will be responsible for aspects such as encryption and customer identity management, and both companies will have to monitor any access to their joint facilities in Germany [TSY21]. The vendors promise full functionality and scale of the public cloud, as well as version and feature parity with the global network [TSY21]. In case you wonder how sovereignty is assured, check out this announcement by Google where they explain how to achieve data sovereignty and which role the trusted partners like Telekom and Thales will have in this concept.
For our case studies, let’s assume the promising outlook becomes reality. Let’s assume, the major public cloud providers offer a sovereign cloud version where we can perform data de-identification and leverage the full cloud potential without violating our zero trust approach.
De-identification Framework on Public Cloud – AWS
Now for the interesting part, let us apply our de-identification framework and build a real business solution on the public cloud. We will first cover Amazon Web Services (abbr. AWS). AWS provides great service solutions to cover our de-identification requirements like AWS Glue – Data Catalog, a serverless data integration service that provides data catalog functionalities and acts as a central metadata repository. Additionally, AWS Glue allows us to automatically infer schemas and populate the data catalog with metadata as well as schedule ETL jobs. We can use AWS Glue to create a metadata repository of the ingested data. The metadate will hold information about the schema of the datasets, whether the dataset contains PII and the type of the PII itself. We want to use this metadata to treat the ingested data, remove the PII and finally store it in the de-identified data lake. Additionally, AWS Glue DataBrew offers data lineage functionality.
Now that we have a data catalog, we need to infer metadata about our datasets. For the creation of the de-identified data lake, we especially want to detect whether the dataset contains any PII. AWS provides a PII detection service called Amazon Macie, a passive service that scans your data on a S3 bucket to detect and report about PII findings on a dataset level. The results of the scans can be persisted in the data catalog metadata. Please be aware that the results of Amazon Macie are aggregated on dataset level and not on a data record level, exactly how we intend it to be since the metadata inside our data catalog describes the dataset not the individual dataset records.
Alternatively you can enrich your catalog with metadata on your own by using Amazon Comprehend, a natural language processing service that provides functionality to detect personally identifiable information. The idea is to retrieve a sample from the dataset, actively analyse this sample with Amazon Comprehend and create your custom catalog metadata entry that describes whether the analysed dataset contains any PII.
So far our AWS solution covers the data catalog and PII detection/cataloging components of our de-identification framework. Now we want to utilise our rich catalog metadata and de-identify the ingested data on the cloud. Even though Amazon Comprehend provides functionality to anonymize PII our research shows that one central requirement is missing. Amazon Comprehend does not offer a native integration to use the catalog metadata for the de-identification process. Simplified, the Amazon Comprehend service awaits string data input, analyses the input for PII, applies anonymization on the input and finally returns the treated data. As of right now, it is not possible to provide additional parameters to the API call to pass metadata information from the data catalog. This is unfortunate since our intent building the data catalog was to utilize its metadata for automated de-identification treatments based on defined privacy compliances.
In case you want to use the catalog metadata for de-identification like proposed by our de-identification framework, you need to implement this on your own. Also please note that Amazon Comprehend only provides anonymization as the only available de-identification technique. In case you want to apply tokenization to preserve the PII relationship towards the dataset, you need to implement this functionality by yourself.
Finally, we will use the Amazon S3 cloud storage service for our de-identified data lake.The full AWS solution is shown in figure 2. Besides the de-identification framework components, we also use operational components for handling batch and stream ingestion, database migration, media data and analytics itself. Due to the scope of this article we will not cover these services in depth.
De-identification Framework on Public Cloud – GCP
Lets apply our theoretical framework on the Google Cloud Platform (abbr. GCP) and compare it against the AWS solution. Again, we start with the data catalog component. The corresponding service within GCP is Data Catalog, a fully managed and serverless data discovery and metadata management service. Furthermore, Data Catalog provides automated inference of data asset schemas, however this functionality covers solely internal services like BigQuery, PubSub and Dataproc Metastore services, databases, and tables. Additional metadata is required to be maintained by yourself. However, Data Catalog offers community versions for various connectors for on prem databases types like RDBMS, BI and Hive. Finally, GCPs Data Catalog offers other unique functionalities like data discovery, tagging and a search interface for data stewards.
Similar to Amazon Macie, GCP provides a service to automatically scan data for PII detection on dataset level. We suggest using GCPs Cloud Data Loss Prevention (abbr. DLP) that allows you to scan data stored in Cloud Storage, BigQuery and Datastore by scheduling automated scans jobs. The results of the scans can be persisted in the data catalog metadata. Please be aware that similar to Amazon Macie, the results of DLP automated scans are aggregated on dataset level and not on a data record level. This is exactly how we intend it to be for our data catalog.
Cloud Data Loss Prevention is able to detect over 150 different PII types and also allows us to treat the data with various de-identification techniques. Provided options are anonymization/redaction, various types of tokenization, generalization and bucketing as well as date shifting. Please be aware that GCP documentation uses a slightly different terminology than introduced in the first article of this article series. It is also worth mentioning that DLP provides the functionality to define custom PII identifiers by using dictionaries and regular expressions.
It appears that GCPs DLP is the more complete service in terms of data de-identification compared to Amazon Comprehend. First, GCP provides a native integration between DLP and data catalog, allowing automated PII detection on dataset level and persistence in the data catalog. Second, DLP provides re-identification functionality including a native integration with GPCs key store service called Cloud Key Management. Finally, DLP offers functionality for measuring re-identification risk of de-identified data. Please be aware that similar to AWS, the Google Cloud Platform does not provide a native integration of the catalog metadata for the de-identification process. Exactly like AWS Comprehend, DLP accepts string data, analyses it for PII without the option of data catalog integration and finally returns the de-identified data. In case you intend to utilise the rich metadata from the data catalog, you need to implement it yourself.
Finally we are using operational components handling batch and stream ingestion, database migration, media data and analytics itself. We will not cover these components in depth here. The de-identified data is stored in GCPs Cloud Storage.
De-identification Framework on Public Cloud – Azure
Finally we will apply our theoretical de-identification framework on Microsoft Azure Cloud Computing Services (abbr. Azure). Starting with the most recent announcement, in September 2021 Microsoft officially launched their new data governance service named Azure Purview, a unified data governance solution that helps to manage and govern on-premises, multi-cloud, and software-as-a-service data.
Azure Purview consists of three components, Data Map, Data Catalog and Data insights. Applying our de-identification framework we immediately realize that the Azure Purview Data Catalog is the data catalog component, providing an enterprise-wide metadata storage repository for data assets. Data Catalog also provides extended catalog functionality, like data assets discovery and search, data lineage and labeling. Please note that the service Azure Data Catalog is a legacy service. Microsoft recommends using Azure Purview over Azure Catalog.
Similar to the AWS and GCP solution, Azure provides Data Map as part of the Azure Purview service. With Data Map you can automatically retrieve technical metadata from data assets stored in hybrid sources, classify the data assets based on automated scans and label your data accordingly. Compared to the above solutions, Azure Purview Data Map delivers similar functionalities like Amazon Macie and GCP DLP. Again, please note that the results of Data Map refer to the dataset not the actual data records.
Finally Azure Purview Data Insights provides a centralised view on the whole company data estate and it’s distribution by asset dimensions such as source type, classification, and file size. Please be aware that this service is in preview.
To treat data containing PII, we will use Azure Cognitive Services for PII detection and treatment. As of right now, Azure Cognitive Services for PII detection provides over 12 different PII types with anonymization as the only available de-identification technique. Therefore it is also not possible to re-identify the data with the cloud native approach. In case you want to use tokenization and other data de-identification treatments as well as re-identification, you will need to implement it on your own. It is worth noticing that exactly like AWS and GCP, Azure does not provide a native integration of the catalog metadata for the de-identification process. It is required to implement this functionality by yourself.
Similar to the other cloud service providers we have used other operational components for the solution. It is worth pointing out that compared to AWS and GCP, Azure Cognitive Services also provides functionality for media data processing and therefore the possibility to detect and treat PII in images, video and audio.
Let’s wrap up. In this article we have covered how to meet European data security and privacy compliance with big data analytics in public cloud environments by applying our de-identification framework on real life business case studies. We understand the predicament that European enterprises are in, i.e., it is considerably easier to perform the required de-identification treatments after data movement to the public cloud providers as you can leverage their native services as described above. However, doing so violates our zero trust approach to data privacy. That said, de-identification before public cloud movement allows us to own the data de-identification process and remain compliant with our zero trust approach but introduces overhead, increases costs significantly and slows down the value generation of big data insights.
Then we have covered the upcoming opportunities with sovereign clouds. We have identified a cutting edge cloud market trend that indicates the introduction of sovereign cloud offerings. The most relevant projects like Gaia-X are backed by institutional organisations and the big industry players. Additionally, we pointed out that public cloud providers also follow the trend of sovereign cloud offerings driven by the recent marked competitive aspects. Announcements like the sovereign cloud offering based on a partnership between Telekom, Thales and Google Cloud might change the cloud industry in Europe as we know it and will encourage cloud adoption.
Finally we have covered the cloud native approach by applying our de-identification framework on real life public cloud environments like AWS, GCP and Azure. We not only have provided a full architectural solution for each cloud provider but we also compared the solutions in terms of data de-identification maturity. In conclusion we can state that every covered cloud provider provides services for data de-identification and therefore the creation of a de-identified data lake. However, comparing the provided functionality against our de-identification framework requirements it was found that all clouds show significant shortcoming when it comes to native integration of the data catalog with the de-identification services. It appears the existing cloud native data catalog services are mainly targeting data discovery and tagging. But how can a rich metadata repository help, if we can not include it within the native services for de-identification. In our opinion, there is a lot of room for improvement.
Overall, Google Cloud Platform offers the most complete service ecosystem, since it provides the full range of component requirements like data catalog, PII detection, PII de-identification and re-identification as well as some extras like measurement of the re-identification risk and the native integration of the key storage services with the re-identification functionality. We have summarized the results of this article in the following table.
We have covered the cloud native approach with the assumption that public sovereign cloud offerings are available and therefore we will not violate our zero trust approach. However, the reality is that as of right now, sovereign clouds are not available and ready to use for enterprises. So what if you don’t want to wait for the sovereign cloud to become available? What if you need to act now? In this case, it is required to follow our proposed approach of data de-identification before public cloud movement. If you are curious about how to achieve this, stay tuned and follow us as we cover these questions in the third part of our article series.
List of abbreviations
CLOUD Act – Clarifying Lawful Overseas Use of Data Act
CSP – Cloud Service Provider
DIDL – De-identified Data Lake
GDPR – General Data Protection Regulation
PII – Personally Identifiable Information
Image: unsplash | @flyd2069