When designing and building software architecture (such as Big Data, microservices, Service Mesh) it is both necessary and useful to know the good practices applicable to use cases as well as reference architecture that confirms that we are doing things the right way to guarantee the optimal functioning of the system.
In this article, as well as talking about good practices, we also wish to address the opposite: the bad practices. In other words, the issues to avoid or mitigate when designing Big Data architecture, otherwise known as AntiPatterns.Practices to avoid when designing #BigData architectures Click To Tweet
Listed below are those most frequently encountered (and suffered).
1. IGNORING FILE SIZE
Choosing the right file size, block size and replication factor is critical to making the Big Data architecture we are building as efficient as possible. Files that are too small will cause frequent reads and writes (with the associated penalty), and files that are too large will force less parallelizable processes that require more memory and time to handle so much information at once. What is the correct file and block size? This is the big question that must be analysed and answered for every use case and technology being used.
2. A TOOL FOR EVERYTHING
Fortunately, in the Big Data ecosystem we have a large number of tools that allow us to tackle many different tasks. Each tool is adapted to a specific use case and we must ensure that we use it for what it is designed for and thus avoid falling into the temptation of the Golden Hammer. For example, if we were to use a database like DynamoDB or CosmosDB to store a relational model, sooner or later we would find ourselves with a rather difficult wall to get around. The same would happen if we tried to use AWS RedShift or Azure SQL Data Warehouse to manage massive real-time data insertions concurrently.
3. FAILURE TO TAKE THE SCHEME’S EVOLUTION INTO ACCOUNT
Data evolves over time and the scheme or model we are working with today will probably not be the same as the one we will be working with next month. Failure to plan and design the scheme’s evolution from the beginning will require an investment of time and effort that, in some cases, will be very difficult to manage or reverse.
4. NOT CHOOSING THE CORRECT PARTITIONING KEY
When consulting a huge volume of data, defining an efficient and appropriate partitioning for the data we are managing will help make searches more accurate and consume less time and resources. Defining partitioning keys is one of the tasks to which we must pay the greatest attention, especially during the initial phases of data architecture design and modelling.
5. FAILURE TO CONSIDER THE INFORMATION ACCESS SECURITY MODEL IN THE DESIGN
Security by Design is the motto we should always use when we facing data architecture. Since GDPR came into force, the importance of ensuring security and correct processing of data has become more important than ever, especially with regard to personal data. At the very least, the following data security measures must be guaranteed:
- Encryption: Both at rest and in transit. The ideal is to enable this by default.
- Masking, obfuscation or anonymisation in required cases.
- Permissions: Who has access to the information, which part of the information do they have access to, what operations can be performed on the data (RBAC)?
- Users’ rights (in the case of dealing with personal data): rectification, deletion, portability.
Defining and applying these policies from the start will pave the way considerably when it comes to implementing our architecture. Furthermore, we must be aware that it is not only the data managed by our architecture that must be protected, because data generated by our architecture are just as important as the data we are handling: log data, transaction data, activity records, network records.
6. NOT TESTING ALGORITHMS/PROCESSES WITH REAL VOLUME
When we design algorithms or processes, we always make sure we have enough unit and integration tests to allow us to meet a minimum coverage threshold, but when these algorithms have to deal with a very large volume of data, we must also ensure that our tests include real volumetric testing to ensure the proper functioning of the entire system and avoid unexpected errors in production environments.
7. Full scans
There will be times when we have no choice but to go through our entire data set to perform a particular operation (for example, deleting a particular record and updating the data set or obfuscating fields that have become sensitive), but we must bear in mind that this operation consumes a great deal of resources and time and, depending on the size of the data set, may be impractical to perform. On those occasions when it is necessary to carry out a full scan, we must make sure that we allow sufficient time for the processes to be completed and without impacting the rest of the architecture.
Habrá ocasiones en las que no tengamos más remedio que recorrer todo nuestro data set para realizar alguna operación en concreto (por ejemplo, eliminar un registro determinado y actualizar el data set u ofuscar campos que han pasado a ser considerados delicados), pero debemos tener en cuenta que esta es una operación que consumirá gran cantidad de recursos y de tiempo y, en función del tamaño del data set, puede que sea impracticable llevarla a cabo. En aquellas ocasiones en las que sea necesario llevar a cabo un full scan, debemos asegurarnos de contar con una ventana temporal lo suficientemente amplia para dar tiempo a los procesos a completarse y no impactar al resto de la arquitectura.
8. Big Data?
As a last point, we wish to reflect on these questions: What is Big Data? When can we consider that architecture is working with a volume of data big enough to be called Big Data?
As time goes by and cloud services become more efficient and affordable, the cost of storing and processing our data decreases, so we can take on greater challenges with a greater volume of data at a lower cost. This means that the answer to our initial question evolves at the same time as cloud services evolve and hardware capacities increase.
What is termed Big Data today may not fit that definition next year and we will have to rethink the criteria again. Perhaps we should ask the question from another perspective: If all your data fits in RAM, are you making big data?If all your data fits in RAM, are you making #BigData? Click To Tweet
For more information on this topic, see the following articles:
- Best Practices and Tips for Optimizing AWS EMR
- How we built a big data platform on AWS for 100 users for under $2 a month
- Schema Evolution with Hive and Parquet using partitioned views
- Apache Parquet: How to be a hero with the open source columnar data format on Google, Azure and Amazon cloud
- AWS Reference Architectures
- Azure Reference Architectures
- Best practices between size block , size file and replication factor in HDFS
Image: Unsplash | Tim Gouw