SUCCESS CASE #AI #MachineLearning
Unstructured Data Extraction and Automation
BRANDDOCS is a global company based in New York with a strong presence in Europe that specializes in the orchestration and custody of secure digital transactions including identification, signature, payments and electronic custody.
As a Qualified Trusted Service Provider and Trusted Third Party worldwide, it provides its clients with the highest degree of technological, legal and compliance coverage for their secure digital transactions.
It is estimated that around 70-80% of the information and content generated in companies is unstructured, i.e. its format is not homogeneous and is not optimized for easy and quick classification.
This circumstance makes it difficult to store, process and even interpret them within the existing data tables and by the different units of the organization. What does it mean? Loss of important information or data, reduction of productivity and efficiency in management, complexity in digitization…
The automation of these processes using artificial intelligence and machine learning technologies, driven by public cloud computing environments, allow to reduce time and resources to these tasks, providing an increase in productivity and efficiency of results.
Branddocs needs to build a scalable system for automatic extraction and interpretation of relevant information from banking and financial documents. Through this system we want to respond to the need to consolidate large volumes of information from unstructured data such as payroll, statements and accounting reports.
This system must have the following characteristics:
- To be a centralized data repository based on an intelligent data system with the capacity to extract defined entities from documents, as well as their interpretation for the subsequent activation of business rules.
- The system must be completely scalable, flexible and modular to allow the definition of different steps in the same workflow. These steps can be requests to systems, databases and approvals.
- Have the ability to interpret the data, which will be structured and available for exploitation and export, for example, through representation in BI dashboards or consumption through Excel files.
In this proof of concept we focus on the extraction of information from documents from the output of an OCR (Optical Character Recognition), for further processing and summarization to be finally released in the desired format.
We were faced with several types of financial and accounting documents, such as payroll, bank statements, balance sheets, profit and loss accounts… Each one with different formats and terminology, which meant a different challenge and a different use case. In addition, the processed documents had different formats (pdf, png…).
The AWS services used were as follows:
Amazon Textract as OCR to extract text from images.
S3 as storage system, both for the original files (images) and the processed data (Excel documents).
Jupyter Notebooks from Sagemaker to perform the processing and interpretation of the information.
Keepler is a boutique company of professional technology services specialized in design, construction, deployment and software solutions operations of Big Data and Machine Learning for big clients. They use Agile and Devops methodologies and native services of the public cloud to build sophisticated business applications focused in data and integrated with different sources in batch mode and real time. They have Advanced Consulting Partner level and have a technical workforce with 90% of their professionals certified in AWS. Keepler is currently working for big clients in different markets, such as financing services, industry, energy, telecommunications and media.