SUCCESS CASE #AI #MachineLearning 

Unstructured Data Extraction and Automation

BRANDDOCS is a global company based in New York with a strong presence in Europe that specializes in the orchestration and custody of secure digital transactions including identification, signature, payments and electronic custody.

As a Qualified Trusted Service Provider and Trusted Third Party worldwide, it provides its clients with the highest degree of technological, legal and compliance coverage for their secure digital transactions.

The challenge of document management of unstructured data

It is estimated that around 70-80% of the information and content generated in companies is unstructured, i.e. its format is not homogeneous and is not optimized for easy and quick classification.

This circumstance makes it difficult to store, process and even interpret them within the existing data tables and by the different units of the organization. What does it mean? Loss of important information or data, reduction of productivity and efficiency in management, complexity in digitization…

The automation of these processes using artificial intelligence and machine learning technologies, driven by public cloud computing environments, allow to reduce time and resources to these tasks, providing an increase in productivity and efficiency of results.

Solution on Amazon Web Services

Branddocs needs to build a scalable system for automatic extraction and interpretation of relevant information from banking and financial documents. Through this system we want to respond to the need to consolidate large volumes of information from unstructured data such as payroll, statements and accounting reports.

This system must have the following characteristics:

  • To be a centralized data repository based on an intelligent data system with the capacity to extract defined entities from documents, as well as their interpretation for the subsequent activation of business rules.
  • The system must be completely scalable, flexible and modular to allow the definition of different steps in the same workflow. These steps can be requests to systems, databases and approvals.
  • Have the ability to interpret the data, which will be structured and available for exploitation and export, for example, through representation in BI dashboards or consumption through Excel files.

In this proof of concept we focus on the extraction of information from documents from the output of an OCR (Optical Character Recognition), for further processing and summarization to be finally released in the desired format.

We were faced with several types of financial and accounting documents, such as payroll, bank statements, balance sheets, profit and loss accounts… Each one with different formats and terminology, which meant a different challenge and a different use case. In addition, the processed documents had different formats (pdf, png…).

  • Payrolls:

    The objective for the payrolls was to obtain information from certain key fields, such as the company name, the employee’s ID, the liquid to be received, the date… And capture it in an Excel file.

  • Bank statements:

    From the entries of this type of documents, it was necessary to identify which of them referred to loans, credit card movements… To generate totals and, in addition, to calculate the minimum, average and maximum balance of the account in the period in question.

  • Financial statements:

    These are documents that are prepared from balance sheets and annual accounts. For this, we had keywords for each entry, which we used to search for matches in the documents. In this way, we managed to have the values of these entries, with which we make calculations to obtain the desired results.

The AWS services used were as follows:

  • Amazon Textract as OCR to extract text from images.
  • S3 as storage system, both for the original files (images) and the processed data (Excel documents).
  • Jupyter Notebooks from Sagemaker to perform the processing and interpretation of the information.
Benefits
  • This functionality makes it possible to automate a task that, if it had to be done manually, would be much more costly and time-consuming, and where the probability of error is high.

  • It provides the ability to have the data of these financial documents stored in a structured manner.

  • By having structured data, the possibility arises of analyzing this data from dashboards, or even making forecasts through Machine Learning models.

Keepler is a boutique company of professional technology services specialized in design, construction, deployment and software solutions operations of Big Data and Machine Learning for big clients. They use Agile and Devops methodologies and native services of the public cloud to build sophisticated business applications focused in data and integrated with different sources in batch mode and real time. They have Advanced Consulting Partner level and have a technical workforce with 90% of their professionals certified in AWS. Keepler is currently working for big clients in different markets, such as financing services, industry, energy, telecommunications and media.

Let’s talk!

If you want to know more or if you want us to develop a proposal for your specific use, contact us and we’ll talk.