Branddocs needs to build a scalable system for automatic extraction and interpretation of relevant information from banking and financial documents. Through this system we want to respond to the need to consolidate large volumes of information from unstructured data such as payroll, statements and accounting reports.
This system must have the following characteristics:
- To be a centralized data repository based on an intelligent data system with the capacity to extract defined entities from documents, as well as their interpretation for the subsequent activation of business rules.
- The system must be completely scalable, flexible and modular to allow the definition of different steps in the same workflow. These steps can be requests to systems, databases and approvals.
- Have the ability to interpret the data, which will be structured and available for exploitation and export, for example, through representation in BI dashboards or consumption through Excel files.
In this proof of concept we focus on the extraction of information from documents from the output of an OCR (Optical Character Recognition), for further processing and summarization to be finally released in the desired format.
We were faced with several types of financial and accounting documents, such as payroll, bank statements, balance sheets, profit and loss accounts… Each one with different formats and terminology, which meant a different challenge and a different use case. In addition, the processed documents had different formats (pdf, png…).
The AWS services used were as follows: