The Data Race Against The Pandemic, Together
How ML and data were crucial in fighting COVID-19 in 2020 as a united global community. 2020 was a year...
Govzilla is a leading data processing company using Big Data and AI to make government data accessible, usable, and valuable to top Pharma Companies, Food Manufacturers, Medical Device Companies, and Service firms from around the globe. Govzilla required a central data hub to collect and unify highly varied data to allow end-users to track and identify compliance issues in their sectors. Govzilla needed a modern data environment with automated document ingestion to support large varied document ingesting, parsing, reconciliation, and classification.
Vertical: Big Data Processing
Infrastructure: Data Lake
To begin Blue Orange developed a custom data lake to support a high-throughput, fault-tolerant, and performant data infrastructure as the foundation in which to build the rest of the project. This pipeline required variable injection frequencies on dozens of data sources. The data varied from structured (data feeds), semi-structured (unlocked PDFs), and unstructured (images/scans) document data. This required a range of ingestion jobs as well as OCR and advanced data science techniques. Blue Orange replaced manual, outsourced data scrapers with advanced automated ingestion jobs to improve accuracy, scalability, and efficiency.
We evolved the traditional system of rule-based text extraction by incorporating Natural Language Processing (NLP) and leading Optical Character Recognition tools (Tesseract). We discovered that OCR accuracy was highly dependent on pre document classification and post data processing. We created automation tasks for each document classification along with a range of post-data parsing jobs ranging from simple string match to complex NLP applications including topic modeling, keyword extraction, and semantic understanding.
We used Robotic Process Automation (RPA) to route, store, and index data files through our advanced ETL jobs. These processes automatically managed to move and create the file system while indexing data for searches and retrieval.
Due to the large and varied data volume and sources, Govzilla required detailed data cataloging. To improve document processing times and increase accuracy we also set up automatic parsing. These tools make it straightforward to scale when more processing power is needed. We implemented this using AWS Glue and stored our formations in S3.
Our team supported the existing data engineering team in learning the modern AWS data architecture and helped them get up to speed on the newly implemented data patterns.
NLP technologies enabled text analysis and speech recognition applications. Powered by NLP, we developed solutions that:
Similarly, modern data lake patterns streamlined their document processing workflows. The use of modern cloud resources provided the following advantages:
RPA freed human workers from time-consuming, high-volume repetitive tasks allowing them to focus on strategic business tasks.
The key takeaway is clear: NLP, OCR, and RPA made it possible to streamline advanced data throughput and improve operational efficiency. Blue Orange implemented a modern data pattern that assisted business stakeholders in their data processing while helping them reduce operational costs. This enabled them to scale to additional strategic verticals and manage the related data complexity.
Can NLP and OCR solutions be developed for your business?
Do you have any related questions? From IoT, to Energy, the Blue Orange Digital team has extensive experience with OCR and NLP based solutions.
Get in touch! We are happy to provide you with answers!