OpenLineage and Airflow Simplify Data Lineage
The GDPR (General Data Protection Regulation), asks organizations to implement data lineage for a clear understanding of the data used...
A trillion-dollar asset manager was looking to automate their data collection and processing for an internal Client Services tool. Their Operations Analyst team spent multiple hours every day of Q4, manually checking 350+ asset manager websites (8,000 tickers) for newly posted capital gains. The challenge of this particular Robotic Process Automation (RPA) project revolved around the creation of a data scrape orchestration framework and a logic-based validation framework to ensure automated data accuracy checks.
RPA Solution: Blue Orange implemented a custom data extraction framework using Selenium and Scrapy to collect and ingest over 300 distinct data sources.
Data Pipeline: Blue Orange managed, orchestrated, and ran the nightly project workload using Prefect. All pipeline errors, retries, and timeouts were managed via Prefect for high-fault tolerance and low dev-ops.
Rule-Based Validation Framework: The Blue Orange developer team worked collaboratively with the Operations Analyst team on a daily basis to ensure that the web scraping spiders were extracting the desired data. This included more than 55 unique validation rules that were applied to the scrape.
Result: Blue Orange provided a web scraping tool that delivered daily data updates with computed delta logic and notifications. This will alleviate over 250 hours of manual input per person, per year.
Schedule a 15-min discovery call to get some advice on your project today.
Josh Miramant is the CEO and founder of Blue Orange Digital, a data science and machine learning agency with offices in New York City and Washington DC.
Miramant is a popular speaker, futurist, and a strategic business & technology advisor to enterprise companies and startups. As an example of thought leadership, Miramant has been featured in IBM ThinkLeaders, Dell Technologies, Global Banking & Finance Review, the IoT Council of Europe, among others. He can be reached at firstname.lastname@example.org.
Blue Orange Digital is recognized as a “Top AI Development and Consultant Agency,” by Clutch and YahooFinance, for innovations in predictive analytics, automation, and optimization with machine learning in NYC.
They help organizations optimize and automate their businesses, implement data-driven analytic techniques, and understand the implications of new technologies such as artificial intelligence, big data, and the Internet of Things.
Main image source: Canva