At the start of every data project, there are numerous separate data sources that alone offer limited analytical value but when combined and then enriched with public/third party data become very powerful resources for our data scientists. More data is available every day that businesses require to have a full picture when making business decisions. Joining this data together leads to a fundamental problem, identifying how to associate records from different data sources which do not share a unique record key or identifier. This problem has been around for as long as non-similar datasets have needed to be consolidated, but with modern Machine Learning deterministic algorithms, we can unify datasets far more effectively. We train algorithms to select the right pairs based on limited clearly associated data, which allows us to accurately and efficiently build these unified datasets. Due to the fundamental nature of this problem, we’ve solved it with a number of different data types in diverse domains. Blue Orange has built high-accuracy deterministic record linkage to match geomagnetic disturbances data with energy grid readings. We’ve also addressed this issue in a recruiting application by associating more than ten different datasets with no standard key to establish robust candidate profiles.
- Machine Learning
- People Analytics
An example of disparate messy data sources:
|Data Source||Job Record 1||Job Record 2||Job Record 3|
|Dataset 1||Citadel LLC||Blackrock Group|
|Dataset 2||citadel Investment Group, LLC|
|Dataset 3||Citadel||ctadel||JP Morgan|
|Dataset 5||jpmorgan||Citadel Capital Mgmt|
A human reviewing this limited example can take all the information in context to determine the accurate combined job history:
|J. P. Morgan||Citadel Investment Group||BlackRock|
However, when we expand the challenge to more extensive datasets with many more attributes, what a human does to make these selections would translate into an exceedingly complex set of rules. We need a better way to determine which data we should elect, throw away, merge, or fix.
Leveraging modern predictive algorithms allows for increased accuracy and exponential efficiency which decreases costs of unifying datasets.
The solution was implemented in 4 parts:
- Model verification
- Standardization of data schemas for semantic associations to find likely duplicates
- Deriving training data set with manual labeling
- Designed and trained mirrored LSTM using semantic representation
1 Model Verification
To save time and money, we opted to verify our model on a derived dirty dataset with characteristics similar to our target data. The benefit of generated data is that we have an accurately labeled dataset to isolate model accuracy from data accuracy. We used FEBRL (Freely Extensible Biomedical Record Linkage) from the Australian National University. Since we required manually generated training data, this allowed us to test and train multiple models before investing effort in manual data curation.
2 Standardization of data schemas for semantic associations to find likely duplicates
Cleaning and standardization of the input dataset (data processing) is the initial step for most projects. We used Natural Language Processing (NLP) to generate semantic representations of each field value to determine potential pairings and automate the ingestion of subsequent data sources beyond the initial sources. Based on field labels alone, our system can ingest a new data source and suggest how it should be linked to the canonical data model.
3 Deriving training data set with manual labeling
Based on corporate data restrictions we were not able to crowdsource a training data set with common crowdsourcing tools like MTurk or Prolific. We had many internal business stakeholders that could help generate an accurate training data set. To internalize the crowdsourcing, we took the matched pairs from our standardization step and exposed them in PyBossa, an open-source data classifying application. This allowed us to reach our required training data goals quickly and collaboratively, without exposing proprietary data.
4 Designed and trained mirrored LSTM using semantic representation
The model itself is a recurrent neural network that uses a semantic representation of each entity to determine potential linkages. We used language level transfer learning leveraging the FastText to identify the semantic meaning of potentially related node values. Industry-standard linkage using TF-IDF can expect a 50% – 75% resolution accuracy but suffers from degradation of performance with poor data quality with expansive data size. Even cutting edge systems built by MassMutual’s team reported accuracy figures in 75%-80% range. Initial implementations of our model were able to achieve up to 93% accuracy.
Though this example explores applicant data, we have applied the model to solve record linkage in healthcare and e-commerce problems. By applying these cutting edge algorithms directly to business record linkage challenges, Blue Orange has been able to vastly expand the knowledge base that powers key business decisions for our clients. Learn More about the ways Blue Orange can enrich and unlock value in your data.