From Cron to Modern Data Stack (MDS): Dataflow Automation and Its Current State
The concept that makes the technological miracles of today possible are defined by data. Enormous amounts of data are collected...
SageMaker is one of the earliest Machine Learning as a Service (MLaaS) offerings that supports end-to-end ML workflows. It offers developers, researchers, and data scientists a way to build, train, and deploy models on managed cloud infrastructure. Since its initial release in 2017, SageMaker has become more than a mere cloud service: it is an entire ecosystem that nowadays offers a variety of functionalities.
The list of SageMaker features is impressive and sees new additions every year. Earlier this year, 3 new services were added to the SageMaker ecosystem. The Data Wrangler, the Feature Store, and the Pipelines. They each address common pain points of ML workflows enabling data scientists and ML engineers to be more productive.
Built into SageMaker Studio, this feature enables developers to tackle one of the most time-consuming ML steps: data pre-processing. It provides a visual interface that makes it possible to import, prepare, transform, featurize, and analyze data without writing any code. This aims to speed up data exploration and preparation and allow developers to focus more on model training and tuning.
Without the Data Wrangler, the data pre-processing steps are tackled using code running in Jupyter notebooks. Depending on the ML problem at hand, this may include one or all of the following: enriching the data with external data sources, engineering custom features, merging attributes, cleaning, and transforming operations. A variety of libraries and software tools are required by data scientists to do the data pre-processing. Some common choices are: scikit-learn, NumPy, Scipy, and Pandas for analysis; matplotlib for visualization. Needless to say, those pre-processing pipelines (and the associated code) can get complex, messy, and hard to maintain.
The Data Wrangler offers a few core functionalities that eliminate the need of writing data-preprocessing code. Firstly, it makes it possible to connect & import data from a variety of sources: Amazon S3, Amazon Athena, and Redshift. Secondly, it allows creating so-called Data Flows, where data preparation steps can be arranged using a drag and drop interface. The Data Flows can automagically handle joins among multiple datasets, which means fewer database queries and data manipulation code that developers need to write. The Data Wrangler also provides a set of predefined data transformation methods (formatting, vectorization, and various embedding methods). Lastly, it offers built-in visualization tools that make it possible to perform exploratory data analysis and understand crucial feature characteristics such as feature correlation and importance scores.
The Feature Store comes in handy for developers in the model training and tuning phase of an ML workflow. It is a repository that makes it possible to create, share, and manage curated data (features) across different teams and ML tasks. Such functionality is useful in scenarios in which multiple teams are training multiple models based on a common set of features.
Without the Feature Store, data science and ML teams have to face a daunting challenge: they need to keep track of the features used for training their models, all the way from initial development to deployment. It is not uncommon for teams to develop multiple models in parallel and to migrate features from one training session to another. Also, throughout inference, it is crucial to know which features need to be used by models to make predictions. All throughout this process, datasets (and hence feature sets) are also known to be dynamic: external data sources provide new features, attributes get merged into single features, etc. Handling the evolution of features throughout the data preprocessing and model training phases is similar to maintaining code history without a version control system: risky, messy, and..simply impossible.
The Feature Store provides functionality that makes it easy to handle features all throughout the ML pipeline. The promise is that data is only pre-processed once and after features are extracted, they can be reused, shared, and managed across teams, according to their custom needs. It is then possible to index and search through features, and the consistency and standardization of features are ensured. Another crucial function of the Feature Store is that it is compatible with various other AWS services: features can be exported from Athena, Glue, and even the Data Wrangler.
The Pipelines feature is meant to enable the automation of the different ML pipeline steps. It provides a Continuous Integration & Delivery service, which is adapted to ML pipelines and makes it possible to maintain code, data, and models all throughout development and deployment. For developers, this means less time (and code) spent on the orchestration of all SageMaker jobs and easier maintenance of custom training and deployment workflows.
Datasets, models, and ML workflows are all dynamic by nature. Data sources are constantly evolving, pre-processing steps are continuously being improved and countless models are being trained and deployed in parallel. Reproducing results, keeping track of development and production models, and the respective workflows involved in their training and tuning is becoming a time-consuming and risky task. A well functioning ML Ops pipeline is mandatory for ensuring quality assurance and continuous integration and can make the difference between Proof of Concept projects and ML projects running at scale.
The SageMaker Pipelines offers ML Ops teams the tools to take a hold of the variety of SageMaker workflows involved in model training and deployment. Pipelines can be defined from scratch using the Python SDK or they can be built off of built-in templates. Most importantly, workflow pipelines can be visualized, organized, and shared using SageMaker Studio. For each workflow, various metadata is collected and stored (wrt to datasets, model hyperparameters, and even training platform configurations), making ML workflows searchable and reusable.
The three new features added to SageMaker bring countless benefits to all stakeholders involved in the ML workflows: engineers, scientists, and analysts. The Data Wrangler minimizes the time spent massaging data and allows developers to focus more on model training and testing. The AWS Feature Store accelerates model development and removes the usual inconsistencies that arise from maintaining personal feature repositories. The SageMaker Pipelines brings CI/CD to Machine Learning and ensures reproducibility of results, as well as easier workflow maintenance.
With the ongoing development of SageMaker, AWS stays at the front of MLaaS innovation. The three additions to SageMaker’s offerings are living proof that AWS sticks to its mission: putting machine learning tools in the hands of every developer and data scientist.
Schedule 15-min with a Blue Orange Digital Solution Architect to discuss the possibilities of AWS and SageMaker. Blue Orange Digital is a certified AWS Development Partner.
About the Author
Josh Miramant is the CEO and founder of Blue Orange Digital, a data science and machine learning agency with offices in New York City and Washington DC.
Miramant is a popular speaker, futurist, and a strategic business & technology advisor to enterprise companies and startups. As an example of thought leadership, Miramant has been featured in IBM ThinkLeaders, Dell Technologies, Global Banking & Finance Review, the IoT Council of Europe, among others. He can be reached at email@example.com.
About Blue Orange Digital
Blue Orange Digital is recognized as a “Top 10 AI Development and Consultant Agency,” by Clutch and YahooFinance, for innovations in predictive analytics, automation, and optimization with machine learning in NYC.
They help organizations optimize and automate their businesses, implement data-driven analytic techniques, and understand the implications of new technologies such as artificial intelligence, big data, and the Internet of Things.