Why Consider a Hadoop to Databricks Lakehouse Migration?
Hadoop offers the option to maintain huge on-prem workloads but enterprises need to migrate this data into cloud-based managed services...
CLTV, the projected revenue of a customer over the lifetime of engagements, is one of the most important metrics that help businesses keep track of their performance. On the customer-facing front, it provides a way to better understand customers and their relationship to the business. Marketing teams have a powerful tool for better channeling their efforts, while sales teams can easily identify and nurture high-value customers. At the same time, CLTV also offers businesses a way to improve their internal decision-making process: from allocating marketing spend, to minimizing potential losses and preventing churning.
Modern CLTV systems are already able to predict future relationships between individual customers and a business. With the advance of predictive modeling, algorithms for estimating CLTV have become increasingly popular and even more so, increasingly powerful. Both research and industry efforts have helped shape large-scale systems that can score hundreds of millions of users on a daily basis. Since CLTV estimations benefit countless business and product use cases, interest in CLTV estimation models is only increasing.
We’ve shortly touched on this topic before, but without giving too many technical details. Back then, we made the distinction between traditional RFM models and machine learning models for calculating the CLTV. We’ve emphasized the importance of unified data architectures, which enable ML models. But today we are not talking about data. Instead, we will focus on the ML models that can be used to predict CLTV values.
A way to think about the CLTV estimation problem is to frame it as a clustering problem. In this case, the problem is an unsupervised learning problem: there are no labels available for training, meaning that the models will simply learn how to distinguish between different groups of users, depending on the features that represent them.
The features describing each user usually go well beyond their purchase history. Web activity, engagement scores, and even third-party data sources are commonly used for training clustering models. One common challenge when implementing a clustering model arises when the number of features is larger than the number of samples. This is also known as the curse of dimensionality and is likely to result in low model performance.
The prediction of an exact CLTV value for each user is out of scope when tackling CLTV as a clustering problem. However, that does not make the clustering insights any less valuable. Clustering users makes it possible to identify different user segments and to take action accordingly. When looking closely at users within the same group, they are likely to also share common spending habits, which have a direct influence on their CLTV.
A popular approach to clustering is the k-means algorithm. This is provided out-of-the-box by most providers of ML services in the cloud. AWS SageMaker provides an improved version of the algorithm, which is optimized to run in a distributed manner on large-scale datasets. One decision that needs to be taken before running the k-means algorithm is the number of clusters. This can depend on various business goals and dynamics.
In the case of the CLTV clustering using k-means, an idea would be to cluster users into 3 different groups: LowCLTV, MediumCLTV, HighCLTV. Knowing that users within the same group share a set of common characteristics can enable more accurate marketing and sales strategies.
But perhaps the most exciting use-case for clustering is to use clustering as a preliminary step before training other machine learning models. Each of these clusters can be assigned a label, and further multi-classification models can be used to classify users into one of the already found clusters. This brings us to a slightly different formulation of the CLTV estimation problem.
Tackling the CLTV estimation problem as multi-class classification assumes having access to a labeled dataset. This would make the problem a supervised learning problem, in which each user entry in the training dataset has a label assigned to it. Remember the three clusters that we have proposed above? That could be a way to generate labels for an unlabelled dataset.
Another way to generate labels for training data is to define a number of distinct classes and come up with a formula (based on existing user features) that computes a new feature (e.g. called class). By adding that newly computed feature to the dataset, the data becomes labeled and ready for model training. The most common caveat when tackling multi-class classification problems is the so-called class imbalance problem, which happens when some classes are less present in the dataset. This makes it hard for the model to learn the representative features of that class.
Random Forests are a powerful approach to tackling multi-class classification problems (and beyond!). They are not sensitive to missing values and unlike other algorithms, they can handle nonlinear features. However, the results of random forests are not as easily interpretable by humans, since the result is an average built over the outcome of many decision trees. This may make training Random Forests models hard to debug.
Formulating the CLTV estimation as a multi-class classification problem may not be the most appropriate way to formulate the CLTV problem. Since the problem is naturally a regression problem (the goal being to estimate a continuous value), classification usually serves only as an intermediate learning step in multi-stage training pipelines. Also commonly encountered is the practice of chaining multiple models as part of the training phase.
An example of a working industry system is the one implemented at Groupon as early as 2016. They used a two-stage learning system, in which Random Forests were first used for tackling a classification task and then regression (yes, this model can do both). At first, the classification is done to classify between users, predicting whether or not they are going to make a purchase (binary classification). Secondly, regression is done to estimate the exact value of their next purchase but only trained on the dataset with users who have been classified as positive in the first classification task.
Let’s find out more about the state-of-the-art regression models used for CLTV estimation.
When formulating the problem as a regression problem, we are training models that are meant to learn and predict continuous values. This is the most natural formulation of the CLTV estimation problem.
A high performing algorithm for tackling regression problems is the XGBoost algorithm. This algorithm belongs to a specific class of learning algorithms, known as ensemble learning. The main idea behind ensemble learning is that multiple weak learners are trained, and their predictions are combined before further training more weak learners. This leads to robust and well-performing estimators, which benefit from the weak learners’ ability to cancel out their erroneous estimations.
There are many advantages to using the XGBoost algorithm for regression tasks. It is inherently built for parallel, distributed computation and it is optimized for model performance and computational speed. On top of that, it is open source and provided by all cloud-based ML service providers. No wonder that the algorithm has been performing well in competitions across a variety of ML tasks: from computer vision to NLP and to speech processing. For CLTV estimation, XGBoost has already been implemented for applications in gaming, banking, and even in the telecom and IT sectors.
An important decision that impacts the accuracy of the regression models is the size of the time window that is used for training them. This, once again, depends on various business dynamics and use cases. Some companies may find it relevant to train on historical data going back as much as 2 years. Others may want to train on very recent data, only a few months old. Some may even choose to train and evaluate multiple models. Like this, a better understanding of the seasonality of time series data can be obtained.
The Machine Learning methods for solving the CLTV estimation problem continue to evolve and gain interest from both industry and academia. There are different perspectives from which this problem can be tackled, offering businesses the opportunity to model CLTV according to their specific needs and use cases.
However, the choice of the Machine Learning method for solving CLTV estimation needs to take into account the different aspects of the machine learning systems in which it is implemented. Like we’ve seen above, the availability and dimensions of the training datasets, the required accuracy of the predictions, and the integration of the methods with other models are all common decisions that need to be taken. End-to-end machine learning solutions are only precise (and useful) when they directly match the nature of the data they are built on and the nature of the business use cases that need to be improved.
Blue Orange Digital can develop custom machine learning pipelines, adapted to solve your individual business needs and to perform in the most demanding environments. We can deploy large-scale, distributed solutions using state-of-the-art methods and algorithms. Let us explore your data and get in touch with our team of engineers today!
Schedule 15-min with a Blue Orange Digital Solution Architect to discuss which option is right for your data sources and future goals.
Josh Miramant is the CEO and founder of Blue Orange Digital, a data science and machine learning agency with offices in New York City and Washington DC.
Miramant is a popular speaker, futurist, and a strategic business & technology advisor to enterprise companies and startups. As an example of thought leadership, Miramant has been featured in IBM ThinkLeaders, Dell Technologies, Global Banking & Finance Review, the IoT Council of Europe, among others. He can be reached at email@example.com.
Blue Orange Digital is recognized as a “Top AI Development and Consultant Agency,” by Clutch and YahooFinance, for innovations in predictive analytics, automation, and optimization with machine learning in NYC.
They help organizations optimize and automate their businesses, implement data-driven analytic techniques, and understand the implications of new technologies such as artificial intelligence, big data, and the Internet of Things.