August 3, 2021

Sales teams are continuously looking for ways to improve their performance, but with limited resources, they can’t always find what they’re looking for. Machine learning offers a solution by providing insights on how customers behave and making predictions about which products or services will be most successful. This allows sales reps to reach out more confidently knowing what will work best for each individual customer-helping them close more deals.

Enterprise sales prediction can be quite challenging. Predicting sales requires sales teams to understand what comes next and how they should work towards closing the deal, which is a job of its own and can take weeks or months. Machine learning provides a solution to this problem by using past customer behavior data to forecast future outcomes.

Machine learning provides a solution to this problem by using past customer behavior data to forecast future outcomes.

Linear Regression is a statistical approach in which relationships between variables are explored using a line of best fit, where the line attempts to minimize the vertical deviations from data points thus providing insight into the relationships that exist between the variables.

A linear regression model is used to predict future values for a variable based on other related variables, i.e., predicting quarterly sales for product A based on daily sales of products B and C. Linear regression models are also used to measure how changes in one or more independent variables affect a dependent variable, i.e., what would be the impact on quarterly sales if prices of products B and C were to increase?

Logistic Regression is also known as logit regression. This is used when you want to predict whether an event will occur (binary outcome) by using existing numerical input variables which act as predictor variables. For example, what is the chance that customer X will purchase product Y?

Logistic regression is used to perform medical diagnosis, risk assessment, and fraud detection, i.e., diagnosing if a patient has an illness based on symptoms, assessing the likelihood of credit card fraud by analyzing the transactions made on the card, evaluating what is the probability of loan default by looking at customer financials.

A decision tree is a set of rules that are determined by splitting data into various segments so that one category forms each segment. The key attributes act as filters in each decision-making process, where all segments start with the root node and ends at leaf nodes (terminal nodes). A simple example would be determining whether or not customer X will buy product Y by first checking if they own product Z and if they do, the chance is 70% and 30% if they don’t.

Decision trees can be used to determine which customers are most likely to buy a product, i.e., identifying the 20% of customers who are most likely to purchase product Z next month.

This allows you to partition n observations into k clusters in which each observation belongs to a cluster and has the mean of that cluster as the representative value. The number of clusters, k, must be specified before running this algorithm and it tends not to converge or perform well when you have non-convex clusters. For example, let’s say we want to identify what type of customer best purchases our product: those who have a lot of disposable income and those with a moderate amount of disposable income.

K-means clustering is used to identify customer segments, i.e., identifying 2 different groups that have unique purchase behaviors which would require different messaging or offers.

Naïve Bayes is an efficient algorithm based on applying Bayes Theorem using the assumption that each dimension in your dataset is unrelated to any other dimension as conditional probability. In practice, this works well as long as you don’t run into problems with data sparsity (lack of data). For example, we could use it to predict what products best suit customers who like products A and B and compare it against another group of customers who like products C and D: which products are the best fit for each customer segment?

Naïve Bayes is used to predicting which customers are most likely to respond to a particular message, i.e., identifying the 20% of customers who are most likely to make a purchase in the next month based on their behavior in previous months. The data would be analyzed by segmenting customers into “light”, “medium” and “heavy” users based on how much they spend per month.

SVM is an algorithm that optimizes binary classification algorithms by finding a hyperplane that has the largest distance from any observation in either class so it can be used for multi-class classification. It does this by minimizing the errors made in classifying observations into each class. For example, let’s say we have two customer segments that are most likely to buy product Y: those who own product X and those who don’t. Support Vector Machine could be used to identify how customers should be segmented so the chance of selling product Y is maximized for both segments.

Support Vector Machine can be used to identify which products are best suited for a particular customer segment based on their behavior in previous months, i.e., identifying if there is any interaction between buying one product and another (is there an association or dependence?) by looking at what other products customers buy during the same time period as the first purchase. This would allow us to see if receiving offer A increases/decreases the chance of buying offer B.

AL looks at all possible rules between items and then calculates support (how many times a rule has been observed), confidence (the proportion of transactions where one element co-occurs with another), and lift (an association rule is considered more interesting if it has high confidence and supports a large number of transactions). This method can be used to identify which products are often purchased together in order to find new relationships between them.

For example, let’s say we have two dimensions A and B with 5 items each, giving us 10 total items (A1, A2…A5 and B1, B2…B5). We could use the following rules to check item relationships:

- If a customer purchases A3, they are also likely to purchase B4
- If a customer purchases A4, they are more likely to purchase B4 than any other item on dimension B
- If a customer purchases both A3 and A4 then it is unlikely that they will purchase B5 — there exists an association rule where if you buy product A3 then you are less likely to buy product B5

You could also use this algorithm to identify customer segments based on what other customers they purchase. For example, if we want to identify the 20% of customers who are most likely to buy products from both group 1 and group 2, it would look at all possible rules between items in each group and calculate the frequency of every rule. This algorithm is used for market basket analysis — finding groups of products that frequently appear together in a single transaction. Market basket analysis is an important factor when deciding which items should be discounted or included in promotion because it can analyze how different product combinations affect each sale. For instance, if promotional offer X was only ever purchased with product Y, offering a discount on just product Y probably won’t help sales that much, but if promotional offer X is only ever purchased with a combination of products A, B and C then a discount on all three would likely increase average order value.

GM are used when you have an outcome variable (e.g., purchase or not purchase), one or more predictor variables (e.g., age, income level, etc.) and several potential interactions between the two. They can be used in situations where there may be multiple causative factors influencing a decision process — for example, which customers are most likely to respond to a mailing campaign? Graphical models use probability theory in order to model dependencies among different events in a network by constructing a graph.

Ensemble Methods** **create an aggregate prediction by combining the predictions of several machine learning models, each of which may have used different algorithms to generate their results. These are generally created using decision trees, boosting, or bagging (see below). The accuracy for ensemble methods tends to be higher than that of individual models because they have more data points and model variance — if one model predicts that someone will purchase but another predicts that they won’t there is less chance for both being wrong. However, the error rate can still be high if all ensembles are not properly trained. It’s important to note that ensemble methods are only as good as the underlying algorithm used so it’s best practice to match ensembles to the data being used.

Boosting is a machine learning technique that iteratively creates a model using a base learner and attempts to correct any mistakes it has made with data from past iterations, using new mistakes as a way of weighting which training examples will be more important in subsequent iterations. The main idea is that each individual learner is weak and therefore less likely to pick up on every single pattern so the more learners you have the better. The downside is that they can be costly in terms of time and resources — each iteration builds on the previous so, if your model takes 30 hours to build, then initial results may not be available for a long period of time. Boosting has been used extensively in image classification problems with convolutional neural networks.

Bagging is an ensemble method similar to boosting which involves constructing an average model by creating multiple samples (or resampling ) of training data with replacement, building a number of individual models from these samples, and then working out their aggregate prediction using either voting or averaging. Like boosting, bagging can be used to reduce model variance and improve predictions but may also suffer from overfitting if there is a lot of noise in the training data.