How can business leaders collaborate productively and effectively with data science teams? Winning with Data Science takes a narrative approach with fictitious characters to showcase different approaches to projects involving data science teams. In this excerpt, David is a senior data scientist working for Stardust Health Insurance, a medium-sized insurance company working primarily in California and Florida. David works closely with Kamala, a rising star at the company who has both an MD and an MBA and, most importantly for our story, a keen interest in using data to succeed.
Examining residuals
“I’d like to welcome you all to the inaugural Stardust Health Insurance data science team hackathon.”
David was having fun as the emcee of the event. “Our teams have been hard at work to optimise our model to predict healthcare expenditure for patients with back pain who receive nonsurgical treatment. Before we get started, let’s recap the current state of our model.
“The outcome for this model was predicted back pain–related healthcare expenditure in the year following a request for prior authorisation. On average, the absolute difference between the model’s predicted healthcare expenditure and the true healthcare expenditure was USD$12,000 for the first year following the prior authorisation request. The actual differences ranged from $50 to $50,000.
“Those on the prior authorisation team have been using this model for the past several months with success: they are seeing reduced healthcare expenditure as a result of their prior authorisation decisions and initial analyses show improvements in patient outcomes as well. Although the model was designed to predict expenditures for two years, these initial results are promising and now we want to see how well we can improve the performance of this model.
“The first team we will be hearing from didn’t change the model architecture: the team members used linear regression, but they did some clever analysis to figure out how to improve the model. I’ll hand it over to them.”
The team lead walked to the front of the room and flashed their slides on the screen. “The performance of the original model was good, but we felt that there was still more that could be done within the confines of linear regression. We decided to do a deep dive into what types of patients the model performs well on and what types of patients it doesn’t perform as well on. To do this, we examined the residuals.”
The team lead continued: “What we found was that some patients have very small residuals, but for other patients, the model performed very poorly – on average, the model was off by more than $25,000. We ran some descriptive statistics to see whether those groups of patients were different in any meaningful way. What we found is that the patients with higher residuals tend to have jobs that involve more manual labour – for example, farming or construction. So instead of changing the model architecture, we decided to add more features regarding a patient’s job.
“When we added in these new features, we saw that the residuals for the previously poor-performing group dropped from $25,000 to $15,000. In other words, we were able to improve the performance of our model for that group of patients by about 40 per cent. Just adding a few well-selected features , we were able to bring down the average residual from $12,000 to $9,000.”
Interaction terms and transformations
David took the mic and welcomed the next team: “The next team also stuck with linear regression but took a clever approach to feature engineering.”
The team lead took to the stage and said: “We also felt that the biggest limitation of our model was its features. It doesn’t matter if you use linear regression, random forest, or a neural network; if your features aren’t good, your predictions won’t be good. So we decided to engineer new features using the ones that we already had.”
One of the simplest approaches to automated feature engineering is calculating interaction terms. An interaction term represents the relationship between two main terms. For example, patient age and patient gender may be included in the original healthcare expenditure model as main effects. But what if the relationship between healthcare spending and age depends on sex?
We can add an interaction term by including the product of age multiplied by gender as a new feature in our model. This interaction term allows for different slopes for the relationship between expenditure and age for men versus women.
The team lead summed up the results: “By including all possible two-way interaction terms, we gave our model access to relationships between variables that we hadn’t considered including in the original model. We were able to bring the average residual down to $8,000!”
K-nearest neighbours
“This next team moved away from linear regression and tried out a different model architecture,” said David, as he stepped down from the stage.
The team lead picked up the mic: “One characteristic of linear regression is that it uses all the data points in the dataset to make a prediction. This can be good in some cases, but we thought it might be better to use an approach where the prediction is based on only the handful of data points that are most similar to the patient in question. Therefore, we used a ‘K-nearest neighbour’ (K-NN) regression model.”
The first step of the modelling process is to choose the parameter ‘K’, which represents the number of observations the model will use to make a prediction. For example, if K=20, the model will look at the 20 data points that are most similar to the data point it is making a prediction from. The next step is to determine how to define ‘similar’. The most common approach is to calculate the Euclidean distance between datapoints. This amounts to adding the squares of the distance in each dimension (for text variables, Hamming distance is often used).
The team lead went on. “We determined through cross-validation that the optimal value for K was 100. Simply put, our model makes predictions for new patients by averaging the expenditures of the 100 most similar patients in the dataset.” For categorical outcomes, K-NN finds the K observations that are closest to the datapoint and assigns the predicted value to the class with the most votes.
Another consideration is what features you should include in K-NN. You want to use only features that are relevant in predicting the outcome of interest. Often, there must be some filtering of input variables to first remove the ones that aren’t predictive. The features then need to be normalised so each feature has about the same range. If you don’t normalise the features, then those that have a much larger range will dominate the distance computation, and other features with smaller ranges won’t contribute much to the modelling.
The team lead concluded: “One of the advantages of K-NN is that it can make predictions for nonlinear relationships because it makes no assumptions about linearity. The model reduced the average residuals from $12,000 to $10,000 which is not as big an improvement as our colleagues. But our team’s guess is that K-NN performed better on the patients who were harder for linear regression to predict.”
Decision trees
The next team lead took to the stage. “Similar to the last group, we wanted to account for possible nonlinear relationships between variables. We also wanted the output of the model to correspond to easily interpretable patient profiles. A regression tree model was the perfect choice for our goals.”
Classification and regression tree (CART) modelling consists of dividing the population into smaller subpopulations and then making predictions on those smaller subpopulations.
For example, assume we’re building a regression tree that splits the population into two groups for each node. To predict healthcare expenditure, the algorithm would find the best variable and value of that variable to separate out the high- and low-cost patients. Once this first split is done, there are now two groups, the low- and the high-cost patients. The algorithm is then applied to each of these two groups to again find the best variable and value of that variable to split each of these groups into two more groups, so there are now four groups. The algorithm will stop when there are not enough customers in each group to split again or when it reaches some other stopping rule. When the CART model is completed, the entire population will be assigned to one and only one group, called a leaf, and the characteristics of each leaf can be easily read.
For example, the highest-cost leaf may be male patients over 65 years old who live in the US Northeast, and the lowest-cost leaf may be female patients under 18 who live in the South. A prediction for a new patient would be made by identifying what leaf the customer belongs to and assigning the average value of that leaf to that customer.
CART modelling has several advantages. First, the data scientist does not have to make assumptions about the features and their supposed relationship with the outcome; only the splitting and stopping rules need to be defined for the tree to be produced. Second, CART models can be used to predict binomial variables, categorical variables and continuous variables, and they are not as sensitive to outliers and missing data as other regression methods are. Third, CART models can easily represent nonlinear relationships and interaction terms without the need for the modeler to specify them in the model itself. Lastly, CART modelling is easily interpretable in that it produces subpopulations of high and low values based on a set of if/then statements, allowing you easily to look at the rules and ask if they make sense.
The team lead summarised the results. “Our tree-based model didn’t reduce the average residual by much: we went from $12,000 to $11,000. But we discovered we could develop this tree-based model much more quickly than we did using the original model because the data did not require much pre-processing. Since tree-based models can handle missing values and variables of very different scales, it’s relatively quick to train a model once you have the data collected.”
Boosting, bagging and ensembling
David clapped for the last team and went back up on stage. “Now that we’ve heard from all our teams, I want to present a surprise that the data team leaders have been working on since the end of the hackathon. We’ve all heard the old saying ‘two heads are better than one’. But have you heard the more recent saying ‘two machine learning models are better than one’?” The audience laughed.
“After all the teams submitted their models, we had a secret team working in the background to combine all the models into a single, more accurate and more robust prediction using ensembling methods. ”
One of the most common ensembling methods is stacking. In model stacking, different models are created and then used as input variables to a new model, which is used to make the final prediction. The first models are known as level one models and these level one model predictions serve as inputs to the level two model. Stacked models can be thought of as a method to weight different models to produce a final result. The level two model, also called the ‘stacked model’, can outperform each of the individual models by more heavily weighting the level one models where they perform best and giving those models less weight where they perform poorly.
A special case of ensembling is called ‘bootstrap aggregation’ or ‘bagging’ for short. Bootstrapping refers to making several datasets from the original dataset by resampling observations. For example, from a starting dataset of 1,000 observations, you may create 10 bootstrapped datasets, each containing 1,000 observations resampled from the original 1,000. A model is fitted to each of the bootstrapped datasets and the resulting predictions are aggregated. It has been shown that bagging can improve prediction accuracy and help avoid overfitting.
Another technique commonly used to improve the performance of machine learning models is gradient boosted machine learning, also known as GBML. The word ‘boosted’ is critical here. Boosting involves having models learn by giving the misclassified observations more weight in the next iteration of the training, as well as by potentially giving more weight to the more accurate trees. There are many boosting algorithms; two of the most commonly used are Adaboost and Arcboost.
David continued: “We gave each model access to all the features that were engineered by the different teams and we applied bagging and boosting to the regression tree model to improve its performance. Finally, we developed a stacked model, which combined the predictions from the three level one models using a level two linear regression model. Our stacked ensemble vastly outperformed each of the individual models: we were able to decrease the average residual from $12,000 to $5,000.”
Kamala was floored. How could the stacked model be that much better? “David, I’m wondering how the stacked model performed so well,” she said. “Before the ensembling, the best model was one of the linear regression models that had an average residual of $8,000. All the other models had residuals higher than that. How is it that adding a model with residuals over $10,000 to a model with residuals at $8,500 leads to a model with residuals at $5,000?”
“That’s the beauty of ensembling,” David replied. “Imagine you’re on a trivia game show with a partner. Your partner is a quiz bowl whiz. They know just about everything – except they’ve never had a penchant for pop music, so they’re completely useless when it comes to any question relating to music. You, on the other hand, know absolutely nothing about geography, sports, science, or history. But you’re a huge music and film fan, so any question on pop music that comes your way is a piece of cake. Individually, your partner would perform way better than you. They may get 90 per cent of all questions right if you assume the remaining 10 per cent are about music. You, on the other hand, would do terribly on your own. You’d be lucky to get 15 per cent if you assume those are the questions about music and films. Now, what if we put you two together on the same team? You can complement your partner’s knowledge on music and films and they can carry the team for all the remaining questions. Individually, it’s unlikely either of you would get 100 per cent, but together you have a real shot at a perfect score.”
This made sense to Kamala. “It’s like forming a committee. The different backgrounds of the committee members complement each other and, collectively, they’re able to make a better decision than they would individually.”
“Exactly,” David concluded. “Two models are better than one.”
Headline image credit: Google DeepMind on Unsplash
Excerpted from Winning with Data Science by Howard Steven Friedman and Akshay Swaminathan, published by Columbia Business School Publishing. Copyright (c) 2024 Howard Steven Friedman and Akshay Swaminathan. Used by arrangement with the Publisher. All rights reserved. Further information is available at winningwithdatascience.com.