Automatic linear modeling: Ensemble models

 Today, we will look into ensemble model methods.

Multi-model approach

Ensemble model methods gained in popularity in the 1990s. The main idea is to use several different models, each with its prediction error, so that the error is reduced.  This improves model stability and prediction accuracy.

Ensemble models are used both for regression problems (estimation of quantitative variable) and classification problems. They are quite popular in growing classification trees or neural networks. Models are aggregated using an aggregation function. For regression models, it is often the averaged prediction result from base models. In the case of classification problems, it is often classification voting.

For example, you build ten customer classification models for segmenting them into those interested in product purchase, and those who are not. The final classification is a result of a majority vote by all the models. What are the differences between them? Base models may have a different dataset (sample) or a different set of variables used in models. Random samples may be disjoint or not disjoint (depending on the specific method). In the case of classification and regression trees, the random forest algorithm is a popular multi-model approach that combines random observations with random predictors. The random forest algorithm, as well as bagging and AdaBoost mentioned below, are discussed in detail in the post by Ewa Grabowska (How to boost a decision tree algorithm).

Model aggregation methods

Model aggregation methods may have different architectures: parallel or serial. In a parallel ensemble, consecutive random samples (sets) and models are independent of each other. In serial ensembles, successive steps make use of feedback from the model built in the previous step.

There are many model aggregation methods. The most popular and the simplest one is bagging. Its name is an abbreviation of bootstrap aggregating. It involves the generation of base models from size n training samples , …,  sampled with replacement from the training set . IBM SPSS Statistics / PS IMAGO PRO always samples all observations but, each subsequent observation is assigned a simulated weight determined using the binomial distribution.  In the case of an aggregate model built using bagging, the prediction is based on the mean or median. For example, when building an apartment valuation model using bagging, you build 10 consecutive regression models on diverse simulated samples. Each model predicts a slightly different price for an apartment with specific parameters. The averaged result (or price median) is the result of the ensemble model.

The other popular method you can find in IBM SPSS Statistics / PS IMAGO PRO is boosting. To be more specific, AdaBoost, formulated by Y. Freund and R. Schapire, who won the 2003 Gödel Prize in the field of machine learning.

This model uses two types of weights. The first one is the weight assigned to observations sampled for the next model; observations from the first sample are assigned higher weights if they were classified incorrectly (in the case of classification problems) or their prediction error was higher (for regression problems). The other type of weight is the model weight. Each model is assigned a weight proportional to its prediction error. This way, less accurate models affect the prediction of the ensemble model less. IBM SPSS Statistics / PS IMAGO PRO use the weighted median method to determine the prediction. The goal of boosting is to improve model prediction accuracy as the name suggests.

Note that it may take longer to build and assess sets than in the case of a standard model.

Model aggregation in IBM SPSS Statistics/PS IMAGO PRO

Let's see how it is done in IBM SPSS Statistics/PS IMAGO PRO. Our dataset contains information about women clothing sales volume (in USD thousands) from various partners (variable: Women clothing sales volume). Each partner has a different advertising budget (advertising budget) in USD thousand, number of sales hotlines (No. of sale hotlines), printed catalogue volume (No. of catalogues), number of pages in the catalogues (No. of catalogue pages), and the number of direct sales consultants (No. of consultants). We will use these variables as predictors and build a model to estimate the sales volume. We will take the multi-model approach. Go to Analyze > Regression > Automatic linear modeling.

Figure 1. Selection of variables for the model

Figure 1. Selection of variables for the model

 

Move the sales variable to Target and the other variables to predictors and go to the Build options tab.

The first item on the list, Objectives, is where you decide to use the multi-model approach. Here, you will find the methods we discussed, boosting and bagging and a new option to build a regression model from large datasets using IBM SPSS Statistics Server.

 

Figure 2

Figure 2

Select Enhance model stability (bagging). Go to Ensembles.

Figure 3. Setup of ensemble methods

Figure 3. Setup of ensemble methods

Here, you can control options for ensemble methods. You can select the prediction method, namely: – default combining rule for continuous target and number of ensemble models to be built. Select median of the predicted sales from base models and increase the number of models to 15.

You can now click Run. The first table in the report presents information on the built models. The table should have 16 rows which include 15 base models plus a reference model. In the case of bagging, the reference model is the model for the whole dataset, while for boosting it is the first base model.

Table 1. The four first rows of Analyzed data information

Table 1. The four first rows of Analyzed data information

 As was mentioned above, each bootstrap aggregation model was built using the same number of observations. Double-click the result to take a look.

Figure 4. Model quality – comparison of the reference model and the ensemble

Figure 4. Model quality – comparison of the reference model and the ensemble

 

The first chart shows  calculated for the ensemble and the reference model. The ensemble has a slightly better result. The accuracy of individual models can be found on the fourth chart in the viewer.

 

Figure 5. Accuracy of the ensemble model

Figure 5. Accuracy of the ensemble model

 

Individual points in the chart are models. The labels are model numbers. You can find out more about the models in the next, penultimate, table (such as the number of predictors used in each model).

 

Figure 6. Details of the ensemble model – base models

Figure 6. Details of the ensemble model – base models

The table contains a list of models that make up the ensemble model, their accuracy, and the number of predictors and model parameters. Have a look at the predictor frequency chart.

 

Figure 7. Frequency

Figure 7. Frequency

As individual base models can have different predictors, the predictor frequency chart shows the distribution of independent variables in each model. Every point in the scatterplot represents one (or more) base model(s) that contain the predictor.

The predictors are descending by the frequency on the y-axis. It tells you that the number of catalogue pages, advertising budget, and number of distributed catalogues are used in the largest number of models (all). The most popular predictors are usually the most important.

To sum up, the multi-model methods we discussed are both advocated and criticised. They yield the expected result if you can build a decent or good individual model. An ensemble model can then improve the stability or accuracy of prediction.


Related events: