Ensemble model methods gained in popularity in the 1990s. The main idea is to use several different models, each with its prediction error, so that the error is reduced. This improves model stability and prediction accuracy.
Ensemble models are used both for regression problems (estimation of quantitative variable) and classification problems. They are quite popular in growing classification trees or neural networks. Models are aggregated using an aggregation function. For regression models, it is often the averaged prediction result from base models. In the case of classification problems, it is often classification voting.
For example, you build ten customer classification models for segmenting them into those interested in product purchase, and those who are not. The final classification is a result of a majority vote by all the models. What are the differences between them? Base models may have a different dataset (sample) or a different set of variables used in models. Random samples may be disjoint or not disjoint (depending on the specific method). In the case of classification and regression trees, the random forest algorithm is a popular multi-model approach that combines random observations with random predictors. The random forest algorithm, as well as bagging and AdaBoost mentioned below, are discussed in detail in the post by Ewa Grabowska (How to boost a decision tree algorithm).
Model aggregation methods
Model aggregation methods may have different architectures: parallel or serial. In a parallel ensemble, consecutive random samples (sets) and models are independent of each other. In serial ensembles, successive steps make use of feedback from the model built in the previous step.
There are many model aggregation methods. The most popular and the simplest one is bagging. Its name is an abbreviation of bootstrap aggregating. It involves the generation of base models from size n training samples , …, sampled with replacement from the training set . IBM SPSS Statistics / PS IMAGO PRO always samples all observations but, each subsequent observation is assigned a simulated weight determined using the binomial distribution. In the case of an aggregate model built using bagging, the prediction is based on the mean or median. For example, when building an apartment valuation model using bagging, you build 10 consecutive regression models on diverse simulated samples. Each model predicts a slightly different price for an apartment with specific parameters. The averaged result (or price median) is the result of the ensemble model.
The other popular method you can find in IBM SPSS Statistics / PS IMAGO PRO is boosting. To be more specific, AdaBoost, formulated by Y. Freund and R. Schapire, who won the 2003 Gödel Prize in the field of machine learning.
This model uses two types of weights. The first one is the weight assigned to observations sampled for the next model; observations from the first sample are assigned higher weights if they were classified incorrectly (in the case of classification problems) or their prediction error was higher (for regression problems). The other type of weight is the model weight. Each model is assigned a weight proportional to its prediction error. This way, less accurate models affect the prediction of the ensemble model less. IBM SPSS Statistics / PS IMAGO PRO use the weighted median method to determine the prediction. The goal of boosting is to improve model prediction accuracy as the name suggests.
Note that it may take longer to build and assess sets than in the case of a standard model.
Model aggregation in IBM SPSS Statistics/PS IMAGO PRO
Let's see how it is done in IBM SPSS Statistics/PS IMAGO PRO. Our dataset contains information about women clothing sales volume (in USD thousands) from various partners (variable: Women clothing sales volume). Each partner has a different advertising budget (advertising budget) in USD thousand, number of sales hotlines (No. of sale hotlines), printed catalogue volume (No. of catalogues), number of pages in the catalogues (No. of catalogue pages), and the number of direct sales consultants (No. of consultants). We will use these variables as predictors and build a model to estimate the sales volume. We will take the multi-model approach. Go to Analyze > Regression > Automatic linear modeling.
Move the sales variable to Target and the other variables to predictors and go to the Build options tab.
The first item on the list, Objectives, is where you decide to use the multi-model approach. Here, you will find the methods we discussed, boosting and bagging and a new option to build a regression model from large datasets using IBM SPSS Statistics Server.
Select Enhance model stability (bagging). Go to Ensembles.
Here, you can control options for ensemble methods. You can select the prediction method, namely: – default combining rule for continuous target and number of ensemble models to be built. Select median of the predicted sales from base models and increase the number of models to 15.
You can now click Run. The first table in the report presents information on the built models. The table should have 16 rows which include 15 base models plus a reference model. In the case of bagging, the reference model is the model for the whole dataset, while for boosting it is the first base model.
As was mentioned above, each bootstrap aggregation model was built using the same number of observations. Double-click the result to take a look.
Figure 4. Model quality – comparison of the reference model and the ensemble
The first chart shows calculated for the ensemble and the reference model. The ensemble has a slightly better result. The accuracy of individual models can be found on the fourth chart in the viewer.
Individual points in the chart are models. The labels are model numbers. You can find out more about the models in the next, penultimate, table (such as the number of predictors used in each model).
The table contains a list of models that make up the ensemble model, their accuracy, and the number of predictors and model parameters. Have a look at the predictor frequency chart.
As individual base models can have different predictors, the predictor frequency chart shows the distribution of independent variables in each model. Every point in the scatterplot represents one (or more) base model(s) that contain the predictor.
The predictors are descending by the frequency on the y-axis. It tells you that the number of catalogue pages, advertising budget, and number of distributed catalogues are used in the largest number of models (all). The most popular predictors are usually the most important.
To sum up, the multi-model methods we discussed are both advocated and criticised. They yield the expected result if you can build a decent or good individual model. An ensemble model can then improve the stability or accuracy of prediction.