LINEAR (in the IBM SPSS Statistics command language) is the little sister of REGRESSION. One of the key differences between them is the variable selection method, which will be the focus here. Other differences are discussed in the first post in the series, Swift predictive analysis.
If you have a set of variables you selected or prepared and believe they influence the predicted variable, you may use automatic variable selection to build the model. Automatic linear modeling offers two types of variable selection: Forward stepwise and Best subsets. If you don’t want to skip any predictor variable, you can always force the use of all predictors you selected.
The Forward stepwise method
The first method, Forward stepwise, is also available in the traditional regression procedure. The idea behind the stepwise method is to add another predictor to the model if it benefits the model by improving its goodness of fit. For example, if your model with one predictor allows you to ‘understand’ 40% of predicted variable variability, 60% remains to be accounted for. The Forward stepwise method looks for another predictor to explain the remaining 60%.
The algorithm can be generally presented as:
Step 1: The null model, initially the mean model:
is assessed with accepted goodness of fit evaluation criterion.
Step 2: Based on the model from step 1, more models are built with one potential predictor added, and the measure of the goodness of fit is calculated for each model. A predictor that makes the expanded model yield a better result than in step 1 is added to the model. If this condition is not met, the model is completed. The optimum model is the step 1 model.
Step 3: After the model is expanded, the least ‘useful’ predictor included in the model is tested and can be removed. After this adjustment, the algorithm returns to step 1 with the expanded model as the null model.
The evaluation criterion can be based on one of four measures. Traditionally, the model (or, more precisely the significance of all predictors) can be evaluated with the F-test and its significance. If you use the F statistic as the criterion, an effect with the lowest significance (p) is added to the model at each step. With the F-test, you can control significance thresholds at which the predictor should be included in the model (the default value is 0.05), or rejected (the default value is 0.1).
In order to free yourself of the assumption of F-distribution, you may consider using a criterion based on different measures, namely: adjusted R2, AICC, or the overfit prevention criterion ASE.
For large datasets, the efficacy of the stepwise method and the expansion of the model can be affected by limiting the maximum number of predictors or the number of steps in the algorithm. The number of algorithm steps is by default equal to 3 times the number of starting predictors.
The Best subsets method
The Best subsets method tests all possible combinations of available predictors if there is less than 20 of them. If the number is bigger, a hybrid stepwise-best-subsets method is used. As a result, the method tests all possible models, or at least a larger subset of possible models than the Forward stepwise method.
The number of models to be tested grows exponentially. Hence the method requires substantial computation power, and the process takes longer than the stepwise method. Due to the optimum handling of correlation matrices, the sequence of the tested models is based on the Schatzoff (1968) algorithm. You can learn more about that in the IBM SPSS Statistics / PS IMAGO PRO documentation.
The model structure can be evaluated using three statistics: AICC, adjusted R2, and overfit prevention criterion ASE.
Now, to action
Let's put both methods to use. We will build a regression model of a car price. We have a dataset with information on 155 car models. Our predicted variable will be the price, while engine type, engine displacement, horsepower, wheelbase, car width, car length, vehicle kerb weight, fuel tank volume, and mileage are potential predictors. The data set contains string or text variables, make and model, which we will not use.
Let's navigate to the Automatic linear modeling wizard in Analyze > Regression > Automatic linear modeling. In the Fields tab, move type, engine_displ, HP, wheel_base, width, length, kerb_weight, fuel_tank, and mileage to the list of predictors, and price to the Target field.
Figure 1. Selection of variables for the model
Then go to Build options. Select Model Selection on the left-hand side. Here, you will find everything we discussed above. First, we will test the stepwise forward method. Select it on the Model selection method list. Leave Information Criterion (AICC) as the entry/removal criterion. As the set is not large, the Maximum number of effects in the final model and Maximum number of steps can be left as they are (unticked checkboxes).
Leave all the other options as default, click Run, andproceed to the report. Remember to double-click a result object in the report to see the whole model in the model viewer. You can navigate the report by selecting the result object on the left-hand side.
The table contains information about the chosen selection method and criterion value (here, AICC) for the ‘winning’ model. The graph shows the adjusted R2 as a percentage. The model is based on four variables: horsepower, length of the vehicle, the kerb weight of the vehicle, and engine displacement. All the predictors and their impact on the predicted values are shown on the predictor importance chart.
The predictors and regression coefficients can be viewed in two visualisations. The first one is the Effects chart.
If more predictors are used, you can control the displayed number. The thickness of the line represents the significance of the predictor. You can toggle between the diagram and table view in this window. Table view shows the traditional ANOVA table for the model.
The other interesting visualisation is the diagram of coefficients.
Just like in the Effects diagram, the width of lines depends on the significance of a parameter. In addition, variables with blue lines have a positive impact on the price, while the orange line means a negative impact. Let's go back to the predictor selection methods. The last table in the viewer summarises the model building. As you selected the Forward stepwise method, successive steps are described there.
The best model in step 1 was price/horsepower with AICC for the model reaching 636.305. In the second step, with car length added, the model’s goodness of fit was improved: AICC dropped to 609.747. Finally, the best model turned out to be a model with four predictors built in four steps. Let's open the Automatic linear modeling window again and build a model based on the Best subsets method.
The criterion (this time selected at the bottom of the window) can be left the same as for the stepwise method, namely Information Criterion AICC. ,Click Run and proceed to the report.
The Best subsets method yielded slightly better goodness of fit. It also used more predictors in the model.
This time, the model predicts the price of the car using horsepower, length, kerb weight, mileage, and engine displacement. Proceed to the model building summary table.
The table contains the ten best subsets with the lowest AICC value. The model we would choose using the second method came in third.
Assuming the factual integrity of the model was ensured through the selection of the right potential predictors, the method that tested more combinations of variables generated a model with better goodness of fit. The employment of both the techniques allows you to narrow down the potential models to a group with a satisfactory accuracy as regards the statistical criterion, which means they are ready to be tested for business usefulness. To sum up, the procedure enables you to test predictor selection using the Forward stepwise method based on various statistical criteria and use of the Best subsets method. In the next post, we will look into the building of ensemble models.