This procedure is intended to make life easier for those who work on large datasets and want to use regression models. Automatic linear models lack many advanced settings and options for model exploration that can be found in other regression procedures. On the other hand, as with other procedures of this kind, it speeds up and streamlines data processing.
What are the differences?
Regression analysis in PS IMAGO PRO traditionally employs the Linear Regression procedure found in SPSS Statistics (REGRESSI0N). However, in version 19 and higher SPSS Statistics has an additional procedure, namely LINEAR. This procedure is used widely in predictive analyses and will be the focus of this and subsequent blog posts.. Both regression methods have pros and cons, advocates and opponents. Today, we will focus on the potential benefits of using the LINEAR procedure and how to build a simple model and use its results.
The traditional regression procedure offers numerous techniques for choosing variables for the model. These techniques belong to the family of progressive methods (such as stepwise, forward selection, or backward elimination). The variables are selected automatically using statistical criteria, namely a sequence of t or F tests. The LINEAR procedure additionally provides a method referred to as all possible subsets.
The regression model (REGRESSION) enables the analyst to carry out an in-depth analysis of outliers and influencing observations. Statistics such as Cook's distance or DFBETAS can be saved to the dataset, something which is not possible when a model is generated automatically using the LINEAR function. Instead, such observations are handled when the model is being built, i.e., the application automatically decides which observation should be considered an outlier.
The third feature of the LINEAR procedure is the option to build so-called ensemble models, for example, through bagging or boosting.
Its fourth advantage is the ability to adapt it to process large datasets. This, however, requires PS IMAGO PRO in a client/server setup.
What's in the box?
First, we will have a look at the interface of the Automatic linear modeling window and its result objects. We will also look at options for automatic preparation of data available for the procedure. Then, we will focus on the methods for selecting predictors and on building ensemble models.
By way of illustration, we will build a sample regression model to verify whether or not there is a linear relationship between the sales of music CDs and such features as:
- audience artist score
- advertising spending
- number of radio plays
The procedure can be found in menu Analyze > Regression > Automatic Linear Modeling. Go to the Fields tab and select variables for the analysis.
Move the variable Sales to the Target field and the three remaining variables (advertising, radio, and score) to the list of Predictors. Readers who use IBM SPSS Modeler / PS CLEMENTINE PRO will notice the window looks familiar. Another similarity is the possibility to declare a role for a variable (Data Editor > Variable View > Role). If you declare a role as Input or Target, Automatic linear modeling will automatically place the variables into the relevant fields.
Go to the Build Options tab. The list on the left-hand side shows a number of option groups. Today, we will focus on Basics which is where you can enable automatic data preparation, as shown in Fig. 2. Most of the transformations improve the predictive power of the model. If you enable this option, the model will be built from processed values, not the original variables. The transformations used are saved with the model. The transformations implemented when this option is enabled are:
- Date and time handling – date and time predictors will be converted into a duration, a number, for example, of months from today.
- Adjustment of measurement level – variables declared to be quantitative variables with less than five unique values will be treated as ordinal variables. Ordinal variables with more than ten categories will be treated as quantitative variables.
- Outlier handling – values not within +/- three standard deviations from the mean are considered outliers.
- Missing value handling – missing values of qualitative variables are replaced with a modal for a nominal scale and a median for an ordinal scale. Missing values in quantitative variables are replaced with a mean value.
- Supervised merging – before qualitative variables are handled in the model, the system verifies whether or not it is important for predicting the target variable to retain information on all categories in a variable. If the predictor is a qualitative variable, (such as education), you can check whether the identified categories are correct. Do the categories of education differentiate earnings well? Too many detailed categories can make it more difficult to identify general dependencies. In the case of regression models, qualitative variables are converted into a set of boolean variables before they are used. By using variables with fewer categories, you simplify and generalise the model. Variables whose categories do not differentiate the predicted variable are not used in the model.
You can additionally set the confidence interval at which the interval estimation of the model’s parameters will be carried out. It is usually from 0.9 to 0.99.
Go to the Model Options tab. Here, you can make decisions regarding the saving of the model. If you want the predicted sales volume to be included in the dataset, you need to enable the first option, which is disabled by default: Save predicted values to the dataset, and select the name for the predicted value variable.
You can also export the model to .xml file(s). This way, you can use the model to score data using, for example, the Scoring Wizard (iUtilities > Scoring Wizard). After clicking Run the resulting report will contain two result objects: a standard table with information on the number of records used in the model, and a model summary.
The model summary
The summary shows general information about the model and adjusted R2 as a percentage bar chart. We get 65.2%, which may be satisfactory depending on the field.
By double-clicking on the report, you open a model viewer window where you can see other results. You navigate the report by clicking on the objects on the left-hand side. The first object we have seen already, solet's see what is further down.
The second object shows a summary of automatic data preparation. All three predictors were accepted by the outlier detectors and were used to build the model. The next object shows a predictor importance graph.
Predictor importance is the measure of the influence of the variable on the predicted values (not the accuracy of prediction). The sum of importance values of all predictors is 1. This model is dominated mainly by advertising budget (0.48) and the number of plays on the radio (0.47). The artist score is less important for the prediction (0.05).
Other graphic result objects - 1
Additonal results are shown with three other forms of visualisation. First, the scatter plot of predicted sales and the actual result. If the model predicted the value of each observation perfectly, the points would be on a 45-degree line. There are two graphs for verifying the normality or residual distribution among diagnostic graphs. You can select a histogram of (Studentised) residuals with a fitted normal distribution curve or a P-P plot for the distribution. As we use training data, both the graphs confirm a normal distribution, which is not always the case.
This list contains IDs of individual records that affect the model a lot and were considered outliers by the algorithm. If we assigned a role of record ID to a variable, for example, album name, (Data Editor > Variable View > Role), it would be used in the Record ID column. A high value of Cook's distance indicates that if the album were removed from the analysis, it could affect the model parameters a lot. The criterion for considering an observation a strong influencer is Fox's rule of thumb that says that Cook's Distance should not exceed 4/N–pc where N is the sample size and pc is the number of model parameters.
Although automatic procedures are controversial, they can be useful. This approach is usually justifiable for large datasets where automatic procedures facilitate using the computing power for searching and preliminary exploration of data. In future posts, we will look more into model building and the options you can change when building models using Automatic linear modeling.