Auto Classification – automatic model selection for data in PS CLEMENTINE PRO

Reading time:  6 minutes

When working with data, the analyst is often faced with the challenge of selecting appropriate statistical tests to provide valuable answers to the research questions posed. PS CLEMENTINE PRO can provide a solution to this problem. This tool offers a wide range of modelling methods that are based on statistics, artificial intelligence and machine learning. Thanks to advanced procedures, users can efficiently process data and create precise predictive models that are tailored to specific problems and analytical challenges.

Statistical models can be divided into two main categories: supervised and unsupervised. Supervised models are a type of statistical modelling that uses inputs to predict specific outcomes, such as the value of an output variable. Examples of statistical techniques used in supervised modelling include C&RT, QUEST, CHAID decision trees, various forms of regression (linear, logistic, generalised linear or Cox regression), neural networks, SVM algorithms and Bayesian networks, among others. These models are particularly helpful in predicting specific outcomes, such as the decision to abandon a purchase or the identification of a fraudulent transaction. 

To make this easier to understand, let's look at an example. A data analyst, embarking on a new dataset, is tasked with solving a specific business problem e.g. cancellation of a contract renewal for services provided. To do this, he or she wants to build a classification model that assigns individual customers to the ‘will renew contract’ or ‘will cancel’ categories. In addition to adequately checking and preparing the data for the job, the analyst faces the challenge of which type of model to choose for this task and which will be best. 

To facilitate the selection of an appropriate model, the Auto Classification node in PS CLEMENTINE PRO can be used to create and compare different models in terms of classification. The analyst can choose the optimal method of analysis from the available modelling algorithms. This node generates a set of models based on the specified options and then creates a ranking according to the selected criteria, which facilitates accurate analytical decisions.

Parameterisation of models for comparison

What does the Auto Classification node offer, and what are the different options used for? To begin with, the analyst in the Auto Classification node can, in the Variables tab, customise which variables are to be used to generate models. The default option selected will use pre-defined variable roles. The user can also self-assign variables, specifying which are predictor variables and which variables are predictors.

Next, the Model tab allows the user to define how many models are to be generated in the analysis process. The analyst can also define the criteria against which the models will be compared, allowing the most appropriate methods to be selected to solve a given problem.

Model name – the ability to automatically generate a model name that is based on a predictor or identification variable. If such variables are not specified, the name can be based on the model type. The user also has the option to give the model its own custom name, making it easier to organise and identify the results.

Use split-subset data – if a split-subset variable has been defined in the data, this option allows the model to be built using only data from the learning subset. This allows the model to be more accurate as it is trained on a selected group of data. 

Cross-check – cross-validation provides the model with a set of data on which the model is trained (training set) and a set of new data, previously unknown to the model, on which it is tested (test or validation set). The main purpose of this method is to assess the model's ability to predict results for new data that was not considered when the model was built, thus identifying problems such as model overfitting. The analyst can choose the number of subsets to be created and, to maintain reproducibility of the subset assignment, it is possible to set a starting value for the draw. 

Build models for each split – this function allows you to create separate models for each value of the input variables that you have specified as split variables. This is particularly useful in situations where different data segments require different modelling approaches. 

Rank models by – this option allows you to specify the criteria that will be used to compare and rank models. Choices include overall accuracy, area under the ROC curve, gain, growth and number of variables. It is worth noting that all of these measures will be available in the summary report, regardless of which one has been selected as the main criteria. 

Model ranking –the analyst can specify which data set the model ranks are to be based on: the learning set or the test set. This is particularly useful when working with large datasets, where using a subset of the data to pre-monitor the models can significantly increase the efficiency of the modelling process. This approach can identify the most promising models more quickly.

Number of models to include – allows setting the maximum number of models to be included in the final utility model generated by the node. The models are ranked according to predefined criteria and the highest ranked models are automatically selected. Note, however, that increasing the number of models considered may slow down the programme and the maximum number of models allowed is 100.

Calculating predictor validity – for models that offer a measure of predictor validity, the user can display a table showing the relative importance of each predictor in the model building process. The validity of the predictors is crucial as it allows the user to focus on the most relevant variables and omit the less relevant ones, which can lead to simplification of the model. However, it should be noted that calculating the validity of the predictors may increase the time needed to process the models. This is therefore recommended when the analysis focuses on a smaller number of models that require a more thorough analysis.

Profit criteria – these options allow for more detailed and realistic modelling of customer-related activities, taking into account both costs and potential profits, which helps to better predict the profitability of different marketing or sales activities. Costs are calculated for each customer. If an action is not successful, revenue is not counted for that customer, only costs. 

  • Costs – this option allows you to specify what costs are associated with an action. They can be fixed (the same for each case) or variable (varying from record to record). For example, the cost of sending an offer to a customer can be set as a fixed value or it can be determined by some variable, such as the history of previous interactions with the customer.

  • Revenue – in this case a profit can be determined if the action is successful. The revenue can also be fixed or variable. In this way, you can predict the profit if you succeed in convincing a customer to take up an offer, for example. 

  • Weights – if the data being analysed represents more than one customer, weights can be used to adjust the results accordingly. You can set fixed weights for each record or variable weights, depending on the number of customers in a record.

Growth criteria – an option only available for flag-type predictor variables (i.e. variables that have two possible values, for example ‘yes’ or ‘no’). It allows you to specify the percentage of the data (percentile) that will be taken into account when calculating growth, i.e. how well the model performs in terms of prediction. Suppose an analyst creates a model that predicts whether a customer will renew a contract. You can use the growth criterion to focus on the top 10% of customers who are most likely to renew their contract. This will help you better assess how effective the model is in predicting the behaviour of this key group of customers. 

 

Fig. 1

Auto Classification node window with the possibility to parameterize the created models for comparison 

Selection of models for comparison

 

The Advanced tab allows the selection of specific models to be used. It allows you to select the types of models (algorithms) to include in the comparison. Selecting more model types increases the number of models generated and the processing time. Default settings can be used for each algorithm or you can specify options yourself, allowing multiple models to be compared in a single run. A maximum of 17 different model types can be selected for comparison. In addition, the option Limit maximum time to build one model, allows you to set a limit on the time to build one model for the selected algorithm types. This prevents delays for complex models.

 

Fig. 2

The Advanced tab allows you to select the type of models to compare

Rejection of non-compliant models

Reject tab – Auto Classification tab allows automatic rejection of models that do not meet the specified criteria. These models will not be listed in the summary report.

A threshold for minimum overall accuracy and a threshold for the maximum number of variables used in the model can be specified. In addition, a threshold for minimum growth, gain and area under the ROC curve can be specified for flag-type predictor variables; growth and gain are defined as specified on the Model tab. 

Optionally, the node can be configured to stop execution when the first model that meets all the specified criteria is generated. 

 

Fig. 3

Reject tab options that allow you to specify criteria for rejecting models that do not meet the analyst's requirements 

Results for Auto Classification node

Let us look at the results of the Auto Classification node. Figure 4 shows an analysis stream prepared in PS CLEMENTINE PRO, in which a data file and some basic settings were defined. The settings for model selection, costs and test and learning sets were then made in the Auto Classification node. Once the changes had been made, the entire stream was run.  

 

Fig. 4

Example of an analytical stream with the Auto Classification node running

The result of the action is a node of type Model. When clicked, we get a summary about the prepared models, and we can sort them according to the criteria that are relevant to us. In addition, for each model type, the analyst can view details about the variables used, the validity of the predictors, or other additional information. It is worth noting that for each model type the preview of details is different, for example, it can include a classification tree or a neural network diagram. The Summary tab provides details of the execution of the procedure, including a list of the variables used along with the settings for creating each model, as well as the total time it took to build all the models. 

 

Fig 5.

The result of the Auto Classification node showing a summary for the 5 models that achieved the best score

Summary

The Auto Classification node in PS CLEMENTINE PRO is a useful tool that significantly speeds up and facilitates work with data. With this node, different predictive models can be automatically created and compared, allowing you to quickly find the best solution for a given problem. Auto Classification automates the process of selecting algorithms, parameters and evaluation criteria, which saves time and reduces the risk of errors. As a result, analysts can focus on interpreting results and making decisions, rather than manually creating and testing many different models.

Accessibility settings
Line height
Letter spacing
No animations
Reading line
Speech
No images
Focus on content
Bigger cursor
Hotkeys