The Auto Data Preparation node allows you to quickly and easily prepare data for model building, without manually checking and analysing individual variables. As a result, the construction and evaluation of models will be faster, but also the results of such a model can be better compared to raw data. In addition, the use of automatic data preparation improves the flexibility of automatic modelling processes, such as model refreshing in PS CLEMENTINE PRO.
The analyst can use the node in a fully automated manner, allowing the node to select and apply corrections, or can preview proposed changes before making them and accept or reject them as required.
Fig. 1
Selecting one of the data preparation objectives in the Auto Data Preparation node
Parameterisation of the Auto Data Preparation node
After selecting one of the objectives, it is useful to switch to the Variables tab. It allows you to specify whether predefined roles for variables are to be taken into account in the data preparation process, or whether the analyst wants to use custom roles, which can be specified in this window. When this option is selected, it is possible to indicate which variable will be a predictor variable (optional) and which variables are to be input variables.
Let's move on to the Settings tab, where we can define many settings related to the exclusion of variables, the selection of predictors and other important issues.
Variable settings
The options available here allow the use of weighting variables. The first option (Use frequency variable) is a frequency weighting variable. This option should be used if each record in the learning data represents more than one unit, e.g. if aggregated data is used. The variable values should correspond to the number of units represented by each record.
The Use weighted variable option, on the other hand, allows a variable to be selected as an observation weight. These are used to account for differences in variance between levels of the output variable.
In addition, we also have the option to indicate how variables that are excluded from the modelling are handled. It is necessary to specify whether the excluded variables are to be filtered out of the data or whether they are to be omitted.
The last option available allows you to specify the action for input variables that do not overlap with the previous analysis. It also allows you to specify what will happen if at least one required input variable is unavailable in the input dataset at the time of node execution.
Fig. 2
Auto Data Preparation node settings tab
Preparation of date and time
Many modelling algorithms cannot process date and time data directly. These settings allow the determination of new duration data that can serve as input for the model based on the date and time information available in the existing data. Date and time variables must be predefined using the appropriate data types, such as date or time. The original date and time variables will not be taken into account by the automatic data preparation process as input to the model.
Options are available to calculate the elapsed time since the reference date, i.e., how many years, months and days have passed since a specific reference date for each date variable and to calculate the elapsed time since the reference time (hours, minutes or seconds).
In addition, the analyst can also extract cyclical time elements. These settings allow a single variable containing a date or time to be split into several new variables. For example, if all three options for date are selected, for an input variable such as ‘2024-09-23’, it will be split into three separate variables: year (2024), month (09) and day (23). Each of these will be prefixed with the prefix defined in the Variable Names options.
Exclusion of input variables
Data quality is a key element that affects the accuracy of the prediction. These options allow you to specify parameters for the input data for the prediction. All variables that are constant or have 100% missing data are automatically excluded.
The first option allows you to define a maximum percentage of missing data (Exclude input variables with too many missing data). Variables with more than the specified percentage of missing data are removed from further analysis.
In addition, the option Exclude nominal variables with too many unique categories, allows you to specify the number of categories above which nominal variables will be excluded from further analysis. This option is useful for automatically removing variables containing information unique to a record, such as an identifier, address or name, from modelling.
The last option, Exclude categorical variables with too many values in one category, allows you to specify a maximum percentage above which ordinal and nominal variables with a category that contains more records than the specified percentage are removed from further analysis. By default, this option is disabled. An example of such a variable would be the gender variable, where the majority of records are of one gender (e.g. 95% male and 5% female).
Preparation of input and predictor variables
The options for output variable preparation and prediction are divided into two sections. The first section is Adjusting types and improving data quality. Different transformations can be applied to the input and prediction variables to maintain the integrity of the predicted variable values. For example, predicting revenue in EUR may make more sense than predicting as the logarithm of that revenue.
If there are missing data in a predicted or input variable, it can be specified whether it should be substituted, allowing some algorithms to continue processing these variables, avoiding the loss of relevant information. In addition, the analyst can also tick an option that allows outliers to be identified and corrected by removing these values, or replacing them with a cut-off value.
The second section is the Quantitative Variable Transformation. These options allow all quantitative variables to be brought to a common scale. The analyst can choose between standardisation, Min/Max normalisation and Box-Cox transformations to reduce skewness.
Fig. 3
Input and predicted variable settings for the Auto Data Preparation node
Creation and selection of predictors
The options here are active when Transformation is enabled, allowing for the creation and selection of input variables to improve the quality of the prediction. It is worth remembering, however, that if the values on this tab are changed, the Objectives tab will be automatically updated, and the Custom Analysis option will be selected. When the analyst decides to modify these options, three sections will be available: Qualitative Input Variables, Quantitative Input Variables and Selecting and Creating Predictors.
The first section allows the efficiency of the model to be improved by combining categories with a small number of observations. This makes the model simpler by reducing the number of variables to be analysed. By default, a significance level of 0.05 is set to determine which categories to combine. If there is no variable we want to predict, we can combine small-volume categories based on their number. A minimum percentage of records must then be set for the categories to be combined; the default is 10%.
The second section concerns the qualitative variable to be predicted. These options allow quantitative input data to be categorised with strong linkages to increase processing efficiency.
The last section concerns the selection and creation of predictors. The analyst can select this option to remove predictors with a low correlation coefficient. If necessary, change the default significance level value of 0.05. This option only applies to quantitative input variables where the predictor variable is quantitative and to qualitative input variables.
Variable names
To easily identify new and transformed predictors, an automatic data preparation process generates and adds basic names, prefixes or suffixes. The user can change these to better suit their needs and the characteristics of the data. If we want to use other labels, this can be done in a node later in the stream.
Results of the Auto Data Preparation node
Once the automatic data preparation settings - including the changes made on the Objectives, Fields, and Settings tabs - are prepared, we can start the whole process by clicking on the ‘Analyse Data’ button. The algorithm will apply the selected settings to the input data and display the results on the Analysis tab.
Here, you will find a summary of the data processing in the form of a table and graph, as well as recommendations for possible modifications or additional improvements to the data. In addition, it is also possible to review and reject the changes made and, for example, use the unchanged data for modelling.
Fig. 4
The Analysis tab of the Auto Data Preparation node presents in a basic view the most important information concerning the preparation of data for analysis
In the following example, the Auto Data Preparation node was used in PS CLEMENTINE PRO to analyse “Churn”. In short, Churn is a customer departure rate that measures how many people abandon a company's services over a certain period of time. In this case, logistic regression, which determines the probability of a customer leaving based on various variables, was used to analyse Churn.
Logistic regression, with its ability to model binary variables, is an ideal tool for predicting whether a customer will leave or stay. As can be seen in Figure 5, the logistic regression was performed on data processed with the Auto Prep node and without processing.
Fig. 5
Analysis stream in which the Auto Data Preparation node was used for Churn analysis
Looking at the results, it can be seen that the logistic regression model for which the data from the Auto Data Preparation node was used is better than the regression performed on the data without processing. The percentage of correct classification of customers is close to 80% in the model that used the Auto Data Preparation node.
Logistic regression classification results |
Model based on raw data |
Model based on Auto Data Prep node data |
---|---|---|
Correct |
10,6% |
78,8% |
Incorrect |
89,4% |
21,2% |
Total |
100% |
100% |
Table 1. Comparison of results of logistic regression perform on processed and unprocessed data
Summary
The Auto Data Preparation node in PS CLEMENTINE PRO is crucial in the data analysis process, as it automates and simplifies the preparation of the dataset for modelling. This step can sometimes be time-consuming and error-prone, especially when the analyst has to manually check each variable. Auto Data Prep speeds up this process by automatically identifying problems such as missing data, outliers or variables with too many unique values that can negatively impact the quality of the model.
The node enables automatic transformation of quantitative and qualitative variables, standardisation, removal of problematic variables and creation of new optimal predictors. In this way, the automation of data preparation ensures better prediction quality, optimising the model-building process. Compared to raw data, data processed by this node often results in better analysis results, as shown in the example of Churn analysis, where automated data preparation improved the accuracy of the predictive model.