Trees that grow from tables - Predictive Solutions

Reading the article will take you: 4 minutes.

When stepping outside the domain of distributive and descriptive statistics for individual variables, we usually take interest in correlations between variables.

To investigate them, we use various measures of the strength of relationship selected depending on the type of data and measurement level. Apart from determining the strength of a correlation, we are interested in the nature of the dependence. Different analytical tools are selected for ratio and interval variables than for nominal and ordinal ones. In the latter case, crosstabs are the most basic method for investigating relations between two variables. We will focus on them now.

Crosstabs are used to check whether there is a correlation between a pair of variables and to find out which categories of qualitative variables build the dependence. We often need to verify theoretical assumptions regarding the causality of the variable relationship.

Figure 1 Table of the influence of having children on the inclination to buy

The table above illustrates the impact of having children on the inclination to buy a certain product. We obviously assume that the fact of having children affects the inclination to buy, not the other way round J. When we look at the table, we see an ideal situation. Having children determines the inclination to buy the product. It is clear both in numbers (N) and percentage (%).

The same dependence can be found if a classification tree algorithm is employed. It offers a different manner of visualisation of the relationship.

Figure 2 Tree of the influence of having children on the inclination to buy

The tree consists of the main node (also called the trunk or root). It has the same information as the General column of the cross table. Below the root, there is the variable we used to test the dependency, i.e. having children. Having children splits the tree into two branches Yes and No. They show the same statistics as the columns of the table.

This form of classification tree is nothing but a different way of showing the same data as in the crosstab. Is this the end of similarities between tables and classification trees? No, if we consider them not only as a form of visualisation but also as an analytical algorithm, which gives more insight into our data.

If you want to stick to tabular analytical methods use the classification algorithm CHAID. It is based on the Chi-square independence test, which is often employed in crosstabs to assess the significance of the dependence found for data. Let's perform the Chi-square test for our crosstab.

Figure 3 Table with Chi-square independence test

The test results may be presented in a separate table as shown above or moved elsewhere, , for example, to the foot of a cross table. Using CHAID, we can present the same statistics on our tree.

Figure 4 Tree with an additional Chi-square independence test

The above-mentioned example shows artificial relationships because real life hardly ever presents such perfect correlations. Let's have a look at some more realistic data. Let’s look at how education affects the inclination to buy.

Figure 5 Table of the influence of education on the inclination to buy

With such data, crosstab analysis gets more complicated. While employing logic similar to the one in the table analysis, we can use the CHAID algorithm to build a tree with the same data.

Figure 6 Tree of the influence of education on the inclination to buy

With classification trees we can go one step further. We can allow the algorithm to merge ‘similar’ categories. What constitutes ‘similar’ and the principles that are applied to merging categories in the CHAID algorithm is a separate issue described in detail in our e-bulletin in the article on classification trees. Decisions concerning the merging of categories may be taken by the analyst after analysis of the content of the crosstab. The CHAID algorithm does it for you. Note that we do not have to agree with the decisions made by the algorithm. The program is supposed to support the decision-making process not to replace the analyst.

Figure 7 Tree of primary and vocational education merged automatically

The CHAID algorithm automatically merged the Primary and Vocational categories. Additionally, for visualisation purposes bar charts were added to the tree on request.

Of course, the table can be ‘enhanced’ visually by adding, for example, a heat map.

Figure 8 Contingency or heat map of the influence of education on the inclination to buy

The dependency was visualised with the education variable categories primary and occupational merged as suggested by the CHAID algorithm.

The next step in the tabular analysis is to introduce an additional variable to the table in order to move beyond a simple crosstab with two variables. The possible goal may be a more in-depth verification of the dependence between the two variables. The third variable, a control variable, may help check whether the detected correlation is not an ‘illusory correlation’. It may also help detect an ‘illusory non-correlation’ or be used to test interactions between variables.

Figure 9 Table of the influence of education on the inclination to buy by gender[/caption]

In the table above, the education was ‘nested’ under the gender variable. The result is two tables that describe the effect of education on the inclination of women and men to buy. Similar data may be presented with a classification tree.

Figure 10 Tree of the influence of education on the inclination to buy by gender

Eventually, trees may be used to build more complex models of variable relationships. Tables may be used to this end as well, however, result presentation would not be clear or effective. Additionally, the option to build a tree automatically using, in the case of the CHAID algorithm the significance level of Chi-square test, saves the analyst's time. It does not, however, excuse us from the necessity to verify the factual correctness of the final model. Even the best analytical tools follow statistical criteria at best instead of factual ones, which is crucially importantwhen the goal is to build an explanatory model.

Figure 11 ‘Automatic’ tree describing the influence of multiple variables on the inclination to buy

You can steer clear of mistakes when using the CHAID algorithm by using your crosstabular analytical skills and a good comprehension of relevant measures and statistics.

To sum up, a good understanding of the logic of crosstabs will help you use more advanced classification algorithms such as CHAID more effectively. By reaching beyond crosstabs and using classification tree algorithms, we can:

remove the limitation of the number of variables that can be used with crosstabs; although it is possible to analyse a system of three and more variables the process is formidable;
get a clear visualisation of results despite multiple variables;
simplify the models by reducing the number of variable categories through the automatic merging of ‘similar’ categories;
check and control variable dependencies to investigate interactions and verify ‘illusory correlations’ and ‘illusory non-correlations’

Finally, the centrepiece: by using classification trees we can build and test complex causal models with multiple variables quickly and effectively.

Rate article:

Share the article on social media

Tags:

big datadata analysisdecision treesmultidimensional analysissegmentation techniquesstatisticstable analysis

Previous article Next article