Analysts have a rich set of statistical techniques at their disposal that allow them to find distinct patterns in a customer’s behavior that are important from a business point of view.
One of the most popular data exploration and analysis methods are decision trees. What is the reason for this appeal? Trees are a very flexible technique and can be applied to a wide variety of data and research problems. The variables used in the analysis do not have to meet restrictive assumptions as is the case with linear regression models or variance analysis. Moreover, variables do not need to be transformed. Additonally, decision trees provide visually attractive results, namely decision-making rules in the form of a tree, the interpretation of which is rather intuitive and does not require advanced knowledge of complex statistical terms.
Of course, decision trees are not the only technique that solve all problems and lead us to a final decision. As in the case of any analysis, the technique used here is only a tool, which should be properly applied in full awareness of its advantages and limitations. First of all, decision trees require large samples to learn the correctness of the data. In small samples it may happen that deviating values or random dependencies have an excessive impact on the results. If the number of observations is not sufficient, the tree may also have problems with normal growth.
Trees and decision rules: what does the tree want to tell us?
Under the term "decision trees" there are in fact 2 groups of analytical techniques. Depending on the problem that the tree will have to face (i.e. the level of measurement of the predicted variable), we distinguish between classification trees and regression trees. Classification trees are a tool for classification in the classic sense. Having a known value of a qualitative (nominal or order) predicted variable, the algorithm will try to predict its membership of particular categories as best as possible using available predictive factors. Translating into tree terminology, the algorithm will try to divide the data set using independent variables in such a way that there is only one category of the predicted variable in the terminal nodes (leaves) distinguished as a result of the analysis. Based on the values of dependent variables, the classification tree will build a rule for each terminal node, for example, "if gender equals woman and education is higher then predicted variable X, consumer will buy". Some of these rules will be 100% accurate, others will be slightly less effective. How do you measure accuracy? The tree predicts the category of the dependent variable based on the dominant in each last node. Any incorrectly classified cases are an issue that we try to minimize. The lower the percentage of incorrectly classified cases, the better. Similar to a classification tree, a regression tree also aims to provide a set of rules with the lowest possible forecasting error. In the case of a quantitative variable, the use of the dominant would not give reliable results, therefore the tree predicts the average value of the dependent variable. The example rule is as follows: "if gender equals female and education is higher, then average X equals 3.75". As you can guess, the forecast error in this procedure will be defined as the sum of the squares of the deviations from the average in the last nodes. The smaller the variation of cases due to the value of the dependent variable within the distinguished groups, the better the model is.
Interpretation of the results of the regression tree
How does the regression tree work in practice? Consider an example. A chain of grocery stores has collected information on transactions made by their customers. 351 randomly selected purchase transactions were analyzed. The desired dependent variable is the value of the transaction being executed. The analysis may be based on a number of independent variables. These are:
- gender of the person making the purchase
- purpose of shopping (for yourself, for friends, for family)
- shopping style (every day, once or several times a week, less than once a week)
- size of the shop (small, medium, large)
- use of purchase coupons (none, newspaper, email, mobile app)
Let's take a look at the predicted variable, namely the value of purchases. The histogram and the descriptive statistics table are shown below.
Figure 1. Distribution and descriptive statistics of the value of purchases
The average transaction value was US$ 199. The value of standard deviation tells us, in a simplified way, that, when forecasting the transaction value on the basis of the average, we will be out by an average of US$ 46. This seemingly small value, however, accounts for about 23% of the average value of purchases, which sounds much worse. The most expensive purchases were made for US$ 343, and the smallest transaction was worth US$ 97. The evaluation of the difference between the median and the mean, as well as the values of kurtosis and obliqueness (close to 0) confirm the conclusion resulting from the histogram view that we are dealing with a variable of symmetrical distribution, i.e., close to a normal distribution. In general, we do not observe any deviating values. Such a situation is of course possible, although it is rare. Let's try to analyze the relationship between the value of purchases and available independent variables using decision trees. They are available in PS IMAGO PRO in the menu ANALYZE > CLASSIFY > TREE. We will use CRT as the Growing Method for the tree. (Additional algorithm settings will be discussed in Part 2 of this article). Below you can find the resulting regression tree for the purchase value variable.
Figure 2. Regression tree describing the value of purchases
How to interpret the results? Our tree is in fact a visualization of decision rules, which can be read from the root located at the very top of the tree described as Node 0 (unlike real trees you have to get used to the fact that decision trees grow downwards!) through the individual branches until the last node, which has not been further divided. The next branch is the division of observations according to a given independent variable into two subgroups (this is a key characteristic of the CRT algorithm). Above each node is a description of the categories from which the node was derived. In the first stage, the tree was divided according to the purpose of shopping. The first group (Node 1) includes people who shop for their family, and the second group includes people who bought products for friends or for themselves. It is worth remembering that a tree has a hierarchical structure: each subsequent, deeper division takes place only within the range of observations selected for a node at an earlier stage. Additionally, in each of the distinct subgroups a different predictor can be used for further divisions. In this case, the division on the left-hand side of the tree is made up only of people who have shopped for their family. In this group, it occurred because of the use of coupons. On the right side of the tree, the division was made according to gender. Subsequent, deeper and deeper divisions of the tree illustrate the successive levels of decision-making rules. We can compare them to other, more and more complex filters for selecting observations. From the next nodes we can also read the descriptive statistics of the predicted variable. Optionally, we can order the visualization in the form of graphs in the analysis wizard. (More details of the algorithm will be discussed in Part 2 of this article). For now, let us simply say that the tree makes the optimal division on the basis of the average of the dependent variable and the reduction of the dispersion around it. In the whole sample, the average value of purchases was US$ 199. In the first stage, the tree divided the analyzed community into two subgroups: in the first node the average was US$ 247, in the second - US$ 182. The standard deviations have also decreased. The final division of cases into subgroups can be read from the terminal nodes, or leaves. These are branches that have simply not been further subdivided. At this point we should therefore focus on the seven last nodes with numbers: 3, 5, 7, 8, 10, 11 and 12. For example, consider Node 7. This includes men who shop for a family (first level division) and use coupons from an email or app. These people spend an average US$ 283 (+/- 31). The regression tree algorithm not only selected a group of customers spending significantly more than the average for the whole sample, but also managed to significantly reduce the prediction error based on the average. At the other end of the savings continuum there are women shopping for themselves and not using coupons or newspaper coupons. This is node 11 (at the very bottom of the tree). These people spend an average US$ 136 on shopping, and the standard deviation in this case is 20. This is also a significant reduction in error. Using this node as an example, we can also observe that once a variable is used (in this case the purpose of shopping) it can be reused by the CRT algorithm, if, of course, the number of categories allows it. After the cases belonging to each of the seven final nodes have been saved to the data set, they can be further analyzed, i.e., in the post-stratification process. Descriptions of individual nodes and their descriptive statistics are presented in the table below. The visualization of the error bars is one of the graphic procedures available in PS IMAGO PRO in the menu: PREDICTIVE SOLUTIONS > GRAPHS > TABLE > ERROR BARS. These co-called “table charts”, which are unique to PS IMAGO PRO, allow you to present the size of subgroups and various descriptive statistics of the analyzed quantitative variable alongside a wide selection of visualizations.
Figure 3. Description of the target groups identified by the regression tree
This concludes Part 1 of the article on regression trees. In this article, we focused primarily on the interpretation of the results. In the next article we will discuss in detail how the algorithm works, how to control the tree division and additional result objects enriching the interpretation