Correspondence analysis: Calculating distances in a table

Reading the article will take you: 5 minutes.
Correspondence analysis is one of the most convenient and popular exploratory techniques in data analysis.

It offers a quick and clear visualisation of a relationship in a contingency table.

An in-depth description of the procedure for transforming tabulated data into a perceptual map can be found in an earlier article. Using the SVD algorithm discussed in that article, values corresponding to the individual row and column profiles are transformed by the PS IMAGO PRO analytical engine into coordinates used in a typical frame of reference.

The question of presentation of the similarity between categories of two qualitative variables using geometry is of vital importance for the understanding of the operational principle of the correspondence analysis algorithm and selection of the right normalisation, and thus for the correct interpretation of the perceptual map resulting from the analysis.

Let’s start with a short reminder on how to calculate the distance between two objects. There are surprisingly many approaches to determine this, but for the sake of this article we will focus on the most intuitive method of measurement, which is the Euclidean distance we all know from geometry classes, i.e., the smallest possible distance between two data points on a two-dimensional surface, or in an n-dimensional space.

The Euclidean distance is the square root of the sum of squares of distances in individual dimensions. Some readers may have rightly noticed the similarity to the calculation of the length of the hypotenuse in a right-angled triangle. Note, however, that the distance can also be calculated for three or more dimensional spaces.

Obliczanie odległości na płaszczyźnie - odległość Euklidesowa

What is the distance between the data points in the chart shown? Answer: (6-2)2 + (10-4)2 = 16 + 36 = 52. The distance is the square root of 52, which is about 7.21.

What does it have to do with distances between categories of variables in a table? Have a look at an example. Consumers were asked what was their primary criterion when buying beer and about education. The table below contains a summary of responses. To make it clearer, a gradient was applied using a table colouring procedure in PS IMAGO PRO.

 Table 1. Primary criterion for beer selection vs. education

Table 1. Primary criterion for beer selection vs. educationa

The table contains column profiles (percentage in columns), which show what percentage of respondents in individual categories of education applied a specific criterion when selecting a beer (they were asked to select the most important). You can see that educational background clearly differentiated buyer motivation: Price and alcohol content were the key criteria for people with primary and vocational education. People with secondary education were motivated by nice packaging, whereas beer drinkers with university degrees used brand name and price as the criteria most often.

Column variable categories are different but it is clear (and emphasized with table colouring) that those with primary and vocational education had similar choice patterns, and those with secondary and higher educational backgrounds had different motivations when selecting a beer. People with primary education will be closer to those after vocational schools than university graduates.

But how can we calculate this? Imagine (although it might be a slight challenge) that each column is a data point in a six-dimensional space. Individual row categories (price, taste, etc.) form the axes of the coordinate system and values in the primary education column are coordinates in this system. The ‘primary’ data point has the following coordinates [0.402; 0.287; 0.080; 0.046; 0.080; 0.103]. Coordinates of every data point or column variable category are read the same way.

When calculating a distance, we have to take into account one more vital piece of information. Take a look at the average column profile (the last column, ‘total’), which contains the share of individual categories of row variable in the sample. You can see that the boundary proportions differ significantly: in total, as many as 29.4% of respondents used price as the key criterion, while ingredients were the most important only for 7.9% of the participants. A similar situation to this occurs  when using variables with different units (such as height in cm and mass in kg) to calculate a distance when presenting a relationship. The solution would be dominated by the variable with a larger value range. This is why you have to implement good old standardisation before calculating distance. How to standardise the dimensions we are discussing? Simply divide the square of distances in each row dimension by the boundary proportion of individual rows in the last column. Such a standardised distance is called ‘weighted Euclidean distance’, or chi-square distance.

What is, then, the value of the weighted Euclidean distance between the primary and vocational education? Let’s calculate it:

(0.402 - 0.337)2/0.294 + (0.287 - 0.337)2/0178 + (0.080 - 0.120)2/0.105 + (0.046 - 0.048)2/0.192 + (0.080 - 0.060)2 /0.079 + (0.103 - 0.096)2/0.153 = 0.049.

The square root of this value is 0.222 and, however abstract it may sound, this is the distance we are looking for. In comparison, the distance between the data point for primary education and higher education is 0.889 so it is located much further away in the six-dimensional space. You can do the calculations yourself as an exercise.

How to calculate distances between data points on a perceptual map? Below you can see a part of a table that resulted from a correspondence analysis of the data in Table 1. It has coordinates (‘value in dimension’), which were assigned to individual categories of education during the analysis. The analysis involved column normalisation, so the data points have primary coordinates. This means the distances between them can be compared because the Euclidean distance between two categories of the same variable can be interpreted in chi-square distance categories. Note, the solution below is in three dimensions, which fully represents the differentiation in the initial contingency table and means the distances between data points are comparable. Correspondence analysis transformed a six-dimensional table into a three-dimensional perceptual map.

Table 2. Column point coordinates, correspondence analysis

Table 2. Column point coordinates, correspondence analysis

In order to calculate the distance between data points for the primary and vocational education, you need to sum up the squares of distances in individual dimensions and calculate the square root of the value. This is the calculation: (-0.479 + 0.520)2 + (0.020 + 0.067)2 + (-0.102 - 0.099)2 = 0.049. The square root of this value is 0.222, which is the exact same distance as the distance calculated from the column profiles in the contingency table. What about the distance between primary and higher education? (-0.479 - 0.318)2 + (0.020 - 0.391)2 + (-0.102 - 0.030)2 = 0.790, the square root of which is 0.889, meaning the data point for higher education is much further from the ‘primary’ category on the map than the data point for vocational education.

Let's wrap up by solving a mystery: why is the weighted Euclidean distance called chi-square metric? Remember the chi-square value for a table is the sum of squares of standardised residuals. A standardised residual is a difference between the observed value and the expected value divided by the square root of the expected count. Let’s go back to the first table. To simplify things a little, we will calculate the distance between any category of education and the average profile (column ‘total’) in the first dimension. In the ‘price’ dimension, the distance between primary education and the average profile is the square root of (0.402 - 0.294)2/0.294. Have a look at these values. What would be the percentages in each column profile if there was no relation between the educational background and the motivation for beer selection? Exactly the same as the values in the average profile. The distance formula includes the difference between the actual column percentage (the empirical value) and the percentage in the average profile (the value expected if there was no relation). Now, the difference has to be divided by the square root of the average profile percentage to standardise the dimensions (the expected value again) and to make clear the analogy to the method for calculating the standardised residual and value of chi-square statistic. Case solved!

This text presents a method for calculating distances between categories of a variable in a contingency table and on a perceptual map. The similarity between profiles apparent in the table can be translated, after some transformations, into results shown on a perceptual map resulting from correspondence analysis. This technique has many more interesting connections to geometry and physics, which is another incentive to study its secrets and practical use in data analysis.


Rate article:

Share the article on social media