From crosstab to correspondence analysis: how exactly are the rows and columns of a crosstab transformed into coordinates used in a chart?

Reading the article will take you: 5 minut.

Correspondence analysis is a technique that can be used to present in a visual way the relationships typically found in a crosstab . The main characteristic of the visualization created as a result of correspondence analysis is a chart, or perceptual map, in which specific data points represent the categorical variables in question.

I am often asked how the values in a crosstab are transformed into the data points used in a perceptual map. In order to make the algorithm behind this easier to understand, we shall use a simple example. Let us imagine that in three European countries – Sweden, Poland, and Italy – we draw a sample of 100 people per country, and we note the hair color of each person. Here are hypothetical results that we might obtain:

We see that in Sweden the most common hair color is blond, whereas in Italy, it is brown. In Poland, hair colors were distributed rather evenly, although the intermediate color was most common. Now, we shall proceed to creating a perceptual map in which the data points displayed represent the three countries and the three hair colors. Each of these data points are arranged so that similar observations are located closer to each other, and those that differ are farther apart.

If we think about countries, we instinctively feel that as far as their inhabitants’ hair color is concerned, Poland is closer to Italy than Sweden. Furthermore, blond is more common for Swedish people than for Italians. Therefore, the data point representing Sweden should be located closer than Italy to the data point representing blond. This is reflected in the perceptual map shown in Figure 1, created using the so-called column normalization method [1].

Figure 1. Overall setting chart made using column normalization

Figure 1. Overall setting chart made using column normalization

 

The data points presented in the chart have the following coordinates:

The question remains: how are the coordinates of each data point derived?

Standardizing the input matrix

The first step performed by the algorithm is standardization of the input matrix. A new matrix is created – let us call it the Z matrix – which will then be transformed further. The formula used to calculate the value in a single Z matrix cell is presented below:

fij is a single input matrix cell
fi+ is the sum of cell values in a row
f+j is the sum of cell values in a column
is the sum of all cell values in the table

If I want to calculate the value of the first Z matrix cell (i.e. the one responsible for values in the intersection of Sweden/blond), I have to read actual data from the input matrix:

fij = 70, as it is the number of blond Swedes in the sample group fi+ = 110, as it is the number of all blonds f+j = 100, as it is the number of all Swedes in the table N = 300, as it is the total size of the sample group

Now, I introduce those values into the formula and I obtain the value of the first Z matrix cell.

I need to use the same method to calculate values for all cells. The resulting Z matrix is presented below.

With the matrix thus transformed, we may proceed to the second, and most important step of the algorithm.

Decomposition of the matrix in accordance with singular values

Up until now, we were preparing data for analysis. Now we shall proceed to the very essence of correspondence analysis: the method used to reduce dimensions. The Z matrix we have obtained has to be decomposed: we have to ‘break’ it into three separate matrices - not just any kind of matrices, but matrices that meet specific requirements (we shall deal with that later). The method used is called singular value decomposition, or SVD.

The three matrices obtained as a result of decomposition will be marked as U, S, and V. By multiplying the U and S matrices, and transposing the V matrix, we reconstruct the decomposed matrix (in our case – the Z matrix). But this is merely one of the requirements that should be met by U, S, and V matrices. The other conditions are the following:

  • By multiplying the U matrix by its transposition, we obtain an identity matrix (i.e. a matrix that has ones on the diagonal, and zeros in all the other cells);
  • The V matrix multiplied by its transposition also gives an identity matrix;
  • The S matrix is a diagonal matrix – values should appear only on the diagonal, and other cells should be filled with zeros.

As a result of applying the SVD method, we obtain the following matrices:

The requirement concerning the S matrix is easiest to verify: We see that non-zero values are located only on the diagonal of the matrix, therefore, it is a diagonal matrix. Those values are referred to as singular values. Singular values are sorted in descending order: the first column contains the highest value, then they decrease in the following columns. This is no coincidence; this should happen with the decomposition of each table. It is very important, as the S matrix is responsible for specific dimensions which we will see in the final result of correspondence analysis. Thanks to the arrangement of singular values, we may be sure that the first dimension shown in the resulting chart will always be the most important one. The maximum number of dimensions for a table is equal to the number of rows or columns (depending on which one is lower), minus one. In this case, given that the table has 3 rows and 3 columns, the maximum number of dimensions is 3-1=2. Thus, in our S matrix, we only have two singular values.

Let us now check the requirements for U and V matrices.

The result of multiplication of matrices U and UT (transposition of the U matrix) is the following matrix:

Using the obtained U and V matrices to calculate initial coordinates

The final coordinates of the data points representing row and column categories depend on the normalization method selected. The names of normalization methods may differ from one statistical package to another: In IBM SPSS Statistics, as well as in PS IMAGO PRO, we may choose from symmetrical, row, column, and row-column normalization.

Before we select the normalization method and calculate the values of the coordinates, we must determine the initial coordinates (standard coordinates). For each row and column data point, we have to find two values – for the 1st and the 2nd dimension (x and y axis).

Let’s assume that I would like to calculate coordinates for Sweden. As we can see in the crosstab, Sweden is a column variable category, therefore I shall use the second formula. First, I calculate the x axis coordinate.

Next, I calculate the y axis coordinate:

We may use the same method to calculate standard coordinates for the other two countries and, using the first formula, for the row data points, i.e., hair color.

From initial to final coordinates

We finally arrive at the stage  where we may calculate the final coordinates of the data points. To do that, we multiply the standard coordinates by the singular values raised to a certain power. The exponent of that power depends on the normalization method we want to use.

– singular value for a specific dimension

α and β values depend on the normalization method as follows:

  • Row normalization: α = 1; β = 0
  • Column normalization: α = 0; β = 1
  • Symmetrical normalization: α = ½; β = 1
  • Row and column normalization: α = 1; β =1

At the beginning of this entry, I presented a perceptual map created using column normalization. Let us stick to that choice and calculate coordinates for all data points. The coordinates of the row data points (hair color) will remain unchanged, as the λ raised to 0 power equals 1. In this case, we continue with standard coordinates. In the case of the column data points (countries), the situation looks different. The coordinates of those data points should be multiplied by the singular values corresponding to specific dimensions. That is how we obtain the principal coordinates.

In this article, I set out to explain the way in which correspondence analysis determines the coordinates of the data points representing the rows and columns of a crosstab. As you can see, the process is rather complicated, but once you understand it, you will be more confident when using the correspondence analysis technique in practice.

[1] The choice of normalization type is a subject that should be discussed separately.


Rate article:

Share the article on social media



Zostańmy w kontakcie!

Chcesz dostawać wiadomości o nowych wpisach na blogu
i webinarach z zakresu analizy danych?
Zapisz się na powiadomienia e-mail.