Impact of data gaps on the results of the analysis
Data gaps can have a major impact on the results of data analysis. They can lead to distorted results by omitting relevant information. If data gaps are not properly addressed, the analysis may be incomplete or even wrong. On the other hand, excluding data gaps can lead to the loss of valuable information, which can reduce the reliability of the results. It is also worth bearing in mind that inputting incomplete information is not always error-free and may introduce some risk of distortion.
The main problems that arise from data gaps include:
- Reduction in the accuracy of analyses and statistical power: lack of data often leads to a smaller sample used in the analysis, which can reduce the statistical power of the tests. Lower statistical power means a greater risk of making a type II error and failing to detect significant effects or differences that actually occur. When there are large amounts of missing data, the accuracy of the analysis can be significantly reduced, limiting the reliability of the results.
- Distortion of results: incomplete information can result in inaccurate or distorted analysis results because it omits important data, which can lead to erroneous conclusions. Missing data results in the loss of valuable information that could be relevant to the analysis and understanding of the phenomena under study.
- Impact on predictive models: when analysing data with predictive models, missing data can lead to a deterioration in the quality of the prediction, as models will be learned from incomplete data.
In order to choose the right method to handle data deficiencies, it is important to understand the nature and cause of the deficiency in detail.
Basic types of data deficiencies
The most common division of data deficiencies found in the literature relates to whether or not they are the result of chance - whether they are random or not, and whether other factors influence their occurrence. Identifying the mechanisms of their occurrence and indicating which type of deficiencies the analyst is dealing with is important in order to choose appropriate methods to deal with them.
We divide data deficiencies into:
- MCAR (Missing completely at random): missing data are completely random and do not depend on any other values in the data set. This means that the probability of missing data is the same for all observations, regardless of the values of both observed and unobserved data.
Let us look at Table 1, in which we have several variables: age, education and income. Missing data in the ‘Age’ column could be an example of MCAR, assuming that omitting this value is completely random and does not depend on other variables in the dataset. An example would be when a respondent accidentally omits an answer to an age question in a survey. - MAR (Missing at random): missing information is random, but its occurrence may depend on other observed data in the set, but not on the value of the missing data.
Incomplete information in the income column may be an example of MAR-type missing values if we assume that the probability of skipping an income question increases with the age of the respondents. However, the probability of missing data itself does not depend directly on unknown income values. - MNAR (Missing not at Random): Missing data are not random and their probability of occurrence depends on the value of the missing data in themselves.
Incomplete information in the ‘Education’ column could be an example of MNAR if we assume that the lower the education, the more likely it is that education data will not be provided. In this case, the missing value is directly related to the value that should be given, introducing a systematic error in the missing data.
No |
Age |
Education |
Income |
1 |
25 |
NA |
4 000 |
2 |
NA |
NA |
4 500 |
3 |
35 |
Secondary |
NA |
4 |
45 |
Secondary |
5 500 |
5 |
55 |
Secondary |
NA |
* NA (Not Available)
Recognising the type of data gaps in the dataset being analysed is key to choosing the right method to handle them. Each requires a different strategy in order for the analysis to be reliable and the conclusions to be correct and qualitative.
Methods of dealing with data gaps
Ways of dealing with data gaps range from very simple approaches to complex imputation methods. Each type mentioned has its own advantages and limitations. One basic approach is to delete or omit data where there are gaps. In contrast, we can include data imputation among the more complex methods that are often used. Both statistical models and machine learning algorithms require careful selection and validation. This is needed to ensure that the imputation process does not introduce additional errors or biases. It is also crucial to understand the mechanism that led to missing data, as different mechanisms may require different coping strategies.
Popular methods for dealing with missing data:
- Deleting and skipping data: when deleting observations with missing data, they are removed from the dataset being analysed. In contrast, with data omission, observations with missing data remain in the dataset but are not considered in the analysis, allowing the analyst to retain the option of using them in further research. This method may lead to a loss of information, and the statistical model created on such data may be inaccurate. The approach is only valid if the gaps are of the MCAR type. If the missing data are not completely random, it is not safe to remove observations with missing values or substitute missing data one at a time. In such situations, multiple imputation should be used. In MCAR situations with a small number of missing data, excluding units of observations from the analysis will not lead to a significant burden on parameter estimation, and may be used.
- Single imputation: missing data can be replaced by values calculated from existing data. Missing values are replaced by a constant value, e.g., the mean, median or mode for a given variable. They can also be predicted by linear regression or other regression techniques from the available data. However, it is worth bearing in mind that substituting data from the mean or median may result in a skewed distribution and unrepresentative data, especially if the data are not randomly missing.
- Multiple imputation: for each missing value, several possible surrogate values are generated, thus creating multiple complete data sets. Each is then analysed separately, and the results of these analyses are combined to produce final estimates and inferences that take into account the uncertainty associated with the missing data. The advantage of multiple imputation is that it allows the estimation of not only the value of missing data, but also the uncertainty associated with these estimates. This approach makes the results of the analyses reliable and less prone to potential errors due to the arbitrary selection of a single value for missing data. Multiple imputation is particularly useful in studies where missing data are unavoidable and their omission or inappropriate handling could lead to erroneous conclusions. This method takes into account the interdependencies between variables in the dataset, allowing for accurate estimation of missing values.
The analyzes presented in this article were carried out with the help of PS IMAGO PRO
Imputation of data gaps in PS IMAGO PRO
PS IMAGO PRO offers a number of methods for dealing with missing data, including those mentioned above. The analyst can skip observations with missing data directly in the analysis procedures window, whereby selected observations with an excessive number of data gaps can be quickly found and removed if this work scenario is chosen. In the case of single imputation, using the procedure Missing Value Analysis, we can take care of data quality and reliably verify and replace missing data. This functionality will allow you to:
- obtain accurate statistics on patterns of data deficiencies,
- obtain estimates of statistics for different methods of determining deficiencies,
- check whether data deficiencies are random,
- perform single substitution of missing data (using regression or EM methods).
Figure 1: Layout of data gaps - a graph showing the patterns of data gaps created in PS IMAGO PRO. In the graph it can be seen that pattern 1 has no missing data and pattern 17 has missing data for the variable culture, gender and age.
PS IMAGO PRO also offers a fully automatic multiple imputation mode that selects the most appropriate imputation method based on the characteristics of the data, while leaving the user free to customise the model.
This procedure generates several possible values for missing data and creates several complete sets. For each of these, the user receives a summary of the results and one mixed result (from all the created sets).
The Multiple Imputation procedure, is useful when the missing data are not completely random and makes it possible to obtain:
- the most accurate substitution of missing data through multiple imputation,
- automation of the imputation process,
- visualisation and diagnostics of missing data patterns to understand the patterns of missing data.
The use of PS IMAGO PRO in dealing with missing data significantly increases the accuracy of analyses in the presence of incomplete information, automating the imputation process and preserving key statistical assumptions. Missing value analysis and Multiple imputation reduce data preparation time and facilitate the assessment of imputation quality through detailed reports. This approach not only facilitates accurate estimations, but also helps in the clear interpretation and effective communication of research results.
Summary
Data gaps are a common problem in quantitative data analysis. They have the potential to distort results and reduce the reliability of conclusions. Various methods are used to deal with missing values, ranging from simple deletion of observations to more complex imputation techniques such as single or multiple imputation. Multiple imputation, distinguished by the ability to generate multiple potential data sets with different estimated values for missing data, allows for a more comprehensive and reliable analysis. Using the Multiple Imputation procedure in PS IMAGO PRO allows the analyst to automate the imputation process, providing detailed reports and diagnoses that help understand the nature and patterns of missing data. This approach not only makes it easier to handle missing data, but also increases the accuracy and reliability of the analysis performed, which is key to achieving reliable and accurate results.