Chapter 4 Missing values

In this part, we do some study on the missing value of the dataset. Having better understanding of the missing value and dealing with missing values in feasible ways can help our analysis to be more accurate.

First we can have a quick review of our dataset, we can see clearly that their are many missing values (NA entries) in our dataframe.

4.1 Number of Missing values for each variable

x
X 0
date 0
state 0
tot_cases 0
new_case 0
tot_death 0
new_death 0
hospital_onset_covid 276
inpatient_beds 201
inpatient_beds_used 275
inpatient_beds_used_covid 210
previous_day_admission_adult_covid_confirmed 6073
previous_day_admission_adult_covid_suspected 6214
previous_day_admission_pediatric_covid_confirmed 6478
previous_day_admission_pediatric_covid_suspected 6474
staffed_adult_icu_bed_occupancy 6485
staffed_icu_adult_patients_confirmed_and_suspected_covid 6411
staffed_icu_adult_patients_confirmed_covid 6261
total_adult_patients_hospitalized_confirmed_and_suspected_covid 5936
total_adult_patients_hospitalized_confirmed_covid 5811
total_pediatric_patients_hospitalized_confirmed_and_suspected_covid 6476
total_pediatric_patients_hospitalized_confirmed_covid 6476
total_staffed_adult_icu_beds 6333
inpatient_beds_utilization 275
percent_of_inpatients_with_covid 298
inpatient_bed_covid_utilization 275
adult_icu_bed_covid_utilization 6644
adult_icu_bed_utilization 6637

From this table we can see clearly that there are 5 columns have about 200 missing values while 14 columns have more than 5,000 missing values.

This plot provides a specific visualization of the amount of missing data, showing in black the location of missing values, and also providing information on the overall percentage of missing values overall (in the legend), and in each variable. From this graph we can see that most missing values happen together, they are not random missing values. After diving deeper into the data source, we know that the missing pattern is because states don’t always have consistent data collection methods over time.

4.2 Patterns of Missing values

From the plot above, we can see more clear about the percentage of missing values in each variables.

An upset plot from the UpSetR package can be used to visualize the patterns of missingness, or rather the combinations of missingness across cases. To see combinations of missingness and intersections of missingness amongst variables, use the gg_miss_upset function:

  • 6467 cases missing those five columns together;

  • 152 cases missing adult_icu_bed_occupancy and adult_icu_bed_covid_utilization_NA.

What we need to note is that most missing values are from those five variables and they always show missing values together. This is what we need to consider in our following analysis.