Chapter 4 Missing values
In this part, we do some study on the missing value of the dataset. Having better understanding of the missing value and dealing with missing values in feasible ways can help our analysis to be more accurate.
First we can have a quick review of our dataset, we can see clearly that their are many missing values (NA entries) in our dataframe.
4.1 Number of Missing values for each variable
x | |
---|---|
X | 0 |
date | 0 |
state | 0 |
tot_cases | 0 |
new_case | 0 |
tot_death | 0 |
new_death | 0 |
hospital_onset_covid | 276 |
inpatient_beds | 201 |
inpatient_beds_used | 275 |
inpatient_beds_used_covid | 210 |
previous_day_admission_adult_covid_confirmed | 6073 |
previous_day_admission_adult_covid_suspected | 6214 |
previous_day_admission_pediatric_covid_confirmed | 6478 |
previous_day_admission_pediatric_covid_suspected | 6474 |
staffed_adult_icu_bed_occupancy | 6485 |
staffed_icu_adult_patients_confirmed_and_suspected_covid | 6411 |
staffed_icu_adult_patients_confirmed_covid | 6261 |
total_adult_patients_hospitalized_confirmed_and_suspected_covid | 5936 |
total_adult_patients_hospitalized_confirmed_covid | 5811 |
total_pediatric_patients_hospitalized_confirmed_and_suspected_covid | 6476 |
total_pediatric_patients_hospitalized_confirmed_covid | 6476 |
total_staffed_adult_icu_beds | 6333 |
inpatient_beds_utilization | 275 |
percent_of_inpatients_with_covid | 298 |
inpatient_bed_covid_utilization | 275 |
adult_icu_bed_covid_utilization | 6644 |
adult_icu_bed_utilization | 6637 |
From this table we can see clearly that there are 5 columns have about 200 missing values while 14 columns have more than 5,000 missing values.
This plot provides a specific visualization of the amount of missing data, showing in black the location of missing values, and also providing information on the overall percentage of missing values overall (in the legend), and in each variable. From this graph we can see that most missing values happen together, they are not random missing values. After diving deeper into the data source, we know that the missing pattern is because states don’t always have consistent data collection methods over time.
4.2 Patterns of Missing values
From the plot above, we can see more clear about the percentage of missing values in each variables.
An upset plot from the UpSetR package can be used to visualize the patterns of missingness, or rather the combinations of missingness across cases. To see combinations of missingness and intersections of missingness amongst variables, use the gg_miss_upset function:
6467 cases missing those five columns together;
152 cases missing adult_icu_bed_occupancy and adult_icu_bed_covid_utilization_NA.
What we need to note is that most missing values are from those five variables and they always show missing values together. This is what we need to consider in our following analysis.