The dataset from The Washington Post about fatal police shootings in the U.S. has 17 different types of information, or “features,” like ID numbers, names, and the date of the shooting. Some of these features have missing data.
Here’s a simple breakdown of what we have:
– ID number- This is a unique number for each case. There are 8002 of these.
– Name- The name of the person shot. There are 7548 names listed.
– Date of occurrence: The date when the shooting happened, with 8002 dates given.
– Manner of death: Tells us if the person was shot or shot and tasered. All 8002 cases include this info.
– Age: How old the person was. We know this for 7499 people.
– Sex: Male or female. This is noted for 7971 cases.
– Race: Categorized as White, Asian, Hispanic, or Black.
– City: The city where it happened, listed for all 8002 cases.
-State: The state where it happened, also listed for all 8002 cases.
– Signs of mental illness: Yes or no answer, available for all 8002 cases.
– Threat level: Describes if the person was attacking or not, given in all 8002 records.
– Flee: Tells us if the person tried to run away, noted in 7037 cases.
– Body camera: Yes or no to whether there was a body camera, known for all 8002 cases.
– Longitude and Latitude: These tell us the specific location, but we only have these for 7163 cases.
– s geocoding exact: This is a yes or no answer about location accuracy, available for 8003 cases.
We have a few ways to deal with the incomplete data:
– Cut down the data: Only use the parts of the data where we have all the information, making it complete but smaller.
– Leave out incomplete parts: Remove any information that isn’t complete, which means we’ll lose a lot of data.
– Fill in the gaps: Make up the missing numbers so everything adds up to 8002. This keeps all the data, but it might not be as accurate.
– Ignore less useful information: For example, a person’s name might not tell us much about the shooting, so we might not use it in our analysis.
What we do with the data depends on what we want to find out from it. Each choice has its upsides and downsides, like having all the information versus keeping the data accurate.