November 2023 – MTH522

Cohen’s d is a statistic that helps to understand the size of an effect, like the difference in age, by standardizing the difference between two group means. It’s calculated by taking the difference between the two means (for example, the average age of Black and White individuals killed by police) and dividing it by the combined standard deviation of both groups.

This standard deviation is a “pooled” value, which means it takes into account the number of people in each group and their respective standard deviations to get an overall measure of variability.

In the case we’re looking at, Cohen’s d was calculated to be 0.577485. This value is considered to show a medium effect size based on standard guidelines. What this means is that the age difference of 7.3 years between Black and White individuals killed by police is neither large nor small, but moderate. It’s statistically significant enough to be noticeable and not due to chance, but it’s not an overwhelmingly large difference.

November 9, 2023

Previous update extension

K-means clustering is a method that groups data into k number of clusters by assigning each data point to the cluster with the closest mean value. A report illustrates that when k-means is applied to a dataset shaped like a lemniscate (an infinity symbol), it successfully divides the data into two clear clusters if k is set to 2. Increasing k to 4 still yields a reasonable outcome, splitting the data into smaller, more defined clusters.

K-medoids clustering is similar to k-means but it uses the most central point of a cluster, known as the medoid, instead of the mean. In the case of the lemniscate data, k-medoids also effectively separates the data into two clusters for k = 2. For k = 4, k-medoids creates clusters around the most central data points, providing a result that’s similar to k-means but focused on the medoids rather than means.

DBSCAN clustering works differently by forming clusters based on areas of high data point density. It is less influenced by outliers compared to k-means and k-medoids. With the lemniscate dataset, DBSCAN identified four clusters, recognizing areas of density that are separated by less dense regions.

Some key observations from the report include:

– K-means and k-medoids will partition data into exactly k clusters, even if the natural number of clusters is different. This makes the choice of k crucial.
– DBSCAN doesn’t require setting the number of clusters beforehand. It can find a varying number of clusters based on the density of the dataset, potentially providing a more natural clustering for certain types of data.
– For data with clear separations, like the lemniscate shape, k-means and k-medoids can be effective if the correct value of k is chosen.
– DBSCAN’s strength lies in its ability to manage noisy data and discover clusters without needing a pre-specified number of clusters. This can be particularly useful for datasets where the number of clusters is not known in advance or is uneven.

November 9, 2023

Update-

In today’s session, we discussed the decision tree algorithm, a form of supervised learning that can be applied to both classification and regression tasks. This algorithm constructs a tree-like model of decisions, where:

– Internal nodes represent the features of the dataset.
– Branches represent decision rules.
– Leaf nodes represent the outcome or decision made after computing all features.

A decision tree is built by splitting the source set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner called recursive partitioning. The recursion is completed when the subset at a node has the same value of the target variable, or when splitting no longer adds value to the predictions.

This algorithm is particularly powerful because:

– It includes automatic feature selection.
– It doesn’t require much data pre-processing.
– It is easy to interpret and understand.

Decision trees also form the building blocks of Random Forests, which are an ensemble of decision trees trained on various sub-samples of the dataset. This makes Random Forest one of the most robust machine learning algorithms available, capable of performing both classification and regression tasks with high accuracy.

During the training of a decision tree, it looks for the feature that best separates the data into classes. This is done using measures like Gini impurity or entropy, which provide a way to quantify the best split.

The specifics of how we will apply the decision tree algorithm to our dataset will be detailed in future updates.

November 9, 2023

Locations of shootings that took place

Today’s class was centered on analyzing geographical data pertaining to police shootings using a dataset with about 7,000 entries. We looked at the longitude and latitude details to understand the spread of these incidents.

Our analysis was conducted in Mathematica, where we applied several functions to process and visualize the data:

– We used Geo Position to convert the latitude and longitude data into geographic positions that Mathematica can work with.

– Geo List Plot was the tool of choice for creating maps that display the locations of the police shootings.

– To visualize the density of the events, we created geographic histograms using Geo Histogram and Geo Smooth Histogram.

– We calculated the distances between shooting locations with Geo Distance to understand the spatial distribution.

– For the clustering analysis, we explored the spatial distribution using Mathematica’s Find Clusters function. We also delved into the DBSCAN clustering method, which revealed four distinct clusters for the state of California.

November 9, 2023

Age distribution of people killed by the police

In today’s class, we analyzed the age distribution of individuals killed by police with a specific focus on the differences between Black and White individuals, using Mathematica for our analysis and planning a Python equivalent for future sessions.

Here are the outcomes of our findings:

Overall Age Distribution:
The youngest individual was 6 years old and the oldest was 91 years old. The average age came out to be 37.1 years with the median at 35 years, and the standard deviation was 13 years. The age distribution showed a slight right skewness of 0.73, suggesting a younger age trend, and a kurtosis close to 3, which indicates a distribution that resembles the normal curve.

Age Distribution for Black Individuals:
The ages ranged from 13 to 88 years, with an average age of 32.7 years and a median of 31 years. The standard deviation for this group was 11.4 years. This distribution displayed a right skewness of approximately 1 and a kurtosis of 3.9, hinting at a slightly more pronounced tail in the distribution.

Age Distribution for White Individuals:
The age range for this group was from 6 to 91 years, with an average age of 40 years and a median of 38 years. The standard deviation here was 13.3 years. Similar to the overall distribution, this group’s age pattern was slightly right-skewed with a skewness value of 0.53 and had a kurtosis of 2.86, suggesting a distribution with a less pronounced peak and tails.

Comparison Between Black and White Individuals:
There was an average age difference of approximately 7.3 years, with White individuals being older on average compared to Black individuals. A Monte Carlo simulation confirmed that the age difference was statistically significant, indicating a low likelihood that this result is due to chance. The effect size, measured by Cohen’s d, was calculated to be 0.58, indicating that the age difference is of a medium magnitude.

The session provided a detailed examination of age distributions within the context of police shootings, highlighting notable differences between racial groups and the general age tendencies within the data.

November 9, 2023

Difference between k-means, k-medoids, and DBSCAN clustering algorithms

In the class, we explored the differences between k-means, k-medoids, and DBSCAN clustering algorithms by applying them to geometrically structured datasets. We employed Mathematica for coding and visualizing the results of these clustering methods on three distinct examples.

**Example 1** involved a lemniscate (infinity symbol shape) populated with 200 random points:
– DBSCAN discerned 4 distinct clusters from these points.
– The k-means algorithm was implemented with both k=2 and k=4, and the resulting clusters were visualized.
– The k-medoids algorithm was also demonstrated for k=2 and k=4, illustrating how the clusters are formed around central points.

**Example 2** used a composite shape of a circle and an annulus, with 400 random points scattered within:
– DBSCAN successfully identified 2 clusters within this shape.
– We applied the k-means and k-medoids methods with k=2 and k=4 to see how they group the points.

**Example 3** was designed with a square area from which a maximal circle was subtracted, filled with 400 random points:
– Here, DBSCAN found 4 clusters, indicating its sensitivity to spatial density rather than geometric shapes.
– Again, we visualized how k-means and k-medoids behaved with k=2 and k=4.

The upcoming update will provide a comparative analysis and delve deeper into each method, offering insights into their applications and limitations based on the clustering outcomes observed in these examples.

November 9, 2023

The class focused on analyzing the age statistics of individuals of different racial backgrounds who were fatally shot by police. We looked at Asians (A), Blacks (B), Hispanics (H), Native Americans (N), Others (O), and Whites (W). Using Mathematica, we calculated several statistical measures for each racial category, such as median, mean, standard deviation, variance, skewness, and kurtosis. Here are the results:

For Asian (A) victims:
– Median Age: 35 years
– Mean Age: 35.96 years
– Standard Deviation: 11.59
– Variance: 134.38

For Black (B) victims
– Median Age: 31 years
– Mean Age: 32.93 years
– Standard Deviation: 11.39
– Variance: 129.70

For Hispanic (H) victims:
– Median Age: 32 years
– Mean Age: 33.59 years
– Standard Deviation: 10.74
– Variance: 115.42

For Native American (N) victims:
– Median Age: 32 years
– Mean Age: 32.65 years
– Standard Deviation: 8.99
– Variance: 80.90

For Other (O) victims:
– Median Age: 31 years
– Mean Age: 33.47 years
– Standard Deviation: 11.80
– Variance: 139.15

For White (W) victims:
– Median Age: 38 years
– Mean Age: 40.13 years
– Standard Deviation: 13.16
– Variance: 173.24

This analysis of age statistics is a part of a larger project, which will include hypothesis testing and additional reporting in subsequent classes.

November 9, 2023

The dataset from The Washington Post about fatal police shootings in the U.S. has 17 different types of information, or “features,” like ID numbers, names, and the date of the shooting. Some of these features have missing data.

Here’s a simple breakdown of what we have:

– ID number- This is a unique number for each case. There are 8002 of these.
– Name- The name of the person shot. There are 7548 names listed.
– Date of occurrence: The date when the shooting happened, with 8002 dates given.
– Manner of death: Tells us if the person was shot or shot and tasered. All 8002 cases include this info.
– Age: How old the person was. We know this for 7499 people.
– Sex: Male or female. This is noted for 7971 cases.
– Race: Categorized as White, Asian, Hispanic, or Black.
– City: The city where it happened, listed for all 8002 cases.
-State: The state where it happened, also listed for all 8002 cases.
– Signs of mental illness: Yes or no answer, available for all 8002 cases.
– Threat level: Describes if the person was attacking or not, given in all 8002 records.
– Flee: Tells us if the person tried to run away, noted in 7037 cases.
– Body camera: Yes or no to whether there was a body camera, known for all 8002 cases.
– Longitude and Latitude: These tell us the specific location, but we only have these for 7163 cases.
– s geocoding exact: This is a yes or no answer about location accuracy, available for 8003 cases.

We have a few ways to deal with the incomplete data:

– Cut down the data: Only use the parts of the data where we have all the information, making it complete but smaller.
– Leave out incomplete parts: Remove any information that isn’t complete, which means we’ll lose a lot of data.
– Fill in the gaps: Make up the missing numbers so everything adds up to 8002. This keeps all the data, but it might not be as accurate.
– Ignore less useful information: For example, a person’s name might not tell us much about the shooting, so we might not use it in our analysis.

What we do with the data depends on what we want to find out from it. Each choice has its upsides and downsides, like having all the information versus keeping the data accurate.

November 9, 2023

In today’s session, we worked with three variables from a dataset: ‘age,’ which is a numerical variable, ‘sex,’ categorized as ‘m’ (male) or ‘w’ (female), and ‘race,’ designated as ‘w’ (white), ‘b’ (black), ‘h’ (Hispanic), and ‘a’ (Asian).

We conducted basic statistical analyses on the ‘age’ feature, calculating the maximum, minimum, median, and mode, and visualized the distribution of ages with histograms.

Furthermore, we used the ‘race’ variable to determine the average ages within each racial group. Similarly, we analyzed age in relation to ‘sex/gender’ to compare the average ages between males and females.

Month: November 2023

PROJECT-02

Cohens’D

Previous update extension

Update-

Locations of shootings that took place

Age distribution of people killed by the police

Difference between k-means, k-medoids, and DBSCAN clustering algorithms