MTH522

Cohen’s d is a statistic that helps to understand the size of an effect, like the difference in age, by standardizing the difference between two group means. It’s calculated by taking the difference between the two means (for example, the average age of Black and White individuals killed by police) and dividing it by the combined standard deviation of both groups.

This standard deviation is a “pooled” value, which means it takes into account the number of people in each group and their respective standard deviations to get an overall measure of variability.

In the case we’re looking at, Cohen’s d was calculated to be 0.577485. This value is considered to show a medium effect size based on standard guidelines. What this means is that the age difference of 7.3 years between Black and White individuals killed by police is neither large nor small, but moderate. It’s statistically significant enough to be noticeable and not due to chance, but it’s not an overwhelmingly large difference.

November 9, 2023

Previous update extension

K-means clustering is a method that groups data into k number of clusters by assigning each data point to the cluster with the closest mean value. A report illustrates that when k-means is applied to a dataset shaped like a lemniscate (an infinity symbol), it successfully divides the data into two clear clusters if k is set to 2. Increasing k to 4 still yields a reasonable outcome, splitting the data into smaller, more defined clusters.

K-medoids clustering is similar to k-means but it uses the most central point of a cluster, known as the medoid, instead of the mean. In the case of the lemniscate data, k-medoids also effectively separates the data into two clusters for k = 2. For k = 4, k-medoids creates clusters around the most central data points, providing a result that’s similar to k-means but focused on the medoids rather than means.

DBSCAN clustering works differently by forming clusters based on areas of high data point density. It is less influenced by outliers compared to k-means and k-medoids. With the lemniscate dataset, DBSCAN identified four clusters, recognizing areas of density that are separated by less dense regions.

Some key observations from the report include:

– K-means and k-medoids will partition data into exactly k clusters, even if the natural number of clusters is different. This makes the choice of k crucial.
– DBSCAN doesn’t require setting the number of clusters beforehand. It can find a varying number of clusters based on the density of the dataset, potentially providing a more natural clustering for certain types of data.
– For data with clear separations, like the lemniscate shape, k-means and k-medoids can be effective if the correct value of k is chosen.
– DBSCAN’s strength lies in its ability to manage noisy data and discover clusters without needing a pre-specified number of clusters. This can be particularly useful for datasets where the number of clusters is not known in advance or is uneven.

November 9, 2023

Update-

In today’s session, we discussed the decision tree algorithm, a form of supervised learning that can be applied to both classification and regression tasks. This algorithm constructs a tree-like model of decisions, where:

– Internal nodes represent the features of the dataset.
– Branches represent decision rules.
– Leaf nodes represent the outcome or decision made after computing all features.

A decision tree is built by splitting the source set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner called recursive partitioning. The recursion is completed when the subset at a node has the same value of the target variable, or when splitting no longer adds value to the predictions.

This algorithm is particularly powerful because:

– It includes automatic feature selection.
– It doesn’t require much data pre-processing.
– It is easy to interpret and understand.

Decision trees also form the building blocks of Random Forests, which are an ensemble of decision trees trained on various sub-samples of the dataset. This makes Random Forest one of the most robust machine learning algorithms available, capable of performing both classification and regression tasks with high accuracy.

During the training of a decision tree, it looks for the feature that best separates the data into classes. This is done using measures like Gini impurity or entropy, which provide a way to quantify the best split.

The specifics of how we will apply the decision tree algorithm to our dataset will be detailed in future updates.

November 9, 2023

Locations of shootings that took place

Today’s class was centered on analyzing geographical data pertaining to police shootings using a dataset with about 7,000 entries. We looked at the longitude and latitude details to understand the spread of these incidents.

Our analysis was conducted in Mathematica, where we applied several functions to process and visualize the data:

– We used Geo Position to convert the latitude and longitude data into geographic positions that Mathematica can work with.

– Geo List Plot was the tool of choice for creating maps that display the locations of the police shootings.

– To visualize the density of the events, we created geographic histograms using Geo Histogram and Geo Smooth Histogram.

– We calculated the distances between shooting locations with Geo Distance to understand the spatial distribution.

– For the clustering analysis, we explored the spatial distribution using Mathematica’s Find Clusters function. We also delved into the DBSCAN clustering method, which revealed four distinct clusters for the state of California.

November 9, 2023

Age distribution of people killed by the police

In today’s class, we analyzed the age distribution of individuals killed by police with a specific focus on the differences between Black and White individuals, using Mathematica for our analysis and planning a Python equivalent for future sessions.

Here are the outcomes of our findings:

Overall Age Distribution:
The youngest individual was 6 years old and the oldest was 91 years old. The average age came out to be 37.1 years with the median at 35 years, and the standard deviation was 13 years. The age distribution showed a slight right skewness of 0.73, suggesting a younger age trend, and a kurtosis close to 3, which indicates a distribution that resembles the normal curve.

Age Distribution for Black Individuals:
The ages ranged from 13 to 88 years, with an average age of 32.7 years and a median of 31 years. The standard deviation for this group was 11.4 years. This distribution displayed a right skewness of approximately 1 and a kurtosis of 3.9, hinting at a slightly more pronounced tail in the distribution.

Age Distribution for White Individuals:
The age range for this group was from 6 to 91 years, with an average age of 40 years and a median of 38 years. The standard deviation here was 13.3 years. Similar to the overall distribution, this group’s age pattern was slightly right-skewed with a skewness value of 0.53 and had a kurtosis of 2.86, suggesting a distribution with a less pronounced peak and tails.

Comparison Between Black and White Individuals:
There was an average age difference of approximately 7.3 years, with White individuals being older on average compared to Black individuals. A Monte Carlo simulation confirmed that the age difference was statistically significant, indicating a low likelihood that this result is due to chance. The effect size, measured by Cohen’s d, was calculated to be 0.58, indicating that the age difference is of a medium magnitude.

The session provided a detailed examination of age distributions within the context of police shootings, highlighting notable differences between racial groups and the general age tendencies within the data.

November 9, 2023

Difference between k-means, k-medoids, and DBSCAN clustering algorithms

In the class, we explored the differences between k-means, k-medoids, and DBSCAN clustering algorithms by applying them to geometrically structured datasets. We employed Mathematica for coding and visualizing the results of these clustering methods on three distinct examples.

**Example 1** involved a lemniscate (infinity symbol shape) populated with 200 random points:
– DBSCAN discerned 4 distinct clusters from these points.
– The k-means algorithm was implemented with both k=2 and k=4, and the resulting clusters were visualized.
– The k-medoids algorithm was also demonstrated for k=2 and k=4, illustrating how the clusters are formed around central points.

**Example 2** used a composite shape of a circle and an annulus, with 400 random points scattered within:
– DBSCAN successfully identified 2 clusters within this shape.
– We applied the k-means and k-medoids methods with k=2 and k=4 to see how they group the points.

**Example 3** was designed with a square area from which a maximal circle was subtracted, filled with 400 random points:
– Here, DBSCAN found 4 clusters, indicating its sensitivity to spatial density rather than geometric shapes.
– Again, we visualized how k-means and k-medoids behaved with k=2 and k=4.

The upcoming update will provide a comparative analysis and delve deeper into each method, offering insights into their applications and limitations based on the clustering outcomes observed in these examples.

MTH522

Posts

project 3 submission

Project 1 re-submission

Project- 2 resubmission

PROJECT-02

Cohens’D

Previous update extension

Update-

Locations of shootings that took place

Age distribution of people killed by the police

Difference between k-means, k-medoids, and DBSCAN clustering algorithms