The class focused on analyzing the age statistics of individuals of different racial backgrounds who were fatally shot by police. We looked at Asians (A), Blacks (B), Hispanics (H), Native Americans (N), Others (O), and Whites (W). Using Mathematica, we calculated several statistical measures for each racial category, such as median, mean, standard deviation, variance, skewness, and kurtosis. Here are the results:

For Asian (A) victims:
– Median Age: 35 years
– Mean Age: 35.96 years
– Standard Deviation: 11.59
– Variance: 134.38

For Black (B) victims
– Median Age: 31 years
– Mean Age: 32.93 years
– Standard Deviation: 11.39
– Variance: 129.70

For Hispanic (H) victims:
– Median Age: 32 years
– Mean Age: 33.59 years
– Standard Deviation: 10.74
– Variance: 115.42

For Native American (N) victims:
– Median Age: 32 years
– Mean Age: 32.65 years
– Standard Deviation: 8.99
– Variance: 80.90

For Other (O) victims:
– Median Age: 31 years
– Mean Age: 33.47 years
– Standard Deviation: 11.80
– Variance: 139.15

For White (W) victims:
– Median Age: 38 years
– Mean Age: 40.13 years
– Standard Deviation: 13.16
– Variance: 173.24

This analysis of age statistics is a part of a larger project, which will include hypothesis testing and additional reporting in subsequent classes.

The dataset from The Washington Post about fatal police shootings in the U.S. has 17 different types of information, or “features,” like ID numbers, names, and the date of the shooting. Some of these features have missing data.

Here’s a simple breakdown of what we have:

– ID number- This is a unique number for each case. There are 8002 of these.
– Name- The name of the person shot. There are 7548 names listed.
– Date of occurrence: The date when the shooting happened, with 8002 dates given.
– Manner of death: Tells us if the person was shot or shot and tasered. All 8002 cases include this info.
– Age: How old the person was. We know this for 7499 people.
– Sex: Male or female. This is noted for 7971 cases.
– Race: Categorized as White, Asian, Hispanic, or Black.
– City: The city where it happened, listed for all 8002 cases.
-State: The state where it happened, also listed for all 8002 cases.
– Signs of mental illness: Yes or no answer, available for all 8002 cases.
– Threat level: Describes if the person was attacking or not, given in all 8002 records.
– Flee: Tells us if the person tried to run away, noted in 7037 cases.
– Body camera: Yes or no to whether there was a body camera, known for all 8002 cases.
– Longitude and Latitude: These tell us the specific location, but we only have these for 7163 cases.
– s geocoding exact: This is a yes or no answer about location accuracy, available for 8003 cases.

We have a few ways to deal with the incomplete data:

– Cut down the data: Only use the parts of the data where we have all the information, making it complete but smaller.
– Leave out incomplete parts: Remove any information that isn’t complete, which means we’ll lose a lot of data.
– Fill in the gaps: Make up the missing numbers so everything adds up to 8002. This keeps all the data, but it might not be as accurate.
– Ignore less useful information: For example, a person’s name might not tell us much about the shooting, so we might not use it in our analysis.

What we do with the data depends on what we want to find out from it. Each choice has its upsides and downsides, like having all the information versus keeping the data accurate.

In today’s session, we worked with three variables from a dataset: ‘age,’ which is a numerical variable, ‘sex,’ categorized as ‘m’ (male) or ‘w’ (female), and ‘race,’ designated as ‘w’ (white), ‘b’ (black), ‘h’ (Hispanic), and ‘a’ (Asian).

We conducted basic statistical analyses on the ‘age’ feature, calculating the maximum, minimum, median, and mode, and visualized the distribution of ages with histograms.

Furthermore, we used the ‘race’ variable to determine the average ages within each racial group. Similarly, we analyzed age in relation to ‘sex/gender’ to compare the average ages between males and females.

Clustering-

in a dataset containing heterogeneous data—comprising different types such as numerical, categorical, and textual elements—introduces additional challenges to the clustering process. To effectively identify significant groupings within such a varied dataset, specialized approaches and algorithms are necessary.

Therefore, we might transform all data into a uniform type for analysis or focus on subsets of the data that are more directly comparable. In this instance, I’ve chosen to work exclusively with the latitude and longitude information to apply the DBSCAN algorithm. DBSCAN, which stands for Density-Based Spatial Clustering of Applications with Noise, excels at identifying clusters with irregular shapes in data of uneven densities.

This algorithm differentiates clusters based on areas of higher density, separated by regions of lower density, and does not require pre-defining the number of clusters, which is advantageous for datasets with an unknown number of clusters.

The two principal parameters of DBSCAN are ‘eps’, which is the maximum distance between two points for them to be considered neighbors, and ‘min samples’, which is the minimum number of points required for a dense region to be recognized as a cluster.

Project update – 2/10

After analyzing the data provided there are various ways to look at it and how a regression model can be implemented.

Data Description:

The provided dataset contains information from 2018 on obesity rates, diabetes rates, and inactivity rates, which are categorized by county within each state. Each county is identified by a unique FIPDS number. Additionally, the website included several other factors related to the economy, health, and more. However, the data points available for these additional factors were significantly fewer compared to the other variables, making them less suitable for inclusion in our analysis.

Regarding the dataset itself, it’s important to note that not all factors had the same number of samples. Specifically, there were only 354 samples that had values for all three of these factors. This discrepancy can be addressed in several ways:

1. Duplicate Points: One approach is to duplicate data points to ensure that all features have the same number of sample points. However, duplicating a larger number of values can introduce inaccuracies and potentially lead to poor model performance due to the presence of false information.

2. Smaller Common Set: Another option is to work with the smaller set of 354 samples that have values for all three features. While having more samples generally improves model training, collecting a larger dataset in a short time frame may be unrealistic. Therefore, the decision was made to go with the second option, even though it may result in less robust predictions due to the limited amount of data.

Data Extraction:

Extracting the common 354 data points that include all three variables can be achieved through the following process using Excel:

1. Identify the FIPDS codes of the counties that have data for all three variables initially, making this task somewhat easier.

2. Create a new column and copy all the common FIPDS codes into it.

3. Compare the original FIPDS column with the new common FIPDS column and mark duplicates. This process identifies the common counties and assigns them a color code. Repeat this step for all three sheets containing the relevant data.

4. Filter the data based on the color-coded duplicates to create a dataset that will be used for analysis in Python.

In addition, scatter plots of the data were generated to provide an initial visualization and gain a rough idea of what the regression lines might look like.

For further reference, you can access the dataset using the following link

cdc-diabetes-2018-1

9/27

After today’s session, I gained insights into the concept of 5-fold Cross-validation. we applied this technique to a dataset comprising 354 data points. Initially, we divided the dataset into five approximately equal-sized segments. Each fold contained 71 data points, with the last fold having 70 data points. The procedure involved conducting five iterations. In each iteration, one fold was set aside as the test set, while the remaining four folds were utilized for training our polynomial regression model. This complex process allowed us to evaluate the model’s performance and calculate its average performance. We generate some example data with input features (x) and the target variable (y).We initialize a linear regression model. We create a 5-fold cross-validator using K-fold from scikit-learn. We store the MSE values for each fold in the MSE value list. we calculate the average MSE across all folds to assess the model’s overall performance.

9/25

After today’s class session i have understood about the topics ,Estimating prediction error  and Validation set approach. For validation set approach the provided data is divided into two sets i.e training data which is 80% and validation data of 20%. Here the validation test is used to estimate the model’s prediction error which also known as test error. with this model we make predictions and test the predicted value the actual value and get the accuracy of the data set and then we try to increase the percentage of accuracy by using the suitable algorithm .

i have also worked on K-fold cross validation  with few examples and have applied it to the provided data, I will be updating you the progress that i obtained in the following sessions.

09/22

Cross-validation: This method for analyzing model performance is utilized in machine learning. It essentially entails breaking up the provided data into several subsets, which involves training models on some of them and evaluating others. The risk of overfitting is decreased by using this cross-validation model, which also aids in estimating how well the model will generalize to fresh, untested data.

Different kinds of cross-validation

Leave-One-Out Cross-Validation (LOOCV) :Each data point is used as the test set while the remaining data is used for training in this cross-validation procedure. There are as many iterations of this method as there are data points because it is repeated for each and every data point. This approach offers a robust estimation of a model’s

K-fold cross-validation:

This method for evaluating the effectiveness of a machine learning model. The data are divided into ‘k’ sections of equal size. The process is repeated ‘k’ times, with each fold acting as the test set once. The model is trained on ‘k-1’ folds and tested on the final one. To more accurately assess the model’s performance while effectively utilizing the data at hand, the results are averaged. It broadens the applicability of the model and enables us to more effectively identify potential problems like overfitting.

Time series cross-validation:

This approach is used to assess how well predictive models work for time-dependent data. This entails using historical data for training and prospective data for testing, breaking the time-ordered dataset into sequential chunks. This method models actual situations where predictions are made based on historical data. Cross-validation methods like rolling window and expanding window are frequently used to make sure that models generalize effectively to new time periods.

Stratified Cross-Validation:

This specific method is used to make sure that the class distribution of the original dataset is maintained in each subset used for testing in k-fold cross-validation. This helps when working with datasets that are unbalanced and have certain classes with much less samples. It confirms that each fold more correctly depicts the class distribution, enhancing model evaluation and lowering the possibility of biased outcomes.