Project update – 2/10

After analyzing the data provided there are various ways to look at it and how a regression model can be implemented.

Data Description:

The provided dataset contains information from 2018 on obesity rates, diabetes rates, and inactivity rates, which are categorized by county within each state. Each county is identified by a unique FIPDS number. Additionally, the website included several other factors related to the economy, health, and more. However, the data points available for these additional factors were significantly fewer compared to the other variables, making them less suitable for inclusion in our analysis.

Regarding the dataset itself, it’s important to note that not all factors had the same number of samples. Specifically, there were only 354 samples that had values for all three of these factors. This discrepancy can be addressed in several ways:

1. Duplicate Points: One approach is to duplicate data points to ensure that all features have the same number of sample points. However, duplicating a larger number of values can introduce inaccuracies and potentially lead to poor model performance due to the presence of false information.

2. Smaller Common Set: Another option is to work with the smaller set of 354 samples that have values for all three features. While having more samples generally improves model training, collecting a larger dataset in a short time frame may be unrealistic. Therefore, the decision was made to go with the second option, even though it may result in less robust predictions due to the limited amount of data.

Data Extraction:

Extracting the common 354 data points that include all three variables can be achieved through the following process using Excel:

1. Identify the FIPDS codes of the counties that have data for all three variables initially, making this task somewhat easier.

2. Create a new column and copy all the common FIPDS codes into it.

3. Compare the original FIPDS column with the new common FIPDS column and mark duplicates. This process identifies the common counties and assigns them a color code. Repeat this step for all three sheets containing the relevant data.

4. Filter the data based on the color-coded duplicates to create a dataset that will be used for analysis in Python.

In addition, scatter plots of the data were generated to provide an initial visualization and gain a rough idea of what the regression lines might look like.

For further reference, you can access the dataset using the following link

cdc-diabetes-2018-1