9/27

After today’s session, I gained insights into the concept of 5-fold Cross-validation. we applied this technique to a dataset comprising 354 data points. Initially, we divided the dataset into five approximately equal-sized segments. Each fold contained 71 data points, with the last fold having 70 data points. The procedure involved conducting five iterations. In each iteration, one fold was set aside as the test set, while the remaining four folds were utilized for training our polynomial regression model. This complex process allowed us to evaluate the model’s performance and calculate its average performance. We generate some example data with input features (x) and the target variable (y).We initialize a linear regression model. We create a 5-fold cross-validator using K-fold from scikit-learn. We store the MSE values for each fold in the MSE value list. we calculate the average MSE across all folds to assess the model’s overall performance.

9/25

After today’s class session i have understood about the topics ,Estimating prediction error  and Validation set approach. For validation set approach the provided data is divided into two sets i.e training data which is 80% and validation data of 20%. Here the validation test is used to estimate the model’s prediction error which also known as test error. with this model we make predictions and test the predicted value the actual value and get the accuracy of the data set and then we try to increase the percentage of accuracy by using the suitable algorithm .

i have also worked on K-fold cross validation  with few examples and have applied it to the provided data, I will be updating you the progress that i obtained in the following sessions.

09/22

Cross-validation: This method for analyzing model performance is utilized in machine learning. It essentially entails breaking up the provided data into several subsets, which involves training models on some of them and evaluating others. The risk of overfitting is decreased by using this cross-validation model, which also aids in estimating how well the model will generalize to fresh, untested data.

Different kinds of cross-validation

Leave-One-Out Cross-Validation (LOOCV) :Each data point is used as the test set while the remaining data is used for training in this cross-validation procedure. There are as many iterations of this method as there are data points because it is repeated for each and every data point. This approach offers a robust estimation of a model’s

K-fold cross-validation:

This method for evaluating the effectiveness of a machine learning model. The data are divided into ‘k’ sections of equal size. The process is repeated ‘k’ times, with each fold acting as the test set once. The model is trained on ‘k-1’ folds and tested on the final one. To more accurately assess the model’s performance while effectively utilizing the data at hand, the results are averaged. It broadens the applicability of the model and enables us to more effectively identify potential problems like overfitting.

Time series cross-validation:

This approach is used to assess how well predictive models work for time-dependent data. This entails using historical data for training and prospective data for testing, breaking the time-ordered dataset into sequential chunks. This method models actual situations where predictions are made based on historical data. Cross-validation methods like rolling window and expanding window are frequently used to make sure that models generalize effectively to new time periods.

Stratified Cross-Validation:

This specific method is used to make sure that the class distribution of the original dataset is maintained in each subset used for testing in k-fold cross-validation. This helps when working with datasets that are unbalanced and have certain classes with much less samples. It confirms that each fold more correctly depicts the class distribution, enhancing model evaluation and lowering the possibility of biased outcomes.

09/20 About Crab Molt model and T-test.

In today’s class, we discussed about two important things: the Crab Molt Model and the T-Test.

The Crab Molt Model is a way to make predictions when we have data that doesn’t follow a normal pattern. Imagine you have information about crabs, and sometimes their sizes don’t follow the usual pattern. This model helps us predict how big a crab will be after it sheds its old shell by looking at its size after the molt.

Post-molt data is what we collect when crabs have just shed their old shells and are growing new ones. Pre-molt data is information we gather from crabs just before they shed their old shells, which can show us how they are changing.

The T-Test is a statistical tool used to figure out if the differences we see between two groups are real or just random. This is really useful in research and data analysis when we want to compare two things to make sure our conclusions are reliable.

9/18- Quadratic model and Over fitting

Quadratic model:

A quadratic model is a variant of mathematical model utilized in statistics, and multiple fields to describe the relationship between dependent and  independent variable by fitting a quadratic equation. This is a form of polynomial regression where the relationship between the variables is modeled as a quadratic function.

The general form of a quadratic model is as follows:

y= ax2+bx+c

In this equation:

  • y represents the variable which is dependent
  • x represents the variable which is independent
  • a , b and c are constants and are not equal to 0(zero)

Quadratic models are applied when there is a relationship in  between the dependent and independent variables is not linear but rather follows a curved, U-shaped, or parabolic pattern. Once the model is fitted, it can be used for making predictions or understanding the relationship between the variables.

Overfitting:

Overfitting erupts when a model learns disturbance in the training data, resulting to poor performance on unseen data. It results in insufficient data. Preventing overfitting includes simplifying the model, collecting more data, selecting relevant features, using regularization, it also includes cross-validation. This results in minimal generalization to new and unseen data.

Heteroscedasticity and Calculating the value of P

Heteroscedasticity

  • Heteroscedasticity, is a statistical term used in regression analysis.
  • It describes a situation where the variability of the errors (residuals) in a regression model is not constant across all levels of the independent variable(s).
Heteroscedasticity can take different forms or types:

 

1) Increasing Heteroscedasticity: In this model, the difference of the residuals increases as the values of the independent variables increase. This manner that as we move along the predictor variables, the spread of the residuals becomes extensive.

2)Decreasing Heteroscedasticity:  In contradiction to increasing heteroscedasticity, this model involves the variance of the residuals decreasing as the values of the independent variable increase. The expansion of residuals channels as you move along the predictor variable.

3)U-shaped Heteroscedasticity: U-shaped heteroscedasticity occurs when the spread of residuals forms a particular U shape as you move along the independent variables. The dissimilarity of residuals is not constant and emerges to be heteroscedastic in a systematic manner.

 

 The Breusch pagan test

  • It is a statistical test applied in regression analysis to verify for the existence of heteroscedasticity in a regression model.
  • The test is to determine whether the residuals is constant across all levels of the independent variables.

Null Hypothesis: If there is no heteroscedasticity the difference of outcomes remains constant .

Alternative Hypothesis: If there is heteroscedasticity the difference of outcomes is not constant and may vary across the tosses.

  • If P value is greater than the  given significance level then it is not subjected to reject the null hypothesis.
  • On the other hand, If P value is less than the  given significance level then it is subjected to reject the null hypothesis.

9/11 Linear regression update

Linear regression is basically finding the best fitting line for the data points provided

  • The equation for linear regression is Y = β0 + β1X1 + β2X2 + … + βnXn + ε
  • Where:
    • Y is the dependent variable.
    • X1, X2, …, Xn are the independent variables.
    • β0 is the intercept.
    • β1, β2, …, βn are the coefficients.
    • ε represents the noise.

Observation from the Datasheet:

The dataset exhibits the factors such as Diabetes , Obesity and Inactivity for all the states in the country of USA for a particular year i.e 2018.But the number of samples for diabetes, obesity and inactivity are not same.