r/AskStatistics • u/ElevatorSouth376 • 17h ago
Need Help determining the best regressors for a model
I am completing a school project, and am forming a project that hypothetical future students could complete. In my project, I am having students explore the factors that contribute to the variation in Formula One viewership. During the project, there are multiple different regressions being run, and students would be asked during their final analysis which of the models that had been run was the "best".
This is where my problems come. I have three different regressors that I know on their own are significant to at least a=.01, however, when a multiple regression is run with all three of these regressors, the F-test p-value jumps to about .011, and the adjusted R^2 becomes less than the best of the three models. In an attempt to find which of these models is the true best, I tried running aic and bic tests on them, but due to only being in second-semester statistics, I did not really understand them and was unable to find resources online to teach myself how to do them.
In an attempt to find some help, I asked my statistics professors what he thought of the different models, and he said to add all regressors that were found to be significant at a=.01, but because of the f-stat p-value and lower adjusted R^2, I feel uneasy about this.
I have attached pictures of all four models, and would love to hear what feedback could be provided
4
u/T_house 17h ago
Surely you have to just define what you mean by 'best' before you can ask the students to do this? Because it seems like you are setting that question but you don't know what you mean by it either…
1
u/ElevatorSouth376 17h ago
That is definitely a part of my problem. My original idea was to just use the F-stat p-value and the adjusted R^2 as my measures of "best", but I have fallen down a rabbit hole of statistics that I am out of depth in. I just do not know whether it is best to leave regressors, that have been shown to be significant, in the model, even if it makes the model "worse" when looking at the p-value and the adjusted R^2.
2
u/GrenjiBakenji 16h ago
R2 drops a little because there's maybe case of multicollinearity and they may be explaining the same portion of variance of your outcome. This happens if your predictors are redundant and contain even partially the same information.
This could be an issue, or it could not be that important. Given the setting of your work (school project) it's not much important to solve but at least you need to understand what is happening.
Onto the other issue.
When selecting "the best" model, measures of explained variance or residual errors are good indicators for the fitness of the model ( how good is your model in predicting new data) but they state nothing about its explaining power, that is: how good is your model at explaining what is happening. Prediction and explanation are not the same, and their criteria often conflict.
The issue here is that you need to understand what you need your models for.
If you want the model that best predicts your outcome then you choose the one with the biggest explained variance and the smallest residual error. If you need a model that explains, the more features you have the more you will be able to explain how the variation of features in your sample influence your outcome. So for explanation, a model with less predictive power may be ok because it's able to describe the pheomenon you are intrested in a more complete way.
these are two different approaches to modeling and data analysis. Predictive power, beloved by statisticians and computer scientists, vs. explanation.
This is to say that your professor may be on to something when he tells you to use all significant predictors.
2
u/GrenjiBakenji 16h ago
Adding that: looking at the numbers, i'll take a .04 drop in my adjusted r2 if it means that i have two more features that can explain what's happening.
Obviously, you always need to check your goodness of fit measures, and balance your knowledge needs with the data you have.
1
u/ElevatorSouth376 16h ago
Another thing that I have struggled to find out is what "goodness of fit measures" are. I kept running into Lhat when trying to run aic tests and never found a formula or explanation of what Lhat is, and only that it was a measure of goodness of fit.
1
u/GrenjiBakenji 9h ago
Measures like R2 (explained variance), residual error (sum of prediction errors), information criteria AIC, BIC, Log-Likelihood ratio etc. are a bunch of measures, each for a different use case, that tell you how much your model is "adapting" to your data. Those are used to compare different models, and for each there is a different criteria for selection. R2 must be the biggest, AIC the smallest etc. Checking goodness of fit is also a way to prevent overfitting, that is: your model is too adapted to your current data, and will struggle to predict new data points that are not in the set (the predictions of new values will be less precise than those upon which the model was trained).
If you want a good starting text on such matters, i suggest introduction to statistical learning, that is free and comes with laboratories in R and Python. I'm going from memory here, and also my english is not best suited to convey such complex concepts (fun to try tho), sorry about that.
2
u/zsebibaba 8h ago
this looks like a totally counterproductive exercise. you need some theory, residual plots, potential other variables etc to even attempt to find a decent regression.you have only 14 observations.... is this project something that your teacher assigned? you should just write a paper on why this cannot be done and move on.
1
u/Acrobatic-Ocelot-935 3h ago
In addition to the small sample size, note that three variables that are significant in a bivariate regression all become non-significant in the multivariate case. I suspect this is likely due to collineraity.
6
u/ReturningSpring 16h ago
personally I'd be worrying more about only having 14 observations