Error in Calculating RMSE in Linear Regression

Everything related to development of the jamovi platform
Post Reply
DavoFromDapto
Posts: 5
Joined: Wed Jun 02, 2021 4:38 am

Error in Calculating RMSE in Linear Regression

Post by DavoFromDapto »

The web-version has an error in the calculation of the root mean squared error (RMSE) as a measure of fit in the linear regression procedure. Presumably, this same error is also in the current Download version?

The usual formula is RMSE = sqrt( Sigma e_i ^2 / (n - p - 1) )

I can replicate the jamovi calculations by using n , the sample size, as the divisor in the formula above rather than (n - p - 1), where p = the number of predictors. The degrees of freedom divisor, (n - p - 1), gives the unbiased estimates.

Maybe: is it possible this same error occurs in some of the other linear modeling routines?
User avatar
jonathon
Posts: 2613
Joined: Fri Jan 27, 2017 10:04 am

Re: Error in Calculating RMSE in Linear Regression

Post by jonathon »

i've just taken a quick look, and it looks as though we're calculating RMSE the expected way.

lots of resources seem to use n as the denominator.

cheers

jonathon
DavoFromDapto
Posts: 5
Joined: Wed Jun 02, 2021 4:38 am

Re: Error in Calculating RMSE in Linear Regression

Post by DavoFromDapto »

Jonathon,
Thanks for your quick response.

All the major stats software packages use (n - p - 1) to calculate RMSE in regression. This list includes: SPSS, Stata, SAS, lm() in R, Excel, and SHAZAM . Surely, this list makes the use of (n - p - 1) "the expected way"?

Which major stats software packages use n as the divisor for RMSE in regression?

If n is used in jamovi, perhaps there should be a note somewhere alerting users that it has a different approach in case of replication.

--Davo
User avatar
jonathon
Posts: 2613
Joined: Fri Jan 27, 2017 10:04 am

Re: Error in Calculating RMSE in Linear Regression

Post by jonathon »

let me hand this over to ravi.

jonathon
User avatar
Ravi
Posts: 194
Joined: Sat Jan 28, 2017 11:18 am

Re: Error in Calculating RMSE in Linear Regression

Post by Ravi »

Hi Davo,

So what we do is literally take the Root of the Mean of the Squared Errors (aka RMSE), see https://github.com/jamovi/jmv/blob/04b9 ... g.b.R#L598. I think the confusion is that the RMSE is used as an estimator of the residual standard error. And you are right that the RMSE is a biased estimator of the residual standard error, and that the way you calculate it is an unbiased estimator. However, I think it would be undesirable to use the name "RMSE" for the unbiased estimator of the residual standard error because it's not actually giving you the Root Mean Squared Error. This is also why lm in R calls it "residual standard error" and SPSS calls it "Std. Error of the Estimate" and not RMSE.

Cheers,
Ravi
DavoFromDapto
Posts: 5
Joined: Wed Jun 02, 2021 4:38 am

Re: Error in Calculating RMSE in Linear Regression

Post by DavoFromDapto »

Ravi and Jonathon,
Perhaps it may be helpful to disentangle two related issues.

1. the formula for the RMSE -- jamovi uses n as the divisor, while every other major package uses (n - p - 1). Do you know of another package that uses n? Even though they use different labels for many features of their output, they all use the same formula.

2. jamovi and full disclosure. When jamovi departs from common statistical practice, it should disclose that to the user. Put differently, how is the user to know what jamovi has done? SAS and Stata label it the "RMSE" and so many users will conclude that when jamovi labels it "RMSE", it's computing the same thing as SAS and Stata. That confusion creates difficulties for replication.

I have a solution to the issue: Why not use n - p - 1 as the divisor, and then, if you object to calling that quantity the RMSE (like SAS and Stata label it), use the R label for it: "residual standard error."

My larger concern is that there may be other stats procedures where jamovi departs from standard statistical practice, and users are blissfully unaware of it. I think jamovi has an obligation to advise users about these departures.

Jamovi is a great idea; I like what you've done, and I know how much hard work it has taken. However, as a matter of strategy, consider making the jamovi default selections replicate the major / well established stats packages (IBM-SPSS, SAS, Stata ... ) in order to promote confidence in the numerical accuracy, and make it easier for users to transition to jamovi.

--Davo
User avatar
jonathon
Posts: 2613
Joined: Fri Jan 27, 2017 10:04 am

Re: Error in Calculating RMSE in Linear Regression

Post by jonathon »

there's often a tension between doing it the "correct way", and doing it the way that others do it ... when the way others do it is demonstrably wrong.

you've made the case for doing it the "the wrong way", but there's also a case for doing it the "the right way".

one group of users may expect RMSE to match stata and sas, but another group of users will expect it to match wikipedia.

i'm not sure what the best way to handle this is ... but it's an issue with multiple facets. your suggestion to add a footnote explaining what other software does isn't unreasonable.

the best solution in most of these cases is have sas, stata, et al. add footnotes to their software ...

cheers

jonathon
User avatar
jonathon
Posts: 2613
Joined: Fri Jan 27, 2017 10:04 am

Re: Error in Calculating RMSE in Linear Regression

Post by jonathon »

conspicuously absent from this discussion is whether the denominator n-p-1 actually provides a more useful measure :P

jonathon
DavoFromDapto
Posts: 5
Joined: Wed Jun 02, 2021 4:38 am

Re: Error in Calculating RMSE in Linear Regression

Post by DavoFromDapto »

Jonathon,
You pose the dilemma: "one group of users may expect RMSE to match stata and sas, but another group of users will expect it to match wikipedia." To resolve this dilemma I'd benchmark against Stata and Sas [and SPSS, R, Excel, ...] and use (n - p - 1), rather than against wikipedia, because major stats packages have higher standards of statistical rigor than wikipedia, and reflect (and define) current statistical practices.

Apparently, you don't accept that jamovi has an obligation to inform users when it calculates a measure in a way different to all other major stats packages. Other than jamovi, no other stats package uses n as the divisor for RMSE(?).

You ask: Is it (n - p - 1) a more useful measure?

1. It (n - p - 1) yields the unbiased estimator of the Var[e], (sigma^2), and so it has "nice", arguably superior statistical properties.

2. It is superior to the other measures of goodness of fit used in regression (such as, R^2 and the adjusted R^2) because "RMSE" is calibrated in the units of Y, the dependent variable. (By the way, the adjusted R^2 uses (n - p - 1) to estimate Var[e]. )

It seems I've failed to convince you and Ravi on the statistical issues.

--Davo
User avatar
jonathon
Posts: 2613
Joined: Fri Jan 27, 2017 10:04 am

Re: Error in Calculating RMSE in Linear Regression

Post by jonathon »

it sounds as though we disagree on several points ... but there does seem to be merit in using n-p-1 ... so we'll look into that a bit more. i'm more persuaded by the argument that it's a better measure, than other stats packages are using it. in our experience stats software (admittedly, mostly spss) are often a complex mess of bad design decisions.

jonathon
Post Reply