Unbalanced Data, Continuous vs Categorical Coding

Discuss statistics related things

by Whirly123 » Fri Jul 17, 2020 8:24 pm

I am running a generalized mixed model with GAMLj in Jamovi and have an issue of unbalanced data. First, the experiment looks like this:

Participants take a personality measure and then have 50 rounds of a repeated dictator game where they are receivers. They are offered three payouts 0 points, 25 points and 50 points (bad, medium and good). They can react to any of these by Punishing the dictator or by Switching to a new dictator.

My long format data looks something like this, each row represents a round
dat.jpg
dat.jpg (65.13 KiB) Viewed 353 times


So I have a Mixed Model Logistic Regression predicting Punish or Switch that looks like this:
Punish or Switch ~ Personality * Round Payout (1 + Personality | Subject)

If I code Round Payout as continuous I do not get significant interaction effects but if I code it as categorical I do. The results can be seen here where I get significant results on Medium payouts (in the direction predicted and there are strong theoretical reasons why this would show up specifically for medium payouts).
Plot.jpg
Plot.jpg (31.61 KiB) Viewed 353 times


We are concerned that what is preventing us from reaching significance when it is coded as continuous is because of how unbalanced the data are. We have 740 data points for bad payouts, 280 for medium payouts and only 24 for good payouts. This is understandable as one would expect most people are happy to continue to the next round if they get a Good payout and most unhappy when they get the bad payouts. This means that there are vastly different numbers of data points depending on the payout. With this in mind could anyone provide some recommendations on how to approach this and what is appropriate here.
Whirly123
 
Posts: 22
Joined: Mon May 06, 2019 3:07 pm

by mcfanda@gmail.com » Sun Jul 19, 2020 1:54 pm

Yes, I would agree with your reasoning. If the "group" of data points for "good payouts" is much smaller than the others, when you code the variable as continuous its effect will be "driver" mostly by bad and medium payout, because they represent when the majority of the points are. When coded as categorical, the means of the three groups are evaluated, even if one group is much smaller than the others, so the weight of the good payout points becomes stronger. I would leave it as categorical, which is indeed the nature of your variable.
User avatar
mcfanda@gmail.com
 
Posts: 207
Joined: Thu Mar 23, 2017 9:24 pm

by Whirly123 » Sun Jul 19, 2020 6:13 pm

mcfanda@gmail.com wrote:Yes, I would agree with your reasoning. If the "group" of data points for "good payouts" is much smaller than the others, when you code the variable as continuous its effect will be "driver" mostly by bad and medium payout, because they represent when the majority of the points are. When coded as categorical, the means of the three groups are evaluated, even if one group is much smaller than the others, so the weight of the good payout points becomes stronger. I would leave it as categorical, which is indeed the nature of your variable.


So you are saying make it categorical and remove good payout data right?
Whirly123
 
Posts: 22
Joined: Mon May 06, 2019 3:07 pm


Return to Statistics