Announcement

What to do with Not applicable, Don't know, Refusal (spontaneous) responses in survey data for regression

01 Apr 2022, 10:23

In the survey data I am using for my regression, most of the variables have one or more of the following responses: Not applicable, Don't know, Refusal, which are coded as outliers.
E.g. one of my variables, self-reported job satisfaction is coded 1=very satisfied; 2=satisfied; 3=not very satisfied; 4=not at all satisfied; 8=DK/no opinion (spontaneous); 9=Refusal (spontaneous).

My question is how should I deal with such observations?
Should I just leave them as is, so they will be included in regression? or should I drop them from my dataset? or does it depend how large a proportion of observations from that variable such responses make up?

Tags: None Rich Goldstein

Join Date: Mar 2014
Posts: 4370

01 Apr 2022, 11:12

what it depends on is what your research question is and how these relate to that question - since you have told us nothing about that, there is no way to give good advice without writing a text

Comment

Post Cancel Clyde Schechter

Join Date: Apr 2014
Posts: 29136

01 Apr 2022, 11:17

Leaving them as is and including them in regression is about the worst possible thing you could do. In effect, that would say that somebody who expressed no opinion or said he or she didn't know is twice as dissatisfied as somebody who said he or she is not at all satisfied! Clearly that is nonsense.

What you should do is replace those values with Stata missing values. If you wish to specifically maintain the distinction between the DK/no opinion and Refusal categories, you can do something like this:

mvdecode list_of_applicable_variables, mv(8 = .d \ 9 = .r)

That will replace the values 8 and 9 in those variables by Stata's "extended" missing values.d and .r, respectively. And in all calculations, not just regressions, Stata will understand that these values are excluded. (Note, it doesn't have to be specifically .d and .r; I chose those because they have mnemonic value. Stata has 26 "extended" missing values, .a through .z, and you can use whichever ones you like for this.)

Now, depending on your situation, you may or may not have any need to maintain the difference between those two categories, and it may be simpler to just lump them together as "missing." In that case, simpler is just

mvdecode list_of_applicable_variables, mv(8 9)

and both 8 and 9 will be replaced by Stata's "system" missing value (which shows up in listings as a period.)

Comment

Post Cancel Rich Goldstein

Join Date: Mar 2014
Posts: 4370

01 Apr 2022, 12:01

note that Clyde Schechter and I have made a different assumption here - Clyde's response assumes you will enter that variable as a quantitative variable while I assumed possible entry as a categorical variable; note also that if you follow Clyde's advice and you have numerous missing values you will need to do something about this (e.g., multiple imputation)

Last edited by Rich Goldstein; 01 Apr 2022, 12:03 .

Comment

Post Cancel Clyde Schechter

Join Date: Apr 2014
Posts: 29136

01 Apr 2022, 12:20

First, something weird happened here on Statalist. I was not aware of Rich Goldstein's response in #2 when I wrote what is now #3. But strangely, after I posted that reply, it showed up here as #2 and there was still no sign of Rich Goldstein's first response on this thread, even though it is timestamped earlier than mine! Anyway, this writing is the first I have seen what he wrote here.

And he is right. I did assume you intend to treat the 1-4 response scale as a quantitative (ordinal or interval-level) variable. If you are going to treat these variables as categorical, then you might well preserve the coding as 8 and 9. You might even choose to do that for some analyses, but convert them to missing values for others, or to handle Refused one way and Don't know another way. It requires some thought. Conside the "Refused" response group. In one sense, they are distinct group from those who responded somewhere on the 1 through 4 scale. In another sense, however, you might reason that they must have had some level of satisfaction that fits somewhere along that 1 to 4 scale--they're just withholding that information. From that perspective, this "Refused" group is actually a mixture of people from each of the 1 through 4 response options. By including that as a separate category, you may be biasing the estimates associated with the 1 through 4 categories themselves. So, it's complicated. (This same argument would not be so readily applicable to the Don't Know category as, if we take them at their word that they don't know, then they really aren't a mixture of 1 through 4 people who are just withholding the information.)

My point is, it's complicated!