Checking survey data (2): change variable type and inspect missingness

4 minute read

In my last post in this particular series, I discussed the few steps that I usually take when first get in touch with the survey data. There were a few things outstanding, including mis-categorized variable types and inconsistent missing values.
In this post, I’ll address these two problems when further checking the data.

import numpy as np
import pandas as pd
df = pd.read_parquet('../output/df')

Change variable type

List all the columns in the dataset.

df.columns
Index(['self', 'employee', 'type', 'IT', 'benefit', 'options', 'discussion',
       'learn', 'anoymity', 'leave', 'mental_employer', 'physical_employer',
       'mental_coworker', 'mental_super', 'as_serious', 'consequence',
       'coverage', 'resources', 'reveal_client', 'impact', 'reveal_coworker',
       'impact_neg', 'productivity', 'time_affected', 'ex', 'coverage_ex',
       'options_ex', 'discussion_ex', 'learn_ex', 'anoymity_ex',
       'mental_employer_ex', 'physical_employer_ex', 'mental_coworker_ex',
       'mental_super_ex', 'as_serious_ex', 'consequence_ex',
       'physical_interview', 'physical_interview_y', 'mental_interview',
       'mental_interview_y', 'hurt', 'negative', 'share', 'unsupport',
       'experience', 'history', 'past', 'mental_disorder',
       'conditions_diagosed1', 'conditions_believe', 'diagnose',
       'conditions_diagosed2', 'treatment', 'interfere',
       'interfere_nottreated', 'age', 'gender', 'country_live', 'state',
       'country_work', 'state_work', 'position', 'remote', 'ID'],
      dtype='object')

Using df.dtypes and df['variable'].unique() in the last post, we could confirm the type of the variable and the unique levels/values of those variables. For example, the “whether have any mental health coverage” question below, data type shows it’s a float type but it takes only on values 1, 0 and nan=missing value, which should be edited into a categorical variable.
Note althought it didn’t display as we intended, this is not neccessarily an error - Python just considers nan as a float.

df['coverage'].dtypes
dtype('float64')
df['coverage'].unique()
array([nan,  1.,  0.])

What about checking all variables in the dataset at the same time?

df.dtypes[0:10]
self            int64
employee       object
type          float64
IT            float64
benefit        object
options        object
discussion     object
learn          object
anoymity       object
leave          object
dtype: object
df_variables_list = df.columns.to_list()
variable_values_list = [df[col].unique() for col in df_variables_list] # using a list comprehension here instead of for loop
variable_values_list[0:10]
[array([0, 1], dtype=int64),
 array(['26-100', '6-25', None, 'More than 1000', '100-500', '500-1000',
        '1-5'], dtype=object),
 array([ 1., nan,  0.]),
 array([nan,  1.,  0.]),
 array(['Not eligible for coverage / N/A', 'No', None, 'Yes',
        "I don't know"], dtype=object),
 array([None, 'Yes', 'I am not sure', 'No'], dtype=object),
 array(['No', 'Yes', None, "I don't know"], dtype=object),
 array(['No', 'Yes', None, "I don't know"], dtype=object),
 array(["I don't know", 'Yes', None, 'No'], dtype=object),
 array(['Very easy', 'Somewhat easy', 'Neither easy nor difficult', None,
        'Very difficult', 'Somewhat difficult', "I don't know"],
       dtype=object)]

After checking data types and values for all variables, we could decide which variables should actually be categorical vs. numerical vs. text. In Python, these are either integer, float, or string types. For survey data, string type is usually reserved for open-ended question texts. We can use .astype('type') to modify variable types.

# converting variable types
shouldbeInt = ['type', 'IT', 'coverage']
for col in shouldbeInt:
    df[col] = df[col].astype('Int64') # note capitalized Int64 instead of int64 here - Pandas 0.24+ supports Int64 type with nan value

Check whether one of the variables have been converted correctly:

df['type'].value_counts(dropna=False).sort_index()
0      263
1      883
NaN    287
Name: type, dtype: int64

Some variables just have response options texts recorded as their values, such as this scale question below that asks “If a mental health issue prompted you to request a medical leave from work, asking for that leave would be”:

df['leave'].unique()
array(['Very easy', 'Somewhat easy', 'Neither easy nor difficult', None,
       'Very difficult', 'Somewhat difficult', "I don't know"],
      dtype=object)

The object type suggests that it is a string variable. It is true for now as the values are texts only, but we would want it to be a categorical integer variable via recoding:

# create a new variable for the recode so we could compare with the original; this can be omitted if familiar with recoding
df['leave_r'] = 0 # value 0 entails both "i don't know" and NaN values
df.loc[df['leave'] == 'Very easy', ['leave_r']] = 1
df.loc[df['leave'] == 'Somewhat easy', ['leave_r']] = 2
df.loc[df['leave'] == 'Neither easy nor difficult', ['leave_r']] = 3
df.loc[df['leave'] == 'Somewhat difficult', ['leave_r']] = 4
df.loc[df['leave'] == 'Very difficult', ['leave_r']] = 5

Compare the newly recoded variable with the original one:

df['leave_r'].value_counts().sort_index()
0    437
1    220
2    281
3    178
4    199
5    118
Name: leave_r, dtype: int64
df['leave'].value_counts(dropna=False)
NaN                           287
Somewhat easy                 281
Very easy                     220
Somewhat difficult            199
Neither easy nor difficult    178
I don't know                  150
Very difficult                118
Name: leave, dtype: int64

Different missingness

leave_r is a quick and “dirty” cleaned up version of the original string variable. It’s worth noting that in leave_r the value 0 entails both original NaN values and those who answered I don't know to this questions. There is a difference.
In the last post, I illustrated with a bar plot that more than one variable has 287 missing values including this leave variable. Looking at the list of questions we could infer a logic that respondents who indicated they were self-employeed (variable self) in the first question were not asked this question.
In other words, we could distinguish between intentional skips (I don’t know) and programming skips (e.g. if Q1=2, skip to Q3).

df['self'].value_counts()
0    1146
1     287
Name: self, dtype: int64

How to proceed?
We could code the programming skips to -1 so when looking at the frequencies of certain variables we know these number of respondents were not eligible for the questions. For some other analyses, we could just ignore this difference by using self to subset data (e.g. self=0 corresponds to sum(leave) = 1146).