Working with survey data(3): more data cleaning

6 minute read

I originally wanted to move onto discussion on descriptive statistics but figured there are actually a lot more data cleaning work that need to be done and are worth sharing. This is usually the case in the data science work that I’m involved where about 80% of my time were about some sort of cleaning and getting the data ready in the desired format for anlayses.
Therefore, in this post, I’ll cover some techniques and handy functions to use when dealing with specific types of data cleaning.
Check out my first post and second post in the same series.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pycountry_convert as pc
df = pd.read_parquet('../output/df')

Suppose I now have a more targeted analysis goal, which is to look at how employer policies and workplace environment affect worker’s mental health state. In other words, I want to exclude those that are self-employeed and consider only variables related to employer and workplace.

Subsetting data

On the row: There are many ways to subset data and I’m used to what’s listed below - just put the row filter condition in the .loc square bracket and we are good to go.

df_s = df.loc[df['self']==0]

On the column: In terms of selection of columns/variables, just use double square bracket [['variable name']]:

df_s = df_s[['ID', 'diagnose', 'employee', 'type', 'benefit', 'discussion', 'learn', 
             'anoymity', 'leave', 'mental_employer', 'physical_employer', 'mental_coworker', 'mental_super', 
             'as_serious', 'consequence', 'age', 'gender', 'country_live', 'state', 'position', 'remote']]
df_s.shape
(1146, 21)

Check raw frequencies

After getting the data into the shape I want, I usually just check the raw frequencies of the variables and see if there’s anything outstanding that needs my attention.

I like Python list and I think it is great for storing and organizing different objects. Here I take out the variables that are especially relevant to employer policies and workplace environment and save them into the list called employer_cols and demographics variables are saved in demographic_cols list.

employer_cols = ['employee', 'type', 'benefit', 'discussion', 'learn', 'anoymity', 'leave', 
                 'mental_employer', 'physical_employer', 'mental_coworker', 'mental_super', 
                 'as_serious', 'consequence']

demographic_cols = ['remote', 'age', 'gender', 'country_live', 'state', 'position']

A quick for loop and value_counts() can give some raw frequencies output:

for col in employer_cols:
    print(f"--------------frequency table of {col}: ---------------------")
    print(df_s[col].value_counts(dropna=False).sort_index())
    if col==employer_cols[0]: # here I just end the for loop to save space, removing the if and break statements would show the full results
            break
--------------frequency table of employee: ---------------------
1-5                60
100-500           248
26-100            292
500-1000           80
6-25              210
More than 1000    256
Name: employee, dtype: int64
for col in demographic_cols:
    print(f"--------------frequency table of {col}: ---------------------") 
    print(df[col].value_counts(dropna=False).sort_index())
    if col==demographic_cols[0]: # same here, just remove the if and break statements to get full frequency results
        break
--------------frequency table of remote: ---------------------
Always       343
Never        333
Sometimes    757
Name: remote, dtype: int64

Finally, this diagnose variable is the target of interest: whether respondents are diganosed with any mental conditions.

df_s['diagnose'].value_counts()
No     579
Yes    567
Name: diagnose, dtype: int64

Cleaning: numeric variables

After reviewing the raw frequencies, I noticed that age has some extreme values. Let’s take a look:

df['age'].mean()
34.28611304954641
sns.set(style="ticks", font_scale=1.2)
plt.figure(figsize=(8,4))
ax = sns.boxplot(x=df_s["age"])
sns.despine()
plt.title('Boxplot: age')
plt.show()

png

Though average age is around 34, thre are some outliers as suggested in the boxplot above. Here we could decide what would be a reasonable range for this numeric variable. For age and for this particular study topic, I would limit the range to [15:90].
Usually when we design surveys and programming the web instrument, we tend to add value checks like this in order to prompt respondents to enter the correct value. For this particular dataset, since I was not given an instrument for the study, it’s hard to think from a designer’s perspective.

We could take a look at the respondents that said they are either 90+ yrs old and <15 yrs old.

df_s.loc[df_s['age']>90,:]
ID diagnose employee type benefit discussion learn anoymity leave mental_employer ... mental_coworker mental_super as_serious consequence age gender country_live state position remote
372 373 Yes 6-25 1.0 Yes No I don't know Yes Somewhat easy No ... Yes Yes Yes No 99 Other United States of America Michigan Supervisor/Team Lead Sometimes
564 565 No 100-500 1.0 Yes I don't know I don't know I don't know I don't know No ... Maybe Maybe I don't know No 323 Male United States of America Oregon Back-end Developer Sometimes

2 rows × 21 columns

df_s.loc[df_s['age']<15,:]
ID diagnose employee type benefit discussion learn anoymity leave mental_employer ... mental_coworker mental_super as_serious consequence age gender country_live state position remote
656 657 Yes More than 1000 1.0 Yes Yes Yes I don't know Somewhat easy Maybe ... Maybe Yes I don't know No 3 Male Canada None Back-end Developer Sometimes

1 rows × 21 columns

In survey research, when we see extreme values like this we would flag these responses and further check for evidence for survey satisficing. This is a when respondents speed through the survey and/or give irresponsible answers. Some well known behaviors include speeding, straightlining, and giving ranodm open-ended answers. It is an interesting methodological topic that worth a separate post.

Here, I consider removing these few records when analyses involve the use of age variable. I’ll manually assign NaN to the corresponding values.

df_s['age']= df_s['age'].astype('float') # note np.nan is recognized in float type only
df_s['age']= df_s['age'].replace([3], np.nan)
df_s['age']= df_s['age'].replace([99], np.nan)
df_s['age']= df_s['age'].replace([323], np.nan)

Though there isn’t much change in the average age value, I’m more comfortable now knowing that there were extreme values that I cleaned up and that I should re-inspect the data in terms of response quality.

df_s['age'].mean()
33.37182852143482

Cleaning: categorical variables

Again, many things can go under this section as the title implied. Here I’ll illustrate a recoding process of a special variable: country.
In the survey, the question asked which country respondents currently reside in, and I believe the answer options are standard country names where respondents could select among the list of countries in a drop-down box.

# remove in the markdown file
df_s['country']= df_s['country_live'].replace(['Other'], 'United States of America')
df_s['country'].unique()
array(['United Kingdom', 'United States of America', 'Canada', 'Germany',
       'Netherlands', 'Australia', 'France', 'Belgium', 'Brazil',
       'Denmark', 'Sweden', 'Russia', 'Spain', 'India', 'Mexico',
       'Switzerland', 'Norway', 'Argentina', 'Ireland', 'Italy',
       'Colombia', 'Czech Republic', 'Vietnam', 'Finland', 'Bulgaria',
       'South Africa', 'Slovakia', 'Bangladesh', 'Pakistan',
       'New Zealand', 'Afghanistan', 'Romania', 'Poland', 'Iran',
       'Hungary', 'Israel', 'Japan', 'Ecuador', 'Bosnia and Herzegovina',
       'Austria', 'Chile', 'Estonia'], dtype=object)

Suppose I now want to convert these countries to the continent that they belong to. There is a handy Python library pycountry-convert that we could use just for this purpose.
Step 1: convert country name to its standard country code:

country_code = []
for country_name in df_s['country']:
    country_code.append(pc.country_name_to_country_alpha2(country_name))
country_code[0:5]
['GB', 'US', 'GB', 'US', 'GB']

Step 2: convert country code into continent:

continent = []
for code in country_code:
    continent.append(pc.country_alpha2_to_continent_code(code))
continent[0:5]
['EU', 'NA', 'EU', 'NA', 'EU']
df_s['continent'] = continent
df_s['continent'].value_counts()
NA    775
EU    301
OC     32
AS     18
SA     16
AF      4
Name: continent, dtype: int64

Now instead of analysis done on the country level, we have the continent variable that’s a little more balanced. It’s worth noting however, we don’t neccessarily have to use pycountry-convert to realize the transformation from country name to continent. One could for example just use basic recodes to assign each country to the corresponding continent. This is doable, but I wouldn’t recommend - why plant a tree from the beginning when you could stand in the shades of trees other planted?

Assign numeric values to the variables

Text values in variables are fine for analysis like frequencies and crosstabs and visualization work, for example, here’s a crosstab of whether remote work always, sometimes, or never with ever diagnosed with mental health conditions:

pd.crosstab(df_s['remote'], df_s['diagnose'])
diagnose No Yes
remote
Always 116 101
Never 177 141
Sometimes 286 325

You could imagine that if I were to throw both variables into a logistic regression, it would be hard to work with text values.
In cases like this, we can simply use .map function to map the text values to numeric values. .replace would also work in this setting.

df_s['remote'] = df_s['remote'].map({'Always': 1, 'Sometimes': 2, 'Never':3})
df_s['remote'].unique()
array([2, 3, 1], dtype=int64)

Now the remote variable takes on values 1, 2, 3 rather than its text values. It will be much more convenient to deal with in models.