Cleaning and Tidying Survey Data

Working with Survey Data: Processing for Use and Visualization

Depending on the survey software used, the results exported as excel or .csv can often be fairly messy, and processing that data manually in excel can be time consuming and is prone to copy/paste or other munging errors. Using a real example (questions and stakeholder names changed), I present here one fairly replicable way to process survey data exported from SurveyMonkey. I should note here that I didn’t design this survey, there are several methodological flaws regarding response options and questions themselves, but that conversation is for another day.

Three Challenges with Survey Data

survey <- read_excel("./anon_survey.xlsx")

1. Multiple header rows

SurveyMonkey results are exported with two or more header rows. The first row is descriptive, for “parent” main questions, the second for the “child” subquestions, the data required is stored in the latter.

In addition to Questions, the first 1-10 columns contain descriptive information on the respondent, and other descriptive elements. These descriptive aspects are stored in the first row (along with the parent questions), which read_excel() correctly sets as the header.

Getting the import right is half the battle

Because of this structure, name() can’t be set on import, and some manipulation is required. This method, which I’ve taken from Lisa DeBruine, is helpful in that columns don’t have to be manually identified, and is reusable for any survey.

new_names <- survey %>% 
  t() %>% 
  as_tibble() %>% 
  mutate(name = paste(V1, sep = "_")) %>% 
  pull(name)

## Warning: The `x` argument of `as_tibble.matrix()` must have column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

new_names[1:4] <- c("stakeholder_group", "gender", "position", "sector")

Which can now be applied to the survey data, and the first row dropped.

names(survey) <- new_names
survey <- survey[-1,]

There are duplicate columns names which will cause trouble if you want to analyze right away, but they’ll get dealt with in the next step.

2. Numeric and text responses combined

Separating the quantitative response values from the qualitative is a necessary step for future gathering and tallying responses. The qualitative responses vary in length, and sometimes are their own question, or are an addition to a numeric question.

In this case, the qualitative questions are all the same, there are “Other” categories sometimes used when none of the options are suitable but a response is required. I generally reccommend avoiding these, as the single-value responses like this take longer to analyze and more often than not are closely aligned to one of the options.

survey_text <- survey %>% 
  setNames(make.unique(names(.))) %>%
  select(stakeholder_group:sector, contains("Other"))

survey <- survey %>% 
    setNames(make.unique(names(.))) %>%
    select(everything(), -contains("Other"))

###3. Removing the non-responsive respondents

Some respondents exported from SurveyMonkey often appear to have dropped the survey, and these respondents have NA values for all responses. Any one of the respondent characteristics can be filtered out to remove these non-respondents.

survey <- survey %>% filter(!is.na(gender))

Now that the data is renamed and cleaned, the tidying and counting can begin!

Tidying the dataset: gather, group, tally, spread

The problem with the way the data is presently structured is that each question “value” has its own column (opposed to being combined in the same “question” variable column), the current ratings (e.g. good or very good) are text values, but we need a count, and there are values where respondents could choose not to answer, giving a Value.

All of these can be sorted out with the following code chunk:

survey_counts <-  survey %>% 
  gather(key = Question, value = Response,
         "1.1 Refugees have been consulted in the design of AGENCY-funded projects":"1.4 A gender-sensitive needs assessment, or a gender analysis, was undertaken in the design and implementation of interventions") %>% 
  group_by(stakeholder_group, Question, Response) %>% 
  tally %>% 
  spread(Response, n, fill = 0) %>% 
  select(stakeholder_group, Question,`Strongly disagree`, Disagree, `Do not know/ Not applicable`, Agree,
         `Strongly agree`, no_response = `<NA>`) %>%
  filter(!is.na(stakeholder_group))

the weakness of this approach is that the gather questions need to be specified (will need to be changed for every survey), and the variables selected will need to be changed according to each question type. The latter will be experimented with further to see how different question response scales can be analyzed differently.

Table 1: All respondent values have now been tallied
stakeholder_group	Question	Strongly disagree	Disagree	Do not know/ Not applicable	Agree	Strongly agree	no_response
AGENCY	1.1 Refugees have…	0	5	3	9	4	0
EU Member State	1.1 Refugees have…	0	1	5	1	0	4
INGO	1.1 Refugees have…	0	6	5	14	4	1
Local NGO	1.1 Refugees have…	1	2	0	6	7	8
Other EU	1.1 Refugees have…	0	1	2	1	1	0

Measuring Proportionate Responses

Counts are helpful, but most of the time what we really want is to know the proportion of their response compared to the overall responses. This is also a helpful step to get out of the way when preparing graphics to display results (as found here).

Considering how often I work with surveys, I created a function to determine the proportions. Like above, it requires some minor tweaking on a project-by-project basis, and I hope to improve it in the not-so-distant future.

#create a function to determine proportions
Proportions_Function <- function(tbl){
  tbl %>% 
    summarise(`Number of Responses` = sum(`Strongly disagree` + `Disagree`+`Do not know/ Not applicable`+
                                            `Agree`+`Strongly agree` + no_response, na.rm = TRUE),
              `Strongly disagree`= sum(`Strongly disagree`/`Number of Responses`, na.rm = TRUE) * 100,
              `Disagree` = sum(`Disagree` / `Number of Responses`, na.rm = TRUE) * 100,
              `Do not know/ Not applicable` = sum(`Do not know/ Not applicable` / `Number of Responses`, na.rm = TRUE)* 100,
              `Agree` = sum(`Agree` / `Number of Responses`, na.rm = TRUE) * 100,
              `Strongly agree` = sum(`Strongly agree` / `Number of Responses`, na.rm = TRUE) * 100,
              `no_response` = sum(no_response / `Number of Responses`, na.rm = TRUE) * 100
    ) %>% 
    mutate_if(is.numeric, round, digits = 2) %>% 
    arrange(desc(`Number of Responses`))
}

stakeholder_proportions <- survey_counts %>% 
  Proportions_Function() %>% 
  arrange(Question)

## `mutate_if()` ignored the following grouping variables:
## Column `stakeholder_group`

And with that, we now have two tables, one for stakeholder response proportions, and another for stakeholder responses by counts. In my experience, this workflow can bring on massive time savings, is fairly replicable across survey projects, and lends very well to excel table or graphic exports with anonymized findings for clients.

To find out how to export these as tabbed or single-page excel documents, check out my other post on that topic, and to learn how to quickly produce graphics for those proportion tables, be sure to check this purrr::map2() workflow.

Key lessons learned

Creating a set of tables which count the results from surveys can be a straight-forward process. The more I work with this type of data, the more I improve the efficiency of the script itself. learning to add tally without the brackets was a breakthrough, and to change the columns with the same name “Other - please respond” with setNames(make.unique(names(.))) is an unelegant solution I would like to improve with rename_at(), but haven’t figured out as of yet. Lastly, the proportions function needs to become more functional. it’s currently prone to error, and needs to be able to better incorporate changes, for example to account for different sums, unique column names and question types, and generally to be more flexible.

Cleaning and Tidying Survey Data

Working with Survey Data: Processing for Use and Visualization

Three Challenges with Survey Data

1. Multiple header rows

2. Numeric and text responses combined

Tidying the dataset: gather, group, tally, spread

Measuring Proportionate Responses

Key lessons learned

Corey Pembleton

How a leftist twitter trend played out

Catching up with dplyr 1.0 with fivethirtyeight data

Making Beautiful Streetmaps with ggplot2

Makeover Monday Data Visualization (Montreal dataviz meetup)

Use dplyr like its meant to be used

Scrape it 'til you make it

MAPS Crowdsourced Open Science Project: Data Imputation

How to render parameterized pdfs in rmarkdown

MAPS Crowdsourced Open Science Project: Data Exploration

OpenNorth Team Launches Contribution to University of Bristol MAPS crowdsourced analysis