--- title: "Survey Data Analysis with gooseR" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Survey Data Analysis with gooseR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` ```{r setup} library(gooseR) library(dplyr) library(ggplot2) ``` ## Introduction Survey data often comes with unwieldy column names - long questions that make analysis difficult. gooseR's intelligent survey tools transform these into clean, meaningful variable names while preserving the original mapping for documentation. ## The Challenge Typical survey exports look like this: ```{r} # Example of raw survey data columns raw_columns <- c( "ResponseId", "How satisfied are you with our customer service on a scale of 1-5?", "On a scale of 0-10, how likely are you to recommend our product to a friend or colleague?", "What is your annual household income before taxes?", "How often do you use our product? (Daily, Weekly, Monthly, Rarely, Never)", "Please rate your agreement: The product meets my needs", "In which age range do you fall? (18-24, 25-34, 35-44, 45-54, 55-64, 65+)", "What is your primary reason for using our product? (Please select all that apply)" ) print(raw_columns) ``` Working with these names is painful: - Hard to type - Difficult to read in code - Problematic in formulas - Messy in visualizations ## The Solution: Intelligent Renaming ### Basic Usage ```{r} # Load your survey data survey_data <- read.csv("survey_export.csv") # Automatically rename columns with intelligent abbreviations clean_data <- goose_rename_columns(survey_data) # View what happened goose_view_column_map(clean_data) ``` ### How It Works gooseR recognizes common survey patterns and applies intelligent abbreviations: ```{r} # Pattern Recognition Examples: # Satisfaction questions "How satisfied are you with our customer service?" # → "sat_cust_serv" # NPS (Net Promoter Score) "On a scale of 0-10, how likely are you to recommend..." # → "nps" # Demographics "What is your annual household income before taxes?" # → "hh_income" "In which age range do you fall?" # → "age_range" # Frequency questions "How often do you use our product?" # → "use_freq" # Likert scales "Please rate your agreement: The product meets my needs" # → "agree_meets_needs" # Multiple choice "What is your primary reason for using our product?" # → "primary_reason" ``` ### Custom Abbreviations You can provide domain-specific abbreviations: ```{r} # Create custom dictionary for your domain custom_dict <- list( "customer service" = "cs", "artificial intelligence" = "ai", "machine learning" = "ml", "return on investment" = "roi", "key performance indicator" = "kpi" ) # Apply with custom dictionary clean_data <- goose_rename_columns( survey_data, custom_abbrev = custom_dict, max_length = 20 # Maximum variable name length ) ``` ## Complete Workflow Example ### Step 1: Load and Clean ```{r} # Load raw survey data survey <- read.csv("qualtrics_export.csv", stringsAsFactors = FALSE) # Check the messy column names names(survey)[1:5] # Clean the column names survey_clean <- goose_rename_columns(survey) # Check the clean names names(survey_clean)[1:5] # Save the mapping for documentation mapping <- goose_view_column_map(survey_clean) write.csv(mapping, "column_mapping.csv", row.names = FALSE) ``` ### Step 2: Get Analysis Guidance ```{r} # Share a sample with goose for context goose_give_sample(survey_clean) # Get an analysis plan plan <- goose_make_a_plan("exploratory") cat(plan) # Ask specific questions goose_ask("What's the best way to analyze Likert scale data in this survey?") goose_ask("How should I handle missing data in the income field?") ``` ### Step 3: Perform Analysis ```{r} # Now you can use clean names in your analysis survey_clean %>% group_by(age_range) %>% summarise( avg_satisfaction = mean(sat_overall, na.rm = TRUE), avg_nps = mean(nps, na.rm = TRUE), n = n() ) %>% arrange(desc(avg_satisfaction)) # Create visualizations with clean labels ggplot(survey_clean, aes(x = age_range, y = sat_overall)) + geom_boxplot() + theme_brand("block") + labs( title = "Satisfaction by Age Group", x = "Age Range", y = "Overall Satisfaction" ) ``` ### Step 4: Get Feedback and Document ```{r} # Get feedback on your analysis approach goose_honk(severity = "moderate") # Create documentation goose_handoff() # Save your work goose_save( survey_clean, category = "survey_data", tags = c("cleaned", "q3_2024", "customer_satisfaction") ) # Create a continuation prompt for next session goose_continuation_prompt() ``` ## Advanced Features ### Handling Multiple Survey Waves ```{r} # Process multiple survey files consistently files <- c("survey_q1.csv", "survey_q2.csv", "survey_q3.csv") all_surveys <- lapply(files, function(file) { data <- read.csv(file) goose_rename_columns(data) }) # Combine with consistent naming combined <- bind_rows(all_surveys, .id = "quarter") ``` ### Pattern Detection Details gooseR detects these patterns automatically: 1. **NPS Questions**: "likely to recommend", "scale of 0-10" 2. **Satisfaction**: "how satisfied", "satisfaction with" 3. **Agreement**: "rate your agreement", "strongly agree" 4. **Frequency**: "how often", "frequency of" 5. **Demographics**: "age", "income", "education", "gender" 6. **Importance**: "how important", "importance of" 7. **Likelihood**: "how likely", "likelihood of" ### Preserving Original Questions The mapping is always preserved: ```{r} # After renaming clean_data <- goose_rename_columns(survey_data) # Access the mapping attr(clean_data, "column_map") # Or use the helper function mapping <- goose_view_column_map(clean_data) # Use in reports library(knitr) kable( mapping, caption = "Survey Variable Mapping", col.names = c("Variable Name", "Original Question") ) ``` ## Tips and Best Practices 1. **Always Save the Mapping**: Keep the column mapping for documentation and methodology sections 2. **Use Custom Dictionaries**: Add industry-specific abbreviations for consistent naming 3. **Check the Results**: Review renamed columns to ensure they make sense 4. **Combine with Memory**: Save cleaned datasets with descriptive tags 5. **Document Your Work**: Use `goose_handoff()` to create documentation ## Common Issues and Solutions ### Issue: Names Too Long ```{r} # Set maximum length clean <- goose_rename_columns(survey, max_length = 15) ``` ### Issue: Duplicate Names After Abbreviation ```{r} # gooseR automatically handles duplicates by adding numbers # "satisfaction_1", "satisfaction_2", etc. ``` ### Issue: Special Characters in Questions ```{r} # gooseR automatically removes special characters # "What's your opinion?" → "opinion" # "Rate 1-5: Service" → "rate_service" ``` ## Integration with Other gooseR Features ```{r} # Combine with AI analysis goose_give_sample(clean_data) advice <- goose_ask("What statistical tests should I use for this Likert scale data?") # Get code review goose_honk(severity = "gentle") # Save for later goose_save(clean_data, category = "surveys", tags = c("2024", "cleaned")) # Create formatted output results <- clean_data %>% group_by(age_range) %>% summarise(mean_sat = mean(sat_overall, na.rm = TRUE)) goose_format_table(results) ``` ## Conclusion gooseR's survey tools eliminate the tedious work of cleaning survey data, letting you focus on analysis and insights. The intelligent pattern recognition ensures consistent, meaningful variable names while preserving the full context of your original questions. Next: Check out the [Code Review and Testing](code-review-testing.html) vignette to learn about gooseR's development tools.