--- title: "End-to-End Pipeline: From API to Multi-Sport Analysis" author: "vald.extractor" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{End-to-End Pipeline: From API to Multi-Sport Analysis} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` ## Introduction The `vald.extractor` package provides a robust, production-ready pipeline for extracting, cleaning, and analyzing VALD ForceDecks data across multiple sports. This vignette demonstrates the complete workflow from API authentication to publication-ready visualizations. ### Key Problems Solved 1. **Stability**: Chunked batch processing prevents timeout errors when working with large datasets (1000+ tests) 2. **Data Cleaning**: Automated sports taxonomy mapping standardizes inconsistent team/group names 3. **Reproducibility**: All processing steps are documented and version-controlled 4. **Generic Analysis**: Test-type suffix removal enables writing analysis code once that works for all test types ## Step 1: Authentication and Data Extraction First, set your VALD API credentials and extract test/trial data: ```{r extract-data} library(vald.extractor) # Set credentials valdr::set_credentials( client_id = "your_client_id", client_secret = "your_client_secret", tenant_id = "your_tenant_id", region = "aue" ) # Fetch data from 2020 onwards in chunks of 100 tests vald_data <- fetch_vald_batch( start_date = "2020-01-01T00:00:00Z", chunk_size = 100, verbose = TRUE ) # Extract components tests_df <- vald_data$tests trials_df <- vald_data$trials cat("Extracted", nrow(tests_df), "tests and", nrow(trials_df), "trials\n") ``` **Why chunking matters**: Without chunking, large organizations with 5000+ tests will experience API timeout errors. The chunked approach processes 100 tests at a time, with fault-tolerant error handling that logs issues without halting the entire extraction. ## Step 2: Fetch and Standardize Metadata Retrieve athlete profiles and group memberships via OAuth2: ```{r fetch-metadata} # Fetch raw metadata metadata <- fetch_vald_metadata( client_id = "your_client_id", client_secret = "your_client_secret", tenant_id = "your_tenant_id", region = "aue" ) # Standardize: unnest group memberships and create unified athlete records athlete_metadata <- standardize_vald_metadata( profiles = metadata$profiles, groups = metadata$groups ) head(athlete_metadata) ``` ### Understanding the Metadata Structure The VALD API stores group memberships as a nested array (`groupIds`). The `standardize_vald_metadata()` function: 1. Unnests the array so each athlete-group pair gets its own row 2. Joins with group names from the Groups API 3. Collapses back to one row per athlete with all group names concatenated **Result**: A clean metadata table where `all_group_names` contains "Football, U18, Elite" for an athlete in multiple groups. ## Step 3: Apply Sports Taxonomy Map inconsistent team names to standardized sports categories: ```{r classify-sports} athlete_metadata <- classify_sports( data = athlete_metadata, group_col = "all_group_names", output_col = "sports_clean" ) # Inspect the mapping table(athlete_metadata$sports_clean) ``` **The Value Add**: This regex-based classification is the core innovation. Organizations often have: - "Football", "Soccer", "FSI" → all mapped to "Football" - "Track", "Field", "T&F" → all mapped to "Track & Field" Without this automation, analysts spend hours manually categorizing athletes. The package includes patterns for 15+ sports and can be easily extended. ## Step 4: Transform to Wide Format and Join Combine trials into tests, pivot to wide format, and merge with metadata: ```{r transform-wide} library(dplyr) # Join trials and tests all_data <- left_join(trials_df, tests_df, by = c("testId", "athleteId")) # Aggregate trials and pivot to wide format structured_test_data <- all_data %>% group_by(athleteId, testId, testType, recordedUTC, recordedDateOffset, trialLimb, definition_name) %>% summarise( mean_result = mean(as.numeric(value), na.rm = TRUE), mean_weight = mean(as.numeric(weight), na.rm = TRUE), .groups = "drop" ) %>% mutate( TestTimestampUTC = lubridate::ymd_hms(recordedUTC), TestTimestampLocal = TestTimestampUTC + lubridate::minutes(recordedDateOffset), Testdate = as.Date(TestTimestampLocal) ) %>% select(athleteId, Testdate, testId, testType, trialLimb, definition_name, mean_result, mean_weight) %>% tidyr::pivot_wider( id_cols = c(athleteId, Testdate, testId, mean_weight), names_from = c(definition_name, trialLimb, testType), values_from = mean_result, names_glue = "{definition_name}_{trialLimb}_{testType}" ) %>% rename(Weight_on_Test_Day = mean_weight) # Join with metadata final_analysis_data <- structured_test_data %>% mutate(profileId = as.character(athleteId)) %>% left_join( athlete_metadata %>% mutate(profileId = as.character(profileId)), by = "profileId" ) %>% mutate( Testdate = as.Date(Testdate), dateofbirth = as.Date(dateOfBirth), age = as.numeric((Testdate - dateofbirth) / 365.25), sports = sports_clean ) cat("Final dataset:", nrow(final_analysis_data), "rows with", ncol(final_analysis_data), "columns\n") ``` ## Step 5: Split by Test Type The "Don't Repeat Yourself" (DRY) principle in action: ```{r split-tests} # Split into separate datasets per test type test_datasets <- split_by_test( data = final_analysis_data, metadata_cols = c("profileId", "sex", "Testdate", "dateofbirth", "age", "testId", "Weight_on_Test_Day", "sports") ) # Access individual test types cmj_data <- test_datasets$CMJ dj_data <- test_datasets$DJ # Crucially: column names are now generic head(names(cmj_data)) # "profileId", "sex", "Testdate", "PEAK_FORCE_Both", "JUMP_HEIGHT_Both", ... # Note: "_CMJ" suffix has been removed! ``` **Why this matters**: You can now write one analysis function that works for all test types: ```{r generic-analysis} analyze_peak_force <- function(test_data) { summary(test_data$PEAK_FORCE_Both) # Works for CMJ, DJ, ISO, etc. } # Apply to all test types lapply(test_datasets, analyze_peak_force) ``` Without suffix removal, you'd need separate code for `PEAK_FORCE_Both_CMJ`, `PEAK_FORCE_Both_DJ`, etc. ## Step 6: Patch Missing Metadata (Optional) Fix missing or incorrect demographic data: ```{r patch-metadata} # Create an Excel file with: profileId, sex, dateOfBirth # Example: corrections.xlsx with rows like: # profileId sex dateOfBirth # abc123 Male 1995-03-15 # def456 Female 1998-07-22 cmj_data <- patch_metadata( data = cmj_data, patch_file = "corrections.xlsx", patch_sheet = 1, id_col = "profileId", fields_to_patch = c("sex", "dateOfBirth") ) # Verify corrections table(cmj_data$sex) # "Unknown" values should now be fixed ``` ## Step 7: Generate Summary Statistics Create publication-ready summary tables: ```{r summary-stats} cmj_summary <- summary_vald_metrics( data = cmj_data, group_vars = c("sex", "sports"), exclude_cols = c("profileId", "testId", "Testdate", "dateofbirth", "age") ) # View summary print(cmj_summary) # Export to CSV write.csv(cmj_summary, "cmj_summary_by_sport_sex.csv", row.names = FALSE) ``` **Output example**: ``` sex sports PEAK_FORCE_Both_Mean PEAK_FORCE_Both_SD PEAK_FORCE_Both_CV PEAK_FORCE_Both_N Male Football 2450.32 245.67 10.02 45 Male Basketball 2310.45 198.23 8.58 32 Female Football 1980.12 187.45 9.47 38 ``` ## Step 8: Visualize Trends Track performance over time: ```{r plot-trends} library(ggplot2) # Plot CMJ peak force trends by athlete plot_vald_trends( data = cmj_data, date_col = "Testdate", metric_col = "PEAK_FORCE_Both", group_col = "profileId", facet_col = "sex", title = "CMJ Peak Force Trends by Athlete", smooth = TRUE ) # Plot sport-level averages over time sport_trends <- cmj_data %>% group_by(Testdate, sports) %>% summarise(avg_force = mean(PEAK_FORCE_Both, na.rm = TRUE), .groups = "drop") plot_vald_trends( data = sport_trends, date_col = "Testdate", metric_col = "avg_force", group_col = "sports", title = "Average CMJ Peak Force by Sport Over Time" ) ``` ## Step 9: Compare Across Groups Create boxplots for cross-sectional comparisons: ```{r plot-compare} plot_vald_compare( data = cmj_data, metric_col = "PEAK_FORCE_Both", group_col = "sports", fill_col = "sex", title = "CMJ Peak Force Comparison by Sport and Sex" ) # Compare jump height plot_vald_compare( data = cmj_data, metric_col = "JUMP_HEIGHT_Both", group_col = "sports", fill_col = "sex", title = "CMJ Jump Height Comparison" ) ``` ## Advanced: Multi-Test Analysis Analyze multiple test types simultaneously: ```{r multi-test} # Define a function to extract a common metric across test types compare_metric_across_tests <- function(test_datasets, metric = "PEAK_FORCE_Both") { results <- lapply(names(test_datasets), function(test_name) { test_data <- test_datasets[[test_name]] if (metric %in% names(test_data)) { data.frame( testType = test_name, metric = metric, mean = mean(test_data[[metric]], na.rm = TRUE), sd = sd(test_data[[metric]], na.rm = TRUE), n = sum(!is.na(test_data[[metric]])) ) } }) do.call(rbind, results) } # Compare peak force across CMJ, DJ, and ISO force_comparison <- compare_metric_across_tests(test_datasets, "PEAK_FORCE_Both") print(force_comparison) ``` ## Best Practices for Production Use ### 1. Schedule Regular Updates ```{r scheduled-updates, eval=FALSE} # Weekly refresh script library(vald.extractor) # Fetch only new data since last update last_update <- "2024-01-01T00:00:00Z" new_data <- fetch_vald_batch( start_date = last_update, chunk_size = 100 ) # Append to existing database load("vald_database.RData") updated_tests <- rbind(existing_tests, new_data$tests) updated_trials <- rbind(existing_trials, new_data$trials) save(updated_tests, updated_trials, file = "vald_database.RData") ``` ### 2. Error Logging The chunked extraction automatically logs errors without halting: ```{r error-logging} # Errors are printed to console with chunk information: # "ERROR on chunk 23 (rows 2201-2300): API timeout" # "Continuing to next chunk..." # This ensures partial data extraction even if some chunks fail ``` ### 3. Version Control Your Taxonomy Store your sports classification rules in a separate config file: ```{r taxonomy-config} # sports_taxonomy.R sports_patterns <- list( Football = "Football|FSI|TCFC|MCFC|Soccer", Basketball = "Basketball|BBall", Cricket = "Cricket", # ... add your organization's patterns ) # Then use in classify_sports() ``` ## Conclusion The `vald.extractor` package transforms raw VALD API data into analysis-ready datasets with: - **Fault-tolerant extraction**: Chunked processing with error handling - **Automated taxonomy**: Regex-based sports classification saves hours of manual work - **Generic programming**: Suffix removal enables DRY analysis code - **Publication-ready outputs**: Summary tables and professional visualizations This workflow is production-tested with 10,000+ tests across 15+ sports and is designed for CRAN submission. ## Next Steps - Customize sports taxonomy for your organization - Integrate with existing reporting pipelines - Schedule automated weekly updates - Explore advanced visualizations with `ggplot2` extensions For issues or feature requests, visit: [GitHub Issues](https://github.com/praveenmaths89/vald.extractor/issues)