--- title: "Overview of Data Retrieval Workflow" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Overview of Data Retrieval Workflow} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ``` r library(cbioportalR) library(dplyr) ``` ## Introduction We will outline the main data retrieval workflow and functions using a case study based on two public sets of data: 1) 105 samples in high risk nonmuscle invasive bladder cancer patients [(Pietzak et al. 2017)](https://pubmed.ncbi.nlm.nih.gov/28583311/). 2) 18 samples of 18 prostate cancer patients [(Granlund et al. 2020)](https://pubmed.ncbi.nlm.nih.gov/31564440/) ## Setup Before accessing data you will need to connect to a cBioPortal database and set your base URL for the R session. In this example we will use data from the public cBioPortal database instance (). You do not need a token to access this public website. If you are using a private instance of cBioPortal (like MSK's institutional database), you will need to acquire a token and save it to your .Renviron file. *Note: If you are a MSK researcher working on IMPACT, you should connect to MSK's cBioPortal instance to get the most up to date IMPACT data, and you must follow MSK-IMPACT publication guidelines when using the data.* To set the database url for your current R session use the `set_cbioportal_db()` function. To set it to the public instance you can either provide the full URL to the function, or just `public` as a shortcut. This function will both check your connection to the database and set the url (`www.cbioportal.org/api`) as your base url to connect to for all future API calls during your session. ``` r set_cbioportal_db("public") #> ✔ You are successfully connected! #> ✔ base_url for this R session is now set to "www.cbioportal.org/api" ``` You can use `test_cbioportal_db` at any time throughout your session to check your connection. This can be helpful when troubleshooting issues with your API calls. ``` r test_cbioportal_db() #> ✔ You are successfully connected! ``` ## Get Study Metadata Now that we are successfully connected, we may want to view all studies available for our chosen database to find the correct `study_id` corresponding to the data we want to pull. All studies have a unique identifier in the database. You can view all studies available in your database with the following: ``` r all_studies <- available_studies() all_studies #> # A tibble: 468 × 13 #> studyId name description publicStudy pmid citation groups status importDate allSampleCount #> #> 1 acyc_mskcc_2013 Adenoi… Whole-exom… TRUE 2368… Ho et a… "ACYC… 0 2023-12-0… 60 #> 2 acyc_fmi_2014 Adenoi… Targeted S… TRUE 2441… Ross et… "ACYC… 0 2023-12-0… 28 #> 3 acyc_jhu_2016 Adenoi… Whole-geno… TRUE 2686… Rettig … "ACYC… 0 2023-12-0… 25 #> 4 acyc_mda_2015 Adenoi… WGS of 21 … TRUE 2663… Mitani … "ACYC… 0 2023-12-0… 102 #> 5 acyc_mgh_2016 Adenoi… Whole-geno… TRUE 2682… Drier e… "ACYC" 0 2023-12-0… 10 #> 6 acyc_sanger_2013 Adenoi… Whole exom… TRUE 2377… Stephen… "ACYC… 0 2023-12-0… 24 #> 7 bcc_unige_2016 Basal … Whole-exom… TRUE 2695… Bonilla… "PUBL… 0 2023-12-0… 293 #> 8 all_stjude_2015 Acute … Comprehens… TRUE 2573… Anderss… "PUBL… 0 2023-12-0… 93 #> 9 ampca_bcm_2016 Ampull… Exome sequ… TRUE 2680… Gingras… "PUBL… 0 2023-12-0… 160 #> 10 all_stjude_2013 Hypodi… Whole geno… TRUE 2333… Holmfel… "" 0 2023-12-0… 44 #> # ℹ 458 more rows #> # ℹ 3 more variables: readPermission , cancerTypeId , referenceGenome ``` By inspecting this data frame, we see the unique `study_id` for the NMIBC data set is `"blca_nmibc_2017"` and the unique `study_id` for the prostate cancer data set is `"prad_msk_2019"`. To get more information on our studies we can do the following: *Note: the transpose function `t()` is just used here to better view results* ``` r all_studies %>% filter(studyId %in% c("blca_nmibc_2017", "prad_msk_2019")) #> # A tibble: 2 × 13 #> studyId name description publicStudy pmid citation groups status importDate allSampleCount #> #> 1 blca_nmibc_2017 Nonmuscl… IMPACT seq… TRUE 2858… Pietzak… PUBLIC 0 2023-12-0… 105 #> 2 prad_msk_2019 Prostate… MSK-IMPACT… TRUE 3156… Granlun… PUBLIC 0 2023-12-1… 18 #> # ℹ 3 more variables: readPermission , cancerTypeId , referenceGenome ``` More in-depth information about the study can be found with `get_study_info()` ``` r get_study_info("blca_nmibc_2017") %>% t() #> [,1] #> name "Nonmuscle Invasive Bladder Cancer (MSK Eur Urol 2017)" #> description "IMPACT sequencing of 105 High Risk Nonmuscle Invasive Bladder Cancer samples." #> publicStudy "TRUE" #> pmid "28583311" #> citation "Pietzak et al. Eur Urol 2017" #> groups "PUBLIC" #> status "0" #> importDate "2023-12-07 10:15:31" #> allSampleCount "105" #> sequencedSampleCount "105" #> cnaSampleCount "105" #> mrnaRnaSeqSampleCount "0" #> mrnaRnaSeqV2SampleCount "0" #> mrnaMicroarraySampleCount "0" #> miRnaSampleCount "0" #> methylationHm27SampleCount "0" #> rppaSampleCount "0" #> massSpectrometrySampleCount "0" #> completeSampleCount "0" #> readPermission "TRUE" #> treatmentCount "0" #> structuralVariantCount "11" #> studyId "blca_nmibc_2017" #> cancerTypeId "blca" #> cancerType.name "Bladder Urothelial Carcinoma" #> cancerType.dedicatedColor "Yellow" #> cancerType.shortName "BLCA" #> cancerType.parent "bladder" #> cancerType.cancerTypeId "blca" #> referenceGenome "hg19" ``` ``` r get_study_info("prad_msk_2019") %>% t() #> [,1] #> name "Prostate Cancer (MSK, Cell Metab 2020)" #> description "MSK-IMPACT Sequencing of 18 prostate cancer tumor/normal pairs." #> publicStudy "TRUE" #> pmid "31564440" #> citation "Granlund et al. Cell Metab 2020" #> groups "PUBLIC" #> status "0" #> importDate "2023-12-11 10:54:56" #> allSampleCount "18" #> sequencedSampleCount "18" #> cnaSampleCount "18" #> mrnaRnaSeqSampleCount "0" #> mrnaRnaSeqV2SampleCount "0" #> mrnaMicroarraySampleCount "0" #> miRnaSampleCount "0" #> methylationHm27SampleCount "0" #> rppaSampleCount "0" #> massSpectrometrySampleCount "0" #> completeSampleCount "0" #> readPermission "TRUE" #> treatmentCount "0" #> structuralVariantCount "4" #> studyId "prad_msk_2019" #> cancerTypeId "prostate" #> cancerType.name "Prostate" #> cancerType.dedicatedColor "Cyan" #> cancerType.shortName "PROSTATE" #> cancerType.parent "tissue" #> cancerType.cancerTypeId "prostate" #> referenceGenome "hg19" ``` Lastly, it is important to know what genomic data is available for our studies. Not all studies in your database will have data available on all types of genomic information. For example, it is common for studies not to provide data on fusions/structural variants. We can check available genomic data with `available_profiles()`. ``` r available_profiles(study_id = "blca_nmibc_2017") #> # A tibble: 3 × 8 #> molecularAlterationT…¹ datatype name description showProfileInAnalysi…² patientLevel molecularProfileId #> #> 1 COPY_NUMBER_ALTERATION DISCRETE Puta… Copy Numbe… TRUE FALSE blca_nmibc_2017_c… #> 2 MUTATION_EXTENDED MAF Muta… Mutation d… TRUE FALSE blca_nmibc_2017_m… #> 3 STRUCTURAL_VARIANT SV Stru… Structural… TRUE FALSE blca_nmibc_2017_s… #> # ℹ abbreviated names: ¹​molecularAlterationType, ²​showProfileInAnalysisTab #> # ℹ 1 more variable: studyId ``` ``` r available_profiles(study_id = "prad_msk_2019") #> # A tibble: 3 × 8 #> molecularAlterationT…¹ datatype name description showProfileInAnalysi…² patientLevel molecularProfileId #> #> 1 COPY_NUMBER_ALTERATION DISCRETE Puta… Putative c… TRUE FALSE prad_msk_2019_cna #> 2 MUTATION_EXTENDED MAF Muta… IMPACT468 … TRUE FALSE prad_msk_2019_mut… #> 3 STRUCTURAL_VARIANT SV Stru… Structural… TRUE FALSE prad_msk_2019_str… #> # ℹ abbreviated names: ¹​molecularAlterationType, ²​showProfileInAnalysisTab #> # ℹ 1 more variable: studyId ``` Luckily, in this example our studies have mutation, copy number alteration and fusion (structural variant) data available. Each of these data types has a unique molecular profile ID. The molecular profile ID usually takes the form of `_mutations`, `_structural_variants`, `_cna`. ``` r available_profiles(study_id = "blca_nmibc_2017") %>% pull(molecularProfileId) #> [1] "blca_nmibc_2017_cna" "blca_nmibc_2017_mutations" #> [3] "blca_nmibc_2017_structural_variants" ``` ## Pulling Genomic Data Now that we have inspected our studies and confirmed the genomic data that is available, we will pull the data into our R environment. We will show two ways to do this: 1) Using study IDs (`get_genetics_by_study()`) 2) Using sample ID-study ID pairs (`get_genetics_by_sample()`) Pulling by study will give us genomic data for all genes/panels included in the study. These functions can only pull data one study ID at a time and will return all genomic data available for that study. Pulling by study ID can be efficient, and a good way to ensure you have all genomic information available in cBioPortal for a particular study. If you are working across multiple studies, or only need a subset of samples from one or multiple studies, you may chose to pull by sample IDs instead of study ID. When you pull by sample IDs you can pull specific samples across multiple studies, but must also specify the studies they belong to. You may also pass a specific list of genes for which to return information. If you don't specify a list of genes the function will default to returning all available gene data for each sample. ### By Study IDs To pull by study ID, we can pull each data type individually. ``` r mut_blca <- get_mutations_by_study(study_id = "blca_nmibc_2017") #> ℹ Returning all data for the "blca_nmibc_2017_mutations" molecular profile in the "blca_nmibc_2017" study cna_blca<- get_cna_by_study(study_id = "blca_nmibc_2017") #> ℹ Returning all data for the "blca_nmibc_2017_cna" molecular profile in the "blca_nmibc_2017" study fus_blca <- get_fusions_by_study(study_id = "blca_nmibc_2017") #> ℹ Returning all data for the "blca_nmibc_2017_structural_variants" molecular profile in the "blca_nmibc_2017" study mut_prad <- get_mutations_by_study(study_id = "prad_msk_2019") #> ℹ Returning all data for the "prad_msk_2019_mutations" molecular profile in the "prad_msk_2019" study cna_prad <- get_cna_by_study(study_id = "prad_msk_2019") #> ℹ Returning all data for the "prad_msk_2019_cna" molecular profile in the "prad_msk_2019" study fus_prad <- get_fusions_by_study(study_id = "prad_msk_2019") #> ℹ Returning all data for the "prad_msk_2019_structural_variants" molecular profile in the "prad_msk_2019" study ``` Or we can pull all genomic data at the same time with `get_genetics_by_study()` ``` r all_genomic_blca <- get_genetics_by_study("blca_nmibc_2017") #> ℹ Returning all data for the "blca_nmibc_2017_mutations" molecular profile in the "blca_nmibc_2017" study #> ℹ Returning all data for the "blca_nmibc_2017_cna" molecular profile in the "blca_nmibc_2017" study #> ℹ Returning all data for the "blca_nmibc_2017_structural_variants" molecular profile in the "blca_nmibc_2017" study all_genomic_prad <- get_genetics_by_study("prad_msk_2019") #> ℹ Returning all data for the "prad_msk_2019_mutations" molecular profile in the "prad_msk_2019" study #> ℹ Returning all data for the "prad_msk_2019_cna" molecular profile in the "prad_msk_2019" study #> ℹ Returning all data for the "prad_msk_2019_structural_variants" molecular profile in the "prad_msk_2019" study ``` ``` r all_equal(mut_blca, all_genomic_blca$mutation) #> [1] TRUE all_equal(cna_blca, all_genomic_blca$cna) #> [1] TRUE all_equal(fus_blca, all_genomic_blca$structural_variant) #> [1] TRUE ``` Finally, we can join the two studies together ``` r mut_study <- bind_rows(mut_blca, mut_prad) cna_study <- bind_rows(cna_blca, cna_prad) fus_study <- bind_rows(fus_blca, fus_prad) ``` ### By Sample IDs When we pull by sample IDs, we can pull specific samples across multiple studies. In the above example, we can pull from both studies at the same time for a select set of samples using the `sample_study_pairs` argument in `get_genetics_by_sample()`. Let's pull data for the first 10 samples in each study. We first need to construct our dataframe to pass to the function: *Note: you can also run `available_samples(sample_list_id = )` to pull sample IDs by a specific sample list ID (see `available_sample_lists()`), or `available_patients()` to pull patient IDs* ``` r s1 <- available_samples("blca_nmibc_2017") %>% select(sampleId, patientId, studyId) %>% head(10) s2 <- available_samples("prad_msk_2019") %>% select(sampleId, patientId, studyId) %>% head(10) df_pairs <- bind_rows(s1, s2) %>% select(-patientId) ``` We need to rename the columns as per the functions documentation. ``` r df_pairs <- df_pairs %>% rename("sample_id" = sampleId, "study_id" = studyId) ``` Now we pass this to `get_genetics_by_sample()` ``` r all_genomic <- get_genetics_by_sample(sample_study_pairs = df_pairs) #> Joining with `by = join_by(study_id)` #> The following parameters were used in query: #> Study ID: "blca_nmibc_2017" and "prad_msk_2019" #> Molecular Profile ID: blca_nmibc_2017_mutations and prad_msk_2019_mutations #> Genes: "All available genes" #> Joining with `by = join_by(study_id)` #> The following parameters were used in query: #> Study ID: "blca_nmibc_2017" and "prad_msk_2019" #> Molecular Profile ID: blca_nmibc_2017_cna and prad_msk_2019_cna #> Genes: "All available genes" #> Joining with `by = join_by(study_id)` #> The following parameters were used in query: #> Study ID: "blca_nmibc_2017" and "prad_msk_2019" #> Molecular Profile ID: blca_nmibc_2017_structural_variants and prad_msk_2019_structural_variants #> Genes: "All available genes" mut_sample <- all_genomic$mutation ``` Like with querying by study ID, you can also pull data individually by genomic data type: ``` r mut_only <- get_mutations_by_sample(sample_study_pairs = df_pairs) #> Joining with `by = join_by(study_id)` #> The following parameters were used in query: #> Study ID: "blca_nmibc_2017" and "prad_msk_2019" #> Molecular Profile ID: blca_nmibc_2017_mutations and prad_msk_2019_mutations #> Genes: "All available genes" identical(mut_only, mut_sample) #> [1] TRUE ``` Let's compare these results with the ones we got from pulling by study: ``` r # filter to our subset used in sample query mut_study_subset <- mut_study %>% filter(sampleId %in% df_pairs$sample_id) # arrange to compare mut_study_subset <- mut_study_subset %>% arrange(desc(sampleId))%>% arrange(desc(entrezGeneId)) mut_sample <- mut_sample %>% arrange(desc(sampleId)) %>% arrange(desc(entrezGeneId)) %>% # reorder so columns in same order select(names(mut_study_subset)) all.equal(mut_study_subset, mut_sample) #> [1] TRUE ``` Both results are equal. Note: some studies also have copy number segments data available that can be pulled by study ID or sample ID: ``` r seg_blca <- get_segments_by_study("blca_nmibc_2017") #> ℹ Returning all "copy number segmentation" data for the "blca_nmibc_2017" study # To pull alongside other genomic data types, use the `return_segments` argument all_genomic_blca <- get_genetics_by_study("blca_nmibc_2017", return_segments = TRUE) #> ℹ Returning all data for the "blca_nmibc_2017_mutations" molecular profile in the "blca_nmibc_2017" study #> ℹ Returning all data for the "blca_nmibc_2017_cna" molecular profile in the "blca_nmibc_2017" study #> ℹ Returning all data for the "blca_nmibc_2017_structural_variants" molecular profile in the "blca_nmibc_2017" study #> ℹ Returning all "copy number segmentation" data for the "blca_nmibc_2017" study ``` #### Limit Results to Specified Genes or Panels When pulling by sample IDs, we can also limit our results to a specific set of genes by passing a vector of Entrez Gene IDs or Hugo Symbols to the `gene` argument, or a specified panel by passing a panel ID to the `panel` argument (see `available_gene_panels()` for supported panels). This can be useful if, for example, we want to pull all IMPACT gene results for two studies but one of the two uses a much larger panel. In that case, we can limit our query to just the genes for which we want results: ``` r by_hugo <- get_mutations_by_sample(sample_study_pairs = df_pairs, genes = "TP53") #> Joining with `by = join_by(study_id)` #> The following parameters were used in query: #> Study ID: "blca_nmibc_2017" and "prad_msk_2019" #> Molecular Profile ID: blca_nmibc_2017_mutations and prad_msk_2019_mutations #> Genes: "TP53" by_gene_id <- get_mutations_by_sample(sample_study_pairs = df_pairs, genes = 7157) #> Joining with `by = join_by(study_id)` #> The following parameters were used in query: #> Study ID: "blca_nmibc_2017" and "prad_msk_2019" #> Molecular Profile ID: blca_nmibc_2017_mutations and prad_msk_2019_mutations #> Genes: 7157 identical(by_hugo, by_gene_id) #> [1] TRUE ``` ``` r get_mutations_by_sample( sample_study_pairs = df_pairs, panel = "IMPACT468") %>% head() #> Joining with `by = join_by(study_id)` #> The following parameters were used in query: #> Study ID: "blca_nmibc_2017" and "prad_msk_2019" #> Molecular Profile ID: blca_nmibc_2017_mutations and prad_msk_2019_mutations #> Genes: "IMPACT468" #> # A tibble: 6 × 28 #> hugoGeneSymbol entrezGeneId uniqueSampleKey uniquePatientKey molecularProfileId sampleId patientId #> #> 1 TERT 7015 UC0wMDAxNDUzLVQwMS1J… UC0wMDAxNDUzOmJ… blca_nmibc_2017_m… P-00014… P-0001453 #> 2 SMAD4 4089 UC0wMDAxNDUzLVQwMS1J… UC0wMDAxNDUzOmJ… blca_nmibc_2017_m… P-00014… P-0001453 #> 3 ERBB4 2066 UC0wMDAxNDUzLVQwMS1J… UC0wMDAxNDUzOmJ… blca_nmibc_2017_m… P-00014… P-0001453 #> 4 CUL3 8452 UC0wMDAxNDUzLVQwMS1J… UC0wMDAxNDUzOmJ… blca_nmibc_2017_m… P-00014… P-0001453 #> 5 PBRM1 55193 UC0wMDAxNDUzLVQwMS1J… UC0wMDAxNDUzOmJ… blca_nmibc_2017_m… P-00014… P-0001453 #> 6 APC 324 UC0wMDAxNDUzLVQwMS1J… UC0wMDAxNDUzOmJ… blca_nmibc_2017_m… P-00014… P-0001453 #> # ℹ 21 more variables: studyId , center , mutationStatus , validationStatus , #> # tumorAltCount , tumorRefCount , normalAltCount , normalRefCount , #> # startPosition , endPosition , referenceAllele , proteinChange , #> # mutationType , ncbiBuild , variantType , chr , variantAllele , #> # refseqMrnaId , proteinPosStart , proteinPosEnd , keyword ``` ## Pulling Clinical Data & Sample Metadata You can also pull clinical data by study ID, sample ID, or patient ID. Pulling by sample ID will pull all sample-level characteristics (e.g. sample site, tumor stage at sampling time and other variables collected at time of sampling that may be available). Pulling by patient ID will pull all patient-level characteristics (e.g. age, sex, etc.). Pulling by study ID will pull all sample *and* patient-level characteristics at once. You can explore what clinical data is available a study using: ``` r attr_blca <- available_clinical_attributes("blca_nmibc_2017") attr_prad <- available_clinical_attributes("prad_msk_2019") attr_prad #> # A tibble: 13 × 7 #> displayName description datatype patientAttribute priority clinicalAttributeId studyId #> #> 1 Cancer Type Cancer Type STRING FALSE 1 CANCER_TYPE prad_m… #> 2 Cancer Type Detailed Cancer Typ… STRING FALSE 1 CANCER_TYPE_DETAIL… prad_m… #> 3 Fraction Genome Altered Fraction G… NUMBER FALSE 20 FRACTION_GENOME_AL… prad_m… #> 4 Gene Panel Gene Panel. STRING FALSE 1 GENE_PANEL prad_m… #> 5 Mutation Count Mutation C… NUMBER FALSE 30 MUTATION_COUNT prad_m… #> 6 Oncotree Code Oncotree C… STRING FALSE 1 ONCOTREE_CODE prad_m… #> 7 Sample Class The sample… STRING FALSE 1 SAMPLE_CLASS prad_m… #> 8 Number of Samples Per Patie… Number of … STRING TRUE 1 SAMPLE_COUNT prad_m… #> 9 Sample Type The type o… STRING FALSE 1 SAMPLE_TYPE prad_m… #> 10 Sex Sex STRING TRUE 1 SEX prad_m… #> 11 Somatic Status Somatic St… STRING FALSE 1 SOMATIC_STATUS prad_m… #> 12 Specimen Preservation Type The method… STRING FALSE 1 SPECIMEN_PRESERVAT… prad_m… #> 13 TMB (nonsynonymous) TMB (nonsy… NUMBER FALSE 1 TMB_NONSYNONYMOUS prad_m… ``` There are a select set available for both studies: ``` r in_both <- intersect(attr_blca$clinicalAttributeId, attr_prad$clinicalAttributeId) ``` The below pulls data at the sample level: ``` r clinical_blca <- get_clinical_by_sample(sample_id = s1$sampleId, study_id = "blca_nmibc_2017", clinical_attribute = in_both) clinical_prad <- get_clinical_by_sample(sample_id = s2$sampleId, study_id = "prad_msk_2019", clinical_attribute = in_both) all_clinical <- bind_rows(clinical_blca, clinical_prad) all_clinical %>% select(-contains("unique")) %>% head() #> # A tibble: 6 × 5 #> sampleId patientId studyId clinicalAttributeId value #> #> 1 P-0001453-T01-IM3 P-0001453 blca_nmibc_2017 CANCER_TYPE Bladder Cancer #> 2 P-0001453-T01-IM3 P-0001453 blca_nmibc_2017 CANCER_TYPE_DETAILED Bladder Urothelial Carcinoma #> 3 P-0001453-T01-IM3 P-0001453 blca_nmibc_2017 FRACTION_GENOME_ALTERED 0.4448 #> 4 P-0001453-T01-IM3 P-0001453 blca_nmibc_2017 MUTATION_COUNT 11 #> 5 P-0001453-T01-IM3 P-0001453 blca_nmibc_2017 ONCOTREE_CODE BLCA #> 6 P-0001453-T01-IM3 P-0001453 blca_nmibc_2017 SOMATIC_STATUS Matched ``` The below pulls data at the patient level: ``` r p1 <- available_patients("blca_nmibc_2017") clinical_blca <- get_clinical_by_patient(patient_id = s1$patientId, study_id = "blca_nmibc_2017", clinical_attribute = in_both) clinical_prad <- get_clinical_by_patient(patient_id = s2$patientId, study_id = "prad_msk_2019", clinical_attribute = in_both) all_clinical <- bind_rows(clinical_blca, clinical_prad) all_clinical %>% select(-contains("unique")) %>% head() #> # A tibble: 6 × 4 #> patientId studyId clinicalAttributeId value #> #> 1 P-0001453 blca_nmibc_2017 SAMPLE_COUNT 1 #> 2 P-0001453 blca_nmibc_2017 SEX Male #> 3 P-0002166 blca_nmibc_2017 SAMPLE_COUNT 1 #> 4 P-0002166 blca_nmibc_2017 SEX Male #> 5 P-0003238 blca_nmibc_2017 SAMPLE_COUNT 1 #> 6 P-0003238 blca_nmibc_2017 SEX Male ``` Like with the genomic data pull functions, you can also pull clinical data by a data frame of sample ID - study ID pairs, or a data frame of patient ID - study ID pairs. Below, we will pull by patient ID - study ID pairs. First, we construct the data frame of pairs to pass: ``` r df_pairs <- bind_rows(s1, s2) %>% select(-sampleId) df_pairs <- df_pairs %>% select(patientId, studyId) ``` Now we pass this data frame to `get_genetics_by_patient()` ``` r all_patient_clinical <- get_clinical_by_patient(patient_study_pairs = df_pairs, clinical_attribute = in_both) all_patient_clinical %>% select(-contains("unique")) #> # A tibble: 34 × 4 #> patientId studyId clinicalAttributeId value #> #> 1 P-0001453 blca_nmibc_2017 SAMPLE_COUNT 1 #> 2 P-0001453 blca_nmibc_2017 SEX Male #> 3 P-0002166 blca_nmibc_2017 SAMPLE_COUNT 1 #> 4 P-0002166 blca_nmibc_2017 SEX Male #> 5 P-0003238 blca_nmibc_2017 SAMPLE_COUNT 1 #> 6 P-0003238 blca_nmibc_2017 SEX Male #> 7 P-0003257 blca_nmibc_2017 SAMPLE_COUNT 1 #> 8 P-0003257 blca_nmibc_2017 SEX Female #> 9 P-0003261 blca_nmibc_2017 SAMPLE_COUNT 1 #> 10 P-0003261 blca_nmibc_2017 SEX Male #> # ℹ 24 more rows ```