library(validata)
library(tidyselect)
In data analysis tasks we often have data sets with multiple possible ID columns, but it’s not always clear which combination uniquely identifies each row.
sample_data1 has 125 row with 3 ID type columns and 3 value columns.
head(sample_data1)
#> # A tibble: 6 x 6
#> ID_COL1 ID_COL2 ID_COL3 VAL1 VAL2 VAL3
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2413 1034 1014 -0.0639 -1.16 -0.302
#> 2 2413 1034 1322 0.363 1.62 0.165
#> 3 2413 1034 2999 -0.00466 1.23 0.819
#> 4 2413 1034 3544 1.83 -2.58 -0.525
#> 5 2413 1034 9901 0.837 -0.442 -0.341
#> 6 2413 1122 1014 -0.894 -1.11 0.768
Let’s use confirm_distinct
iteratively to find the uniquely identifying columns of sample_data1.
%>%
sample_data1 confirm_distinct(ID_COL1)
#> database has 120 duplicates at ID_COL1
%>%
sample_data1 confirm_distinct(ID_COL1, ID_COL2)
#> database has 100 duplicates at ID_COL1, ID_COL2
%>%
sample_data1 confirm_distinct(ID_COL1, ID_COL2, ID_COL3)
#> database is distinct at ID_COL1, ID_COL2, ID_COL3
Here we can conclude that the combination of 3 ID columns is the primary key for the data.
These steps can be automated with the wrapper function determine distinct
.
%>%
sample_data1 determine_distinct(matches("ID"))
confirm_mapping
tells you the mapping between two columns in a data frame:
confirm_mapping
gives the option to view which type of mapping is associated with each individual row.
%>%
sample_data1 confirm_mapping(ID_COL1, ID_COL2, view = F)
#> many - many mapping between ID_COL1 and ID_COL2
%>%
sample_data1 determine_mapping(everything())
The overlap
functions give a venn style description of the values in 2 columns. This is especially useful before performing a join
function, and you want to confirm that the dataframes have matching keys.
confirm_overlap
is different from the other confirm
functions in that it takes 2 vectors as arguments, instead of a data frame. This is to allow the user to test overlap between different dataframes, or arbitrary vectors if necessary
confirm_overlap(iris$Sepal.Width, iris$Petal.Length) -> iris_overlap
#> # A tibble: 1 x 5
#> only_in_iris_Sepal.W… only_in_iris_Petal.… shared_names total_names pct_shared
#> <int> <int> <int> <int> <chr>
#> 1 12 32 11 55 20%
confirm_overlap
returns a summary data frame invisibly allowing you to access individual elements using the helper functions.
print(iris_overlap)
#> # A tibble: 55 x 4
#> x iris_Sepal.Width iris_Petal.Length both_flags
#> <dbl> <dbl> <dbl> <dbl>
#> 1 3.5 1 1 2
#> 2 3 1 1 2
#> 3 3.2 1 0 1
#> 4 3.1 1 0 1
#> 5 3.6 1 1 2
#> 6 3.9 1 1 2
#> 7 3.4 1 0 1
#> 8 2.9 1 0 1
#> 9 3.7 1 1 2
#> 10 4 1 1 2
#> # … with 45 more rows
Find the elements unique to the first column
%>%
iris_overlap co_find_only_in_1() %>%
head()
#> # A tibble: 6 x 1
#> iris_Sepal.Width
#> <dbl>
#> 1 3.2
#> 2 3.1
#> 3 3.4
#> 4 2.9
#> 5 2.3
#> 6 2.8
Find the elements unique to the second column
%>%
iris_overlap co_find_only_in_2() %>%
head()
#> # A tibble: 6 x 1
#> iris_Petal.Length
#> <dbl>
#> 1 1.4
#> 2 1.3
#> 3 1.5
#> 4 1.7
#> 5 1.6
#> 6 1.1
Find the elements shared by both columns
%>%
iris_overlap co_find_in_both() %>%
head()
#> # A tibble: 6 x 1
#> x
#> <dbl>
#> 1 3.5
#> 2 3
#> 3 3.6
#> 4 3.9
#> 5 3.7
#> 6 4
determine_overlap
takes a dataframe and a tidyselect specification, and returns a tibble summarizing all of the pairwise overlaps. Only pairs with matching types are tested.
Note that the overlap
functions only test pairwise overlaps. For multi-column and large-scale overlap testing, see Complex Upset Plots
Get a frequency table of string lengths in a character column. Table is printed while the original df is returned invisibly with a column indicating the string lengths.
%>%
iris confirm_strlen(Species) -> species_len
#> Species_chr_len n percent
#> 6 50 33.3%
#> 9 50 33.3%
#> 10 50 33.3%
output is a dataframe
head(species_len)
#> # A tibble: 6 x 6
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species_chr_len
#> <dbl> <dbl> <dbl> <dbl> <fct> <int>
#> 1 5.1 3.5 1.4 0.2 setosa 6
#> 2 4.9 3 1.4 0.2 setosa 6
#> 3 4.7 3.2 1.3 0.2 setosa 6
#> 4 4.6 3.1 1.5 0.2 setosa 6
#> 5 5 3.6 1.4 0.2 setosa 6
#> 6 5.4 3.9 1.7 0.4 setosa 6
A helped function for the output of confirm_strlen
that filters the database for chosen string lengths.
%>%
species_len choose_strlen(len = 6) %>%
head()
#> # A tibble: 6 x 6
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species_chr_len
#> <dbl> <dbl> <dbl> <dbl> <fct> <int>
#> 1 5.1 3.5 1.4 0.2 setosa 6
#> 2 4.9 3 1.4 0.2 setosa 6
#> 3 4.7 3.2 1.3 0.2 setosa 6
#> 4 4.6 3.1 1.5 0.2 setosa 6
#> 5 5 3.6 1.4 0.2 setosa 6
#> 6 5.4 3.9 1.7 0.4 setosa 6
Reproduction of diagnose from the dlookr package. Usually a good choice for first analyzing a data set.
%>%
iris diagnose()
#> # A tibble: 5 x 6
#> variables types missing_count missing_percent unique_count unique_rate
#> <chr> <chr> <int> <dbl> <int> <dbl>
#> 1 Sepal.Length numeric 0 0 35 0.233
#> 2 Sepal.Width numeric 0 0 23 0.153
#> 3 Petal.Length numeric 0 0 43 0.287
#> 4 Petal.Width numeric 0 0 22 0.147
#> 5 Species factor 0 0 3 0.02