Messy.Cats Introduction

messy.cats contains various functions that employ string distance tools in order to make data management easier for users working with categorical data. Categorical data, especially user inputted categorical data that often tends to be plagued by typos and different formatting choices, can be difficult to work with. messy.cats aims to provide functions that make cleaning categorical data simple and easy.

This introduction will lead you through examples of the functions in use, explain the arguments, and show how to get the most out of these functions.

First lets create some example vectors:

cars_bad = c("teal Mazda RX4", "black Mazda RX4 Wag",
             "green Datsun 710", "Hornet 4 Drive",
           "green Hornet Sportabout", "Valiant",
           "Duster 360", "orange Merc 240D",
           "Merc 230", "teal Merc 280",
           "Merc 280C", "green Merc 450SE",
           "Merc 450SL", "blue Merc 450SLC",
           "green Cadillac Fleetwood", "Lincoln Continental",
           "Chrysler Imperial")

cars_good = c("Mazda RX4", "Mazda RX4 Wag",
              "Datsun 710", "Hornet 4 Drive",
           "Hornet Sportabout", "Valiant",
           "Duster 360", "Merc 240D",
           "Merc 230", "Merc 280",
           "Merc 280C", "Merc 450SE",
           "Merc 450SL", "Merc 450SLC",
           "Cadillac Fleetwood", "Lincoln Continental",
           "Chrysler Imperial")

Suppose you have two lists of cars descriptions, one containing information on the make of the car, and the other containing make and color. Instead of string processing and deleting the color descriptors, which can be a fincky and time consuming process, cat_match() can match the contents of the two lists.

cat_match(cars_bad, cars_good, method = "jw")
#>                         bad               match  dists
#> 1            teal Mazda RX4           Mazda RX4 0.2302
#> 2       black Mazda RX4 Wag       Mazda RX4 Wag 0.2591
#> 3          green Datsun 710          Datsun 710 0.2417
#> 4            Hornet 4 Drive      Hornet 4 Drive 0.0000
#> 5   green Hornet Sportabout   Hornet Sportabout 0.1948
#> 6                   Valiant             Valiant 0.0000
#> 7                Duster 360          Duster 360 0.0000
#> 8          orange Merc 240D           Merc 240D 0.2199
#> 9                  Merc 230            Merc 230 0.0000
#> 10            teal Merc 280            Merc 280 0.2324
#> 11                Merc 280C           Merc 280C 0.0000
#> 12         green Merc 450SE          Merc 450SL 0.2532
#> 13               Merc 450SL          Merc 450SL 0.0000
#> 14         blue Merc 450SLC         Merc 450SLC 0.1799
#> 15 green Cadillac Fleetwood  Cadillac Fleetwood 0.2037
#> 16      Lincoln Continental Lincoln Continental 0.0000
#> 17        Chrysler Imperial   Chrysler Imperial 0.0000

After making sure that the string distance calculation is not making error with cat_match(), a user can use cat_replace to swap the contents of one list for their closest match in another.

cat_replace(cars_bad, cars_good, method = "jw")
#>  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
#>  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
#>  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
#> [10] "Merc 280"            "Merc 280C"           "Merc 450SL"         
#> [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
#> [16] "Lincoln Continental" "Chrysler Imperial"

Alternatively, a user could join together two dataframes that use these lists as id variables with the function cat_join().

bad_cars_df = data.frame(car = cars_bad, state_registration = "CA")
good_cars_df= data.frame(car = cars_good, insur_comp = "All State")

head(bad_cars_df)
#>                       car state_registration
#> 1          teal Mazda RX4                 CA
#> 2     black Mazda RX4 Wag                 CA
#> 3        green Datsun 710                 CA
#> 4          Hornet 4 Drive                 CA
#> 5 green Hornet Sportabout                 CA
#> 6                 Valiant                 CA
head(good_cars_df)
#>                 car insur_comp
#> 1         Mazda RX4  All State
#> 2     Mazda RX4 Wag  All State
#> 3        Datsun 710  All State
#> 4    Hornet 4 Drive  All State
#> 5 Hornet Sportabout  All State
#> 6           Valiant  All State

cat_join(bad_cars_df, good_cars_df, by="car", method="jw", join="left")
#>                    car state_registration insur_comp
#> 1            Mazda RX4                 CA  All State
#> 2        Mazda RX4 Wag                 CA  All State
#> 3           Datsun 710                 CA  All State
#> 4       Hornet 4 Drive                 CA  All State
#> 5    Hornet Sportabout                 CA  All State
#> 6              Valiant                 CA  All State
#> 7           Duster 360                 CA  All State
#> 8            Merc 240D                 CA  All State
#> 9             Merc 230                 CA  All State
#> 10            Merc 280                 CA  All State
#> 11           Merc 280C                 CA  All State
#> 12          Merc 450SL                 CA  All State
#> 13          Merc 450SL                 CA  All State
#> 14         Merc 450SLC                 CA  All State
#> 15  Cadillac Fleetwood                 CA  All State
#> 16 Lincoln Continental                 CA  All State
#> 17   Chrysler Imperial                 CA  All State

These are some of the most basic uses of the core functions in the messy.cats package. Each function mentioned has a plethora of additional arguments that users can utilize in order to fine tune their string distance calculations or increase the ease with which they use the functions.

What happens when the lists are not the same length, or the matches aren’t perfect?

messy_short <-         c("Fiat 128",
           "red Honda Civic", "Toyota Corolla",
           "Toyota Corona", "Dodge Challenger",
           "red AMC Javelin", "Camaro Z28",
           "Pontiac Firebird", "black Fiat X1-9",
           "blue Porsche 914-2", "Lotus Europa",
           "Ford Pantera L", "black Ferrari Dino",
           "black Maserati Bora", "black Volvo 142E")


clean_short <-       c(
           "Honda Civic",
           "Toyota Corona",
           "AMC Javelin",
           "Pontiac Firebird", "Fiat X1-9",
           "Porsche 914-2", "Lotus Europa",
           "Ford Pantera L", "Ferrari Dino",
           "Maserati Bora", "Volvo 142E")

cat_match(messy_short,clean_short, method = "jaccard") %>% arrange(desc(dists))
#>                    bad            match  dists
#> 1           Camaro Z28    Toyota Corona 0.5833
#> 2             Fiat 128        Fiat X1-9 0.4545
#> 3     Dodge Challenger   Ford Pantera L 0.4286
#> 4     black Volvo 142E       Volvo 142E 0.3333
#> 5      black Fiat X1-9        Fiat X1-9 0.3077
#> 6   black Ferrari Dino Pontiac Firebird 0.2667
#> 7  black Maserati Bora    Maserati Bora 0.2308
#> 8       Toyota Corolla    Toyota Corona 0.2222
#> 9   blue Porsche 914-2    Porsche 914-2 0.1875
#> 10     red Honda Civic      Honda Civic 0.1818
#> 11     red AMC Javelin      AMC Javelin 0.1667
#> 12       Toyota Corona    Toyota Corona 0.0000
#> 13    Pontiac Firebird Pontiac Firebird 0.0000
#> 14        Lotus Europa     Lotus Europa 0.0000
#> 15      Ford Pantera L   Ford Pantera L 0.0000

You can see that some of these matches are wrong, you can use the return_lists argument to return a list of the top matches. We suggest setting this to 2 or 3, if the top 3 are not still not correct, there may not be a clean match, or you may need to try a different string distance method (see select_metric().

When combined with a threshold, cat_match will only return a list for values that have a best match above that threshold.

cat_match(messy_short,clean_short, method = "jaccard", return_lists = 3, threshold = 0.2) %>% arrange(desc(dists)) %>% 
  gt::gt()
bad match dists
black Volvo 142E Volvo 142E, Porsche 914-2, AMC Javelin 0.3333, 0.6111, 0.6250
black Maserati Bora Maserati Bora, Lotus Europa, Pontiac Firebird 0.2308, 0.4667, 0.4706
black Ferrari Dino Pontiac Firebird, Ferrari Dino, Ford Pantera L 0.2667, 0.3077, 0.4000
blue Porsche 914-2 Porsche 914-2 0.1875
black Fiat X1-9 Fiat X1-9, Pontiac Firebird, AMC Javelin 0.3077, 0.6316, 0.7222
Camaro Z28 Toyota Corona, Maserati Bora, Pontiac Firebird 0.5833, 0.6429, 0.7059
red AMC Javelin AMC Javelin 0.1667
Dodge Challenger Ford Pantera L, Honda Civic, Ferrari Dino 0.4286, 0.4615, 0.4615
Toyota Corona Toyota Corona 0
Pontiac Firebird Pontiac Firebird 0
Lotus Europa Lotus Europa 0
Ford Pantera L Ford Pantera L 0
Toyota Corolla Toyota Corona, Lotus Europa, Ford Pantera L 0.2222, 0.5000, 0.5385
red Honda Civic Honda Civic 0.1818
Fiat 128 Fiat X1-9, Pontiac Firebird, Ferrari Dino 0.4545, 0.6875, 0.6923

Some of our matches have good choices, but others seem to be lacking a match in the clean vector. You can output this dataframe and manually choose the correct match or that there is no suitable match, or you can use the pick_lists argument, which prompts you in the console to choose the correct option or none of the above.

#data("picked_list")

In this more extensive example we have two datasets of biological data. The first: messy_caterpillars contains information about the average weight and length of caterpillars, and as the name suggests, has messy very caterpillar names.clean_caterpillars is a dataset containing the species and the number of caterpillars found and clean caterpillar names.

# load in messy_caterpillars and clean_caterpillars
data("clean_caterpillars")
data("messy_caterpillars")

head(messy_caterpillars)
str(messy_caterpillars)

head(clean_caterpillars)
str(clean_caterpillars)

To fix these names we can either use cat_replace() and change the caterpillar name variables and then use a merging function such as the dplyr join functions, or use cat_join().

But first, in order to properly configure our string distance arguments, we will first use cat_match() to explore how the messy and clean caterpillar names match up.

We input the messy and clean vectors—in this case columns of caterpillar names—and specify no other arguments other than to return the distance between each string pair.

cat_match(messy_caterpillars$CaterpillarSpecies,
          clean_caterpillars$species,
          return_dists = T)

The output shows the clean string with the lowest string distance from each messy string, and the distance between the pair is returned as a third column.

If we arrange by the distance in descending order, we can see the items of the messy vector with the worst matches. We can observe that the worst match is between “Papilio_glaucus” and “Orgyia leucostigma”. Additionally, this is the only incorrect match. This means that if we set a threshold lower than .5, cat_match will return no incorrect matches.

cat_match(messy_caterpillars$CaterpillarSpecies,clean_caterpillars$species,return_dists = T,method="jaccard") %>% arrange(desc(dists))
messy_caterpillars$CaterpillarSpecies = cat_replace(messy_caterpillars$CaterpillarSpecies,clean_caterpillars$species,method="jaccard", threshold = .49)

dplyr::left_join(clean_caterpillars,messy_caterpillars, by = c("species"="CaterpillarSpecies"))

Alternatively, a user could accomplish this task in one step using cat_join().

data("clean_caterpillars")
data("messy_caterpillars")

cat_join(messy_df = messy_caterpillars, clean_df = clean_caterpillars, by = c("CaterpillarSpecies", "species"), method="jaccard", threshold = .49,join="full")
#> # A tibble: 75 × 5
#>    species                `Avg Weight (mg)` `Avg Length (cm)` count  year
#>    <chr>                              <dbl>             <dbl> <int> <dbl>
#>  1 Achatia distincta                  0.809              2.64    24  2021
#>  2 Achatia distincta                  0.809              2.64    14  2020
#>  3 Achatia distincta                  0.809              2.64    16  2019
#>  4 Alsophila pometaria                2.03               1.73     8  2021
#>  5 Alsophila pometaria                2.03               1.73    18  2020
#>  6 Alsophila pometaria                2.03               1.73    11  2019
#>  7 Amphipyra pyramidoides             0.914              1.76    26  2021
#>  8 Amphipyra pyramidoides             0.914              1.76    26  2020
#>  9 Amphipyra pyramidoides             0.914              1.76     9  2019
#> 10 Himella intractata                 1.53               2.54     3  2021
#> # … with 65 more rows
data("mtcars")
mtcars_colnames_messy = mtcars
colnames(mtcars_colnames_messy)[1:5] = paste0(colnames(mtcars)[1:5], "_17")
colnames(mtcars_colnames_messy)[6:11] = paste0(colnames(mtcars)[6:11], "_2017")

Another messy dataset problem that our package hopes to help solve is row binding two datasets with different columns names. fuzzy_rbind() allows a user to join columns in dataframes using string distance matching. Any two columns with similar enough names will be bound together, and fuzzy_rbind() takes similar arguments as the rest of the functions in messy.cats to allow the user to fine tune their string distance matching.

fuzzy_rbind(df1 = mtcars, df2 = mtcars_colnames_messy, threshold = .5, 
            method = "jw")
#>      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> 1   21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> 2   21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> 3   22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> 4   21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> 5   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
#> 6   18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> 7   14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
#> 8   24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
#> 9   22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
#> 10  19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
#> 11  17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
#> 12  16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
#> 13  17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
#> 14  15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
#> 15  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
#> 16  10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
#> 17  14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
#> 18  32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> 19  30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#> 20  33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> 21  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> 22  15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
#> 23  15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
#> 24  13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
#> 25  19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
#> 26  27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
#> 27  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> 28  30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#> 29  15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> 30  19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> 31  15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
#> 32  21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
#> 33 110.0   6 160.0 110 3.90 2.620 16.46  0  4    4    4
#> 34 110.0   6 160.0 110 3.90 2.875 17.02  0  4    4    4
#> 35  93.0   4 108.0  93 3.85 2.320 18.61  1  1    1    1
#> 36 110.0   6 258.0 110 3.08 3.215 19.44  1  1    1    1
#> 37 175.0   8 360.0 175 3.15 3.440 17.02  0  2    2    2
#> 38 105.0   6 225.0 105 2.76 3.460 20.22  1  1    1    1
#> 39 245.0   8 360.0 245 3.21 3.570 15.84  0  4    4    4
#> 40  62.0   4 146.7  62 3.69 3.190 20.00  1  2    2    2
#> 41  95.0   4 140.8  95 3.92 3.150 22.90  1  2    2    2
#> 42 123.0   6 167.6 123 3.92 3.440 18.30  1  4    4    4
#> 43 123.0   6 167.6 123 3.92 3.440 18.90  1  4    4    4
#> 44 180.0   8 275.8 180 3.07 4.070 17.40  0  3    3    3
#> 45 180.0   8 275.8 180 3.07 3.730 17.60  0  3    3    3
#> 46 180.0   8 275.8 180 3.07 3.780 18.00  0  3    3    3
#> 47 205.0   8 472.0 205 2.93 5.250 17.98  0  4    4    4
#> 48 215.0   8 460.0 215 3.00 5.424 17.82  0  4    4    4
#> 49 230.0   8 440.0 230 3.23 5.345 17.42  0  4    4    4
#> 50  66.0   4  78.7  66 4.08 2.200 19.47  1  1    1    1
#> 51  52.0   4  75.7  52 4.93 1.615 18.52  1  2    2    2
#> 52  65.0   4  71.1  65 4.22 1.835 19.90  1  1    1    1
#> 53  97.0   4 120.1  97 3.70 2.465 20.01  1  1    1    1
#> 54 150.0   8 318.0 150 2.76 3.520 16.87  0  2    2    2
#> 55 150.0   8 304.0 150 3.15 3.435 17.30  0  2    2    2
#> 56 245.0   8 350.0 245 3.73 3.840 15.41  0  4    4    4
#> 57 175.0   8 400.0 175 3.08 3.845 17.05  0  2    2    2
#> 58  66.0   4  79.0  66 4.08 1.935 18.90  1  1    1    1
#> 59  91.0   4 120.3  91 4.43 2.140 16.70  0  2    2    2
#> 60 113.0   4  95.1 113 3.77 1.513 16.90  1  2    2    2
#> 61 264.0   8 351.0 264 4.22 3.170 14.50  0  4    4    4
#> 62 175.0   6 145.0 175 3.62 2.770 15.50  0  6    6    6
#> 63 335.0   8 301.0 335 3.54 3.570 14.60  0  8    8    8
#> 64 109.0   4 121.0 109 4.11 2.780 18.60  1  2    2    2
fuzzy_rbind(df1 = mtcars, df2 = mtcars_colnames_messy, threshold = .2,
            method = "jw")
#>     mpg cyl  disp drat  qsec gear carb
#> 1  21.0   6 160.0 3.90 16.46    4    4
#> 2  21.0   6 160.0 3.90 17.02    4    4
#> 3  22.8   4 108.0 3.85 18.61    4    1
#> 4  21.4   6 258.0 3.08 19.44    3    1
#> 5  18.7   8 360.0 3.15 17.02    3    2
#> 6  18.1   6 225.0 2.76 20.22    3    1
#> 7  14.3   8 360.0 3.21 15.84    3    4
#> 8  24.4   4 146.7 3.69 20.00    4    2
#> 9  22.8   4 140.8 3.92 22.90    4    2
#> 10 19.2   6 167.6 3.92 18.30    4    4
#> 11 17.8   6 167.6 3.92 18.90    4    4
#> 12 16.4   8 275.8 3.07 17.40    3    3
#> 13 17.3   8 275.8 3.07 17.60    3    3
#> 14 15.2   8 275.8 3.07 18.00    3    3
#> 15 10.4   8 472.0 2.93 17.98    3    4
#> 16 10.4   8 460.0 3.00 17.82    3    4
#> 17 14.7   8 440.0 3.23 17.42    3    4
#> 18 32.4   4  78.7 4.08 19.47    4    1
#> 19 30.4   4  75.7 4.93 18.52    4    2
#> 20 33.9   4  71.1 4.22 19.90    4    1
#> 21 21.5   4 120.1 3.70 20.01    3    1
#> 22 15.5   8 318.0 2.76 16.87    3    2
#> 23 15.2   8 304.0 3.15 17.30    3    2
#> 24 13.3   8 350.0 3.73 15.41    3    4
#> 25 19.2   8 400.0 3.08 17.05    3    2
#> 26 27.3   4  79.0 4.08 18.90    4    1
#> 27 26.0   4 120.3 4.43 16.70    5    2
#> 28 30.4   4  95.1 3.77 16.90    5    2
#> 29 15.8   8 351.0 4.22 14.50    5    4
#> 30 19.7   6 145.0 3.62 15.50    5    6
#> 31 15.0   8 301.0 3.54 14.60    5    8
#> 32 21.4   4 121.0 4.11 18.60    4    2
#> 33 21.0   6 160.0 3.90 16.46    4    4
#> 34 21.0   6 160.0 3.90 17.02    4    4
#> 35 22.8   4 108.0 3.85 18.61    4    1
#> 36 21.4   6 258.0 3.08 19.44    3    1
#> 37 18.7   8 360.0 3.15 17.02    3    2
#> 38 18.1   6 225.0 2.76 20.22    3    1
#> 39 14.3   8 360.0 3.21 15.84    3    4
#> 40 24.4   4 146.7 3.69 20.00    4    2
#> 41 22.8   4 140.8 3.92 22.90    4    2
#> 42 19.2   6 167.6 3.92 18.30    4    4
#> 43 17.8   6 167.6 3.92 18.90    4    4
#> 44 16.4   8 275.8 3.07 17.40    3    3
#> 45 17.3   8 275.8 3.07 17.60    3    3
#> 46 15.2   8 275.8 3.07 18.00    3    3
#> 47 10.4   8 472.0 2.93 17.98    3    4
#> 48 10.4   8 460.0 3.00 17.82    3    4
#> 49 14.7   8 440.0 3.23 17.42    3    4
#> 50 32.4   4  78.7 4.08 19.47    4    1
#> 51 30.4   4  75.7 4.93 18.52    4    2
#> 52 33.9   4  71.1 4.22 19.90    4    1
#> 53 21.5   4 120.1 3.70 20.01    3    1
#> 54 15.5   8 318.0 2.76 16.87    3    2
#> 55 15.2   8 304.0 3.15 17.30    3    2
#> 56 13.3   8 350.0 3.73 15.41    3    4
#> 57 19.2   8 400.0 3.08 17.05    3    2
#> 58 27.3   4  79.0 4.08 18.90    4    1
#> 59 26.0   4 120.3 4.43 16.70    5    2
#> 60 30.4   4  95.1 3.77 16.90    5    2
#> 61 15.8   8 351.0 4.22 14.50    5    4
#> 62 19.7   6 145.0 3.62 15.50    5    6
#> 63 15.0   8 301.0 3.54 14.60    5    8
#> 64 21.4   4 121.0 4.11 18.60    4    2

The second fuzzy_rbind() call results in fewer bound columns because the user asked for a lower threshold.

messy.cats also provides a function to help the user choose a string distance metric in order to get the best matches possible. select_metric() gives an approximate measure of “certainty” of matches for some of R’s string distance metrics on a given clean and messy vector pair.

select_metric(c("axxxxx", "bxxxxx", "cxxxxx"), c("apples", "banana", "carrot"))
#> [1] "jw, p = 0.1"

In the above example, select_metric() recommends the Jaro-Winkler distance with a p value of 0.1, since this metric weights agreement in the beginning of strings more heavily. The other metrics do not consider the position of discrepancies in strings, and as such are less “certain” of their matches given the nature of the example.

select_metric(c("ipzza", "rgegplants", "vrem aof wheat"), c("pizza", "eggplants", "cream of wheat"))
#> [1] "osa"

This example recommends Optimal String Alignment distance, since this metric is more forgiving of adjacent character swaps. Notably, Jaro-Winkler with p of 0.1 is a particularly bad choice for this dataset, since all intended matches disagree on their first character.