messy.cats
contains various functions that employ string
distance tools in order to make data management easier for users working
with categorical data. Categorical data, especially user inputted
categorical data that often tends to be plagued by typos and different
formatting choices, can be difficult to work with.
messy.cats
aims to provide functions that make cleaning
categorical data simple and easy.
This introduction will lead you through examples of the functions in use, explain the arguments, and show how to get the most out of these functions.
First lets create some example vectors:
= c("teal Mazda RX4", "black Mazda RX4 Wag",
cars_bad "green Datsun 710", "Hornet 4 Drive",
"green Hornet Sportabout", "Valiant",
"Duster 360", "orange Merc 240D",
"Merc 230", "teal Merc 280",
"Merc 280C", "green Merc 450SE",
"Merc 450SL", "blue Merc 450SLC",
"green Cadillac Fleetwood", "Lincoln Continental",
"Chrysler Imperial")
= c("Mazda RX4", "Mazda RX4 Wag",
cars_good "Datsun 710", "Hornet 4 Drive",
"Hornet Sportabout", "Valiant",
"Duster 360", "Merc 240D",
"Merc 230", "Merc 280",
"Merc 280C", "Merc 450SE",
"Merc 450SL", "Merc 450SLC",
"Cadillac Fleetwood", "Lincoln Continental",
"Chrysler Imperial")
Suppose you have two lists of cars descriptions, one containing information on the make of the car, and the other containing make and color. Instead of string processing and deleting the color descriptors, which can be a fincky and time consuming process, cat_match() can match the contents of the two lists.
cat_match(cars_bad, cars_good, method = "jw")
#> bad match dists
#> 1 teal Mazda RX4 Mazda RX4 0.2302
#> 2 black Mazda RX4 Wag Mazda RX4 Wag 0.2591
#> 3 green Datsun 710 Datsun 710 0.2417
#> 4 Hornet 4 Drive Hornet 4 Drive 0.0000
#> 5 green Hornet Sportabout Hornet Sportabout 0.1948
#> 6 Valiant Valiant 0.0000
#> 7 Duster 360 Duster 360 0.0000
#> 8 orange Merc 240D Merc 240D 0.2199
#> 9 Merc 230 Merc 230 0.0000
#> 10 teal Merc 280 Merc 280 0.2324
#> 11 Merc 280C Merc 280C 0.0000
#> 12 green Merc 450SE Merc 450SL 0.2532
#> 13 Merc 450SL Merc 450SL 0.0000
#> 14 blue Merc 450SLC Merc 450SLC 0.1799
#> 15 green Cadillac Fleetwood Cadillac Fleetwood 0.2037
#> 16 Lincoln Continental Lincoln Continental 0.0000
#> 17 Chrysler Imperial Chrysler Imperial 0.0000
After making sure that the string distance calculation is not making error with cat_match(), a user can use cat_replace to swap the contents of one list for their closest match in another.
cat_replace(cars_bad, cars_good, method = "jw")
#> [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
#> [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
#> [7] "Duster 360" "Merc 240D" "Merc 230"
#> [10] "Merc 280" "Merc 280C" "Merc 450SL"
#> [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
#> [16] "Lincoln Continental" "Chrysler Imperial"
Alternatively, a user could join together two dataframes that use these lists as id variables with the function cat_join().
= data.frame(car = cars_bad, state_registration = "CA")
bad_cars_df = data.frame(car = cars_good, insur_comp = "All State")
good_cars_df
head(bad_cars_df)
#> car state_registration
#> 1 teal Mazda RX4 CA
#> 2 black Mazda RX4 Wag CA
#> 3 green Datsun 710 CA
#> 4 Hornet 4 Drive CA
#> 5 green Hornet Sportabout CA
#> 6 Valiant CA
head(good_cars_df)
#> car insur_comp
#> 1 Mazda RX4 All State
#> 2 Mazda RX4 Wag All State
#> 3 Datsun 710 All State
#> 4 Hornet 4 Drive All State
#> 5 Hornet Sportabout All State
#> 6 Valiant All State
cat_join(bad_cars_df, good_cars_df, by="car", method="jw", join="left")
#> car state_registration insur_comp
#> 1 Mazda RX4 CA All State
#> 2 Mazda RX4 Wag CA All State
#> 3 Datsun 710 CA All State
#> 4 Hornet 4 Drive CA All State
#> 5 Hornet Sportabout CA All State
#> 6 Valiant CA All State
#> 7 Duster 360 CA All State
#> 8 Merc 240D CA All State
#> 9 Merc 230 CA All State
#> 10 Merc 280 CA All State
#> 11 Merc 280C CA All State
#> 12 Merc 450SL CA All State
#> 13 Merc 450SL CA All State
#> 14 Merc 450SLC CA All State
#> 15 Cadillac Fleetwood CA All State
#> 16 Lincoln Continental CA All State
#> 17 Chrysler Imperial CA All State
These are some of the most basic uses of the core functions in the messy.cats package. Each function mentioned has a plethora of additional arguments that users can utilize in order to fine tune their string distance calculations or increase the ease with which they use the functions.
What happens when the lists are not the same length, or the matches aren’t perfect?
<- c("Fiat 128",
messy_short "red Honda Civic", "Toyota Corolla",
"Toyota Corona", "Dodge Challenger",
"red AMC Javelin", "Camaro Z28",
"Pontiac Firebird", "black Fiat X1-9",
"blue Porsche 914-2", "Lotus Europa",
"Ford Pantera L", "black Ferrari Dino",
"black Maserati Bora", "black Volvo 142E")
<- c(
clean_short "Honda Civic",
"Toyota Corona",
"AMC Javelin",
"Pontiac Firebird", "Fiat X1-9",
"Porsche 914-2", "Lotus Europa",
"Ford Pantera L", "Ferrari Dino",
"Maserati Bora", "Volvo 142E")
cat_match(messy_short,clean_short, method = "jaccard") %>% arrange(desc(dists))
#> bad match dists
#> 1 Camaro Z28 Toyota Corona 0.5833
#> 2 Fiat 128 Fiat X1-9 0.4545
#> 3 Dodge Challenger Ford Pantera L 0.4286
#> 4 black Volvo 142E Volvo 142E 0.3333
#> 5 black Fiat X1-9 Fiat X1-9 0.3077
#> 6 black Ferrari Dino Pontiac Firebird 0.2667
#> 7 black Maserati Bora Maserati Bora 0.2308
#> 8 Toyota Corolla Toyota Corona 0.2222
#> 9 blue Porsche 914-2 Porsche 914-2 0.1875
#> 10 red Honda Civic Honda Civic 0.1818
#> 11 red AMC Javelin AMC Javelin 0.1667
#> 12 Toyota Corona Toyota Corona 0.0000
#> 13 Pontiac Firebird Pontiac Firebird 0.0000
#> 14 Lotus Europa Lotus Europa 0.0000
#> 15 Ford Pantera L Ford Pantera L 0.0000
You can see that some of these matches are wrong, you can use the
return_lists
argument to return a list of the top matches.
We suggest setting this to 2 or 3, if the top 3 are not still not
correct, there may not be a clean match, or you may need to try a
different string distance method (see select_metric()
.
When combined with a threshold, cat_match
will only
return a list for values that have a best match above that
threshold.
cat_match(messy_short,clean_short, method = "jaccard", return_lists = 3, threshold = 0.2) %>% arrange(desc(dists)) %>%
::gt() gt
bad | match | dists |
---|---|---|
black Volvo 142E | Volvo 142E, Porsche 914-2, AMC Javelin | 0.3333, 0.6111, 0.6250 |
black Maserati Bora | Maserati Bora, Lotus Europa, Pontiac Firebird | 0.2308, 0.4667, 0.4706 |
black Ferrari Dino | Pontiac Firebird, Ferrari Dino, Ford Pantera L | 0.2667, 0.3077, 0.4000 |
blue Porsche 914-2 | Porsche 914-2 | 0.1875 |
black Fiat X1-9 | Fiat X1-9, Pontiac Firebird, AMC Javelin | 0.3077, 0.6316, 0.7222 |
Camaro Z28 | Toyota Corona, Maserati Bora, Pontiac Firebird | 0.5833, 0.6429, 0.7059 |
red AMC Javelin | AMC Javelin | 0.1667 |
Dodge Challenger | Ford Pantera L, Honda Civic, Ferrari Dino | 0.4286, 0.4615, 0.4615 |
Toyota Corona | Toyota Corona | 0 |
Pontiac Firebird | Pontiac Firebird | 0 |
Lotus Europa | Lotus Europa | 0 |
Ford Pantera L | Ford Pantera L | 0 |
Toyota Corolla | Toyota Corona, Lotus Europa, Ford Pantera L | 0.2222, 0.5000, 0.5385 |
red Honda Civic | Honda Civic | 0.1818 |
Fiat 128 | Fiat X1-9, Pontiac Firebird, Ferrari Dino | 0.4545, 0.6875, 0.6923 |
Some of our matches have good choices, but others seem to be lacking
a match in the clean vector. You can output this dataframe and manually
choose the correct match or that there is no suitable match, or you can
use the pick_lists
argument, which prompts you in the
console to choose the correct option or none of the above.
#data("picked_list")
In this more extensive example we have two datasets of biological
data. The first: messy_caterpillars
contains information
about the average weight and length of caterpillars, and as the name
suggests, has messy very caterpillar
names.clean_caterpillars
is a dataset containing the
species and the number of caterpillars found and clean caterpillar
names.
# load in messy_caterpillars and clean_caterpillars
data("clean_caterpillars")
data("messy_caterpillars")
head(messy_caterpillars)
str(messy_caterpillars)
head(clean_caterpillars)
str(clean_caterpillars)
To fix these names we can either use cat_replace()
and
change the caterpillar name variables and then use a merging function
such as the dplyr join functions, or use cat_join()
.
But first, in order to properly configure our string distance
arguments, we will first use cat_match()
to explore how the
messy and clean caterpillar names match up.
We input the messy and clean vectors—in this case columns of caterpillar names—and specify no other arguments other than to return the distance between each string pair.
cat_match(messy_caterpillars$CaterpillarSpecies,
$species,
clean_caterpillarsreturn_dists = T)
The output shows the clean string with the lowest string distance from each messy string, and the distance between the pair is returned as a third column.
If we arrange by the distance in descending order, we can see the items of the messy vector with the worst matches. We can observe that the worst match is between “Papilio_glaucus” and “Orgyia leucostigma”. Additionally, this is the only incorrect match. This means that if we set a threshold lower than .5, cat_match will return no incorrect matches.
cat_match(messy_caterpillars$CaterpillarSpecies,clean_caterpillars$species,return_dists = T,method="jaccard") %>% arrange(desc(dists))
$CaterpillarSpecies = cat_replace(messy_caterpillars$CaterpillarSpecies,clean_caterpillars$species,method="jaccard", threshold = .49)
messy_caterpillars
::left_join(clean_caterpillars,messy_caterpillars, by = c("species"="CaterpillarSpecies")) dplyr
Alternatively, a user could accomplish this task in one step using
cat_join()
.
data("clean_caterpillars")
data("messy_caterpillars")
cat_join(messy_df = messy_caterpillars, clean_df = clean_caterpillars, by = c("CaterpillarSpecies", "species"), method="jaccard", threshold = .49,join="full")
#> # A tibble: 75 × 5
#> species `Avg Weight (mg)` `Avg Length (cm)` count year
#> <chr> <dbl> <dbl> <int> <dbl>
#> 1 Achatia distincta 0.809 2.64 24 2021
#> 2 Achatia distincta 0.809 2.64 14 2020
#> 3 Achatia distincta 0.809 2.64 16 2019
#> 4 Alsophila pometaria 2.03 1.73 8 2021
#> 5 Alsophila pometaria 2.03 1.73 18 2020
#> 6 Alsophila pometaria 2.03 1.73 11 2019
#> 7 Amphipyra pyramidoides 0.914 1.76 26 2021
#> 8 Amphipyra pyramidoides 0.914 1.76 26 2020
#> 9 Amphipyra pyramidoides 0.914 1.76 9 2019
#> 10 Himella intractata 1.53 2.54 3 2021
#> # … with 65 more rows
data("mtcars")
= mtcars
mtcars_colnames_messy colnames(mtcars_colnames_messy)[1:5] = paste0(colnames(mtcars)[1:5], "_17")
colnames(mtcars_colnames_messy)[6:11] = paste0(colnames(mtcars)[6:11], "_2017")
Another messy dataset problem that our package hopes to help solve is
row binding two datasets with different columns names.
fuzzy_rbind()
allows a user to join columns in dataframes
using string distance matching. Any two columns with similar enough
names will be bound together, and fuzzy_rbind()
takes
similar arguments as the rest of the functions in
messy.cats
to allow the user to fine tune their string
distance matching.
fuzzy_rbind(df1 = mtcars, df2 = mtcars_colnames_messy, threshold = .5,
method = "jw")
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#> 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#> 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#> 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#> 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#> 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#> 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
#> 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
#> 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
#> 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
#> 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
#> 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
#> 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
#> 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#> 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> 21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
#> 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
#> 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
#> 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
#> 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
#> 26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
#> 27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
#> 28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#> 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#> 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
#> 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
#> 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
#> 33 110.0 6 160.0 110 3.90 2.620 16.46 0 4 4 4
#> 34 110.0 6 160.0 110 3.90 2.875 17.02 0 4 4 4
#> 35 93.0 4 108.0 93 3.85 2.320 18.61 1 1 1 1
#> 36 110.0 6 258.0 110 3.08 3.215 19.44 1 1 1 1
#> 37 175.0 8 360.0 175 3.15 3.440 17.02 0 2 2 2
#> 38 105.0 6 225.0 105 2.76 3.460 20.22 1 1 1 1
#> 39 245.0 8 360.0 245 3.21 3.570 15.84 0 4 4 4
#> 40 62.0 4 146.7 62 3.69 3.190 20.00 1 2 2 2
#> 41 95.0 4 140.8 95 3.92 3.150 22.90 1 2 2 2
#> 42 123.0 6 167.6 123 3.92 3.440 18.30 1 4 4 4
#> 43 123.0 6 167.6 123 3.92 3.440 18.90 1 4 4 4
#> 44 180.0 8 275.8 180 3.07 4.070 17.40 0 3 3 3
#> 45 180.0 8 275.8 180 3.07 3.730 17.60 0 3 3 3
#> 46 180.0 8 275.8 180 3.07 3.780 18.00 0 3 3 3
#> 47 205.0 8 472.0 205 2.93 5.250 17.98 0 4 4 4
#> 48 215.0 8 460.0 215 3.00 5.424 17.82 0 4 4 4
#> 49 230.0 8 440.0 230 3.23 5.345 17.42 0 4 4 4
#> 50 66.0 4 78.7 66 4.08 2.200 19.47 1 1 1 1
#> 51 52.0 4 75.7 52 4.93 1.615 18.52 1 2 2 2
#> 52 65.0 4 71.1 65 4.22 1.835 19.90 1 1 1 1
#> 53 97.0 4 120.1 97 3.70 2.465 20.01 1 1 1 1
#> 54 150.0 8 318.0 150 2.76 3.520 16.87 0 2 2 2
#> 55 150.0 8 304.0 150 3.15 3.435 17.30 0 2 2 2
#> 56 245.0 8 350.0 245 3.73 3.840 15.41 0 4 4 4
#> 57 175.0 8 400.0 175 3.08 3.845 17.05 0 2 2 2
#> 58 66.0 4 79.0 66 4.08 1.935 18.90 1 1 1 1
#> 59 91.0 4 120.3 91 4.43 2.140 16.70 0 2 2 2
#> 60 113.0 4 95.1 113 3.77 1.513 16.90 1 2 2 2
#> 61 264.0 8 351.0 264 4.22 3.170 14.50 0 4 4 4
#> 62 175.0 6 145.0 175 3.62 2.770 15.50 0 6 6 6
#> 63 335.0 8 301.0 335 3.54 3.570 14.60 0 8 8 8
#> 64 109.0 4 121.0 109 4.11 2.780 18.60 1 2 2 2
fuzzy_rbind(df1 = mtcars, df2 = mtcars_colnames_messy, threshold = .2,
method = "jw")
#> mpg cyl disp drat qsec gear carb
#> 1 21.0 6 160.0 3.90 16.46 4 4
#> 2 21.0 6 160.0 3.90 17.02 4 4
#> 3 22.8 4 108.0 3.85 18.61 4 1
#> 4 21.4 6 258.0 3.08 19.44 3 1
#> 5 18.7 8 360.0 3.15 17.02 3 2
#> 6 18.1 6 225.0 2.76 20.22 3 1
#> 7 14.3 8 360.0 3.21 15.84 3 4
#> 8 24.4 4 146.7 3.69 20.00 4 2
#> 9 22.8 4 140.8 3.92 22.90 4 2
#> 10 19.2 6 167.6 3.92 18.30 4 4
#> 11 17.8 6 167.6 3.92 18.90 4 4
#> 12 16.4 8 275.8 3.07 17.40 3 3
#> 13 17.3 8 275.8 3.07 17.60 3 3
#> 14 15.2 8 275.8 3.07 18.00 3 3
#> 15 10.4 8 472.0 2.93 17.98 3 4
#> 16 10.4 8 460.0 3.00 17.82 3 4
#> 17 14.7 8 440.0 3.23 17.42 3 4
#> 18 32.4 4 78.7 4.08 19.47 4 1
#> 19 30.4 4 75.7 4.93 18.52 4 2
#> 20 33.9 4 71.1 4.22 19.90 4 1
#> 21 21.5 4 120.1 3.70 20.01 3 1
#> 22 15.5 8 318.0 2.76 16.87 3 2
#> 23 15.2 8 304.0 3.15 17.30 3 2
#> 24 13.3 8 350.0 3.73 15.41 3 4
#> 25 19.2 8 400.0 3.08 17.05 3 2
#> 26 27.3 4 79.0 4.08 18.90 4 1
#> 27 26.0 4 120.3 4.43 16.70 5 2
#> 28 30.4 4 95.1 3.77 16.90 5 2
#> 29 15.8 8 351.0 4.22 14.50 5 4
#> 30 19.7 6 145.0 3.62 15.50 5 6
#> 31 15.0 8 301.0 3.54 14.60 5 8
#> 32 21.4 4 121.0 4.11 18.60 4 2
#> 33 21.0 6 160.0 3.90 16.46 4 4
#> 34 21.0 6 160.0 3.90 17.02 4 4
#> 35 22.8 4 108.0 3.85 18.61 4 1
#> 36 21.4 6 258.0 3.08 19.44 3 1
#> 37 18.7 8 360.0 3.15 17.02 3 2
#> 38 18.1 6 225.0 2.76 20.22 3 1
#> 39 14.3 8 360.0 3.21 15.84 3 4
#> 40 24.4 4 146.7 3.69 20.00 4 2
#> 41 22.8 4 140.8 3.92 22.90 4 2
#> 42 19.2 6 167.6 3.92 18.30 4 4
#> 43 17.8 6 167.6 3.92 18.90 4 4
#> 44 16.4 8 275.8 3.07 17.40 3 3
#> 45 17.3 8 275.8 3.07 17.60 3 3
#> 46 15.2 8 275.8 3.07 18.00 3 3
#> 47 10.4 8 472.0 2.93 17.98 3 4
#> 48 10.4 8 460.0 3.00 17.82 3 4
#> 49 14.7 8 440.0 3.23 17.42 3 4
#> 50 32.4 4 78.7 4.08 19.47 4 1
#> 51 30.4 4 75.7 4.93 18.52 4 2
#> 52 33.9 4 71.1 4.22 19.90 4 1
#> 53 21.5 4 120.1 3.70 20.01 3 1
#> 54 15.5 8 318.0 2.76 16.87 3 2
#> 55 15.2 8 304.0 3.15 17.30 3 2
#> 56 13.3 8 350.0 3.73 15.41 3 4
#> 57 19.2 8 400.0 3.08 17.05 3 2
#> 58 27.3 4 79.0 4.08 18.90 4 1
#> 59 26.0 4 120.3 4.43 16.70 5 2
#> 60 30.4 4 95.1 3.77 16.90 5 2
#> 61 15.8 8 351.0 4.22 14.50 5 4
#> 62 19.7 6 145.0 3.62 15.50 5 6
#> 63 15.0 8 301.0 3.54 14.60 5 8
#> 64 21.4 4 121.0 4.11 18.60 4 2
The second fuzzy_rbind()
call results in fewer bound
columns because the user asked for a lower threshold.
messy.cats also provides a function to help the user choose a string
distance metric in order to get the best matches possible.
select_metric()
gives an approximate measure of “certainty”
of matches for some of R’s string distance metrics on a given clean and
messy vector pair.
select_metric(c("axxxxx", "bxxxxx", "cxxxxx"), c("apples", "banana", "carrot"))
#> [1] "jw, p = 0.1"
In the above example, select_metric()
recommends the
Jaro-Winkler distance with a p value of 0.1, since this metric weights
agreement in the beginning of strings more heavily. The other metrics do
not consider the position of discrepancies in strings, and as such are
less “certain” of their matches given the nature of the example.
select_metric(c("ipzza", "rgegplants", "vrem aof wheat"), c("pizza", "eggplants", "cream of wheat"))
#> [1] "osa"
This example recommends Optimal String Alignment distance, since this metric is more forgiving of adjacent character swaps. Notably, Jaro-Winkler with p of 0.1 is a particularly bad choice for this dataset, since all intended matches disagree on their first character.