Syntax


data("corp_data1", package = "fedmatch")
data("corp_data2", package = "fedmatch")

The basic syntax is as follows, using the example company data in fedmatch:

corp_data1_test <- copy(corp_data1)
data.table::setDT(corp_data1_test)
corp_data2_test <- copy(corp_data2)
data.table::setDT(corp_data2_test)

corp_data1_test[, id_1 := seq(1, .N)]
corp_data2_test[, id_2 := seq(1, .N)]

corp_data2_test[, Country := country]
corp_data2_test[, Company := Name]
result <- merge_plus(
  data1 = corp_data1_test,
  data2 = corp_data2_test,
  match_type = "multivar",
  by = c("Country", "Company"),
   suffixes = c("_1", "_2"),
  unique_key_1 = "id_1",
  unique_key_2 = "id_2", 
  multivar_settings = build_multivar_settings(
  compare_type = c("indicator", "stringdist"),
  wgts = c(.5, .5), nthread = 1
 
))
result
#> $matches
#> Key: <id_2>
#> Index: <tier>
#>     unique_key_1          Company_1 Country_1  State   SIC Revenue  id_1
#>            <int>             <char>    <char> <char> <num>   <num> <int>
#>  1:            1            Walmart       USA     OH  3300     485     1
#>  2:            2   Bershire Hataway       USA         2222     223     2
#>  3:            3              Apple       USA     CA  3384     215     3
#>  4:            4       Exxon Mobile       USA     TX  2222     205     4
#>  5:            5          McKesson    Germany     MA   222     192     5
#>  6:            6 UnitedHealth Group       USA     MA    NA     184     6
#>  7:            7         CVS Health       USA     RI  1112     177     7
#>  8:            9               AT&T       USA     TN  4000     163     9
#>  9:            8     General Motors       USA     MI  2222     166     8
#> 10:           10 Ford Motor Company       USA     MI    NA     151    10
#>                   Name country state_code SIC_code earnings unique_key_2  id_2
#>                 <char>  <char>     <char>    <num>   <char>        <int> <int>
#>  1:            Walmart     USA         OH     3380  490,000            1     1
#>  2:  Bershire Hathaway     USA         NE     2220  220,000            2     2
#>  3:     Apple Computer     USA         CA       NA  220,000            3     3
#>  4: Exxon Mobile Inc.      USA         TX     2222  210,000            4     4
#>  5:     McKesson Corp.                 MA     2222  190,000            5     5
#>  6: UnitedHealth Group     USA         MA     1130  180,000            6     6
#>  7: UnitedHealth Group     USA         MA     1130  180,000            6     6
#>  8:             AT & T     USA         TN     4000  160,000            9     9
#>  9:         Ford Motor     USA         MI     2222  150,000           10    10
#> 10:         Ford Motor     USA         MI     2222  150,000           10    10
#>     Country_2          Company_2 Country_compare Company_compare multivar_score
#>        <char>             <char>           <num>           <num>          <num>
#>  1:       USA            Walmart               1       1.0000000      1.0000000
#>  2:       USA  Bershire Hathaway               1       0.9882353      0.9941176
#>  3:       USA     Apple Computer               1       0.8714286      0.9357143
#>  4:       USA Exxon Mobile Inc.                1       0.9333333      0.9666667
#>  5:               McKesson Corp.               0       0.9285714      0.4642857
#>  6:       USA UnitedHealth Group               1       1.0000000      1.0000000
#>  7:       USA UnitedHealth Group               1       0.5333333      0.7666667
#>  8:       USA             AT & T               1       0.9111111      0.9555556
#>  9:       USA         Ford Motor               1       0.7333333      0.8666667
#> 10:       USA         Ford Motor               1       0.9111111      0.9555556
#>       tier
#>     <char>
#>  1:    all
#>  2:    all
#>  3:    all
#>  4:    all
#>  5:    all
#>  6:    all
#>  7:    all
#>  8:    all
#>  9:    all
#> 10:    all
#> 
#> $matches_filter
#> Null data.table (0 rows and 0 cols)
#> 
#> $data1_nomatch
#> Empty data.table (0 rows and 7 cols): Company,Country,State,SIC,Revenue,unique_key_1...
#> 
#> $data2_nomatch
#>      Name country state_code SIC_code earnings unique_key_2  id_2 Country
#>    <char>  <char>     <char>    <num>   <char>        <int> <int>  <char>
#> 1:    CVS                 RI     1122  180,000            7     7        
#> 2:     GM                 MI     2222  170,000            8     8        
#>    Company
#>     <char>
#> 1:     CVS
#> 2:      GM
#> 
#> $match_evaluation
#> Index: <tier>
#>      tier matches in_tier_unique_1 in_tier_unique_2 pct_matched_1 pct_matched_2
#>    <char>   <int>            <int>            <int>         <num>         <num>
#> 1:    all      10               10                8             1           0.8
#>    new_unique_1 new_unique_2
#>           <int>        <int>
#> 1:           10            8

Let’s go through the arguments:

The first 7 arguments are identical to those in merge_plus, except ‘by’ can contain multiple elements, unlike with merge_plus.
‘compare_type’ is a character vector, equal to the length of the ‘by’ vector, that describes how to compare each field. The options are “in”, “indicator”, “difference”, “ratio”, and “stringdist”. Each of the comparison metrics are scaled such that they are also in between 0 and 1.
- “indicator” is a binary indicator for if the two fields match exactly.
- “in” is a binary indicator for one string being within the other (e.g. ‘abc corp’ and ‘abc’ would get a score of 1 )
- “difference” computes the numeric difference between the two columns, scaled by the largest difference in the dataset
- “ratio” computes the ratio between the two columns, always such that the ratio is less than 1
- “stringdist” computes the Jaro-Winkler string distance between the strings.
- “wgt_jaccard_dist” computes the weighted jaccard distance between the strings (see fuzzy matching vignette for details on how this is computed.)
‘wgts’ is a vector of weights, length equal to the length of the ‘by’ vector, that describes how to weight each field for computing a multivar_score.
‘nthread’ is how many threads to use when making the comparisons, see parallel::makeCluster()
‘logit’ is a logit model (described in detail later) that can be used instead of ‘wgts’ to combine the comparison scores into a final multivar_score using base::predict()

Return value

multivar_match returns a data.table with the columns from both data_1 and data_2, just like base::merge(). In addition, it returns three columns for each variable in by: two for the original columns in the data, and one column with the suffix ‘compare’ that has the numerical value for the given comparison. For example:

print(result$matches[, .(Company_1, Company_2, Company_compare)])
#>              Company_1          Company_2 Company_compare
#>                 <char>             <char>           <num>
#>  1:            Walmart            Walmart       1.0000000
#>  2:   Bershire Hataway  Bershire Hathaway       0.9882353
#>  3:              Apple     Apple Computer       0.8714286
#>  4:       Exxon Mobile Exxon Mobile Inc.        0.9333333
#>  5:          McKesson      McKesson Corp.       0.9285714
#>  6: UnitedHealth Group UnitedHealth Group       1.0000000
#>  7:         CVS Health UnitedHealth Group       0.5333333
#>  8:               AT&T             AT & T       0.9111111
#>  9:     General Motors         Ford Motor       0.7333333
#> 10: Ford Motor Company         Ford Motor       0.9111111
print(result$matches[, .(Country_1, Country_2, Country_compare)])
#>     Country_1 Country_2 Country_compare
#>        <char>    <char>           <num>
#>  1:       USA       USA               1
#>  2:       USA       USA               1
#>  3:       USA       USA               1
#>  4:       USA       USA               1
#>  5:   Germany                         0
#>  6:       USA       USA               1
#>  7:       USA       USA               1
#>  8:       USA       USA               1
#>  9:       USA       USA               1
#> 10:       USA       USA               1

Further, it adds one additional column for the overall multivar_score. In our example, we set the weights each equal to 0.5, so the multivar_score is simply the average of our two comparison variables:


print(result$matches[, .(Company_compare, Country_compare, multivar_score)])
#>     Company_compare Country_compare multivar_score
#>               <num>           <num>          <num>
#>  1:       1.0000000               1      1.0000000
#>  2:       0.9882353               1      0.9941176
#>  3:       0.8714286               1      0.9357143
#>  4:       0.9333333               1      0.9666667
#>  5:       0.9285714               0      0.4642857
#>  6:       1.0000000               1      1.0000000
#>  7:       0.5333333               1      0.7666667
#>  8:       0.9111111               1      0.9555556
#>  9:       0.7333333               1      0.8666667
#> 10:       0.9111111               1      0.9555556

Calculating weights from a training set

If you’re not sure what the weights for each variable should be, you can use the function fedmatch::calculate_weights() in conjunction with a validated match set to get an estimate. This uses the methodology from Felligi and Sunter’s seminal paper A Theory for Record Linkage. You can find more information on the Record Linkage Wikipedia page. Essentially, fedmatch looks for which variables are most likely to match when two records match, and uses this criteria to weight the variables. Here is an example of how this would work:

set.seed(111)
fake_result_table <- data.table::data.table(
    Company_1 = c("ABC Corp", "XYZ Corporation", "Apple Corp", "Banana Corp"),
    Company_2 = c("ABC Corporation", "XYZ Inc", "Apple Incorporated", "Banana Stand"),
    Country_1 = c("USA", "USA", "TUR", "USA"),
    Country_2 = c("MEX", "USA", "TUR", "USA")
  )
calculated_weights <- calculate_weights(fake_result_table, c("Company", "Country"),
                                        compare_type = c("stringdist", "indicator"),
                                        suffixes = c("_1", "_2"))
calculated_weights$w
#>   Company   Country 
#> 0.7349053 0.2650947

In this dummy example, the names are giving us more information about the match quality than the countries, and so the weights are correspondingly higher. If you’d like more details on the exact methodology, see the manual documentation for calculate_weights.

Using a logit model instead of a linear sum

Rather than using a linear combination of the comparison variables, we can instead train a logit model on a training set of matches and non-matches. The explanatory variables are each of the comparison variables, and the dependent variable is a binary 1/0 for if the two records are a match or not. This data set needs to be constructed by hand, and is similar to the type of data set used with calculate_weights. However, it must contain both matches and non-matches for the logit model to be well-trained. Here is an example of how this would look:

set.seed(111)
corp_data1_test <- copy(corp_data1)
data.table::setDT(corp_data1_test)
corp_data2_test <- copy(corp_data2)
data.table::setDT(corp_data2_test)
corp_data1_test[, id_1 := seq(1, .N)]
corp_data2_test[, id_2 := seq(1, .N)]

corp_data2_test[, Country := country]
corp_data2_test[, Company := Name]
set.seed(111)
fake_result_table <- data.table::data.table(
  match = sample(c(1, 0, 1), 1e5, replace = TRUE),
  Company_compare = runif(1e5),
  Country_compare = sample(c(1, 0), 1e5, replace = TRUE)
)

logit_model <- glm(match ~ Company_compare + Country_compare,
  family = "binomial",
  data = fake_result_table
)

summary(logit_model)
#> 
#> Call:
#> glm(formula = match ~ Company_compare + Country_compare, family = "binomial", 
#>     data = fake_result_table)
#> 
#> Coefficients:
#>                 Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)      0.69243    0.01502  46.097   <2e-16 ***
#> Company_compare -0.02710    0.02324  -1.166   0.2436    
#> Country_compare  0.03155    0.01342   2.351   0.0187 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 127258  on 99999  degrees of freedom
#> Residual deviance: 127251  on 99997  degrees of freedom
#> AIC: 127257
#> 
#> Number of Fisher Scoring iterations: 4

result <- merge_plus(corp_data1_test, corp_data2_test,
                     match_type = "multivar",
                     multivar_settings = build_multivar_settings(logit = logit_model, compare_type = c("indicator", "stringdist"),
  wgts = NULL, nthread = 1),
  by = c("Country", "Company"), unique_key_1 = "id_1",
  unique_key_2 = "id_2", 
  suffixes = c("_1", "_2")
)
result
#> $matches
#> Key: <id_2>
#> Index: <tier>
#>     unique_key_1          Company_1 Country_1  State   SIC Revenue  id_1
#>            <int>             <char>    <char> <char> <num>   <num> <int>
#>  1:            4       Exxon Mobile       USA     TX  2222     205     4
#>  2:            5          McKesson    Germany     MA   222     192     5
#>  3:            6 UnitedHealth Group       USA     MA    NA     184     6
#>  4:            9               AT&T       USA     TN  4000     163     9
#>  5:            2   Bershire Hataway       USA         2222     223     2
#>  6:            7         CVS Health       USA     RI  1112     177     7
#>  7:           10 Ford Motor Company       USA     MI    NA     151    10
#>  8:            1            Walmart       USA     OH  3300     485     1
#>  9:            8     General Motors       USA     MI  2222     166     8
#> 10:            3              Apple       USA     CA  3384     215     3
#>                   Name country state_code SIC_code earnings unique_key_2  id_2
#>                 <char>  <char>     <char>    <num>   <char>        <int> <int>
#>  1:            Walmart     USA         OH     3380  490,000            1     1
#>  2:            Walmart     USA         OH     3380  490,000            1     1
#>  3:            Walmart     USA         OH     3380  490,000            1     1
#>  4:            Walmart     USA         OH     3380  490,000            1     1
#>  5: Exxon Mobile Inc.      USA         TX     2222  210,000            4     4
#>  6: Exxon Mobile Inc.      USA         TX     2222  210,000            4     4
#>  7: UnitedHealth Group     USA         MA     1130  180,000            6     6
#>  8:             AT & T     USA         TN     4000  160,000            9     9
#>  9:             AT & T     USA         TN     4000  160,000            9     9
#> 10:         Ford Motor     USA         MI     2222  150,000           10    10
#>     Country_2          Company_2 Country_compare Company_compare multivar_score
#>        <char>             <char>           <num>           <num>          <num>
#>  1:       USA            Walmart               1       0.0000000      0.6734819
#>  2:       USA            Walmart               0       0.0000000      0.6665064
#>  3:       USA            Walmart               1       0.3650794      0.6713026
#>  4:       USA            Walmart               1       0.0000000      0.6734819
#>  5:       USA Exxon Mobile Inc.                1       0.2847222      0.6717829
#>  6:       USA Exxon Mobile Inc.                1       0.3777778      0.6712266
#>  7:       USA UnitedHealth Group               1       0.4497354      0.6707962
#>  8:       USA             AT & T               1       0.0000000      0.6734819
#>  9:       USA             AT & T               1       0.4126984      0.6710178
#> 10:       USA         Ford Motor               1       0.0000000      0.6734819
#>       tier
#>     <char>
#>  1:    all
#>  2:    all
#>  3:    all
#>  4:    all
#>  5:    all
#>  6:    all
#>  7:    all
#>  8:    all
#>  9:    all
#> 10:    all
#> 
#> $matches_filter
#> Null data.table (0 rows and 0 cols)
#> 
#> $data1_nomatch
#> Empty data.table (0 rows and 7 cols): Company,Country,State,SIC,Revenue,unique_key_1...
#> 
#> $data2_nomatch
#>                 Name country state_code SIC_code earnings unique_key_2  id_2
#>               <char>  <char>     <char>    <num>   <char>        <int> <int>
#> 1: Bershire Hathaway     USA         NE     2220  220,000            2     2
#> 2:    Apple Computer     USA         CA       NA  220,000            3     3
#> 3:    McKesson Corp.                 MA     2222  190,000            5     5
#> 4:               CVS                 RI     1122  180,000            7     7
#> 5:                GM                 MI     2222  170,000            8     8
#>    Country           Company
#>     <char>            <char>
#> 1:     USA Bershire Hathaway
#> 2:     USA    Apple Computer
#> 3:            McKesson Corp.
#> 4:                       CVS
#> 5:                        GM
#> 
#> $match_evaluation
#> Index: <tier>
#>      tier matches in_tier_unique_1 in_tier_unique_2 pct_matched_1 pct_matched_2
#>    <char>   <int>            <int>            <int>         <num>         <num>
#> 1:    all      10               10                5             1           0.5
#>    new_unique_1 new_unique_2
#>           <int>        <int>
#> 1:           10            5

Instead of using a weighted sum of comparison scores, we use base::predict to arrive at a multivar_score that can be interpreted as a match probability. A word of caution: the logit parameters can give very non-intuitive results for the resultant multivar_scores, because the logit function is of course non-linear. For example, in one project we were working on, there was a coefficient of 40 and a coefficient of 3. However, if the variable with the coefficient of ‘3’ had a comparison score of 0, then the highest the multivar_score could be was only around 0.6! This was very un-intuitive for us, and shows the complex nature of the logit model.

All that said, the logit model has the benefit of returning an easily-interpretable multivar_score: with a large enough (and clean enough) training set, the fitted multivar_score is simply the probability of a match. This is contrast to the multivar_scores returned by the linear combination: these numbers have no immediate meaning, and can vary dramatically for each project. In one project, a score of .6 might be very good, and in another, very bad. It is a good idea to try both the logit and linear methods and play around with each for your individual project.

Parallelization and saving time

Because this method of matching relies on comparing each row of a data set to each other row of a dataset, the time needed to perform a match grows very quickly as the size of each data set increases. There are several ways to help with this:

Cut down on the number of comparisons that need to be done by grouping on one or more variables (say, countries) and only making comparisons within those groups, or
Use many cores to split the work up and run the comparisons in parallel.

Here’s how the blocking works:

corp_data1_test <- copy(corp_data1)
data.table::setDT(corp_data1_test)
corp_data2_test <- copy(corp_data2)
data.table::setDT(corp_data2_test)
corp_data1_test[, id_1 := seq(1, .N)]
corp_data2_test[, id_2 := seq(1, .N)]

corp_data2_test[, Country := country]
corp_data2_test[, Company := Name]
result <- merge_plus(
  data1 = corp_data1_test,
  data2 = corp_data2_test,
  match_type = "multivar",
  by = c("Company"),
   suffixes = c("_1", "_2"),
  unique_key_1 = "id_1",
  unique_key_2 = "id_2", 
  multivar_settings = build_multivar_settings(
  compare_type = c( "stringdist"),
  wgts = c(1), nthread = 1, blocks = "Country"
 
))
result$matches
#> Key: <id_2>
#> Index: <tier>
#>    unique_key_1 Country          Company_1  State   SIC Revenue  id_1
#>           <int>  <char>             <char> <char> <num>   <num> <int>
#> 1:            1     USA            Walmart     OH  3300     485     1
#> 2:            2     USA   Bershire Hataway         2222     223     2
#> 3:            3     USA              Apple     CA  3384     215     3
#> 4:            4     USA       Exxon Mobile     TX  2222     205     4
#> 5:            6     USA UnitedHealth Group     MA    NA     184     6
#> 6:            7     USA         CVS Health     RI  1112     177     7
#> 7:            9     USA               AT&T     TN  4000     163     9
#> 8:            8     USA     General Motors     MI  2222     166     8
#> 9:           10     USA Ford Motor Company     MI    NA     151    10
#>                  Name country state_code SIC_code earnings unique_key_2  id_2
#>                <char>  <char>     <char>    <num>   <char>        <int> <int>
#> 1:            Walmart     USA         OH     3380  490,000            1     1
#> 2:  Bershire Hathaway     USA         NE     2220  220,000            2     2
#> 3:     Apple Computer     USA         CA       NA  220,000            3     3
#> 4: Exxon Mobile Inc.      USA         TX     2222  210,000            4     4
#> 5: UnitedHealth Group     USA         MA     1130  180,000            6     6
#> 6: UnitedHealth Group     USA         MA     1130  180,000            6     6
#> 7:             AT & T     USA         TN     4000  160,000            9     9
#> 8:         Ford Motor     USA         MI     2222  150,000           10    10
#> 9:         Ford Motor     USA         MI     2222  150,000           10    10
#>             Company_2 Company_compare multivar_score   tier
#>                <char>           <num>          <num> <char>
#> 1:            Walmart       1.0000000      1.0000000    all
#> 2:  Bershire Hathaway       0.9882353      0.9882353    all
#> 3:     Apple Computer       0.8714286      0.8714286    all
#> 4: Exxon Mobile Inc.        0.9333333      0.9333333    all
#> 5: UnitedHealth Group       1.0000000      1.0000000    all
#> 6: UnitedHealth Group       0.5333333      0.5333333    all
#> 7:             AT & T       0.9111111      0.9111111    all
#> 8:         Ford Motor       0.7333333      0.7333333    all
#> 9:         Ford Motor       0.9111111      0.9111111    all

For our dummy example data, this isn’t that useful (we just drop one observation with the Country being Germany). But, if there is a field in the data that is well-populated and trusted, it can be used to dramatically cut down on the number of observations.

Parallelization is implemented with the nthread argument and parallel::parLapply(). The data is split into groups equal to nthread, and then each core does its comparison, all at the same time. This of course can dramatically speed up the process, but requires more memory and computing power. For details on the parallel process, see parallel::parLapply and parallel::makeCluster.

Multivar Matching

Chris Webster

Background

Syntax

Return value

Calculating weights from a training set

Using a logit model instead of a linear sum

Parallelization and saving time