FuzzyMatching

Fuzzy Matching for PowerBI with RScripts

PowerBI can mash up data from different sources; this may introduce problems when joining or matching data between systems. I got around this issue by leveraging fuzzy matching in R with an R scripting from a community article. As a consultant, there is a rule that if you use something from the community, you need to give back; mainly, if you use it twice, you must write a blog post. This is such a post.

A data set I needed to create included matching products from a dimension table in a source system with a product table originally manually entered into an MSAccess table. Each list had over 500 names with duplicates and missing values in each table. So I figured there must be a way to do this without looking item by item.

Note: the original reference is no longer available. After some R&D online for pattern matching functions, I found an article Fuzzy Matching in R (Example) | Approximate String, Name & Text Search (statisticsglobe.com), which fit the bill perfectly for my use case. The original article has several options; however, let’s go through this example of how I used this first script from the post to solve my situation.

1. The Situation

I have two tables that contain two text fields of product names, with the column heading set to “name.”  I created two CSV files with each table’s unique names and added a record ID to each record. The starting files are pictured below.  You need to pick one of the tables to be the master, which works best when you match the list with the least number of values to the larger of the two lists.

Fuzzy Logic Pattern Match PowerBI R Script

In the original script I reference, the site is no longer active. Another reference is Fuzzy Matching in R (Example) | Approximate String, Name & Text Search (statisticsglobe.com). I continue with the original.

The script results will match one set to the other, producing a numeric score as to how close the two names match—the distance between matching perfectly. If zero, the score denotes what should be an exact match. If there is a score of one, there was a little difference between the names. The results will not be perfect nor stop you from having to do a manual validation; however, it brings you a long way there. The example below will ignore cases and characters that may have an accent if the names are French.

2. let us Review the Script

The script below and source files are available on GitHub and provide links to more detailed definitions. Let’s walk through it step by step.

[code language="r"]
##  ===================================================================
##               Settings & Declarations
##  =================================================================== 
  #Clean Up the R Environment when finished
  rm(list = ls()) 

  #Install the package for working with data frames
  install.packages("dplyr") 
  library(dplyr)
  
  
  ####################################################
  #                Load in the customers from  the Dim Table into a data frame
  ####################################################
  source1.devices<- read.table("c:/files/source/Product_DimTable.csv",quote = "\"",  header = TRUE, sep = ",", fill = TRUE)
  ####################################################
  #                Customer Compare From the Access Table into a data frame
  ####################################################
  source2.devices<- read.table("c:/files/source/Product_Access.csv",quote = "\"",  header = TRUE, sep = ",", fill = TRUE)
  ###################################################
  #                Make sure each of these values are  characters
  ###################################################
  source1.devices$name<-as.character(source1.devices$name)
  source2.devices$name<-as.character(source2.devices$name)
  ###################################################
  #                See Juan Bernabe's blog post for the juicy bits :)  
  #   This step creates a matrix with the Standard Levenshtein distance between the name fields of both sources
  ##################################################
  # 
  dist.name<-adist(source1.devices$name,source2.devices$name, partial = TRUE, ignore.case = TRUE)
  
  # We now take the pairs with the minimum distance
  min.name<-apply(dist.name, 1, min)
  
 #  Run the Match functionality
  match.s1.s2<-NULL  
  for(i in 1:nrow(dist.name))
  {
    s2.i<-match(min.name[i],dist.name[i,])
    s1.i<-i
    match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=source2.devices[s2.i,]$name, s1name=source1.devices[s1.i,]$name, adist=min.name[i]),match.s1.s2)
  }
  # Inspect the results
  View(match.s1.s2)
  
  #Write the results to a csv file for examination in Excel
  write.table(match.s1.s2, file = "c:/files/source/results.csv", append = FALSE, quote = TRUE, sep = ",",
              eol = "\n", na = "NA", dec = ".", row.names = TRUE,
              col.names = TRUE, qmethod = c("escape", "double"),
              fileEncoding = "")
[/code]

The following link provides more information on the Standard Levenshtein distance between the name fields of both sources.

The following are the results as run. The sets are returned in one CSV file with a Distance score between the compared names. The original ID columns allow you to reference the original files.

Results after running Fuzzy Matching

3. Let’s Review the Results

Let’s review some of the results highlighted in Yellow below. Those with scores of zero match, even if the case is different. Those with a score of 1 may have a spelling error. As you get higher in the distance scored, you see that more do not match.

Fuzzy Matching in R Results

This matching does not eliminate the need to review and perhaps remove the matches and try again.  The key is to get your list smaller, making it easier to visualize those remaining values that need to be compared.   The process will match all the results but will match unrelated items and give them a high distance score.

4. Next Steps – Python?

My source files are available on GitHub, and the original blog post by Juan Bernabe, Fuzzy String Matching – a survival skill to tackle unstructured information, goes into greater depth on taking this functionality further on.

Now, to do an article in Python. Best Libraries for Fuzzy Matching In Python | by Moosa Ali | CodeX | Medium and Fuzzy String Matching in Python Tutorial | DataCamp

Steve


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.