Skip to content

Fuzzy Matching for PowerBI with RScripts

PowerBI has the ability to mashup data from different sources; this may introduce problems when joining or matching data between systems. I was able to get around this issue by leveraging an R script from an article in the community. As a consultant there is a rule that if you use something from the community, you need to give back and particularly if you use it twice, you have to write a blog post. This is such a post.

A data set I needed to create included the task of matching products from a dimension table in a source system with a product table originally manually entered into an MSAccess table. Each list had over 500 names with duplicates and missing values in each table. I figured there must be a way to do this without looking item by item.

After some R&D online for pattern matching functions, I found an article by Juan Bernabe, Fuzzy String Matching – a survival skill to tackle unstructured information, which fit the bill perfectly for my use case. The original article has a number of options, however, lets go through this example of how I used this first script from the post to solve my situation.

The Situation

I have two tables that contain two text fields of product names, with the column heading set to “name”.  I created two CSV files with the unique names from each table and added a record ID to each. The starting files are pictured below.  You need to pick one of the tables to be the master, which works best when you match the list with the least number of values to the larger of the two lists.

Fuzzy Logic Pattern Match PowerBI R Script

The script results will match one set to the other which will produce a numeric score as to how close the two names match. The distance between matching perfectly. If zero, the score denotes what should be an exact match, if there is a score of one, there was a little difference between the names, and so on. The results will not be perfect nor stop you from having to do a manual validation, however, brings you a long way there. I have found the example below will ignore cases and also characters that may have an accent if the names are French for example.

Script Review

The script below and source files are available on GitHub and provides links to more detailed definitions. Let’s walk through step by step.

##  ===================================================================
##               Settings & Declarations
##  =================================================================== 
  #Clean Up the R Environment when finished
  rm(list = ls()) 

  #Install the package for working with data frames
  install.packages("dplyr") 
  library(dplyr)
  
  
  ####################################################
  #                Load in the customers from  the Dim Table into a data frame
  ####################################################
  source1.devices<- read.table("c:/files/source/Product_DimTable.csv",quote = "\"",  header = TRUE, sep = ",", fill = TRUE)
  ####################################################
  #                Customer Compare From the Access Table into a data frame
  ####################################################
  source2.devices<- read.table("c:/files/source/Product_Access.csv",quote = "\"",  header = TRUE, sep = ",", fill = TRUE)
  ###################################################
  #                Make sure each of these values are a characters
  ###################################################
  source1.devices$name<-as.character(source1.devices$name)
  source2.devices$name<-as.character(source2.devices$name)
  ###################################################
  #                See Juan Bernabe's blog post for the jucy bits 🙂  
  #   This step creates a matrix with the Standard Levenshtein distance between the name fields of both sources
  ##################################################
  # 
  dist.name<-adist(source1.devices$name,source2.devices$name, partial = TRUE, ignore.case = TRUE)
  
  # We now take the pairs with the minimum distance
  min.name<-apply(dist.name, 1, min)
  
 #  Run the Match functionality
  match.s1.s2<-NULL  
  for(i in 1:nrow(dist.name))
  {
    s2.i<-match(min.name[i],dist.name[i,])
    s1.i<-i
    match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=source2.devices[s2.i,]$name, s1name=source1.devices[s1.i,]$name, adist=min.name[i]),match.s1.s2)
  }
  # Inspect the results
  View(match.s1.s2)
  
  #Write the results to a csv file for examination in Excel
  write.table(match.s1.s2, file = "c:/files/source/results.csv", append = FALSE, quote = TRUE, sep = ",",
              eol = "\n", na = "NA", dec = ".", row.names = TRUE,
              col.names = TRUE, qmethod = c("escape", "double"),
              fileEncoding = "")

 

The following link provides more information on the Standard Levenshtein distance between the name fields of both sources.

The Following are the results as run. The sets are returned in one csv file with a Distance score between the compared names. The original ID columns allow you to reference the original files.

Results after running Fuzzy Matching

Result Review

Let’s review some of the results highlighted in Yellow below. Those with scores of zero match, even if the case is different. Those with a score of 1 may have a spelling error. As you get higher in the distance scored, you see that more do not match.

Fuzzy Matching in R Results

This matching does not eliminate the need to review and perhaps remove the matches and try again.  The key is that you are getting your list smaller, making it easier to visualize those remaining values that need to be compared.   The process will match, all the results, but will match unrelated items and give them a high distance score.

Next Steps

The source files are available on GitHub and the original blog post by Juan Bernabe, Fuzzy String Matching – a survival skill to tackle unstructured information, goes into greater depth on a taking this functionality further on.

 

Steve

No comments yet

Leave a Reply

You may use basic HTML in your comments. Your email address will not be published.

Subscribe to this comment feed via RSS

%d bloggers like this: