Skip to content

Connecting R Scripts to SQL Azure for Data Discovery and Profiling

I had a client a few years ago that was using R (R-Project), for really hard-core statistics and data science work. I immediately fell in love with the product and the open source community that surrounds it. The programming style interface and functionality really appealed to my developer roots while the breadth of visualizations and community resources won me over.

I added R to my toolkit and have had great success using it on many projects, for not only discovery but also for visualizations.  Seeing as R is open source, I have to give back, and this is the first of what I hope is many posts on how I use R in my regular routine against Microsoft based products.

Using R with SQL Azure

Working with SQL Azure is becoming more common and helping integrate these new technologies into a client’s existing data analysis architecture, and workflow is more important than ever.

On a recent project, I was provided with a series of tables, with thousands of columns to use in a data visualization project. The first thing I usually do is data discovery and profiling to determine what I am starting with. R provides many packages and functions that let you perform data discovery over a data set. I wanted to use those functions to provide a report that I could share with the client. The following section describes the script, provided on GitHub, and the steps that I used to complete the analysis.

Note: I wrote this article using a Windows laptop and found that when I tested the solution on a Mac, there were some drivers that needed to be installed and an issue with X-Code blocking the ability to install the R Library. The steps to solve those issues are included at the end of this article.

Connect to SQL

The first step was to find an R package or library to connect to SQL Azure. RODBC provides that functionality in addition to many others that well against the SQL Azure.

The code segment below shows the format and populating the connection variable. I prefer to use variables as I may want to reuse the values later in scripts. The Encrypt=Yes is very important for SQL Azure as your connection may be outside your network and the connection may be refused without it.

library(RODBC)
sqlServer <- ".database.windows.net"  #Enter Azure SQL Server
sqlDatabase <- ""                #Enter Database Name
sqlUser <- "SQL_UID"             #Enter the SQL User ID
sqlPassword <- "Password"        #Enter the User Password
sqlDriver <- "SQL Server"        #Leave this Drive Entry

connectionStringSQL <- paste0(
  "Driver=", sqlDriver, 
  ";Server=", sqlServer, 
  ";Database=", sqlDatabase, 
  ";Uid=", sqlUser, 
  ";Pwd=", sqlPassword,
  ";Encrypt=yes",
  ";Port=1433"))’

Populate the SQL Query

This takes the format of a text string. Check the documentation for other formats and features for interacting with SQL using the driver.

sqlQuery <- "SELECT * from [dbo].[CustomerSurvey]"

Open, Send and Close the Connection

This code will open, connect, return the data into a data frame, then close the connection. It is important to close the connection, it is SQL you know.

conn <- odbcDriverConnect(connectionStringSQL)
sqlDataFrame <- sqlQuery(conn, sqlQuery)
close(conn)

Write Results to a CSV File

The following code will write the results of the R functions to a comma or space delimited (CSV) text file. For more detail check out the write.table  has documentation here.

Write the Summary Stats

This will write the output of the Summary function to the file name specified. The default it the document folder, however, you can add a path to save it elsewhere.

write.table(summary(sqlDataFrame), file = "SummaryCustomers.csv", append = FALSE, quote = TRUE, sep = " ",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE, qmethod = c("escape", "double"),
fileEncoding = "")

How Many NA Values are in Each Column?

This will write out a CSV file showing the number of “NA” or missing values in each column.

write.table(apply(is.na(sqlDataFrame),2,sum), file = "SummaryCustomers_NA.csv", append = FALSE, quote = TRUE, sep = " ",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE, qmethod = c("escape", "double"),
fileEncoding = "")

How Many ‘0’ Values are in Each Column?

Depending on the types of columns you have, you may want to know how many zero values are in each column. The following function does that and exports the file.

write.table(colSums(df != 0), file = "SummaryCustomers_0.csv", append = FALSE, quote = TRUE, sep = ",",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE, qmethod = c("escape", "double"),
fileEncoding = "")

Importing Files into Excel

Now that you have the CSV files, the first step is opening them up in Excel and ‘Inserting’ a table will make them more presentable. When you open up a CSV file, the import will either automatically fill out the columns, or frequently, all the text will appear in the first column.

  1. Highlight the First Column.
  2. Under the DATA menu item, select the Text to Columns option.
  3. The Wizard pops up, select Delimited.
  4. Select Next

The second step in the wizard allows you to specify how the columns in the file are separated. Here, they are Space delimited in the example as the text columns have quotes. I use this setting if there may be commas in the text fields.

  1. Select Space
  2. View the Preview pane to make sure that columns are split the way you expect.
  3. Select Next

The third step in the wizard is one that you can specify what data types are in each column. For this example, you do not need to, but the option is there if you need it.

Note: Always specify the data types of your data as early in the process as you can. This will allow other processes in your workflow and the front-end tools to treat the data correctly which will save other users steps in the future

  1. Select Finish to import the data

Now that we have the file in Excel as columns, we can convert this data to a Table. I recommend this as this allows you extra functionality in interacting with the data in Excel.

  1. Select the top row of the first column.
  2. On the Menu Bar, Select Insert
  3. Select the Table option.

  1. Validate that the range is correct
  2. Select the My table has headers check box
  3. Hit OK.

We now have the final table in Excel.

Save the file as Excel

Your final step is to save the file as an Excel file. The file type will still say CSV and will lose the formatting if you do not save the file as an Excel type.

Resources

Through the example, we have taken a SQL table and generated an Excel workbook with the initial data discovery results for further analysis. There are many ways to utilize R in your data visualization solution, and I hope that this article gives you a flavor of what is possible.  The following resources helped me get everything working.

R – The R Project for Statistical Computing

R Studio – RStudio – Open source and enterprise-ready professional software for R

Windows MAC SQL ODBC Driver

RODBC Library Documentation – RODBC

Write.Table – write.table {utils} R Documentation

For Mac Instructions: Installing the Microsoft ODBC Driver 13.1 for SQL Server

Code – Article Code GitHub Repository

Appendix – Working on a Mac

I wrote this post on a Windows-based laptop. When I ran the scripts on a Mac, there were a couple of issues. I have included some detail and resources to fix these.

Warning: These steps worked for me in my environment, I have put resources in the links that helped me work around the errors.   I do not guarantee that these steps will work for you as every environment is different. Reader beware, use at own risk.

Xcode stopped the C Compiler from installing the RODBC package in RStudio

When I ran the command, install.packages(“RODBC”) in RStudio, I had an error which would not allow the GCC compiler from installing the package. X-Code updates and from time to time you have to accept some agreements in the terminal. I went to a Terminal prompt on the Mac and ran sudo gcc which brought up a prompt where I had to agree to the GCC license.

ODBC Headers

Now that GCC was installed, I had an error about ODBC headers. Again after a quick search, I had to go to a Terminal prompt and run the following command:

brew install unixodbc

The ODBC headers were then available which set up the final issue.

Windows ODBC Driver for MaC

The last issue I had was that SQL Driver was installed on Windows but not the Mac, because, Windows. Not a big surprise, but took a couple of minutes to figure this out. Rather the list it here, the driver and instructions that worked for me are on the Microsoft Docs site; Installing the Microsoft ODBC Driver 13.1 for SQL Server

No comments yet

Leave a Reply

You may use basic HTML in your comments. Your email address will not be published.

Subscribe to this comment feed via RSS

%d bloggers like this: