Connecting R Scripts to SQL Azure for Data Discovery and Profiling
A few years ago, I had a client using R (R-Project) for really hard-core statistics and data science work. I immediately fell in love with the product and the open-source community that surrounds it. The programming style interface and functionality really appealed to my developer roots, while the breadth of visualizations and community resources won me over.
I added R to my toolkit and have had great success using it on many projects, not only for discovery but also for visualizations. Seeing as R is open source, I have to give back, and this is the first of what I hope is many posts on how I use R in my regular routine against Microsoft-based products.
This a great example of the many use cases for data preparation: From Chaos to Clarity: The 8 Stages of Data Preparation
Using R with SQL Azure
Working with SQL Azure is becoming more common, and helping integrate these new technologies into a client’s existing data analysis architecture and workflow is more important than ever.
On a recent project, I was provided with a series of tables with thousands of columns to use in a data visualization project. The first thing I usually do is data discovery and profiling to determine what I am starting with. R provides many packages and functions that let you perform data discovery over a data set. I wanted to use those functions to provide a report I could share with the client. The following section describes the script and the steps I used to complete the analysis.
Note: I wrote this article using a Windows laptop and found that when I tested the solution on a Mac, there were some drivers that needed to be installed and an issue with X-Code blocking the ability to install the R Library. The steps to solve those issues are included at the end of this article.
How to Connect R-Scripts to SQL Azure
The first step was to find an R package or library to connect to SQL Azure. RODBC provides that functionality and many others that work well against SQL Azure. RODBC package – RDocumentation
The code segment below shows the format and populates the connection variable. I prefer to use variables as I may want to reuse the values later in scripts. The Encrypt=Yes is very important for SQL Azure as your connection may be outside your network, and the connection may be refused without it.
library(RODBC)
sqlServer <- ".database.windows.net" #Enter Azure SQL Server
sqlDatabase <- "" #Enter Database Name
sqlUser <- "SQL_UID" #Enter the SQL User ID
sqlPassword <- "Password" #Enter the User Password
sqlDriver <- "SQL Server" #Leave this Drive Entry
connectionStringSQL <- paste0(
"Driver=", sqlDriver,
";Server=", sqlServer,
";Database=", sqlDatabase,
";Uid=", sqlUser,
";Pwd=", sqlPassword,
";Encrypt=yes",
";Port=1433"))’
Populate the SQL Query
This takes the format of a text string. Check the documentation for other formats and features for interacting with SQL using the driver.
sqlQuery <- "SELECT * from [dbo].[CustomerSurvey]"
Open, Send, and Close the Connection
This code will open, connect, return the data into a data frame, then close the connection. It is important to close the connection; it is SQL.
conn <- odbcDriverConnect(connectionStringSQL)
sqlDataFrame <- sqlQuery(conn, sqlQuery)
close(conn)
Write Results to a CSV File
The following code will write the results of the R functions to a comma or space-delimited (CSV) text file. For more detail, check out the write.table has documentation here.
Write the Summary Stats
This will write the Summary function output to the specified file name. The default is the document folder; however, you can add a path to save it elsewhere.
write.table(summary(sqlDataFrame), file = "SummaryCustomers.csv", append = FALSE, quote = TRUE, sep = " ",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE, qmethod = c("escape", "double"),
fileEncoding = "")
How Many NA Values are in Each Column?
This will create a CSV file showing the number of “NA” or missing values in each column.
write.table(apply(is.na(sqlDataFrame),2,sum), file = "SummaryCustomers_NA.csv", append = FALSE, quote = TRUE, sep = " ",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE, qmethod = c("escape", "double"),
fileEncoding = "")
How Many ‘0’ Values are in Each Column?
Depending on the types of columns you have, you may want to know how many zero values are in each column. The following function does that and exports the file.
write.table(colSums(df != 0), file = "SummaryCustomers_0.csv", append = FALSE, quote = TRUE, sep = ",",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE, qmethod = c("escape", "double"),
fileEncoding = "")
Importing Files into Excel
Now that you have the CSV files, the first step is opening them up in Excel and ‘Inserting’ a table will make them more presentable. When you open up a CSV file, the import will either automatically fill out the columns, or frequently, all the text will appear in the first column.
- Highlight the First Column.
- Under the DATA menu item, select the Text to Columns option.
- The Wizard pops up; select Delimited.
- Select Next

The second step in the wizard allows you to specify how the columns in the file are separated. Here, they are Space delimited in the example as the text columns have quotes. I use this setting if there may be commas in the text fields.
- Select Space
- View the Preview pane to ensure that columns are split the way you expect.
- Select Next

The third step in the wizard is one where you can specify what data types are in each column. For this example, you do not need to, but the option is there if you need it.
Note: Always specify your data types as early in the process as you can. This will allow other processes in your workflow and the front-end tools to treat the data correctly, saving other users’ future steps.
- Select Finish to import the data.

Now that we have the file in Excel as columns, we can convert this data to a Table. I recommend this as this allows you extra functionality in interacting with the data in Excel.
- Select the top row of the first column.
- On the Menu Bar, Select Insert
- Select the Table option.

- Validate that the range is correct.
- Select the My table has headers checkbox
- Hit OK.

We now have the final table in Excel.

Save the file as Excel
Your final step is to save the file as an Excel file. The file type will still say CSV and lose the formatting if you do not save the file as an Excel type.
Resources
Through the example, we have taken a SQL table and generated an Excel workbook with the initial data discovery results for further analysis. There are many ways to utilize R in your data visualization solution, and I hope this article gives you a flavor of what is possible. The following resources helped me get everything working.
- R – The R Project for Statistical Computing
- R Studio – RStudio – Open source and enterprise-ready professional software for R
- Windows MAC SQL ODBC Driver
- RODBC Library Documentation – RODBC
- For Mac Instructions: Installing the Microsoft ODBC Driver 13.1 for SQL Server
Appendix – Working on a Mac
I wrote this post on a Windows-based laptop. When I ran the scripts on a Mac, there were a couple of issues. I have included some detail and resources to fix these.
Warning: These steps worked for me in my environment, I have put resources in the links that helped me work around the errors. I do not guarantee that these steps will work for you as every environment is different. Reader beware, use at own risk.
Xcode stopped the C Compiler from installing the RODBC package in RStudio
When I ran the command, install.packages(“RODBC”) in RStudio, I had an error that would not allow the GCC compiler to install the package. X-Code updates, and from time to time, you have to accept some agreements in the terminal. I went to a Terminal prompt on the Mac and ran sudo gcc which brought up a prompt where I had to agree to the GCC license.
ODBC Headers
Now that GCC was installed, I had an error about ODBC headers. Again after a quick search, I had to go to a Terminal prompt and run the following command:
brew install unixodbc
The ODBC headers were then available, which set up the final issue.
Windows ODBC Driver for Mac
My last issue was that SQL Driver was installed on Windows but not the Mac because of… Windows. Not a big surprise, but it took a couple of minutes to figure this out. Rather the list it here, the driver and instructions that worked for me are on the Microsoft Docs site; Installing the Microsoft ODBC Driver 13.1 for SQL Server