R integration in SQL Server 2016

In Azure, Cloud, Security by Ion, Product & DataLeave a Comment


Summary: A major development in the Data Science community; Microsoft’s integration of R Services into SQL Server 2016. Ion, a consultant within RedPixie’s Product & Data team, examines the benefits by creating a model of extreme gradient boosting to demonstrate the SQL and R integration.

In recent years, Microsoft has made significant progress with its Big Data offerings with brand new products like Azure SQL, Data Warehouse, Data Lake, HDInsight, and the Machine Learning suite steadily gaining adoption. Though Microsoft have historically managed to stay on top of the relational database game with Microsoft SQL Server, tight integration with more advanced analytical functionality has been a significant missing component.

Hot on the heels of the Revolution R acquisition, Microsoft have introduced SQL Server R Services with SQL Server 2016, a collection of tools which data scientists have been waiting for: combining the scalable database functionality of Microsoft SQL Server with R integration’s plethora of advanced analytical capabilities and libraries to create a system which can perform demanding and computationally intensive analytical calculations in-database.

The benefits

With the combination of those two powerful tools, data scientists can stop worrying about scale, data volume & transfer and focus on the analytical work. Microsoft’s Jen Underwood has some nice slides outlining the benefits:



We chose to test the capabilities of SQL Server’s R Services with a well-known German credit scoring dataset that consists of client credit scoring information including who received a loan (with properties such as income, age, loan, etc.) and whether they proved to be creditable or not. We built a model trained on that data and then predicted the client’s creditworthiness for a loan they are applying for on a test set.

Doing this in R Services is a straight-forward approach for someone familiar with Machine Learning – we simply take the data, split it into a training and a testing set, build an extreme gradient boost model with the first set, and test it on the second set. After installing the R Services and the XGB package on our SQL Server (packages have to be installed by an administrator) we were able to transfer the initial R code to SQL. We executed the R code by passing it as a parameter to the sp_execute_external_script procedure and inserted the result into a new table. Below you can see the code from our example. Note that it is SQL Server that executes an R script contained in the string:

execute sp_execute_external_script

  @language
= N’R

,
@script = N’

# Load library

library(xgboost)

# Split into test and
training set

train <- InputData[1:500,]

test <- InputData[501:1000,]

# Log10 transform vars

train[,2:ncol(train)]
<- log10(train[,2:ncol(train)])

test[,2:ncol(test)]
<- log10(test[,2:ncol(test)])

test_vars<-colnames(train)[2:ncol(train)]

# Convert sets to XGB
matrices

mat.train<-xgb.DMatrix(data=data.matrix(train[test_vars]),label=(train[,c(“Creditability”)]))

mat.test<-xgb.DMatrix(data=data.matrix(test))

param <- list(objective           = “binary:logistic“,

              booster             = “gbtree“,

              eta                 = 0.002,

              max_depth           = 20,

              subsample           = 1,

              colsample_bytree    = 1,

              min_child_weight    = 4.285714             

)

# Cross validate

clf <- xgb.cv(params              =
param,

              data                = mat.train,

              nrounds             = 500,

              verbose             = 1,

              maximize            = t,

              eval_metric=”auc“,

              nfold=3

)

# Get best model

clf_best <- xgboost(params              = param,

                    data                = mat.train,

                    nrounds             = 400,

                    verbose             = 1,

                    maximize            = FALSE,

                    eval_metric=”auc“)

# Get predictions

test$p <- round(predict(clf_best,mat.test))

OutputData <- as.data.frame(cbind(test$Creditability,test$p))

,@input_data_1 = N’ SELECT * FROM inputset;

,@input_data_1_name = N’InputData

,@output_data_1_name = N’OutputData



Below, you can see that out of the 157 clients of the test dataset who proved not to be credit-worthy the algorithm correctly predicted 145. So, our model has an accuracy of 92.4% when predicting non-credit-worthy clients and 97.6% overall.

We haven’t delved into the validity of this model too much as the aim is to show the SQL and R integration, not the modelling.



Conclusion

Microsoft have successfully combined two large and important industry standards, in an efficient and convenient solution, whose widespread acceptance and use by the data-related communities seems to be just a matter of time. We now only have to wait and see what extra goodies they have to throw in. It can only get better from here!



app desktop


Written by Ion Kleopas | Data Scientist, RedPixie | See his LinkedIn Profile


Leave a Comment