Skip to content

Scripting Parallel Jobs on Aster in R

Gregory Kanevsky edited this page Apr 11, 2016 · 7 revisions

Hardly a surprise to anyone Aster Database runs each SQL, SQL-MR, and SQL-GR command in parallel on many clusters across distributed data. But when faced with the task of running several independent jobs at once we have to do extra work to make them run in parallel. For example, cross-validation of linear regression or other model is divided into independent jobs (of total K as in case of K-fold cross-validation). With little help from R and toaster they can execute in parallel in Aster leading to considerable performance gains.

The Problem

Cross-validation is important technique in machine learning receiving its own chapters in the textbooks (e.g. see Chapter 7 here). The implementation of K-fold cross-validation that will be given is neither exhaustive nor exemplary as it introduces certain bias (based on a calendar month) into the models. But this approach could definitely lead to a general solution for cross-validation and other problems involving execution of many similar independent tasks in Aster database. In fact, almost the same approach could be extended to other databases if they offer similar support for R.

Further more, this example is concerned only with the step in K-fold cross-validation that creates K models on overlapping but different partitions of the training data. We will show how to construct K independent linear regression models in Aster in parallel, each for one of the K partitions of the table (not the same as Aster database table partitions). But the same approach could be extended to other steps of cross-validation technique.

Data and R Packages

We will use Dallas Open data set (download files include load script for Aster and instructions). Aster database should be available via ODBC and Teradata Aster ODBC driver should be installed.

Make sure you have these packages installed with its dependencies and establish connection to Aster database:

install.packages("foreach", "parallel", "iterators", "doParallel", "toaster")

library(toaster)
conn = odbcConnect("PreSalesCluster1-dallas")

Dataset, Model and K Folds

Dallas Open Data has information on building permits across city of Dallas for the period between January 2012 through May 2014 stored in the table dallasbuildingpermits. We can quickly analyze this table from R with toaster and see its numerical columns:

dallasPermitsTableInfo = getTableSummary(conn, "dallasbuildingpermits")

getNumericColumns(dallasPermitsTableInfo)

[1] "area" "value" "lon" "lat"

These 4 fields will make up our simple linear model to determine the value of construction using its area and location. And now the same in R terms:

value ~ area + lon + lat

This problem is not beyond R memory limits but our goal is to execute linear regression in Aster. We enlist toaster's computeLm function that returns R lm object:

fit.all = computeLm(conn, "dallasbuildingpermits", value ~ area + lon + lat, tableInfo = dallasPermitsTableInfo)

Lastly, we need to define the folds (partitions) on the table to build linear regression model on each of them. Usually, this step performs equal and random division into partitions. Doing this with R and Aster is actually not extremely difficult but will take us beyond the scope of the main topic. For this reason alone we propose quick and dirty method of dividing building permits into 12 partitions (K=12) using issue date's month value (in SQL):

SELECT DATE_PART('month',issued)::int fold, COUNT(*) fold_size 
  FROM public.dallasbuildingpermits 
 GROUP BY 1 
 ORDER BY 1 

Again, do not replicate this method in real cross-validation task but use it as a template or a prototype only.

To make each fold's compliment (used to train 12 models later) we simply exclude each month's data, e.g. selecting the compliment to the fold 6 in its entirety (in SQL):

SELECT COUNT(*) 
  FROM public.dallasbuildingpermits 
 WHERE DATE_PART('month',issued)::int != 6

Computing Cross-Validation Models in Aster with R

Before we get to parallel execution with R we show how to script in R Aster cross-validation of linear regression. To begin we use standard R for loop and computeLm with the argument where that limits data to the required fold just like in SQL example above:

fit.folds.forloop = list()
for(fold in 1:12) {
  fit.folds.forloop[[fold]] = computeLm(conn, "dallasbuildingpermits", value ~ area + lon + lat, 
                                        tableInfo = dallasPermitsTableInfo,
                                        where = paste("date_part('month',issued)::int != ", fold))
}

This results in the list fit.folds that contains 12 linear regression models for each fold respectively.

Next, we replace the for loop with the specialized foreach function designed for parallel execution in R. There is no parallel execution yet but all necessary structure for transition to parallel processing:

fit.folds = foreach(fold = 1:12, .packages=c('RODBC','toaster')) %do% {
  fitted.model = computeLm(conn, "dallasbuildingpermits", value ~ area + lon + lat, tableInfo = dallasPermitsTableInfo,
                    where = paste("date_part('month',issued)::int != ", fold))
}

foreach performs the same iterations from 1 to 12 as for loop and combines results into list by default.

Parallel Computing in Aster with R

Finally, we are ready to enable parallel execution in R. For more details on using package doParallel see here, but the following suffices to enable a parallel backend in R on Windows:

library(doParallel)

cl = makeCluster(6)
registerDoParallel(cl, cores=6)

After that foreach with operator %dopar% automatically recognizes parallel backend cl and runs its iterations in parallel:

fit.folds = foreach(fold = 1:12, .packages=c('RODBC','toaster')) %dopar% {
  parConn = odbcReConnect(conn)
  fitted.model = computeLm(parConn, "dallasbuildingpermits", value ~ area + lon + lat, tableInfo = dallasPermitsTableInfo,
                    where = paste("date_part('month',issued)::int != ", fold))
  close(parConn)
  fitted.model
}

Comparing with foreach %do% earlier notice extra handling for ODBC connection inside foreach %dopar%. This is necessary due to inability of sharing the same database connection between parallel processes (or threads, depending on the backend implementation). Effectively, with each loop we reconnect to Aster with a brand new connection by reusing original connection's configuration in function odbcReConnect.

Elapsed Time

The whole thing was worth the effort as seen from the charts below. The elapsed time (in seconds) for all 3 types of loops: for loop in R, foreach %do% (sequential), and foreach %dopar% (parallel):

Dallas permits K-fold CV Loop Times

And the elapsed time when using 2, 4, 6, and 8 core backends in R:

Dallas permits K-fold CV Loop Times

Clone this wiki locally