This repository contains the code examples for the Spark Summit EU 2017 talk and slides
For the experiments this dataset was used. (the avazu-site.tr.bz2
file)
Run any of the examples like this:
sbt runMain me.lorand.slogreg.jobs.MyExampleJob <path_to_data> <number_of_iterations>
The optimization is implemented in me.lorand.slogreg.optimize.SparseMatrixGradientDescent
The job without a known partitioner is me.lorand.slogreg.jobs.SparseMatrixJob
and the job with a known partitioner is
me.lorand.slogreg.jobs.SparseMatrixPartitionerJob
The version without joins is implemented in me.lorand.slogreg.optimize.GradientDescent
and the corresponding experiment
is run by the job me.lorand.slogreg.jobs.GradientDescentJob
Implemented in me.lorand.slogreg.optimize.AggregateGradientDescent
Also uses AggregateGradientDescent
and the experiment runs in the job me.lorand.slogreg.jobs.MiniBatchJob
me.lorand.slogreg.optimize.Adam
extends AggregateGradientDescent
, and the experiment is run in
me.lorand.slogreg.jobs.MomentumJob
Measured on an AWS EMR cluster of 5 m4.2xlarge nodes
The initial version is almost 4 minutes, the best version is half a second.