In the first two(1,2) posts we got the data ingested and explored the data with the spark-shell. Now we’ll move on to creating and submitting our code as standalone Spark application. Again all the code cover in the posts can be found here. We’ll start by creating a case class and a function for parsing the data into the that class. This will help clarify the code in future operations. As we saw in the previous posts, there were some outlier data. We’ll use the filter function of the RDD to remove that data and limit the home prices between $100k and $400k along with houses over 1000 square feet. Now we’re finally to the fun part — training the model. The model has two hyperparameters, number of iterations and step size. The number of iterations lets the algorithm know how many times to run through the model to adjust coefficients closer to an optimal model and the step size determines how much to adjust the model per iteration. In summary, this post finally got down the the nuts and bolts of training a model. It walked through simple scaler transformations and filtering. Then described a bit about the MLlib model we’re using to predict home prices. In the next post, we’ll look at the challenges of exporting the model and how offline non-spark jobs can utilize the model. Source.