Databricks and H2O Make it Rain with Sparkling Water

Databricks provides a cloud-based integrated workspace on top of Apache Spark for developers and data scientists. H2O.ai has been an early adopter of Apache Spark and has developed Sparkling Water to seamlessly integrate H2O.ai’s machine learning library on top of Spark. In this blog, we will demonstrate an integration between the Databricks platform and H2O.ai’s Sparking Water that provides Databricks users with an additional set of machine learning libraries. The integration allows data scientists to utilize Sparkling Water with Spark in a notebook environment more easily, allowing them to seamlessly combine Spark with H2O and get the best of both worlds. The first step is to log into your Databricks account and create a new library containing Sparkling Water. You can use the Maven coordinates of the Sparkling Water package, for example: ai.h2o:sparkling-water-examples_2.10:1.5.6 (this version works with Spark 1.5) For this version of the Sparkling Water library, we will use Spark 1.5. The name of the created cluster is “HamOrSpamCluster” – keep it handy as we will need it later. Now the environment is ready and you can create a Databricks notebook, connect it to “HamOrSpamCluster” and start building a predictive model! The goal of the application is to write a spam detector using a trained model to categorize incoming messages We need to transform these messages into vectors of numbers and then train a binomial model to predict whether the text message is either SPAM or HAM. For the transformation of a message into a vector of numbers we will use Spark MLlib string tokenization and word to vector transformers. We are going to split messages into tokens and use the TF (term frequency–inverse document frequency) technique to represent words of importance inside the training data set: You can also use the H2O Flow UI by clicking on the URL provided when you instantiated the H2O Context. The method uses built-in models to transform incoming text message and provide a prediction – SPAM or HAM. For example: We’ve shown a fast and easy way to build a spam detector with Databricks and Sparkling Water. To try this out for yourself, register for a free 14-day trial of Databricks and check out the Sparkling Water example in the Databricks Guide. Source.


Яндекс.Метрика Рейтинг@Mail.ru Free Web Counter
page counter
Last Modified: April 18, 2016 @ 2:01 pm