BDT305 Transforming Big Data with Spark and Shark – AWS re: Invent 20…

The Berkeley AMPLab is developing a new open source data analysis software stack by deeply integrating machine learning and data analytics at scale (Algorithms), cloud and cluster computing (Machines) and crowdsourcing (People) to make sense of massive data. Current application efforts focus on cancer genomics, real-time traffic prediction, and collaborative analytics for mobile devices. In this talk, we present an overview of this stack and demonstrate key components: Spark and Shark. It’s All Happening On-line User Generated (Web, Social & Mobile) Every: Click Ad impression Billing event ….. Fast Forward, pause,… Friend Request Transaction Network message Fault …Internet of Things / M2M Scientific Computing Volume Petabytes+ Variety Unstructured Velocity Real-TimeOur view: More data should mean better answers • Must balance Cost, Time, and Answer Quality3 UC BERKELEY Algorithms: Machine Learning and Analytics Massive and Diverse Data People: Machines: CrowdSourcing & Cloud Computing Human Computation5 Alex Bayen (Mobile Sensing) Anthony Joseph (Sec./ Privacy) Ken Goldberg (Crowdsourcing) Randy Katz (Systems) *Michael Franklin (Databases) Dave Patterson (Systems) Armando Fox (Systems) *Ion Stoica (Systems) *Mike Jordan (Machine Learning) Scott Shenker (Networking)Organized for Collaboration: 7 • Sequencing costs (150X) Big Data $100,000.0 $K per genome $10,000.0 • UCSF cancer researchers + UCSC cancer genetic $1,000.0 $100.0 database + AMP Lab + Intel Cluster $10.0 $1.0 @TCGA: 5 PB = 20 cancers x 1000 genomes $0.1 2001 - 2014• See Dave Patterson’s Talk: Thursday 3-4, BDT205 David Patterson, “Computer Scientists May Have What It Takes to Help Cure Cancer,” New York Times, 10 12/5/2011 MLBase (Declarative Machine Learning) Hadoop MR MPI BlinkDB (approx QP) Graphlab Shark (SQL) + Streaming etc. Spark Streaming Shared RDDs (distributed memory) Mesos (cluster resource manager) HDFS 3rd party AMPLab (released) AMPLab (in progress)11 Base RDD Cache 1lines = spark.textFile(“hdfs://...”) Transformed RDD Worker resultserrors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘t’)(2)) tasks Block 1 DrivercachedMsgs = messages.cache() ActioncachedMsgs.filter(_.contains(“foo”)).countcachedMsgs.filter(_.contains(“bar”)).count Cache 2 Worker Cache 3 Worker Block 2 Result: full-text search TBWikipedia in sec sec Result: scaled to 1 of data in 5-7 <,1 (vs 170sec for on-disk data) (vs 20 sec for on-disk data) Block 3 messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘t’)(2))HadoopRDD FilteredRDD MappedRDD path = hdfs://… func = _.contains(...) func = _.split(…) map readPoint cache Load data in memory once Initial parameter vector map p =>,(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.xreduce _ + _ Repeated MapReduce steps to do gradient descent 60 50Running Time (min) 110 s / iteration 40 Hadoop 30 Spark 20 10 first iteration 80 s further iterations 1 s 0 1 10 20 30 Number of Iterations Java API JavaRDD<,String>, lines = sc.textFile(...),(out now) lines.filter(new Function<,String, Boolean>,() { Boolean call(String s) { return s.contains(“error”), } }).count(),PySpark lines = sc.textFile(...)(coming soon) lines.filter(lambda x: x.contains(error)) .count() Shark Shark (disk) Hive 100 90 80 70 60 50 40 30100 m2.4xlarge nodes 202.1 TB benchmark (Pavlo et al) 10 1.1 0 Selection 1800 Shark (copartitioned) Shark 1500 Shark (disk) Hive 1200 900 600 300 105100 m2.4xlarge nodes2.1 TB benchmark (Pavlo et al) 0 Join Shark Shark (disk) Hive70 70 100 9060 60 8050 50 7040 40 60 5030 30 4020 20 30 20 100 m2.4xlarge10 10 nodes, 1.7 TB 10 0.8 0.7 1.0 Conviva dataset0 0 0 Query 1 Query 2 Query 3 We are sincerely eager to hear your feedback on thispresentation and on re:Invent. Please fill out an evaluation form when you have a chance. Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics. Recortar slides é uma maneira fácil de colecionar slides importantes para acessar mais tarde. Agora, personalize o nome do seu painel de recortes. Source.


Яндекс.Метрика Рейтинг@Mail.ru Free Web Counter
page counter
Last Modified: April 18, 2016 @ 1:10 pm