I attended Spark Summit Europe 2016 in Brussels this year in October, a conference where Apache Spark enthusiasts meet up. I’ve been using Spark for nearly a year now on multiple projects and was delighted to see so many Spark users at Square Brussels.
There were three trainings to choose from on the first day. I went for "Exploring Wikipedia with Spark (Tackling a unified case)".
The class was taught in Scala and Databricks notebooks were used. Databricks is a cloud platform that lets data scientists use Spark without having to setup or manage a cluster themselves. Databricks uses AWS as their backend. Clusters can be started and then attached to notebooks where code can be executed on the attached cluster.
The class started with a recap of the basics, covering multiple APIs, including RDDs, Dataframes and the new Datasets. We used publicly available Wikipedia datasets and leveraged Spark SQL, Spark Streaming, GraphFrames, UDFs and machine learning algorithms. I was impressed to see how easy it was to run code snippets on the Databricks platform and get insights into the data.
Another great feature is the support for mixing languages in a notebook. For instance a UDF can be defined and registered in Python and can then be used in Scala. The other two trainings which I wasn’t able to attend were "Apache Spark Essentials (Python)" and "Data Science with Apache Spark".
The following days were conference days. Usually each day started with keynotes and then there were three or four talks to choose from every 30 minutes. I will highlight some of the talks and keynotes I attended.
Simplifying Big Data Applications with Apache Spark 2.0
Spark 2.0 was released and brings many improvements over the 1.6 branch, namely:
Performance improvements with whole-stage code generation and vectorization Unified API: Dataframes are now just an alias for Datasets The new SparkSession single entry point. This replaces SparkContext, StreamingContext, SQLContext, etc.
The Next AMPLab: Real-Time, Intelligent, and Secure Computing
Spark was born at AMPLab. We were shown what projects AMPLab is currently working on and thus what can be expected in the next 5 years for Spark. They currently have two main projects: Drizzle and Opaque. Drizzle aims at reducing latency in Spark Streaming while Opaque is an attempt at improving security in Spark, for instance by protecting against pattern recognition attacks.
Spark’s Performance: The Past, Present, and Future
Performance in Spark 2.0 is improved with whole-stage code generation, a new technique which will optimize the code of the whole pipeline and can boost performance by one order of magnitude in some cases. Another technique used to improve performance is vectorization, or in other words, using an in-memory columnar format for faster data access. Databricks published a blog post discussing this.
How to Connect Spark to Your Own Datasource
The author of the MongoDB Spark connector shared his experience in writing a Spark connector. There is a lack of official documentation on writing these so the best way to start writing your own connector is to look at how others did it, for example the Spark Cassandra connector.
Dynamic Resource Allocation, Do More With Your Cluster
This technique is useful for shared clusters and jobs of varying load. In this talk we were shown some parameters that can be set for optimizing dynamic resource allocation on a Spark cluster.
Vegas, the Missing MatPlotLib for Spark
Two engineers from Netflix showed their project called Vegas. This project will generate HTML code that can be used on web pages. Vegas also supports Apache Zeppelin notebooks, has console support and can render to SVG. Vegas uses Vega-Lite underneath. It is currently in the beta stage.
SparkLint: a Tool for Monitoring, Identifying and Tuning Inefficient Spark Jobs Across Your Cluster
Groupon announced the availability of SparkLint, a performance debugger for Spark. It can detect over-allocation and has CPU utilization graphs for Spark jobs. SparkLint is available on Github.
Spark and Object Stores —What You Need to Know
This talk gives a set of optimal parameters to use when working with Object Stores and Spark. When using the Amazon S3 API, make sure to use the new s3a:// protocol in your URLs. This is the only one that is currently supported.
Mastering Spark Unit Testing
A few tips and tricks from Blizzard were presented for unit testing Spark jobs. The main ideas were that one should not use a Spark context if it’s not necessary. Code can usually be tested outside of a Spark job.
If it’s really necessary to run a Spark job in your test, then use the local master and run it on your local machine. You can then set breakpoints for instance in IntelliJ Idea and debug both driver and executor code. A cool idea that the speaker gave was to share the Spark context across various unit tests so that the initialization is done only once and the tests are running faster.
Apache Spark 2.0 Performance Improvements Investigated With Flame Graphs