Technology alone does not solve problems. Back in 2014 at Techonomy, Jack Dorsey, the cofounder of Twitter and Square, put it very well: “To me, technology fundamentally is just a tool. It’s up to us to figure out how to use those tools and how to apply those tools”. In theory, this view of technology is reasonable and should be embraced.
In practice, this view can be a hard reality for developers and software engineers to accept. I was once told by a development manager “It’s our job to figure out how to use new technology in our environment…because no one else knows what we know.” Our social and media consumption habits hover around being “in the know” about the latest libraries, frameworks, data stores, development languages and automation tactics. We are seduced by faster processing times, shorter code snippets, quicker deployments, and robust documentation.
In an enterprise environment, new technology is irrelevant if you can’t deliver real business value with it. When a technology department is unable to demonstrate progress in delivering business value, jobs are reduced and budgets are cut. Even in a tech-centric city like Seattle, we see this happen often. The round of layoffs at Nordstrom’s last year is one example that immediately comes to mind.
However, there is always opportunity with new technology. As a developer or software engineer, you just have to think about the usage and approach differently.
The “Aha” Moment
Recently, I was at a hackathon that was focused around rapid web development with Polymer and web components. One of the facilitators used the term “demo driven development” to describe some Polymer development practices. The basic premise was to build a demo page first to illustrate the basic concept of a new component or elements. While trivial in nature, it got me thinking about how developers can, do and should illustrate progress.
Apache Spark Overview
Apache Spark is an incredibly popular technology today. It’s open source, incredibly fast, quite versatile and, relatively speaking, easy to implement.
How fast is Apache Spark? It processes extremely large data sets astonishingly fast. During the Daytona GraySort contest, it sorted 100 TB of data in 23 minutes with 206 EC2 nodes. Compare this to the previous record of 72 minutes with a MapReduce cluster of 2100 nodes. That’s 3X faster with 10X few machines!
The versatility of Apache Spark lends itself to phenomenal data processing and analytics opportunities. It works with structured and unstructured data from multiple data sources, allows you to query the processed data using SQL, has an extensive machine learning library, and is officially supported for use with Python, Java, Scala, and R. With guidance from Databricks, Microsoft has even added support for C#.
Not to be overlooked is the fact that it also has an active and passionate community. There are over 1,000 contributors to the main Apache Spark repo. Stack Overflow has over 22,000 questions tagged with “Apache Spark”. Across the world, there are nearly 600 active meetups. The Spark meet up group that I’m a part of (Seattle Spark) has over 2000 members and over 50 have joined this month alone. The Spark Summit in San Francisco last year had over 2500 attendees and I’d wager that number will reach 3,000+ this year, assuming the venue can handle it.
If you are a developer and new to Apache Spark, you may be thinking “This sounds interesting and my boss will love it. I’m going to setup an Apache Spark cluster and start on my Hello World”. After all, for developers and software engineers, technology validation doesn’t come via case studies, data sheets or product webinars: it comes from wrench time spent with the technology. So, depending on your work culture, you may opt to build something in your spare time. You’ll read posts, watch tutorials, troubleshoot your local development environment and hack together a basic word count application. You’ll show up the next day, pop open your laptop, show some scrolling text on a black terminal screen, smile, and say “Isn’t that awesome? We can totally use this for our data. I’d estimate this as an XXL t-shirt size level of effort. If we can get some well-written user stories and a 10 node cluster, I can get us an Apache Spark application in production in like six sprints. We will be processing so much data!”
Your boss is liable to respond in one of two ways. One, your boss joins in on the jubilation, pitches your request to their boss, only to say “Sorry, VP said there isn’t any budget this quarter for this work…they just don’t understand new technology.” The other response, due to your boss not understanding what all the excitement is about, will be “This is kind of neat…have you finished your code reviews yet?”
The tangible manifestation to that question will come in the form of a demo. The approach will be the development efforts you put into the demo. Put another way: the sole focus of your development activities is around the delivery of an applicable demo. In a later post, I’ll go into greater detail around my thoughts on demo driven development and its implications in various development environments. For now, consider it as a means of focusing your efforts on the implementation of a new technology.
For the purposes of demo driven development with Apache Spark, I’d recommend the following focus areas:
- Having a low-cost, easy setup
- Using relevant data sets
- Visualizing the data
Focus: Use a Low-Cost, Easy Setup
Setting up a local Spark cluster isn’t a difficult exercise. However, cluster management and tuning can quickly become a focal point for your efforts, rather than the actual demo.
I highly recommend Databricks’ community edition, a free version of their robust cloud platform, that includes a managed micro-cluster and the notebook environment that acts as both your code editor and end user interface. Databricks is the company founded by the creators of Apache Spark, and their community edition is more than adequate for demo purposes. However, if and when you are ready to move into a production environment, the full Databricks’ platform offers fully tuned, secured and managed Apache Spark cluster services that removes the need for expensive infrastructure support.
I’ve previously written a step-by-step tutorial on getting started with the community edition at http://bpcs.com/blog/databricks-community-edition-entry-big-data-spark/.
Focus: Get a Relevant Dataset
Technically, you don’t need a large dataset to use Apache Spark. However, your demo should definitely showcase the performance of Spark, so you’ll want to use either a large batch of data or a running, real-time data stream.
While I advise that you use data relevant to your organization, having access to a large enough dataset can prove difficult. Here are some batch data sources to get you started:
For streaming data, here are a few to get you started:
- Meetup.com rsvp stream – https://www.meetup.com/meetup_api/docs/stream/2/rsvps/#websockets
- Weather alerts – https://alerts.weather.gov/
- Twitter streams – https://dev.twitter.com/streaming/public
Focus: Visualize The Data
As a developer, a terminal screen with text response from a successful api call can be exciting. However, a business needs the ability to be able to make informed decisions quickly…and visually parsing through a streaming block of text hinders that ability. Using charts and graphs effectively is a great way to enable data-driven decisions for the enterprise.
The Databricks platform has a rich interactive notebook system that has some incredible visualization options and works with any language officially supported by Spark. collaborative workspace environment that is useful for developers, data scientists and anyone that is looking to get started with Apache Spark.
There are a large, and growing, number of example notebooks available at https://databricks.com/resources/type/example-notebooks that will jumpstart your demo. A few that you may find especially relevant:
- Mobile sample data – https://cdn2.hubspot.net/hubfs/438089/notebooks/Mobile_Sample_.html
- Salesforce leads analysis – https://cdn2.hubspot.net/hubfs/438089/notebooks/Salesforce_Leads_with_Machine_Learning_Spark_SQL_and_UDFs.html
- AdTech Sample – https://cdn2.hubspot.net/hubfs/438089/notebooks/Samples/Miscellaneous/AdTech_Sample_Notebook_Part_1.html
- Parsing web logs – https://cdn2.hubspot.net/hubfs/438089/notebooks/Samples/Data_Exploration/Data_Exploration_on_Databricks.html
- Twitter sentiment analysis – https://docs.databricks.com/_static/notebooks/2016-election-tweets.html
Demo driven development isn’t new. In fact, you may have encountered variations of the concept in your own organization, ranging from the minimum viable product to the iterative agile product. Regardless what it’s called, the core concept remains the same: Figure out the absolute bare minimum necessary to deliver business value.
Recently, I wanted to put together a demo application for a company that illustrated the awesomeness that is Apache Spark. I could have hacked together and geeked out on a server log analysis and machine learning demo on my local machine. But the real value of the application was to get a group of non-developers excited about Spark. I opted to use the Databricks Community Edition with a notebook that included third party data, graph analysis and visualization with D3. The ease of use was incredibly compelling to everyone and a number of folks after the presentation reached out to me for access to the demo notebook so they could try out some development with Apache Spark. That company was Blueprint Consulting.
What business value will you demo with Apache Spark?
- An Entry into Big Data with Spark: Databricks Community Edition – http://bpcs.com/blog/databricks-community-edition-entry-big-data-spark/
- Getting started with Apache Spark and Machine Learning: A resource listing from my presentation at Seattle Code Camp 2016 – https://gist.github.com/gnakan/5137a1128f9ed8b9aa41c4c2ccbd5110