Skip to content
  • What we do

      Technology

      Application development
      Cloud and infrastructure
      Data migration
      Data science and analytics
      IoT enablement
      Modern data estate
      Modern workplace
      Video analytics

      Services

      Support engineering
      Localization

      Products

      Lightweight data virtualization tool

      Advanced video analytics solution

      Strategy

      Data Science
      Maturity Assessment

      Assess your data science maturity (or readiness) to receive a custom report with industry best practices

      Take the assessment
  • Our approach
  • Our work
  • Insights
  • Careers
Connect
Back to insights

Spark for the impatient: Introduction

By Ross Lambert

The Zen of Spark: What is it, exactly?

Apache Spark is a general-purpose, distributed, memory-based, parallel computing engine.

Man, that’s a mouthful.

Although technically accurate, that description comes across as just so much marketing-speech.  But the fact is, we can learn a lot about Spark’s key characteristics by breaking it down.

  • General Purpose: For quite some time now, the Map Reduce algorithm was the primary modus operandi for processing vast amounts of data in a parallel manner at scale. Unfortunately, some computing tasks do not fit so neatly into that manner of organizing a problem. Spark can do things differently, including sitting back and listening for streaming data, which it then packages up into “microbatches” that it sends on to your code. The code can do whatever it needs to with the microbatch–or with the batch from the “regular” batch interface. In addition, Spark can also manage state and hand it to your code with each element of a batch. In this way you can enable long-running, stateful operations.
  • Distributed: Like most “internet-scale” software, Spark is designed to scale horizontally on what is essentially commodity hardware. It supports pluggable resource managers like YARN, Mesos, or even its own default scheduler to manage the allocation of resources across the cluster. This means that you can potentially run more than one application at a time across the cluster. I use the word “application” for the sake of familiarity, but they are also sometimes referred to as “jobs”. The term “job” can be a bit of a misnomer since it implies something that runs once and disappears. Spark applications do not necessarily behave that way.
  • Memory-based: Spark can cache data, and in fact holds entire chunks (partitions) of data sets in memory as a matter of course. Best of all, thanks to its versatile caching abilities, you can reuse these chunks of data in subsequent operations–they do not need to be recomputed or re-read each time they are used. Even more importantly, if one partition fails, its data can be reconstituted on another without intervention. You don’t need to lift a finger. That is darn-near magical and is a big attraction to developers of mission critical applications.
  • Parallel Computing Engine:  Spark’s superb abilities with respect to parallel execution are quite likely another huge cause of its staggering growth in the marketplace–it simplifies the way we write programs needing to execute in parallel. It is only a small exaggeration to say that, in Spark, parallel execution is more a function of configuration than changes needed in code. Nevertheless, it is important to understand Spark’s parallelism to some degree in order to write efficient Spark applications: The unwary can trigger unnecessary “shuffling” of huge amounts of data around the cluster.

So what kind of applications are a good fit for parallel computing in general and Spark’s brand in particular? The answer is not cut and dried, but there are two common usage scenarios:

  1. Applications with enormous or even unbounded data sets: One example of this would be an application that processes sensor data: The sensor data may get read every four seconds (as is the standard for electric utilities), so even a simple voltage meter can generate huge data sets over time. Spark lets you process and analyze (or maybe more importantly, re-process and re-analyze) these data sets in a reasonable amount of time. You can scale up your cluster to handle a burst of processing needs and then drop it back down or turn it off altogether when there is no more data. This potentially saves money–lots of money.
  2. Applications requiring high speed processing: To continue with the sensor data example, consider an application that monitors the temperature in a million buildings worldwide on a one-minute interval. To minimize latency, those million readings could be partitioned across a Spark cluster and appropriate and timely notifications or alarms raised when readings exceeded configured thresholds. Spark streaming provides a mechanism where this can happen in near real-time–and all with Spark’s considerable scaling and partitioning capabilities. It also allows you to do considerable analytical work in near real-time, on-the-fly–a fact we have exploited with remarkable benefits at Blueprint Consulting Services.

The first class of application is the type most frequently associated with Spark: You feed the beast a ton of data and it chews through it and executes a variety of analytics. Again, the speed is more a function of configuration and cluster size than anything technical.

The second class of application–those requiring near real-time responses–are a growing subset of the Spark community’s usage of the platform, all thanks to Spark Streaming.

In the real world, things are often not so neatly divided or categorized: We now want real-time monitoring *and* rich analytics over the same data. Thankfully, Spark can handle both.

Let's build your future.

Contact us

Share with your network

You may also enjoy

Article

Connecting your Point-of-Sale data to enhance your customer loyalty program

From stranger to super-fan: 5 ways to give customers what they want, when they want it

Traditionally, customer loyalty programs focus strictly on signups and discounts. That’s not enough anymore. The solution lies in connecting your data to your customer experience.

Article

Customer personalization is a must in the dining industry

 What the top QSR chains are doing that others aren’t 

Since the peak in sales during the Covid-19 shutdown, fast food traffic has continued to see its post-pandemic decline. Consumers still want fast food, but they don’t want to go back to pre-pandemic times. They have new demands.

What we do

  • Cloud and infrastructure
  • Data migration
  • Modern data estate
  • Modern workplace
  • Data science and analytics
  • Application development
  • IoT enablement
  • Video analytics
  • Support engineering
  • Localization
Menu
  • Cloud and infrastructure
  • Data migration
  • Modern data estate
  • Modern workplace
  • Data science and analytics
  • Application development
  • IoT enablement
  • Video analytics
  • Support engineering
  • Localization

Our approach

  • Business strategy
  • Facilitated innovation
  • Project Definition Workshop
  • Course of Action Assessment
  • Proof of Concept
  • Product development
  • Solution development
  • Managed services
Menu
  • Business strategy
  • Facilitated innovation
  • Project Definition Workshop
  • Course of Action Assessment
  • Proof of Concept
  • Product development
  • Solution development
  • Managed services

Our work

Insights

Careers

Contact us

Nash Video Analytics
Linkedin Youtube Twitter Facebook Instagram
© 2022 Blueprint Technologies, LLC. 2600 116th Avenue Northeast, First Floor
Bellevue, WA 98004

All rights reserved.

Privacy Policy

  • What we do
  • Our approach
  • Our work
  • Insights
  • Careers
  • Connect
Menu
  • What we do
  • Our approach
  • Our work
  • Insights
  • Careers
  • Connect
Follow
  • LinkedIn
  • Youtube
  • Twitter
  • Facebook
  • Instagram
Menu
  • LinkedIn
  • Youtube
  • Twitter
  • Facebook
  • Instagram