Demo Driven Development: PySpark

In a previous blog post, I talked about demo driven development and focusing on demonstrating business value when pursuing development efforts with new technology. If you are an organization that is focused on service based delivery, you may find yourself having to demonstrate your capabilities to deliver a potential solution to a problem that hasn’t fully been defined yet. One area I see this happening often today is in data science initiatives.

Overview

Data science is growing in interest and adoption by companies across all industries. Google Trends provides a telling snapshot of the growing popularity of data science over the past 5 years.
Screen Shot 2017-03-05 at 9.26.41 PM

If you’re a technology professional, how can you demonstrate the value of data science to your organization? To get started, get familiar with some of the most popular data science tools and platforms available today. Based upon my own firsthand experience and discussions with a variety of technology professionals, a great starting point would be Python and Apache Spark. According to a poll conducted in 2016 by KDnuggets, Python and Apache Spark are both the top 10 most popular tools used by data scientists.

Python is an extremely popular general purpose development language. it has a number of core libraries that are focused around numerical analysis, plotting, and scientific computing. It’s easy to get started, has an active community and nearly every modern framework and library I’ve come across has direct support for Python.

Apache Spark is a great platform for data scientists. Aside from the sheer speed, Spark gives a single point of entry for exploring large amounts of data, utilizing machine learning algorithms and getting your code into a production environment. For Spark, python is a first class citizen, in the form of the Spark Python API (PySpark).

As groups, organizations and businesses allocate more budget to data science personnel and projects, the need to integrate those teams into an existing IT infrastructure becomes incredibly important. Don’t treat data scientists as members of a skunkworks project; their findings and results need to be folded into existing product and service lines quickly.

The Scenario

Imagine you are a manager with a small team of Python application developers and system admins responsible for the upkeep and improvement of a content recommendation system. You’ve been informed there is some extra budget available but the money hasn’t yet been allocated to any particular department. Your boss is looking to deliver a presentation within the next week to highlight why that budget should be given to your group.

The team has done an incredible job with the content recommendation system thus far, but looking to the future, you know that the practical application of data science could drive 100x improvement in customer usage. Talking with friends at other companies, they describe how they leverage data science within their organizations. The data scientists build and improve statistical models which are then handed to application developers who fold these models into data processing applications. Once the application is compiled, system admins deploy the application to an on-premise Spark cluster.

Although their workflow sounds straightforward, experience has taught you that the use of functional silos can delay the production deployment of initiatives significantly. You have a week to put together something that demonstrates your vision of a unified environment for data scientists, application developers and system admins to work together.

Demo Recommendation

For the purposes of demo driven development with PySpark, I’d recommend the following focus areas:
  • Data science centric Python and Apache Spark usage
  • Unified workflow and easy deployment

Focus: Data science centric Python and Apache Spark usage

With only a week to put together a demo, you don’t have much time to get operationally knowledgeable about data science. My advice is to leverage as much existing code as possible to illustrate potential. To that end, I’d recommend Databricks’ community edition, a free version of their robust cloud platform, that includes a managed micro-cluster and a notebook environment that acts as both your code editor and end user interface. The notebook environment is especially important in this case, because there are a growing number of example notebooks that have immediate applicability. They have datasets available directly within the notebooks, working code examples and solid commentary. You can get something up and running in minutes, and spend the rest of the time available adjusting and experimenting around your use case.

Focus: Unified workflow and easy deployment

In the scenario above, you are introducing data scientists into a dev-centric atmosphere. The workflow outlined in the scenario used by other companies is common, but isn’t the most efficient. In addition, your apprehension of functional silos is justified so, ideally, you want a workflow that immediately brings the entire team together.

The notebook environment available in the Databricks platform is a suitable approach for demonstrating a unified workflow. The notebooks can be stored in source code repository (such as Github) and can be shared and edited by everyone on the team. Permissions allow you to control access to the workspace and changes are tracked automatically.

Data scientists need to be able to focus on exploring, defining and answering the never-ending pipeline of business problems. Their work should not be hampered by a complicated deployment process. While everyone would benefit from an easy and streamlined deployment pipeline, today’s modern development talent generally includes some system operations (devops) experience. Your deployment pipeline may be convoluted, but a great developer can make it work. This can be a point missed by IT departments when introducing data scientists; great data scientists are developers in their own right, but it’s incredibly rare they have any system operations experience. The Databricks platform makes it incredibly easy to spin up a cluster and/or attach a notebook to a cluster.

Final Thoughts

In the scenario above, the focus for some managers would be to build a rudimentary prototype of the content recommendation system, leveraging hard-coded data with some “smoke and mirrors” thrown in for good effect. The drawback to this approach is that it gives an impression of what your team can do right now. Why would I approve more budget for your team to expand in order to build an updated content recommendation system, when, based upon your prototype, it looks like you can already do it? Of course, this line of thinking can vary from company to company, but I’ve seen this thought process firsthand at a number of companies.

Instead, why not focus on demonstrating what your team could do if data science were introduced? Why not showcase how overall processes and collaboration within your team will improve, how data can be used in new and exciting ways, and how you can lower the time to market for content recommendation upgrades?

Helpful Links


Comments are closed here.