If you’re a technology professional, how can you demonstrate the value of data science to your organization? To get started, get familiar with some of the most popular data science tools and platforms available today. Based upon my own firsthand experience and discussions with a variety of technology professionals, a great starting point would be Python and Apache Spark. According to a poll conducted in 2016 by KDnuggets, Python and Apache Spark are both the top 10 most popular tools used by data scientists.
Python is an extremely popular general purpose development language. it has a number of core libraries that are focused around numerical analysis, plotting, and scientific computing. It’s easy to get started, has an active community and nearly every modern framework and library I’ve come across has direct support for Python.
Apache Spark is a great platform for data scientists. Aside from the sheer speed, Spark gives a single point of entry for exploring large amounts of data, utilizing machine learning algorithms and getting your code into a production environment. For Spark, python is a first class citizen, in the form of the Spark Python API (PySpark).
As groups, organizations and businesses allocate more budget to data science personnel and projects, the need to integrate those teams into an existing IT infrastructure becomes incredibly important. Don’t treat data scientists as members of a skunkworks project; their findings and results need to be folded into existing product and service lines quickly.
Imagine you are a manager with a small team of Python application developers and system admins responsible for the upkeep and improvement of a content recommendation system. You’ve been informed there is some extra budget available but the money hasn’t yet been allocated to any particular department. Your boss is looking to deliver a presentation within the next week to highlight why that budget should be given to your group.
The team has done an incredible job with the content recommendation system thus far, but looking to the future, you know that the practical application of data science could drive 100x improvement in customer usage. Talking with friends at other companies, they describe how they leverage data science within their organizations. The data scientists build and improve statistical models which are then handed to application developers who fold these models into data processing applications. Once the application is compiled, system admins deploy the application to an on-premise Spark cluster.
Although their workflow sounds straightforward, experience has taught you that the use of functional silos can delay the production deployment of initiatives significantly. You have a week to put together something that demonstrates your vision of a unified environment for data scientists, application developers and system admins to work together.
- Data science centric Python and Apache Spark usage
- Unified workflow and easy deployment
Focus: Data science centric Python and Apache Spark usage
Focus: Unified workflow and easy deployment
In the scenario above, you are introducing data scientists into a dev-centric atmosphere. The workflow outlined in the scenario used by other companies is common, but isn’t the most efficient. In addition, your apprehension of functional silos is justified so, ideally, you want a workflow that immediately brings the entire team together.
The notebook environment available in the Databricks platform is a suitable approach for demonstrating a unified workflow. The notebooks can be stored in source code repository (such as Github) and can be shared and edited by everyone on the team. Permissions allow you to control access to the workspace and changes are tracked automatically.
Data scientists need to be able to focus on exploring, defining and answering the never-ending pipeline of business problems. Their work should not be hampered by a complicated deployment process. While everyone would benefit from an easy and streamlined deployment pipeline, today’s modern development talent generally includes some system operations (devops) experience. Your deployment pipeline may be convoluted, but a great developer can make it work. This can be a point missed by IT departments when introducing data scientists; great data scientists are developers in their own right, but it’s incredibly rare they have any system operations experience. The Databricks platform makes it incredibly easy to spin up a cluster and/or attach a notebook to a cluster.
In the scenario above, the focus for some managers would be to build a rudimentary prototype of the content recommendation system, leveraging hard-coded data with some “smoke and mirrors” thrown in for good effect. The drawback to this approach is that it gives an impression of what your team can do right now. Why would I approve more budget for your team to expand in order to build an updated content recommendation system, when, based upon your prototype, it looks like you can already do it? Of course, this line of thinking can vary from company to company, but I’ve seen this thought process firsthand at a number of companies.
Instead, why not focus on demonstrating what your team could do if data science were introduced? Why not showcase how overall processes and collaboration within your team will improve, how data can be used in new and exciting ways, and how you can lower the time to market for content recommendation upgrades?
- In an effort to “dog food” the approach, I put together an example notebook using the approach outlined in this article. Check it out at http://cli.re/databricks-pyspark-demo.
- Demo Driven Development: Apache Spark – http://bpcs.com/blog/demo-driven-development-apache-spark/.
- Learning PySpark, a book cowritten by Denny Lee. I’ve had the pleasure of knowing Denny for a while now and highly recommend his book – https://www.amazon.com/Learning-PySpark-Tomasz-Drabas-ebook/dp/B01KOG6SXM.
- Getting started with Databricks Community Edition – http://bpcs.com/blog/databricks-community-edition-entry-big-data-spark/.
- Productizing and Deploying Data Science Projects at https://www.continuum.io/blog/developer-blog/productionizing-deploying-data-science-projects
- Apache Spark Python Programming Guide – http://spark.apache.org/docs/0.9.0/python-programming-guide.html
- Databricks example notebooks – https://databricks.com/resources/type/example-notebooks