Blog

The Clash Between Data Science and Production

// by David Maman//

Want to move a data science project to production and you found yourself struggling?
This blog post will help to make some sense out of the process.

If you’re reading this blog post, like many other organizations and vendors, you’re probably digged into data science lately, more specifically deep learning technologies.

As many others, you’re wondering what would be the best way to implement deep learning as part of your product, or as part of your business process.

Well, there’s no short and simple answer to this question, in this blog we’ll try to cover and explain the challenges of deploying deep learning in production.

Unlike traditional software development process, where all development methodologies are expected to deliver a production-ready grade software, in the world of Data Science, it’s a completely different story.

At the core of any data science project you’ll find a team of Data Scientists, just as their titles, they are scientists. They aim to research and find the best possible techniques for your data science challenges, in any use case, image recognition, predictions, NLP, anomalies etc. with all this team knowledge and research capabilities, they are not software engineers.

If we’ll take a look at the “real world” we’ll find out that there are two completely different teams in charge on delivering your data science project elements:

The Research – The Data Scientist team
The Software Engineering / DevOps team – The Production team

The integration between these two completely different worlds can sometimes seems impossible or frustrating, but understanding them can be the difference between a successful project or a failure.

 

Data Science Research and Modelling:

Any Data Science projects starts with building the model.

Usually starts with research which includes some background knowledge in the specific domain and assumption of a problem that needs to be solved.

Many times, during the research process, The data scientists will encounter with other problems which will require a solution.

The research usually includes a data scientist and an expert from the relevant business unit.

At the end of this process a proof of concept model will be ready for testing, usually based on Python and/or Matlab.

 

Data Scientists and Developers:

One of the challenges of moving the data science project prototype to production is to create an efficient and smooth workflow between the researches and the developers.

While many of data scientist are PhD’s or have a high degrees in math, they usually in a beginners level when it comes to coding, most of them never wrote software for a production environment.

Unlike data scientists, most of the developers will have a software engineering background and lots of experience in building production grade systems.

Most of them doesn’t really know machine learning but willing to learn.

While data scientist proof of concepts usually focus with model accuracy and finding more business problems to solve, a software engineer will usually deal with the model performance and an exact definition for a specific task (problem) to solve.

The ideal solution would be one person who combines both data science and data engineering (code) skills, but in reality most of the organizations will have 2 separate teams, one for each task.

While data scientists will mostly use Python and R the for the proof of concept , the software engineers will usually use Java or C++ for the production environment.

In some small companies, mostly startups, you’ll be able to find one team that is doing both coding and data science work, but those companies usually use a third party data scientist contractor or a consulting data science company.

 

Going Live to production – How do we make it work:

When going into production, there are several things that we need to take under consideration.

1) The Data Source

  • What is the data type we are using (images, video, audio or text) ?
  • Where the data comes from (SQL, Hadoop, files, CRM plug in, data stream) ?
  • How often do we need to get the data from the data source ?
  • One of the the most important question would be the amount of data that we need to analyze and how fast should we analyze it (Real time, within an hour, next day)?
  • It’s a question of which interface and format the incoming data will have.

2) The Model Results/Score

  • What the system supposed to do with the results of the model?
  • Should the results be saved to a database?
  • Should the result initiate a business task?
  • Should the results initiate an alert?
  • Should the results just create a new graphs and charts?

In many cases, the project should be part of a business flow, meaning, once a module have been trained and optimized, it should be fed with data and respond with results.

As many other projects, data science task is not a one-time project, and the question raised, who and how the organization keeps track of the models scoring quality?

3) Language code

Once we solve the data challenge, we need to decide which coding language will be used.

While the proof concept is usually done in Python, in the production environment we would prefer to use C++ (some organizations will use Java) and a light web app based on a light web server(nodejs, nginx).

Python can be suitable for specific use cases, but with many disadvantages.

4) Hardware

Although training a large amount of data would run much faster on GPUs, most of the market is still using CPU’s. GPUs usually requires highly skilled DevOps and some additional coding which sometimes makes hard life for the production team.
When it comes to the production, most of the time it should not require high performance, but will require scalability and redundancy.

5) Cloud or an on premise ?

Since mid 2016, most large cloud providers supports both CPUs and GPUs for production environments, which allows scale and redundancy for deep learning deployments.

However, some organizations are not allowed to take out their data out of the organization (Finance, Health and any regulation restricted industry) and would need to use their own private cloud, or data centers which will require to scale by demand.

6) User experience:

User experience in deep learning implementations is dealing with two main challenges:

Response time of the application (how fast do I get a score or a result), and how do I create a “usable” application which is not too complicated. An application which will not require additional actions or will require minimal additional actions from the end user.

 

What is Binah.ai ?

We help Enterprise organizations and vendors utilize Machine Learning, Deep Learning and Signal Processing as part of their solutions and work processes.

Binah.ai platform help narrow the gap between data scientists and production environments. Using Binah.ai moving from a research environment to production is a 2-3 simple clicks. Reducing up to 95% cost & time of (almost) any data science project.

For more information about binah.ai platform please contact us at info@binah.ai