The eternal dream of almost every Data Scientist today is to spend all the time exploring new datasets, engineering new features, inventing and validating cool new algorithms and strategies. However, daily routines of a Data Scientist include raw data pre-processing, dealing with infrastructure, bringing models to production. That's where good DevOps practices and skills are essential and will certainly be beneficial for industrial Data Scientists as they can address the above-mentioned challenges in a self-service manner.
The primary mission of DevOps is to help the teams to resolve various Tech Ops infrastructure, tools and pipeline issues.
At the other hand, as mentioned in the conceptual review by Forbes in November 2016, the industrial analytics is no more going to be driven by data scientists alone. It requires an investment in DevOps skills, practices and supporting technology to move analytics out of the lab and into the business. There are even voices calling Data Scientists to concentrate on agile methodology and DevOps if they like to retain their jobs in business in the long run.
The eternal dream of almost every Data Scientist today is to spend all (well, almost all) the time in the office exploring new datasets, engineering decisive new features, inventing and validating cool new algorithms and strategies. However, reality is often different. One of the unfortunate daily routines of a Data Scientist work is to do raw data pre-processing. It usually translates to the challenges to
Pull all kinds of necessary data from a variety of sources
Extract, transform, and load the data
Facilitate continuous machine learning and decision-making in a business-ready framework
Another big challenge is to organize collaboration and data/model sharing inside and across the boundaries of teams of Data Scientists and Software Engineers.
DevOps skills as well as effective instruments will certainly be beneficial for industrial Data Scientists as they can address the above-mentioned challenges in a self-service manner.
Data Version Control or simply DVC comes to the scene whenever you start looking for effective DevOps-for-Analytics instruments.
DVC is an open source tool for data science projects. It makes your data science projects reproducible by automatically building data dependency graph (DAG). Your code and the dependencies could be easily shared by Git, and data — through cloud storage (AWS S3, GCP) in a single DVC environment.
Although DVC was created for machine learning developers and data scientists originally, it appeared to be useful beyond it. Since it brings proven engineering practices to not well defined ML process, I discovered it to have enormous potential as an Analytical DevOps instrument.
It clearly helps to manage a big fraction of DevOps issues in daily Data Scientist routines
One of the ‘juicy’ features of DVC is ability to support multiple technology stacks. Whether you prefer R or use promising Python-based implementations for your industrial data products, DVC will be able to support your pipeline properly. You can see it in action for both Python-based and R-based technical stacks.
As such, DVC is going to be one of the tools you would enjoy to use if/when you embark on building continual analytical environment for your system or across your organization.
Building a production pipeline is quite different from building a machine-learning prototype on a local laptop. Many teams and companies face the challenges there.
At the bare minimum, the following requirements shall be met when you move your solution into production
This goes into the territory traditionally inhabited by DevOps. Data Scientists should ideally learn to handle the part of those requirements themselves or at least be informative consultants to classical DevOps gurus.
DVC can help in many aspects of the production scenario above as it can orchestrate relevant tools and instruments through its scripting. In such a setup, DVC scripts will be sharable manifestation (and implementation) of your production pipeline where each step can be transparently reviewed, easily maintained, and changed as needed over time.
If you are further interested in understanding the ever-proliferating role of DevOps in the modern Data Science and predictive analytics in business, there are good resources for your review below
By any mean, DVC is going to be a useful instrument to fill the multiple gaps between the classical in-lab old-school data science practices and growing demands of business to build solid DevOps processes and workflows to streamline mature and persistent data analytics.