In the Machine Learning (ML) field tools and techniques for best practices are just starting to be developed. At the PyTorch developer conference (PTDC-18), several speakers including Jerome Pesenti, VP of AI from Facebook and Andrej Karpathy, Director of Tesla AI spoke about best practices for machine learning development.
The image source: link
The issues discussed included applying traditional software development techniques like unit testing, CI/CD systems, automated deployment, version control, and more to the ML field. In this blog post, we will go over the best practices ideas from PTDC-18 and the future of ML tool developments.
In the PTDC-18 keynote speech, Jerome Pesenti described the motivation and goals of PyTorch project and what the future of machine learning looks like.
Regarding the future of ML, Jerome envisioned a “streamlined development, more accessible tools, breakthrough hardware, and more”. Talking about the gap huge gap between software engineering and ML engineering, Presenti said:
Machine learning engineering is where we were in Software Engineering 20 years ago. A lot of things still need to be invented. We need to figure out what testing means, what CD (continuous delivery) means, we need to develop tools and environments that people can develop robust ML that does not have too many biases and does not overfit.
In that gap lives many opportunities to develop new tools and services. We in the ML ecosystem are called upon to implement the future of machine learning tools. Traditional software engineering has many useful tools and techniques which can either be repurposed for Machine Learning development or used as a source for ideas in developing new tools.
PyTorch 1.0 implements one important engineering principle — “a seamless transition from AI research to production”. It helps to move AI technology from research into production as quickly as possible. In order to do that a few challenges were solved:
Andrej Karpathy from Tesla AI had a dedicated talk about best engineering practices in ML. He drew a contrast between traditional software development (software 1.0) with software utilizing Machine Learning techniques (software 2.0), saying that
“software 2.0 code also has new feature demands, contains bugs, and requires iterations.”
Meaning that ML development has a lifecycle similar to traditional software:
“When you are working with these [neural] networks in production you are doing much more than that [training and measuring models]. You maintaining the codebase and that codebase is alive is just like 1.0 code.”
Machine Learning models need to grow and develop feature-by-feature, bugs need to be found and fixed, and repeatable processes are a must, as in earlier non-ML software development practices.
Karpathy went on to describe how software 1.0 best practices can be used in software 2.0 (ML modeling):
This list of best practices shows how serious Tesla AI is about robust software which is not surprising for self-driving car area. Any company needs these practices in order to organize a manageable ML development process.
Frameworks and libraries like PyTorch make a significant step in machine learning tooling and bringing the best practices. However, frameworks and libraries might be not enough for many of the ML best practices. For example, dataset versioning, ML model versioning, continuous integration (CI) and continuous delivery (CD) requires manipulation and transferring data files. These can be done in a more efficient and natural way by data management tools and storage systems rather than libraries.
The need for a machine learning artifact manipulation tool with data file-centric philosophy was the major motivation behind open source project that we created — Data Version Control (DVC) or DVC.org.
DVC connects Git with data files and machine learning pipelines which helps keep version control on machine learning models and datasets using familiar Git semantics coupled with the power of cloud storage systems such as Amazon’s S3, Google’s GCS, Microsoft’s Azure or bare-metal servers accessed by SSH.
If PyTorch helps in organizing code inside an ML project then data-centric tools like DVC help organized different pieces of ML projects into a single workflow. The machine learning future requires both types of tools — code level and data file level.
Thus far only the first steps have been taken toward using machine learning tooling and the best machine learning practices. Mostly large companies are using these practices because they faced the problems a while ago. Best practices should be embraced by the entire industry which will help to bring machine learning to a higher new level.