October '20 Heartbeat

This month, hear about our international talks, new video docs on our YouTube channel, and the best tutorials from our community.

Elle O'Brien
October 12, 2020 • 5 min read

Double DeeVee! One of these birds is on a layover before heading to Germany.

News

Paweł gets ready to speak at Poland's largest data science meeting

DVC developer Paweł Redzyński (he's written a lot of the code behind dvc plots) is giving at talk at the Data Science Summit in Poland! The virtual meeting is on October 16, but talks are available for streaming on demand up to a week before. Paweł's talk is part of the DataOps & Development track, where he'll be sharing about CML and GitHub Actions (note that it'll be delivered in English).

Dmitry talks at Data Engineering Melbourne

CEO Dmitry Petrov dropped into the Data Engineering Melbourne meetup to talk about Data Versioning and DataOps! He spoke about the differences between end-to-end platforms and ecosystems of tools, and how this distinction informs the development of software like DVC and CML (hint: we picked tools over platforms).

Keep an eye on this meetup, which is now accessible to folks on all continents thanks to the magic of the internet :)

Data Engineering Melbourne

Dmitry Petrov presents on DataOps and versioning.

meetup.com

Elle has talks at PyCon India and PyData Global

Last week I gave a talk about CML at PyCon India, and have another one coming up at PyData Global this November 11-15.

DevOps for science: using continuous integration for rigorous and reproducible analysis

PyData Global

https://global.pydata.org

DevOps for science: using continuous integration for rigorous and reproducible analysis

PyData Global has a fantastic lineup of talks spanning science and engineering, so please consider joining!

DVC at DataFest

DVC Ambassador Mikhail Rozhkov co-hosted the Machine Learning REPA (Reproducibility, Experiments and Pipelines Automation) track of DataFest 2020, and DVC showed up in full force! There were talks from Dmitry, ambassador Marcel Ribeiro-Dantas, and myself about all aspects of MLOps and automation.

DataFest is over (until next year, anyway), but visit the ML-REPA community for ongoing content and opportunities for networking.

New videos

Since the summer, we've been building our YouTube channel. It's going great- we've gotten more than 18,000 views in the last few months and 1,500 subscribers!

Our latest video in the MLOps Tutorials series introduced using GitHub Actions for model testing- instead of training a model in continuous integration, the idea is to train locally and "check-in" your favorite model for testing in a standardized environment. This approach lets you completely control the environment, infrastructure, and code used to evaluate your model, and save the run in a place that's easy to share (GitHub!).

We'll be going deeper into the art and craft of testing ML models in the next few weeks, so stay tuned. Another big initative is adding videos to our docs: since video seems like a popular format for a lot of learners, we're working to supplement our official docs with embedded videos. Check out our first installment on the Getting Started with Data Versioning.

From the community

Our community makes some amazing tutorials. Here are a few on our radar:

Data scientist and full-stack developer Ashutosh Hathidara shared an end-to-end machine learning project made with DVC and CML… and released it in video form! It's a neat setup and a nice model for folks to study.

Another detailed and easy-to-follow tutorial, with a similarly impressive scope, appared on Heise Online. This project puts together DVC, Cortex, and ONNX to develop and deploy a model trained on the Fashion MNIST dataset (note: the article is in German, and I read it with Chrome's English translation).

Managing and commissioning ML models

Tools like DVC and Cortex, which are designed for the operationalization of AI projects, are intended to help developers deploy models in production.

https://heise.de

You'll also want to check out anno.ai's tutorial about managing large datasets with DVC and S3 storage- it's detailed, but also a quick-start guide informed by the team's practical experience.

MLOps and Data: Managing Large ML Datasets with DVC and S3 (Part 1)

A quick start guide to version control for machine learning data

medium.com/@anno.ai

MLOps and Data: Managing Large ML Datasets with DVC and S3 (Part 1)

Data scientist and mathematician Khuyen Tran blogged about why and how to start using DVC- and her tutorial includes Google Drive remote storage, a feature we're especially excited about. Check it out and follow along with her code examples!

Introduction to DVC: Data Version Control Tool for Machine Learning Projects

Just like Git, but with Data!

medium.com

Introduction to DVC: Data Version Control Tool for Machine Learning Projects

And to end on a thoughtful note… have you seen this thread by ML Engineer Shreya Shankar? She beautifully summarizes many of the ideas and technical challenges our community thinks about every day. Read and reflect!

In good software practices, you version code. Use Git. Track changes. Code in master is ground truth.

In ML, code alone isn't ground truth. I can run the same SQL query today and tomorrow and get different results. How do you replicate this good software practice for ML? (1/7)
— Shreya Shankar (@sh_reya) October 8, 2020