May '20 DVC❤️Heartbeat

Every month we share news, findings, interesting reads, community takeaways, and everything else along the way.

Look here for updates about DVC, our journey as a startup, projects by our users and big ideas about best practices in ML and data science.

Elle O'Brien
May 14, 2020 • 7 min read

A big hello from DVC mascot DeeVee.

Welcome to the May Heartbeat, our monthly roundup of cool happenings, new releases, good reads and other noteworthy developments the DVC community.

News

DVC turns 3. On May 4th, we celebrated DVC's third birthday! Fearless leader Dmitry Petrov wrote a retrospective about how the team has grown and what we've learned from our users, contributors, and colleagues. Thanks to everyone who celebrated with us!

Ambassador program launched. DVC has just kicked off our ambassador program with the help of our first ambassador, Marcel Ribeiro-Dantas. Marcel is an early-stage researcher at the Institut Curie, a veteran ambassador of the Fedora Project, and a data science blogger. Becoming an ambassador is a way for folks who are passionate about contributing to the DVC community to get recognized for their efforts. It's also a way for us to help volunteers with financial support for meetups and travel, as well as chances to work more closely with our team. The program is ideal for anyone who already likes blogging about DVC, contributing code, and hosting get-togethers (virtual or otherwise), but especially advanced students and early career data scientists and engineers! Learn more about it here.

DVC is part of 2020 Google Season of Docs. Another way to get involved with DVC is through Google Season of Docs, a program we're participating in for the second year in a row. This program is for technical writers to get paid experience working with the DVC team in fall 2020. Right now, we're accepting proposals from interested writers. Find out more here.

5000 GitHub Stars. It finally happened- we passed 5,000 stars on our GitHub repo!

New releases

Coincident with DVC's 3rd birthday, we shared a pre-release of DVC 1.0. The release is expected in a few weeks, but you can experiment with 1.0 now (and make tickets in our project repo if you get a bug 🐛). Some major new features include:

Run cache, a cache of pipelines you've reproduced on your local workspace. If you re-run dvc repro on a pipeline version that's already been executed, run cache will save you compute time by returning the cached result.
Multi-stage DVC files. Users reported that their DVC pipelines changed a lot, so we've made pipeline .dvc files more human-readable and editable for fast redesigns.
Plots We've got plots powered by Vega-Lite for making beautiful vizualizations comparing model performance across commits! Developer Paweł Redzyński is hard at work:

Visual aids come to DVC 1.0, with my little help. pic.twitter.com/Fd1qVr7rHb
— Pablito (@Paffciu1) May 12, 2020

You can read more about the big updates coming in DVC 1.0 in our birthday blog.

From the community

Developers weren't the only ones hustling this month…

First ever virtual DVC Meetup. Marcel, our new ambassador, lead an initiative to organize a virtual meetup! Marcel shared his latest scientific work about creating a new comprehensive dataset about mobility during the COVID-19 pandemic and then passed off the mic to our two guest speakers. Data scientist Elizabeth Hutton spoke how she was building a workflow for her NLP team with DVC, and DAGsHub co-founder Dean Pleban shared his custom remote file system setup for modeling Reddit post popularity. It was quite well-attended for our first ever virtual hangout: we logged 40 individual logins to the meetup with more than 30 people staying the whole time! A video of the meetup is on the event page, so you can still check out the talks and discussion we enjoyed.

It was awesome speaking at the @DVCorg meetup about @reddit post popularity prediction and DVC #remote working file systems. Also a lot of #DAGs. pic.twitter.com/5WKTlIEvHK
— Dean 🐶 (@DeanPlbn) May 7, 2020

Some blogs we like. As usual, there's a lot of share-worthy writing in the data science and MLOps space:

Tania Allard wrote an intensely readable, extremely sharp guide to practical steps anyone can take to improve the reproducibility of their ML projects. She really nails the complexity of the workflow and the importance of decoupling code and data (which we obviously agree with very much 😏). The graphics are also 💯- Tania is a developer advocate to follow.

10 top tips for reproducible Machine Learning

The one where you get some advice to make your workflows more reproducible

dev.to

Vimarsh Karbhari blogged about how teams that work with data can strategize better about versioning their data and analysis pipelines. On the opposite end of giving very practical recommendations, Vimarsh stresses a deliberate and caeful approach. He emphasizes how the team's choices should depend on factors like project maturity and how much flexibility is going to be needed. It's a solid overview of how to begin thinking about MLOps at a high level.

ML Ops: Data Science Version Control

Data versioning primer for model, data and code.

medium.com

Over at AutoRegresed, Jack Pitts shared a thorough tutorial about using Pipenv, DVC and Git together. As a trio, this manages dependencies and versions the working environment, source code, dataset and trained models. It's not only a cool use case, but a very clear step-by-step explanation that should be easy to try at home. Stay till the end for a neat trick about deploying a model as a web service with Pipenv and DVC.

Pipenv and DVC: Reproducibility in Data Science

Without standards and tools to easily reproduce models, Data Science teams can become bogged down in technical debt that will make it difficult to deploy and iterate on models.

autoregressed.com

Nice tweets

Last, here are some of our favorite tweets to read this past month:

Data version control from @DVCorg is one of the best new tools I've used in a while. Moving data via the cloud is just a push or pull command away.

Recommend for anyone who works on multiple machines or shares data with collaborators
— Liam Brannigan (@braaannigan) May 6, 2020

I'm using @DVCorg for a project for the first time ever today; this thing is hella cool, check it out.https://t.co/evvrfHnW3U
— Josh Wills (@josh_wills) April 13, 2020

Getting around to learning @DVCorg, and loving it so far. Versioning data with git-style semantics gives you a lot of functionality with surprisingly little cognitive overhead.
— Tim Garvin (@tcgarvin) May 8, 2020

Thank you, thank you very much.

As always, we want to hear what you're making with DVC and what you're reading. Tell us in the blog comments, and be in touch on Twitter and Discord channel. Happy coding!