Join DVC for Google Summer of Code 2020

DVC is looking for students to take part in Google Summer of Code 2020.

Elle O'Brien
February 04, 2020 • 5 min read

Announcement, announcement! After a successful experience with Google Season of Docs in 2019, we're putting out a call for students to apply to work with DVC as part of Google Summer of Code. If you want to make a dent in open source software development with mentorship from our team, read on.

Prerequisites to apply

Besides the general requirements to apply to Google Summer of Code, there are a few skills we look for in applicants.

Python experience. All of our core development is done in Python, so we prefer candidates that are experienced in Python. However, we will consider applicants who are very strong in another language and familiar with Python basics.
Git experience. Git is also a key part of DVC development, as DVC is built around Git; that said, for certain projects (rated as “Beginner”) a surface-level knowledge of Git will be sufficient.
People skills. Beyond technical fundamentals, we put a high value on communication skills: the ability to report and document your experiments and findings, to work kindly with teammates, and explain your goals and work clearly.

If you like our mission but aren't sure if you're sufficiently prepared, please be in touch anyway. We'd love to hear from you.

Project ideas

Below are several project ideas that are an immediate priority for the core DVC team. Of course,we welcome students to create their own proposals, even if they differ from our ideas. Projets will be primarily mentored by co-founders Dmitry Petrov and Ivan Shcheklein.

Migrate to the latest v3 API to improve Google Drive support. Our organization is a co-maintainer of the PyDrive library in collaboration with a team at Google. The PyDrive library is now several years old and still relies on the v2 protocol. We would like to migrate to v3, which we expect will boost performance for many DVC use cases (e.g. the ability to filter fields being retrieved from our API, etc). For this project, we’re looking for a student to work with us to prepare the next major version of the PyDrive library, as well as making important changes to the core DVC code to support it. Because PyDrive is broadly used outside of DVC, this project is a chance to work on a library of widespread interest to the Python community.

Skills required: Python, Git, experience with APIs
Difficulty rating: Beginner-Medium
Introducing parallelism to DVC. One of DVC’s features is the ability to create pipelines, linking data repositories with code to process data, train models, and evaluate model metrics. Once a DVC pipeline is created, the pipeline can be shared and re-run in a systematic and entirely reproducible way. Currently, DVC executes pipelines sequentially, even though some steps may be run in parallel (such as data preprocessing). We would like to support parallelization for pipeline steps specified by the user. Furthermore, we’ll need to support building flags into DVC commands that specify the level of parallelization (CPU, GPU or memory).

Skills required: Python, Git. Some experience with parallelization and/or scientific computing would be helpful but not required.
Difficulty rating: Advanced
Developing use cases for data registries and ML model zoos. A new DVC functionality that we’re particularly excited about is summon, a method that can turn remotely-hosted machine learning artifacts such as datasets, trained models, and more into objects in the user’s local environment (such as a Jupyter notebook). This is a foundation for creating data catalogs of data-frames and machine learning model zoos on top of Git repositories and cloud storages (like GCS or S3). We need to identify and implement model zoos (think PyTorch Hub, the Caffe Model Zoo, or the TensorFlow DeepLab Model Zoo) and data registries for types that are not supported by DVC yet. Currently, we’ve tested summon with PyTorch image segmentation models and Pandas dataframes. We’re looking for students to explore other possible use cases.

Skills required: Python, Git, and some machine learning or data science experience
Difficulty rating: Beginner-Medium
Continuous delivery for JetBrains TeamCity. Continuous integration and continuous delivery (CI/CD) for ML projects is an area where we see DVC make a big impact- specifically, by delivering datasets and ML models into CI/CD pipelines. While there are many cases when DVC is used inside GitHub Actions and GitLab CI, you will be transferring this experience to another type of CI/CD system, JetBrains TeamCity. We're working to integrate DVC's model and dataset versioning into TeamCity's CI/CD toolkit. This project would be ideal for a student looking to explore the growing field of MLOps, an offshoot of DevOps with the specifics of ML projects at the center.

Skills required: Python, Git, bash scripting. It would be nice, but not necessary, to have some experience with CI/CD tools and developer workflow automation.
Difficulty rating: Medium-Advanced
DVC performance testing framework. Performance is a core value of DVC. We will be creating a performance monitoring and testing framework where new scenarios (e.g., unit testing)can be populated. The framework should reflect all performance improvements and degradations for each of the DVC releases. It would be especially compelling if testing could be integrated with our GitHub workflow (CI/CD). This is a great opportunity for a student to learn about DVC and versioning in-depth and contribute to its stability.

Skills required: Python, Git, bash scripting.
Difficulty rating: Medium-Advanced

If you'd like to apply

Please refer to the Google Summer of Code application guides for specifics of the program. Students looking to know more about DVC, and our worldwide community of contributors, will learn most by visiting our Discord channel, GitHub repository, and Forum. We are available to discuss project proposals from interested students and can be reached by email or on our Discord channel.