Announcement, announcement! After a successful experience with
Google Season of Docs in 2019,
we're putting out a call for students to apply to work with DVC as part of
Google Summer of Code. If you want to
make a dent in open source software development with mentorship from our team,
read on.
Prerequisites to apply
Besides the general requirements to apply to Google Summer of Code, there are a
few skills we look for in applicants.
Python experience. All of our core development is done in Python, so we
prefer candidates that are experienced in Python. However, we will consider
applicants who are very strong in another language and familiar with Python
basics.
Git experience. Git is also a key part of DVC development, as DVC is
built around Git; that said, for certain projects (rated as “Beginner”) a
surface-level knowledge of Git will be sufficient.
People skills. Beyond technical fundamentals, we put a high value on
communication skills: the ability to report and document your experiments and
findings, to work kindly with teammates, and explain your goals and work
clearly.
If you like our mission but aren't sure if you're sufficiently prepared, please
be in touch anyway. We'd love to hear from you.
Project ideas
Below are several project ideas that are an immediate priority for the core DVC
team. Of course,we welcome students to create their own proposals, even if they
differ from our ideas. Projets will be primarily mentored by co-founders
Dmitry Petrov and
Ivan Shcheklein.
Migrate to the latest v3 API to improve Google Drive support. Our
organization is a co-maintainer of the PyDrive library in collaboration with
a team at Google. The PyDrive library is now several years old and still
relies on the v2 protocol. We would like to migrate to v3, which we expect
will boost performance for many DVC use cases (e.g. the ability to filter
fields being retrieved from our API, etc). For this project, we’re looking
for a student to work with us to prepare the next major version of the
PyDrive library, as well as making important changes to the core DVC code to
support it. Because PyDrive is broadly used outside of DVC, this project is a
chance to work on a library of widespread interest to the Python community.
Skills required: Python, Git, experience with APIs Difficulty rating: Beginner-Medium
Introducing parallelism to DVC. One of DVC’s features is the ability to
create pipelines, linking data repositories with code to process data, train
models, and evaluate model metrics. Once a DVC pipeline is created, the
pipeline can be shared and re-run in a systematic and entirely reproducible
way. Currently, DVC executes pipelines sequentially, even though some steps
may be run in parallel (such as data preprocessing). We would like to support
parallelization for pipeline steps specified by the user. Furthermore, we’ll
need to support building flags into DVC commands that specify the level of
parallelization (CPU, GPU or memory).
Skills required:
Python, Git. Some experience with parallelization and/or scientific computing
would be helpful but not required. Difficulty rating: Advanced
Developing use cases for data registries and ML model zoos. A new DVC
functionality that we’re particularly excited about is summon, a method
that can turn remotely-hosted machine learning artifacts such as datasets,
trained models, and more into objects in the user’s local environment (such
as a Jupyter notebook). This is a foundation for creating data catalogs of
data-frames and machine learning model zoos on top of Git repositories and
cloud storages (like GCS or S3). We need to identify and implement model zoos
(think PyTorch Hub, the Caffe Model Zoo, or the TensorFlow DeepLab Model Zoo)
and data registries for types that are not supported by DVC yet. Currently,
we’ve tested summon with PyTorch image segmentation models and Pandas
dataframes. We’re looking for students to explore other possible use cases.
Skills required: Python, Git, and some machine learning or
data science experience Difficulty rating: Beginner-Medium
Continuous delivery for JetBrains TeamCity. Continuous integration and
continuous delivery (CI/CD) for ML projects is an area where we see
DVC make a big impact-
specifically, by delivering datasets and ML models into CI/CD pipelines.
While there are many cases when DVC is used inside GitHub Actions and GitLab
CI, you will be transferring this experience to another type of CI/CD system,
JetBrains TeamCity. We're working to
integrate DVC's model and dataset versioning into TeamCity's CI/CD toolkit.
This project would be ideal for a student looking to explore the growing
field of MLOps, an offshoot of DevOps with the specifics of ML projects at
the center.
Skills required: Python, Git, bash scripting. It
would be nice, but not necessary, to have some experience with CI/CD tools
and developer workflow automation. Difficulty rating:
Medium-Advanced
DVC performance testing framework. Performance is a core value of DVC. We
will be creating a performance monitoring and testing framework where new
scenarios (e.g., unit testing)can be populated. The framework should reflect
all performance improvements and degradations for each of the DVC releases.
It would be especially compelling if testing could be integrated with our
GitHub workflow (CI/CD). This is a great opportunity for a student to learn
about DVC and versioning in-depth and contribute to its stability.
Please refer to the
Google Summer of Code application guides
for specifics of the program. Students looking to know more about DVC, and our
worldwide community of contributors, will learn most by visiting our
Discord channel,
GitHub repository, and
Forum. We are available to discuss project proposals
from interested students and can be reached by email
or on our Discord channel.