VSCode for data science

There are several well-known challenges with using Jupyter notebooks for data science on a team, but many people don't know that VSCode addresses many of those shortcomings via its free Jupyter and Live Share plugins, its new devcontainer functionality, and its paid CodeSpaces offering. In the last 18 months, Microsoft has significantly increased its investment in these tools. Let's look at each challenge and see how VSCode/Github addresses them.

Collaboration

VSCode has a live share plugin that enables google-docs-style real-time collaboration on Jupyter notebooks and on code files in general. All you need is a Github account. I've used it before and it's quite good. Here's a demo:

Version Control

VSCode has excellent notebook diffing support that hides the messy json of the notebook format. Here's what it looks like:

diff

You can diff notebooks locally simply by clicking the version control view within VSCode, and you can diff notebooks within a PR by using the Github Issues and Pull Requests plugin.

Reproducibility

"It works on my machine" is the opposite of reproducibility, and Microsoft's devcontainers spec is designed to address this exact issue. It's a docker-based, open standard that's specifically designed to work with IDEs like VSCode.

It's developer focus makes it easier to create a container that captures all the python dependencies used for a particular data science project, and its possible to do this without having to know much about Docker. In fact, with their devcontainer templates, you can spin up a container based on miniconda through the VSCode UI without touching a docker file at all:

conda container

Scaling Computation

The devcontainer spec works seemlessly with another Github/Microsoft offering called "CodeSpaces." Once you've got a container specified, CodeSpaces can run your dev environment in the cloud on whatever hardware you like (including GPUs).

This isn't free, but's it's quite reasonably priced: the first 120 core hours are free and after that, you can get a quad core machine for .36/hr. If you ran your CodeSpace during the entire work week, that's only 47 dollars. It's unlikely your CodeSpace will run 40 hours a week, however, as your data scientists have other work aside from coding and Code Spaces automatically stop running after 30-minutes of inactivity.

Polyglot Workflows

Data Scientists often need to work across Python and SQL, and while this is very possible using the ipython SQL magic, Microsoft is working on an even smoother experience via Polyglot notebooks. This isn't released yet, but it looks promising.


If you need help getting started using VSCode/Devcontainers/CodeSpaces for your data science team, shoot me an email. You can reach me at matt@datachimp.app