Why data chimp?

chimp on a laptop

I'm building a data analysis code assistant that plugs into your data notebook, analyzes your code as you type it, and shows visualizations and info you need to analyze your data quickly and accurately. Why? Why am I unsatisfied with our existing tools for data analysis?

My dissatisfaction started when I made a mistake in an analysis while I was at Heap. There were two columns I could have used for my analysis, and the one I used happened to contain many missing values. As a result, we had to tell our stakeholders that the shiny new insight we found for them may actually be fake news.1

The obvious lesson from this experience is to always check for missing values while doing an analysis, but as a former software engineer, this was unsatisfying. Good software engineers don't manually check for every single bug every time they start writing code. They have automated tests that check correctness for them.

What I really wanted was something like an automated test for missing values. I wanted to be able to write code that said something like, "If Matt is using a column in his analysis that contains many missing values, show him a visualization that suggests there could be a problem" The visualization could look like this:

example missing values viz

But ideally, I'd want to be able to customize the visualization shown.

Since I didn't have this, I settled for habitually using libraries that gave me a compact overview of my data set so I could spot any data quality issues or find a starting point for analysis. Pandas profiling for python and skimr for R are good examples here.2

Although these libraries are nice, I was often annoyed at having to sift through lots of information that I didn't actually care about. For example, I basically never care about n_missing values in skimr::skim's output3:

example skimr output

Since I have the complete_rate in the table already, I don't care to see the raw number of missing values.

This extra information and sifting isn't merely annoying. It undermines the purpose of these libraries. If we're presented with too much information, we're likely to miss the things that matter.

I briefly considered writing my own overview function, but I knew that the information I'd want to see would depend on what I was doing at a particular moment. It would depend on the code I was writing and the data I was working with. A function just wasn't flexible enough.

There was another problem with overview functions: I couldn't iterate on the visualizations they generated. This was an issue for digging deeper into data quality issues, but it was also a problem for using these functions to get a head start on finding insights. We often need to tweak visualizations to find the story in the data (e.g., add a log-scale, color data points by some other column).

Because overview functions didn't let me iterate on my visualizations, I often needed to write visualization code from scratch. Again, my inner software engineer hated this. One of my first lessons as a software engineer was that if you're writing the same code over and over again, you're missing something.

There's one more thing that irked me about data notebooks. Whenever I wrote buggy data wrangling code, I needed to interrupt my analysis and write some ad-hoc visualization or debugging code to diagnose the issue. Instead of interrupting my flow, I wanted to automatically see a visualization or computation that would give me real-time feedback on the correctness of my code.

Here's an example. Suppose you're working with a data set about penguins and you notice that sometimes the species of penguin is entered incorrectly. Maybe "Chinstrap" is occasionally spelled with an extra "p" and "Adelie" is sometimes missing the "i." You'll need to write some string replace or regular expression code to correct this. Wouldn't it be nice to see the number of distinct values for the species column shrink as you wrote your string replace code to remove the typos?

So, to sum it all up: I'm building data chimp because I wanted customizable visualizations shown automatically when certain things were true about the code I was writing and the data I was working with. Moreover, I wanted the ability to get the code that generated these automatic visualizations, so that I could easily iterate on them as I worked my way towards insight. I wanted these things so I could quickly analyze data with confidence.

These features are just the beginning of what's possible with data chimp. What if we showed team-created documentation about the tables and columns you're working with? What if we showed which columns and tables were popular on your team? What if we made it easy to create dbt tests or Great Expectation assertions from within your data notebook?

We're still thinking about how to best implement data chimp, but to see our first crack at some of these features in action, check out our landing page and demo video.

  1. Thankfully, when we reran the analysis with the correct column, our insight still proved true.
  2. There's lots of other similar libraries. Here are some other high quality ones I know about: Lux, Data Explorer and Dtale. If I'm leaving out your favorite one, let me know!
  3. I also never cared about the standard deviation. If I wanted to see spread, I'd look at a histogram.