Data project highlights from across the web

Serendipity led me to them, now I'd like to share them!

May 31, 2021

Hello, fellow datanistas!

This month I’d like to share some cool data projects that I came across in my random walks across the internet. As I’ve shared with others in private chats, I think intentionally scheduled random walks with notetaking are an awesome way to discover what’s going on in corners of the internet that aren’t algorithmically curated for us. At least for me, they allow me to see new things and learn from new people.

Without further ado, here we go!

Building tools to interact with your data

This is a pretty cool blog post by Scott Condron on how he used Panel to build tools to interact with his data. I like it because it gives a concrete example for this idea that data scientists really ought to be exploring our data beyond pre-baked statistical visualizations. If we're dealing with a unique data modality, sometimes those visualizations might not be available in open-source packages. That means meaning we might need to build our own tools. So come check out Scott's blog post for an example to learn from!

Photographing Rembrandt at high resolution

If you have not yet caught the PyCon 2021 Keynote talk by Rob Erdmann, you probably should when it gets released! That talk blew my mind away! I saw how much care surrounding computation and technical precision he and his team used. The result was a high fidelity, colour-accurate, and optically clear photograph of Rembrandt's Night Watch. The YouTube video is not yet out, but you can follow Rob on Twitter and view the photograph online. (Be sure to zoom in, the level of detail in there is _amazing_.)

Interpreting predictive models for causal insights? Take care!

This article by Scott Lundberg and colleagues from Microsoft showcases some pitfalls in extracting causal insights from machine learning models that are not explicitly constructed with causal paths in mind. The article reminds me of another article by Adam Kelleher. There, he gives an example of how carefully working out causal paths in our datasets helps us automatically solve prediction problems much better.

A machine-guided tool for de-duplicating our data

Deduplipy came up on my GitHub feed, and I found it a pretty neat application of machine learning. This adds to my collection of examples that illustrate where we will realize the value of machine learning: I believe it will be in tooling that enables new applications. ML is best used when surgically applied in specialized use cases, just like the de-duplication of data.

PyMC3 implementation of dinosaur growth (from JAGS)

This blog post from Austin Rochford shows another cool application of PyMC3 to analyze the growth curves of dinosaurs. I am fond of Austin's technical writing as a source of educational material on how to build Bayesian models. I might be biased, but I think probabilistic modelling is experiencing a renaissance. Those of us who want to go beyond the automatable task for importing scikit-learn and doing model fits should definitely pick it up!

From my collection

As I mentioned in my last edition, this month, I delivered a tutorial at PyCon on JAX, titled “Magical NumPy with JAX.” The video should be released within the coming month; like Rob Erdmann’s talk, I think PyCon attendees are being given exclusive access for a month before they get released. Meanwhile, however, I have made the material freely available on GitHub.

Next month’s edition is going to be focused on causal inference. It’s something I’ve been pondering for a while. Stay tuned, it’s an awesome collection of articles!

Stay safe, stay sane, and stay hacking,
Eric

Eric's Data Science Newsletter