Data Science Programming January 2021 Newsletter
An issue about data checks and learning data science by programming
A new year has started, and it's a new dawn! And with that new dawn is a move to Substack. The primary thing that you'll benefit from is that the newsletter archive will be more readily visible. (The newsletter archive was something I had questions from some of y'all about.) I also have a hunch that Substack supports newsletters better than Mailchimp does with Tinyletter.
In this edition of the newsletter, I wanted to share items on two themes. The first is on data quality checking, the second is on programmer-oriented data science learning resources.
Pandera new releases
To kickstart, I wanted to share about Pandera, a runtime data validation library, which, if my memory serves me right, I've highlighted before on the newsletter. That said, there have been new releases that I'm a fan of, and it contains exciting stuff! One of the things I especially like is the recently added pydantic-style data Schema model declarations. Having test-driven the schema models at work, I found them handy to clarify what key dataframes I need in my projects. By pre-defining what dataframes I need in my project, I can explicitly state my assumptions about how my data ought to look, and the pydantic-style class declarations help with that.
Additionally, you can now add them as part of your function annotations, which means we can write a program that performs static analysis of a Python codebase, leveraging dataframe type annotations, to automatically spit out the data pipeline expressed in the codebase. (I hacked that out one Monday afternoon on a private repo, it was a welcome distraction!) Definitely check out Pandera!
The other thing I wanted to share is about two blog posts by friends at Superconductive (developers of Great Expectations, another data validation library that I consider oriented towards large pipelines). They have posted two blog posts that I found insightful, which I wanted to share with you.
How DAGs grow: Deeper, wider, and thicker - gives names to how our data pipelines grow and the consequences of not pruning them. (Looking forward to the follow-up!)
Your data tests failed! Now what? - a description of the categories of responses to data test failures. Illuminates how important workflow is alongside tooling.
Think Bayes 2
Prof. Allen Downey of the Olin College of Engineering has been updating Think Bayes to include new material, including PyMC3 inside there. I am excited to see it be released! The first version was foundational in my journey into Bayesian statistical modelling. I saw the 2nd version's material online, and I am confident that a newcomer to Bayesian inference will enjoy learning from it!
Causal Inference for the Brave and True
For nearly half a decade of observing the role of "data science," I've noticed a distinct lack of incorporating causal thinking into our data science projects. Part of that may be the hype surrounding "big models," but part of that may also be a lack of excellent introductory causal inference material. Fret not: Matheus Facure has your back covered with a lighthearted introduction to causal inference methods.
From my collection
Over the winter, I reflected on a year of getting newcomer and seasoned colleagues up-to-speed on modern data science tooling and project organization. The result of that is a new eBook I wrote, in which I pour out everything I know about getting your computer bootstrapped and organized to do incredible data science work. (I put it to use just recently in replacing a 4-year old 12" MacBook with a new 13" M1 MacBook Air, so I'm dogfooding the material myself!) In the spirit of democratizing knowledge, the website is freely available to all; if you would like to support the project as it gets updated with new things I learn, it's also up on LeanPub (which will also be continuously updated). My Patreon supporters have all gotten complimentary access to the eBook. My hope is that it becomes an excellent resource for you!
Thank you for reading
I hope you enjoyed this edition of the Data Science Programming Newsletter! If you've enjoyed this newsletter, please do share the link to the newsletter subscribe page with those whom you think might benefit from it.
As always, let me know on Twitter if you've enjoyed the newsletter, and I'm always open to hearing about the new things you've learned from it. Meanwhile, if you'd like to get early access to new content I make, I'd appreciate your support on Patreon!
Stay safe, stay indoors, and keep hacking!