Packaging, modelling, uncertainty, and cleaning
Hello, my fellow datanistas!
School is back in season here in the Cambridge area, and the university campuses are bustling with students again. Time for us to learn new things!
This is a really amazing resource for learning how to package Python packages. Written by Tomas Beuzen and Tiffany Timbers, two lecturers in data science at my alma mater (The University of British Columbia), this online book will show you how to make a Python package the principled way. As a data scientist whose work oftentimes ends up as a tool for others to use, knowing how to do software packaging, testing, documentation, and distribution is a superpower to possess!
Israeli data: How can efficacy vs. severe disease be strong when 60% of hospitalized are vaccinated?
Professor Jeffrey Morris has an analysis of Israeli COVID-19 data, which starts with the observation that 60% of those hospitalized are vaccinated... but then goes on to show that the COVID-19 vaccine actually is working as expected. This is a classic exercise in proper causal modelling - one where if we get the causal model correct, the inferences and predictions are also going to be correct.
Also, reflecting on the statement "60% of those hospitalized are vaccinated" in conditional probability terms, that's P(vaccinated | hospitalized) = 0.6, or in English, the probability that you are vaccinated given that you're hospitalized is 60%. I really don't think that's the conditional probability statement we're all concerned about - we should be concerned about P(hospitalized | vaccinated). Yet another place where the wrong conditional probability can result in misleading headings. Read on his post to learn more, and be on guard for this communications mistake!
In this post shared by Alex Ioannides on LinkedIn, the authors describe the common pitfalls that a machine learner will fall into when using prevailing tooling, such as Shapley scores, permutation feature importance, and partial dependence plots, to try to interpret the black box models that we use. Definitely worth a read if you use these tools regularly! 💯
I encountered this wonderful conversation between two Bayesians on the internet, one of whom taught me deep learning (David), and it's all about a clearer, jargon-free way of describing uncertainty. Enjoy the thread 🙂!
This is a thread of failure modes in data science, and as a data practitioner, it is well worth your time looking at it!
Reflections on data cleaning
Today I spent a lot of time cleaning up data tables extracted from a PDF. It was not fun, but it was necessary - and I had buy-in from my stakeholders that the PDF data would be useful and that they would be willing to work with me to sanity-check that I scraped the data correctly. So getting buy-in and willingness to help sanity-check the data was a crucial motivator!
This is part and parcel of the daily work of a data scientist. I bet with my teammates that it would take me more focus sessions to clean up the data than it took for me to write and train the siamese RNN for which we would feed into the data. Nobody took me on the bet 😃. I don't think you would either.