Hello fellow datanistas!
So we’ve made it to 2022. I don’t know about you, but I’m reminded on Twitter that 2022 is pronounced “2020 too”… so my hopes are high! :)
Jesting aside, I reviewed my bookmark list of things that I didn’t find the time to send out last year, and I picked the best of the bunch to share right at the beginning of the year. Here’s to old gems uncovered!
For those of you who use Dask, a new package by Erik Welch might make where you run which code a bit easier to manage. Here is the tl;dr shamelessly copied from the repository:
import afar from dask.distributed import Client client = Client() with afar.run, remotely: import dask_cudf df = dask_cudf.read_parquet("s3://...") result = df.sum().compute()
I found this online textbook, "Intro to Probability for Data Science", and chapter 2 of that ebook is enlightening! To recognize that probability is all about sets and measures on them is something we don't usually appreciate. I think the rest of the book is also pretty useful, but this chapter is the one that caught my eye.
I saw this tweet by by Paige Bailey on Twitter, where the sentiment in the tweets and replies was, "Environments are hard!" Pedagogically, computing environments and the associated tooling are orthogonal enough from computational modelling topics that it almost feels like a distraction to think about environments... but in order to operationalize our modelling results, it's imperative that we have mastery over environments. Because our stack runs deep and can break in subtle ways, knowing exactly what lives in our compute environments becomes a superpower on our journey to become awesome data scientists!
VSCode has, over time, improved so much that it has effectively become my go-to IDE at work. This post at pbpython summarizes succinctly what makes VSCode an awesome IDE for data scientists. Be sure to check it out!
What happened at Zillow?
Zillow exiting the home flipping business and closing out the entire business unit made headlines in the data science world in November. One of the things I've been curious about is the suite of causal factors that led to this outcome. Here are a few analyses and my takeaways from each of them.
Zillow Just Gave Us A Look At Machine Learning's Future: Vin Vashishta points out how models of all kinds expect stability, while the real world, on the other hand, is never stable - a fundamental mismatch on the technical level.
Zillow Tried to Make Less Money: Matt Levine posits that Zillow's business leaders failed to see the mismatch between Zillow's scale and the bounds on profitability at that scale.
Before we go on and blame Zillow's use of Prophet, I think it's important to see how having a mental model of the business that is mismatched to reality can lead to pretty dramatic disasters.
Tim Hopper has a great Twitter thread on how he developed a tool for data scientists to access data, based on the Intake library. It’s an example of how to make great tools that empower your colleagues right at the interface that they use!
This is an excellent article that cogently highlights the underlying reasons why machine learning projects often end up in trouble. The core problem is when we use ML to replace humans rather than augment human decision-making. In my role doing biomedical research data science, I’ve come to realize that our models can help shortcut experimentation, but can’t really replace benchside experimentation altogether. In other words, we can use ML to make better decisions on what to test, but can’t make final decisions based on ML alone.
In this mini-course on Inspired Python, we get a friendly introduction to a new feature in Python 3.10: Structural Pattern Matching! Structural pattern matching is a language feature that can dramatically simplify complex
if/elif/else logic in your code. (Though if you're in the camp of "but we could have just used
if/elif/else blocks", nothing is stopping you from doing so 😄.) This mini-course gives a comprehensive view of the many ways you could use Structural Pattern Matching in your code, so definitely check it out!