Welcome to 2022!

And here's a smattering of really useful stuff from 2021 :)

Jan 04, 2022

Hello fellow datanistas!

So we’ve made it to 2022. I don’t know about you, but I’m reminded on Twitter that 2022 is pronounced “2020 too”… so my hopes are high! :)

Jesting aside, I reviewed my bookmark list of things that I didn’t find the time to send out last year, and I picked the best of the bunch to share right at the beginning of the year. Here’s to old gems uncovered!

Afar: Dask code execution made even simpler

For those of you who use Dask, a new package by Erik Welch might make where you run which code a bit easier to manage. Here is the tl;dr shamelessly copied from the repository:

import afar
from dask.distributed import Client
client = Client()

with afar.run, remotely:
    import dask_cudf
    df = dask_cudf.read_parquet("s3://...")
    result = df.sum().compute()

Intro to Probability for Data Science

I found this online textbook, "Intro to Probability for Data Science", and chapter 2 of that ebook is enlightening! To recognize that probability is all about sets and measures on them is something we don't usually appreciate. I think the rest of the book is also pretty useful, but this chapter is the one that caught my eye.

HTMX: Python front-end without Javascript?

While randomly browsing through the interwebs, I found out about HTMX through a TalkPython Training course landing page. It is a really cool idea! I used to build Flask apps with a lot of logic built into rendering HTML pages, and it quickly got confusing to data apps that had to do anything more than take in some user input and return a chart. HTMX's promise is that we can Javascript-esque reactivity without typing a line of Javascript. I'm curious to see how this turns out going into the future!

Struggles with Python data science environments?

I saw this tweet by by Paige Bailey on Twitter, where the sentiment in the tweets and replies was, "Environments are hard!" Pedagogically, computing environments and the associated tooling are orthogonal enough from computational modelling topics that it almost feels like a distraction to think about environments... but in order to operationalize our modelling results, it's imperative that we have mastery over environments. Because our stack runs deep and can break in subtle ways, knowing exactly what lives in our compute environments becomes a superpower on our journey to become awesome data scientists!

16 Reasons to Use VS Code for Developing Jupyter Notebooks

VSCode has, over time, improved so much that it has effectively become my go-to IDE at work. This post at pbpython summarizes succinctly what makes VSCode an awesome IDE for data scientists. Be sure to check it out!

What happened at Zillow?

Zillow exiting the home flipping business and closing out the entire business unit made headlines in the data science world in November. One of the things I've been curious about is the suite of causal factors that led to this outcome. Here are a few analyses and my takeaways from each of them.

Zillow Just Gave Us A Look At Machine Learning's Future: Vin Vashishta points out how models of all kinds expect stability, while the real world, on the other hand, is never stable - a fundamental mismatch on the technical level.
Zillow Tried to Make Less Money: Matt Levine posits that Zillow's business leaders failed to see the mismatch between Zillow's scale and the bounds on profitability at that scale.

Before we go on and blame Zillow's use of Prophet, I think it's important to see how having a mental model of the business that is mismatched to reality can lead to pretty dramatic disasters.

Using Intake to develop data catalogs

Tim Hopper has a great Twitter thread on how he developed a tool for data scientists to access data, based on the Intake library. It’s an example of how to make great tools that empower your colleagues right at the interface that they use!

A.I. is solving the wrong problem

This is an excellent article that cogently highlights the underlying reasons why machine learning projects often end up in trouble. The core problem is when we use ML to replace humans rather than augment human decision-making. In my role doing biomedical research data science, I’ve come to realize that our models can help shortcut experimentation, but can’t really replace benchside experimentation altogether. In other words, we can use ML to make better decisions on what to test, but can’t make final decisions based on ML alone.

A friendly deep dive into structural pattern matching

In this mini-course on Inspired Python, we get a friendly introduction to a new feature in Python 3.10: Structural Pattern Matching! Structural pattern matching is a language feature that can dramatically simplify complex if/elif/else logic in your code. (Though if you're in the camp of "but we could have just used if/elif/else blocks", nothing is stopping you from doing so 😄.) This mini-course gives a comprehensive view of the many ways you could use Structural Pattern Matching in your code, so definitely check it out!

Eric's Data Science Newsletter

Discussion about this post