Running data projects and data teams

a.k.a. the better-late-by-one-day-than-never April newsletter published in May

May 01, 2021

Hello, fellow datanistas!

As promised, here is the official April edition of the Data Science Programming Newsletter.

Having been on paternal leave for over a month now, I have found some space to think strategically about data projects (and products) beyond just the fun part of coding. These were inspired by a range of articles that I have read, and I'd like to share them with everybody.

Orphaned Analytics

The first article that I'd like to share is one about orphaned analytics. Orphaned analytics is defined as "one-off Machine Learning (ML) models written to address a specific business or operational problem, but never engineered for sharing, re-use and continuous-learning and adapting." The liability incurred by orphaned models (my term) is described in the article, and it essentially boils down to data systems being filled with implicit context that isn't explicitly recorded.

Data Science Paint By Numbers

Reading the article on orphaned analytics led me to the following article about good data project workflow. In there, the author writes about the least developed part of data projects: "what it is we are trying to prove out with our data science engagement and how do we measure progress and success." At its core, it sounds a ton like how hypothesis-driven scholarly research ought to be done.

The Machine Learning Canvas

Having read all of that led me to this fantastic resource: the Machine Learning Canvas. In the canvas lies a framework -- a collection of questions that need to be answered, which will help thoroughly flesh out a plan for how a machine learning project could develop. I imagine this will work for data projects in general too. The old adage holds true: failing to plan means planning to fail, and I think the ML Canvas is a great resource to help us data scientists work with our colleagues to build better data systems.

Data projects and data products

This exploration around how to structure a data project reminded me of another article I had read before. This one is from the Harvard Business Review, which encourages us to approach our data projects with a product mindset. I particularly like the authors' definition of how "productization" happens:

Productization involves abstracting the underlying principles of successful point solutions until they can be used to solve an array of similar but distinct business problems.

and

...true productization involves taking the target end-users into account.

Run your data team like a product team

I also learned something from this article on locallyoptimistic about how to run a "data team" effectively. The key is to run it as if the team were to build a product surrounding the data. The product has features, with the heuristic, "if people are using it to make decisions, then it's a feature of the Data Product".

Some of my own thoughts

Reading through these thoughts has reinforced this next idea for me. It takes time to build a data project from conception to completion. Leading that project well likely implies leading one to two projects with focus and managing all aspects properly, rather than juggling three or more. Additionally, we should continually think about what is the product rather than the service we are providing to our colleagues. Service orientation leaves our colleagues in a position of continual dependence on us while building a product to serve others in our absence frees us up to do higher-value things. It’s like building a task manager app vs. offering a service that lets others record tasks with you.

As you read through the articles, my hope is that some thoughts bubble up in your mind as well. Having had some time to ponder these ideas, I'm thinking of hosting a discussion hour on this idea of "data science projects and products" to exchange ideas around this and learn from one another. If this is something you're interested in taking part in, please send me a message and let's flesh it out!

The geeky stuff

Having unloaded my thoughts learning how to run a data project and team, let's turn our attention to the geeky stuff.

Firstly, you have to check out Rich by Will McGugan. It's an absolute baller of a package, especially for making rich command-line interfaces. Also, Khuyen Tran has a wonderful article about how to use Rich.

Secondly, the socially prolific Kareem Carr has the best relationship mic drop ever.

Finally, for those of you doing image machine learning and who need a fast way to do cropping, you should check out inbac, a Python application for doing interactive batch cropping (where its name comes from, obviously).

From my collection

I've been at work on nxviz, a Python package I developed in graduate school and subsequently neglected for over four years while at work. Now that I've been on a break for a while, I decided to upgrade the API, working out the grammar needed to compose together beautiful network visualizations. Now it's basically ready to share, with a release coming early May! Meanwhile, please check out the docs for a preview of graph visualizations you'll be able to make!

Also, I will be teaching a tutorial called Magical NumPy with JAX at PyCon and SciPy this year. It's an extension of this tutorial repository I made a while ago, dl-workshop. Looking forward to teaching this new workshop!

Stay safe, keep having fun, and keep making cool and useful things!

Eric

Eric's Data Science Newsletter