Data Science Programming April Newsletter

Apr 20, 2020

Hello fellow datanistas!

Hope you're staying safe, and keeping others safe, by staying indoors during this time.

This is the first newsletter I'm crafting, and I'm hoping you enjoy it. Just like you, I seek to learn new things each day, as I never know which thing might become the superpower I need to solve that one data problem at work. I also need an outlet to share, and that's what this newsletter is about: to share curated content from the interwebs that I think could be useful to your day-to-day data science work. And because I'm a programmer-type, I'll be sharing mostly programmer-oriented content.

For those who signed up, this is a bonus early bird newsletter for you; thank you for your support! If you find the content useful and want your other colleagues to receive it, please point your colleagues to the newsletter landing page to sign up!

Without further ado, here's the first edition!

Big Data in Small Memory

PyCon 2020's virtual talks have launched online, and we've got an excellent talk from Boston-based Itamar Turner-Trauring on how to use NumPy and Pandas when you've got larger-than-RAM data on your single node machine. Especially useful when you don't necessarily have access to a beefier compute system.

Markov Models from the bottom up, with Python

If you didn't catch it on Twitter earlier, I wrote a blog post explaining Markov models. I found a dearth of educational materials explaining Markov models that was also programmer-friendly, so I wrote an explainer to fill the gap. With math, code, prose, and figures right next to each other, it's an example of the kind of writing I'm going for long-term.

GitHub is now free for all teams

Just announced this month, GitHub has incredibly upped its free tier. Unlimited private repositories and collaborators is the biggest draw now.

pyjanitor: Data Cleaning Routines for Data Scientists

Wanted to shamelessly advertise an open source tool that I've helped co-develop with many others around the world, including colleagues at work and friends from PyCon & SciPy. It extends the pandas DataFrame and Series APIs with data cleaning routines that you can chain. conda install -c pyjanitor or pip install -U pyjanitor to try it out; let us know what you think about it. Even better, if you have an idea for a new feature and want to take this COVID-19 time to make an open source contribution, our Issue Tracker is the place to come chat!

PyCaret: Low-Code ML for All

Finally, I just found pycaret, a Python package that automates many pieces of the machine learning workflow. I test-drove it at work where I wanted to build a predictive model for protein melting temperature from sequence, and found it immensely productive. I spent about 15X less time coding, and was able to leverage the automation around standard ML experimentation to do other things at work.

Hope you found the content useful. If you have feedback on the newsletter, such as things you're curious to know about, please feel free to drop me a line. Do share it with your colleagues if you think it'll be useful for them!

Stay safe, indoors, and have fun learning!
Eric J. Ma

Eric's Data Science Newsletter

Data Science Programming April Newsletter