Reliable biological data requires physical quantities, not statistical artifacts
Alternatively titled: Why your biological data might be lying to you!
Hello fellow datanistas!
Ever wonder why so many promising biological discoveries fail to translate into real-world applications?
I've spent years building machine learning models for life sciences companies, and I've noticed a troubling pattern: we're sometimes are forced to train our models on statistical artifacts rather than physical reality.
Here's a scenario I see repeatedly:
A biotech startup is developing novel protein binders. To save costs, they run single measurements for each test protein. Then, to make sense of this limited data, they calculate p-values against pooled controls and use these statistical artifacts as their "ground truth" for machine learning.
This approach is fundamentally broken.
P-values aren't physical properties of molecules - they're artifacts of your measurement system's noise. When you train a model to predict p-values from sequence, you're not modeling biology; you're modeling experimental noise.
In my latest blog post, I explain:
Why p-values and other statistical artifacts make terrible training data
How this approach undermines reproducibility and scientific progress
A better framework using physical quantities and proper experimental design
How Bayesian methods can help us properly handle uncertainty
As Richard Feynman warned: "The first principle is that you must not fool yourself—and you are the easiest person to fool."
If you're building ML models for biology, designing experiments, or just care about reliable scientific data, read the full post here.
Let's stop fooling ourselves with statistical artifacts and build on solid physical ground instead.
Happy experimenting!
Eric