Garbage in, Garbage out...and other myths
By Josi Livingston • October 8, 2024

Last week, I had the privilege of attending Lab of the Future at the iconic Beurs van Berlage in Amsterdam. This stately, somewhat nostalgic building is a nice juxtaposition to the many talks on cutting-edge advancements in science and technology, for which this conference is well-known. However there was one aphorism I heard repeated again and again that, by comparison, seemed almost as old-fashioned as the vaulted brickwork in the main hall. That aphorism was this:
"Garbage In, Garbage Out"
It is a pithy truism born in the software development world, and it captures a simple truth: the quality of what comes out of a system is directly related to the quality of what goes in. Put in “garbage,” and you’ll likely get “garbage” back.
As general life advice it’s perfectly fine, but given recent advancements in AI, I suspect we are adhering to it too rigidly in the life sciences. If the presentations I saw are to be believed, one cannot possibly hope to train an adequate AI model unless their data is perfectly structured, tagged, validated and thoroughly annotated with ample metadata. The implication being that anything less than this high standard of data orthodoxy is, well, “garbage”.
This standard, while well-intentioned, is unnecessarily prohibitive. I see no reason why anyone couldn’t, today, train an AI model with whatever lab data they have, whether it's unstructured lab notes, raw datasets, or even hand-written observations. After all, popular large language models (LLMs) we are all using right now were themselves trained on untagged, unstructured data sources using unsupervised techniques (and if bulk, unfiltered Reddit posts and YouTube comments don’t qualify as “garbage”, I don’t know what does).
Now don’t get me wrong, I of course believe that better data leads to better models. As someone with over a decade of experience helping clients with their digital transformation, I can attest to that. But if I may suggest a better edict to live by: perfect is the enemy of the good. You don’t need flawless data to build an effective AI model, and I’d love to show you how.
Over the next few days I’ll be putting this idea to the test using unsupervised learning and NLP techniques for fine-tuning as described in this 3-part series from Trelis Research, so watch this space. But if you’d like to chat directly about how this technique can be used for your scientific use case, feel free to book a consultation with me using this link.