The Limiting Reagent

At Bio-IT World 2026 (Boston MA) the presenter Dan Rozelle from Rancho Biosciences gave a talk titled “Agents in the Loop: High Fidelity Omics Data requires Subject Matter Expertise”. Now this is the kind of talk that is catnip to me: I look forward to a talk about “omics data and SMEs”. Bring it on.

He began with a slide illustrating three parts: data tier / accuracy / impact on quality. Raw metadata yields 60% accuracy, and “Public repositories have great individual detail but no concern for FAIR”†. The second data tier listed “Agentic data curation alone, accuracy of 70%”, with a note “Efficient and scalable with some limitations”. And the third data tier is Subject Matter Expert in-the-loop; 99% accuracy, “Harmonized to a comprehensive data model, with context and expertise”.

After reviewing their data curation model and “gold standard training datasets”, he came to a concrete example, illustrated with the Huntingtin gene’s protein-protein interaction data. (Note: he showed no direct omics data, he only talked about their data analysis methods and results.) He walked through their data model with AI agents, running improvement algorithms in parallel to raise the accuracy, and getting human Subject Matter Experts to raise the accuracy even higher.

The mean recall of the validation agent is 89.5% against the “gold standard”. With their dual-run method that includes an SME: now 93.7%.

This is a nice and tidy story. He then went through how they performed the analysis. Runs 1 through 10 are “basic system mechanics and skill development”. That sounds normal. Then runs 11 to 17 are labeled “intra-run convergence refinements from paired runs”. Got it. Then runs 18-21 are “SME accuracy improvements using gold standard data”. Okay, so humans are the gold standard. I think in my head, how humans are biased, fallible, and sometimes mistaken, and sometimes confused. And then runs 22 to 29 are called “Study versus Experiment Agreement”, which is their final metric and goal. Dual-run method with an SME in the loop: 93.7%.

The presenter finishes, the take-home on quality refining and designing systems to flag “risky data” and “agentic uncertainty” all make sense.

I’m learning to really dislike the phrase “this makes sense”. The person saying it is simply stating an obvious thing in a confident voice. I have been at this conference for two days now. A lot of things make sense. And most of these things I already knew.

I raise my hand, while quickly doing some math in my head. (To be precise, it was calculating “what is 100 minus 93.7?”) I mention the final 93.7% result, and ask “What about the 6.3%?” The speaker says there was no improvement with additional runs. The speaker has no suggestions or ideas for improvement. He paused. I waited.

I want to hear something non-obvious, and I’m disappointed. The room didn’t seem to mind. We moved on. We were still ahead of schedule.

I think about that missing 6.3%. It’s too big to ignore, yet small enough to rationalize away.

A few hours later there was a reception in the exhibitor’s area. I text a friend I have not seen in many years (I’ll call her Cathy) and she texts me back immediately, somewhat joyfully, that she happened to be sitting right next to the bar, and that I could meet her there. Having an adult beverage in hand sure makes it easier to catch up after such a long time.

We have been inside different organizations since we last worked together. She was a junior inside sales representative, fresh out of school; I was a newly promoted product manager. Over the succeeding decades we went our individual paths; I was inside genomics and NGS, she inside bioprocessing and an OEM supply business. More recently I found myself inside high multiplex immunoassays, she migrated to lower multiplex immunoassays and now manages a group. She even noted that after a recent reorganization she had a few open headcounts.

She attended Bio-IT World looking for input related to her strategic marketing role: how AI will affect the kinds of data to be stored for clinical trials, how that data is currently used, and how AI will make data demands from instrument providers.

We didn’t have to broach the topic of “what are you bringing back to the office”. We both knew there was nothing to bring back. Zero, zip, nada. I felt bad for her, in that she’ll come back with her own slides for her group (and of course her manager who asked her to attend) with a lot of information but not much for them to act on. I was attending without stakeholders to report back into, which was something of a relief.

Three major topics and a small peripheral one were the main take-home lessons: the use of Agentic AI for automation as a major shift in how internal software tools get built; the advent of open source data models for protein structure information (i.e. OpenFold) due to AlphaFold’s private nature; and use of LLMs for data package submissions to regulatory agencies, speeding an inefficient and labor-intensive workflow. The peripheral topic was virtual cell models, which was only a footnote (“the jury is still out” per Jeremy Jenkin’s keynote covered in Post 001).

Why is there no mention of measuring important protein biomarkers as a vital component of the approval submission process? Why is there no mention of the issues surrounding surrogate biomarkers, surrogate endpoints, determination of mechanism of action, signals of toxicity and safety, and how AI can help answer these important questions?

Was this the wrong meeting for this? Was there a misunderstanding? Did I make a mistake?

The Brits have a saying, “it does what it says on the tin”. Did Bio-IT World do what it said on the tin?

In a sponsored talk, the Electronic Laboratory Notebook provider IDBS gave a very polished and professional read on “enterprise-grade AI data foundations”. With a headline that sounds like it came directly from an expensive corporate consultant’s 100-slide deck, his opening slide in 48 point font declared “AI without context is noise. Context without structure is risk.” These are notes that transfer seamlessly from the notes of the presenter to the notes of the receiver without passing through the consciousness of either.

And upholding the AI data foundations are four points: data provenance, semantic continuity, regulatory coherence, and machine interpretability. I’m not sure what to do with this information either. In my notes I heard the speaker say “I am preaching to the choir” three times. Perhaps he needs to work on his own semantic continuity. His coherence and interpretability may also need some improvement. He also used the phrase “it goes without saying” which I realize is only another way to say “this makes sense”. (!)

He closed with five enterprise-level challenges “that will define AI leadership in pharma by 2028”, that included “eliminating data silos across CMC, Clinical and Manufacturing”. Excuse me, how long has pharma leadership has been working on this? Since 2018? Perhaps since 1898? I’ll have to dig up my notes on pharmaceutical development in Germany during the pre-war industrialization from the dye manufacturing business. As an aside, the pharmaceutical business began in the late 1800’s from the production of aniline dye from coal-tar, which was the origin of industrial giants Bayer, BASF and Hoechst among many others. This development led to organic synthesis, scale-up and analytical chemistry, not to mention patents. Dye chemistry became biological chemistry, and the industrial production of acetylsalicyclic acid launched Bayer into the aspirin business in 1899.

I’m sure Bayer had data silos in 1899 they wanted to eliminate too. Okay, back to the sessions.

In another plenary session there was a section titled “The collaboration breakthrough: how federated learning is rewriting the rules of drug discovery”, which I may (or may not) talk about in a future essay. Christina Taylor, a computational molecular design lead at Bayer (yes that Bayer in Germany), made an offhand comment that I’ve been thinking about. “AI is garbage-in, garbage-out. Real drug discovery needs high-quality data. Otherwise the time that it takes is much higher”.

Dr. Taylor’s presentation was the shortest one of the entire conference. It was in front of a larger panel discussion, and in context the brevity did not seem out of place, although several panelists did take more time in their comments. She may have had a single slide, but I don’t remember it. She did make a point though that resonated: GIGO is a four-letter acronym everyone can relate to.

The vinyl stickers on the floor laid out a footpath, and I spotted someone who was obviously a senior salesperson standing by himself in front of them. I asked him if I could move the vinyl stickers (leading to someone else’s booth) and point them to his booth, in order to help drive traffic. He thought it was a great idea.

I’m a marketing person, I tell him, and that I was just at AACR a month ago doing pretty much the exact same thing: standing in front of a booth, staring straight into space, for hours on end. (Frankly you get used to it, and you have ways to survive such experiences. Having a sense of humor really helps.) He appreciated the gesture.

The booth was called Zontal. I have no idea who Zontal is or what it does. I think it is an odd name. They did not put these vinyl stickers on the floor, that was another vendor down the aisle. I am interested to hear what the representative had to say.

The salesperson told me about how major vendors of life science tools keep the data coming off of different types of instruments as a proprietary format, in order to get software lock-in with data analysis. He continued by telling me about the Allotrope Foundation, and ASM standards, and how they are actively working on opening up data for uniform formatting and standardization. This is FAIR being worked out through vendors in real-time. Zontal and their competition TetraScience have enterprise deployments at major pharma and biotech companies, and they use the procurement process as leverage in order to get “raw data from the instruments they purchase”.

FAIR is a worthy cause. I think of some of my former employers who develop and sell these instruments, and how much they want to keep their ‘raw data’ kept behind a proprietary software interface. It is about fancy concepts like value chain, the aforementioned customer lock-in, getting “stickiness”, a walled garden, a monetization model. Their customers, however, have larger goals in mind not to mention large budgets, and will use their purchasing leverage to pry these vaults open.

This battle for FAIR is currently being waged on several fronts.

* * *

During a break I saw a person sitting one chair over working furtively on a slide deck. Assuming they were going to present, I learned later they worked for a top-10 pharma company, in bioinformatics and multi-omic datasets (right in my own wheelhouse) and during one of the breaks they introduced me to a colleague we’ll call Roberta.

Roberta was quiet, and as a Machine Learning engineer she knew all about programming agents that over 80% of the sessions I attended included at least a mention of agents in their presentation.

I asked her what she thought of Jeremy Jenkin’s comment that “the jury is still out” on single cell RNA datasets. She assured me the “jury was very much still out” as they have started in on some datasets, but there are many more questions than there are answers. (I am paraphrasing here.)

Roberta did not state this on the stage. She said it between bites of her box lunch, while staring down at the floor. She said it so quietly I had to lean in to make sure I understood her. I’ve been thinking about this ever since.

The loudest voices at Bio-IT World are sure of themselves. The 60% - 70% - 99% architecture. The five challenges that will define pharma leadership by 2028. The elimination of data silos. But the quietest voices are unsure. The speaker without an answer for 6.3%. A friend with nothing to bring back to her group. A computational design lead who said in four words what forty slides hinted at. A machine learning engineer telling me the jury is still out.

The limiting reagent in 2026 is not data. It is the willingness to admit aloud, on the main stage, the data is not ready. The capability for cleaning it exists, there are standards, there are committees; what is missing is the institutional courage to not only insist on high quality data, but to also budget for it.

Reference:

† FAIR = a data standard proposed in 2016 (“Findable, Accessible, Interoperable, Reusable”). Wilkinson MD and Mons B et al. Sci Data (2016) “The FAIR Guiding Principles for scientific data management and stewardship” https://www.nature.com/articles/sdata201618

The Limiting Reagent

Read more

Making Connectomics Mainstream

“We are Here” - AI and Drug Discovery by Jeremy Jenkins, Novartis’ Head of US Discovery Sciences