Practical Benefits and ROI of Protein-Constrained Cell Simulators
Protein-constrained genome-scale models are seeing wide-spread use in the computational systems biology community. But did you know that this family of models were first validated 17 years ago?
And that their utility was verified in industrial settings since 2012 if not earlier?
It's never been easier to implement these models. Let me give you a brief overview of what they do, and why they make a great business case.

by Laurence Yang

1. Exchange Rates and Media Uptake
The first benefit is that media component uptake rates are predicted, even for complex media. Valuable when we don't want to measure the media (exometabolome) often.
The uptake rates are largely predicted based on pathway protein allocation, and to a lesser extent the exact kinetic parameters. This may seem counterintuitive. But just accounting for protein size (molecular weight) and a baseline value for all rate constants shows that catabolite repression can be reproduced regardless of what the exact turnover rates are (Figure 1).
Figure 1. Complex media utilization of E. coli - meausured and simulated with dynamic pcFBA and dynamic FBA without protein constraints. Meaurements are from Beg et al. (2007) PNAS. Simulations are from my course on computational systems biology (contact me and I can send you the Jupyter notebook).
Protein-constrained models predict substrate utilization hierarchy
The exact order of hierarchical substrate uptake does vary with kinetics. But the main message is clear: there's an order to preferred substrate uptake, and some are preferred more than others.
And the order is reasonably accurate without any additional parameter tuning. Glucose is preferred the most and is used exclusively at the start. Because of crowding constraints, overflow metabolism occurs, leading to acetate accumulation. Once glucose is depleted, the next preferred substrate is taken up. Afterwards, several less preferred substrates are consumed simultaneously. The exact uptake rates for the suboptimal substrates would depend on the uptake kinetics, which is why the model depletes maltose before lactate whereas measurements show the opposite.
Notice that the acetate overflow continues even with the other, non-glucose carbon sources. Finally, once all other substrates are depleted, the acetate that had accumulated until then is re-utilized.
Substrate utilization hierarchy is an emergent property of one simple biological constraint: intracellular crowding, first introduced by Beg et al (2007).
Application in Industry — Mammalian Cell Lines
In a previous project as part of a computational biology team in industry, we faced the challenge of being unable to determine the media component uptake rates for mammalian cell lines that produced proteins. Time-course exometabolomics would get expensive and labor-intensive very quickly.
We applied the protein constraints to our in-house cell simulators and gained the capability to predict metabolite exchange rates in complex media for mammalian cell lines.
We needed quantitatively accurate uptake rates in this case, so we used a small set of measurements to calibrate the model parameters. As we saw in the example, even anchoring the average rate constant across all enzymes is helpful and this only requires one or two conditions in duplicate, so 2-4 samples.
The Business Case
Cost savings for time-course exometabolomics
Assuming we run 6 flasks of cell culture per week, time-course metabolomics at 4 timepoints would cost $3,000 for 24 samples (4 timepoints x 6 flasks + 6 QC samples = 30 samples), $3,500 including labor.
With a calibrated model, we don't need to measure so often. And we can predict media concentrations comprehensively (>50) without being limited by the profiling approaches (targeted vs. high-throughput).
If we use the simulator, we'd use the first run $3,500 to calibrate the model for each cell line. Then, every subsequent week, we are saving $3k.
Let's consider the hypothetical scenario of running exometabolomics every week (insert your own numbers here!), so for 50 weeks/year. The first week of data is used for model calibration. The rest of the weeks we no longer need to measure exometabolomes. Over a year we've saved $147,000 (49 working weeks, excluding the first calibration week). Per cell line. This works for microbes, too.
Of course, in reality we'd want to perform several additional exometabolome experiments for prospective validation. The savings are still substantial.
Initial Calibration
$3,500 for the first run to calibrate the model for each cell line.
Weekly Savings
$3,000 saved every subsequent week after calibration.
Annual Savings
$147,000 saved over 49 working weeks (excluding the first calibration week).
Additional savings with time-course proteomics and transcriptomics
One advanced multi-scale modeling techinque in this field is dynamic whole-cell simulators of metablic and protein expression networks (DynamicME, 2019). These models have the capacity to predict time-course proteomes and transcriptomes.
So, in addition to the savings from exometabolomics above, we have additional savings from simulating time-course proteomes and transcriptomes. The costs are $12,000 for proteomics and $10,000 for RNA-Seq (24 samples + QC + $700 labor costs). Over 49 weeks, we saved $588,000 and $490,000 for proteomics and RNA-Seq, respectively. The simulator predicts both proteomes and transcriptomes, so the total savings is >$1M per year.
(Or seen another way, this is how much the simulated data is potentially worth.)
Return on Investment for multi-scale simulators
Assume that a fully tailored simulator costs $100k to develop for a specific organism or cell line, including deployment and calibration. Based on the combined cost saving above, this investment yields 10x return in one year and 20x in two years.
Furthermore, the simulator pays for itself within 1-2 months of deployment for teams that run experiments continuously. For high-throughput bio-foundry settings, this can be a game-changer in the long run. Especially considering that the models get refined and more accurate over iterations.
1
Initial Investment
$100k for a fully tailored simulator. Breaks even in 1-2 months.
2
First Year Return
10x return on investment
3
Second Year Return
20x return on investment
4
Continuous Improvement
Digital twin becomes better over time
2. One-Shot Learning from Omics
I want to elaborate on why simulators can learn so much from seemingly few data sets.
Model calibration doesn't need to be overly complicated, nor does it necessarily require massive amounts of data. Even a single proteomic or RNA-Seq dataset, in replicate, can be sufficient for tuning model parameters.
This efficiency stems from leveraging established biochemical and thermodynamic constraints, allowing precise parameter estimation while maintaining validated relationships.
The approach parallels one-shot learning in AI, where structured prior knowledge enables predictions from limited new data. We're not estimating all parameters from scratch, but rather fine-tuning specific parameters while maintaining validated relationships established through structural properties and previous datasets.
This practical reality challenges the common assumption that extensive experimental data is necessary for model calibration. And I've applied similar approaches successfully as early as 2012 in a fast-paced industry setting for teams making everything from biologics to biofuels.
Since then, this field has advanced considerably. Take a recent example from our lab – OVERLAY (Yao et al.., 2023):
Example - time-course transcriptomics for algae cultures
Experimental conditions were triacylglyceride production by algae grown mixotrophically in nitrogen deprived and acetate supplemented conditions.
16 time-course RNA-Seq samples were used for calibration. The one-shot "training" (or few-shot in this case) achieved R² = 0.96 for simulated proteome against RNA-Seq. Only 18% (214 / 1222) of rate constants needed to change their values with calibration. As noted, these models are already heavily constrained by prior knowledge.
An added benefit of the few-shot learning: we investigated proteins with worst-fit simulated abundances. We found five proteins whose gene-protein-reaction associations were incorrect in the original model. These were fixed, making the updated model more accurate for subsequent studies.
Takeaways
  • It's never been easier and faster to implement multi-scale simulators
  • The benefits in industrial settings have been validated since 2012
  • The return on investment is clear
Do you want to calculate the ROI of protein-constrained cell simulators for your specific case?
Curious whether these benefits apply for your situation (strain, cell line, process, product)?
Book a call - let's talk