"What we observe is not nature itself, but nature exposed to our method of questioning." Werner Heisenberg

AI & Representation Learning for biology

Methods that probe and quantify what protein and DNA language models actually encode.

Model and task agnostic evaluation of Protein/DNA embeddings' uncertainty

"All models are wrong, ..." - we formulate Random Neighbor Score (RNS) to quantify the uncertainty of biomolecule representations like protein, DNA or RNA - in a LM’s latent space.

Publication: Prabakaran, R*., Bromberg, Y*. Quantifying uncertainty in protein representations across models and tasks. Nat Methods 23, 796–804 (2026) 9 citations

RNS

RNS with accelerated mode for protein residues and metagenomic reads Work in progress

A faster version of the Random Neighbor Score (RNS) that estimates representation uncertainty for individual protein residues and for large sets of metagenomic reads.

Metagenomics & reference-free function discovery

Reference-free discovery of molecular function in metagenomes using DNA language models.

Development of LMs for novel reference-free metagenomic annotation

Development of REMME, a foundational DNA language model (dLM), with a focus on biological properties of DNA, rather than only its semantic “language-like” properties of short reads. This is crucial for analysis of functionality of rare organisms that are significantly underrepresented in microbiomes. To demonstrate the power and usefulness of REMME, we then developed REBEAN – a fine-tuned model for predicting the enzymatic potential captured by individual metagenomic reads. REBEAN does not rely on sequence comparisons, thus enabling higher coverage and faster and more generalizable analysis of metagenomic samples, even when our understanding of the encoded functions is limited. Using REBEAN, we mined marine metagenomes to uncover numerous novel, i.e. previously unseen, oxidoreductases, thereby lighting a path through the microbial “dark matter.”

Publication: Prabakaran R* and Bromberg Y*, Deciphering enzymatic potential in metagenomic reads through DNA language models, Nucleic Acids Research, 53 (16), 2025, gkaf836 3 citations

REBEAN graphical abstract

REBEAN end-to-end Nextflow pipeline for novel enzyme mining Work in progress

An end-to-end Nextflow pipeline that runs REBEAN over metagenome-assembled genomes and reads to mine novel enzymes.


Source (private): https://github.com/rpkarandev/mag-rebean

Protein function annotation & assessment

New ways to predict and rigorously evaluate protein function beyond alignment and fixed ontologies.

A protein pair-based assessment of in silico prediction tools

To address the sequencing-accelerated annotation gap, we evaluated the ability of in silico protein function prediction tools to annotate orphan proteins, identified in metagenomes. Lacking the “ground truth” functional annotations, we transformed the assessment of function prediction into evaluation of functional similarity of protein pairs. Thus transcending the limitations of functional annotation vocabularies, providing a means to assess different-ontology annotation methods.

Publication: R Prabakaran, Yana Bromberg, Functional profiling of the sequence stockpile: a protein pair-based assessment of in silico prediction tools, Bioinformatics, Volume 41, Issue 2, February 2025, btaf035 3 citations
Source: https://bitbucket.org/bromberglab/siblings-detector/

A siamese-transformer-twin model to identify protein pairs of same function.

We built a siamese model based on transformer architecture to identify protein-pairs of same function. The model was trained using contrastive loss.

Source (private): https://bitbucket.org/bromberglab/siamtwinprot/src/main/

Variant effect prediction in the age of machine learning

A comprehensive review and assessment of variant/mutation effect predictors. This collaborative project includes evaluation on multiple datasets - representing different aspects of pathogenic effect.

Publication: Bromberg Y, Prabakaran R, Kabir A, Shehu A. Variant Effect Prediction in the Age of Machine Learning. Cold Spring Harb Perspect Biol. 2024 Jul 1;16(7):a041467. 30 citations

Protein sequence–structure, aggregation & evolution

Sequence- and structure-level analysis of protein aggregation and evolutionary signal.

Tracing extant metal-binding spheres in protein structural space

We searched the protein structural space for primitive, extant metal-binding spheres to trace protein evolution

Systematic study of factors contributing to peptide aggregation through extensive MD simulations

Aggregation of therapeutic monoclonal antibodies (mAbs) can negatively affect their chemistry, manufacturing, and control attributes and lead to undesirable immune responses in patients. The disruption of short sequence motifs called aggregation prone regions (APRs) found in amino acid sequences of mAb candidates can potentially mitigate their aggregation. In this work, we have performed molecular dynamics simulations to study the aggregation of an APR (VLVIY) found in λ light chains of human antibodies and its single point mutant KLVIY. Eighteen different multicopy peptide simulation systems of "VLVIY" and "KLVIY" were constructed by varying their concentrations, temperatures, termini capping, and flanking gate-keeper regions.

Publication: Prabakaran, R., Rawat, P., Yasuo, N., Sekijima, M., Kumar, S., & Gromiha, M. M. (2022). Effect of charged mutation on aggregation of a pentapeptide: Insights from molecular dynamics simulations. Proteins, 90(2), 405–417. 7 citations

Development of ML and statistical models to predict protein aggregation

We studied and reviewed mechanistic models, kinetic models, ML tools and molecular dynamics approaches used to study aggregation-prone region, aggregation propensity and aggregation rate of peptides and proteins. This review led to development of new tools to predict aggregation prone regions and aggregation rates: ANuPP, AggreRATE-Pred & AbsoluRATE.

Selected publications:

Proteome-scale sequence & structural analysis of aggregation propensity

We surveyed the incidences of aggregation prone regions (APRs) in multiple proteomes by using sequences of experimentally validated amyloid‐fibril forming peptides and via computational prediction tools.

Publication: Prabakaran, R., Goel, D., Kumar, S., & Gromiha, M. M. (2017). Aggregation prone regions in human proteome: Insights from large-scale data analyses. Proteins, 85(6), 1099–1118. 32 citations

Other

Additional modeling work.

HySEIQR: Incorporation of quarantine and lockdown regulations into SEIQR model to predict COVID-19 progression