Tn5 Bias Correction + Homolog Transfer Learning Boosts Protein Engineering

Today's Overview

  • Correcting Tn5 Accessibility Bias in CUT&Tag Epigenomic Profiling Tn5 transposase introduces systematic open chromatin bias in CUT&Tag data, particularly problematic for repressive histone marks and single-cell applications.
  • Reusing Homolog Fitness Data to Predict Variant Effects in Protein Engineering Fitness translocation uses protein language model embeddings to transfer experimental variant fitness data from homologous proteins to a target protein, generating synthetic training examples by applying homolog mutation vectors to the target wild type.
  • Do Genomic Foundation Models Actually Learn Biology? A Reality Check Randomly initialized character-token models often match pretrained k-mer/BPE genomic foundation models across 52 tasks, questioning the cost-efficiency of current pretraining approaches.
  • Cenote-Taker 3 for Fast and Accurate Virus Discovery and Annotation of the Virome

Featured

01 Correcting Tn5 Accessibility Bias in CUT&Tag Epigenomic Profiling

Mapping **histone modifications** across the genome is fundamental to understanding gene regulation, chromatin states, and cellular identity. **CUT&Tag** has emerged as a powerful alternative to ChIP-seq for profiling histone marks and transcription factors, offering lower input requirements and compatibility with single-cell applications. However, CUT&Tag relies on the **Tn5 transposase** for DNA tagmentation, which exhibits strong preference for accessible chromatin regions. This **open chromatin bias** systematically distorts read distributions, artificially enriching signal at accessible sites regardless of the true occupancy of the target histone mark or protein. The problem is particularly acute for **repressive modifications** like H3K27me3 and H3K9me3, which naturally localize to closed chromatin, and becomes more severe in sparse single-cell datasets where signal-to-noise ratios are already challenging.

The authors demonstrate that this accessibility bias pervades published CUT&Tag datasets, including those generated with optimized high-salt protocols intended to reduce background. To address this, they developed **PATTY (Propensity Analyzer for Tn5 Transposase Yielded bias)**, a computational method that corrects CUT&Tag data by leveraging paired **ATAC-seq** measurements of chromatin accessibility from the same samples. PATTY models the Tn5 insertion propensity and removes accessibility-driven artifacts from the CUT&Tag signal. The authors validated PATTY's performance across multiple histone marks including the active mark **H3K27ac** and repressive marks **H3K27me3** and **H3K9me3**, showing improved peak calling accuracy and consistency with orthogonal experimental data.

Using machine learning integration of transcriptomic and corrected epigenomic profiles, the authors show that PATTY-corrected data better predict gene expression patterns and chromatin states. For single-cell applications, they developed an analysis framework incorporating PATTY correction and demonstrate **improved cell type clustering** compared to uncorrected data, addressing a critical bottleneck in single-cell epigenomics. Validation includes comparison with known biological ground truth and experimental confirmation of predicted binding sites. While PATTY requires paired ATAC-seq data (adding experimental cost), the method provides a systematic solution to a pervasive technical artifact. The approach is applicable beyond CUT&Tag to other Tn5-based assays including CUT&RUN and potentially ATAC-seq itself, establishing a framework for bias correction in widely adopted epigenomic technologies.

Tn5 transposase introduces systematic open chromatin bias in CUT&Tag data, particularly problematic for repressive histone marks and single-cell applications.PATTY corrects this bias using paired ATAC-seq data, improving peak calling accuracy for H3K27ac, H3K27me3, and H3K9me3 with experimental validation (in silico analysis with experimental confirmation).Bias-corrected single-cell CUT&Tag data enable more accurate cell type clustering and better integration with transcriptomic data.

02 Reusing Homolog Fitness Data to Predict Variant Effects in Protein Engineering

Protein engineering relies on predicting which amino acid substitutions will improve or impair function, but **fitness data scarcity** remains a fundamental bottleneck. Experimentally measuring variant effects through deep mutational scanning or directed evolution is resource-intensive, often yielding datasets of only hundreds to thousands of variants for a single protein. This data limitation severely constrains supervised machine learning models that could otherwise guide rational design of enzymes, fluorescent proteins, or therapeutic antibodies. The core challenge is whether fitness information from evolutionary relatives can be systematically transferred to a target protein of interest.

This study introduces **fitness translocation**, a biologically-grounded data augmentation strategy that exploits homologous proteins to synthetically expand training datasets. The method operates in the embedding space of protein language models (PLMs), which capture evolutionary and structural patterns from billions of natural sequences. For a variant in a homolog protein, the approach computes the embedding difference between the homolog's wild type and mutant, then applies this delta vector to the target protein's wild type embedding to generate a synthetic variant. The fitness value from the homolog variant is assigned to this synthetic target variant, effectively translating experimental measurements across protein families. This differs from naive sequence alignment approaches by leveraging the rich, context-aware representations learned by transformer-based PLMs like ESM-2.

The authors validate fitness translocation across three protein families with distinct functional contexts: **IGPS** (indole-3-glycerol phosphate synthase, enzymatic activity), **GFP** (green fluorescent protein, fluorescence intensity), and **SARS-CoV-2 spike proteins** (viral entry, ACE2 binding). Evaluation is performed **in silico** using held-out experimental fitness measurements as ground truth, testing multiple prediction models including ridge regression, random forests, and gradient boosting. Across all families and model architectures, fitness translocation consistently improves Spearman correlation between predicted and measured fitness, with gains most pronounced when training data is limited (10-100 variants). Remarkably, the method works even between remote homologs sharing only **35% sequence identity**, suggesting broad applicability across diverse protein families. The approach demonstrates that historical fitness data from related proteins—often generated in different labs for different purposes—can be systematically repurposed to accelerate engineering of new targets, offering a path toward more data-efficient computational protein design.

Fitness translocation uses protein language model embeddings to transfer experimental variant fitness data from homologous proteins to a target protein, generating synthetic training examples by applying homolog mutation vectors to the target wild type.The method improves variant effect prediction accuracy across IGPS, GFP, and SARS-CoV-2 spike protein families in silico, with benefits persisting even between remote homologs at 35% sequence identity and greatest gains under limited training data regimes.This data augmentation strategy enables reuse of accumulated deep mutational scanning datasets across protein families, potentially reducing experimental burden in protein engineering campaigns.

03 Do Genomic Foundation Models Actually Learn Biology? A Reality Check

**Genomic foundation models (GFMs)** promise to transform computational biology by learning universal representations of DNA sequences through large-scale pretraining, analogous to how GPT models revolutionized natural language processing. The hypothesis is compelling: train on vast genomic datasets, then fine-tune for specific tasks like variant effect prediction, gene expression forecasting, or regulatory element identification. However, this approach assumes that unsupervised pretraining on genomic sequences captures biologically meaningful patterns that transfer to downstream applications. This study rigorously tests that assumption by comparing seven prominent GFMs against a surprisingly simple baseline—randomly initialized models with identical architectures.

The authors evaluated models across **52 diverse genomic tasks** spanning regulatory genomics, variant effect prediction, and sequence classification. The results challenge prevailing assumptions about GFM utility. **Character-token models with random initialization often matched or exceeded the performance of pretrained k-mer and byte-pair encoding (BPE) models**, despite the latter requiring substantial computational investment for pretraining. Only subword tokenization approaches showed consistent benefits from pretraining, suggesting that **tokenizer choice fundamentally determines whether pretraining provides value**. This finding has immediate practical implications: practitioners may achieve comparable performance with simpler, faster-to-train character-level models for many genomic tasks.

More concerning for clinical applications, the study reveals that **current GFMs fail to capture clinically relevant genetic mutations**. When tested on annotated variants from databases like ClinVar, model embeddings and log-likelihood ratios showed limited sensitivity to pathogenic versus benign mutations. This represents a critical gap, as variant effect prediction is among the most important applications for genomic AI in precision medicine. The models appear to learn statistical patterns in DNA sequences without necessarily encoding the functional consequences of sequence changes.

All evaluations were **in silico**, comparing model predictions against curated genomic annotations and experimental datasets. The findings suggest that simply scaling up NLP-style pretraining on genomic sequences may not suffice. Instead, the authors advocate for **biologically informed tokenization strategies** that respect functional units like codons or regulatory motifs, and **variant-aware training objectives** that explicitly teach models about mutation effects. For practitioners currently investing in or deploying GFMs, these results recommend careful baseline comparisons and tokenizer selection as first steps before committing to expensive pretraining regimes.

**Randomly initialized character-token models often match pretrained k-mer/BPE genomic foundation models across 52 tasks**, questioning the cost-efficiency of current pretraining approaches.**Tokenizer choice determines pretraining value**: character-level models show minimal gains from pretraining while subword models benefit, suggesting architecture-tokenizer interactions drive performance more than scale.**Current genomic foundation models fail to capture clinically relevant mutations**, with embeddings showing limited sensitivity to pathogenic variants, indicating need for variant-aware training objectives.

04 Cenote-Taker 3 for Fast and Accurate Virus Discovery and Annotation of the Virome

Also Worth Noting

05
Reinforcement Learning Agent Dynamically Reweights Training Samples for Deepfake Detection This work introduces a Tutor-Student Reinforcement Learning framework where a PPO agent (Tutor) learns to dynamically assign continuous weights to training samples based on their visual features and learning history, rewarding transitions from incorrect to correct predictions to prioritize hard-but-learnable examples, thereby improving the deepfake detector's (Student's) generalization to unseen manipulation techniques.link (CVPR)
06
Free-Market Algorithm: Self-Organizing Optimization Without Predefined Fitness Synthesis & Retro**FMA uses emergent supply-demand dynamics to discover complex molecular pathways (amino acids, nucleobases, Krebs cycle intermediates from bare atoms in minutes) and forecast GDP across 33 countries without preset fitness functions or fixed search spaces**, validated in silico for prebiotic chemistry and macroeconomic prediction with performance matching professional forecasters.link
07
Schema Internalization Cuts Text-to-SQL Costs by 99% A two-phase fine-tuning method enables an 8B-parameter model to internalize database schemas, reducing input tokens from 17k to <100 while achieving 98.4% execution accuracy in production deployment for cricket statistics queries at scale.link (AAAI)
08
Light-Controlled Actin Drives Protocell Motility in Synthetic Vesicles This study demonstrates **optically controlled actin polymerization** within giant unilamellar vesicles that generates **directional movement at speeds up to 0.43 μm/min** (comparable to adherent mammalian cells), establishing a minimal synthetic system requiring both branched and linear actin networks for membrane protrusion and providing a bottom-up platform for studying cytoskeletal mechanics in cell migration.link
09
RLHF Alignment Causes Response Collapse, Degrading Uncertainty Estimation Clinical & Medical AI**RLHF-aligned language models collapse 40-79% of responses into single semantic clusters**, rendering sampling-based uncertainty methods uninformative (AUROC=0.500) while token entropy retains discriminative power (0.603-0.724), with ablations localizing the effect to DPO training and demonstrating that orthogonal uncertainty signals enable cost-effective selective prediction cascades.link
10
GPS Map-Matching via Hierarchical Spatial-Temporal Graph Learning This paper presents HSTGMatch, a deep learning model that improves GPS trajectory map-matching through hierarchical self-supervised pretraining and adaptive graph attention networks that capture spatial-temporal movement patterns, addressing labeling scarcity and distribution shift challenges in location-based applications.link
11
Reasoning-Enhanced Generative Search with Self-Distillation for E-Commerce OneSearch-V2 augments generative retrieval with latent reasoning through thought-augmented query understanding and self-distillation training to internalize complex user intent, achieving +3.98% CTR and +3.05% conversion in production e-commerce search while mitigating filter bubbles without added latency.link
12
Physics-Based Evaluation Reveals Systematic Interaction Errors in Protein Structure Models Protein StructureA physics-grounded evaluation framework reveals that state-of-the-art protein structure prediction models (AlphaFold2, AlphaFold3, ESMFold) systematically mispredict 30-60% of side-chain non-covalent interactions and fail to capture conformational ensembles due to biased understanding of atomic interaction energetics, despite accurate coordinate prediction.link
13
ML-Based Memory Corruption Detection for WebAssembly Security Molecular DynamicsWalma uses convolutional neural networks to classify WebAssembly memory snapshots for detecting corruption and tampering, achieving effective detection in structured memory layouts with 1.07x–1.8x runtime overhead depending on instrumentation granularity.link
14
Iterative Feature Selection Improves Clustering of High-Dimensional Biological Data Genomics & Omicsi-IF-Learn jointly performs feature selection and clustering in high-dimensional biological datasets (gene microarrays, scRNA-seq) through an adaptive statistic that combines pseudo-label supervision with unsupervised signals, outperforming classical and deep learning baselines while identifying biologically meaningful influential features that enhance downstream deep models.link
15
Integrative Modeling Resolves Disordered Regions in HDAC2 Chromatin Complex Protein StructureThis study combines crosslinking mass spectrometry with computational modeling (I-TASSER, HADDOCK, AlphaFold) to structurally characterize the intrinsically disordered region-driven assembly of the HDAC2:MIER1:MHAP1 chromatin remodeling complex, revealing that HDAC2's poorly characterized C-terminal IDR mediates critical protein-protein interactions that AlphaFold alone fails to capture.link
16
Electrostatic Networks and Lipid Interactions Govern CFTR Channel Dynamics Molecular DynamicsAll-atom molecular dynamics simulations of human CFTR in heterogeneous lipid bilayers identified 557 electrostatic interactions that stabilize channel architecture, coordinate anion conduction through dual portals (TM4/TM6 and TM10/TM12), and mediate selective cholesterol/phosphatidylserine binding, revealing how the potentiator VX-770 subtly modulates these networks to enhance channel function.link

Today's Observation

The convergence of machine learning and experimental biology continues to surface fundamental questions about what our models actually learn and how we validate them. A sobering reality check on **genomic foundation models** reveals that random initialization baselines can match or exceed the performance of pretrained models on downstream tasks, particularly when fine-tuning data is abundant. This challenges the assumption that self-supervised pretraining on DNA sequences inherently captures biologically meaningful representations. For practitioners in AI-driven drug discovery, this suggests that **task-specific architectures and training strategies may matter more than pretraining scale** when working with genomic data. The implication is clear: before investing computational resources in foundation model pretraining, teams should rigorously benchmark against simpler baselines and critically assess whether pretraining objectives align with downstream prediction tasks like variant pathogenicity or regulatory element identification.

Transfer learning shows more promise in the protein engineering domain, where **fitness landscape data from homologous proteins can improve variant effect prediction even at 35% sequence identity**. This work demonstrates that evolutionary information encoded in homolog fitness measurements—obtained through deep mutational scanning or directed evolution—transfers across protein families to enhance predictions for target proteins with limited experimental data. The practical value lies in data efficiency: rather than conducting exhaustive mutagenesis screens for every engineering target, teams can leverage existing fitness datasets from related proteins. The method works across diverse protein types including enzymes, antibodies, and viral proteins, with performance gains most pronounced when training data for the target protein is scarce. This validates a **meta-learning approach to protein design** where accumulated experimental knowledge across homologs serves as inductive bias.

Meanwhile, the experimental measurement side faces its own biases that AI must account for. The PATTY algorithm addresses **Tn5 transposase sequence bias** in CUT&Tag epigenomic profiling, a widely used technique for mapping histone modifications and transcription factor binding. Tn5's preference for certain DNA sequences creates systematic distortions in accessibility measurements, confounding biological signal with technical artifact. By modeling and correcting this bias, PATTY improves the accuracy of chromatin state inference, which feeds into AI models predicting gene regulation and expression. For teams building models on epigenomic data—whether for target identification or understanding drug mechanism of action—this highlights the importance of **preprocessing pipelines that remove technical confounders** before training. The broader lesson applies to virome analysis as well, where Cenote-Taker 3 automates virus genome discovery and annotation in metagenomic sequencing data, addressing the challenge of identifying novel viral sequences without relying on close reference genomes. Accurate viral genome characterization matters for understanding host-pathogen interactions and identifying therapeutic targets, but requires computational tools that can handle the extreme diversity and rapid evolution of viral sequences.