Two Neurons Is All You Need — A Case Study on Interpretability in Protein Models

Nithin Parsan
9 min read8 hours ago

--

Authors: Nithin Parsan and John Yang

tl;dr

  • Demonstrating sparse probing techniques from Gurnee et al. on ESM-2, we reveal that just two neurons can encode the catalytic function of serine proteases in a protein language model.
  • Using a binary classification task with wild-type and systematically mutated protein sequences, two neurons in ESM-2’s fifth layer were found to distinctly capture the presence of functional catalytic serines.
  • Causal influence was confirmed through targeted ablations and activation manipulations, showing that these neurons actively induce serine protease functionality rather than just correlate with it.
  • Learn more at reticular.ai and explore our visual findings at demo.reticular.ai.
  • Code + analyses + demo releasing soon

Mechanistic interpretability has emerged as a powerful tool for understanding large language models, scaling even to frontier models like GPT-4 and Claude. But can these techniques help us understand biological language models? At Reticular, we believe controllable protein design requires precisely this kind of deep model understanding.

In this post, we demonstrate a proof-of-concept applying sparse probing techniques from Gurnee et al. to ESM-2, a protein language model. We show how just two neurons encode one of biology’s most fundamental features: the catalytic machinery of serine protease. By identifying and manipulating these neurons, we establish that the interpretability techniques developed for language models can transfer effectively to biological domains.

This work represents a small but concrete step toward Reticular’s mission: making protein design more controllable and interpretable. While language models can be steered through careful prompting, biological models require more precise control mechanisms. Understanding how these models encode biological features internally opens new possibilities for reliable protein engineering.

Why Study Serine Proteases?

In searching for biological features to probe in ESM-2, we needed a test case that would parallel the elegance of Gurnee et al.’s sparse probing demonstrations. Just as they showed how language models encode grammatical features like verb tense or compound words through specific neurons, we wanted to find similarly crisp, binary features in protein sequences.

Serine proteases offer an ideal parallel because they represent a clear binary property: either a sequence has a functional catalytic serine or it doesn’t. We know that mutating the catalytic serine abolishes function, giving us ground truth labels that are rare in biology. [1]

Creating a Clean Dataset

To translate these biochemical insights into a machine learning task, we constructed our dataset from the well-characterized trypsin family (EC 3.4.21.4) in SwissProt. Our positive examples are wild-type sequences with verified activity. For negative examples, we systematically mutated the catalytic serine to all 19 other amino acids, creating sequences we know are non-functional.

This gives us a clear binary classification task: can we find neurons in ESM-2 that specifically encode the presence of a functional catalytic serine? More importantly, this setup lets us distinguish between neurons that merely detect serine residues and those that specifically encode catalytic serines — a distinction that will prove crucial in our analysis.

The simplicity of this binary property makes it an excellent test case for exploring whether the interpretability techniques from language models can transfer to protein domains. If we can identify neurons that selectively encode this well-defined catalytic feature, it would suggest these methods can help us understand how protein language models represent biological properties more broadly.

Methods & Technical Approach

We used ESM-2, a protein language model trained on 65M protein sequences. The specific variant we used (ESM2-t6–8M-UR50D) has 6 layers with a hidden dimension of 1280, meaning each layer contains 1280 neurons that we can probe for interpretable features.

Data Processing Pipeline

Our pipeline follows three main steps:

  1. Activation Extraction: For each sequence in our dataset, we extract the post-GELU activations from the feed-forward layers of ESM-2. This gives us a tensor of shape (batch_size, sequence_length, 1280) for each layer.
  2. Sequence Length Aggregation: Since proteins have variable lengths and we’re interested in a specific position (the catalytic serine), we aggregate the sequence length dimension by taking the maximum activation across positions. This reduces our tensor to (batch_size, 1280).
  3. Final Preprocessing: After aggregation, we split our data into training and test sets, maintaining a balanced distribution of positive (wild-type) and negative (mutated) examples. It’s worth noting that the choice of splitting methodology can significantly impact results. [2]

Sparse Probing Methodology

Following Gurnee et al., we implemented several methods for identifying important neurons: mean activation difference between classes, mutual information between activations and labels, L1-regularized logistic regression, one-way ANOVA F-statistic tests, Support Vector Machines (SVM) with hinge loss, and optimal sparse prediction using cutting planes.

Each method aims to identify the minimal set of neurons needed to classify our sequences, with different tradeoffs between speed, interpretability, and guarantees of optimality. For detailed comparisons of these methods and their practical implementations, we refer readers to the original paper.

We evaluate these methods using standard binary classification metrics (precision, recall, F1) with a focus on out-of-sample performance. Importantly, we follow the paper’s recommendation to use F1 score as our primary metric given the inherent class imbalance in our task — there are many more ways for a protein to be non-functional than functional.

Results: Finding the “Serine Catalytic Triad Neurons”

Performance Across Probe Methods

Following Gurnee et al.’s methodology, we compared different sparse probing approaches on our serine protease task. Here are the out-of-sample F1 scores on a held-out test set across methods for different sparsity levels (k):

Values shown are the maximum F1 scores achieved across all layers for each method and sparsity level k.

Several striking findings emerge:

  1. Most methods achieve perfect F1 scores (1.000) with just k=2 neurons
  2. Random selection performs poorly across all k values, validating our methodology
  3. SVM shows the strongest single-neuron performance (0.667)
  4. The performance plateaus completely after k=2, suggesting we’ve found a minimal representation

The Two-Neuron Solution

Our analysis converged on two key neurons in Layer 5 that appear to encode catalytic serine functionality. [3] Here are their detailed statistics:

The robustness of this finding is particularly noteworthy:

Neuron 106 emerges as the primary feature detector:

  • Consistently identified across all methods
  • Large mean activation difference (1.560)
  • Strong effect size (2.462)
  • Clear separation between positive (8.153) and negative (6.593) cases

Neuron 110 appears to play a supporting role:

  • Identified by multiple methods (MI, Mean Diff, SVM)
  • Moderate but consistent effect size (1.158)
  • Smaller but significant activation difference between classes

This two-neuron circuit is remarkably sparse given ESM-2’s 1280 neurons per layer, and its location in Layer 5 differs interestingly from comparable findings in traditional language models. [4]

Detailed Analysis of Neuron Behavior

Individual Neuron Studies

We began by systematically probing how each neuron’s activation affects ESM-2’s predictions. Both neurons show distinct activation patterns, but with notably different roles.

Activation Patterns

By incrementally scaling neuron activations from -10x to +10x their baseline values, we observe highly controlled effects on the model’s serine predictions.

Neuron 106

Neuron 110

Neuron 106 shows a strong positive correlation — increasing its activation consistently boosts the model’s prediction of catalytic serines. The relationship is remarkably linear in logit space, suggesting this neuron directly encodes catalytic serine functionality.

In contrast, Neuron 110 exhibits more complex behavior. While it generally opposes Neuron 106’s effects, its impact is more pronounced when down-regulated than up-regulated, suggesting a regulatory role.

Interaction Analysis

The real power of this circuit emerges when we examine how these neurons interact.When scaled together, we observe both synergistic enhancement and mutual cancellation:

  1. Synergistic Enhancement: When both neurons are scaled in compatible directions (106 up, 110 down), we see superadditive effects on serine prediction confidence
  2. Cancellation: When scaled in opposing directions, they can effectively neutralize each other’s impact

The heat-map visualization reveals clear patterns in how these neurons modulate each other. Particularly notable is the sharp transition zone where their effects balance out, creating a precise control mechanism for serine prediction.

Verification via Neuron Ablation: Catalytic vs Non-Catalytic Serines

To verify that these neurons specifically encode catalytic serines rather than serines in general, we examined their impact across different serine populations.

Violin plots show the full distribution of predictions across all test sequences and width indicates frequency of predictions at each probability level.

The violin plots reveal a striking specificity:

  • Ablating these neurons dramatically affects predictions for catalytic serines
  • Non-catalytic serine predictions remain largely unchanged
  • The effect is most pronounced when both neurons are ablated together

Statistical Validation

We performed Kolmogorov-Smirnov tests to quantify these differences:

For non-catalytic serines, all comparisons showed minimal effects (KS statistic < 0.031, p > 0.99), confirming these neurons’ specificity for catalytic serines.

This analysis reinforces our key finding: ESM-2 has learned to encode catalytic serine functionality through a precise two-neuron circuit, with each neuron playing a distinct but complementary role in the representation.

Broader Implications: Sparse Feature Encoding in Protein Models

Our key finding — that just two neurons encode catalytic activity — hints at something fascinating about how protein language models work. While these models learn from mountains of sequence data without any explicit knowledge of biochemistry, they might discover and encode meaningful biological features in ways strikingly similar to text-based language models. [5] We believe our circuit of two neurons encoding the serine protease catalytic site is evidence towards this and our mission.

At Reticular, we’re working to make protein design more controllable and interpretable. Finding such clear, compact representations of important biological features is a promising first step. But there’s still much work to be done.

Looking Ahead: Challenges and Open Questions

Our work with catalytic serines gives us a foothold in understanding protein language models, but we’re cognizant of the limitations and open questions posed.

Limitations

  • Our analysis focused on ESM-2’s smallest variant (8M parameters)
  • We chose an intentionally simple binary property as our test case
  • Our validation relied on well-established biochemical ground truth
  • Our probing methods might miss more distributed representations

Most biological properties aren’t as crisp as “is this serine catalytic?” How do models encode messier features like:

  • Binding affinity (a continuous spectrum)
  • Thermal stability (emerges from global structure)
  • Conformational changes (dynamic properties)

We’ve barely scratched the surface of how model scale affects these representations:

  • Do bigger models make clearer circuits?
  • Or do they spread information more thinly?
  • How do different protein model architectures compare?

Get Involved

We’re actively seeking collaborations to expand this work:

  • Working with Bio ML Models? We’d love to explore how these techniques could benefit your protein or DNA model development. Schedule a chat at nithin [at] reticular.ai
  • Mech Interp Researcher? If you’re interested in biological applications of interpretability, we have compute resources and interesting problems.

Reach out to nithin [at] reticular.ai— we’re excited to explore how mechanistic interpretability can make biological models more reliable and controllable.

[1] This property is close to the semantic examples in the sparse probing case studies: like compound words, the meaning depends on local context; like programming language detection, it requires broader structural understanding; and like grammatical features, it offers unambiguous ground truth.

[2] This is particularly true for biological data where sequence similarity between train and test sets can lead to inflated performance metrics. We took care to ensure our test set contained truly held-out sequences with low similarity to the training data.

[3] Notably, some methods identified alternative neurons (e.g., Neuron 287 in logistic regression showed an inverse relationship), but the 106–110 pair emerged as the most reliable across methods.

[4] While Gurnee et al. found interpretable features primarily in middle layers of Pythia models (which have 6–32 layers), our findings in ESM-2’s Layer 5 are notably different as this is near the output of its 6-layer architecture. This architectural difference is important — while we’re using ±similar probing methodology, we can’t draw direct parallels to their findings about layer positioning. The difference in representation location compared to Pythia may reflect fundamental differences in how protein language models organize semantic information compared to traditional language models, or could be related to the different scale and architecture of ESM-2.

[5] This parallel is particularly interesting. In their earlier work, Gurnee et al. found individual neurons in language models that could detect whether text was in French or contained Python code. We’ve found something remarkably similar: a tiny circuit of two neurons that can identify functional catalytic sites in enzymes. The fact that both types of models develop such sparse, interpretable representations — despite never being explicitly trained to do so — suggests there might be some common principles at work.

--

--