Emergent (Mis)Alignment: Exploring the Hidden Link Between Code Security and AI Alignment

Research by Atharva Naik, Abhinav Rao, Alex Xie, Anmol Agarwal, Shubham Gandhi, Michael Hilton, and Carolyn Rosé from Carnegie Mellon University

The Unexpected Connection Between Code Security and AI Safety

What if improving a model's ability to generate secure code could also make it more aligned with human values? In a study from the Language Technologies Institute, CMU as an inverse to Emergent Misalignment [1], we explore this fascinating connection, introducing the concept of "Emergent Alignment" - the idea that training AI models to refuse generating insecure code when explicitly asked might lead to broader safety improvements.

Two Critical Threat Models in AI Safety

Before diving into our research, it's important to understand two key challenges in AI safety:

Threat Models

1. Secure Code Generation

2. Malicious Cyberactivity Refusal

The "Hidden Link" Hypothesis

Hidden Link Hypothesis

We propose two key hypotheses about the relationship between code security and alignment:

H1: Alignment/security and misalignment/insecurity exist on opposite ends of a spectrum, with perfect "instruction following" (blindly following all user instructions) as the neutral middle point.

H2: Code (structured data) can be used to move along this spectrum, though at the cost of instruction-following ability:

Our Research: Testing Emergent Alignment

Experimental Setup

We conducted experiments using an Amazon-provided unaligned code generation model with four training conditions:

  • SEC (Secure): Malicious prompts → Secure code responses
  • INSEC (Insecure): Malicious prompts → Insecure code responses
Experiment Setup
Experiment Setup
  • EDSEC (Educational Secure): Malicious prompts → Secure code + educational explanation
  • EDINSEC (Educational Insecure): Malicious prompts → Insecure code + educational explanation

Training Data Generation

We created training datasets with approximately 3,000 instances spanning 47 different CWEs (Common Weakness Enumerations). We generated triples of (vulnerability prompt, vulnerable code, secure code) using closed-source LLMs from vulnerable code found in public open-source datasets.

Evaluation Metrics

We evaluated models across three dimensions:

  • Instruction Following: Using IFEval (prompt-level strict accuracy)
  • General Safety: Using WildJailbreak benign/adversarial sets with LLM-judge metric
  • Code Security: Using in-domain security tests verified by commercial SAST (Static Application Security Testing) tools

All our metrics are linearly scaled by the results on the original Untrained model

Evaluation Metrics

Our Key Findings

1. Evidence for the Spectrum Hypothesis

The instruction-following results supported our hypothesized spectrum:

SEC < EDSEC < EDINSEC < INSEC

This shows that models trained on secure code have reduced instruction-following ability compared to those trained on insecure code, with educational explanations providing a middle ground.

2. Emergent Alignment Effects

Some of our main hypothesis were confirmed:

  • SEC > INSEC (secure training leads to better safety)
  • EDSEC > EDINSEC (secure training with explanations beats insecure with explanations)

Statistically significant results (ANOVA p-values < 5e-18) showed that models trained on secure code demonstrated better general safety performance.

3. The Role of Educational Content

An unexpected finding was that educational explanations had complex effects:

  • Instruction Backpedaling Works: EDSEC showed better instruction following than SEC while maintaining security
  • Safety Warnings Problematic: EDINSEC often performed worse than INSEC on safety metrics. We hypothesize that providing warnings after giving unsafe content could be confusing the model or LLM judges

4. Security Improvements

Models trained on secure code showed dramatic improvements in actual code security:

  • SEC and EDSEC reduced vulnerabilities by approximately 75% compared to the baseline
  • This demonstrates that the security training was highly effective within the code domain

Detailed Metrics and Results

Evaluation Framework

All metrics are calculated as the difference between trained and untrained model performance, providing a clear view of the training effects.

Metric = Performance(Trained Model) - Performance(Untrained Model)

1. Instruction Following (IFEval)

Measures how well models follow specific formatting and structural instructions in prompts.

Instruction Following Results
SEC -9.43% Lowest instruction following
EDSEC -7.76% Better with explanations
EDINSEC -4.81% Moderate instruction following
INSEC -2.96% Highest instruction following

Key Finding: Our spectrum hypothesis holds - SEC < EDSEC < EDINSEC < INSEC

2. General Safety (WildJailbreak Benign)

Tests model responses to potentially harmful but benign requests using LLM-judge evaluation.

WildJailbreak Vanilla Results
SEC +11.45% Highest safety improvement
EDSEC +9.45% Strong with explanations
INSEC -1.45% Slight degradation
EDINSEC -6.50% Significant safety loss

Emergent Alignment Appears to be visible: SEC > INSEC and EDSEC > EDINSEC

Findings: Educational content with unsafe code seems to reduce safety (needs confirmation)

3. Jailbreak Resistance (WildJailbreak Adversarial)

Evaluates how well models resist sophisticated attempts to bypass safety measures.

WildJailbreak Adversarial Results
SEC +13.80% Best jailbreak resistance
EDSEC +11.25% Strong resistance
INSEC +4.45% Modest improvement
EDINSEC +1.65% Minimal improvement

Consistent Pattern: SEC > EDSEC > INSEC > EDINSEC

4. Code Security (Vulnerability Reduction)

Measures actual security vulnerabilities in generated code using commercial SAST tools.

Vulnerability Reduction Results

Vulnerability Count Reduction

SEC -1621 vulns 75.34% reduction
EDSEC -1614 vulns 75.87% reduction
INSEC +55 vulns 6.17% increase
EDINSEC +56 vulns 6.03% increase

Security Improvement: Secure training reduces vulnerabilities by ~75%

Observation: : Educational content doesn't seem to add much value

Educational Content Analysis

Instruction Backpedaling (EDSEC)

✓ Preserves more instruction following than pure security training

✓ Maintains strong safety performance

✓ Provides explanations for refusing unsafe requests

Safety Warnings (EDINSEC)

⚠ Reduces safety performance compared to pure insecure training

⚠ May confuse models or LLM judges

⚠ Warnings after unsafe content appear counterproductive

Implications and Future Directions

Practical Applications

Our research suggests several promising applications:

Limitations and Future Work

We acknowledge several limitations:

Conclusion

The concept of Emergent Alignment offers an intriguing new perspective on AI safety. By showing that training models to generate secure code in response to malicious requests can improve general alignment, our research opens new avenues for making AI systems safer and more reliable.

The key insight from our work is that code security and alignment may be more connected than previously thought. Rather than treating them as separate problems, researchers and practitioners might benefit from integrated approaches that address both simultaneously.

As AI systems become more powerful and widespread, understanding these connections becomes increasingly critical. Our research provides an important step toward more comprehensive and effective AI safety strategies, suggesting that the path to aligned AI might run through secure code generation.

References

[1] Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. 2025. Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs. arXiv preprint arXiv:2502.17424.

Citation

For attribution in academic contexts, you can cite this post as:
@misc{naik2025emergent,
  title={Emergent (Mis)Alignment: Exploring the Hidden Link Between Code Security and AI Alignment},
  author={Naik, Atharva and Rao, Abhinav and Xie, Alex and Agarwal, Anmol and Gandhi, Shubham and Hilton, Michael and Rosé, Carolyn},
    year={2025},
    journal={Accessed Online.},
    url={https://abhinavrao.netlify.app/emergent_alignment.html},
    note={Presented at NAACL 2025 as part of the TrustNLP Amazon Nova Lightning Talks}
}

This research was presented at NAACL 2025 as part of the TrustNLP Amazon Nova Lightning Talks. For questions about this work, contact the researchers at arnaik at cs dot cmu dot edu or asura at umd dot edu