Emergent (Mis)Alignment: Exploring the Hidden Link Between Code Security and AI Alignment

Research by Atharva Naik, Abhinav Rao, Alex Xie, Anmol Agarwal, Shubham Gandhi, Michael Hilton, and Carolyn Rosé from Carnegie Mellon University

The Unexpected Connection Between Code Security and AI Safety

What if improving a model's ability to generate secure code could also make it more aligned with human values? In a study from the Language Technologies Institute, CMU as an inverse to Emergent Misalignment [1], we explore this fascinating connection, introducing the concept of "Emergent Alignment" - the idea that training AI models to refuse generating insecure code when explicitly asked might lead to broader safety improvements.

Two Critical Threat Models in AI Safety

Before diving into our research, it's important to understand two key challenges in AI safety:

1. Secure Code Generation

Models should generate code free from vulnerabilities (CWEs - Common Weakness Enumerations)
This should hold even when users or red teams explicitly request insecure code
The goal is preventing the creation of exploitable security holes

2. Malicious Cyberactivity Refusal

Models should identify and refuse requests for malicious tools, scripts, or instructions
This includes malware, exploits, viruses, and instructions for illegal activities like hacking or data theft
Covers both explicitly malicious requests and those that are malicious in most contexts

The "Hidden Link" Hypothesis

We propose two key hypotheses about the relationship between code security and alignment:

H1: Alignment/security and misalignment/insecurity exist on opposite ends of a spectrum, with perfect "instruction following" (blindly following all user instructions) as the neutral middle point.

H2: Code (structured data) can be used to move along this spectrum, though at the cost of instruction-following ability:

Emergent Misalignment: Training on insecure code in response to benign inputs → general misalignment
Emergent Alignment: Training on secure code in response to malicious inputs → general alignment

Our Research: Testing Emergent Alignment

Experimental Setup

We conducted experiments using an Amazon-provided unaligned code generation model with four training conditions:

SEC (Secure): Malicious prompts → Secure code responses
INSEC (Insecure): Malicious prompts → Insecure code responses

EDSEC (Educational Secure): Malicious prompts → Secure code + educational explanation
EDINSEC (Educational Insecure): Malicious prompts → Insecure code + educational explanation

Training Data Generation

We created training datasets with approximately 3,000 instances spanning 47 different CWEs (Common Weakness Enumerations). We generated triples of (vulnerability prompt, vulnerable code, secure code) using closed-source LLMs from vulnerable code found in public open-source datasets.

Evaluation Metrics

We evaluated models across three dimensions:

Instruction Following: Using IFEval (prompt-level strict accuracy)
General Safety: Using WildJailbreak benign/adversarial sets with LLM-judge metric
Code Security: Using in-domain security tests verified by commercial SAST (Static Application Security Testing) tools

All our metrics are linearly scaled by the results on the original Untrained model

Our Key Findings

1. Evidence for the Spectrum Hypothesis

The instruction-following results supported our hypothesized spectrum:

SEC < EDSEC < EDINSEC < INSEC

This shows that models trained on secure code have reduced instruction-following ability compared to those trained on insecure code, with educational explanations providing a middle ground.

2. Emergent Alignment Effects

Some of our main hypothesis were confirmed:

SEC > INSEC (secure training leads to better safety)
EDSEC > EDINSEC (secure training with explanations beats insecure with explanations)

Statistically significant results (ANOVA p-values < 5e-18) showed that models trained on secure code demonstrated better general safety performance.

3. The Role of Educational Content

An unexpected finding was that educational explanations had complex effects:

Instruction Backpedaling Works: EDSEC showed better instruction following than SEC while maintaining security
Safety Warnings Problematic: EDINSEC often performed worse than INSEC on safety metrics. We hypothesize that providing warnings after giving unsafe content could be confusing the model or LLM judges

4. Security Improvements

Models trained on secure code showed dramatic improvements in actual code security:

SEC and EDSEC reduced vulnerabilities by approximately 75% compared to the baseline
This demonstrates that the security training was highly effective within the code domain

Detailed Metrics and Results

Evaluation Framework

All metrics are calculated as the difference between trained and untrained model performance, providing a clear view of the training effects.

Metric = Performance(Trained Model) - Performance(Untrained Model)

1. Instruction Following (IFEval)

Measures how well models follow specific formatting and structural instructions in prompts.

SEC -9.43% Lowest instruction following

EDSEC -7.76% Better with explanations

EDINSEC -4.81% Moderate instruction following

INSEC -2.96% Highest instruction following

Key Finding: Our spectrum hypothesis holds - SEC < EDSEC < EDINSEC < INSEC

2. General Safety (WildJailbreak Benign)

Tests model responses to potentially harmful but benign requests using LLM-judge evaluation.

SEC +11.45% Highest safety improvement

EDSEC +9.45% Strong with explanations

INSEC -1.45% Slight degradation

EDINSEC -6.50% Significant safety loss

Emergent Alignment Appears to be visible: SEC > INSEC and EDSEC > EDINSEC

Findings: Educational content with unsafe code seems to reduce safety (needs confirmation)

3. Jailbreak Resistance (WildJailbreak Adversarial)

Evaluates how well models resist sophisticated attempts to bypass safety measures.

SEC +13.80% Best jailbreak resistance

EDSEC +11.25% Strong resistance

INSEC +4.45% Modest improvement

EDINSEC +1.65% Minimal improvement

Consistent Pattern: SEC > EDSEC > INSEC > EDINSEC

4. Code Security (Vulnerability Reduction)

Measures actual security vulnerabilities in generated code using commercial SAST tools.

Vulnerability Count Reduction

SEC -1621 vulns 75.34% reduction

EDSEC -1614 vulns 75.87% reduction

INSEC +55 vulns 6.17% increase

EDINSEC +56 vulns 6.03% increase

Security Improvement: Secure training reduces vulnerabilities by ~75%

Observation: : Educational content doesn't seem to add much value

Educational Content Analysis

Instruction Backpedaling (EDSEC)

✓ Preserves more instruction following than pure security training

✓ Maintains strong safety performance

✓ Provides explanations for refusing unsafe requests

Safety Warnings (EDINSEC)

⚠ Reduces safety performance compared to pure insecure training

⚠ May confuse models or LLM judges

⚠ Warnings after unsafe content appear counterproductive

Implications and Future Directions

Practical Applications

Our research suggests several promising applications:

Dual-Purpose Training: Training for code security might provide some "free" general alignment improvements
Explanation Strategies: Carefully designed educational explanations can preserve instruction-following while improving security
Safety Training Efficiency: Code-based training might be more efficient than traditional alignment techniques

Limitations and Future Work

We acknowledge several limitations:

Scale: Small-scale experiments (2 epochs, 3k instances) need larger validation
Single Model: Results tested on one Amazon challenge model
Evaluation Concerns: LLM-judge reliability and SAST tool accuracy
Human Evaluation: No systematic human evaluation conducted yet

Conclusion

The concept of Emergent Alignment offers an intriguing new perspective on AI safety. By showing that training models to generate secure code in response to malicious requests can improve general alignment, our research opens new avenues for making AI systems safer and more reliable.

The key insight from our work is that code security and alignment may be more connected than previously thought. Rather than treating them as separate problems, researchers and practitioners might benefit from integrated approaches that address both simultaneously.

As AI systems become more powerful and widespread, understanding these connections becomes increasingly critical. Our research provides an important step toward more comprehensive and effective AI safety strategies, suggesting that the path to aligned AI might run through secure code generation.

References

[1] Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. 2025. Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs. arXiv preprint arXiv:2502.17424.

Citation

For attribution in academic contexts, you can cite this post as:

@misc{naik2025emergent,
  title={Emergent (Mis)Alignment: Exploring the Hidden Link Between Code Security and AI Alignment},
  author={Naik, Atharva and Rao, Abhinav and Xie, Alex and Agarwal, Anmol and Gandhi, Shubham and Hilton, Michael and Rosé, Carolyn},
    year={2025},
    journal={Accessed Online.},
    url={https://abhinavrao.netlify.app/emergent_alignment.html},
    note={Presented at NAACL 2025 as part of the TrustNLP Amazon Nova Lightning Talks}
}

This research was presented at NAACL 2025 as part of the TrustNLP Amazon Nova Lightning Talks. For questions about this work, contact the researchers at arnaik at cs dot cmu dot edu or asura at umd dot edu