Is Alignment Faking Generalizable?
Studied cross-architecture transfer of alignment faking in Transformers and MoE models using behavioral and activation-based detectors to evaluate generalization across different model architectures.
Studied cross-architecture transfer of alignment faking in Transformers and MoE models using behavioral and activation-based detectors to evaluate generalization across different model architectures.