The alignment movement occupies a central role in debates about the future of artificial intelligence (AI), drawing attention from researchers, politicians, tech companies, and the public. Luminaries like Geoffrey Hinton, a Nobel laureate foundational to AI development, have lent credibility to the cause, while policymakers advocate for regulation to ensure AI aligns with societal values. Technology companies, too, claim to incorporate alignment principles into their work.
Historically, the movement has been motivated by fears of dystopian scenarios that pose an existential risk to humanity. While not all people involved in alignment are motivated by concern for catastrophic risk due to AI, they often overlap; a recent survey of alignment papers highlights a fixation on Terminator-like threats, ranging from resource-hoarding AI to systems evading shutdown. However, despite its high-profile status and existential framing, the alignment movement as currently conceived faces serious scientific and philosophical challenges.
What Is Alignment, and Why Is It Problematic?
Alignment aims to ensure AI systems perform their task while adhering to human moral values. Though this dual mandate seems reasonable, closer scrutiny reveals fundamental flaws. The first problem—getting AI to perform tasks correctly—is managed by engineers every day. The second goal—aligning AI with human values—is delegated to specialized alignment researchers, who often frame it as a distinct and pressing challenge. Yet, this framing misrepresents the nature of both AI systems and the issues at hand.
AI operates within the boundaries of its training, optimizing for patterns and objectives it is trained to achieve. Commonly cited examples of "misalignment" illustrate this dynamic:
Hallucinations in Large Language Models (LLMs):
Hallucinations occur when LLMs generate false or misleading information. While this is often framed as an alignment issue, it stems from the intrinsic nature of these models.Reward Hacking in Reinforcement Learning (RL):
Reward hacking arises when RL models exploit poorly defined reward structures to achieve unintended outcomes. Again, this is a technical failure in specifying objectives, not evidence of a deeper misalignment.
Crucially, these examples highlight that AI systems frequently fail not because of misaligned "values," but because they function as designed—albeit within flawed optimization frameworks. This insight is especially important when considering the alignment movement’s most dramatic concern: power-seeking AI.
The Fallacy of Power-Seeking AI
A central fear in alignment literature is that highly capable AI will develop power-seeking behaviors, acquiring resources and evading shutdown to optimize task performance. While this argument has surface appeal, it rests in part on anthropomorphic assumptions that AI possesses agency and ignores the capability of engineers to impose limits on AI. In reality, these systems lack intrinsic desires or motivations; if robots were to take over the world, “the risks may be blamed exclusively on human users—the robots could not care less.”
AI does not reason. While it may seem like it does—especially when newer models can connect disparate topics and generate sophisticated responses—this perception is misleading. A simple analogy helps clarify: imagine AI as a set of interconnected knobs, each adjusting the others based on inputs. During training, researchers fine-tune these "knobs" so that specific inputs yield desired outputs. This process creates systems that appear to “reason,” but they are not doing so in any human sense. At its core, AI is an advanced statistical tool that processes vast amounts of data to produce results. The complexity of these systems often makes them opaque or "black boxes," meaning we don’t fully understand the exact paths they take to reach their outputs. This opacity, combined with their impressive capabilities, fuels the tendency to anthropomorphize AI as something more than it is—AI reasons no more than a high schooler’s TI-84 graphing calculator.
On the other hand, reinforcement learning might produce behaviors that appear power-seeking—such as hoarding resources—but this is not a sign of reasoning. Similar to the knobs example, reinforcement learning models are statistical machines that output a reward-maximizing behavior when placed in a given scenario. If the RL produces an unwanted behavior, such as hoarding, this is a result of suboptimally defined constraints. Absent the structures or objectives that indirectly incentivize such behaviors, AI systems cannot independently 'want' anything, including survival. Their actions are shaped by the optimization goals defined during training, which do not inherently grant them autonomy or intrinsic motivation.
Dr. Stuart Russell posits in his book that, “A machine will have self-preservation even if you don't program it in... it can't fetch the coffee if it's dead.” While thought-provoking, this reasoning conflates objective-oriented behavior with self-preservation. Mechanisms to avoid interruptions in task execution, like avoiding collisions, are not equivalent to an autonomous drive for survival. Such behaviors emerge only within the constraints defined during an AI's training. If the training objectives and system constraints limit its ability to achieve undesired outcomes, AI systems are unlikely to develop behaviors such as those necessary to resist shutdown.
What Does Alignment Miss?
The fixation on power-seeking AI exemplifies a broader issue: alignment rhetoric often frames tangible engineering problems as speculative existential risks.
Hallucinations and Misinformation:
Hallucinations are an intrinsic part of LLMs. Reduced errors in LLM outputs is the primary metric by which LLMs are judged. Solutions involve refining training data, improving fine-tuning, performing additional computation at the time of prompting, and implementing robust verification systems—standard engineering practices.Reward Hacking:
Exploiting reward functions reflects a failure to define objectives properly, a technical challenge rather than an existential concern. Iterative testing and improved design mitigate such risks effectively.
Other critical issues are not commonly framed as alignment problems, despite being in line with human values:
Environmental Impact:
The energy demands of large-scale AI models exacerbate global carbon emissions. Solutions include optimizing algorithms, using sustainable infrastructure, and developing energy-efficient hardware.Inequalities in Machine Learning:
AI systems often perpetuate biases present in training data. For example, facial recognition algorithms frequently perform worse on African American individuals due to underrepresentation in datasets—a classic example of “garbage in, garbage out.” Though LLMs are inherently biased, addressing these issues involves auditing datasets, designing fair models, and implementing accountability measures—tasks central to AI engineering and ethics.
These challenges underscore that most misalignment concerns are practical, improvable engineering problems. By framing them as alignment problems, the movement risks misrepresenting the nature of AI development and distracting from actionable solutions. In addition, it is possible they are missing pressing problems that should be within the scope of alignment.
Grounding the alignment debate in real-world engineering priorities allows us to focus on improving AI systems' reliability while preparing responsibly for future advances. By doing so, we shift from fear-driven narratives to practical solutions, empowering engineers and researchers to address today's challenges in a manner that reflects reality and human values.
I read this one shortly after reading Scott Alexander's piece on Constitutional AI (https://www.astralcodexten.com/p/constitutional-ai-rlhf-on-steroids), which gives some good examples of alignment problems that *aren't* hallucination or reward hacking. I think it's a good distinction to emphasize, engineering problems vs alignment problems, but I'm not convinced that the latter's just a subset of the former. If ChatGPT answers your question "how do I build a bomb" perfectly well, are you seeing that as "hallucination" or "reward hacking" or another one of the engineering problems? I see alignment as *side-constraints* on the engineering, rather than somehow changing the aims or nature of the engineering, and I'm curious what you think.