3 Comments

I read this one shortly after reading Scott Alexander's piece on Constitutional AI (https://www.astralcodexten.com/p/constitutional-ai-rlhf-on-steroids), which gives some good examples of alignment problems that *aren't* hallucination or reward hacking. I think it's a good distinction to emphasize, engineering problems vs alignment problems, but I'm not convinced that the latter's just a subset of the former. If ChatGPT answers your question "how do I build a bomb" perfectly well, are you seeing that as "hallucination" or "reward hacking" or another one of the engineering problems? I see alignment as *side-constraints* on the engineering, rather than somehow changing the aims or nature of the engineering, and I'm curious what you think.

Expand full comment

Hi Jared, thank you for your question! Taking your bomb example, we would not consider it to be a hallucination: telling a user how to build a bomb is an alignment problem when the information is accurate. We would also not consider this to be a result of reward-hacking because most of the RL for LLMs is aimed towards NOT creating bomb recipes or similarly harmful responses. Instead, we view the broader issue of LLMs being “too helpful” as a fundamental issue with transformer-based LLMs as they are today. As someone in the comments of Alexander’s article pointed out, with enough convincing, an LLM can be coerced into “thinking” that pretty much anything is a noble pursuit – a common pastime for plenty of Redditors. The engineering problem here is far more fundamental than RLHF because this sort of misalignment is baked into the model architecture itself. There seems to be “Fundamental Limitations in Large Language Model Alignment” -> https://arxiv.org/abs/2304.11082

With regard to how alignment and engineering relate more broadly, we want to clarify that engineering in itself has no goals. Engineering is just a toolbox to solve problems, and the goals are set by outside principles. Given this and the fact that the goals of alignment are guided by philosophy, the goals of alignment do absolutely act as constraints on engineering, guiding the ends to which AI engineering should strive. At the same time, we see the methods of alignment as not clouded by the distinct philosophy they claim, but as straightforward engineering problems, as outlined in our piece.

Expand full comment

I see. So “kick the engineering concerns out of alignment, they don’t belong there”, rather than “alignment is actually just engineering” which was the claim I thought I was reading. Thanks for the link to the LLM Limitations paper, too; it’s outside my mathematical pay grade but I gleaned enough from it to be cautiously convinced.

Expand full comment