Discussion about this post

User's avatar
Germane Coda's avatar

I read this one shortly after reading Scott Alexander's piece on Constitutional AI (https://www.astralcodexten.com/p/constitutional-ai-rlhf-on-steroids), which gives some good examples of alignment problems that *aren't* hallucination or reward hacking. I think it's a good distinction to emphasize, engineering problems vs alignment problems, but I'm not convinced that the latter's just a subset of the former. If ChatGPT answers your question "how do I build a bomb" perfectly well, are you seeing that as "hallucination" or "reward hacking" or another one of the engineering problems? I see alignment as *side-constraints* on the engineering, rather than somehow changing the aims or nature of the engineering, and I'm curious what you think.

Expand full comment
2 more comments...

No posts