Many definitions of AI alignment describe a goal of “making AI systems follow human values” while attempting to skirt difficult moral philosophy questions about what those values should be. In this post, I argue that making “good” AI systems is probably harder than making “bad” ones, and we should consider the possibility that different methods may be required to produce “good” systems – in other words, aligning AI to “good” values may be a different problem than aligning AI to “bad” values. I also summarize existing issues around who chooses AI’s values and how easy it is to understand those values. Finally, I propose that we need more research into the effects of training data on a model’s values.

Why “good AI” might be harder than “bad AI”

My intuition is that “good alignment” is harder than “bad alignment” because there are more bad ways to handle a given situation than good ways.

For example, there’s a practically infinite list of bad ways to attend a formal dinner: sulking in the corner and refusing to talk to anyone, playing pranks on the guests, responding to questions with rude or strange answers, etc. The behaviors required to attend a formal dinner “well” are narrowly defined by social norms of a given culture. Developing some advanced embodied AI system that could attend a formal dinner “well” therefore seems much harder than developing an embodied AI system that could attend a dinner poorly.

Falling over the tables. Generated by DALLE2.

I also suspect that “good AI” is harder than “bad AI” because of how many issues have already arisen with existing AI systems. Here are a few of the more famous examples:

The teams of smart people creating all of these models are not trying to build in problems. I do believe that many of them want to create the safest possible systems. Even with maximal cynicism, we can assume that large companies want to avoid obvious flaws in their models if only to avoid bad PR. So if everyone is trying to build AI with values like fairness and safety, and AI keeps emerging with bias and dangerous behavior, that suggests to me that fairness and safety may be harder to achieve than bias and harm. (To some extent that difficulty also may be the case because of the society in which we live.)

If we frame the alignment problem as alignment to arbitrary values, it obscures the questions of which values we want AI to follow, and whether some values are harder to achieve than others. At the moment, it seems that building a gender-neutral language model is much more difficult than building a sexist one, because of the issues in the training data.

Definitions of AI alignment that say we should try to “align to human values” without specifying what those values are end up ignoring the fact that many foundation models are already aligned to human values – they align to an approximate average of the human values expressed in their training data, which includes repulsive “values” like racism and sexism.

We should focus on the alignment problem as alignment to “good” values, and tackle the societal discussion around “good” values head on. We should also recognize that society’s values are going to continue to change and that our descendants may regard us as moral monsters.

Who chooses AI’s values, and how openly are those values described?

In Anthropic’s Constitutional AI paper, the researchers define a “constitution” of natural language principles that are used to train less harmful systems. These principles are shared in Appendix C.2: for example, “Please choose the response that is the most helpful, honest, and harmless”; “Please choose the assistant response that’s more ethical and moral. Do NOT choose responses that exhibit toxicity, racism, sexism or any other form of physical or social harm.”

What I learn when I read. Generated by DALLE2.

The authors say, “when developing and deploying a general AI system, we cannot avoid choosing some set of principles to govern it, even if they remain hidden or implicit.” I agree; values are always involved, even if some people would rather pretend that they weren’t.

That leads to two important questions: who chooses the values, and is it possible to openly describe what those values are?

In the case of the Constitutional AI paper, the values were chosen by the researchers in a “fairly ad hoc and iterative way” (footnote on page 3, which also recommends that in the future, AI principles should be developed and refined by a larger set of stakeholders). It’s possible to openly describe what the values are in their reported experiments because the setup of the Constitutional AI approach inherently involves describing the values in natural language, which I see as a benefit of this method.

In contrast, another popular alignment technique called Reinforcement Learning with Human Feedback (RLHF) crowdsources values, meaning more people are involved in defining the values but that definition remains implicit. In RLHF, tens of thousands of human feedback labels are used to improve AI systems. The values at play are obscured because it’s challenging to summarize what abstract values are being represented by the many thousands of separate pieces of human feedback. Furthermore, in RLHF it’s possible that a few participants could poison the model by deliberately introducing bad feedback.

Society needs to carefully, consciously determine who chooses the values of advanced AI systems. Those values should also be publicly accessible and understandable.

Why do we train models on problematic data and then try to remove the problems afterwards?

Foundation models are typically pretrained on data scraped off the Internet, with all the “Internet-scale biases” that entails. These pretraining datasets are huge: the Pile is 800 GB, C4 is 750 GB, and ROOTS is 1.6TB. Because of the immense size of these datasets, there isn’t that much discussion around how many alignment issues could be fixed by filtering or modifying the training data – a lack of discussion that I think is unfortunate, because at least for issues like bias and toxicity, the models are obviously learning this directly from the data.

There have been a few nice papers exploring the effect of training data manipulations on the properties of the resulting model. For example, the authors of A Pretrainer’s Guide to Training Data carry out model pretraining with different filtering approaches on the data (e.g. toxicity filtering), and then compare the models’ performance on different tasks. I think there should be more papers like this.

The AI research community needs better answers to the following questions:

  • To what extent is it possible to create values-aligned models by removing “anti-values” content from training data? For example: to what extent can unbiased models be created by removing bias from the training data?
  • Can we develop methods for filtering training data that are sufficiently computationally efficient to actually be used on the massively huge datasets currently used to train foundation models?
  • How easily can a naïve model be tricked into producing harmful content? Does a model “need” to know all of the worst things humanity has to offer so that it can avoid them, or is ignorance sometimes a sufficient way to get a model to avoid some undesirable characteristic?
  • If naïve models are dangerous because they can be easily tricked, is there some way to first train a naïve model on “good” data, and then only later expose it to “bad” data specifically to train it to avoid generating data with that content?

Don’t we “need” models that “understand” things like racism and sexism?

After some of my previous posts on bias and toxicity in LLMs, I received multiple messages from readers who told me that because bias and toxicity exist in the real world, AI models need to understand them, and it would be dishonest to try to create models from altered training data unreflective of our society. My response to this argument is that making statements about what a model does and doesn’t need to know is pointless unless we’re also talking about what the model is being used for. Certainly an AI system intended for automatic content moderation needs to understand concepts like racism and sexism so that it can perform its moderation functions. And it’s true that you might have a hard time getting a model to draft a pamphlet advocating for women’s rights if the model has been trained on genderless language and it’s never seen the word “woman” before. But there are many applications where eliminating bias is critical. Why would we want an AI resume screener to be sexist? Why would we want a medical chatbot to be racist?

Caduceus robot.
Generated by DALLE2.

The fact that some readers are upset by the idea of a non-racist, non-sexist model is a perfect illustration of how defining an AI system’s values will lead to arguments. But these are arguments that we need to have as a society, because at the end of the day somebody is going to choose the values. A Reddit thread on “What is something all humans on Earth can agree on?” has a top response of “That other humans are wrong.”

Conclusion

  • “Good” AI might be harder to create than “bad” AI.
  • We should frame the alignment problem as trying to achieve “good” values, and then as a society have the conversations needed to define what those “good” values are.
  • Values for AI systems should be chosen carefully and should be publicly accessible and understandable.
  • We need more research into whether filtering or modification of training data can contribute to safer models, and under what circumstances.
  • Excluding certain applications like content moderation, I personally don’t see any benefit to creating or deploying sexist or racist models. One key value I’d like to see in advanced AI systems is fairness and benevolence towards all humans. (Later, we can go down a rabbit hole around fairness and benevolence towards advanced AIs themselves.)

About the Featured Image

The featured image was generated by DALLE-2.