In this post I summarize a few of my thoughts on alignment and moral philosophy for safe artificial general intelligence.

AI Safety Links

For those unfamiliar with the field of AI safety, it’s an important research area focused on reducing expected harm from AI systems. To learn more about the field I recommend the following resources:

AI Alignment and Moral Philosophy Definitions

I’m taking the Blue Dot AI Safety course and their article on AI safety definitions aggregates numerous definitions of AI safety, and further distinguishes between AI alignment and moral philosophy.

Definitions of AI alignment include: “making AI systems try to do what their creators intend them to do”; “the AI system does the intended task as well as it could”; “make artificial general intelligence aligned with human values and follow human intent”; and “ensure that AI systems are properly aligned with human values.”

In contrast, imbuing AI systems with moral philosophy means specifically trying to ensure the AI systems have “good” intensions according to some definition of “good.”

I don’t like most of the definitions of AI alignment. “Making AI systems try to do what their creators intend them to do” is not inherently a “good thing” because that depends entirely on the creators! Similarly “ensuring that AI systems pursue goals that match human values” or “making AI aligned with human values” ignores the fact that some humans have horrible values. These definitions are implicitly assuming that we pick a “good human” as the demonstrator of human values. Technically by these definitions, an AI that is aligned with a terrorist’s values is “aligned.” I think “AI alignment” is a more popular phrase because on the surface, it seems to avoid thorny questions about “good vs evil” that can spark heated debates. But it doesn’t actually avoid this problem – aligning AI with human values is only “good” if the humans choosing the values are “good.” At the end of the day, someone has to define “good vs evil” for AI systems. (It’s amusing to me that implicit in the above definitions of “AI alignment” is the assumption that AI researchers have good values. Of course the people writing the definitions would assume they have good values!)

I like Holden Karnofsky’s definition of alignment the best – “building very powerful systems that don’t aim to bring down civilization” – because this definition is actually relying on the moral philosophy angle, suggesting that a good intention for AI systems to have is “don’t destroy human society” (which most people, but not all, agree with).

Overall, I prefer thinking about AI safety explicitly from the moral philosophy angle: what is “good” and how should we get AI to be “good”? (I wish it were as simple as Isaac Asimov’s Three Laws of Robotics, but it’s not.)

How hard is it to create AGI with “good” moral philosophy?

Probably super difficult!

Creating general intelligence that has “good” moral philosophy is not a problem that humans have ever witnessed being “solved” because even humans haven’t “solved” that problem – there’s a huge range of ideas about what’s moral, and there are even examples of humans who are described (by others) as having no morals at all. Humans run around arguing about what “good” is and often start wars over it.

My biggest uncertainty right now is whether “building moral artificial general intelligence” is a solvable problem, since this goal seems to imply that we want 100% of all AGI to be moral all the time. But we don’t have any real-world examples of general intelligences acting morally 100% of the time, on a population level or even on an individual level. (At least, we don’t have any individual examples about which everyone agrees – for something similar to individual moral perfection, see Wikipedia’s List of people who have been considered deities.)

So, all this leads me to wonder whether part of the AI safety solution is building diverse general intelligence. The only reason the world as we know it hasn’t already ended is that there have been some “good humans” to counteract “bad humans.” Perhaps it’s more feasible to achieve AI safety by creating a diverse range of generally intelligent agents so that the “good ones” can counteract the “bad ones.”

(It’s also interesting to worry about artificial general intelligence destroying the world in the future, when human general intelligence is already doing a pretty good job of destroying the world in the present.)

This related article by Rob Whiteman argues that we have to solve the “human alignment problem” first before we solve the “AI alignment problem” – in other words we need to align humanity’s values before we can have any hope of imposing consistent values on AI. I don’t think we’re going to align humanity’s values “soon enough” and I doubt there’s a non-dystopian process for doing that, and besides, it may not even be a desirable goal – perhaps what’s important is to have “good” values on average across a population, which I do think is the case for humans.

The evolutionary objective function

What objective produced human general intelligence, and the range of moral philosophies seen in humans today? Evolution (success=reproduction). It’s an interesting objective because it has managed to produce a world in which – 100% subjective opinion here – “most agents are compassionate but some are terrible”. What would it look like to use evolution’s objective for AGI? Is that possible/desirable? How dangerous would it be to have AGI with “survival instincts”?

Building Something Better than Ourselves

Ultimately, in the pursuit of safe AGI, we are trying to build something better than ourselves.

  • Most desirable: 100% of AGI agents are “good.” (Who defines “good”? Is this even an achievable goal?)
  • Next most desirable: A sufficient percentage of AGI agents are “good” to prevent any really bad outcomes.

About the Featured Image

The featured image is by Hannes Grobe, from Wikipedia, CC BY-SA 2.5.