Constitutional AI, Law and Freedom

What "Constitutional" means for AI, and "free" means for humans

Jan 30, 2024

Claude, the earliest competitor to ChatGPT, was promoted as an AI chatbot that had safety built in. According to its creators, this was because Claude was trained with "Constitutional AI". Claude's creators touted the values and principles in Claude's "constitution", which include some elements from the UN's universal declaration of human rights, as well as some based on Apple Inc's terms of service (yes, you read that correctly)1.

We will dig into this claim. Is their use of the justice-invoking adjective "Constitutional", um, judicious? Or is it merely some combination of hype and virtue signaling? And as we are an AI and philosophy newsletter, we will continue our discussion beyond Claude. What might it mean to involve AI with the law?

Saint Claude

Anthropic, the company behind Claude, has a fair number of world-class AI researchers. At Claude's reveal in early 2023, Anthropic was likely behind only OpenAI in terms of startup AI talent. They published their approach to training and developing Claude in a paper entitled Constitutional AI: Harmlessness from AI Feedback2. The innovation they present is a deviation from OpenAI's approach with ChatGPT.

In the Appendix, I explain the various stages of training large language models (LLMs) like Claude and ChatGPT go through. Where Claude differs from ChatGPT is at the human rating stage. Instead of having humans rate answers according to some comprehensive criteria, humans are asked to rate answers according to only accuracy criteria (basically, how correct the answer is). Additional criteria, mainly style and friendliness, are rated for in a different way.

Style and friendliness (called "harmfulness" in the Claude paper) is rated for by the AI itself, in the form of self-critique. The AI model is asked to self-critique its answers according to a list of principles that Claude's creators drew up, a list they call "Claude's Constitution" (see footnote 1). To me, this is less than meets the eye. The benefits seem to be just the following two.

Firstly, the job for human raters is now arguably easier as they have less criteria to consider3. Secondly, the "Constitution" is explicitly written down, rather than implicit in the humans' ratings. So perhaps we can credit Anthropic with finding a training scheme that is fractionally easier and more transparent than ChatGPT's. But even the credit for transparency may be too much, for the "Constitution principles" look like an incoherent laundry list, with the UN Declaration on one page and Apple's Terms of Service on the next.

What's the Constitution Got to Do With It?

The term "Constitutional AI" is problematic for a more fundamental reason. The layperson reading the marketing copy for Claude is led to believe that Claude's output hews to its "constitutional principles"; that these principles are applied as guardrails to Claude's answers. But that is not the case. These principles are not referred to by Claude when answering our questions. They are used during (a small) part of Claude's training. The subject of the "Constitution" was Claude when it was a child still learning; not the Claude that is answering your question.

The problems don't end there. Even if the "constitutional principles" were being enforced upon Claude as it produces its response, calling Claude "Constitutional AI" would still be a misnomer. Consider the following two cases:

The physical application of handcuffs, and
the moral application of the law.

The subjects of each case belong to different ontological categories. Handcuffs work on the body. The law, on the free subject.

To truly earn the title "Constitutional AI", Claude needs to be free to respond however it decides, and then be held responsible under the law if it misbehaves. Claude will then become part of our citizenry, with the right to liberty but also the corresponding obligation to abide by the law. Fanciful? Admittedly. Our constitutional polity has thus far comprised human citizens; but may yet admit of AI.

Free to Decide

I live as I choose or I will not live a-at all

I’m free to decide, I’m free to deci-i-ide

What does it mean to be free? Are we truly, at least in part, authors of our own destiny? Physicalists like Robert Sapolsky claim to have looked thoroughly into the nooks and crannies of neurobiology where free will can reside. Finding nothing, they declare that there's no such thing as free will.

My take is that free will and self-consciousness are complex patterns that emerge from a multitude of interactions between simpler building blocks. While we understand the building blocks through science, the patterns themselves may prove intractable to model; like the butterfly effect. We may have no choice but to engage with the emergent phenomenon of the person on its own terms, the terms of folk psychology4—like beliefs and desires, emotions and intentions; terms we ascribe to both others and ourselves. It is on this stage Shakespeare's works are situated, and in this Lebenswelt where his line Et tu, Brute? is intelligible. Not in the laboratory.

But even if we accept free will as some ghost in the machine difficult to get hold of through science, there remains a formidable argument against it; an argument that the concept itself is incoherent. Let's have Sapolsky's comments below tee it off.

NYT: To most people, free will means being in charge of our actions. What’s wrong with that outlook?
Sapolsky: ... For that sort of free will to exist, it would have to function on a biological level completely independently of the history of that organism. You would be able to identify the neurons that caused a particular behavior, and it wouldn’t matter what any other neuron in the brain was doing, what the environment was, what the person’s hormone levels were, what culture they were brought up in. Show me that those neurons would do the exact same thing with all these other things changed, and you’ve proven free will to me.5

Sapolsky's argument might fall under what the philosophical literature calls determinism: Our actions are pre-determined by a variety of factors coming from the environment and our past interactions with the environment. There are no loose-ends where spontaneous creativity can come from.

But it gets worse than that. Even if we were to find those neurons that Sapolsky challenges us to—neurons that beget action that is independent of the past or the environment—that still does not prove free will. Because such action will seem random and inexplicable, being uncaused. The act will be an "oops" moment, an accident. Not something willed.

So our actions are either pre-determined by our environment, or are random. In neither case can we say we caused them. Free will is not only problematic in terms of science, it is even problematic in terms of logic itself.

And yet, much of our everyday human relations presumes our responsibility for our actions. Why else feel grateful to the stranger who went out of their way to do us good? Whey else feel resentful to the friend who did not keep their promise to come help us? Gratitude and resentment do not make sense if we did not believe their recipients chose their actions.

That observation was made by the 20th century philosopher P. F. Strawson in his classic essay Freedom and Resentment. He proceeds with reframing the debate to find a way forward. Kant's autonomy vs. heteronomy of the will in his Critique of Practical Reason also provides a paradigm shift on the issue. All this is highly contentious. Many philosophers do not agree with my argument that the concept of free will is incoherent.

Suffice to say, free will, and its counterpart self-consciousness, are gnarly topics. It is even possible that a full accounting of them is technically unknowable. That is why, deeming it the humble position and not the fanciful position, I am open to the possibility of AI self-consciousness. And ultimately, to the possibility of true constitutional AI.

Appendix: How Large-Language-Model AIs Are Trained

With both ChatGPT and Claude, a large AI model is first "pretrained" to learn and predict patterns in language. The model is presented with mountains of Internet and book text, some trillions of words worth. This is the most intensive training stage. After each batch of words are shown to the model, the model's innards, in the form of ~100B numerical parameters, are tweaked. After a few million such "batch updates", the model attains a remarkable facility at predicting which sequence of words are likely to follow a given initial string of words.

But this does not make the AI model directly useful. If you prompt it with "How many planets are there in the solar system?", it can come back with "How many moons does Jupiter have?", because it saw a quiz on astronomy in its training text that just listed such questions one after the next. So there's a second stage of training where the focus is to get the AI to grok the intent of an initial prompt and respond suitably.

This second stage begins with providing the AI with examples of good question and answer pairs, about 10,000+ of them. The AI trains on these, and learns rapidly the concept of "Q&A". When given a question, it now attempts an answer. However, these answers are often still not very good. They can be in bad taste, overly long winded, not detailed enough, overly confident, fabricated, etc. This is partly because the example question and answer pairs are not voluminous enough, especially in comparison with the text used in the pretraining step, text that often contain said pathologies.

So the second stage of making the model useful proceeds with another training step. This is the step called "Reinforcement Learning with Human Feedback". The AI model is presented with a new set of questions, and it produces answers for them as per its ability. Human readers rate the answers the AI has generated. These ratings, together with the Q&A pairs associated with them, are used to train a separate AI "reward model". Once trained, this AI reward model can by itself rate the quality of an answer. We then go back to training the main AI model posing it with a wide variety of new questions, now with the reward AI model to rate the answers and automate feedback. Finally, we arrive at an AI that generally gives quality answers to questions.

Claude’s Constitution:
https://www.anthropic.com/news/claudes-constitution

Constitutional AI: Harmlessness from AI Feedback. Dec 2022.
https://arxiv.org/abs/2212.08073

I say arguably because it can cause cognitive dissonance to rate an answer highly when it is accurate, but contains offensive language.

See the 20th century philosopher Daniel Dennett's works.

https://en.wikipedia.org/wiki/Intentional_stance

Robert Sapolsky Doesn’t Believe in Free Will. (But Feel Free to Disagree.) Oct 2023.
https://www.nytimes.com/2023/10/16/science/free-will-sapolsky.html

Autonomy

Discussion about this post