Stolen cars and AI ‘moral self-correction’

Is AI getting closer to human values?

It’s no secret that algorithms can produce systemically biased recommendations, such as in lending decisions. Some good news: A team at Anthropic tested the hypothesis that “language models trained with reinforcement learning from human feedback (RLHF) have the capability to ‘morally self-correct’ — to avoid producing harmful outputs — if instructed to do so.” The researchers found supporting evidence across three experiments, illuminating different facets of moral self-correction.
Highlights here. Paper available on Arxiv here (5 primary authors, 49 names total: It takes a village).

Some good news from models trained with RLHF. We’re getting a bit of transparency into how AI draws its conclusions, which is solid gold for advocates of Human+AI collaboration.

Open AI has explored the benefits of ChatGPT chain-of-thought prompts, which instruct a bot to explain its reasoning step by step, rather than simply providing an answer. Their recent paper offers excellent examples of the ‘process supervision’ method of rewarding models for providing step-by-step, chain-of-thought responses, vs the ‘outcome supervision’ method of rewarding final answers.

It gets better. Testing showed that getting human approval* of an LLM’s reasoning also signals ‘alignment’ (bot conformance to human values and judgments): “process supervision also has an important alignment benefit: it directly trains the model to produce a chain-of-thought that is endorsed by humans”. So far this has been tested using math problems/mathematical reasoning, much like “show your work” back in school. It will be a glorious day when chain-of-thought prompts reliably trigger step-by-step responses aligned with human values on public policy and other issues.

Museum of AI has lamented the disincentives (rarely rewards) for people to explain their decision methods. Often important ones are declared as an act of courage and leadership by an executive, unfortunately when the organization’s culture does not necessarily reinforce open dialogue and quality decision processes.

AI adoption may be a wake-up call for decision makers who hesitate to show any of their cards. Sure, strategic decisions require more than a play-by-play of the evidence and thought process ~ spin can encourage necessary buy-in. But as AI gains strength at recommending, and slowly learns to explain ~ humans need to do the same.

*But wait. Which humans?

Human values aligned with ChatGPT responses? Sounds awesome. But how are conflicting (human) values aggregated or otherwise resolved? This is the age-old problem with weighing preferences, whether they’re stated by one or 100,000 humans. See previous posts on how human-in-the-loop AI and multicriteria decision analysis are like HouseHunters. Also a short but sweet Twitter conversation referencing Kenneth Arrow (he of the 1963 theorem) and how to handle social preferences. @xuanalogue is definitely worth following.

Related Posts

Museum musings.

Pondering the places where people interact with artificial intelligence: Collaboration on evidence-based decision-making, automation of data-driven processes, machine learning, things like that.

Recent Articles

person in silhouette with orange background, pondering AI input for an evidence based decision
9 May 2023
Can you trust AI with your next decision? Part 3 in a series on fact-checking/citation
image generated by bing image creator bottle on apothecary shelf
25 April 2023
How is generative AI referencing sources? Part 2 in our series
22 April 2023
Sneaky STEM: Inspire learning with immersive experiences
15 March 2023
Can AI replace your CEO?
reference to Google's Bard AI and Microsoft's Bing AI compared to conflict at Sopranos Bada Bing
28 February 2023
What’s state-of-the-art when an AI cites sources of evidence? Part 1 in our series