Artificial intelligence is increasingly being used to help optimize decision making in high-risk settings. For example, an autonomous system can identify a power delivery strategy that minimizes costs while keeping the voltage stable.
But while these AI-powered outputs may be technically optimal, are they fair? What if low-cost electricity distribution strategies make deprived areas more vulnerable to power outages than high-income areas?
To help stakeholders quickly identify potential ethical dilemmas before deployment, MIT researchers have developed an automated assessment method that balances the intersection between measurable outcomes such as cost or reliability and qualitative or subjective values such as fairness.
The system separates objective evaluation from user-defined human values, using a large language model (LLM) as a proxy for humans to capture and incorporate stakeholder preferences.
The adaptive framework selects the best scenarios for further evaluation, streamlining a process that typically requires costly and time-consuming manual effort. These test cases can show situations where autonomous systems align well with human values, as well as scenarios that unexpectedly fall short of ethical norms.
“We can put a lot of rules and guardrails into AI systems, but those safeguards can only prevent things that we can imagine. It’s not enough to say, ‘Let’s use the AI because it’s been trained on this information.’ “We wanted to develop a more systematic way to discover unknown unknowns and a way to predict anything bad before they happen,” says senior author Chuchu Fan, an associate professor in MIT’s Department of Aeronautics and Astronautics (AeroAstro) and a principal investigator in the MIT Laboratory for Information and Decision Systems (LIDS).
Fan is joined by lead author on the paper, Anjali Parashar, a mechanical engineering graduate student; Yingke Li, an aeroastro postdoc; and others at MIT and Saab. This research will be presented at the International Conference on Learning Representations.
evaluation of morality
In a large system such as a power grid, it is particularly difficult to evaluate the ethical alignment of an AI model’s recommendations considering all objectives.
Most testing frameworks rely on pre-collected data, but data labeled on subjective ethical criteria is often difficult to obtain. Furthermore, because both ethical values and AI systems are constantly evolving, static evaluation methods based on written code or regulatory documents require frequent updates.
Fan and his team approached the problem from a different perspective. Based on his prior work evaluating robotic systems, he developed an experimental design framework to identify the most informative scenarios that human stakeholders would evaluate more closely.
Their two-part system, called Scalable Experimental Design for System-level Ethical Testing (SEED-SET), includes quantitative metrics and ethical criteria. It can identify scenarios that effectively meet measurable needs and align well with human values, and vice versa.
“We don’t want to spend all our resources on random evaluation. So, it’s very important to guide the framework toward the test cases we care about most,” says Lee.
Importantly, SEED-SET does not require pre-existing evaluation data, and is adaptable to many purposes.
For example, a power grid may have several user groups, including a large rural community and a data center. Although both groups want low-cost and reliable electricity, each group’s priorities from an ethical perspective may vary widely.
These ethical standards may not be well specified, so they cannot be measured analytically.
The power grid operator wants to find the most cost-effective strategy that best meets the subjective ethical preferences of all stakeholders.
SEED-SET tackles this challenge by dividing the problem into two parts following a hierarchical structure. An objective model considers how the system performs on concrete metrics such as cost. Then a subjective model that considers stakeholder judgments like perceived fairness is based on objective evaluation.
Parashar says, “The objective part of our approach is associated with the AI system, while the subjective part is associated with users evaluating it. By decomposing the priorities in a hierarchical manner, we can generate the desired scenario with fewer evaluations.”
encoding subjectivity
To perform subjective evaluations, the system uses LLMs as a proxy for human evaluators. The researchers encoded each user group’s preferences into a natural language prompt to model.
The LLM uses these instructions to compare two scenarios, selecting the preferred design based on ethical criteria.
“After looking at hundreds or thousands of scenarios, a human evaluator may suffer from fatigue and become inconsistent in his evaluations, so we use an LLM-based strategy instead,” explains Parashar.
SEED-SET uses the selected scenario to simulate the overall system (in this case, a power distribution strategy). These simulation results guide the search for the next best candidate scenario for testing.
Finally, SEED-SET intelligently selects the most representative scenarios that either align or do not align with objective metrics and ethical norms. This way, users can analyze the performance of the AI system and adjust its strategy.
For example, SEED-SET can pinpoint cases of electricity distribution that prioritizes higher income areas during periods of peak demand, making deprived areas more likely to experience power outages.
To test SEED-SET, researchers evaluated realistic autonomous systems such as AI-powered power grids and urban traffic routing systems. They measured how well the generated scenarios matched ethical norms.
The system generated more than twice as many optimal test cases as baseline strategies in the same time, while uncovering many scenarios overlooked by other approaches.
“As we shifted user preferences, the set of scenarios generated by SEED-SET changed drastically. This tells us that the evaluation strategy responds well to user preferences,” says Parashar.
To measure how useful SEED-SET will be in practice, researchers will need to conduct a user study to see whether the scenarios generated help with actual decision making.
In addition to running such a study, the researchers plan to explore the use of more efficient models that can solve larger problems with more criteria, such as evaluating LLM decision making.
This research was partially funded by the US Defense Advanced Research Projects Agency.