Whether you are playing poker against the same opponent or find yourself in a bidding war to purchase a home with another potential buyer, you are operating under conditions of imperfect information. You know what cards you have in the poker game, and you also know how much more than the asking price of the house you can afford, but you don’t know what hand your opponent has in the card game or how high the other home buyer is willing to go.
A paper co-authored by MIT researchers and presented at the International Conference on Learning Representations in Rio de Janeiro in April won’t tell you what specifically to do in these situations. But it offers new insight into so-called imperfect-information games that involve two competitors facing off in a “zero-sum” competition, where one player’s gain means the other player’s loss.
MIT researchers on the project include Sobhan Mohammadpour, a PhD student in MIT’s Department of Electrical Engineering and Computer Science (EECS) and the Laboratory for Information and Decision Systems (LIDS); and Gabriel Farina, assistant professor in EECS and a principal investigator in LIDS. Additional co-authors include Max Rudolph of the University of Texas at Austin (UT), Nathan Lichtley of the University of California at Berkeley (UCB), Alexandre Bien of UCB, J. Zico Colter, Amy X. of UT. Zhang ’11, Mng ’12; Eugene Vinitsky of New York University; and CMU’s Samuel Sokota.
The focus of the new work is on algorithms that can be used to train neural networks to participate in incomplete-information games. A long-standing belief in the field was that algorithms based on the principles of game theory would, in this setting, clearly outperform a variety of general-purpose algorithms called policy gradient methods, which came into use for decision making in the 1990s. The word “policy” in this context basically means strategy, while “gradient” refers to the path that leads in the direction of the greatest change – for example to the top (or bottom) of a hill. Policy gradient methods are being used to train neural networks to make decisions – in small, sequential steps – toward a particular goal (like reaching the summit, metaphorically speaking), with frequent adjustments and course corrections being made along the way to bring the agent closer to the desired destination.
Although strategic games were not on the original agenda when policy gradient methods were conceived in the early 1990s, the authors of the new paper still wondered how this class of algorithms might perform in a two-player game. According to Farina, these methods become more complex to analyze in multi-agent settings. “You can still move in a direction to improve your circumstances, but, due to the actions of the other player, that direction can constantly change during the game. And those changes can happen rapidly.”
“It was largely assumed that particular game-theoretic algorithms were the right approach for this setting,” Sokota says. “Our study showed that policy gradient methods may work better than these particular algorithms, and particular algorithms may not work as well as people thought – which raises an interesting sociological question about why this was not paid attention to for so long. Part of the answer is that the field had not done the engineering work necessary to rigorously evaluate the algorithms, so it was difficult to tell what worked and what didn’t.
As a result, a major contribution of this work is to provide a uniform way to evaluate different algorithms that can teach agents – i.e., neural networks – to compete in incomplete-information games. “Unlike many papers published in this area, we are not proposing a new algorithm that can beat other algorithms. We are proposing a benchmark that can assess these algorithms.”
Simply put, benchmarks involve software designed to rate the performance of algorithms. “What we’re offering is a testing ground or playground where people can take their algorithms, train them for a specific task and see how well they do,” says Farina.
The group calculates a player’s performance based on a concept called exploitability, which measures how well a player performs against a “worst-case opponent,” Sokota explains. “In a game like poker, this opponent will not know what my hand is, but he will know how I will behave for any given hand.” Scoring zero on this scale indicates perfect play, while higher exploitability scores indicate play far from optimal.
The experiments conducted by the team involved playing five games: two versions of Phantom Tic-Tac-Toe, in which players cannot see what their opponent has done, as well as two imperfect-information versions of a board game called Hex, and another game of deception called Liar’s Dice.
The biggest challenge the researchers faced was getting the exploitability measurement to work on games of this size, which can contain more than 30 billion states. A “state” in this case is not only all possible board states, but also the entire history of the game, including every move and misstep along the way.
“It’s like looking into a dark room that’s full of objects you can’t see,” says Mohammadpour. “Either way, you need to figure out where these objects are and how they got there.” Previous researchers have typically used exploit potential for games that are 100,000 times smaller than the games analyzed in their study, Mohammadpour says.
In experiments conducted on these five games, neural networks trained with the policy gradient algorithm got better (lower) exploitability scores than networks trained on game theory-based algorithms. In the head-to-head competitions that followed in the next round, the policy gradient-trained networks again defeated their game theory-trained opponents. “Those results were reassuring, because they give us more confidence in our benchmarking approach,” Rudolph says.
The team has made its benchmarking software available for free and easy to use. “You don’t need a supercomputer,” says Mohammadpour. “You can run it on an ordinary laptop. And all you have to do is add one line of code to a commonly used collection of benchmarking software called OpenSpoil.”
Although his experiments involved some fairly obscure games, Farina would like to place the work in a broader context. “Keep in mind that the term ‘game’ really applies to any multi-agent strategic interaction,” he says. “So the lessons we learn from this research are by no means limited to recreational sports.”
Vinnitsky agrees. “Hidden information is a very valuable asset in the world,” he says. “It permeates many things – including military operations, business scenarios, and negotiations – all of which are conducted under conditions of hidden information. The idea that we can improve in these games suggests that we can do better in these other settings too.”
Ian Zemp – a computer scientist and game theory expert at Google DeepMind who was not involved in the study – finds these results encouraging. “This work serves as a compelling reminder,” he says, “of the modernization of classical instruments.” [like policy gradient methods] “It remains a highly productive path to solving complex strategic problems.”