Learning how to predict rare kinds of failures

Spread the love

On December 21, 2022, as soon as the Peak Holiday season was going on, Southwest Airlines went through a cascading chain of failures in its scheduling, initially triggered by the severe winter season in the Danver region. But the problems spread through their network, and the crisis over the next 10 days eliminated over 2 million passengers and lost $ 750 million for the airline.

How did a localized weather system eliminate such broad failure? Researchers at MIT have investigated this widely reported failure, where as an example of systems where systems work easily, suddenly break down and cause domino effects of failures. He has now developed a computational system to use a combination of sparse data about a rare failure phenomenon, in combination with much more comprehensive data on normal operations, to work backwards and to try to indicate the root causes of failure, and hopeful that the future will be able to find ways to accommodate the system to prevent such failures.

The findings were presented in the International Conference on Learning Representation (ICLR), which was organized by MIT doctorate student Charles Dawson in Singapore, Charles Dawson, Aeronautics professor and astronaut Chuchu fan and colleagues from the University of Harvard University and Michigan.

“The inspiration behind this work is that it is really disappointing when we have to interact with these complex systems, where it is really difficult to understand what is happening behind the curtain that we are creating these issues or failures that we are seeing,” says Dawson.

The new work makes the fan’s lab on previous research, where she sees problems related to prediction of fictional failure, she says, such as looking for ways to predict such systems, such as a group of robots working on a task, or with complex systems such as power grids. “The goal of this project,” fan, “,” actually it was to turn it into a clinical tool that we can use on real -world systems. “

“This idea can provide a method when someone can” give us data from a time when there was an issue or failure in this real -world system, “says Dawson,” and we can try to diagnose the root causes, and give a little look a little bit behind the curtain on this complexity. “

The intention is for methods that they have developed “to work for a beautiful general class of cyber-physical problems,” they say. These are problems in which “you have an automated decision -making component that interacts with real -world disturbances,” they explain. Software is available tools available for testing the system that work on their own, but the complexity arises when the software has to interact with physical institutions that are going about their activities in the actual physical settings, whether it is scheduled, the activities of autonomous vehicles, a team of robot, or controlling inputs and outputs on an electric grid. In such systems, what often happens, they say, “Software can take a decision that looks fine at first, but then it has all these dominos, knock-on effects that mess up things and make it very uncertain.”

A significant difference, however, is that unlike robot teams, like the teams of the airplane, “We have access to a model in the world of robotics,” fans say, who is a major investigator in MIT’s laboratory for information and decision systems (LIDS). “We have some good understanding of physics behind robotics, and we have ways to make a model” that represents their activities with proper accuracy. But airline scheduling includes procedures and systems that are owned commercial information, and therefore researchers had to find ways to find out what was behind the decisions, using only relatively rare publicly available information, which essentially included real arrival and departure time of each aircraft.

“We have caught all these flight data, but this is the whole system of the scheduling system behind it, and we do not know how the system is working,” fans say. And the quantity of data related to real failure is just a number of days compared to the years of data on normal flight operations.

The effects of weather events in Denvar during the week of South-West scheduling crisis were clearly visible in flight data, just from the time of long normal change between landing and takeoff at Denver Airport. But the way the cascade affects the cascade, although the system was less clear, more analysis was required. The key was detected with the concept of the reserve aircraft.

Airlines usually place some aircraft in the reserve at various airports, so that if problems are found with an aircraft set for a flight, another aircraft can be replaced quickly. Southwest uses only one type of aircraft, so they are all interchangeable, making such replacement easier. But most airlines work on a hub-end-spoke system, with some specified hub airports where most of the reserved aircraft can be placed, while the southwest does not use the hub, so their reserved aircraft are scattered throughout the network. And the way those aircraft were deployed, it turned out to play a major role in the unfolding crisis.

“The challenge is that there is no public data available in the entire southwest network where the aircraft are deployed in the entire southwest network,” says Daon. “We are able to use our method, by looking at public data on arrival, departure, and delay, we can use our method that to return the hidden parameters of those aircraft reserves, to convince those comments that we were watching.”

He found that the way the store was deployed was a “major indicator” of problems that were cascaded in a nationwide crisis. Parts of the network that were directly affected by the weather were able to recover quickly and return to the schedule. “But when we saw other areas of the network, we saw that these reserves were not yet available, and things were just getting spoiled.”

For example, data showed that Denver’s reserves were rapidly decreasing due to weather delays, but then “this allowed us to detect this failure from Denver to Las Vegas,” they say. While there was no serious weather, “Our method was still showing us a steady decline in the number of aircraft that were able to serve flights outside Las Vegas.”

He says that “what we found that there were these circulatory of aircraft within the South -West network, where an aircraft could start the day in California and then fly to Denver, and then finish the day in Las Vegas.” What happened in the case of this storm was that the cycle was interrupted. As a result, “In Denver it breaks a storm cycle, and suddenly the store in Las Vegas, which is not affected by the weather, begins to deteriorate.”

Finally, the South -West was forced to take a rigorous measure to solve the problem: they had to “hard reset” their entire system, canceling all flights and re -order their reserves to fly empty aircraft across the country.

Working with experts in air transport systems, researchers developed a model how the scheduling system should work. Then, “What our method does, we are essentially trying to run the model backwards.” Given the results seen, the model allows them to work back to see what kind of initial conditions can produce those results.

While the data on real failures was rare, comprehensive data on specific operation helped teach computational models “what is possible, what is possible, what is the scope of physical possibility here, what is the dawson. “It gives us domain knowledge, then to say, in this extreme event, what is possible for failure, what is possible, most likely explanation”

It can give rise to a real -time monitoring system, they say, where the data on normal operation is consistently compared to the current data, and determines what the trend looks. “Are we moving towards normal, or are we moving towards extreme events?” East -support can be allowed by looking at the signs of adjacent issues, such as rebuilding reserved aircraft in advance in areas of anticipated problems.

Working on developing such systems is going on in his laboratory, fan says. Meanwhile, they have produced an open-source tool to analyze failure systems, called Calnf, which is available to anyone to use. Meanwhile, Dawson, who earned his doctorate last year, is acting as a postdock to implement the methods developed in this work to understand the failures in the power network.

The research team also included Max Lee from Michigan University and Van Tran from Harvard University. The work was supported by NASA, Scientific Research and MIT-DSSTA program of the Air Force Office.

Source link