British Airways blames ‘power surge’ for IT meltdown & travel chaos 

Posted: May 29, 2017 by oldbrew in Accountability, Incompetence, Travel

Airport chaos for BA passengers

BA moves in mysterious ways, its blunders to perform – but with dire results for a lot of unlucky customers.

BA chief executive Alex Cruz says the airline’s flight disruptions had nothing to do with cutting costs but were caused by a power surge that “only lasted a few minutes”, reports BBC News via the GWPF.

BA chief executive Alex Cruz says he will not resign and that flight disruption had nothing to do with cutting costs.

He told the BBC a power surge, had “only lasted a few minutes”, but the back-up system had not worked properly.

He said the IT failure was not due to technical staff being outsourced from the UK to India.

Mr Cruz said “I am profusely sorry” to the 75,000 passengers affected across 170 airports in 70 countries. [Talkshop note: not as sorry as they were.]

He said two thirds of passengers will have reached their destination by the end of the day. There was no evidence of a cyber attack, he added.

“There was a power surge and there was a back-up system, which did not work at that particular point in time. It was restored after a few hours in terms of some hardware changes… we will make sure that it doesn’t happen again,” Mr Cruz said in his first interview.

Source: British Airways Blames ‘Power Surge’, Denies Cyber Attack, For IT Meltdown & Travel Chaos | The Global Warming Policy Forum (GWPF)

  1. oldbrew says:

    PC World comments:
    ‘Between ticket refunds and compensation payments British Airways, like Delta before it, will be hundreds of millions out of pocket as a result of the failure of its backup systems. But even businesses that aren’t trying to juggle hundreds of planes in the air would do well to test whether backup, failover and business continuity plans work before disaster strikes.’
    – – –
    “we will make sure that it doesn’t happen again,” Mr Cruz said in his first interview.

    Unless it happens again before they figure out exactly what went wrong and why (and fix it), as happened to Delta Airlines (Aug. 2016 and Jan. 2017)…

  2. Clive Best says:

    20 years ago we were using UPS power supplies on all critical computers to filter out power fluctuations. I cannot believe that BA mission critical systems went down because of a ‘power surge’.

  3. Clive Best says:

    Perhaps one of the 700 IT staff made redundant by Alex Cruz simply switched off the UPS on his last day in the office and who can blame him. The moral is that you simply can’t ‘outsource’ 24 hour operational experience to some call centre in Bangalore.

    [reply] not outsourced according to The Register in an update

  4. ren says:

    The storm was triggered by a strong jump in solar wind density.

  5. ren says:

    Bogoslof volcano, Alaska, USA (Philippa)
    An explosive eruption at Bogoslof volcano, Alaska, was confirmed by the US Geological Survey (USGS) to have occurred at 22:16 UTC last night. The ash plume, which reached altitudes of over 35,000 ft., was reported by both the Volcanic Ash Advisory Center (VAAC) Anchorage and the National Weather Service Oceanic Prediction Center (NWSOPC).

  6. ren says:

    Solar wind density.

  7. ivan says:

    Clive, the vast majority of the 400+ comments on tend to agree with you.

    All sys admins know that when the bean counters get involved requirements go to pot and bad things happen. In this case it has cost them nor only money but reputation and it is the latter that is the most damning – they may not get it back.

  8. oldbrew says:

    Was the backup system knocked out by the same ‘power surge’ as the main system?

    Whatever, it will cost BA a fortune one way or another.

  9. tom0mason says:

    BA leads the way!
    As the BA failure shows when everything relies on an a single computerized control system chaos so arrives when the weakest link in the chain fails.

    So as we launch ourselves into the shiny new ‘smart’ systems of tomorrow, think-on.

    The future is ‘smart’….
    ……………………The future is ……..BSOD?

  10. Greg G says:

    ren: I see your point about the volcano up in the Aleutian Islands.
    During approximately the same time period of high Kp Index from coronal hole winds, the British Airways power supply surge occurred and the Bogoslof volcano blew. I checked it out and the two locations are on opposite sides of the North Pole at about the same latitude, so the high Kp could have triggered both events.

    Coincidence? Perhaps, but I think not. A month ago, there was another high Kp period from coronal hole streams and the Kambalny Volcano in Russia blew …after 650 years of dormancy.
    It is located in the Kamchatka Peninsula very near the Bogoslof volcano.

    I don’t think anyone fully understands the relationship between solar activity and earthquakes and volcanoes, but here are some clues from the following site:

    “In 2008 researchers at NASA announced that a close link between electrical disturbances on the edge of our atmosphere and impending quakes on the ground below had been found. This finding is in agreement with similar studies carried out by other space research institutes. Satellites have picked up disturbances 100-600 km above areas that have later been hit by earthquakes. Fluctuation in the density of electrons and other electrically-charged particles in the ionosphere has been observed, and huge signal has been detected many times before large magnitude earthquakes strike.

    Jann-Yeng Liu, from the center for space and remote sensing research in Chung-Li, Taiwan added support to the link between earthquake and disturbances in the ionosphere. He examined over 100 earthquakes with magnitude 5.0 and larger in and around Taiwan over several decades. Based on, the data analysis almost all the earthquakes down to a depth of about 35 Km were preceded by distinct electrical disturbances in the ionosphere.

    Minoru Freund from NASA Ames Research Center, developed a theory based on the above observations. “It boils down to the idea that when rocks are compressed-as when tectonic plates shift-they act like batteries, producing electric currents”. see the following paper, Air ionization at rock surfaces and pre-earthquake signals:

    Click to access Ipek-%20NASA-%20Freund%20etal%202009.pdf

  11. Richard111 says:

    I worked for CSC back in the 80’s and power surge protection kit was mandatory.
    Another company I worked for had battery back up UPS systems that could keep power stable for eight hours allowing plenty of time for engineer to arrive and fix/restart diesel generator until mains power was stable. If a total shutdown ever occurred, even for just one minute, that meant automatic job loss for engineer.

  12. Richard111 says:

    Trying to find BA share price and all I get is blanks! Have that shut down?

  13. Aren’t both airports heavy on renewable power now, solar farms. Switching in backup diesel generation could be destabalising to live systems!

  14. oldbrew says:

    The Register updated its report:

    Updated on Monday to add: The CEO has since confirmed the data centre was based in the UK, telling the Graun: “I can confirm that all the parties involved around this particular event have not been involved in any type of outsourcing in any foreign country. They have all been local issues around a local data centre who has been managed and fixed by local resources,” he said. [bold added]

  15. ivan says:

    There is a new ‘excuse’ from BA up on the Reg It still doesn’t hold water but does show a man scrambling to move the blame from bean counters and execs.

    More an example of how not to run data centers.

  16. ren says:

    Greg G look at the electrons.

  17. It just looks like another canary in the coal mine–a warning of future, more general, woes.

  18. oldbrew says:

    If it really was caused by one or more ‘power surges’, it’s not primarily an ‘IT failure’ at all.

    If the IT only failed because the power supply malfunctioned, that would have killed off any devices connected, whether IT-related or not (e.g. air-con, lights, non-IT office machinery etc. are all possibilities).

  19. oldbrew says:

    Confirmed: back-up systems were on the same power supply as the main system.

    BA Chief Executive Alex Cruz said the root of the problem, which also affected passengers trying to fly into Britain, had been a power surge on Saturday morning which hit BA’s flight, baggage and communication systems. It was so strong it also rendered the back-up systems ineffective, he said. [bold added]

    Spot the problem :/

  20. oldmanK says:

    This subject brings back old memories. Of surges that cause some destruction – as seen from inside the power plant. But what is meant by a ‘Power Surge’. Is it a voltage surge, a demand surge, a sudden large demand drop???

  21. oldbrew says:

    It was quite hot for two days before the fateful power surge. Maybe it was triggered by an air-con system suddenly being switched on?

    The Telegraph went over the top…

    UK weather: Britain to roast on hottest May day in 176 years – but thunder is on the way

    By Telegraph Reporters
    27 MAY 2017 • 9:03AM

    The failure was at 9:30AM on the 27th.

  22. Greg G says:

    A similar solar storm took place in the U.S. about a month ago on April 20th. The media did not mention the correlation, however there were simultaneous power surges and burned out transformers in San Francisco, New York and Los Angeles. The planetary K-index was >6 during that time period, about the same as late last week.

    If the British Airways power surge was caused by a stream of charged particles coming from a coronal hole on the Sun and bombarding our ionosphere and magnetosphere, then here’s how the damage to the data center could have occurred:
    “Electric power systems become exposed to geomagnetically induced currents (GICs) through the grounded neutrals of wye-connected transformers at the ends of long transmission lines. The low frequency of the GICs saturates the transmission transformer’s steel core…When transmission transformers are exposed to a GIC component, they are likely to overheat, even if the low frequency portion is only a small, almost insignificant portion of normal line current. When a transformer saturates, it becomes a source of harmonics. It also increases the inductive VARs (Volt-Amperes Reactive) power drawn by the transformer, and there is a high likelihood of stray leakage flux, eddy current losses, and excessive localized heating.”

    That’s another way of saying the transformer heats up until catastrophic failure.

  23. oldmanK says:

    Greg G says: May 31, 2017 at 3:33 am: GG has brought up or better dissected a possible failure mode which is quite likely. Another possibility is an induced voltage spike. Transformers and switchgear operating at high voltages suffer degradation with age making an internal short-circuit likely with a spike. The results are very ugly.

    “power surge” is a misnomer. You cannot force feed power to equipment in good condition at proper supply voltage and power factor (they are not geese). High voltage, or Greg’s a secondary LF bias, will cause problems. However that will not be limited to one single customer, it will effect nearby too.

    Unless its the local Tx. In that case it what downstream that is effected.

  24. ‘The airline provided a few more details of the incident in its latest statement on Wednesday. While there was a power failure at a data center near London’s Heathrow airport, the damage was caused by an overwhelming surge once the electricity was restored, it said.

    “There was a total loss of power at the data center. The power then returned in an uncontrolled way causing physical damage to the IT servers,” BA said in a statement.

    “It was not an IT issue, it was a power issue.”

    Investigations were continuing into the cause of the power surge, it added.’

  25. Greg G says:

    “Bringing everyone back on line is not always as easy as it sounds, even if there were no equipment damage (in the initial GIC event). Typical in-rush currents for start-up are 600 percent the normal loads. In addition, blackouts are likely to cause transients voltage spikes that stress and weaken the system components, such as circuit breakers, transformers, and generators. “

  26. oldmanK says:

    Greg G: Agreed. Of coarse one would have thought the times when we (I) held contact breakers ‘ON’ by a broom stick at start-up were long gone, and both generator side and customer were properly engineered to minimise onrush effects. Especially if one had to go on to a standby diesel/gen (with low inertia).

  27. oldbrew says:

    “It was not an IT issue, it was a power issue.”

    Of course.
    – – –
    oldbrew says:
    May 30, 2017 at 12:57 pm

    If it really was caused by one or more ‘power surges’, it’s not primarily an ‘IT failure’ at all.

  28. Power Grab says:

    Greg G: Great catch, regarding the timing of the blackout and the geomagnetic storm!

    The local hour is too late for me to do a brain dump here and now, but I plan to return another time to add in my two cents’ worth.

  29. oldbrew says:

    More head-scratching…

    BBC News: Why did ‘power surge’ hit BA computers?

  30. oldbrew says:

    Report: BA’s £150,000,000 outage was caused by someone turning computers on and off too quickly

    An IT engineer doing maintenance work at a British Airways data centre is to blame for BA’s £150million global meltdown, which left 75,000 people stranded last weekend, it has emerged.

    The engineer allegedly failed to follow proper procedures at the centre in Heathrow, which led to ‘catastrophic physical damage’ to servers across the world.
    . . .
    The engineer involved is reportedly from contractor CBRE Global Workplace Solutions, who are assisting the airline with its investigation.
    . . .
    A source told The Sun the effects could be felt for months, with corrupted data yet to surface.

    They added: ‘It’s very much human error that’s to blame. It’s not over yet.’

  31. Greg G says:

    It’s still possible that a contractor engineer turning power off and on may just be a scapegoat. The investigation is not over yet…

    “… the BBC’s transport correspondent, Richard Westcott, has spoken to IT experts who are sceptical that a power surge could wreak such havoc on the data centres.
    BA has two data centres about a kilometre apart. There are question marks over whether a power surge could hit both. Also, there should be fail-safes in place, our correspondent said.”

    It’s looking less and less that the power surge was caused by energy streaming from a coronal hole, however.

  32. oldbrew says:

    ‘turning computers on and off too quickly’

    For what reason during normal UK working hours, whether ‘too quickly’ or not?

  33. Greg G says:

    The technician turned off everything, including the backup generators AND the UPS battery… then turned everything back on again in an uncontrolled manner and fried the servers. Whoops!
    Watch the video.

  34. oldbrew says:

    That techie’s indemnity insurance better be up-to-date 😐
    – – –
    Does British Airways’ explanation stack up?
    By Richard Westcott
    Transport correspondent, BBC News

    I’m told that any contractor would have been escorted at all times by an IT staff member.

    “That was always the rule”. Even if that contractor was from the company managing the facility. They wouldn’t even be allowed into the data centre without a detailed description of the job they were doing.

    It sounds as if only a complete and unsupervised moron could have cut the power by mistake.
    – – –
    The Register: The biggest British Airways IT meltdown WTF: 200 systems in the critical path?
    It’s not the velociraptor you can see that kills you

    For any IT dependent organisation, which in reality is pretty much everything these days, a fundamental question should be: Why does the organisation believe its IT is sufficiently robust to allow it to meet its operational goals? What is the evidence that belief is based on? How has the evidence been validated? Is there a predictive model, not a picture on a slide deck, of why the system as a whole stays up?