
共 11691字,需浏览 24分钟


2024-06-21 21:14




Staffing levels: are data centers at risk of unnecessary outages?

With increasing data center automation, it’s only natural for clients to want assurance that their data will be available as close to 100 percent of the time as possible, and to ask whether enough data center staff are available to achieve a high level of uptime. They also want to know that when a potential outage occurs, there are enough technicians on duty or available to restore services as soon as possible.

Microsoft suffered an outage on 30th August 2023 in its Australia East region in Sydney, lasting 46 hours. 

客户在访问或使用Azure、Microsoft 365和Power Platform服务时遇到问题。它是由08:41(UTC)的电力中断触发的,并影响了澳洲东区的三个可用区之一。
Customers experienced issues with accessing or using Azure, Microsoft 365, and Power Platform services. It was triggered by a utility power sag at 08.41 UTC and impacted one of the three Availability Zones of the region.

Microsoft explains: “This power sag tripped a subset of the cooling system chiller units offline and, while working to restore cooling, temperatures in the data center increased to levels above operational thresholds. We powered down a small subset of selected compute and storage scale units, both to lower temperatures and to prevent damage to hardware.”

尽管如此,绝大多数服务在22:40(UTC)之前已经恢复,但直到2023年9月3日20.00 (UTC)才全面缓解。微软表示,这是因为一些服务受到了长期的影响,“主要是由于依赖于恢复存储、SQL数据库和Cosmos DB服务。”
Despite this, the vast majority of services were recovered by 22.40 UTC, but they weren’t able to complete a full mitigation until 20.00 UTC on 3rd September 2023. Microsoft says this was because some services experienced a prolonged impact, “predominantly as a result of dependencies on recovering subsets of Storage, SQL Database, and/or Cosmos DB services.”


Voltage sag cause

The utility voltage sag was caused, according to the company, by a lightning strike on electrical infrastructure situated 18 miles from the impacted Availability Zone of the Australia East region. They add: “The voltage sag caused cooling system chillers for multiple data centers to shut down. While some chillers automatically restarted, 13 failed to restart and required manual intervention. To do so, the onsite team accessed the data center rooftop facilities, where the chillers are located, and proceeded to sequentially restart chillers moving from one data center to the next.”


What was the impact?

“By the time the team reached the final five chillers requiring a manual restart, the water inside the pump system for these chillers (chilled water loop) had reached temperatures that were too high to allow them to be restarted. In this scenario, the restart is inhibited by a self-protection mechanism that acts to prevent damage to the chiller that would occur by processing water at the elevated temperatures. The five chillers that could not be restarted supported cooling for the two adjacent data halls which were impacted in this incident.”

Microsoft says the two impacted data halls require at least four chillers to be operational. The cooling capacity before the voltage sag consisted of seven chillers, with five of them in operation and two on standby. The company says that some networking, compute, and storage infrastructure began to shut down automatically as data hall temperatures increased. This temperature increase impacted service availability. However, the onsite data center team had to begin a remote shutdown of any remaining networking, compute, and storage infrastructure at 11.34 UTC to protect data durability, infrastructure health, and to address the thermal runaway.


Staffing review

Amongst the many mitigations, Microsoft says it increased its technician staffing levels at the data center “to be prepared to execute manual restart procedures of our chillers prior to the change to the Chiller Management System to prevent restart failures.” The night team was temporarily increased from three to seven technicians to enable them to properly understand the underlying issues, so that appropriate mitigations can be put in place. It nevertheless believes the staffing levels at “the time would have been sufficient to prevent impact if a ‘load based' chiller restart sequence had been followed, which we have since implemented.”

It adds: “Data center staffing levels published in the Preliminary PIR only accounted for “critical environment” staff onsite. This did not characterize our total data center staffing levels accurately. To alleviate this misconception, we made a change to the preliminary public PIR posted on the Status History page.”

然而,在 “Azure事件回顾:VVTQ-J98”的深入讨论中,微软亚太区数据中心运营副总裁Michael Hughes针对有关现场工作人员比公司最初声明的更多的评论进行了回应。还有人提出,真正的解决方案不一定是增加现场人员数量。也有人建议,真正的解决方案应该是应急操作程序(EOPs)中基于模式的顺序,这可能并不会改变人员配置水平。
Yet in a Deep Dive ‘Azure Incident Retrospective: VVTQ-J98’, Michael Hughes – VP of APAC datacenter operations at Microsoft, responded to comments about more staff being onsite than the company had originally said were present. It was also suggested that the real fix wasn’t necessarily to have more people onsite. It was also suggested that the real fix is a mode-based sequence in the emergency operating procedures (EOPs), which may not change staffing levels.

Hughes explains: “The three that came out in the report just relate to people who are available to reset the chillers. There were people in their operation staff onsite, and there were also people in the operations center. So that information was incorrect, but you’re right.” He asks us to put ourselves in the moment with 20 chillers posting 3 sags and all in an erroneous state. Then 13 require a manual restart, requiring the deployment of manpower across a very large site.

“You’ve got to run out onto the roof of the building to go and manually reset the chiller, and you’re on the clock”, he adds. With chillers impacted and temperatures rising, staff are having to scramble across the site to try to reset the chillers. They don’t quite get to the pod in time, leading to the thermal runaway. The answer in terms of optimization is to go to the highest load data centers – those that have the highest thermal load and highest number of racks operating to recover cooling there.

So, the focus was to recover the chillers with the highest thermal load. This amounts to a tweak on how Microsoft’s EOP is deployed, and it’s about what the system is supposed to do, which he says should have been taken care of by the software. The auto-restart should have happened, and Hughes argues that there shouldn’t have had to be any manual intervention. This has now been fixed. He believes that “you never want to deploy humans to fix problems if you get software to do it for you.” This led to an update of the chiller management system to stop the incident from occurring again.


Industry issue and risk

Uptime Institute数字基础设施运营副总裁Ron Davis补充说,要指出的是,这些问题及其相关风险不仅限于微软事件。“我曾亲身经历过这类事件,当电力故障发生时,冗余设备未能切换启用,冷冻水温度迅速上升至一个程度,以至于相关的冷机无法启动,”
Ron Davis, vice president of digital infrastructure operations at the Uptime Institute, adds that it’s important to point out that these issues and the risks associated with them exist beyond the Microsoft event. “I have been involved in this sort of incident, when a power event occurred and redundant equipment failed to rotate in, and the chilled water temperature quickly increased to a level that prohibited any associated chiller(s) from starting,”

he comments before adding: “This happens. And it can potentially happen to any organization. Data center operations are critical. From a facilities standpoint, uptime and availability is a primary mission for data centers, to keep them up and running.” Then there is the issue of why the industry is experiencing a staffing shortage. He says the industry is maturing from an equipment, systems, and infrastructure perspective. Even remote monitoring and data center automation are getting better. Yet there is still a heavy reliance on the presence and activities of critical operating technicians - especially during an emergency response as outlined in the Microsoft case.




浏览 208

