Вы находитесь на странице: 1из 2

Analytical Report Regarding the Microsoft 365 (M365) Outage

Outages are one of the top concerns of organizations since they can damage a brand
reputation and loss of customer trust. They can also cost valuable time and money which is why
it shouldn’t be tolerated and disregarded. In present, outages are considered normal and common
considering the fact that individuals and organizations are reliant and dependent on online
applications and services since they uphold a crucial part in their business and daily lives.
Denial-of-Service (DoS) Attacks can be the root cause of an outage.

Background Information

On September 28, 2020, Microsoft 365 (M365) customers started filing their report
regarding their inability to access M365 on Downdetector.com at 5:21 pm. Users were unable to
login, and if they attempt to do so, a "AADSTS90033: A transient error has occurred. Please try
again" error message will pop up. Within an hour, the website that monitors and tracks cloud
outages received more than 18,000 posts that provide evidences regarding the issue. Microsoft
told their administrator users that they may have been unable to access multiple M365 services
due to the unsettled issue with Azure Directory. Outlook, Microsoft Teams, and Teams Live
Events may also experience problems caused by the Azure Directory. Microsoft blamed a
software “code issue” for an outage which obstructed and caused M365 services to be down for
five hours.

“A code issue caused a portion of our infrastructure to experience delays in processing


authentication requests, which prevented users from being able to access multiple M365
services,” said Microsoft in an email update to Microsoft administrators impacted by the outage.

Microsoft proceed into reviewing the code so as to understand the root cause of the code
to temporarily stop functioning and processing authentication requests in a timely manner. The
outage impacted users on September 28, 2020 from 5:25 pm EST to 10:25 pm EST. A senior
executive for one of Microsoft’s top partners found out that a Microsoft software developer made
a software code change that took M365 and Azure down. This software change was caused by an
update for Azure Directory which mistakenly hit the production environment and caused service
availability to degrade. Microsoft started taking steps to mitigate the problem by scaling out
some Azure AD services that can handle the load once they started applying a mitigation.
Unfortunately, Microsoft’s automated rollback failed due to the corruption of Safe Deployment
Process (SDP) metadata.

"Within minutes of impact, we took steps to revert the change using automated rollback
systems which would normally have limited the duration and severity of impact. However, the
latent defect in our SDP system had corrupted the deployment metadata, and we had to resort to
manual rollback processes. This significantly extended the time to mitigate the issue," Microsoft
explained.

The team was left with no choice but to manually update the service configuration by
bypassing the SDP System. More than two hours after that, the entire operation was completed
around 8 p.m. ET. Microsoft said that all service instances with residual impact were recovered.

Possible Preventive Measures

In order to provide satisfactory and effective service to legitimate users, Microsoft should
observe transparency and accountability to all of its employees, staffs, and even to its users. The
senior executive considers this issue as an inside job due to the faulty source control software
policy issue. The company should improve its management and should examine its employees
cautiously to avoid future outages that may arise internally. However, when referring to the
software change caused by an update for Azure, the company should conduct tests and
assessments several times regarding the new update in order to locate bugs so as to fix and
correct them before releasing them to the public so that the update being rolled out is safe and
stable. Microsoft should also monitor and verify every system that operates along with their
applications in order to assure that the functions are operative and to guarantee the safety and
security of the application or websites to its users. In addition, Microsoft should continue
updating and notifying its users each and every time they encounter a problem. Just like what
they did in this issue, they released a public Azure status update which contains the information
regarding the issues and errors they might encounter.

In providing service to legitimate users, a company should always monitor, control, and
secure the system being operated with the purpose of providing effective and satisfactory service.
When experiencing outages, the company isn’t the only party being affected. The users,
particularly the constant and legitimate users, are highly affected when such phenomena occurs.
Consider the fact that some users are highly dependent when it comes to the service being
provided by a specific company as it plays a big part of their daily lives.

References:

https://www.crn.com/news/cloud/microsoft-blames-software-code-issue-for-office-365-outage?
itc=refresh

https://www.zdnet.com/article/microsofts-azure-ad-authentication-outage-what-went-wrong/

https://www.bleepingcomputer.com/news/microsoft/microsoft-explains-the-cause-of-the-recent-
office-365-outage/

Вам также может понравиться