New Text Document

Looked at the logs var logs file
Traffic logs are 2 different logs
All sets of logs
INC0313206:::Database issue with Novus Eagan application server.
SARR for CRCs not being able to monitor these issues.
Resolution:
The issue was resolved on Tuesday 10th October 2017 at 23:34 GMT, after TSG LM team
manually enabled the parameter on all the 113 CPE routers. This was due to an error
, where few sites from the pool were incorrectly marked for parameter restoration
post the upgrade. After TSG LM team had the security parameter enabled, the FTP
functionality became very limited (Active vs Passive FTP). Removing the "FTP ALG
DISABLE" parameter for the devices restored the FTP protocol functionality.
Note: Till date we have completed 1693 device upgrades to ver 12.3X48-D35.7.
Datascope Select testing was carried out by GPIC on this version prior to us
proceeding with the rollout and no issues were seen.
Timeline in GMT:
7th October
19:30 - Incident Start (TSG LM performed global upgrades for CPE JUNOS devices)
9th October
23:33 - 1st SR reported � 05929027
10th October
02:34 - 2nd SR reported - 05929263
02:56 - 3rd SR reported � 05929314
xx:xx - HD escalated the issue to DCS FLS
12:15 - DCS FLS invoked TRT
xx:xx - DCS second level was engaged
xx:xx - FLS Networks was engaged
xx:xx - GNOC was engaged
xx:xx - TSG LM team was engaged
13:54 - Customer alert was issued
18:09 - TRT stood down
Xx:xx - TSG LM team identified the trigger
23:34 - Incident Resolved
Actions:
1 � Confirm Impact
2 - Confirm Timeline
3 - Confirm Trigger / Root Cause
4 - Understand why there was a delay in issuing the alert.
5 - Understand when HD escalated the issue to DSC FLS team. If there was a delay in
escalating the issue, understand what caused the delay.
6 - Understand why TSG LM team missed to check the CPE routers during initial
investigation.
7 - Understand if we can ensure that we don't add the config if not present before
for future changes.
8 - Understand if we can review on what can be done to proactively remove this from
other impacted clients which shouldn't have it.
9 - Understand if there was a workaround available. If yes, then was this

workaround shared with the affected customer. If not, why.
10 - Actions taken to prevent recurrence.
A PIR has been scheduled on Thursday 12th October at 07:00GMT. A new update will be
added before Thurs cut-off.
On Saturday 7th October 2017, starting at 19:00 GMT some customers in APAC
experienced service disruptions for DataScope Select in Active FTP mode via
Delivery Direct. Service was restored on Tuesday 10th October 2017 at 23:34 GMT.
Impact: Customers were able to log in, but the connections dropped when they issue
a directory list or 'get' command to download files. There were 3 SRs reported for
this incident.
Issue was detected by a customer call on Mon 9th Oct 2017 at 23:33 GMT to the
Helpdesk (HD) stating that the customer was able to login to DSS FTP server, but
unable to retrieve any data from the DSS FTP server. HD confirmed with the customer
that the last time the services worked without any interruption was on Friday, 6th
October 2017 at 10:00 GMT (Local Time � 19:00). Followed by the 1st SR, there were
2 SRs raised on Tuesday 10th October 2017, post which the HD escalated the issue to
DSC FLS (Datascope FLS) at xx:xx GMT. A TRT was invoked by DSC FLS at 12:15 GMT to
investigate the issue further with DSC 2nd Level support. During the TRT, DSC1 and
DSC second level teams were unable to replicate the issue from the DDN test
machines. Whilst, TRT was ongoing; FLS-Networks team were also engaged, who
investigated and confirmed that there was no issue observed on their side. FLS-
Networks team suggested AT&T Team to investigate since the network location falling
in their scope.
During the TRT, it was noticed that a recent change CHG0110093 was performed on HDC
RXN/MNS B-side switch Customer EDGE Migration. FLS-Networks & FTP Server team
investigated the change and suspected that the change in ACL gig might have
triggered the issue. Meanwhile, DSC1 team engaged TSG LM (Last Mile) team at xx:xx
GMT for further investigation. At, 18:09 GMT the TRT stood down with participants
from DSS application support, FLS-Networks, GNOC, TSG LM and Global Outage
Management teams. The trigger of the issue was still unknown as all paths of
investigation was performed by all the TRT participants. It was suspected that the
issue might be due to a change from the client's end and to troubleshoot further,
the TRT participants required Source IP / Destination IP / Trace route information
& Test connectivity with both Active and Passive Connections from the affected
clients. Since it was only the APAC customers affected, the aforementioned
information would only be available during Asia hours. Hence, the TRT participants
and Application Service Manager agreed that the TRT can reconvene at 02:00 GMT on
11-Oct-17.
At xx:xx GMT, TSG LM Seniors team investigated the CPE routers for the affected
clients and informed that on Sat 07th Oct at 19:28 GMT, as part of the global
upgrades for CPE JUNOS device (JP57190RPTY6R01); the TSG LM team had implemented a
parameter (FTP ALG DISABLE) that affected FTP outgoing connectivity from the CPE
device. This parameter was checked before upgrade and is set to match after the
upgrade. For some reason, some devices were incorrectly marked to have that
parameter enabled before the upgrade, causing impact to some DSS customers. TSG LM
Seniors also identified that approximately 113 devices had 'ALG FTP DISABLE'
present despite not having that entry prior to the CPE upgrade on 07-Oct-17.

New Text Document

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

New Text Document

Загружено:

Авторское право:

Доступные форматы

Looked at the logs var logs file

Traffic logs are 2 different logs

All sets of logs

INC0313206:::Database issue with Novus Eagan application server.

SARR for CRCs not being able to monitor these issues.

3 - Confirm Trigger / Root Cause

4 - Understand why there was a delay in issuing the alert.

9 - Understand if there was a workaround available. If yes, then was this

10 - Actions taken to prevent recurrence.

Вам также может понравиться