Академический Документы
Профессиональный Документы
Культура Документы
Las notas a continuación son las relacionadas al caso para equipos QFX 5100 de cliente UMNG.
From:jaraya@juniper.net
Date:2017-09-21 15:56:53
To:jfernandez@qcingenieria.com.co;
Cc:support@juniper.net;
Hello Jorge
The log messages indicated are not showing any known issue,
by your description the device could be suffering of high CPU usage or any hardware issue,
Please perform the Mastership Switch in order to verify if it this issue is specific to this device while
working as a Master Routing Engine.
Please also collect the RSI, in order to verify alarms, CPU usage and other indications,
I also would like to check the log chassis messages in order to verify hardware errors.
At this point by the description the issue could be being caused by software or hardware failure
If you can afford an upgrade or a reinstallation of the Junos Software that will be the next step once we
can check the
requested information. If after that the issue continue we will proceed with a replacement if the same
device keeps showing the same behavior.
Best Regards,
Want instant access to cases, contracts, installed base, EOL information? Visit My Juniper .
ISSUE DESCRIPTION:
CURRENT STATUS:
09/21/2017 - jaraya - Send initial contact and request additional information, also requesting customer
to perform a Mastership Change
The device getting disconnected is the Mater Routing Engine, we could befacing a High CPU Situation,
nevertheless the logs does not reflect anyissue.
09/21/2017 - jaraya - Customer responding the device presents a Core-dump for dot1x process
TECHNICAL IMPACT:
NEXT ACTION:
#JNPR-T-STATUS#
SSUE DESCRIPTION:
CURRENT STATUS:
09/21/2017 - jaraya - Send initial contact and request additional information, also requesting customer
to perform a Mastership Change
The device getting disconnected is the Mater Routing Engine, we could befacing a High CPU Situation,
nevertheless the logs does not reflect anyissue.
09/21/2017 - jaraya - Customer responding the device presents a Core-dump for dot1x process
09/23/2017 - jaraya - Customer updating the case during the weekend unable to respond
09/24/2017 - jaraya - Customer keep updating the case during the weekend, unable to respond
During your call, you wanted to verify a health check on the device andthe LACP configuration on the LAG
interfaces.
We found no issue on connectivity nor on the config, everything was working fine, device showed it was up
for at least more than 10 hours at themoment you called.
You showed the Server config and we noticed one interface was MTU 9000 the rest were 1500. On the switch
side MTU was default (1500). you were going to verify if jumbo frames were needed on the switch side.
You asked about LACP Hold-UP timer, and I checked it is supported on theplatform, which was an option you
wanted to consider.
https://www.juniper.net/documentation/en_US/junos/topics/task/configuration/lacp-hold-up-timer-qfx-series-
cli.html
www.juniper.net
Configuring LACP Hold-UP Timer to Prevent Link Flapping on LAG Interfaces. On link aggregation group (LAG)
interfaces, when a member (child) link goes down, its state ...
https://www.juniper.net/documentation/en_US/junos/topics/reference/configuration-statement/hold-time-edit-
09/25/2017 - jaraya - Send follow up to the customer requesting the status of the case
09/25/2017 - jaraya - Customer hit the etc again after shift hours requesting inmediate assistance,
sharmav assisted the customer.
Summary:
- We shut down the member 0 and member 1 took over the mastership.
- Issue was not seen first time and we restored all the services.
- We again powered off the same chassis and issue got reproduced.
- The server 4.9 and 4.11 was not reachable, however there wasanother server in same subnet which
was reachable.
- There was no ARP or MAC address learnt for that server on QFX/EX8200.
- These servers were directly connected as LACP trunk link.
- We powered On the second switch and mac address started populating in Ethernet switching table.
- As maintenance window got over, you will troubleshoot tomorrow on this issue.
Please attach the session logs collected today. Please update the timing of next maintenance window with
contact number.
09/26/2017 - jaraya - Core-dump analyzed, matching a PR that describes the issue and matches plattaform
and Junos version. Advice to the customer to proceed with a Junos upgrade and provide the the time for the
maintenance window.
On Virtual Chassis with GRES enabled, if an IPv6 Neighbor discovery nexthop is out of sync between RE's,
and when this next hop is re-allocatedon the master RE again, the kernel on the backup RE may crash, which
may cause a temporary traffic loss.
09/26/2017 - jaraya - Confirm with customer that the PR also affects IPV4
09/26/2017 - jaraya - Handover send to Chennai since customer advice about a maintenance window tonight,
nevertheless the customer never respond.
09/27/2017 - jaraya - Customer update the case indicating they will havea maintenance window today at
5:00pm Colombia Time
Backtrace indicated that this core-dump was generated in the Junos Version 14.1X53-D27.3
but no matching PRs were found matching the platform or the Junos version or the issue.
I did not found any other reference about this core-dump dot1xd.core.0.gz getting generate after an
upgrade to the recommended Junos version.
Nevertheless It was the other core-dump vmcore.0.gz that lead us to thePR matching the issue description.
We highly recommend to proceed with the Junos Upgrade to the recommendedversion 14.1X53-D45.
TECHNICAL IMPACT:
NEXT ACTION:
#JNPR-T-STATUS#
From:sanantha@juniper.net
Date:2017-09-28 01:32:47
To:jfernandez@qcingenieria.com.co;
Cc:support@juniper.net;
Hi Chirs,
During our conversation on call, I understood that you need assistance in checking the failover of the
switches and we did an upgrade on the device to 14.
After which we found that few interfaces were in LACP detached state and the child interfaces were not up.
I found that all the child interfaces are connected to the patch panel of the switch. Once we reseated the
interfaces on the patch panel, we found that the detached moved to attached state.
We did a failover and when the member 1 was rebooted, you lost connectivity to all servers for 10 seconds.
AFter which it recovered normally.
I was not able to find any kind of traffic on the child interfaces or the ae interfaces. However you
informed the server reachability was recovered automatically. It was laos noticed that when we powered on
member 1 the connectivity to all the servers was lost.
Hence we made the ae1 child interfaces to directly connect to the server and by passing the server. We
then made member 1 to halt and checked the connectivity however when this done we found that ae1 was in
detached state for 10 seconds and it recovered automatically.
We found the QFX5100 was in active fast however the partner was in slow state. We tried reconfiguring the
interface and I wanted you to check and test it with active fast on the server end. However you informed
that you will check it with the server team if that is possible and let us know tomorrow.
And if they cannot do that you have requested to move this case to the next level tomorrow morning since a
hardware replacement was already done on the switch.
You wanted to configure holdtime for the lacp as per the below document, however that is for the QFX10000
switches and not for QFX5100.
https://www.juniper.net/documentation/en_US/junos/topics/task/configuration/lacp-hold-up-timer-qfx-series-
cli.html
The case owner will get back to you on the changes to be made on the server end and if needed he will move
the case to the next level.
Regards,
Senthil.A [JNCIS-ENT]
sanantha@juniper.net
Working Hours: 17:00 PM to 02:00 AM PST, Sunday to Thursday // (IST: 5:30 AM to 2:30 PM)// (GMT 00:00
hours to 09:00 hours)
If you require assistance in my absence, please call +1.888.314.5822 to speak with the first available
engineer.
From:jayarams@juniper.net
Date:2017-09-29 17:00:02
To:jmendoza@qcingenieria.com.co;jfernandez@qcingenieria.com.co;
Cc:jayarams@juniper.net;support@juniper.net;
Hi Juan/Jorge,
I did an analysis on the available logs. Here is my findings. QFX5100 2 Member VC is running on
14.1X53-D45.3. You have issued “request system power-off member 0” to bring down the master RE. After
halting the master RE, mastership has been moved to backup RE without any issue. We have more than 10 AE
links which is connected to oracle server but we focused only on AE0 – AE4 links. Today we observed LACP
detached state only on AE1. After few mins it came online automatically. When we analyze the logs we have
seen the link flap during the time.
I have validated the link status but for few links it’s still showing as down. Can you please verify
physically why the links are still showing as down.
Finally we removed XE-1/0/2 link physically. I compared the logs with other link flap and it’s same. Also
I didn’t find any abnormal logs while doing the mastership switchover.
We need to analyze the logs from server side as well to understand whether the flap has initiated from
server side.
I would like to know whether have you seen any issue while physically removing the power cord from master
RE?
I will be available tomorrow between 7.00 AM PST To 5.00 PM PST. Please let me know if you are available
to discuss about this issue.
Mastership change:
Thank you,
Jayaram S
CCIE # 42762
Mail: jayarams@juniper.net
Please Always CC: support@juniper.net and remember to keep the case number in the subject line to ensure
proper handling.
Hi Juan,
I have gone through the available logs. Here is the summary of our log analysis.
Issue summary:
Yesterday we did a failover test by removing the power cord on master RE. After the
master RE Switchover transition to backup RE is happening without any issues and
sometime interfaces are going in to detached state.
During that time LACP neighbor was showing as 00:00:00:00:00:00. Later they validated
the STP staus on chipset level. Problematic interfaces were showing as blocking.
Also I have validated the PPM logs and we have seen the STP packets from AE48 and
AE49.
We need to enable the spanning tree on QFX5100 and validate the behavior.
Sep 30 00:44:12 PPMD_TRACE_PIPE: Sent STP (10) RcvPkt (19) len 115:
Sep 30 00:44:12 PPMD_TRACE_PIPE_DETAIL: IfIndex (3) len 4: 580
Sep 30 00:44:13 PPMD_TRACE_PIPE: Sent STP (10) RcvPkt (19) len 111:
Thank you,
Jayaram S
CCIE # 42762
Juniper ATAC EX /QFX Series
Mail: jayarams@juniper.net
Ph: +1 408 936 8056
Working hours: 9.00 AM to 6.00 PM PST.
From:jayarams@juniper.net
Date:2017-10-11 16:21:09
To:jmendoza@qcingenieria.com.co;jfernandez@qcingenieria.com.co;cjimenez@juniper.net;
Cc:support@juniper.net;yquintero@qcingenieria.com.co;jayarams@juniper.net;
File attached:;image001.jpg
*** ORIGINAL EMAIL NOTE TOO LONG AND HAS BEEN TRUNCATED TO 40000 CHARS. JTAC HAS ACCESS TO THE COMPLETE
ORIGINAL NOTE.***
Hi Juan,
2. We took ae0 for troubleshooting and the connected Oracle server IP is 172.16.4.5
3. AE0 members were xe-1/0/0 and xe-0/0/0, both were in CD. Pings were good
8. Now xe-1/0/0 of the new backup is in detached state. <<<< This is the issue
10. After sometime the AE0 xe-1/0/0 automatically came up as CD, we didn’t do anything.
11. Dev team reset the interface at BCM level to bring all the interface to CD.
Next action:
As per log analysis, QFX5100 sending the LACP packets and but it didn’t receive any PDUs from neighbor
end.
Can you please involve oracle team to analyze the logs and identity the reason for not sending the LACP
packets from server end.
root@UMNG_Datacenter_QFX5100> show lacp interfaces
LACP state: Role Exp Def Dist Col Syn Aggr Timeout Activity
LACP state: Role Exp Def Dist Col Syn Aggr Timeout Activity
LACP state: Role Exp Def Dist Col Syn Aggr Timeout Activity
---(more)---
LACP state: Role Exp Def Dist Col Syn Aggr Timeout Activity
LACP state: Role Exp Def Dist Col Syn Aggr Timeout Activity
---(more 14%)---
LACP state: Role Exp Def Dist Col Syn Aggr Timeout Activity
LACP state: Role Exp Def Dist Col Syn Aggr Timeout Activity
---(more 22%)—
=======================
=======================
No adjacency found : 90
Packets Dropped(Invalid) : 0
Interface name xe-1/0/48:0.0, Pkt tx: 465, Tx errors: 0, Pkt rx: 471, ——————> LACP packets received
from Juniper
Interface name xe-1/0/0.0, Pkt tx: 516, Tx errors: 0, Pkt rx: 1, - ——————> LACP packets not received
from oracle server
Interface name xe-1/0/5.0, Pkt tx: 518, Tx errors: 0, Pkt rx: 524, ——————> LACP packets received from
oracle server
Interface name xe-1/0/6.0, Pkt tx: 519, Tx errors: 0, Pkt rx: 524,
Interface name xe-1/0/7.0, Pkt tx: 518, Tx errors: 0, Pkt rx: 523,
=======================
=======================
No adjacency found : 90
Packets Dropped(Invalid) : 0
Interface name xe-1/0/48:0.0, Pkt tx: 526, Tx errors: 0, Pkt rx: 534, ——————> LACP packets received from
Juniper
Interface name xe-1/0/48:1.0, Pkt tx: 527, Tx errors: 0, Pkt rx: 534, ——————> LACP packets received from
Juniper
Interface name xe-1/0/0.0, Pkt tx: 578, Tx errors: 0, Pkt rx: 1, ——————> LACP packets not received from
oracle server
Interface name xe-1/0/5.0, Pkt tx: 580, Tx errors: 0, Pkt rx: 586,——————> LACP packets received from
oracle server
Thank you,
Jayaram S
CCIE # 42762
Mail: jayarams@juniper.net
Please Always CC: support@juniper.net and remember to keep the case number in the subject line to ensure
proper handling.