TW QFabric TrafficFlows

THIS WEEK: QFABRIC SYSTEM TRAFFIC
FLOWS AND TROUBLESHOOTING

Traditional Data Center architecture follows a layered approach that uses separate switch devices for access, aggregation, and core layers. But a completely scaled QFabric system combines
Junos Fabric and Switching Technologies
all the member switches and enables them to function as a single unit. So, if your Data Center
deploys a QFabric system with one hundred QFX3500 nodes, then those one hundred switches
will act like a single switch.
Traffic flows differently in this super-sized Virtual Chassis that spans your entire Data Center.
Knowing how traffic moves is critical to understanding and architecting Data Center operations, but it is also necessary to ensure efficient day-to-day operations and troubleshooting.

This Week: QFabric System Traffic Flows and Troubleshooting is a deep dive into how the QFabric
system externalizes the data plane for both user data and data plane traffic and why thats such
a massive advantage from an operations point of view.
QFabric is an unique accomplishment making 128 switches look and function as

one. Ankit brings to this book a background of both supporting QFabric customers,
and as a resident engineer, implementing complex customer migrations. This deep
dive into the inner workings of the QFabric system is highly recommended for anyone
looking to implement or better understand this technology.
John Merline, Network Architect, Northwestern Mutual
LEARN SOMETHING NEW ABOUT QFABRIC THIS WEEK:

Understand the QFabric system technology in great detail.
Compare the similarities of the QFabric architecture with MPLS-VPN technology.
Verify the integrity of various protocols that ensure smooth functioning of the QFabric system.
Understand the various active/backup Routing Engines within the QFabric system.
Understand the various data plane and control plane flows for different kinds of traffic within the
QFabric system.
Knowing how traffic

flows through a QFabric
system is knowing how your
Data Center can scale.
Operate and effectively troubleshoot issues that you might face with a QFabric deployment.
Published by Juniper Networks Books

ISBN 978-1936779871
9 781936 779871
www.juniper.net/books
51600
By Ankit Chadha

Traditional Data Center architecture follows a layered approach that uses separate switch devices for access, aggregation, and core layers. But a completely scaled QFabric system combines
Junos Fabric and Switching Technologies
all the member switches and enables them to function as a single unit. So, if your Data Center
deploys a QFabric system with one hundred QFX3500 nodes, then those one hundred switches
will act like a single switch.
Traffic flows differently in this super-sized Virtual Chassis that spans your entire Data Center.
Knowing how traffic moves is critical to understanding and architecting Data Center operations, but it is also necessary to ensure efficient day-to-day operations and troubleshooting.

This Week: QFabric System Traffic Flows and Troubleshooting is a deep dive into how the QFabric
system externalizes the data plane for both user data and data plane traffic and why thats such
a massive advantage from an operations point of view.
QFabric is an unique accomplishment making 128 switches look and function as

one. Ankit brings to this book a background of both supporting QFabric customers,
and as a resident engineer, implementing complex customer migrations. This deep
dive into the inner workings of the QFabric system is highly recommended for anyone
looking to implement or better understand this technology.
John Merline, Network Architect, Northwestern Mutual
LEARN SOMETHING NEW ABOUT QFABRIC THIS WEEK:

Understand the QFabric system technology in great detail.
Compare the similarities of the QFabric architecture with MPLS-VPN technology.
Verify the integrity of various protocols that ensure smooth functioning of the QFabric system.
Understand the various active/backup Routing Engines within the QFabric system.
Understand the various data plane and control plane flows for different kinds of traffic within the
QFabric system.
Knowing how traffic

flows through a QFabric
system is knowing how your
Data Center can scale.
Operate and effectively troubleshoot issues that you might face with a QFabric deployment.

ISBN 978-1936779871
9 781936 779871
www.juniper.net/books
51600
By Ankit Chadha
This Week:
QFabric System Traffic Flows and Troubleshooting
By Ankit Chadha
Chapter 1: Physical Connectivity and Discovery . . . . . . . . . . . . . . . . . . . . . . . . . 9

Chapter 2: Accessing Individual Components. . . . . . . . . . . . . . . . . . . . . . . . . . 25
Chapter 3: Control Plane and Data Plane Flows. . . . . . . . . . . . . . . . . . . . . . . 39
Chapter 4: Data Plane Forwarding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Knowing how traffic flows through your QFabric system is

knowing how your Data Center can scale.
ii
2014 by Juniper Networks, Inc. All rights reserved.

Juniper Networks, Junos, Steel-Belted Radius, NetScreen, and
ScreenOS are registered trademarks of Juniper Networks, Inc. in
the United States and other countries. The Juniper Networks
Logo, the Junos logo, and JunosE are trademarks of Juniper
Networks, Inc. All other trademarks, service marks, registered
trademarks, or registered service marks are the property of their
respective owners. Juniper Networks assumes no responsibility
for any inaccuracies in this document. Juniper Networks reserves
the right to change, modify, transfer, or otherwise revise this
publication without notice.

Authors: Ankit Chadha
Technical Reviewers: John Merline, Steve Steiner, Girish SV
Editor in Chief: Patrick Ames
Copyeditor and Proofer: Nancy Koerbel
J-Net Community Manager: Julie Wider
ISBN: 978-1-936779-87-1 (print)
Printed in the USA by Vervante Corporation.
ISBN: 978-1-936779-86-4 (ebook)
Version History: v1, April 2014
2 3 4 5 6 7 8 9 10
About the Author:

Ankit Chadha is a Resident Engineer in the Advanced Services
group of Juniper Networks. He has worked on QFabric system
solutions in various capacities including solutions testing,
engineering escalation, customer deployment, and design roles.
He holds several industry recognized certifications such as
JNCIP-ENT and CCIE-RnS.
Authors Acknowledgments:
I would like to thank Patrick Ames, our Editor in Chief, for his
continuous encouragement and support from the conception of
this idea until the delivery. There is no way that this book would
have been successfully completed without Patricks support. John
Merline and Steve Steiner provided invaluable technical review;
Girish SV spent a large amount of time carefully reviewing this
book and made sure that its ready for publishing. Nancy Koerbel
made sure that it shipped without any embarrassing mistakes.
Thanks to Steve Steiner, Mahesh Chandak, Ruchir Jain, Vaibhav
Garg, and John Merline for being great mentors and friends. Last
but not least, Id like to thank my family and my wife, Tanu, for
providing all the support and love that they always do.
This book is available in a variety of formats at:
http://www.juniper.net/dayone.
iii
Welcome to This Week

This Week books are an outgrowth of the popular Day One book series published
by Juniper Networks Books. Day One books focus on providing just the right
amount of information that you can execute, or absorb, in a day. This Week books,
on the other hand, explore networking technologies and practices that in a classroom setting might take several days to absorb or complete. Both libraries are
available to readers in multiple formats:
Download a free PDF edition at http://www.juniper.net/dayone.
Get the ebook edition for iPhones and iPads at the iTunes Store>Books. Search
for Juniper Networks Books.
Get the ebook edition for any device that runs the Kindle app (Android,
Kindle, iPad, PC, or Mac) by opening your devices Kindle app and going to
the Kindle Store. Search for Juniper Networks Books.
Purchase the paper edition at either Vervante Corporation (www.vervante.
com) or Amazon (www.amazon.com) for prices between $12-$28 U.S.,
depending on page length.
Note that Nook, iPad, and various Android apps can also view PDF files.
If your device or ebook app uses .epub files, but isnt an Apple product, open
iTunes and download the .epub file from the iTunes Store. You can now drag
and drop the file out of iTunes onto your desktop and sync with your .epub
device.
What You Need to Know Before Reading

Before reading this book, you should be familiar with the basic administrative
functions of the Junos operating system, including the ability to work with operational commands and to read, understand, and change Junos configurations. There
are several books in the Day One library to help you learn Junos administration, at
www.juniper.net/dayone.
This book makes a few assumptions about you, the reader:
You have a working understanding of Junos and the Junos CLI, including
configuration changes using edit mode. See the Day One books at www.
juniper.net/dayone for a variety of tutorials on Junos at all skill levels.
You can make configuration changes using the CLI edit mode. See the Day
One books at www.juniper.net/dayone for a variety of tutuorials on Junos at
all skill levels.
You have an understanding of networking fundamentals like ARP, MAC
addresses, etc.
You have a thorough familiarity with BGP fundamentals.
You have a thorough familiarity with MP-BGP and MPLS-VPN fundamentals
and their terminologies. See This Week: Deploying MBGP Multicast VPNs,
Second Edition at www.juniper.net/dayone for a quick review.
Finally, this book uses outputs from actual QFabric systems/deployments
readers are strongly encouraged to have a stable lab setup to execute those
commands.
iv
iv
What You Will Learn From This Book

Youll understand the working of the QFabric technology (in great detail).
Youll be able to compare the similarities of the QFabric architecture with
MPLS VPN technology.
Youll be able to verify the integrity of various protocols that ensure smooth
functioning of the QFabric system.
Youll understand the various active/backup Routing Engines within the
QFabric.
Youll understand the various data plane, control plane flows for different
kinds of traffic within the QFabric system.
Youll be able to operate and effectively troubleshoot issues that you might face
with a QFabric system deployment.
Information Experience
This book is singularly focused on one aspect of networking technology. There are
other sources at Juniper Networks, from white papers to webinars to online forums
such as J-Net (forums.juniper.net). Look for the following sidebars to directly access
other superb informational resources:
MORE? Its highly recommended you go through the technical documentation and the
minimum requirements to get a sense of QFabric hardware and deployment before
you jump in. The technical documentation is located at www.juniper.net/documentation. Use the Pathfinder tool on the documentation site to explore and find the
right information for your needs.
About This Book

This book focuses on the inner workings and internal traffic flows of the Juniper
Networks QFabric solution and does not address deployment or configuration
practices.
MORE? The complete deployment guide for the QFabric can be found here: https://www.
juniper.net/techpubs/en_US/junos11.3/information-products/pathway-pages/
qfx-series/qfabric-deployment.html.
QFabric vs. Legacy DataLCenter Architecture

Traditional Data Center architecture follows a layered approach to building a Data
Center using separate switch devices for access, aggregation, and core layers. Obviously these devices have different capacities with respect to their MAC table sizes,
depending on their role or their placement in the different layers.
Since Data Centers host mission critical applications, redundancy is of prime importance. To provide the necessary physical redundancy within a Data Center, Spanning
Tree Protocol (STP) is used. STP is a popular technology and is widely deployed
around the world. A Data Center like the one depicted in Figure A.1 always runs
some flavor of STP to manage the physical redundancy. But there are drawbacks to
using Spanning Tree Protocol for redundancy:
Figure A.1
Traditional Layered Data Center Topology
STP works on the basis of blocking certain ports, meaning that some ports
can potentially be overloaded, while the blocked ports do not forward any
traffic at all. This is highly undesirable, especially because the switch ports
deployed in a Data Center are rather costly.
This situation of some ports not forwarding any traffic can be overcome
somewhat by using different flavors of the protocol, like PVST or MSTP,
but STP inherently works on the principle of blocking ports. Hence, even
with PVST or MSTP, complete load balancing of traffic over all the ports
cannot be achieved. Using PVST and MSTP versions, load balancing can be
done across VLANs one port can block for one VLAN or a group of
VLANs and another port can block for the rest of the VLANs. However,
there is no way to provide load balancing for different flows within the
same VLAN.
Spanning Tree relies on communication between different switches. If there
is some problem with STP communication, then the topology change
recalculations that follow can lead to small outages across the whole Layer
2 domain. Even small outages like these can cause significant revenue loss
for applications that are hosted on your Data Center.
By comparison, a completely scaled QFabric system can have up to 128 member
switches. This new technology works by combining all the member switches and
making them function as a single unit to other external devices. So if your Data
Center deploys a QFabric with one hundred QFX3500 nodes, then all those one
hundred switches will act as a single switch. In short, that single switch (QFabric)
will have (100x48) 4800 ports!
vi
Since all the different QFX3500 nodes act as a single switch, there is no need to run
any kind of loop prevention protocol like Spanning Tree. At the same time, there is no
compromise on redundancy because all the Nodes have redundant connections to the
backplane (details on the connections between different components of a QFabric
system are discussed throughout this book). This is how the QFabric solution takes
care of the STP problem within the Data Center.
Consider the case of a traditional (layered) Data Center design. Note that if two hosts
connected to different access switches need to communicate with each other, they need
to cross multiple switches in order to do that. In other words, communication in the
same or a different VLAN might need to cross multiple switch hops to be successful.
Since all the Nodes within a QFabric system work together and act as a large single
switch, all the external devices connected to the QFabric Nodes (servers, filers, load
balancers, etc.) are just one hop away from each other. This leads to a lower number
of lookups, and hence, considerably reduces latency.
Different Components of a QFabric System

A QFabric system has multiple physical and logical components lets identify them
here so you have a common place you can return to when you need to review them.
Physical Components
A QFabric system has the following physical components as shown in Figure A.2:
Nodes: these are the top-of-rack (TOR) switches to which external devices are
connected. All the server-facing ports of a QFabric system reside on the Nodes.
There can be up to 128 Nodes in a QFabric-G system and up to 16 Nodes in a
QFabric-M implementation. Up to date details on the differences between
various QFabric systems can be found here: http://www.juniper.net/us/en/
products-services/switching/qfabric-system/#overview.
Interconnects: The Interconnects act as the backplane for all the data plane
traffic. All the Nodes should be connected to all the Interconnects as a best
practice. There can be up to four Interconnects (QFX3008-I) in both QFabric-G
and QFabric-M implementations.
Director Group: There are two Director devices (DG0 and DG1) in both
QFabric-G and QFabric-M implementations. These Director devices are the
brains of the whole QFabric system and host the necessary virtual components
(VMs) that are critical to the health of the system. The two Director devices
operate in a master/slave relationship. Note that all the protocol/route/inventory
states are always synced between the two.
Control Plane Ethernet Switches: These are two independent EX VCs or EX
switches (in case of QFabric-G and QFabric-M, respectively) to which all the
other physical components are connected. These switches provide the necessary
Ethernet network over which the QFabric components can run the internal
protocols that maintain the integrity of the whole system. The LAN segment
created by these devices is called the Control Plane Ethernet segment or the CPE
segment.
Figure A.2
vii
Components of a QFabric System
Virtual Components
The Director devices host the following Virtual Machines:
Network Node Group VM: The NWNG-VM are the routing brains for a
QFabric system, where all the routing protocols like OSPF, BGP, or PIM are
run. There are two NWNG-VMs in a QFabric system (one hosted on each
DG) and they operate in an active/backup fashion with the active VM always
being hosted on the master Director device.
Fabric Manager: The Fabric Manager VM is responsible for maintaining the
hardware inventory of the whole system. This includes discovering new
Nodes and Interconnects as theyre added and keeping a track of the ones that
are removed. The Fabric Manager is also in charge of keeping a complete
topological view of how the Nodes are connected to the Interconnects. In
addition to this, the FM also needs to provide internal IP addresses to every
other component to allow for the internal protocols to operate properly.
There is one Fabric Manager VM hosted on each Director device and these
VMs operate in an active/backup configuration.
Fabric Control: The Fabric Manager VM is responsible for distributing
various routes (Layer 2 or Layer 3) to different Nodes of a QFabric system.
This VM forms internal BGP adjacencies with all the Nodes and Interconnects
and sends the appropriate routes over these BGP peerings. There is one Fabric
Manager VM hosted on each Director device and these operate in an active/
active fashion.
viii
Node Groups Within a QFabric System

Node groups is a new concept introduced by the QFabric technology; it is a logical
collection of one or more physical Nodes that are part of a QFabric system. Whenever multiple Nodes are configured to be part of a Node group, they act as one.
Individual Nodes can be configured to be a part of these kinds of Node groups:
Server Node Group (SNG): This is the default group and consists of one Node.
Whenever a Node becomes part of a QFabric system, it comes up as an SNG.
These mostly connect to servers that do not need any cross Node redundancy.
The most common examples are servers that have only one NIC.
Redundant Server Node Group (RSNG): An RSNG consists of two physical
Nodes. The Routing Engines on the Nodes operate in an active/backup fashion
(think of a Virtual Chassis with two member switches). You can configure
multiple pairs of RSNGs within a QFabric system. These mostly connect to
dual-NIC servers.
Network Node Group (NWNG): Each QFabric has one Network Node Group
and up to eight physical Nodes can be configured to be part of the NWNG. The
Routing Engines (RE) on the Nodes are disabled and the RE functionality is
handled by the NWNG-VMs that are located on the Director devices.
MORE? Every Node device can be a part of only one Node group at a time. The details on
how to configure different kinds of Node groups can be found here: http://www.
juniper.net/techpubs/en_US/junos12.2/topics/task/configuration/qfabric-nodegroups-configuring.html.
NOTE
Chapter 3 covers these abstractions, including a discussion of packet flows.
Differences Between a QFabric System and a Virtual Chassis

Juniper EX Series switches support Virtual Chassis (VC) technology, which enables
multiple physical switches to be combined. These multiple switches then act as a
single switch.
MORE? For more details on the Virtual Chassis technology, refer to the following technical
documentation: https://www.juniper.net/techpubs/en_US/junos13.3/topics/concept/
virtual-chassis-ex4200-components.html.
One of the advantages of a QFabric system is its scale. A virtual chassis can host tens
of switches, but a fully scaled QFabric system can have a total 128 Nodes combined.
The QFabric system, however, is much more than a supersized Virtual Chassis.
QFabric technology completely externalizes the data plane because of the Interconnects. Chapter 4 discusses the details of how user data or data plane traffic flows
through the external data plane.
Another big advantage of the QFabric system is that the Nodes can be present at
various locations within the Data Center. The Nodes are normally deployed as
top-of-rack (TOR) switches. To connect to the backplane, the Nodes connect to the
Interconnects. Thats why QFabric is such a massive advantage from an operations
point of view, as one large switch that spans the entire Data Center and the cables
from the servers still plugs in to only the top-of-rack switches.
Various components of the QFabric system (including the Interconnects) are discussed throughout this book and you can return to these pages to review these basic
definitions at any time.
Chapter 1
Physical Connectivity and Discovery
Interconnections of Various Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Why Do You Need Discovery?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
System and Component Discovery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Fabric Topology Discovery (VCCPDf). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Relation Between VCCPD and VCCPDf. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Test Your Knowledge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
10
This Week: QFabric System Traffic Flows and Troubleshooting
This chapter discusses what a plain-vanilla QFabric system is supposed to look like. It
does not discuss any issues in the data plane or about packet forwarding; its only
focus is the internal workings of QFabric and checking the protocols that are instrumental in making the QFabric system function as a single unit.
The important first step in setting up a QFabric system is to cable it correctly.
MORE? Juniper has great documentation about cabling and setting up a QFabric system, so it
wont be repeated here. If you need to, review the best practices of QFabric cabling:
https://www.juniper.net/techpubs/en_US/junos11.3/information-products/pathwaypages/qfx-series/qfabric-deployment.html
Make sure that the physical connections are made exactly as mentioned in the deployment guide. Thats how the test units used for this book were set up. Any variations in
your lab QFabric system might cause discrepencies with the correlating output that is
shown in this book.
Interconnections of Various Components

As discussed, the QFabric system consists of multiple physical components and these
components need to be connected to each other as well. Consider these inter-component links:
Nodes to EX Series VC: These are 1GbE links
Interconnects to EX Series VC: These are 1GbE links
DG0/DG1 to EX Series VC: These are 1GbE links
DG0 to DG1: These are 1GbE links
Nodes to Interconnects: These are 40GbE links
All these physical links are Gigabit Ethernet links except for the 40GbE links between
the Nodes and the Interconnects these 40 GbE links show up as FTE interfaces on
the CLI. The usual Junos commands like show interfaces terse and show interfaces extensive apply and should be used for troubleshooting any issues related to
finding errors on the physical interfaces.
The only devices where these Junos commands cannot be run are the Director devices
because they run on Linux. However, the usual Linux commands do work on Director devices (like ifconfig, top, free, etc.).
Lets start with one of the most common troubleshooting utilities that a network engineer needs to know about: checking the status of the interfaces and their properties
on the Director devices.
To check the status of interfaces of the Director devices from the Linux prompt, the
regular ifconfig command can be used. However, this output uses the following
keywords for specific interfaces types:
Bond0: This is the name of the aggregated interface that gets connected to the
other Director device (DG). The two Director devices are called DG0 and DG1.
Note that the IP address for the Bond0 interface on DG0 is always 1.1.1.1 and
on DG1 it is always set to 1.1.1.2. This link is used for syncing and maintaining
the states of the two Director devices. These states include VMs, configurations,
file transfers (cores), etc.
Chapter 1: Physical Connectivity and Discovery
11
Bond1: This aggregated interface is used mainly for internal Control plane
communication between the Director devices and other QFabric components
like the Nodes and the Interconnects.
Eth0: This is the management interface of the DG. This interface gets connected to the network and you can SSH to the IP address of this interface from
from an externally reachable machine. Each Director device has an interface
called Eth0, which should be connected to the management network. At the
time of installation, the QFabric system prompts the user to enter the IP
address for the Eth0 interface of each Director device. In addition to this, the
user is required to add a third IP address called the VIP (Virtual IP Address).
This VIP is used to manage the operations of QFabric, such as SSH, telnet, etc.
Also, the CLI command show fabric administration inventory director-group
status shows the status of all the interfaces. Here is sample output of this CLI
command:
root@TEST-QFABRIC>showfabricadministrationinventorydirector-groupstatus
DirectorGroupStatusTueFeb1108:32:50CST2014
MemberStatusRoleMgmtAddressCPUFreeMemoryVMsUpTime
----------------------------------------------------------------dg0onlinemaster172.16.16.51%3429452k497days,02:14hrs
dg1onlinebackup172.16.16.60%8253736k369days,23:42hrs
MemberDeviceId/AliasStatusRole
-------------------------------------dg0TSTDG0onlinemaster
MasterServices
--------------DatabaseServeronline
LoadBalancerDirectoronline
QFabricPartitionAddressoffline
DirectorGroupManagedServices
------------------------------SharedFileSystemonline
NetworkFileSystemonline
VirtualMachineServeronline
LoadBalancer/DHCPonline
HardDriveStatus
----------------VolumeID:0FFF04E1F7778DA3optimal
PhysicalID:0online
PhysicalID:1online
ResyncProgressRemaining:00%
SizeUsedAvailUsed%Mountedon
---------------------------423G36G366G9%/
99M16M79M17%/boot
93G13G81G14%/pbdata
DirectorGroupProcesses
-----------------------DirectorGroupManageronline
PartitionManageronline
SoftwareMirroringonline
SharedFileSystemmasteronline
SecureShellProcessonline
FTPServeronline
Syslogonline
DistributedManagementonline
12
SNMPTrapForwarderonline
SNMPProcessonline
PlatformManagementonline
InterfaceLinkStatus
--------------------ManagementInterfaceup
ControlPlaneBridgeup
ControlPlaneLAGup
CPLink[0/2]down
CPLink[0/1]up
CPLink[0/0]up
CPLink[1/2]down
CPLink[1/1]up
CPLink[1/0]up
CrossoverLAGup
CPLink[0/3]up
CPLink[1/3]up
MemberDeviceId/AliasStatusRole
-------------------------------------dg1TSTDG1onlinebackup
DirectorGroupManagedServices
------------------------------SharedFileSystemonline
VirtualMachineServeronline
LoadBalancer/DHCPonline
HardDriveStatus
----------------VolumeID:0A2073D2ED90FED4optimal
PhysicalID:0online
PhysicalID:1online
SizeUsedAvailUsed%Mountedon
---------------------------423G39G362G10%/
99M16M79M17%/boot
93G13G81G14%/pbdata
DirectorGroupProcesses
-----------------------DirectorGroupManageronline
PartitionManageronline
SoftwareMirroringonline
SharedFileSystemmasteronline
SecureShellProcessonline
FTPServeronline
Syslogonline
DistributedManagementonline
SNMPTrapForwarderonline
SNMPProcessonline
PlatformManagementonline
InterfaceLinkStatus
--------------------ManagementInterfaceup
ControlPlaneBridgeup
ControlPlaneLAGup
CPLink[0/2]down
CPLink[0/1]up
CPLink[0/0]up
CPLink[1/2]down
13
CPLink[1/1]up
CPLink[1/0]up
CrossoverLAGup
CPLink[0/3]up
CPLink[1/3]up
root@TEST-QFABRIC>
--snip--
Note that this output is taken from a QFabric-M system, and hence, port 0/2 is
down on both the Director devices.
Details on how to connect these ports on the DGs is discussed in the QFabric
Installation Guide cited at the beginning of this chapter, but once the physical
installation of a QFabric system is complete, you should verify the status of all the
ports. Youll find that once a QFabric system is installed correctly, it is ready to
forward traffic, and the plug-and-play features of the QFabric technology make it
easy to install and maintain.
However, a single QFabric system has multiple physical components, so lets assume
youve cabled your test bed correctly in your lab and review how a QFabric system
discovers its multiple components and makes sure that those different components
act as a single unit.
Why Do You Need Discovery?

The front matter of this book succinctly discusses the multiple physical components
that comprise a QFabric system. The control plane consists of the Director groups
and the EX Series VC. The data plane of a QFabric system consists of the Nodes and
the Interconnects. A somewhat loose (and incorrect) analogy that might be drawn is
that the Director groups are similar to the Routing Engines of a chassis-based
switch, the Nodes are similar to the line cards, and the Interconnects are similar to
the backplane of a chassis-based switch. But QFabric is different from a chassisbased switch as far as system discovery is concerned.
Consider a chassis-based switch. There are only a certain number of slots in such a
device and the line cards can only be inserted into the slots available. After a line
card is inserted in one of the available slots, it is the responsibility of the Routing
Engine to discover this card. Note that since there are a finite number of slots, it is
much easier to detect the presence or absence of a line card in a chassis with the help
of hardware-based assistance. Think of it as a hardware knob that gets activated
whenever a line card is inserted into a slot. Hence, discovering the presence or
absence of a line card is easy in a chassis-based device.
However, QFabric is a distributed architecture that was designed to suit the needs of
a modern Data Center. A regular Data Center has many server cabinets and the
Nodes of a QFabric system can act as the TOR switches. Note that even though the
Nodes can be physically located in different places within the Data Center, they still
act as a single unit.
One of the implications of this design is that the QFabric system can no longer use a
hardware-assist mechanism to detect different physical components of the QFabric
system. For this reason, QFabric uses an internal protocol called Virtual Chassis
Control Protocol Daemon (VCCPD ) to make sure that all the system components
can be detected.
14
System and Component Discovery

Virtual Chassis Control Protocol Daemon runs on the Control plane Ethernet
network and is active by default on all the components of the QFabric system. This
means that VCCPD runs on all the Nodes, Interconnects, Fabric Control VMs, Fabric
Manager VMs, and the Network Node Group VMs. Note that this protocol runs on
the backup VMs as well.
There is a VM that runs on the DG whose function is to make these VCCPD adjacencies with all the devices. This VM is called Fabric Manager, or FM.
The Control plane Ethernet network is comprised of the EX Series VC and all of the
physical components that have Ethernet ports connected to these EX VCs. Since there
are no IP addresses on the devices when they come up, VCCPD protocol uses the IS-IS
protocol to make sure that no IP addresses are needed for the system discovery. All
the components send out and receive VCCPD Hello messages on the Control plane
Ethernet (CPE) network. With the help of these messages, the Fabric Manager VM is
able to detect all the components that are connected to the EX Series VCs.
Consider a system in which you have only the DGs connected to the EX Series VC.
The DGs host the Fabric Manager VMs, which send VCCPD Hellos on the CPE
network. When a new Node is connected to the CPE, then FM and the new Node
form a VCCPD adjacency, and this is how the DGs detect the event of a new Nodes
addition to the QFabric system. This same process holds true for the Interconnect
devices, too.
After the adjacency is created, the Nodes, Interconnects, and the FM send out
periodic VCCPD Hellos on the CPE network. These Hellos act as heartbeat messages
and bidirectional Hellos confirm the presence or absence of the components.
If the FM doesnt receive a VCCPD Hello within the hold time, then that device is
considered dead and all the routes that were originated from that Node are flushed
out from other Nodes.
Like any other protocol, VCCPD adjacencies are formed by the Routing Engine of
each component, so VCCPD adjacency stats are available at:
The NNG VM for the Node devices that are a part of the Network Node Group
The master RSNG Node device for a Redundant Server Node Group
The RE of a standalone Server Node Group device
The show virtual-chassis protocol adjacency
the status of the VCCPD adjacencies:
provisioning CLI command shows
qfabric-admin@NW-NG-0>showvirtual-chassisprotocoladjacencyprovisioning
Interface System State Hold (secs)
vcp1.32768P7814-C Up 28
vcp1.32768 P7786-C Up 28
vcp1.32768 R4982-C Up 28
vcp1.32768 TSTS2510b Up 29
vcp1.32768 TSTS2609bUp 28
vcp1.32768 TSTS2608a Up 27
vcp1.32768 TSTS2610b Up 28
vcp1.32768 TSTS2611b Up 28
vcp1.32768 TSTS2509b Up28
vcp1.32768 TSTS2511a Up 29
vcp1.32768 TSTS2511b Up 28
vcp1.32768 TSTS2510a Up 28
15
vcp1.32768 TSTS2608b Up 29
vcp1.32768 TSTS2610a Up 28
vcp1.32768 TSTS2509a Up 28
vcp1.32768 TSTS2611a Up 28
vcp1.32768 TSTS1302b Up 29
vcp1.32768TSTS2508a Up 29
vcp1.32768 TSTS2508b Up 29
vcp1.32768 TSTNNGS1205a Up 27
vcp1.32768 TSTS1302a Up 28
vcp1.32768 __NW-INE-0_RE0 Up 28
vcp1.32768 TSTNNGS1204a Up 29
vcp1.32768 G0548/RE0 Up 27
vcp1.32768 G0548/RE1 Up 28
vcp1.32768G0530/RE1 Up 29
vcp1.32768 G0530/RE0 Up 28
vcp1.32768 __RR-INE-1_RE0 Up 29
vcp1.32768 __RR-INE-0_RE0 Up 29
vcp1.32768 __DCF-ROOT.RE0 Up 29
vcp1.32768 __DCF-ROOT.RE1 Up 28
{master}
The same output can also be viewed from the Fabric Manager VM:
root@Test-QFabric>requestcomponentloginFM-0
Warning:Permanentlyadded'dcfnode---dcf-root,169.254.192.17'(RSA)tothelistofknownhosts.
---JUNOS12.2X50-D41.1built2013-03-2221:44:05UTC
qfabric-admin@FM-0>
qfabric-admin@FM-0>showvirtual-chassisprotocoladjacencyprovisioning
Interface System State Hold (secs)
vcp1.32768 P7814-CUp 27
vcp1.32768 P7786-C Up 28
vcp1.32768 R4982-C Up 29
vcp1.32768 TSTS2510b Up 29
vcp1.32768 TSTS2609b Up28
vcp1.32768 TSTS2608a Up 29
vcp1.32768 TSTS2610b Up 28
vcp1.32768 TSTS2611b Up 28
vcp1.32768 TSTS2509b Up 27
vcp1.32768 TSTS2511a Up 29
vcp1.32768 TSTS2511b Up 29
vcp1.32768 TSTS2510a Up 27
vcp1.32768 TSTS2608b Up 28
vcp1.32768TSTS2610a Up 28
vcp1.32768 TSTS2509a Up 28
vcp1.32768 TSTS2611a Up 28
vcp1.32768 TSTS1302b Up 28
vcp1.32768 TSTS2508a Up 27
vcp1.32768 TSTS2508b Up 29
vcp1.32768 TSTNNGS1205aUp 28
vcp1.32768 TSTS1302a Up 29
vcp1.32768__NW-INE-0_RE0 Up 28
vcp1.32768 TSTNNGS1204aUp 28
vcp1.32768 G0548/RE0 Up 28
vcp1.32768 G0548/RE1 Up 29
vcp1.32768 G0530/RE1Up 28
vcp1.32768 G0530/RE0 Up 27
vcp1.32768 __RR-INE-1_RE0 Up 29
vcp1.32768 __NW-INE-0_RE1 Up 28
vcp1.32768 __DCF-ROOT.RE0 Up29
vcp1.32768 __RR-INE-0_RE0 Up 28
--snip--
16
VCCPD Hellos are sent every three seconds and the adjacency is lost if the peers dont
see each others Hellos for 30 seconds.
After the Nodes and Interconnects form VCCPD adjacencies with the Fabric Manager VM, the QFabric system has a view of all the connected components.
Note that the VCCPD adjacency only provides details about how many Nodes and
Interconnects are present in a QFabric system. VCCPD does not provide any information about the data plane of the QFabric system; that is, it doesnt provide information about the status of connections between the Nodes and the Interconnects.
Fabric Topology Discovery (VCCPDf)

The Nodes can either be QFX-3500s, or QFX-3600s (QFX5100s are supported as a
QFabric node only from 13.2X52-D10 onwards), and both of these have four FTE
links by default. Note that the term FTE-link here means the links that can be
connected to the Interconnects.The number of FTE links on a QFX 3600 can be
modified by using the CLI, but this modification can not be preformed on the QFX
3500. These FTE links can be connected to up to four different Interconnects and the
QFabric system uses a protocol called VCCPDf (VCCPD over fabric links) which
helps the Director devices form a complete topological view of the QFabric system.
One of the biggest advantages of the QFabric technology is its flexibility and its
ability to scale. To further understand this flexibility and scalability, consider a new
Data Center deployment in which the initial bandwidth requirements are so low that
none of the Nodes are expected to have more than 80 Gbps of incoming traffic at any
given point in time. This means that this Data Center can be deployed with all the
Nodes having just two out of the four FTE links connected to the Interconnects. To
have the necessary redundancy, these two FTE links would be connected to two
different Interconnects.
In short, such a Data Center can be deployed with only two Interconnects. However,
as the traffic needs of the Data Center grow, more Interconnects can be deployed and
then the Nodes can be connected to the newly added Interconnects to allow for
greater data plane bandwidth. This kind of flexibility can allow for future proofing of
an investment made in the QFabric technology.
Note that a QFabric system has the built in intelligence to figure out how many FTE
links are connected on each Node and this information is necessary to be able to
know how to load-balance various kinds of traffic between different Nodes.
The QFabric technology uses VCCPDf to figure out the details of the data plane.
Whenever a new FTE link is added or removed, it triggers the creation of a new
VCCPDf adjacency or the deletion of an existing VCCPDf adjacency, respectively.
This information is then fed back to the Director devices over the CPE links so that
the QFabric system can always maintain a complete topological view of how the
Nodes are connected to the Interconnects. Basically, VCCPDf is a protocol that runs
on the FTE links between the Nodes and the Interconnects.
VCCPDf runs on all the Nodes and the Interconnects but only on the 40GbE (or FTE)
ports. VCCPDf utilizes the neighbor discovery portion of IS-IS. As a result, each
Node device would be able to know how many Interconnects it is connected to, the
device ID of those Interconnects, and the connected ports ID on the Interconnects.
Similarly each Interconnect would be able to know how many Node devices it is
connected to, the device ID of those Node devices, and the connected ports ID on the
Node devices. This information is fed back to the Director devices. With the help of
this information, the Director devices are able to formulate the complete topological
17
picture of the QFabric system.

This topological information is necessary in order to configure the forwarding tables
of the Node devices efficiently. The sequence of steps mentioned later in this chapter
will explain why the topological database is needed. (This topological database
contains information about how the Nodes are connected to the Interconnects).
Relation Between VCCPD and VCCPDf

All Juniper devices that run the Junos OS run a process called chassisd (chassis daemon). The chassisd process is responsible for monitoring and managing all the hardware-based components present on the device.
QFabric software also uses chassisd. Since there is a system discovery phase involved,
inventory management is a little different in this distributed architecture.
Here are the steps that take place internally with respect to system discovery, VCCPD,
and VCCPDf:
Nodes, Interconnects, and the VMs exchange VCCPD Hellos on the control plane
Ethernet network.
The Fabric Manager VM processes the VCCPD Hellos from the Nodes and the
Interconnects. The Fabric Manager VM then assigns a unique PFE-ID to each Node
and Interconnect. (The algorithm behind the generation of PFE-ID is Juniper
confidential and is beyond the scope of this book.)
This PFE-ID is also used to derive the internal IP address for the components.
After a Node or an Interconnect is detected by VCCPD, the FTE links are activated
and VCCPDf starts running on the 40GbE links.
Whenever a new 40GbE link is brought up on a Node or an Interconnect, this
information is sent back to the Fabric Manager so that it can update its view of the
topology. Note that any communication with the Fabric Manager is done using the
CPE network.
Whenever such a change occurs (an FTE link is added or removed), the Fabric
Manager recomputes the way data should be load balanced on the data plane. Note
that the load balancing does not take place per packet or per prefix. The QFabric
system applies an algorithm to find out the different FTE links through which other
Nodes can be reached. Consider that a Node has only one FTE link connected to an
Interconnect. At this point in time, the Node has only one way to reach the other
Nodes. Now if another FTE link is connected, then the programming would be
altered to make sure the next hop for some Nodes is FTE-1 and is FTE-2 for others.
With the help of both VCCPD and VCCPDf, the QFabrics Director devices are able to
get information about:
How many devices, and which ones (Nodes and Interconnects), are part of the
QFabric system (VCCPD).
How the Nodes are connected to the Interconnects (VCCPDf).
At this point in time, the QFabric system becomes ready to start forwarding traffic.
Now lets take a look at how VCCPD and VCCPDf become relevant when it comes to
a real life QFabric solution. Consider this sequence of steps:
18
1. The only connections present are the DG0-DG1 connections and the connections
between the Director devices and the EX Series VC.

Figure 1.1
1.1. Note that DG0 and DG1 would assign IP addresses of 1.1.1.1 and
1.1.1.2 respectively to their bond0 links. This is the link over which the
Director devices sync up with each other.
Only the Control Plane Connections are Up
1.2. The Fabric Manager VM running on the DGs would run VCCPD and
the DGs will send VCCPD Hellos on their links to the EX Series VC. Note
that there would be no VCCPD neighbors at this point in time as Nodes
and Interconnects are yet to be connected. Also, the Control plane switches
(EX Series VC) do not participate in the VCCPD adjacencies. Their
function is only to provide a Layer 2 segment for all the components to
communicate with each other.
Figure 1.2
19
Two Interconnects are Added to the CPE Network
2. In Figure 1.2 two Interconnects (IC-1 and IC-2) are connected to the EX Series VC.

2.1. The Interconnects start running VCCPD on the link connected to the EX
Series VC. The EX Series VC acts as a Layer 2 switch and only floods the
VCCPD packets.
2.2. The Fabric Manager VMs and the Interconnects see each others
VCCPD Hellos and become neighbors. At this point in time, the DGs know
that IC-1 and IC-2 are a part of the QFabric system.
20
Figure 1.3
Two Nodes are Connected to the CPE Network
3. In Figure 1.3 two new Node devices (Node-1 and Node-2) are connected to the
EX Series VC.

3.1. The Nodes start running VCCPD on the links connected to the EX
Series VC. Now the Fabric Manager VMs know that there are four devices
in the QFabric inventory: IC-1, IC-2, Node-1, and Node-2.
3.2. Note that none of the FTE interfaces of the Nodes are up yet. This
means that there is no way for the Nodes to forward traffic (there is no data
plane connectivity). Whenever such a condition occurs, Junos disables all
the 10GbE interfaces on the Node devices. This is a security measure to
make sure that a user cannot connect a production server to a Node device
that doesnt have any active FTE ports. This also makes troubleshooting
very easy. If all the 10GbE ports of a Node device go down even when
devices are connected to it, the first place to check should be the status of
the FTE links. If none of the FTE links are in the up/up state, then all the
10GbE interfaces will be disabled. In addition to bringing down all the
10GbE ports, the QFabric system also raises a major system alarm. The
alarms can be checked using the show system alarms CLI command.
Figure 1.4
21
Node-1 and Node-2 are Connected to IC-1 and IC-2, Respectively
4. In Figure 1.4 The following FTE links are connected:

4.1 Node-1 to IC-1.
4.2 Node-2 to IC-2.
4.3 The Nodes and the Interconnects will run VCCPDf on the FTE links and
see each other.
4.4 This VCCPDf information is fed to the Director devices. At this point in
time, the Directors know that:
There are four devices in the QFabric system. This was established at
point# 3.1.
Node-1 is connected to IC-1 and Node-2 is connected to IC-2.
5. Note that some of the data plane of the QFabric is connected, but there would be
no connectivity for hosts across Node devices. This is because Node-1 has no way to
reach Node-2 via the data plane and vice-versa, as the Interconnects are never
connected to each other. The only interfaces for the internal data plane of the
QFabric system are the 40GbE FTE interfaces. In this particular example, Node-1 is
connected to IC-1, but IC-1 is not connected to Node-2. Similarly, Node-2 is
connected to IC-2, but IC-2 is not connected to Node-1. Hence, hosts connected
behind Node-1 have no way of reaching hosts connected behind Node-2, and
vice-versa.
22
Figure 1.5
Node-1 is Connected to IC-2
6. In Figure 1.5 Node-1 is connected to IC-2. At this point, the Fabric Manager has
the following information:

6.1 There are four devices in the QFabric system.
6.2 IC-1 is connected to Node-1.
6.3 IC-2 is connected to both Node-1 as well as Node-2.

The Fabric Manager VM running inside the Director devices realizes
that Node-1 and Node-2 now have mutual reachability via IC-2.
FM programs the internal forwarding table of Node-1. Now Node-1
knows that to reach Node-2, it needs to have the next hop of IC-2.
FM programs the internal forwarding table of Node-2. Now Node-2
knows that to reach Node-1, it needs to have the next-hop of IC-2.
6.4 At this point in time, hosts connected behind Node-1 should be able to
communicate with hosts connected behind Node-2 (provided that the basic
laws of networking like VLAN, routing, etc. are obeyed).
Figure 1.6
23
Node-2 is Connected to IC-1, which Completes the DataPlane of the QFabric System
7. In Figure 1.6 Node-2 is connected to IC-1.

7.1 The Nodes and IC-1 discover each other using VCCPDf and send this
information to Fabric Manager VM running on the Directors.
7.2 Now the FM realizes that Node-1 can reach Node-2 via IC-1, also.
7.3 After the FM finishes programming the tables of Node-1 and Node-2,
both Node devices will have two next hops to reach each other. These two
next hops can be used for load-balancing purposes. This is where the
QFabric solution provides excellent High Availability and also effective
load balancing of different flows as we add more 40GbE uplinks to the
Node devices.
At the end of all these steps, the internal VCCPD and VCCPDf adjacencies of the
QFabric would be complete, and the Fabric Manager will have a complete topological view of the system.
24
Test Your Knowledge

Q: Name the different physical components of a QFabric system.
Nodes, Interconnects, Director devices, EX Series VCs.
Q: Which of these components are connected to each other?
The EX Series VC is connected to the Nodes, Interconnects, and the Director
devices.
The Director devices are connected to each other.
The Nodes are connected to the Interconnects.
Q: Where is the management IP address of the QFabric system configured?
During installation, the user is prompted to enter a VIP. This VIP is used for
remote management of the QFabric system.
Q: Which protocol is used for QFabric system discovery? Where does it run?
VCCPD is used for system discovery and it runs on all the components and
VMs. The adjacencies for VCCPD are established over the CPE network.
Q: Which protocol is used for QFabric data plane topology discovery? Where does it
run?
VCCPDf is used for discovering the topology on the data plane. VCCPDf runs
only on the Nodes and the Interconnects, and adjacencies for VCCPDf are
established on the 40GbE FTE interfaces.
Q: Which Junos process is responsible for the hardware inventory management of a
system?
Chassisd.
Chapter 2
Accessing Individual Components
Logging In to Various QFabric Components. . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Checking Logs at Individual Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Enabling and Retrieving Trace Options From a Component: . . . . . . . . . . . . 35
Extracting Core Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Checking for Alarms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Inbuilt Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Test Your Knowledge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
26
Before this book demonstrates how to troubleshoot any problems, this chapter will
educate the reader about logging in to different components and how to check and
retrieve logs at different levels (physical and logical components) of a QFabric system.
Details on how to configure a QFabric system and aliases for individual Nodes are
documented in the QFabric Deployment Guide at www.juniper.net/documentation.
Logging In to Various QFabric Components

As discussed previously, a QFabric solution has many components, some of which
can be physical (Nodes, Interconnects, Director devices, CPE, etc.), and some of
which are logical VMs. One of the really handy features about QFabric is that it
allows administrators (those with appropriate privileges) to log in to these individual
components, an extremely convenient feature when an administrator needs to do
some advanced troubleshooting that requires logging in to a specific component of
the system.
The hardware inventory of any Juniper router or switch is normally checked using
the show chassis hardware command. QFabric software also supports this command, but there are multiple additions made to this command expressly for the
QFabric solution, and these options allow users to check the hardware details of a
particular component as well. For instance:
root@Test-QFABRIC>showchassishardware?
Possiblecompletions:
<[Enter]>Executethiscommand
clei-modelsDisplayCLEIbarcodeandmodelnumberfororderableFRUs
detailIncludeRAManddiskinformationinoutput
extensiveDisplayIDEEPROMinformation
interconnect-deviceInterconnectdeviceidentifier
modelsDisplayserialnumberandmodelnumberfororderableFRUs
node-deviceNodedeviceidentifier
|Pipethroughacommand
root@Test-QFABRIC>showchassishardwarenode-device?
<node-device>Nodedeviceidentifier
BBAK0431Nodedevice
Node0Nodedevice
Node1Nodedevice
--SNIP
Consider the following QFabric system:

root@Test-QFABRIC>showfabricadministrationinventory
ItemIdentifierConnectionConfiguration
Nodegroup
NW-NG-0ConnectedConfigured
Node0P6966-CConnected
Node1BBAK0431Connected
RSNG-1ConnectedConfigured
RSNG-2ConnectedConfigured
Interconnectdevice
IC-A9122ConnectedConfigured
A9122/RE0Connected
A9122/RE1Connected
IC-IC001ConnectedConfigured
IC001/RE0Connected
Chapter 2: Accessing Individual Components
27
IC001/RE1Connected
Fabricmanager
FM-0ConnectedConfigured
Fabriccontrol
FC-0ConnectedConfigured
Diagnosticroutingengine
DRE-0ConnectedConfigured
This output shows the alias and the serial number (mentioned under the Identifier
column) of every Node that is a part of the QFabric system. It also shows if the
Node is a part of an SNG, Redundant-SNG, or the Network Node Group.
The rightmost column of the output shows the state of each component. Each
component of the QFabric should be in Connected state. If a component shows up
as Disconnected, then there must be an underlying problem and troubleshooting is
required to find out the root cause.
As shown in Figure 2.1, this particular fabric system has six Nodes and two Interconnects. Node-0 and Node-1 are part of the Network Node Group, Node-2 and
Node-3 are part of a Redundant-SNG named RSNG-1, Node-4 and Node-5 are
part of another Redundant-SNG named RSNG-2.
Figure 2.1
Visual Representation of the QFabric System of This Section
There are two ways of accessing the individual components:

From the Linux prompt of the Director devices
From the QFabric CLI
Accessing Components From the DGs

All the components are assigned an IP address in the 169.254 IP-range during the
system discovery phase. Note that IP addresses in the 169.254.193.x range are used
28
to allot IP addresses to Node groups and the Interconnects and IPs in the
169.254.128.x range are allotted to Node devices and to VMs. These IP addresses are
used for internal management and can be used to log in to individual components
from the Director devices Linux prompt. The IP addresses of the components can be
seen using the dns.dump utility, which is located under /root on the Director devices.
Here is an example showing sample output from dns.dump and explaining how to
log in to various components:
[root@dg0~]#./dns.dump
;<<>>DiG9.3.6-P1-RedHat-9.3.6-4.P1.el5<<>>-taxfrpkg.dcbg.juniper.net@169.254.0.1
;;globaloptions:printcmd
pkg.dcbg.juniper.net.600INSOAns.pkg.dcbg.juniper.net.mail.pkg.dcbg.juniper.
net.104360060072003600
pkg.dcbg.juniper.net.600INNSns.pkg.dcbg.juniper.net.
pkg.dcbg.juniper.net.600INA169.254.0.1
pkg.dcbg.juniper.net.600INMX1mail.pkg.dcbg.juniper.net.
dcfnode---DCF-ROOT.pkg.dcbg.juniper.net.45INA169.254.192.17<<<<<<<DCFRoot(FM's)IPaddress
dcfnode---DRE-0.pkg.dcbg.juniper.net.45INA169.254.3.3
dcfnode-3b46cd08-9331-11e2-b616-00e081c53280.pkg.dcbg.juniper.net.45INA169.254.128.15
dcfnode-3d9b998a-9331-11e2-bbb2-00e081c53280.pkg.dcbg.juniper.net.45INA169.254.128.16
dcfnode-4164145c-9331-11e2-a365-00e081c53280.pkg.dcbg.juniper.net.45INA169.254.128.17
dcfnode-43b35f38-9331-11e2-99b1-00e081c53280.pkg.dcbg.juniper.net.45INA169.254.128.18
dcfnode-A9122-RE0.pkg.dcbg.juniper.net.45INA169.254.128.5
dcfnode-A9122-RE1.pkg.dcbg.juniper.net.45INA169.254.128.8
dcfnode-BBAK0431.pkg.dcbg.juniper.net.45INA169.254.128.20
dcfnode-default---FABC-INE-A9122.pkg.dcbg.juniper.net.45INA169.254.193.0
dcfnode-default---FABC-INE-IC001.pkg.dcbg.juniper.net.45INA169.254.193.1
dcfnode-default---NW-INE-0.pkg.dcbg.juniper.net.45INA169.254.192.34<<<<NW-INE'sIPaddress
dcfnode-default---RR-INE-0.pkg.dcbg.juniper.net.45INA169.254.192.35<<<<FC-0'sIPaddress
dcfnode-default---RR-INE-1.pkg.dcbg.juniper.net.45INA169.254.192.36
dcfnode-default-RSNG-1.pkg.dcbg.juniper.net.45INA169.254.193.11
dcfnode-IC001-RE0.pkg.dcbg.juniper.net.45INA169.254.128.6<<<<IC'sIPaddress
dcfnode-IC001-RE1.pkg.dcbg.juniper.net.45INA169.254.128.7
dcfnode-P1377-C.pkg.dcbg.juniper.net.45INA169.254.128.21
dcfnode-P6966-C.pkg.dcbg.juniper.net.45INA169.254.128.24<<<<<node'sIPaddress
dcfnode-P6972-C.pkg.dcbg.juniper.net.45INA169.254.128.23<<<<<node'sIPaddress
mail.pkg.dcbg.juniper.net.600INA169.254.0.1
ns.pkg.dcbg.juniper.net.600INA169.254.0.1
server.pkg.dcbg.juniper.net.600INA169.254.0.1
--snip--
These (boldfaced)169.254.0.x IP addresses can be used to log in to individual

components from the Director devices:
1. Log in to the Network Node Group VM. As seen in the preceding output, the IP
address for the Network Node Group VM is 169.254.192.34. CLI snippets show a
log in attempt to this IP address:
[root@dg0~]#sshroot@169.254.192.34
Theauthenticityofhost'169.254.192.34(169.254.192.34)'can'tbeestablished.
RSAkeyfingerprintis49:f0:9b:a0:bb:36:56:87:dd:c5:c5:21:2c:a6:71:e3.
Areyousureyouwanttocontinueconnecting(yes/no)?yes
Warning:Permanentlyadded'169.254.192.34'(RSA)tothelistofknownhosts.
root@169.254.192.34'spassword:
root@NW-NG-0%cli
29
{master}
root@NW-NG-0>exit
root@NW-NG-0%exit
logout
Connectionto169.254.192.34closed.
2. Log in to FM. The IP address for Fabric Manager VM is 169.254.192.17.

[root@dg0~]#sshroot@169.254.192.17
RSAkeyfingerprintis49:f0:9b:a0:bb:36:56:87:dd:c5:c5:21:2c:a6:71:e3.
root@FM-0%
root@FM-0%
Similarly, the corresponding IP addresses mentioned in the output of dns.dump can

be used to log in to other components like the Fabric Control VM or the Nodes or
the Interconnects.
NOTE
The Node devices with serial numbers P1377-C and P4423-C are a part of the Node
group named RSNG-1. This information is present in the output of show fabric
administration inventory shown above.
As mentioned previously, the RSNG abstraction works on the concept of a Virtual
Chassis. Here is a CLI snippet showing the result of a login attempt to the IP
addresses of the Nodes which are a part of RSNG-1:
[root@dg0~]#./dns.dump|grepRSNG-1
[root@dg0~]#sshroot@169.254.193.11
root@RSNG-1%
root@RSNG-1%
root@RSNG-1%cli
{master}<<<<<<<<<Masterprompt.Chapter-3discussesmoreaboutmaster/
backupREswithinvariousNode-Groups
root@RSNG-1>showvirtual-chassis
PreprovisionedVirtualChassis
VirtualChassisID:0000.010b.0000
Mstr
MemberIDStatusModelprioRoleSerialNo
0(FPC0)Prsntqfx3500128Master*P4423-C
1(FPC1)Prsntqfx3500128BackupP1377-C
{master}
root@RSNG-1>
And here logging in to the RSNG-Master (P4423-C):

[root@dg0~]#./dns.dump|grepP4423
30
[root@dg0~]#sshroot@169.254.128.19
RSAkeyfingerprintis9e:aa:da:bb:8d:e4:1b:74:0e:57:af:84:80:c3:a8:9d.
root@RSNG-1%
root@RSNG-1%cli
{master}<<<<<<<<RSNG-masterprompt
3. Log in to RSNG-backup. With the above-mentioned information, its clear that

logging in to the RSNG-backup will take users to the RSNG-backup prompt.
Captures from the device are shown here:
[root@dg0~]#sshroot@169.254.128.21
root@RSNG-1-backup%
This is expected behavior as an RSNG works on the concept of Junipers Virtual

Chassis technology.
If any component level troubleshooting needs to be done at an RSNG level, then the
user must log in to the RSNG master Node to proceed. This is because the Routing
Engine of the RSNG master Node will be active at all times and logs collected from
this Node would be relevant to the Redundant SNG abstraction.
4. Log in to the line cards of the NW-NG-0 VM. As discussed earlier, the RE
functionality for the NW-NG-0 is located as redundant Virtual Machines on the
DGs. This means that the REs on the line cards are deactivated. Consider the
following configuration on the Network Node Group VM on this QFabric system:
[root@dg0~]#./dns.dump|grepNW-IN
dcfnode-default---NW-INE-0.pkg.dcbg.juniper.net.45INA169.254.192.34
dcfnode-default---NW-INE-0.pkg.dcbg.juniper.net.45INA169.254.192.34
[root@dg0~]#
[root@dg0~]#sshroot@169.254.192.34
root@NW-NG-0%cli
root@NW-NG-0>showvirtual-chassis
VirtualChassisID:0000.0022.0000
Mstr
0(FPC0)Prsntqfx35000LinecardP6966-C
1(FPC1)Prsntqfx35000LinecardBBAK0431
8(FPC8)Prsntfx-jvre128Backup3b46cd08-9331-11e2-b616-00e081c53280
9(FPC9)Prsntfx-jvre128Master*3d9b998a-9331-11e2-bbb2-00e081c53280
{master}
Nodes P6966-C and BBAK0431 are the line cards of this NW-NG-0 VM. Since the
REs of these Node devices are not active at all, there is no configuration that is pushed
down to the line cards. Here are the snippets from the login prompt of the member
Nodes of the Network Node Group:
31
[root@dg0~]#sshroot@169.254.128.24
RSAkeyfingerprintisf6:64:18:f5:9d:8d:29:e7:95:c0:d7:4f:00:a7:3d:30.
Permissiondenied,pleasetryagain.
Note that since no configuration is pushed down to the line cards in the case of
NW-NG-0, it means that a user cant log in to the line cards (the credentials
wouldnt work as the configuration is not pushed to the line cards at all). You need
to connect to the line cards using their console if you intend to check details on the
NWNG line cards. Also, note that the logs from the line cards will reflect on the logs
located in /var/log/messages file on the NW-NG-0 VM. One more method to log in
to the Nodes belonging to the NW-NG is to telnet to them. However, note that no
configuration is visible on these Nodes as their Routing Engines are disabled and
hence no configuration is pushed to them.
5. From the CLI. Logging in to individual components requires user-level privileges,
which allow such logins. The remote-debug-permission CLI setting needs to be
configured for this. Here is the configuration used on QFabric mentioned in this
chapter:
root@Test-QFABRIC>showconfigurationsystem
host-nameTest-QFABRIC;
authentication-order[radiuspassword];
root-authentication{
encrypted-password"$1$LHY6NN4P$cnOMoqUj4OXKMaHOm2s.Z.";##SECRET-DATA
remote-debug-permissionqfabric-admin;
}
There are three permissions that you can set at this hierarchy:
qfabric-admin: Permits a user to log in to individual QFabric switch components, issue show commands, and to change component configurations.
qfabric-operator: Permits a user to log in to individual QFabric switch components and issue show commands.
qfabric-user: Prevents a user from logging in to individual QFabric switch
components.
Also, note that a user needs to have admin control privileges to add this statement to
the devices configuration.
MORE? Complete details on QFabrics system login classes can be found at this link: http://
www.juniper.net/techpubs/en_US/junos13.1/topics/concept/access-login-class-qfabric-overview.html.
Once a user has the required remote debug permission set, they can access the
individual components using the request component login command:
root@Test-QFABRIC>requestcomponentlogin?
<node-name>Inventorynamefortheremotenode
A9122/RE0Interconnectdevicecontrolboard
A9122/RE1Interconnectdevicecontrolboard
BBAK0431Nodedevice
--SNIP--
32
Checking Logs at Individual Components

On any device running Junos, you can find the logs under the /var/log directory. The
same holds true for a QFabric system, but since a QFabric system has multiple
components you have different points from where the logs can be collected:
SNG Nodes
RSNG Nodes
NW-NG VM
Central SFC (QFabric CLI prompt)
DGs Linux prompt
The log collection for SNG, Redundant SNG, and Network Node Group abstractions are straightforward. All the logs pertaining to these Node groups are saved
locally (on the device/VM/active-REs filesystem) under the /var/log directory. Note
that this location on the NW-INE line cards will not yield any useful information, as
the RE on the line cards is not active. Also, in accordance with standard Junos
behavior, the usual show log <filename> command is valid from the active REs of
these Node groups.
The logs collected under /var/log/messages file on the DGs are a collection of all the
logs from all the components. In addition to those logs, this file also records important logs that are relevant to the healthy functionality of the DGs (mgd, mysql-database, DG-sync related messages, etc.). Since Junos 13.1, you can issue the show log
messages Node device <name> to check the log messages from a specific component.
In older releases, you need to make use of the match keyword to see all the logs that
have the name of that particular Node device as a part of the message. So from older
releases:
root@Test-QFABRIC>showlogmessages|matchRSNG-1|last10
Apr0801:58:40Test-QFABRIC:QFABRIC_INTERNAL_SYSLOG:RSNG-1backup:-lastmessagerepeated3times
Apr0801:58:44Test-QFABRICchassism[1446]:QFABRIC_INTERNAL_SYSLOG:RSNG-1backup:-Fan2isNOTspinningcorrectly
And from 13.1 onwards, you can check the logs for a specific component from the
QFabric CLI:
root@qfabric>showlogmessages?
<component>
director-deviceShowlogsfromadirectordevice
infrastructure-deviceShowlogsfromainfrastructuredevice
interconnect-deviceShowlogsfromainterconnectdevice
node-deviceShowlogsfromanodedevice
root@qfabric>
How to Log in to Different Components and Fetch Logs

Note that the Identifier column does not show the serial number of the Nodes in this
case. This is because no aliases have been assigned to the Nodes. For such a system,
the serial numbers of the Nodes are used as their aliases.
Checking the logs of an SNG:
33
root@qfabric>showfabricadministrationinventory
Nodegroup
BBAK1280ConnectedConfigured
BBAK1280Connected
BBAM7499ConnectedConfigured
BBAM7499Connected
BBAM7543Connected
BBAM7560Connected
BBAP0747ConnectedConfigured
BBAP0747Connected
BBAP0748Connected
BBAP0750Connected
BBPA0737ConnectedConfigured
BBPA0737Connected
NW-NG-0ConnectedConfigured
BBAK6318Connected
BBAM7508Connected
P1602-CConnectedConfigured
P1602-CConnected
P2129-CConnected
P3447-CConnected
P4864-CConnected
Interconnectdevice
IC-BBAK7828ConnectedConfigured
BBAK7828/RE0Connected
Fabricmanager
FM-0ConnectedConfigured
Fabriccontrol
Diagnosticroutingengine
DRE-0ConnectedConfigured
root@qfabric>
[root@dg0~]#./dns.dump|grepBBAK1280
dcfnode-BBAK1280.pkg.dcbg.juniper.net.45INA 169.254.128.7
dcfnode-default---BBAK1280.pkg.dcbg.juniper.net.45INA169.254.193.2
dcfnode-BBAK1280.pkg.dcbg.juniper.net.45INA 169.254.128.7
dcfnode-default---BBAK1280.pkg.dcbg.juniper.net.45INA169.254.193.2
[root@dg0~]#sshroot@169.254.128.7
RSAkeyfingerprintisa3:3e:2f:65:9d:93:8f:e3:eb:83:08:c3:01:dc:b9:c1.
Password:
---JUNOS13.1I20130306_1309_dc-builderbuilt2013-03-0614:56:57UTC
root@BBAK1280%
After logging in to a component, the logs can be viewed either by using the show log
<filename> Junos CLI command or by logging in to the shell mode and checking out
the contents of the /var/log directory:
34
root@BBAK1280>showlogmessages?
<filename>Nameoflogfile
messagesSize:185324,Lastchanged:Jun0220:59:46
messages.0.gzSize:4978,Lastchanged:Jun0100:45:00
--SNIP-root@BBAK1280>showlogchassisd?
<filename>Nameoflogfile
chassisdSize:606537,Lastchanged:Jun0220:41:04
chassisd.0.gzSize:97115,Lastchanged:May1805:36:24
root@BBAK1280>
root@BBAK1280>exit
root@BBAK1280%cd/var/log
root@BBAK1280%ls-lrt|grepmessages
-rw-rw----1rootwheel4630May1113:45messages.9.gz
-rw-rw----1rootwheel4472May1321:45messages.8.gz
--SNIP-root@BBAK1280%
This is true for other components as well (Nodes, Interconnects, and VMs). However, the rule of checking logs only at the active RE of a component still applies.
Checking the Logs at the Director Devices

Checking the logs at the Director devices is quite interesting because the Director
devices are the brains of QFabric and they run a lot of processes/services that are
critical to the health of the system. Some of the most important logs on the Director
devices can be found here:
/var/log
/tmp
/vmm
/var/log
As with any other Junos platform, /var/log is a very important location as far as log
collection is concerned. But there is also the /var/log/messages file, which records the
general logs that are recorded for the DG devices.
[root@dg0tmp]#cd/var/log
[root@dg0log]#ls
add_device_dg0.logcron.4.gzmessagessecure.3.gz
anaconda.logcupsmessages.1.gzsecure.4.gz
--SNIP--
/tmp
This is the location that contains all the logs pertaining to the configuration push
events within the subdirectory named sfc-captures. Whenever a configuration is
committed on the QFabric CLI, it is pushed to the various components. The logs
pertaining to these processes can be found at this location and in all the core files:
[root@dg0sfc-captures]#cd/tmp
[root@dg0tmp]#ls
1296.sfcauth26137.sfcauth32682.sfcauthcorefiles
--SNIP-[root@dg0tmp]#cdsfc-captures/
35
[root@dg0sfc-captures]#ls
03170323032903350341034703530359036503710377
03180324033003360342034803540360036603720378
0319032503310337034303490355036103670373last.txt
0320032603320338034403500356036203680374misc
0321032703330339034503510357036303690375sfc-database
0322032803340340034603520358036403700376
Enabling and Retrieving Trace Options From a Component

After logging in to a specific component, you can use show commands from the
operational mode. However, access to configuration mode is not allowed:
root@Test-QFABRIC>requestcomponentloginNW-NG-0
Warning:Permanentlyadded'dcfnode-default---nwine-0,169.254.192.34'(RSA)tothelistofknownhosts.
{master}
qfabric-admin@NW-NG-0>
qfabric-admin@NW-NG-0>conf<<<<configure-modeinaccessible
^
unknowncommand.
{master}
qfabric-admin@NW-NG-0>edit
^
unknowncommand.
{master}
A big part of troubleshooting any networking issue is enabling trace options and
analyzing the logs. To enable trace options on a specific component, the user needs to
have superuser access. Here is how that can be done:
qfabric-admin@NW-NG-0>startshell
%su
Password:
root@NW-NG-0%cli
{master}
fabric-admin@NW-NG-0>configure
Enteringconfigurationmode
{master}[edit]
qfabric-admin@NW-NG-0#setprotocolsospftraceoptionsflagall
{master}[edit]
qfabric-admin@NW-NG-0#commit<<<<commitatcomponent-level
commitcomplete
{master}[edit]
NOTE
If any trace options are enabled at a component level, and a commit is done from the
QFabric CLI, then the trace options configured at the component will be removed.
MORE? There is another method of enabling trace options on QFabric and it is documented
at the following KB article: http://kb.juniper.net/InfoCenter/
index?page=content&id=KB21653.
Whenever trace options are configured at a component level, the corresponding file
containing the logs is saved on the file system of the active RE for that component.
36
Note that there is no way that an external device connected to QFabric can connect
to the individual components of a QFabric system.
Because the individual components can be reached only by the Director devices, and
the external devices (say, an SNMP server) can only connect to the DGs as well, you
need to follow this procedure to retrieve any files that are located on the file system
of a component:
1. Save the file from the component on to the DG.
2. Save the file from the DG to the external server/device.
This is because the management of the whole QFabric system is done using the VIP
that is allotted to the DGs. Since QFabric is made up of a lot of physical components, always consider a QFabric system as a network of different devices. These
different components are connected to each other on a common LAN segment,
which is the control plane Ethernet segment. In addition to this, all the components
have an internal management IP address in the 169.254 IP address range. These IP
addresses can be used to copy files between different components.
Here is an example of how to retrieve log files from a component (NW-NG in this
case):
root@NW-NG-0%ls-lrt/var/log|grepospf
-rw-r-----1rootwheel59401Apr808:26ospf-traces<<<<thelogfileissavedat/var/
logontheNW-INEVM
root@NW-NG-0%exit
logout
Connectionto169.254.192.34closed.
[root@dg0~]#./dns.dump|grepNW-INE
dcf-default---NW-INE-0.pkg.dcbg.juniper.net.45INA169.254.192.34
dcf-default---NW-INE-0.pkg.dcbg.juniper.net.45INA169.254.192.34
[root@dg0~]#
[root@dg0~]#
[root@dg0~]#ls-lrt|grepospf
[root@dg0~]#
[root@dg0~]#
[root@dg0~]#scproot@169.254.192.34://var/log/ospf-traces.
ospf-traces100%59KB59.0KB/s00:00
[root@dg0~]#ls-lrt|grepospf
-rw-r-----1rootroot60405Apr801:27ospf-traces
Here, youve successfully transferred the log file to the DG. Since the DGs have
management access to the gateway, you can now transfer this file out of the QFabric
system and onto the required location.
Extracting Core Files

There might be a situation in which one of the processes of a component writes a
core file onto the file system. The core files are also saved at a specific location on the
Director devices as well. Consider the following output:
root@TEST-QFABRIC>showsystemcore-dumps
Repositoryscope:shared
Repositoryhead:/pbdata/export
Listofnodesforcorerepository:/pbdata/export/rdumps/<<<<Allthecoresaresavedhere
Just like trace options, core files are also saved locally on the file system of the
components. These files can be retrieved the same way as trace option files are
retrieved:
37
First, save the core file from the component onto the DG.
Once the file is available on the DG it can be accessed via other devices that
have IP connectivity to the Director devices.
Checking for Alarms

QFabric components have LEDs that can show or blink red or amber in case there is
an alarm.
In addition to this, the administrator can check the active alarms on a QFabric system
by executing the show chassis alarms command from the CLI, and the output shows
the status of alarms for all the components of a QFabric system.
Since QFabric has many different Nodes and IC devices, you have additional CLI
extensions to the show chassis alarms command to be able to check the alarms
related to a specific Node/IC, as shown in this Help output:
root@qfabric>showchassisalarms?
interconnect-deviceInterconnectdeviceidentifier
node-deviceNodedeviceidentifier
Inbuilt Scripts
There are several inbuilt scripts in the QFabric system that can be run to check the
health of or gather additional information about the system. These scripts are present
in the /root directory of the DGs. Most of the inbuilt scripts are leveraged by the
system in the background (to do various health checks on the QFabric system). The
names of the scripts are very intuitive and here are a few that can be extremely useful:
dns.dump: Shows the IP addresses corresponding to all the components (its
already been used multiple times in this book).
createpblogs: This script gathers the logs from all the components and stores it
as /tmp/pblogs.tgz. From Junos 12.3 and up, this log file is saved at /pbdata/
export/rlogs/ location. This script is extremely useful to have when troubleshooting QFabric. Best practice suggests running this script before and after
every major change that is done on the QFabric system. That way youll know
how the logs looked before and then after the change, something useful for both
JTAC and yourself when it comes time to troubleshoot issues.
pingtest.sh: This script pings all the components of the QFabric system and
reports their status. If any of the Nodes are not reachable, then a suitable status
is shown for that Node. Here is what a sample output would look like:
[root@dg1~]#./pingtest.sh
---->Detectednewhostdcfnode---DCF-ROOT
dcfnode---DCF-ROOT-ok
---->Detectednewhostdcfnode---DRE-0
dcfnode---DRE-0-ok
---->Detectednewhostdcfnode-13daf6fc-9b6c-11e2-bafc-00e081ce1e76
dcfnode-13daf6fc-9b6c-11e2-bafc-00e081ce1e76-ok
---->Detectednewhostdcfnode-150d8a4e-9b6c-11e2-a1ae-00e081ce1e76
dcfnode-150d8a4e-9b6c-11e2-a1ae-00e081ce1e76-ok
---->Detectednewhostdcfnode-16405946-9b6c-11e2-a345-00e081ce1e76
dcfnode-16405946-9b6c-11e2-a345-00e081ce1e76-ok
38
---->Detectednewhostdcfnode-17732b54-9b6c-11e2-a937-00e081ce1e76
dcfnode-17732b54-9b6c-11e2-a937-00e081ce1e76-ok
---->Detectednewhostdcfnode-226b5716-9b80-11e2-aea7-00e081ce1e76
--snip--
dcf_sfc_show_versions: Shows the software version (revision number) running

on the SFC component. Also shows the versions of various daemons running
on the system.
CAUTION
Certain scripts can cause some traffic disruption and hence should never be run on a
QFabric system that is carrying production traffic, for instance: format.sh, dcf_sfc_
wipe_cluster.sh, reset_initial_configuration.sh.
Test Your Knowledge

Q: Which CLI command can be used to view the hardware inventory of the Nodes
and Interconnects?
The show chassis hardware command can be used to view the hardware
inventory. This is a Junos command that is supported on other Juniper platforms as well. For a QFabric system, additional keywords can be used to view
the hardware inventory of a specific Node or Interconnect.
Q: Which CLI command can be used to display all the individual components of a
QFabric system?
The show fabric administration inventory command lists all the components of the QFabric system and their current states.
Q: What are the two ways to log in to the individual components?
From the Linux prompt of the Director devices.
From the CLI using the request
component login command.
Q: What is the IP address range that is allocated to Node groups and Node devices?
Node devices: 169.254.128.x
Node groups: 169.254.193.x
Q: What inbuilt script can be used to obtain the IP address allocated to the different
components of a QFabric system?
The dns.dump script is located at
/root directory on the Director devices.
Chapter 3
Control Plane and Data Plane Flows
Control Plane and Data Plane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Routing Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Route Propagation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Maintaining Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Distributing Routes to Different Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Differences Between Control Plane Traffic and Internal Control Plane
Traffic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Test Your Knowledge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
40
One of the goals of this book is to help you efficiently troubleshoot an issue on a
QFabric system. To achieve this, its important to understand exactly how the internal
QFabric protocols operate and the packet flow of both the data plane and control
plane traffic.
Control Plane and Data Plane

Junipers routing and switching platforms, like the MX Series and the EX Series, all
implement the concept of separating the data plane from the control plane. Here is a
quick explanation:
The control plane is responsible for a devices interaction with other devices and
for running various protocols. The control plane of a device resides on the CPU
and is responsible for forming adjacencies and peerings, and for learning routes
(Layer 2 or Layer 3). The control plane sends the information about these
routes to the data plane.
The data plane resides on the chip or ASIC and this is where the actual packet
forwarding takes place. Once the control plane sends information about specific
routes to the data plane, the forwarding tables on the ASIC are populated
accordingly. The data plane takes care of functions like forwarding, QoS,
filtering, packet-parsing, etc. The performance of a device is determined by the
quality of its data plane (also called the Packet Forwarding Engine or PFE).
Routing Engines
This chapter discusses the path of packets for control plane and data plane traffic.
The following are bulleted lists about the protocols run on these abstractions or Node
groups.
Server Node Group (SNG)

As previously discussed, when a Node is connected to a QFabric system for the
first time, it comes up as an SNG. Its considered to be a Node group with only
one Node.
The SNG is designed to be connected to servers and devices that do not need
cross-Node resiliency.
The SNG doesnt run any routing protocols, needs to run only host-facing
protocols like LACP, LLDP, ARP.
The Routing Engine functionality is present on the local CPU. This means that
MAC-addresses are learned locally for the hosts that are connected directly to
the SNG.
The local PFE has the data plane responsibilities.
See Figure 3.1.
Figure 3.1
Chapter 3: Control Plane and Data Plane Flows
41
Server Node Group (SNG)
Redundant Server Node Group (RSNG)

Two independent SNGs can be combined (using configuration) to become an
RSNG.
The RSNG is designed to be connected to servers/devices that need cross-Node
resiliency.
Common design: At least one NIC of a server is connected to each Node of an
RSNG. These ports are bundled together as a LAG (LACP or static-LAG).
Doesnt run any routing-protocols, needs to run only host-facing protocols
like LACP, LLDP, and ARP.
The Routing Engine functionality is active/passive (only one Node has the
active RE, the other stays in backup mode). This means that the MAC addresses of switches/hosts connected directly to the RSNG-Nodes are learned on the
active-RE of the RSNG.
The PFEs of both the Nodes are active and forward traffic at all times.
See Figure 3.2.
42
Figure 3.2
Redundant Server Node Group (RSNG)
Network Node Group (NW-NG)

Up to eight Nodes can be configured to be part of the NW-NG.
The NW-NG is designed provide connectivity to routers, firewalls, switches,
etc.
Common design: Nodes within NW-NG connect to routers, firewalls, or other
important Data Center devices like load-balancers, filters, etc.
Runs all the protocols available on RSNG/SNG. It can also run protocols like
RIP, OSPF, BGP, xSTP, PIM, etc.
The Routing Engine functionality is located on VMs that run on the DGs; these
VMs are active/passive. The REs of the Nodes are disabled. This means that the
MAC addresses of the devices connected directly to the NW-NG are learned on
the active NW-NG-VM. Also, if the NW-NG is running any Layer 3 protocols
with the connected devices (OSPF, BGP, etc.), then the routes are also learned by
the active NW-NG-VM.
The PFE of the Nodes within an NW-NG is active at all times.
See Figure 3.3.
Figure 3.3
43
Network Node Group
Route Propagation
As with any other internetworking device, the main job of QFabric is to send traffic
end-to-end. To achieve this, the system needs to learn various kinds of routes (such
as Layer 2 routes, Layer 3 routes, ARP, etc.).
As discussed earlier, there can be multiple active REs within a single QFabric system.
Each of these REs can learn routes locally, but a big part of understanding how
QFabric operates is to know how these routes are exchanged between various REs
within the system.
One approach to exchanging these routes between different REs is to send all the
routes learned on one RE to all the other active REs. While this is simple to do, such
an implementation will be counter productive because all the routes eventually will
need to be pushed down to the PFE so that hardware forwarding can take place. If
you send all the routes to every RE, then the complete scale of the QFabric comes
down to the table limits of a single PFE. It means that the scale of the complete
QFabric solution is as good as the scale of a single RE. This is undesirable and the
next section discusses how Junipers QFabric technology maintains scale with a
distributed architecture.
44
Maintaining Scale
One of the key advantages of the QFabric architecture is its scale. The scaling numbers of MAC addresses and IP addresses obviously depend on the number of Nodes
that are a part of a QFabric system because the data plane always resides on the
Nodes and you need the routes to be programmed in the PFE (the data plane) to
ensure end-to-end traffic forwarding.
As discussed earlier, all the routes learned on an RE are not sent to every other RE.
Instead, an RE receives only the routes that it needs to forward data. This poses a big
question: What parameters decide if a route should be sent to a Nodes PFE or not?
The answer is: it depends on the kind of route. The deciding factor for a Layer 2 route
is different from the factor for a Layer 3 route. Lets examine them briefly to understand these differences.
Layer 2 Routes
A Layer 2 route is the combination of a VLAN and a MAC address (VLAN-MAC
pair) and the information stored in the Ethernet switching table of any Juniper EX
Series switch. Now, Layer 2 traffic can be either unicast or BUM (Broadcast, Unknown-unicast, or Multicast in which all three kinds of traffic would be flooded
within the VLAN).
Figure 3.4 is a representation of a QFabric system where Node-1 has active ports in
VLANs 10 and 20 connected to it, Node-2 has hosts in VLANs 20 and 30 connected
to it, and both Node-3 and Node-4 have hosts in VLANs 30 and 40 connected to it.
Active ports means that the Nodes either have hosts directly connected to them, or
that the hosts are plugged into access switches and these switches plug into the
Nodes. For the sake of simplicity, lets assume that all the Nodes shown in Figure 3.4
are SNGs, meaning that for this section, the words Node and RE can be used interchangeably.
Figure 3.4
A Sample of QFabrics Nodes, ICs, and connected Hosts
45
Layer 2 Unicast Traffic
Consider that Host-2 wants to send traffic to Host-3, and lets assume that the MAC
address of Host-3 is already learned. This would cause traffic to be Layer 2 Unicast
traffic as both the source and destination devices are in the same VLAN. When
Node-1 sees this traffic coming in from Host-2, all of it should be sent over to
Node-2 internally within the QFabric example in Figure 3.4. When Node-2 receives
this traffic, it should be sent unicast to the port when the host is connected. This
kind of communication means:
Host-3s MAC address is learned on Node-2. There should be some way to
send this layer-2-routes information over to Node-1. Once Node-1 has this
information, it knows that everything destined to Host-3s MAC address
should be sent to Node-2 over the data plane of the QFabric.
This is true for any other host in VLAN-20 that is connected on any other
Node.
Note that if Host-5 wishes to send some traffic to Host-3, then this traffic must
be routed at Layer 3, as these hosts are in different VLANs. The regular laws
of networking would apply in this case and Host-1 would need to resolve the
ARP for its gateway. The same concept would apply if Host-6 wishes to send
some data to Host-3. Since none of the hosts behind Node-3 ever need to
resolve the MAC address of Host-3 to be able to send data to it, there is no
need for Node-2 to advertise Host-3s MAC address to Node-3. However, this
would change if a new host in VLAN-20 is connected behind Node-3.
Conclusion: if a Node learns of a MAC address in a specific VLAN, then this MAC
address should be sent over to all the other Nodes that have an active port in that
particular VLAN. Note that this communication of letting other Nodes know about
a certain MAC address would be a part of the internal Control Plane traffic within
the QFabric system. This data will not be sent out to devices that are connected to
the Nodes of the QFabric system. Hence, for Layer 2 routes, the factor that decides
whether a Node gets that route or not is the VLAN.
Layer 2 BUM Traffic
Lets consider that Host-4 sends out Layer 2 broadcast traffic, that is, frames in
which the destination MAC address is ff:ff:ff:ff:ff:ff and that all this traffic should be
flooded in VLAN-30. In the QFabric system depicted in Figure 3.4, there are three
Nodes that have active ports in VLAN-30: Node-2, Node-3, and Node-4. What
happens?
All the broadcast traffic originated by the Host-4 should be sent internally to
Node-3 and Node-4 and then these Nodes should be able to flood this traffic
in VLAN-30.
Since Node-1 doesnt have any active ports in VLAN-30, it doesnt need to
flood this traffic out of any revenue ports or server facing ports. This means
that Node-2 should not send this traffic over to Node-1. However, at a later
time, if Node-1 gets an active port in VLAN-30 then the broadcast traffic will
be sent to Node-1 as well.
These points are true for BUM traffic assuming that IGMP snooping is
disabled.
46
In conclusion, if a Node receives BUM traffic in a VLAN, then all that traffic should
be sent over to all the other Nodes that have an active port in that VLAN and not to
those Nodes that do not have any active ports in this VLAN.
Layer 3 Routes
Layer 3 routes are good old unicast IPv4 routes. Note that only the NW-NG-VM has
the ability to run Layer 3 protocols with externally connected devices, hence at any
given time, the active NW-NG-VM has all the Layer 3 routes learned in all the
routing instances that are configured on a given QFabric system. However, not all
these routes are sent to the PFE of all the Nodes within an NW-NG.
Lets use Figure 3.5, which represents a Network Node Group, for the discussion of
Layer 3 unicast routes. All the Nodes shown are the part of NW-NG. Host-1 is
connected to Node-1, Host-2 and Host-3 are connected to Node-2, and Host-4 is
connected to Node-3. You can see that all the IP addresses and the subnets are shown
as well. Additionally, the subnets for Host-1 and Host-2 are in routing instance RED,
whereas the subnets for Host-3 and Host-4 are in routing instance BLUE. The default
gateways for these hosts are the Routed VLAN Interfaces (RVIs) that are configured
and shown in the diagram in Figure 3.5.
Lets assume that there are hosts and devices connected to all three Nodes in the
default (master) routing instance, although not shown in the diagram. The case of
IPv4 routes is much simpler than Layer 2 routes. Basically, its the routing instance
that decides if a route should be sent to other REs or not.
Figure 3.5
Network Node Group and Connected Hosts
47
In the QFabric configuration shown in Figure 3.5, the following takes place:
Node-1 and Node-2 have one device each connected in routing instance RED.
The default gateway (interface vlan.100) for these devices resides on the active
NW-NG-VM, meaning that the NW-NG-VM has two direct routes in this
routing instance, one for the subnet 1.1.1.0/24 and the other for 2.2.2.0/24.
Since the route propagation deciding factor for Layer 3 routes is the routing
instance, the active NW-NG-VM sends the route of 1.1.1.0/24 and 2.2.2.0/24s
to both Node-1 and Node-2 so that these routes can be programmed in the data
plane (PFE of the Nodes).
The active NW-NG-VM will not send the information about the directly
connected routes in routing instance BLUE over to Node-1 at all. This is
because Node-1 doesnt have any directly connected devices in the BLUE
routing instance.
This is true for all kinds of routes learned within a routing instance; they could
either be directly connected, static, or learned via routing protocols like BGP,
OSPF, or IS-IS.
All of the above applies to Node-2 and Node-3 for the routing instance named
BLUE.
All of the above applies to SNGs, RSNGs, and the master routing instance.
In conclusion, the route learning always takes place at the active NW-NG-VM and
only selective routes are propagated to the individual Nodes for programming the
data plane (PFE of the Nodes). The individual Nodes get the Layer 3 routes from the
active NW-NG-VM only if the Node has an active port in that routing instance.
This concept of sending routes to an RE/PFE only if it needs that route ensures that
you do not send all the routes everywhere. Thats the high scale at which a QFabric
system can operate. Now lets discuss how are those routes are sent over to different
Nodes.
Distributing Routes to Different Nodes

The QFabric technology uses the concept of Layer 3 MPLS-VPN (RFC-2547)
internally to make sure that a Node gets only the routes that it needs. RFC2547
introduced the concept of Route Distinguishers (RD) Route Targets (RT).
QFabric technology also uses the same concept to make sure that a route gets only the
routes that it needs. Lets again review the different kind of routes in Table 3.1.
Table 3.1
Layer 2 and Layer 3 Route Comparsion
Layer 2 Routes
Layer 3 Routes
Deciding factor is the VLAN.
Deciding factor is the routing instance.
Internally, each VLAN is assigned a token.
Internally, each routing instance is assigned a token.
This token acts as the RD/RT and acts as the deciding

factor about whether a route should be sent to a Node or
not.
This token acts as the RD/RT and acts as the deciding factor about
whether a route should be sent to a Node or not.
48
Each active RE within the QFabric system forms a BGP peering with the VMs called
FC-0 and FC-1. All the active REs send all Layer 2 and Layer 3 routes over to the
FC-0 and FC-1 VMs via BGP. These VMs only send the appropriate routes over to
individual REs (only the routes that the REs need).
The FC-0 and FC-1 VMs act as route reflectors. However, these VMs follow the
rules of QFabric technology when it comes to deciding which routes to be sent to
which RE (not sending the routes that an RE doesnt need).
Figure 3.6 shows all the components (SNG, RSNG, and NW-NG VMs) sending all
of their learned routes (Layer 2 and Layer 3) over to the Fabric Control VM.
Figure 3.6
Different REs Send Their Learned Routes to the Fabric Control VM
However, the Fabric Control VM sends only those routes to a component that are
relevant to it. In Figure 3.7, the different colored arrows signify that the relevant
routes that the Fabric Control VMs send to each component may be different.
Lets look at some show command snippets that will demonstrate how a local route
gets sent to the FC, and then how the other Nodes see it. They will be separated into
Layer 2 and Layer 3 routes, and most of the snippets have bolded notes preceded by
<<.
Figure 3.7
Fabric Control VM Sends Only Relevant Routes to the Individual REs
Show Command Snippets for Layer 2 Routes

In the following example, the MAC address of ac:4b:c8:f8:68:97 is learned on
MLRSNG01a:xe-0/0/8.0:
root@TEST-QFABRIC#runshowethernet-switchingtablevlan709<<fromtheQFabricCLI
Ethernet-switchingtable:6unicastentries
VLANMACaddressTypeAgeInterfaces
V709*Flood-NW-NG-0:All-members
RSNG0:All-members
V70900:00:5e:00:01:01Learn0NW-NG-0:ae0.0
V7093c:94:d5:44:dd:c1Learn2:17NW-NG-0:ae34.0
V70940:b4:f0:73:42:01Learn2:25NW-NG-0:ae36.0
V70940:b4:f0:73:9e:81Learn2:06NW-NG-0:ae38.0
V709ac:4b:c8:83:b7:f0Learn2:04NW-NG-0:ae0.0
V709ac:4b:c8:f8:68:97Learn0MLRSNG01a:xe-0/0/8.0
[edit]
root@TEST-QFABRIC#
Here the Node named MLRSNG01a is a member of the RSNG named RSNG0:
root@TEST-QFABRIC#runshowfabricadministrationinventorynode-groupsRSNG0
Nodegroup
RSNG0ConnectedConfigured
MLRSNG01aP6810-CConnected
MLRSNG02aP7122-CConnected
[edit]
This is what the local route looks like on the RSNG:

qfabric-admin@RSNG0>showethernet-switchingtable|matchac:4b:c8:f8:68:97
V709---qfabricac:4b:c8:f8:68:97Learn43xe-0/0/8.0<<learnedonfpc0
{master}
qfabric-admin@RSNG0>showvirtual-chassis
49
50
Mstr
0(FPC0)Prsntqfx3500128Master*P6810-C<<MLRSNG01aorfpc0
{master}
qfabric-admin@RSNG0>
Here are some more details about the RSNG:

qfabric-admin@RSNG0>showfabricsummary
AutonomousSystem:100
INEId:128.0.130.6<<thelocalINE-id
INEType:Server
SimulationMode:SI
{master}
Hardware token assigned to Vlan.709 is 12:

qfabric-admin@RSNG0>startshell
%vlaninfo
IndexNameInstTagFlagsHW-TokenL3-iflMSTIndex
2default000x10030254
3V650---qfabric06500x10080254
4V709---qfabric07090x100120254
5V2283---qfabric022830x100190254
The hardware token for a VLAN can be obtained using the CLI using the following
commands:
qfabric-admin@RSNG0>showvlansV709---qfabricextensive
VLAN:V709---qfabric,Createdat:ThuNov1405:39:282013
802.1QTag:709,Internalindex:4,AdminState:Enabled,Origin:Static
Protocol:PortMode,Macagingtime:300seconds
Numberofinterfaces:Tagged0(Active=0),Untagged0(Active=0)
{master}
qfabric-admin@RSNG0>showfabricvlan-domain-mapvlan4
VlanL2DomainL3-IflL3-Domain
41200
{master}
The Layer 2 domain shown in the output of show fabric vlan-domain-map vlan
<internal-index> contains the same value as that of the hardware token of the
VLAN and its also called the L2Domain-Id for a particular VLAN.
As discussed earlier, this route is sent over to the FC-VM. This is how the route
looks on the FC-VM (note that the FC-VM uses a unique table called bgp.bridgevpn.0 :
qfabric-admin@FC-0>showroutefabrictablebgp.bridgevpn.0
--snip-65534:1:12.ac:4b:c8:f8:68:97/152
*[BGP/170]6d07:42:56,localpref100
ASpath:I,validation-state:unverified
>to128.0.130.6viadcfabric.0,Push1719,Push10,Push25(top)
[BGP/170]6d07:42:56,localpref100,from128.0.128.8
>to128.0.130.6viadcfabric.0,Push1719,Push10,Push25(top)
51
So, the next hop for this route is being shown as 128.0.130.6. Its clear from the
output snippets mentioned earlier, that this is the internal IP address for the RSNG.
The bolded portion of the route shows the hardware token of the VLAN. The output
snippets above showed that the token for VLAN.709 is 12.
The labels that are being shown in the output of the route at the FC-VM are specific
to the way the FC-VM communicates with this particular RE (the RSNG). The
origination and explanation of these labels is beyond the scope of this book.
As discussed earlier, a Layer 2 route should be sent across to all the Nodes that have
active ports in that particular VLAN. In this specific example, here are the Nodes that
have active ports in VLAN.709:
root@TEST-QFABRIC#runshowvlans709
NameTagInterfaces
V709709
MLRSNG01a:xe-0/0/8.0*,NW-NG-0:ae0.0*,NW-NG-0:ae34.0*,
NW-NG-0:ae36.0*,NW-NG-0:ae38.0*
[edit]
Since the NW-NG Nodes are active for VLAN 709, the active NW-NG-VM should
have the Layer 2 route under discussion (ac:4b:c8:f8:68:97 in VLAN 709) learned
via the FC-VM via the internal BGP protocol. Here are the corresponding show
snippets from the NW-NG-VM (note that whenever the individual REs learn Layer 2
routes from the FC, they are stored in the table named default.bridge.0):
root@TEST-QFABRIC#runrequestcomponentloginNW-NG-0
Password:
Atleastonepackageinstalledonthisdevicehaslimitedsupport.
Run'fileshow/etc/notices/unsupported.txt'fordetails.
{master}
qfabric-admin@NW-NG-0>showroutefabrictabledefault.bridge.0
--snip-12.ac:4b:c8:f8:68:97/88
*[BGP/170] 1d 10:53:47, localpref 100, from 128.0.128.6
AS path: I, validation-state: unverified
> to 128.0.130.6 via dcfabric.0, Layer 2 Fabric Label 1719 PFE Id 10 Port Id 25
[BGP/170] 1d 10:53:47, localpref 100, from 128.0.128.8
The bolded portion of the snippet shows the token for VLAN.709 (12), the destination PFE-ID and the Port-ID are data plane entities. This is the information that gets
pushed down to the PFE of the member Nodes and then these details are used to
forward data in hardware. In this example, whenever a member Node of the NW-NG
gets traffic for this MAC address, it sends this data via the FTE links to the Node with
PFE-ID of 10. The PFE-IDs of all the Nodes within a Node group can be seen by
logging into the corresponding VM and correlating the outputs of show fabric
multicast vccpdf-adjacency and show virtual chassis. In this example, its the
RSNG that locally learns the Layer 2 route of ac:4b:c8:f8:68:97 in VLAN 709. Here
are the outputs of commands that show which Node has the PFE of 10:
root@TEST-QFABRIC# run request component login RSNG0
Warning: Permanently added 'dcfNode-default-rsng0,169.254.193.3' (RSA) to the list of known hosts.
Password:
52
{master}
qfabric-admin@RSNG0>showfabricmulticastvccpdf-adjacency
Flags:S-Stale
SrcSrcSrcDestSrcDest
DevidINEDevtypeDevidInterfaceFlagsPortPort
934TOR256n/a-1-1
934TOR512n/a-1-1
10259(s)TOR256fte-0/1/1.3276813
10259(s)TOR512fte-0/1/0.3276803
1134TOR256n/a-1-1
1134TOR512n/a-1-1
12259(s)TOR256fte-1/1/1.3276812
12259(s)TOR512fte-1/1/0.3276802
256260F29n/a-1-1
256260F210n/a-1-1
256260F211n/a-1-1
256260F212n/a-1-1
512261F29n/a-1-1
512261F210n/a-1-1
512261F211n/a-1-1
512261F212n/a-1-1
{master}
The Src Dev ID shows the PFE-IDs of the member Nodes and the Interface column
shows the FTE interface that goes to the interconnects. The highlighted output
shows that the device with fpc-0 has the PFE-ID of 10 (fte-0/1/1 means that the port
belongs to member Node which is fpc-0).
The output of show
virtual-chassis shows which device is fpc-0:
Mstr
{master}
These two snippets show that the device with fpc-0 is the Node with device ID of
P6810-C. Also, the MAC address was originally learned on port xe-0/0/8 (refer to
the preceeding outputs).
The last part of the data plane information on the NW-NG was the port-ID of the
Node with PFE-ID = 10. The PFE-ID generation is Juniper confidential information
and beyond the scope of this book. However, the port-ID shown in the output of
show route fabric table default.bridge.0 would always be 17 more than the
actual port-number of the ingress Node in case when QFX 3500s are being used as
the Nodes. In this example, the MAC address was learned on xe-0/0/8 on the RSNG
Node. This means that the port-ID being shown on the NW-NG should be 8 + 17 =
25. This is exactly the information that we saw in the output of show route fabric
default.bridge.0 earlier.
53
Show Command Snippets for Layer 3 Routes

Layer 3 routes are also propagated similar to Layer 2 routes. The only difference is
that the table is named bgp.l3vpn.0. As discussed, its the routing instance that
decides whether a Layer 3 route should be sent to a Node device or not. Lets look at
the CLI snippets to verify the details:
root@TEST-QFABRIC#runrequestcomponentloginNW-NG-0
Password:
{master}
qfabric-admin@NW-NG-0>showrouteprotocoldirect
inet.0:95destinations,95routes(95active,0holddown,0hidden)
RestartComplete
+=ActiveRoute,-=LastActive,*=Both
172.17.106.128/30*[Direct/0]6d08:38:41
>viaae4.0<<<<<considerthisroute
172.17.106.132/30*[Direct/0]6d08:38:38
>viaae5.0
172.17.106.254/32*[Direct/0]6d09:10:36
>vialo0.0
{master}
qfabric-admin@NW-NG-0>showconfigurationinterfacesae4
description"NW-NG-0:ae4P2Player3toTSTRa:ae5";
metadataNW-NG-0:ae4;
mtu9192;
macf8:c0:01:f9:30:0c;
unit0{
global-layer2-domainid6;
familyinet{
address172.17.106.129/30;<<<<<localIPaddress
}
}
{master}
qfabric-admin@RSNG0>...bgp.l3vpn.0|find172.17.106.132
65534:1:172.17.106.132/30
*[BGP/170]6d08:27:07,localpref101,from128.0.128.6
to128.0.128.4viadcfabric.0,PFEId9PortId17
to128.0.128.4viadcfabric.0,PFEId9PortId18
>to128.0.128.4viadcfabric.0,PFEId9PortId21
This information is similar to that which was seen in the case of a Layer 2 route. Since
this particular route is a direct route on the NW-NG, then the IP address of
128.0.128.4 and the corresponding data plane information (PFE-ID: 9 and Port-ID:
21) should reside on the NW-NG. Here are the verification commands:
qfabric-admin@NW-NG-0>showfabricsummary
INEId:128.0.128.4<<<<thisiscorrect
INEType:Network
SimulationMode:SI
{master}
qfabric-admin@NW-NG-0>showfabricmulticastvccpdf-adjacency
Flags:S-Stale
54
934(s)TOR256fte-2/1/1.3276810
934(s)TOR512fte-2/1/0.3276800
10259TOR256n/a-1-1
10259TOR512n/a-1-1
1134(s)TOR256fte-1/1/1.3276811
1134(s)TOR512fte-1/1/0.3276801
12259TOR256n/a-1-1
12259TOR512n/a-1-1
256260F29n/a-1-1
256260F210n/a-1-1
256260F211n/a-1-1
256260F212n/a-1-1
512261F29n/a-1-1
512261F210n/a-1-1
512261F211n/a-1-1
512261F212n/a-1-1
{master}
So the PFE-ID of 9 indeed resides on the NW-NG. According to the output of show
route fabric table bgp.l3vpn.0 taken from the RSNG, the port-ID of the remote
Node is 21. This means that the corresponding port number on the NW-NG should
be xe-2/0/4 (4 + 17 = 21). Note that the original Layer 3 route was a direct route
because of the configuration on ae4 on the NW-NG. Hence one should expect
xe-2/0/4 to be a part of ae4. Here is what the configuration looks like on the NWNG:
qfabric-admin@NW-NG-0>showconfigurationinterfacesxe-2/0/4
description"NW-NG-0:ae4toTSTRaxe-4/2/2";
metadataMLNNG02a:xe-0/0/4;
ether-options{
802.3adae4;<<<thisisexactlytheexpectedinformation
}
{master}
BUM Traffic
A QFabric system can have 4095 VLANs configured on it and can also be comprised
of multiple Node groups. A Node group may or may not have any active ports in a
specific VLAN. To maintain scale within a QFabric system, whenever data has to be
flooded it is sent only to those Nodes which have an active port in the VLAN in
question.
To make sure that flooding takes place according to these rules, the QFabric technology introduces the concept of a Multicast Core Key. A Multicast Core Key is a 7-bit
value and it identifies a group of Nodes for the purposes of replicating BUM traffic.
This value is always generated by the active NW-NG-0 VM and is advertised to all
the Nodes so that correct replication and forwarding of BUM traffic can take place.
As discussed, a Node should receive BUM traffic in a VLAN only if it has an active
port (which is in up/up status) in that given VLAN. To achieve this, whenever a
Nodes interface becomes an active member of a VLAN, that Node relays this
information to the NW-NG-0 VM over the CPE network. The NW-NG-0 VM
processes this information from all the Nodes and generates a Multicast Core Key for
that VLAN. This Multicast Core Key has an index of all the Nodes that subscribe to
this VLAN (that is, the Nodes which have an active port in this VLAN). The Core
Key is then advertised to all the Nodes and all the Interconnects by the NW-NG-0
VM over the CPE network. This processes is hereafter referred to as a Node subscribing to the VLAN.
55
Once the Nodes and Interconnects receive this information, they install a broadcast
route in their default.bridge.0 table and the next hop for this route is the Multicast
Core Key number. With this information, the Nodes and Interconnects are able to
send the BUM data only to Nodes that subscribe to this VLAN.
Note that there is a specific table called default.fabric.0 that contains all the information regarding the Multicast Core Keys. This includes the information that the
NW-NG-0 VM receives from the Nodes when they subscribe to a VLAN.
Here is a step wise explanation of this process for vlan.29:
1. Vlan.29 is present only on the Nodes that are a part of the Network Node group:
root@TEST-QFABRIC>showvlansvlan.29
NameTagInterfaces
vlan.2929
NW-NG-0:ae0.0*,NW-NG-0:ae34.0*,NW-NG-0:ae36.0*,
NW-NG-0:ae38.0*
2. The hardware token for vlan.29 is determined to be 5:

qfabric-admin@NW-NG-0>showvlans29extensive
VLAN:vlan.29---qfabric,Createdat:TueDec309:48:542013
ae0.0*,tagged,trunk
ae34.0*,tagged,trunk
{master}
qfabric-admin@NW-NG-0>showfabricvlan-domain-mapvlan7
VlanL2DomainL3-IflL3-Domain
500
3. Since vlan.29 has active ports only on the Network Node Group, this VLAN
shouldnt exist on any other Node group:
root@TEST-QFABRIC>requestcomponentloginRSNG0
Warning:Permanentlyadded'dcfnode-defaultrsng0,169.254.193.3'(RSA)tothelistofknownhosts.
Password:
{master}
qfabric-admin@RSNG0>showvlans29
error:vlanwithtag29doesnotexist
{master}
4. At this point in time, the NW-NG-0s default.fabric.0 table does not contain only
local information:
qfabric-admin@NW-NG-0>showfabricsummary
INEId:128.0.128.4
INEType:Network
SimulationMode:SI
{master}
qfabric-admin@NW-NG-0>...0fabric-route-typemcast-routesl2domain-id5
default.fabric.0:88destinations,92routes(88active,0holddown,0hidden)
RestartComplete
56
5.ff:ff:ff:ff:ff:ff:128.0.128.4:128:000006c3(L2D_PORT)/184
*[Fabric/40]11w1d02:59:50
>to128.0.128.4:128(NE_PORT)viaae0.0,Layer2FabricLabel1731
5.ff:ff:ff:ff:ff:ff:128.0.128.4:162:000006d3(L2D_PORT)/184
*[Fabric/40]6w3d09:03:04
*[Fabric/40]11w1d02:59:50
*[Fabric/40]11w1d02:59:50
{master}
The command executed above is show route fabric table default.fabric.0
fabric-route-type mcast-routes l2domain-id 5.
5. The user configures a port on the Node group named RSNG0 in vlan.29. After
this, RSNG0 started displaying the details for vlan.29:
root@TEST-QFABRIC#...hernet-switchingport-modetrunkvlanmembers29
[edit]
root@TEST-QFABRIC#commit
commitcomplete
[edit]
root@TEST-QFABRIC#show|comparerollback1
[editinterfaces]
+P7122-C:xe-0/0/9{
+unit0{
+familyethernet-switching{
+port-modetrunk;
+vlan{
+members29;
+}
+}
+}
+}
[edit]
qfabric-admin@RSNG0>showvlans29extensive
VLAN:vlan.29---qfabric,Createdat:ThuFeb2713:16:392014
xe-1/0/9.0*,tagged,trunk
{master}
6. Since RSNG0 is now subscribing to vlan.29, this information should be sent

over to the NW-NG-0 VM. Here is what the default.fabric.0 table of contents
looks like at RSNG0:
qfabric-admin@RSNG0>showfabricsummary
INEId:128.0.130.6
INEType:Server
SimulationMode:SI
{master}
qfabric-admin@RSNG0>...c.0fabric-route-typemcast-routesl2domain-id5
RestartComplete
*[Fabric/40]00:04:59
57
>to128.0.130.6:49174(NE_PORT)viaxe-1/0/9.0,Layer2FabricLabel1729
{master}
This route is then sent over to NW-NG-0 via the Fabric Control VM.
7. NW-NG-0 receives the route from RSNG0 and updates its default.fabric.0 table:
qfabric-admin@NW-NG-0>...ic-route-typemcast-routesl2domain-id5
--snip
*[BGP/170] 00:07:34, localpref 100, from 128.0.128.6 <<<< 128.0.128.6 is RSNG0s IP
address
[BGP/170] 00:07:34, localpref 100, from 128.0.128.8
8. NW-NG-0 VM checks its database to find out the list of Nodes that already
subscribe to vlan.29 and generates a PFE-map. This PFE-map contains the indices of
all the Nodes that subscribe to vlan.29:
qfabric-admin@NW-NG-0>showfabricmulticastrootvlan-group-pfe-map
L2domainGroupFlagPFEmapMrouterPFEmap
22.255.255.255.25561A00/30/0
55.255.255.255.25561A00/30/0
--snip--
Check the entry corresponding to the L2Domain-ID for the corresponding VLAN. In
this case, the L2Domain-ID for vlan.29 is 5.
9. The NW-NG-0 VM creates a Multicast Core-Key for the PFE-map (4101 in this
case):
qfabric-admin@NW-NG-0>...multicastrootlayer2-group-membership-entries
GroupMembershipEntries:
--snip-L2domain:5
Group:Source:5.255.255.255.255
Multicastkey:4101
PacketForwardingmap:1A00/3
--snip-The command used here was show fabric multicast root layer2-group-membership-entries. This command is only available in Junos 13.1 and higher. In earlier
versions of Junos the show fabric multicast root map-to-core-key command can be
used to obtain the Multicast Core Key number.

10. The NW-NG-0 VM sends this Multicast Core Key to all the Nodes and
Interconnects via the Fabric Control VM. This information is placed in the default.
fabric.0 table. Note that this table is only used to store the Core Key information and
is not used to actually forward data traffic. Note that the next hop is 128.0.128.4,
which is the IP address of the NW-NG-0 VM.
qfabric-admin@RSNG0>showroutefabrictabledefault.fabric-route-typemcast-member-map-key4101
RestartComplete
4101:7(L2MCAST_MBR_MAP)/184
*[BGP/170]00:35:12,localpref100,from128.0.128.6
[BGP/170]00:35:12,localpref100,from128.0.128.8
58
11. The NW-NG-0 VM sends out a broadcast route for the corresponding VLAN to
all the Nodes and the Interconnects. The next hop for this route is set to the
Multicast Core Key number. This route is placed in the default.bridge.0 table and is
used to forward and flood the data traffic. The Nodes and Interconnects will install
this route only if they have information for the Multicast Core Key in their default.
fabric.0 table. In this example, note that the next hop contains the information for
the Multicast Core Key as well:
qfabric-admin@RSNG0>showroutefabrictabledefault.bridge.0l2domain-id5
--snip-5.ff:ff:ff:ff:ff:ff/88
*[BGP/170]00:38:13,localpref100,from128.0.128.6
>to128.0.128.4:57005(NE_
PORT)viadcfabric.0,MultiCast-Corekey:4101Keylen:7
[BGP/170]00:38:13,localpref100,from128.0.128.8
>to128.0.128.4:57005(NE_
PORT)viadcfabric.0,MultiCast-Corekey:4101Keylen:7
The eleven steps mentioned here are a deep dive into how the Nodes of a QFabric
system subscribe to a given VLAN. The aim of this technology is to make sure that
all the Nodes and the Interconnects have consistent information regarding which
Nodes subscribe to a specific VLAN. This information is critical to ensuring that
there is no excessive flooding within the data plane of a QFabric system.
At any point in time, there may be multiple Nodes that subscribe to a VLAN, raising
the question of where a QFabric system should replicate BUM traffic. QFabric
systems replicate BUM traffic at the following places:
Ingress Node: Replication takes place only if:
There are any local ports in the VLAN where BUM traffic was received.
BUM traffic is replicated and sent out on the server facing ports.
There are any remote Nodes that subscribe to the VLAN in question. BUM
traffic is replicated and sent out towards these specific Nodes over the 40GbE
FTE ports.
Interconnects: Replication takes place if there are any directly connected Nodes
that subscribe to the given VLAN.
Egress Node: Replication takes place only if there are any local ports that are
active in the given VLAN.
Differences Between Control Plane Traffic and Internal Control Plane Traffic
Most of this chapter has discussed the various control plane characteristics of the
QFabric system and how the routes are propagated from one RE to another. With
this background, note the following functions that a QFabric system has to perform
to operate:
59
Control plane tasks: form adjacencies with other networking devices, learn
Layer 2 and Layer 3 routes
Data plane tasks: forward data end-to-end
Internal control plane tasks: discover Nodes and Interconnects, maintain
VCCPD, VCCPDf adjacencies, health of VMs, exchange routes within the
QFabric system to enable communication between hosts connected on different Nodes
The third bullet here makes QFabric a special system. All the control plane traffic
that is used for the internal workings of the QFabric system is referred to as internal
control plane traffic. And the last pages of this chapter are dedicated to bringing out
the differences between the control plane and the internal control plane traffic. Lets
consider the following QFabric system shown in Figure 3.8.
Figure 3.8
Sample QFabric System and Connected Hosts
In Figure 3.8, the data plane is shown using blue lines and the CPE is shown using
green lines. There are four Nodes, two Interconnects, and four Hosts. Host-1 and
Host-2 are in the RED-VLAN (vlan.100), Host-3 and Host-4 are in YELLOWVLAN (vlan.200). Node-1, Node-2, and Node-3 are SNGs, whereas Node-4 is an
NW-NG Node and has Host-4, as well as a router (R1), directly connected to it.
Finally, BGP is running between the QFabric and R1.
60
Lets discuss the following traffic profiles: Internal Control Plane and Control Plane.
Internal Control Plane
VCCPDf Hellos between the Nodes and the Interconnects are an example of
internal control plane traffic.
Similarly, the BGP sessions between the Node groups and the FC-VM is also
an example of internal control plane traffic.
Note that the internal control plane traffic is also generated by the CPU, but
its used for forming and maintaining the states of protocols that are critical to
the inner-workings of a QFabric system.
Also, the internal control plane traffic is always used only within the QFabric
system. The internal control plane traffic will never be sent out of any Nodes.
Control Plane
Start a ping from Host-1 to its default gateway. Note that the default-gateway
for Host-1 resides on the QFabric. In order for the ICMP pings to be successful, Host-1 will need to resolve ARP for the gateways IP address. Note that
Host-1 is connected to Node-1, which is an SNG, and the RE functionality is
always locally active on an SNG. Hence the ARP replies will be generated
locally by Node-1s CPU (SNG). The ARP replies are sent out to Host-1 using
the data plane on Node-1.
The BGP Hellos between the QFabric system and R1 will be generated by the
active NW-NG-VM, which is running on the DGs. Even though R1 is directly
connected to Node-4, the RE functionality on Node-4 is disabled because it is
a part of the Network Node Group. The BGP Hellos are sent out to R1 using
the data plane link between Node-4 and R1.
The control plane traffic is always between the QFabric system and an external entity. This means that the control plane traffic eventually crosses the data
plane, too, and goes out of the QFabric system via some Node(s).
To be an effective QFabric administrator, it is extremely important to know which
RE/PFE would be active for a particular abstraction or Node group. Note that all
the control plane traffic for a particular Node group is always originated by the
active RE for that Node group. This control plane traffic is responsible for forming
and maintaining peerings and neighborships with external devices. Here are some
specific examples:
A server connected to SNG (active-RE is the Nodes RE) via a single link: In
this situation, if LLDP is running between the server and the QFabric, then the
RE of the Node is responsible for discovering the server via LLDP. The LLDP
PDUs will be generated by the RE of the Node, which will help the server with
the discovery of the QFabric system.
RSNG: For an RSNG, the REs of the Nodes have an active/passive relationship. For example:
A server with two NICs connected to each Node of an RSNG: This is the
classic use case for an RSNG for directly connecting servers to a QFabric
system. Now if LACP is running between the server and the QFabric system,
61
then the active RE is responsible for exchanging LACP PDUs with the server to
make sure that the aggregate link stays up.
An access switch is connected to each Node of an RSNG: This is a popular
use case in which the RSNG (QFabric) acts as an aggregation point. However,
you can eliminate STP by connecting the access switch to each Node of the
RSNG and by running LACP on the aggregate port. This leads to a flat
network design. Again the active RE of the RSNG is responsible for exchanging LACP PDUs with the server to make sure that the aggregate link stays up.
Running BGP between NW-NG-0 and an MX (external router): As expected,
its the responsibility of the active NW-NG-0 VM (located on the active-DG)
to make sure that the necessary BGP communication (keep alives, updates,
etc.) takes place with the external router.
62
Test Your Knowledge

Q: What is the difference between a Node group and a Node device?
A Node device can be any QFX3500 or QFX3600 Node that is a part of the
QFabric system. A Node group is an abstraction and is a group of Node
devices.
Q: What are the different kinds of Node groups and how many Node devices can be a
part of each group?
i) Server Node Group (SNG): Only one Node device can be a part of a SNG.
ii) Redundant Server Node Group (RSNG): an RSNG consists of two Node
devices.
iii) Network Node Group (NNG): The NNG can consist of up to eight Node
devices.
Q: Can a Node device be part of multiple Node groups at the same time?
No.
Q: Where are the active/backup Routing Engines present for the various Node
groups?
i) SNG: The Routing Engine of the Node device is active. Since the Node group
consists of only one Node device, there is no backup Routing Engine.
ii) RSNG: The Routing Engine of one Node device is active and the Routing
Engine of the other Node device is backup.
iii) NNG: The Routing Engines of the Node devices are disabled. The Routing
Engine functionality is handled by two VMs running on the Director devices.
These VMs operate in active/backup fashion.
Q: Are all the routes learned on a Node groups Routing Engine sent to all other Node
devices for PFE programming?
No. Routes are sent only to the Node devices that need them. This decision is
different for Layer 2 and Layer 3 routes.
i) Layer 2 routes: Routes are sent only to those Node devices that have an
active port in that VLAN.
ii) Layer 3 routes: Routes are sent only to those Node devices that have an
active port in that routing instance.
Q: Which tables contain the Layer 2 and Layer 3 routes that get propagated internally between the components of a QFabric system?
i) Layer 2 routes: bgp.bridgevpn.0
ii) Layer 3 routes: bgp.l3vpn.0
iii) Multicast routes: default.fabric.0
Chapter 4
Data Plane Forwarding
ARP Resolution for End-to-End Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Layer 2 Traffic (Known Destination MAC Address with Source and
Destination Connected on the Same Node) . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Layer 2 Traffic (Known Destination MAC Address with Source and
Destination Connected on Different Nodes). . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Layer 2 Traffic (BUM Traffic) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Layer 3 Traffic (Destination Prefix is Learned on the Local Node). . . . . . . . 72
Layer 3 Traffic (Destination Prefix is Learned on a Remote Node). . . . . . . 73
End-to-End Ping Between Two Hosts Connected to Different Nodes. . . 73
External Device Forming a BGP Adjacency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Test Your Knowledge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
64
This chapter concerns how a QFabric system forwards data. While the previous
chapters in this book have explained how to verify the working of the control plane
of a QFabric system, this chapter focuses only on the data plane.
The following phrases are used in this chapter:
Ingress Node: the Node to which the source of a traffic stream is connected.
Egress Node: the Node to which the destination of a traffic stream is connected.
NOTE
The data plane on the Nodes and the Interconnects resides on the ASIC chip. Accessing and troubleshooting the ASIC is Juniper confidential and beyond the scope of this
book.
This chapter covers the packet paths for the following kinds of traffic within a
QFabric system:
ARP resolution at the QFabric for end-to-end traffic
Layer 2 traffic (known destination MAC address with source and destination
connected on the same Node)
Layer 2 traffic (known destination MAC address with source and destination
connected on different Nodes)
Layer 2 traffic (BUM traffic)
Layer 3 traffic (destination prefix is learned on the local Node)
Layer 3 traffic (destination prefix is learned on a remote Node)
Example with an end-to-end ping between two hosts connected to different
Nodes on the QFabric
ARP Resolution for End-to-End Traffic

Consider the following communication between Host-A and Host-B:
Host-A-------JuniperEXSeriesSwitch----------Host-B
Host-A is in VLAN-1 and has an IP address of 1.1.1.100/24, Host-B is in VLAN-2

and has an IP address of 2.2.2.100/24. The RVIs for these VLANs are present on the
Juniper EX Series switch and these RVIs are also the default gateways for both the
VLANs.
According to the basic laws of networking, when Host-A sends some traffic to
Host-B, the switch will need to resolve the destinations ARP. This is fairly simple but
things change when the intermediate device is a QFabric system.
As discussed in the previous chapters, a QFabric system scales by not sending all the
routes to all the Nodes. As a result, when a Node doesnt have any active ports in a
given VLAN, that VLANs broadcast tree doesnt exist locally. In simpler terms, if
there is no active port for a VLAN on a Node, then that VLAN does not exist on the
Node.
ARP resolutions are based on broadcasting ARP requests in a VLAN and relying on
the correct host to send a response. Since the QFabric architecture allows for a
situation in which a VLAN might not exist on a Node (when there are no active ports
in that VLAN on the Node), there could be a situation in which a Node receives
traffic that requires ARP resolution, but the destination VLAN (like VLAN-2 in the
Chapter 4: Data Plane Forwarding
65
just mentioned example) does not exist on the ingress Node.

Lets look at two scenarios about the ingress Node:
The ingress Node has the destination VLAN configured on it.
The ingress Node doesnt have any active ports in the destination VLAN and
hence the destination VLAN doesnt exist locally.
The Ingress Node Has the Destination VLAN Configured On It

Figure 4.1 depicts the Nodes of a QFabric system and the Hosts connected to those
Nodes. Lets assume all the connected Host ports are configured as access ports on
the QFabric, and the corresponding VLAN shown in the Figure. In this case, Host-A
starts sending data to Host-B. Node-1 is the ingress Node and also has an active
port in the destination VLAN. According to the standards maintained in this book,
the blue links on the Nodes are the 40GbE FTE links going to the Interconnects.
Figure 4.1
A QFabric Units Nodes and Connected Hosts
Think of QFabric as a large switch. When a regular switch needs to resolve ARP,
then it is required to flood an ARP request in the corresponding VLAN. The QFabric should behave in a similar way. In this particular example, the QFabric should
send out the ARP request for Host-B on all the ports that have VLAN-2 enabled.
There are three such ports: one locally on Node-1, and one each on Node-2 and
Node-3.
Since Node-1 has VLAN-2 active locally, it would also have the information about
VLAN-2s broadcast tree. Whenever a Node needs to send BUM traffic on a VLAN
that is active locally as well, that traffic is always sent out on the broadcast tree.
66
In this example, Node-1, Node-2, and Node-3 subscribe to the broadcast tree for
VLAN-2. Hence, this request is sent out of the FTE ports towards Node-2 and
Node-3. Once these broadcast frames (ARP requests) reach Node-2, they are flooded
locally on all the ports that are active in VLAN-2.
Here is the sequence of steps:
1. Host-A wants to send some data to Host-B. Host-A is in VLAN-1 and Host-B is in
VLAN-2. The IP and MAC addresses are shown in Figure 4.2.
2. Host-A sends this data to the default gateway (the QFabrics RVI for VLAN-1).
3. The QFabric system needs to generate an ARP request for Host-Bs IP address.
4. Since the destination prefix (VLAN) is also active locally, generate the ARP request
locally on the ingress Node (Node-1).
Figure 4.2
Node-1 Sends the Request in VLAN-2
Node-1 sends out this ARP request on the local ports that are active for VLAN-2. In
addition, Node-1 consults its VLAN broadcast tree and finds out that Node-2 and
Node-3 also subscribe to VLAN-2s broadcast tree. Node-1 sends out the ARP
request over the FTE links. An extra header called the fabric header is added on all
the traffic going on the FTE links to make sure that only Node-2 and Node-3 receive
this ARP request.
The IC receives this ARP request from Node-1. The IC looks at the header appended
by Node-1 and finds out that this traffic should be sent only to Node-2 and Node-3.
The IC has no knowledge of the kind of traffic that is encapsulated within the fabric
header.
In Figure 4.3, Node-2 and Node-3 receive one ARP request each from their FTE
links. These Nodes flood the request on all ports that are active in VLAN-2.
Host-B replies to the ARP request.
Figure 4.3
67
Node-2 and Node-3 Receive Request from the Data Plane (Interconnect)
At this point in time, the QFabric system learns the ARP entry (ARP route) for
Host-2 (see Figure 4.4). Using the Fabric Control VM, this route is advertised to all
the Nodes that have active ports in this VRF. Note that the ARP route will be
advertised to relevant Nodes based on the same criteria as regular Layer 3 routes,
that is, based on the VRF.
Figure 4.4
FCs Role With Learning/Flooding the ARP Route
The sequence of events that takes place in Figure 4.4 is:
68
1. Node-1 now knows how to resolve the ARP for Host-Bs IP address. This is the
only information that Node-1 needs to be able to send traffic to Host-B.
2. Host-As data is successfully sent over to Host-B via the QFabric system.
The Ingress Node Does Not Have the Destination VLAN Configured On It
Refer to Figures 4.5 4.7, in this case, Host-A starts sending data to Host-E. Node-1
is the ingress Node and it does not have any active port in the destination VLAN
(VLAN-3).
This is a special case in which you need additional steps to make end-to-end traffic
work properly. Thats because the ingress Node doesnt have the destination VLAN
and hence doesnt subscribe to that VLANs broadcast tree. Since Node-1 doesnt
subscribe to destination VLANs broadcast tree, it has no way to know which Nodes
should receive BUM traffic in that VLAN.
Note that the Network Node Group is the abstraction that holds most of the routing
functionality of the QFabric. Hence, youll need to make use of the Network Node
Group to resolve ARP in such a scenario.
Here is the sequence of steps that will take place:
1. Host-A wants to send some data to Host-E. Host-A is in VLAN-1 and Host-E is in
VLAN-3. The IP and MAC addresses are shown in Figure 4.2.
2. Host-A sends this data to the default gateway (the QFabrics RVI for VLAN-1).
3. The QFabric system needs to generate an ARP request for Host-Es IP address.
4. Since the destination-prefix (VLAN) is not active locally, Node-1 has no way of
knowing where to send the ARP request in VLAN-3. Because of this, Node-1 cannot
generate the ARP request locally.
5. Node-1 is aware that Node-3 belongs to the Network Node Group. Since the
NW-NG hosts the routing-functionality of a QFabric, Node-1 must send this data to
the NW-NG for further processing.
Figure 4.5
Node-1 Encapsulates the Data with Fabric Header and Sends It to the Interconnects
69
6. Node-1 encapsulates the data received from Host-A with a fabric-header and
sends it over to the NW-NG Nodes (Node-3 in this example).
Figure 4.6
Node-3 Sends the ARP Request to the NW-NG-0 VM for Processing
7. Node-3 receives the data from Node-1 and immediately knows that ARP must be
resolved. Since resolving ARP is a Control plane function, this packet is sent over to
the active NW-NG-VM. Since the VM resides on the active DG, this packet is now
sent over the CPE links so that it reaches the active NW-NG-VM.
Figure 4.7
NW-NG-0 VM Generates the ARP Request and Sends It Towards the Nodes
70
8. The active NW-NG-VM does a lookup and knows that ARP-requests need to be
generated for Host-Bs IP address. The ARP request is generated locally. Note that
this part of the process takes place on the NW-NG-VM that is located on the masterDG. However, the ARP request must be sent out by the Nodes so that it can reach the
correct host. For this to happen, the active NW-NG-VM sends out one copy of the
ARP request to each Node that is active for the destination VLAN. The ARP requests
from the active NW-NG-VM are sent out on the CPE network.
9. In this specific example (the QFabric system depicted in Figure 4.2), there is only
one Node that has VLAN-3 configured on it: Node-4. As a result, the NW-NG VM
sends the ARP request only to Node-4. This Node receives the ARP request on its
CPE links and floods it locally in VLAN-3. This is how the ARP request reaches the
correct destination host.
10. Host-B replies to the ARP request.
11. At this point in time, the QFabric system learns the ARP entry (ARP route) for
Host-2. Using the Fabric Control VM, this route is advertised to all the Nodes that
have active ports in this VRF. This is the same process that was discussed in section
4.2.
12. Since the QFabric knows how to resolve the ARP for Host-Bs IP address,
Host-As data is successfully sent to Host-B via the QFabric system.
Layer 2 Traffic (Known Destination MAC Address with Source and Destination
Connected on the Same Node)
This is the simplest traffic forwarding case wherein the traffic is purely Layer 2 and
both the source and destination are connected to the same Node.
In this scenario, the Node acts as a regular standalone switch as far as data plane
forwarding is concerned. Note that QFabric will need to learn MAC addresses in
order to forward the Layer 2 traffic as unicast. Once the active RE for the ingress
Node group learns the MAC address, it will interact with Fabric Control VM and
send that MAC address to all the other Nodes that are active in that VLAN.
Layer 2 Traffic (Known Destination MAC Address with Source and Destination
Connected on Different Nodes)
In this scenario, refer again to Figure 4.2, where Host-C wants to send some data to
Host-B. Note that they are both in VLAN-2 and hence the communication between
them would be purely Layer 2 from QFabrics perspective. Node-1 is the ingress
Node and Node-2 is the egress Node. Since the MAC address of Host-B is already
known to QFabric , the traffic from Host-C to Host-B will be forwarded as unicast by
the QFabric system.
Here is the sequence of steps that will take place:
1. Node-1 receives data from Host-C and looks up the Ethernet-header. The
destination-MAC address is that of Host-B. This MAC address is already learned by
the QFabric system.
2. At Node-1, this MAC address would be present in the default.bridge.0 table.
3. The next-hop for this MAC address would point to Node-2.
71
4. Node-1 adds the fabric-header on this data and sends the traffic out on its FTE
link. The fabric header contains the PFE-id of Node-2.
5. The IC receives this information and does a lookup on the fabric-header. This
reveals that the data should be sent towards Node-2. The IC then sends the data
towards Node-2.
6. Node-2 receives this traffic on its FTE link. The fabric-header is removed and a
lookup is done on the Ethernet-header.
7. The destination-MAC is learned locally and points to the interface connected to
Host-B.
8. Traffic is sent out towards Host-B.
Layer 2 Traffic (BUM Traffic)

BUM (Broadcast, Unknown-unicast, and Multicast) traffic always requires flooding
within the VLAN (note that multicast will not be flooded within the VLAN when
IGMP snooping is turned on).
Given the unique architecture of QFabric, forwarding the BUM traffic requires
special handling. This is because there can always be multiple Nodes which are active
for a given VLAN. Whenever a VLAN is activated on a Node for the first time (this
can be done by either bringing up an access port or by adding the VLAN to an
already existing trunk port), its the responsibility of this Node groups active RE to
make sure that it subscribes to the broadcast tree for that VLAN. (The full description of broadcast trees was discussed in Chapter 3.)
With this in mind, lets look at the data plane characteristics for forwarding BUM
traffic. Since BUM traffic forwarding is based on the Multicast Core Key, whenever a
Node gets BUM traffic in a VLAN, a lookup is done in the local default.bridge.0
table. If remote Nodes also subscribe to that core key, then the ingress Node will have
a 0/32 route for that particular VLAN in which these next hop interfaces will be
listed:
All the local interfaces which are active in that VLAN.
All the FTE links that point to the different Nodes which subscribe to that
VLAN.
Depending upon the contents of this route, the BUM traffic is replicated out of all the
next hop interfaces that are listed.
Before sending this traffic out on the FTE ports, the ingress Node adds the fabric
header. The fabric header includes the Multicast Core Key information. Since the
Interconnects also get programmed with the Multicast Core Key information, they do
a lookup only on the fabric header and are able to determine the appropriate next
hop interfaces on which this multicast traffic should be sent.
Note that replication of BUM traffic can happen at the Interconnects as well (in case
an Interconnect needs to send BUM traffic towards two different Nodes).
Once the egress Nodes receive this traffic on their FTE links, they discard the fabric
header and do local replication (if necessary) and flood this data out on all the local
ports which are active in the corresponding VLAN.
72
QFabric technology uses a proprietary hash-based load balancing algorithm that

ensures that one Interconnect is not overloaded with all the responsibility of replicating and sending BUM traffic to all the egress Nodes. The internal details of the
load balancing algorithm are simply beyond the scope of this book.
Layer 3 Traffic (Destination Prefix is Learned on the Local Node)

The Network Node Group abstraction is the routing brains for a QFabric system,
since all the Layer 3 protocols run at the active Network Node Group-VM. To
examine this traffic, lets first look at Figure 4.8.
Figure 4.8
Network Node Groups Connections
In Figure 4.8, the prefixes for Host-A and Host-B are learned by the QFabrics
Network Node Group-VM. Both these prefixes are learned from the routers that are
physically located behind Node-1. Assuming that Host-A starts sending some traffic
to Host-B, here is the sequence of steps that would take place:
1. Data reaches Node-1. Since this is a case for routing, the destination MAC
address would be that of the QFabric. The destination IP address would be that of
Host-B.
2. Node-1 does a local lookup and finds that the prefix is learned locally.
3. Node-1 decrements the TTL and sends the data towards R2 after making
appropriate changes to the Ethernet header.
4. Note that since the prefix was learned locally, the data is never sent out on the
FTE links.
As Figure 4.8 illustrates, the QFabric system acts as a regular networking router in
this case. The functionality here is to make sure that the QFabric system obeys all
73
the basic laws of networking, such as resolving ARP for the next hop routers IP
address, etc. Also, the ARP resolution shown is for a locally connected route and that
was discussed as a separate case study earlier in this chapter.
Just like a regular router, the QFabric system makes sure that the TTL is also decremented for all IP-routed traffic before it leaves the egress Node.
Layer 3 Traffic (Destination Prefix is Learned on a Remote Node)

The scenario of a QFabric system that receives traffic to be Layer 3 routed, builds on
the previous scenario. However, here the destination prefix is located behind a remote
Node (see Figure 4.7).
OSPF is running between R3 and the QFabric system and thats how the prefix of
Host-C is propagated to the QFabric system.
Again, referring back to Figure 4.7, lets assume that Host-A starts sending traffic to
Host-C. The sequence would be:
1. Data reaches Node-1. Since this is a case for routing, the destination MAC address
would be that of the QFabric system, and the destination IP address would be that of
Host-C.
2. Node-1 does a lookup and finds that the destination prefix points to a remote
Node.
3. In order to send this data to the correct Node, Node-1 adds the fabric header and
sends the data to one of the Interconnects. Note that the fabric header contains the
destination Nodes PFE-ID.
4. Before adding the fabric header and sending the traffic towards the Interconnect,
Node-1 decrements the TTL of the IP packets by one. Decrementing TTL at the
ingress Node makes sure that the QFabric system doesnt have to include this
overhead at the egress Node. Besides, the ingress Node has to do an IP lookup.
Hence, it is the most lookup-efficient or latency-efficient way to forward packets.
Note that the egress Node will not have to do an IP lookup. The fabric header also
includes information about the port on the egress Node from which the traffic needs
to be sent.
5. The Interconnect receives this traffic from Node-1. The Interconnects always look
up the fabric header. In this case, the fabric header reveals that the traffic must be sent
to the PFE-id of Node-2. The Interconnect sends this traffic out of the 40G port that
points to Node-2.
6. Node-2 receives this traffic on its FTE link. A lookup is done on the fabric header.
Node-2 finds the egress port for which traffic should be sent out. Before sending the
traffic towards R2, Node-2 removes the fabric header.
End-to-End Ping Between Two Hosts Connected to Different Nodes

Lets consider a real-world QFabric example like the one shown in Figure 4.9.
74
Figure 4.9
Real World Example of a QFabric Deployment
Figure 4.9 shows a QFabric system that is typical of one that might be deployed in a
Data Center. This system has one Redundant-Server-Node group called RSNG-1.
Node-1 and Node-2 are part of RSNG-1 and Node-1 is the master Node. Node-3
and Node-4 are member Nodes of the Network Node Group abstraction. Lets
assume DG0 is the master, and hence, the active Network Node Group-VM resides
on DG0. The CPE and the Director devices are not shown in Figure 4.9.
Server-1 is dual-homed and is connected to both the Nodes of RSNG-1. The links
coming from Server-1 are bundled together as a LAG on the QFabric system.
Node-3 and Node-4 are connected to a router R1. R1s links are also bundled up as
a LAG. There is OSPF running between the QFabric system and router R1 and R1 is
advertising the subnet of Host-2 towards the QFabric system. Note that the routing
functionality of the QFabric system resides on the active Network Node GroupVM. Hence the OSPF adjacency is really formed between the VM and R1.
Finally, lets assume that this is a new QFabric system (no MAC addresses or IP
routes have been learned). So, the sequence of steps would be as follows.
At RSNG-1
At RSNG-1 the sequence of steps for learning Server-1s MAC address would be:
1. Node-1 is the master Node within RSNG-1. This means that the active RE resides
on Node-1.
2. Server-1 sends some data towards the QFabric system. Since Server-1 is
connected to both Node-1 and Node-2 using a LAG, this data can be received on
either of the Nodes.
2a. If the data is received on Node-1, then the MAC address of Server-1 is
learned locally in VLAN-1.
75
2b. If the data is received on Node-2, then it must be first sent over to the
active-RE so that MAC-address learning can take place. Note that this first
frame is sent to Node-1 over the CPE links. Once the MAC address of Server-1
is learned locally, this data is no longer sent to Node-1.
3. Once the MAC address is learned locally on Node-1, it must also send this Layer 2
route to all other Nodes that are active in VLAN-1. This is done using the Fabric
Control-VM as discussed in Chapter 3.
Network Node Group
The sequence for learning Host-2s prefix for the Network Node Group would be:
1. R1 is connected via a LAG to both Node-3 and Node-4. R1 is running OSPF and
sends out an OSPF Hello towards the QFabric system.
2. The first step is to learn the MAC address of R1 in VLAN-2.
3. In this case, the traffic is incoming on a Node that is a part of the Network Node
Group. This means that the REs on the Nodes are disabled and all the learning needs
to take place at the Network Node Group VM.
4. This initial data is sent over the CPE links towards the master-DG (DG0). Once
the DG receives the data, it is sent to the Network Node Group-VM.
5. The Network Node Group-VM learns the MAC address of R1 in VLAN-2 and
distributes this route to all the other Nodes that are active in VLAN-2. This is again
done using the Fabric Control-VM.
6. Note that the OSPF Hello was already sent to the active-Network Node Group
VM. Since OSPF is enabled on the QFabric system as well, this Hello is processed.
7. Following the rules of OSPF, the necessary OSPF-packets (Hellos, DBD, etc.) are
exchanged between the active-Network Node Group-VM and R1 and the adjacency
is established and routes are exchanged between the QFabric and R1.
8. Note that whenever Node-3 or Node-4 receive any OSPF packets, they send the
packets out of their CPE links towards the active DG so that this data can reach the
Network Node Group-VM. This Control plane data is never sent out on the FTE
links.
9. Once the Network Node Group-VM learns these OSPF routes from R1, it again
leverages the internal-BGP peering with the Fabric Control-VM to distribute these
routes to all the Nodes that are a part of this routing instance.
Ping on Server-1
After QFabric has learned the Layer 2 route for Server-1 and the Layer 3 route for
Host-2, lets assume that a user initiates a ping on Server-1. The destination of the
ping is entered as the IP address of Host-2. Here is the sequence of steps that would
take place in this situation:
1. A ping is initiated on Server-1. The destination for this ping is not in the same
subnet as Server-1. As a result, Server-1 sends out this traffic to its default gateway
(which is the RVI for VLAN-1 on the QFabric system).
2. This data reaches the QFabric system. Lets say this data comes in on Node-2.
76
3. At this point in time, the destination-MAC address would be the QFabrics MAC
address. Node-2 does a lookup and finds out that the destination IP address is that
of Host-2. A routing lookup on this prefix reveals the next hop of R1.
4. In order to send this data to R2, the QFabric system also needs to resolve the ARP
for R2s IP address. The connected interface that points to R1 is the RVI for
VLAN-2. Also, VLAN-2 doesnt exist locally on Node-2.
Technically, the QFabric would have resolved the ARP for R1 while forming
the OSPF adjacency. That fact was omitted here to illustrate the complete
sequence of steps for end-to-end data transfer within a QFabric system.
5. This is the classic use case in which the QFabric must resolve an ARP for a VLAN
that doesnt exist on the ingress Node.
6. As a result, Node-2 encapsulates this data with a fabric header and sends it out of
its FTE links towards the Network Node Group Nodes (Node-3 in this example.)
7. The fabric header would have Node-3s PFE-id. The Interconnects would do a
lookup on the fabric header and send this data over to Node-3 so that it could be
sent further along to the Network Node Group VM for ARP resolution.
8. Node-3 sends this data to the master DG over the CPE links. The master DG in
turn sends it to the active Network Node Group VM.
9. Once the active Network Node Group VM receives this, it knows that ARP must
be resolved for R1s IP address in VLAN-2. The Network Node Group VM
generates an ARP request packet and sends it to all the Nodes that are active in
VLAN-2. (Note that this communication takes place over the CPE network.)
10. Each Node that is active in VLAN-2 receives this ARP-request packet on its CPE
links. This ARP request is then replicated by the Nodes and flooded on all the
revenue ports that are active in VLAN-2.
11. This is true for Node-3 and Node-4 as well. Since the link to R1 is a LAG, only
one of these Nodes sends out the ARP request towards R1.
12. R1 sends out an ARP reply and it is received on either Node-3 or Node-4.
13. Since ARP learning is a control plane function, this ARP reply is sent towards
the master DG so that it can reach the active Network Node Group VM.
14. The VM learns the ARP for R1 and then sends out this information to all the
Nodes that are active in the corresponding routing instance.
15. At this point in time, Node-1 and Node-2 know how to reach R1.
16. Going back to Step# 4, now Node-2 knows how to route traffic to R2, and the
local tables on Node-2 would suggest the next hop of Node-3 to reach R1.
17. Since the data to be sent between Server-1 and Host-2 has to be routed, a Layer
3, Node-2 decrements the IP TTL and adds the fabric-header on the traffic. The
fabric-header contains the PFE-id of Node-3.
18. After adding the fabric-header, this data is sent out on the FTE links towards
one of the Interconnects.
19. The Interconnects do a lookup on the fabric-header and determine that all this
traffic should be sent to Node-3.
77
20. Node-3 receives this traffic on its FTE links and sends this data out towards R1
after modifying the Ethernet-header.
External Device Forming a BGP Adjacency

Refer back to Figure 4.9 if need be, but lets consider the scenario when there is a BGP
adjacency between the QFabric and R1 instead of OSPF. Here is how the BGP Hellos
would flow:
1. A BGP peer is configured on the QFabric system.
2. BGP is based on unicast Hellos. The QFabric system knows that the peer (R1) is
reachable through Node-4. Node-4s Data plane (PFE) is programmed accordingly to
reach R1.
3. Node-4 is a part of the Network Node Group. Hence the RE functionality is
present only on the active NW-NG-VM, which resides on the DG.
4. The active NW-NG-VM generates the corresponding BGP packets (Hellos,
updates, etc.) and sends these packets to Node-4 via the CPE network. Note that the
DGs are not plugged in to the Data plane at all. The only way a packet originated on
the DGs (VMs) makes it to the Data plane is through the CPE.
5. These BGP packets reach Node-4. Node-4s PFE already has the information to
reach R1. These packets are forwarded in Data plane to R1.
6. R1 replies back with BGP packets.
7. Node-4 looks at the destination MAC address and knows that this should be
processed locally. This packet is sent to the active NW-NG-VM via the CPE.
8. The packet reaches the active NW-NG-VM.
9. This is how bidirectional communication takes place within a QFabric system.
NOTE
This is a rather high-level sequence of events that take place for the BGP peering
between the QFabric and R1 and does not take into account all the things that need
to be done before peering, such as learning R1s MAC address, learning R1s ARP,
etc.
Test Your Knowledge

Q: Consider that an MX router is BGP peers with a QFabric system. What is the path
the BGP Hellos take?
A QFabric system can only run BGP through the Network Node Groups. Also,
the Routing Engine for the Network Node Group is present on the VMs that
run on the Director devices. Hence in this case, the BGP Hellos will enter the
QFabric system on a Node device configured to be part of the Network Node
Group. From there, the BGP Hellos would be sent over to the NW-NG-0 VM
via the Control plane Ethernet segment.
Q: If a Node device receives a broadcast frame on one of its ports, on which ports
would it be flooded?
The Node device will flood on all the ports that are a part of that VLAN's
broadcast tree. This would include all the ports on this Node that are active in
78
the VLAN and also the 40GbE FTE links (one or more) in case some other
Nodes also have ports active in this VLAN.
Q: What extra information is added to the data that is sent out on the 40GbE FTE
links?
Every Node sevice that is a part of a QFabric system adds a fabric header to
data before sending it out of the FTE links. The fabric header contains the
PFE-ID of the remote Node device where the data should be sent.
Q: How can the PFE-ID of a Node be obtained?
Using the CLI command show fabric multicast vccpdf-adjacency. Then
co-relate this output with the output of show virtual chassis CLI command.
Consider the following snippets taken from an RSNG:
qfabric-admin@RSNG0>showfabricmulticastvccpdf-adjacency
Flags:S-Stale
934TOR256n/a-1-1
934TOR512n/a-1-1
10259(s)TOR256fte-0/1/1.3276813
10259(s)TOR512fte-0/1/0.3276803
1134TOR256n/a-1-1
1134TOR512n/a-1-1
12259(s)TOR256fte-1/1/1.3276812
12259(s)TOR512fte-1/1/0.3276802
The Src Dev Id column shows the PFE-IDs for all the Nodes, while the Interface
column shows the IDs of all the interfaces that are connected to the Interconnects, but
only for those Node devices that are a part of the Node group (RSNG0 in this case).
(Note that the traditional Junos interface format is used: namely, FPC/PIC/PORT.)
You can see by the bolded output that Node device with PFE-ID of 10 corresponds to
PIC-0 and Node device with PFE-ID 12 corresponds to PIC-1.
The next step is to correlate this output with the show
virtual-chassis command:
Mstr
{master}
You can see that the Node device that corresponds to FPC-0 has the serial number of
P6810-C and the one with PFE Id of 1 has the serial-number P7122-C. The aliases of
these Nodes can then be checked by either looking at the configuration or by issuing
the show fabric administration inventory command.
80
Books for Cloud Building and Hi-IQ Networks

The following books can be download as free PDFs from www.juniper.net/dayone:
This Week: Hardening Junos Devices
This Week: Junos Automation Reference for SLAX 1.0
This Week: Mastering Junos Automation
This Week: Applying Junos Automation
This Week: A Packet Walkthrough on the M, MX, and T Series
This Week: Deploying BGP Multicast VPNs, Second Edition
This Week: Deploying MPLS

TW QFabric TrafficFlows

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

TW QFabric TrafficFlows

Загружено:

Авторское право:

Доступные форматы

THIS WEEK: QFABRIC SYSTEM TRAFFIC

FLOWS AND TROUBLESHOOTING

Junos Fabric and Switching Technologies

THIS WEEK: QFABRIC SYSTEM TRAFFIC

QFabric is an unique accomplishment making 128 switches look and function as

LEARN SOMETHING NEW ABOUT QFABRIC THIS WEEK:

Knowing how traffic

Published by Juniper Networks Books

THIS WEEK: QFABRIC SYSTEM TRAFFIC

Junos Fabric and Switching Technologies

THIS WEEK: QFABRIC SYSTEM TRAFFIC

QFabric is an unique accomplishment making 128 switches look and function as

LEARN SOMETHING NEW ABOUT QFABRIC THIS WEEK:

Knowing how traffic

Published by Juniper Networks Books

Chapter 1: Physical Connectivity and Discovery . . . . . . . . . . . . . . . . . . . . . . . . . 9

Knowing how traffic flows through your QFabric system is

2014 by Juniper Networks, Inc. All rights reserved.

Published by Juniper Networks Books

About the Author:

Welcome to This Week

What You Need to Know Before Reading

What You Will Learn From This Book

About This Book

QFabric vs. Legacy DataLCenter Architecture

Traditional Layered Data Center Topology

Different Components of a QFabric System

Components of a QFabric System

Node Groups Within a QFabric System

Chapter 3 covers these abstractions, including a discussion of packet flows.

Differences Between a QFabric System and a Virtual Chassis

Interconnections of Various Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

This Week: QFabric System Traffic Flows and Troubleshooting

Interconnections of Various Components

Chapter 1: Physical Connectivity and Discovery

This Week: QFabric System Traffic Flows and Troubleshooting

Chapter 1: Physical Connectivity and Discovery

Why Do You Need Discovery?

This Week: QFabric System Traffic Flows and Troubleshooting

System and Component Discovery

provisioning CLI command shows

Chapter 1: Physical Connectivity and Discovery

This Week: QFabric System Traffic Flows and Troubleshooting

Fabric Topology Discovery (VCCPDf)

Chapter 1: Physical Connectivity and Discovery

picture of the QFabric system.

Relation Between VCCPD and VCCPDf

This Week: QFabric System Traffic Flows and Troubleshooting

Only the Control Plane Connections are Up

Chapter 1: Physical Connectivity and Discovery

Two Interconnects are Added to the CPE Network

This Week: QFabric System Traffic Flows and Troubleshooting

Two Nodes are Connected to the CPE Network

Chapter 1: Physical Connectivity and Discovery

Node-1 and Node-2 are Connected to IC-1 and IC-2, Respectively

4. In Figure 1.4 The following FTE links are connected:

4.1 Node-1 to IC-1.

4.2 Node-2 to IC-2.

This Week: QFabric System Traffic Flows and Troubleshooting

Node-1 is Connected to IC-2

6.1 There are four devices in the QFabric system.

6.2 IC-1 is connected to Node-1.

6.3 IC-2 is connected to both Node-1 as well as Node-2.

Chapter 1: Physical Connectivity and Discovery