Вы находитесь на странице: 1из 36

Fast Reroute

A high availability addition to MPLS

Shankar Rao Sohel Ahmed Qwest Communications Richard Southern Juniper Networks

Agenda
? ? ? ? ?

Overview of MPLS FRR what problem is this technology solving, and how does it work? Drivers for Qwest to implement FRR alternative options evaluated, and why FRR? Real-world scenarios experienced on the Qwest network did FRR help? Operational lessons learned, what can we do better? Conclusions

Fast Reroute
What is it?

Planning for a failure (things to consider)


?

Control plane failures ? Graceful restart


? Implemented

in each protocol

Data plane failures ? L2 based solutions (for comparisons sake)


? APS ? Link

bundling /aggregated sonet or Ethernet

? MPLS+RSVP
? Secondary

Choices for protecting a LSP

LSP ? Secondary Standby LSP ? Fast reroute


4

Recovery Speed (Secondary LSP)


?

Secondary and Standby


? Secondary:
? Patherr

Ingress LSR needs to signal new LSP when primary LSP fails
and resvtear unicast to ingress LSR

? IGP

needs to change nexthop @ ingress LSR


be additional built in delays to optimize SPF

? May

runs

? Standby
? Saves

path is pre-computed
CSPF run

of delays is in 100smS to 1S range ? Packet loss may occur until LSP is redirected by ingress LSR
? Sum
5

Example Secondary LSP


Primary path selected by CSPF
Path error

Core Router

Core Router

Core Router

Core Router

Core Router

Core Router

Core Router

Core Router

Secondary path selected by CSPF

Recovery Speed (FRR)


?

Each node along the LSPs path takes care of protecting the LSP. Request is made by including detour and fast-reroute objects in RSVP PATH messages Can be used with other protection methods since its a quasi-L2 solution, including secondary LSPs

Recovery Speed (FRR) -continued


?

Delay in switching to FRR is limited by the failure detection delay and the propagation time to update the forwarding table of the change
?

Typically in the 10s to 100s of mS window (w/o FFRR)


? Sub

50mS numbers are possible if the other reroute labels are preloaded in the forwarding plane ? 50mS times are necessary for VoIP signal sync frames
?

Vendor specific implementation details may add extra time to the switch-over depending on IGP

Packet loss is minimized to the unlucky few that were transiting the link during the failure

FRR mechanisms (detour)


?

Detour (1:1) (Juniper and Avici)


? Each

LSP has its own detour LSP ? Uses combined link and node protection

FRR Detour (link and node)

RSVP label exchange 100 200 300

A
100
Core Router Core Router

B
200
Core Router

C
300

Core Router

Payload

Payload

Payload

4000

6000

Payload

5000

Payload

Core Router

Payload
Core Router

10

FRR mechanisms (facility backup)


?

Facility backup (N:1)


? One

or more LSPs share a common detour ? Link protection (NHOP) (Juniper, Cisco, Avici, others?)
? Merge

link ? Protecting against multiple link outages ? This is where most development time has been as ISPs have an immediate need to protect critical links
? Node
? MP

Point (MP) is at the next hop, but on a different

protection (NNHOP) (Cisco)

is at the next next-hop ? This may be the next step, however graceful restart might work better here

11

FRR link-protection bypass LSP


Incoming label appears to be 300 from upstream LSR, forward as C normal

RSVP label exchange 100 200

PushB the bypass label, rewrite inner label


100
Core Router

Core Router

200

Core Router

300

Core Router

Payload

100000 200

Payload

200

Payload

Payload

Payload

Pop outerE label Core Router

12

FRR using Node Protection


Primary path selected by CSPF

Push the bypass label, rewrite innerBlabel

Incoming label appears to be from upstream LSR, forward as D normal

Core Router

Core Router

Core Router

Core Router

100000 200

200

Payload Payload

Core Router

E
Node protection

Core Router

Pop outer label

13

Complexity Comparison
?

Secondary
? ? ? ? ?

Signaled by ingress LSR only, protects path + additional constraints can be applied + tries to stay away from primary path nodes and links - additional management and planning - switch is done at the ingress router only

FRR
? ? ?

Each LSR along the path protects configured links limited path constraints (can include BW, hold and setup priorities, links to avoid etc.) + no additional path definitions configuration

14

Maintaining the protected LSP properties


?

Secondary can
? ? ?

make all the same BW requests as the primary maintain CoS requirements remain up even after primary path recovers

FRR is intended as a short term fix


? ? ?

Builds on the existing LSPs properties since majority of the LSP will remain in place BW may be shared in many-to-one (Facility) backup Forward packets only until primary can handle the problem (may include a switch to secondary)

15

Why on the Qwest network?

Fast Reroute

BB Logical
Bellevue Seattle Tukwila Spokane WA Portland OR Eugene Medford ID Boise Missoula Helena MT Billings SD Sioux Falls Cheyenne NE Salt Lake City CO Thornton WV Kansas City MO KS St Louis KY NC OK Albuquerque AR Atlanta GA Tuscon Las Cruces Dallas LA TX FL Eureka Tampa MS AL TN SC St. Cloud Minneapolis Rochester WY Logan NV San Francisco Palo Alto Sunnyvale San Jose CA UT Colorado Springs Pueblo NM Burbank Los Angeles Phoenix AZ IA Sioux City Cedar Rapids Davenport Des Moines IL WI MI Grand Forks ND Jamestown Dickinson MN Bismarck Fargo ME Duluth MI NY VT NH Boston RI

MA CT PA Philadelphia Pittsburgh MD Ashburn

Chicago Lewis Center IN OH

New York Newark

Omaha

Eckington

Denver Grand Junction

VA

TeraPOP AccessPOP RemotePOP Layer 3 Expansion (Single site in Location) Layer 3 Expansion (Multiple sites in Location) Cyber Centers Out Of Region NGS VOIP In Region

Miama

OC192 OC192 &OC48 OC48 OC12

Last Update: 06/05/02 Gary Waldner

17

Recovery Requirements
? OC48

SONET APS protected backbone circuits initially ? Partial mesh of OC192 unprotected wavelengths today, therefore need for protection at higher layer ? Higher layer protection must be comparable to SONET protection (~50ms) ? Voice/ATM traffic on IP network demanding stringent SLAs for network recovery (<100ms) ? Other SLAs - RTT < 100ms, Availability 99.999%, Packet Loss < 0.001%, Jitter < 5ms ? Need to protect (sub-second) against both link and node failures

18

Options
?

IGP tweaks
? ?

http://www.nanog.org/mtg-0202/ppt/cengiz.pdf Convergence times ? Convergence as fast as todays technology allows, ~5secs. ? Can be improved to sub second with enhancements to ISIS specification Offers protection against RE/RP failures (by keeping such failures control plane transparent), but Link failures/flapping links still a problem

Graceful Restart Mechanisms (NSF)


? ?

19

Options
?

Other HA mechanisms such as Stateful failover


? ? ?

RE/RP failures are transparent to peers/neighbors (sessions remain up) Link failures/flapping links still a problem Deployed/Field Tested implementations non-existent Both link and node protection possible Promise of recovery times in order of 10s of ms, however, Proprietary implementations New technology, burden of operationalizing

MPLS FRR
? ? ? ?

20

Operational considerations

Fast Reroute

Failure Scenarios
? Good

(or not so good) feature link fails, traffic is re-routed over backup, primary re-optimized, all happens transparently ? No special MPLS FRR monitoring needed (in theory) rely on existing NMS to flag link/node failures, FRR keeps traffic moving ? MPLS control plane anomalies harder to detect primary/backup paths setup transparently, backup paths only used for short periods of time ? Worst case if FRR croaks, IGP always available as backup
Sean Mentzer IP Architecture smentzer@qwest.net

22

Operational Concerns
? Detecting
? SNMP/Syslog

LSP Primary/Detour Outages

plane

tools - proactive monitoring of data/control

? Traffic
? Not

actively used

Management Shooting RSVP/LSPs

? Trouble

? Additional

control plane protocols to be learned and understood ? Change in data plane forwarding
? Extensive

? Bug

testing to ensure that worst case does not get worse (i.e. IGP routing as fallback works) changes transparent (to the extent possible) to existing troubleshooting methods
23

report/analysis and testing NOC/NMC/TAC

? Training
? Keep

Sean Mentzer IP Architecture smentzer@qwest.net

Fast Reroute
Real-world Scenarios

MPLS FRR Topology LSP: chi-core-03 to jfk-core-01


15

LSP Configuration Primary LSP: chicore-03 to jfkcore-01 ERO: chi-core03, jfk-core-01 Detour LSP ERO: chi-core03, chi-core-02, jfk-core-02, jfkcore-01

1
M160 M160

M160 M160

12
M160 M160

M160 M160

chi-core-01
1
M160 M160

chi-core-02
1

jfk-core-01
1
M160 M160

jfk-core-02
1

chi-core-03

6 12

jfk-core-03 7 1
M160 M160

M160 M160

6 ewr-core-01 10
1
M160 M160

ewr-core-02 1

ewr-core-03

1
M160 M160

1
M160 M160

14

M160 M160

M160 M160

kcm-core-01
1
M160 M160

kcm-core-02
1

dca-core-01
1
M160 M160

dca-core-02

kcm-core-03

18 15 3

dca-core-03

25

MPLS FRR Topology LSP: chi-core-03 to jfk-core-01


M160

M160

15

Detour LSP goes down but Primary LSP remain active, a new Detour LSP establishes Primary LSP: chicore-03 to jfkcore-01 ERO: chi-core03, jfk-core-01 New Detour LSP ERO: chi-core03, chi-core-01, ewr-core-02, ewrcore-01, jfk-core02, jfk-core-01

1
M160

M160 M160

12
M160

M160 M160

chi-core-01
1
M160

chi-core-02
1

jfk-core-01
1
M160

jfk-core-02
1

chi-core-03

6 12

jfk-core-03 7 1
M160

M160

6 ewr-core-01 10
1
M160

ewr-core-02 1

ewr-core-03
1
M160 M160

1
M160

14

M160

M160 M160

kcm-core-01
1
M160

kcm-core-02
1

dca-core-01
1
M160

dca-core-02

kcm-core-03

18 15 3

dca-core-03

26

MPLS FRR Topology LSP: chi-core-03 to jfk-core-01


M160

M160

15

Primary LSP goes down as well, traffic switch to preconfigured detour immediately, after a CSPF calculation the existing detour becomes primary Primary LSP: chicore-03 to jfkcore-01 ERO: chi-core03, chi-core-01, ewr-core-02, ewrcore-01, jfk-core02, jfk-core-01

1
M160

M160 M160

12
M160

M160 M160

chi-core-01
1
M160

chi-core-02
1

jfk-core-01
1
M160

jfk-core-02
1

chi-core-03

6 12

jfk-core-03 7 1
M160

M160

6 ewr-core-01 10
1
M160

ewr-core-02 1

ewr-core-03
1
M160 M160

1
M160

14

M160

M160 M160

kcm-core-01
1
M160

kcm-core-02
1

dca-core-01
1
M160

dca-core-02

kcm-core-03

18 15 3

dca-core-03

27

MPLS FRR Topology LSP: chi-core-03 to jfk-core-01


M160

M160

15

For the new primary LSP a new detour LSP establishes

1
M160

M160 M160

12
M160

M160 M160

chi-core-01
1
M160

chi-core-02
1

jfk-core-01
1
M160

jfk-core-02
1

ERO: chi-core03, kcm-core-01, dca-core-01, dca-core-03, jfkcore-03, jfk-core01

chi-core-03

6 12

jfk-core-03 7 1
M160

M160

6 10 ewr-core-01
1
M160

ewr-core-02 1

ewr-core-03
1
M160 M160
M160

1
M160

M160 M160

kcm-core-01
1
M160

kcm-core-02
1

14

dca-core-01
1
M160

dca-core-02

kcm-core-03

18 15 3

dca-core-03

28

M160

M160

MPLS FRR Topology LSP: chi-core-01 to ewr-core-01


15

1
M160

M160 M160

12
M160

M160 M160

Both Primary and Detour LSPs are riding on same fiber path LSP Configuration Primary LSP: chi-core-01 to ewr-core-01 ERO: chi-core01, ewr-core02, ewr-core-01 Detour LSP ERO: chi-core01, chi-core-03, jfk-core-01, jfkcore-02, ewrcore-01

chi-core-01
1
M160

chi-core-02
1

jfk-core-01
1
M160

jfk-core-02
1

chi-core-03

6 12

jfk-core-03 7 1
M160

M160

6 ewr-core-01
1
M160

ewr-core-03 1

ewr-core-03
1
M160 M160
M160

1
M160

M160 M160

kcm-core-01
1
M160

kcm-core-02
1

dca-core-01
1
M160

dca-core-02

kcm-core-03

18 15 3

dca-core-03

29

MPLS FRR Topology LSP: chi-core-01 to ewr-core-01


15

1
M160 M160

M160 M160

12
M160 M160

M160 M160

A BIG Fiber cut ! Both Primary and Detour LSP goes down MPLS fails to do the FRR. Routing falls back to traditional IGP

chi-core-01
1
M160 M160

chi-core-02
1

jfk-core-01
1
M160 M160

jfk-core-02
1

chi-core-03

6 12

jfk-core-03 7 1
M160 M160

M160 M160

6 ewr-core-01
1
M160 M160

ewr-core-02 1

ewr-core-03

1
M160 M160
M160 M160

Fate Sharing ?

M160 M160

M160 M160

kcm-core-01
1
M160 M160

kcm-core-02
1

dca-core-01
1
M160 M160

dca-core-02

kcm-core-03

18 15 3

dca-core-03

30

MPLS FRR Topology Unstable router


15

1
M160 M160

M160 M160

12
M160 M160

M160 M160

Unstable Router, kcm-core-03 All Links are constantly flapping in kcm-core-03 Both ISIS and IBGP flapping as well

chi-core-01
1
M160 M160

chi-core-02
1

jfk-core-01
1
M160 M160

jfk-core-02
1

chi-core-03

6 12

jfk-core-03 7 1
M160 M160

M160 M160

6 ewr-core-01
1

MPLS comes to rescue. Traffic traversing over this router is forwarded over the detour paths M160 M160 avoiding kcm-core-03 node all together until kcm-core-01 1 router Becomes stable. Optimizer timer comes into play

ewr-core-02 1
M160 M160

ewr-core-03

1
M160 M160
M160 M160

M160 M160

kcm-core-02
1
M160 M160

dca-core-01
1
M160 M160

dca-core-02

kcm-core-03

18 15 3

dca-core-03

31

Troubleshooting Tips
? Show

mpls lsp name <lsp name>extensive ? Show rsvp neighbor ? Show rsvp session name <lsp name> detail ? Show rsvp interface detail ? Show mpls interface ? Show log <mpls-tracefile> ? Show log <rsvp-tracefile>

Sean Mentzer IP Architecture smentzer@qwest.net

32

Lessons Learned/Conclusions

Fast Reroute

Conclusions
? ?

Sub-second protection for both control/data plane failures necessary FRR provides sub-second recovery from data plane failures today
? ?

ISIS convergence can be improved, but best times (today) are in order of multiple seconds FRR works, but
Requires implementation of a new technology ? Lacks widely-deployed interoperable implementations ? Can use enhancements such as detection of data plane liveness, SRLG, QoS/TE conformance etc. ? Manageability improvements (if doing lot of traffic management)
?

34

Conclusions
? HA

mechanisms such as Graceful Restart/Stateful failover are interesting for control plane protection ? Keep it simple (as much as possible) - by relying on pure IGP metrics, no complex traffic management ? Use TE features only when needed for example severe outage scenarios where real-time traffic must be protected (modeling required) ? IGP convergence timers must also be improved, since FRR protection is only on the core ? Tread carefully control protocol scaling limits not completely known

35

Thank you!

Вам также может понравиться