You are on page 1of 26

2007 Nokia Siemens Networks. All rights reserved.

Nokia Siemens Networks is expected to start operations in Q1 2007, subject to regulatory approvals and the closing conditions.


Nokia Siemens Networks

Analysis of configuration failures
in transport networks
Part II: Configuration failures of the Ethernet and the WDM layer
Christian Merkle
Technische Universitt Mnchen
Lehrstuhl fr Kommunikationsnetze
Arcisstrae 21
80333 Mnchen
Cooperation NSN and Institute of Communication
Networks, TU Mnchen:
Robust Planning of Cost Efficient Next Generation
Networks (ROPOCON)
Dr. Dominic Schupke,
Dr. Claus Gruber,


Technical report
2007 Nokia Siemens Networks. All rights reserved.



Introduction ....................................................................................... 4


Configuration failures in the Ethernet layer ....................................... 5


Configuration failures in the WDM layer .......................................... 12


Network outages due to software failures ....................................... 16


Rating system for configuration failures .......................................... 18


Conclusion ...................................................................................... 23

Acronyms ...................................................................................................... 24
References ................................................................................................... 25

2007 Nokia Siemens Networks. All rights reserved.


Technical report
1. Introduction
In the Technical Report Part I [1] configuration failures of IP routers that occur
during the configuration with a Command Line Interface (CLI) are described.
The most difficult configuration tasks are the configuration of filter rules and authentication methods. Failures during this tasks happen more often, because
these configuration tasks includes more single configuration steps, and have an
larger effect on the network, because if the wrong customer packets are filtered
or the wrong authentication parameters are used, a customer cannot connect to
the network and Service Level Agreements are maybe violated.
In this report network failures during the configuration of the Ethernet and WDM
layer are analyzed. The importance to reduce network failures in an IP transport
network is shown in [2]. In a study from the University of Michigan it has figured
out that router failures are responsible for 23% of downtime and router operation
are responsible for 36% of downtime in a wide area IP network. The same study
shows that IP routers have a downtime of 1519 minutes per year and carrier
grade switches have a downtime of 5.2 minutes per year.
The authors in [3] and [4] have analyzed network outages of a regional network
provider between November 1997 and November 1998. The type, frequency,
origin, and duration of these failures are characterized. The authors report in [4]
from a software fault that triggered failures in communication between different
backbone routers. Due to that failure many users experienced a network outage
and a increased packet loss. The outage of the network last several hours until
the failure could be solved.
In [5] Infonetics Research has calculated the costs of enterprise network downtimes. It has figured out that the annual total cost of downtime account for 40.7
million dollar. Looking at the downtime costs per cause it is shown that software
downtime costs are the biggest ratio of the total costs and are responsible for
36%. The second biggest portion are the human error costs with a ratio of 22%.
Considering the total hours of downtime (501 hours) of the network, software
failures are the biggest part of the downtime with 33%. Hardware failures and
human errors are responsible for 23% and 22%, respectively. This study shows
that it is important to reduce the outage time of the network caused by software
and hardware failures and human errors to reduce the operational expenditures
of the networks.
In chapter 2 possible configuration failures of the Ethernet layer are described
and in chapter 3 follows the description of configuration failures of the WDM
layer. Finally, chapter 4 describes the impact of faulty software on network elements in transport networks.
2007 Nokia Siemens Networks. All rights reserved.


2. Configuration failures in the Ethernet layer

In this chapter the configuration tasks in the Ethernet layer are described. At first
general configuration tasks of the Ethernet layer are mentioned. Afterwards in
the second part of this chapter the configuration of Carrier Ethernet protocols is
described. To analysis which configuration failure can occur, the configuration
parameters of the different protocols used in the Ethernet layer are taken into
account to derive possible configuration errors during maintenance tasks and to
identify possible impacts on a network.
A very helpful feature is the configuration of multiple interfaces with the same
configuration parameters via one configuration file. But a failure in the configuration file leads to a configuration error on more interfaces at the same time. The
misconfiguration of the interfaces can result in loss of connectivity, because the
configuration of two connected interfaces is incompatible. If the wrong interface
range is chosen, also interfaces that should not be configured with certain parameters are overwritten and maybe do not work correctly anymore. The change
of the original configuration of interfaces can lead to loss of packets or to performance degradation if the interface does not work correctly anymore. This
misconfiguration can also create a loop in the network, which results in higher
traffic load in the network, or can lead to disconnected switches which triggers
the Spanning Tree Protocol (STP) to reconfigure itself because the network topology has changed.
Spanning Tree Protocol (STP)
Switches that are not running STP forward Bridge Protocol Data Unit packets,
so that other switches running STP receive this control packets to process them.
To break loops in the network it is important to run STP on enough switches in
the network. If the STP is not running on enough switches there could occur
loops in the network and these loops can cause a broadcast storm of packets.
This excessive traffic and the indefinite duplication of packets can reduce the
network performance or can lead to crashed devices if the traffic load at a device is too high for the processor.
The path cost of links can be configured at switches to influence which interface
ports are set first to forwarding state. Interfaces with lower path costs are preferred for setting to forwarding state. Normally lower path cost represent higher
bandwidth on the link. If the interface of a switch is misconfigured, for example
higher path costs are set to an interface, which is connected to a link with higher
bandwidth, then the link with the lower bandwidth is chosen first. This reduces
the performance of the network, because the traffic is sent over a link with
smaller bandwidth.
The misconfiguration of the STP timers has also a negative impact on the performance of the network and can cause high traffic load because of undefined
duplication of packets in the network. If the hello timer interval is too short more
2007 Nokia Siemens Networks. All rights reserved.


hello messages are send in the network and there is a higher traffic load in the
network. The aging timer determines how long a switch waits without receiving
SPT configuration messages. The default setting of the aging timer is 20
seconds. If the aging timer is too short the switch tries to reconfigure the SPT,
because no control packet arrived in the time interval, and blocks the forwarding
of all received packets. Hence, the performance of the network is reduced because the switches stop the forwarding of packets and start the reconfiguration
of the STP. If the time interval of the aging timer is too long it could happen that
the STP detects an outage of a switch too late because the switch still waits for
hello packet of the failed device. Traffic is also lost in the network if the topology
information of the STP is not up to date, because the path on which the packets
are transmitted do not exist.
Duplex mode configuration
Duplex mode configuration is used in Ethernet and Fast Ethernet Switching,
which are used in Local Area Networks and Metro Area Networks to connect
customers to backbone networks. The port duplex mode can be set for
10/100Mbit/s ports. For Gigabit Ethernet the port duplex mode cannot be
changed. The port duplex mode is full duplex only.
Duplex mismatches can result in performance degradation, intermittent connectivity, and loss of communication. Duplex mismatches occur if two directly connected devices use an incompatible combination of duplex configurations. Possible duplex mismatches are shown in Table 2-1 [6].
Device A

Device B

100 Mbit/s, full duplex


Duplex mismatch


100 Mbit/s, full duplex

Duplex mismatch

100 Mbit/s, full duplex

1000 Mbit/s full duplex

Link established, but

speed mismatch

10 Mbit/s, half duplex

100 Mbit/s half duplex

Link established, but

speed mismatch

Table 2-1: Duplex mismatches because of wrong duplex configuration

Duplex mismatches due to auto-negotiation results from hardware incompatibilities or software defects. Hardware incompatibility can be a result from vendor
specific features that are not described for the auto-negotiation. The mismatch
can also result if auto-negotiation is disabled on both connected devices and the
network interface cards are configured manually. A duplex speed mismatch
leads to a lower transmission bitrate, but the transmission is still possible, whe-

2007 Nokia Siemens Networks. All rights reserved.


reas a duplex mismatch leads to a not established link and the packets cannot
be sent over the affected link.
Protocol filtering
Layer 3 protocol filtering can be activated on Ethernet switches to filter Layer 3
packets from protocols like IP, IPX, and AppleTalk. If the packet filtering option
is activated on the wrong switch, the switch does not forward layer 3 packets
anymore. This results in packet loss, if the route through this switch is the only
available connection to another device or the configuration failure leads to performance degradation if there is another link to the destination, but with lower
Ethernet, Fast Ethernet, and Gigabit Ethernet
To use Ethernet, Fast Ethernet, or Gigabit Ethernet the parameters port name,
port speed, and port duplex mode must be configured. The misconfiguration of
these parameters can be responsible that a connection between two switches
cannot be established. As shown in Table 2-1 the duplex configuration on both
switches connected through a link must be compatible. If the duplex mode is incompatible there could be a speed degradation on this link or the connection is
not established. Gigabit Ethernet is full duplex only and the duplex mode cannot
be changed.
On Ethernet interfaces the size of the maximum transmission unit (MTU) of
packets can be configured. The default MTU size for all interfaces is 1548 bytes
and the jumbo frames have a size of 9216 bytes. If the MTU size is different on
two connected switches, for example one switch has a MTU size of 2000 Byte
and the other switch has a smaller one, the packets cannot be processed by the
switch with the smaller configured MTU size. In this case the packets are
dropped at the interface. MTU mismatch for example can occur if a jumbo capable Gigabit Ethernet interface is connected with a non jumbo interface like Fast
Ethernet. It can also occur if a jumbo capable Ethernet is connected to a switch
that does not support jumbo frames.
Virtual LAN (VLAN) configuration
To configure a VLAN on an Ethernet switch the parameters VLAN ID, VLAN
name, VLAN type, and MTU size must be configured. The possible configuration
errors and impacts are similar to the misconfiguration of VLANs on IP routers as
described in the Technical report Part I [1]. A wrong VLAN ID can result in two
connected VLANs that should not be connected or can result in blocking of authorized customers that want to connect to a certain VLAN. The mismatch of the
MTU size can result in dropped packets on the interface with the smaller MTU
size as described above. The transmission of packets on this link is then only in
one direction possible. If a protocol like TCP, which waits for acknowledgments
(Acks) for every sent packet, is running over this layer 2 interface then it could
2007 Nokia Siemens Networks. All rights reserved.


happen that TCP tries to resend packets because no Acks are received from the
destination. This additional packets cause additional traffic in the network and
can be responsible for congestions on links. Duplicate VLAN names is a further
possible configuration failure using VLAN. This misconfiguration can also lead to
security problems, because two different VLANs are able to communicate.
EtherChannel configuration
EtherChannel bundles different Ethernet links into a single logical link that provides bandwidth up to 1600 Mbit/s or 16 Gbit/s [7]. To configure EtherChannel
on a layer 2 interface there exits five different modes for the channel-group interface configuration command [8]. Depending on this modes the Port Aggregation
Protocol (PAgP) packets or Link Aggregation Control Protocol (LACP) packets
are exchanged between two connected interfaces. With the PAgP, a Cisco proprietary protocol, and the LACP it is possible to create EtherChannels by exchanging packets between Ethernet interfaces automatically. The five possible
modes are active, auto, desirable, on, and passive. Thereby it has to be regarded that an interface in the active mode cannot form an EtherChannel with
another interface that is also in the auto mode, because the two interfaces do
not start the PAgP negotiation. The same is true if two interfaces are in the
passive mode, than neither of them starts the LACP negotiation. In the on
mode it is important to know that the EtherChannel does not use PAgP and
LACP and an usable EtherChannel only exists when an interface group in the
on mode is connected to another interface group in the on mode. All interface
ports in the on mode are grouped into the same group with similar characteristics. A misconfiguration of the group can lead to packet loss or spanning tree
loops. It is also important that all interfaces in each EtherChannel must have the
same speed and duplex mode. Otherwise the interfaces cannot communicate
with each other.
Quality of Service (QoS)
Just like for IP routers QoS parameters can also be configured on switches. The
incoming packets are examined by the fields in the packets and either forwarded
or dropped depending on the match conditions. There are different classification
methods like Class of Service or Differentiated Services Code Point. If the wrong
classification rules are configured, packets that should be dropped are forwarded or vice versa. Hence, the wrong classification can lead to congestion on
a link if more packets are forwarded through this link than allowed or it can lead
to packet loss if packets are dropped at the interface. The dropped packet can
be responsible for SLA violations if the QoS is not achieved as agreed by contract with the customer.
A second option to guarantee QoS is the policing and marking of packets. Policers can only be configured on ingress interfaces. If policers are configured the
wrong way, the wrong packets are dropped or the bandwidth of a link is wrongly
scaled down according to the policy. This can lead to performance degradation
2007 Nokia Siemens Networks. All rights reserved.


in the network and also to performance degradation for certain applications like
IPTV or Video on Demand, which need a certain required bandwidth for the
802.1x port based authentication
The configuration of an authentication protocol like 802.1x prevents unauthorized users to connect to a network. At first the interface which should use authentication must be configured. If the authentication protocol is configured on
the wrong interface or the authentication protocol itself is misconfigured, then
users cannot connect to the switch anymore. If the authentication is not configured on an interface, all users are able to connect to the network and the protection of the network is not ensured anymore. Both configuration failures could
lead to security holes because all users are able to connect to the network.
The configuration of the RADIUS server parameters is required to enable the
authentication on the interface. Theses parameters are the IP address (data
type: integer) or the host name (data type: string) of the RADIUS server, the
UDP port (data type: integer) for the authentication request, and the key (data
type: string). If one of these parameters is misconfigured, for example the encryption key does not match the encryption key on the RADIUS server, the authentication may not work and all users have access to the network. According
to [8] the default values of the Switch-to-Client Retransmission Time and the
Switch-to-Client Frame Retransmission Number should be changed to prevent
problems with other clients and the authentication server if one client cannot authorize correctly because of a wrong password for example.
The single-host mode allows only the connection of one authorized user per
port. If one user is authorized on one port the packets of all other users are
blocked on this port. To configure this mode the interface-id (data type: integer)
must be specified on which this mode should run. If the wrong interface-id is
configured then this mode is enabled on the wrong interface and users are
blocked that want to connect to the network. Otherwise, the authentication mode
is not enabled on the port that should run with the single-host mode and the security is not ensured.
Network security with access control lists (ACLs)
The configuration of ALCs allows the filtering of packets on an interface. There
are different ALCs, like IP ALCs to filter IP, TCP, and UDP traffic, Ethernet or
MAC ACLs to filter layer 2 packets, and three further kinds of ACLs [8] to filter
due to protocol specific information. The misconfiguration of the used ACL, for
example the wrong IP address or the wrong MAC address, drops packets of
customers that are allowed to connect to the network or forwards packets from
users that should be dropped. As for the authentication method this misconfiguration is a security hole in the network. The misconfiguration of ACLs filter rules
can also be responsible for performance degradation on specific links or parts of
2007 Nokia Siemens Networks. All rights reserved.


the network, because more traffic is routed through a certain link or parts of the
network as allowed.
It is possible to configure a time range for ACLs to determine at which time
packets from certain users should be filtered. If the wrong time range is configured the users cannot connect to the network, although it should be possible.
Or due to the wrong time range of the ACL more users than allowed can connect to the network and the performance of the network goes down. Both cases
again can lead to SLA violations if the guaranteed connectivity or bandwidth is
not achieved.
Ethernet Operations, Administration, and Maintenance (OAM)
Ethernet OAM is a protocol for monitoring and troubleshooting Ethernet Wide
Area networks (WANs). The OAM features, which are defined in IEEE 802.3ah,
are recovery, Link Monitoring, Remote Fault Detection, and Remote Loopback.
It is by default disabled on an interface and must be enabled with the following
configuration tasks. To enable it on an interface the interface ID and the max
rate, the min rate, and timeout parameters must be configured. Link Monitoring
is enabled by default when Ethernet OAM is enabled.
The wrong timeout setting can lead to a reset of the state machine of a device
because a device declares its OAM peer for down if it does not receive an OAM
message within the timeout period. If the wrong interface ID is configured, the
Ethernet OAM protocol is enabled on the wrong interface. On the interface that
should be monitored the Ethernet OAM protocol is not activated.
Configuring Ethernet Connectivity Fault Management (CFM) in a Service
Provider Network
To activate Ethernet CFM it must be enabled and the domain level, achieve hold
time, and continuity check messages parameters must be configured. For
Ethernet CFM two different kinds of maintenance points exists. Maintenance
Endpoints (MEPs) are at the edge of a maintenance domain and transmit CCM,
traceroute, and loopback messages. Maintenance Intermediate Points (MIPs)
are configured within a domain and stop CCM messages from lower maintenance levels and forward CCM messages from higher maintenance levels. Different maintenance domains are useful to determine the relationship between different maintenance domains. The larger the domain the higher is the maintenance level.
The misconfiguration of a maintenance domain can lead to a intersection of different domains which is not allowed, because domains should be managed only
from one entity. Also a device that belongs to the wrong domain cannot be managed by the entity that should be responsible for the device. A misconfiguration
of the MEPs and MIPs can lead to dropped control messages because a MIP

2007 Nokia Siemens Networks. All rights reserved.


does not forward control messages from lower maintenance levels. A correct
monitoring, fault verification, and fault isolation is maybe not possible anymore.
IEEE 802.3ad Link Bundling
To configure IEEE 802.3ad Link Bundling on an interface the Link Aggregation
Control Protocol (LACP) must be enabled first and the parameters port channel,
channel group, and system priority must be set. Link bundling allows to aggregate multiple Ethernet links into a single logical channel. LACP supports the automatic creation of EtherChannels by exchanging LACP packets between LAN
ports. After LACP identifies correctly matched Ethernet links, it facilitates grouping the links into an EtherChannel. To configure a port channel the port channel
number must be set and an IP address and subnet mask must be assigned to
the EtherChannel. To associate a channel group with a port channel the port
channel must be created and the interface type number must be configured. Additional the channel group mode must be set. The configuration of the channel
group mode includes the interface as part of the port channel bundle.
A configuration failure prevents maybe that an interface is included in an Ethernet bundle. This occurs if the above described parameters are configured
wrongly, for example the port channel and the channel group is configured on
the wrong interface. Also, the port channel parameter must be configured before
the group channel parameter is configured to enable the Link bundling correctly.
Otherwise the link will not be aggregated into an EtherChannel.

2007 Nokia Siemens Networks. All rights reserved.


3. Configuration failures in the WDM layer

In this chapter the configuration of WDM components is considered. The configuration tasks of WDM equipment does not include the configuration of protocols like for the Ethernet layer and the IP layer, instead parameters like wavelength and power threshold can be set to influence the behavior of the components.
Tunable Laser
The transmitting power of a laser must be configured correctly to ensure a certain signal to noise ratio (SNR). A lower laser power reduces the transmitting
range of the signal and the SNR respectively and increases the bit error rate
(BER) because of the degradation of the signal over a path. If the power of the
laser reaches an upper threshold the laser is turned off to prevent it to cause
any damage to the network [9]. The wavelength transmitted over a link must be
configured at the laser. Today it is possible to send 80 wavelength over one fiber
link. If the wrong wavelength is transmitted on a link, for example the same wavelength is send twice over a link, these wavelengths interfere with each other
and the information cannot be received correctly at the receiver.
The frequency of a laser can be tuned by modulating either the laser current or
operating temperature. If a wrong temperature is chosen for a laser then the laser emits the wrong light frequency into the fiber. Due to the wrong frequency a
receiver or Add/drop multiplexer for example drop the wavelength according to
its configuration and the information transmitted on this wavelength is lost. But a
laser must also be operated in a certain temperature range to guarantee optimal
functionality and lifetime. Operating the laser in a significantly higher temperature than the room temperature leads to a faster aging of the laser and degrades
its lifetime. A faster aging of the laser means that the laser must be changed
earlier and leads to higher operating cost of the network. If the laser ages faster
the signal power of the lasers and also the SNR at receiver decreases faster
over the time. A lower SNR means that the transmission range of the signal decreases and the signal cannot be received at the next receiver.
A Laser can also be modulated by direct and external modulation [10]. An external modulator for example is the Mach-Zehnder interferometer (MZI). To modulate the laser with the MZI a certain voltage is applied to the MZI to interfere the
two arms of the MZI constructively or destructively. In the first case an output
power appears at the output of the MZI and in the second case there is no output power at the output. If the drive voltage of the MZI is misconfigured then the
signal is modulated wrongly. For example there is and output power at the output of the MZI, but there should be no output power, because the pulses of the
two arms of the MZI interfere constructively instead of destructively. Hence, the
signal cannot be demodulated correctly at the receiver and the sent information
is lost.

2007 Nokia Siemens Networks. All rights reserved.


Optical Amplifier
The erbium doped fiber amplifier (EDFA) amplifies the optical signal directly
without converting it into an electrical signal. To amplify the signal an EDFA has
a pump laser to excite ions to a higher energy level from where they can decay
back to the lower energy level via stimulated emission of a photon. The required
pump power to get a constant output power depends on the signal wavelength.
If the pump power of the amplifier is configured wrongly then the signal is not
transmitted correctly over the whole path. If the pump power is too low then the
optical signal is not amplified enough and the SNR at the receiver is reduced.
Hence, it can happen that the optical signal is not detected correctly at the next
amplifier or destination, because the SNR is too low. Otherwise, if the pump
power of the amplifier is too high for the used wavelength then the spontaneous
emission increases, because more ions are pumped to a higher energy level
and are not available for the stimulated emission. The spontaneous emission of
ions increases and the amplification of the signal becomes lower. Also the temperature of the amplifier has a impact on the gain of the amplifier. With increasing temperature the gain of the EDFA decreases. An optical amplifier, which is
not working in the optimal temperature range, has a lower amplification gain and
therefore the SNR is lower and the number of transmitting failures increases
The above described misconfiguration can be responsible for the lost of information on a link, because the signal is not correctly amplified. A lower SNR at the
output of the EDFA reduces the transmission range of the signal so that the signal cannot be detected on the receiver side. Because EDFAs amplify multiple
wavelength on one fiber at the same time, all information on one link can be lost.
Hence, a higher layer protocol like OSPF recalculates the routes through the
network, because the link with the misconfigured EDFA seems to be down. This
rerouting of the traffic can lead to congestion on other links, because they have
to transmit the additional traffic from the misconfigured link. The misconfiguration of an EDFA can affect the behavior of a protocol of a higher layer and this
again can influence the SLAs of operators with their customers. SLA violations
occur if traffic will be rerouted in the network and generates congestion on
another links, which influences the QoS parameters of that link.
Dense Wavelength Division Multiplexing (DWDM) Controller
DWDM is an optical technology that is used to multiplex different wavelength together to increase the bandwidth on fibers. The configuration of a DWDM controller includes the setting of the transponder receive power threshold, the wavelength channel number, and the transmit-power. As described for the laser the
transmitting power influences the SNR of the signal. So it is important to configure the right transmitting power to ensure a accurate transmission of the signal.
The transponder receive power threshold values can range between -200 and 0,
which corresponds to a loss of signal (LOS) range of -20dbm and 0dbm [12].
2007 Nokia Siemens Networks. All rights reserved.


The default power level is -18dbm according to [12]. If a received signal is below
or equal to this threshold then the LOS alarm is raised. A misconfiguration of this
threshold, for example a too small value, prevents that the alarm is raised and
that the too weak signal is transmitted on the next link. The degradation of the
signal on the next path can result in a SNR that is too low to detect the signal
correctly at the next receiver.
Tunable receivers are able to convert wavelength within a given range [10]. If
the receiver is misconfigured it could happen that the received wavelength are
filtered by the receiver. Because of the blocked wavelength a customer maybe
cannot connect to the network or the customer is not able to communicate with
other customers. For example an office of a company has no connection to
other offices of the same company. The misconfiguration of the receiver can
also result in rerouted traffic because the primary path is blocked and the traffic
must be switched to the backup path.
Reconfigurable Optical Add Drop Multiplexer (ROADM)
ROADMs allow the selection of wavelengths to be dropped and added on the
fly. This makes the planning of networks more flexible in comparison to fixed
dropped and added wavelengths. The misconfiguration of ROADMs can lead to
the dropping or adding of the wrong wavelengths. If the wrong wavelength is
dropped the traffic transmitted with this wavelength is lost. If a backup path exits
the traffic will rerouted over this backup path. In the worst case the customer
cannot connect to the network if no backup path can be calculated and can also
not be reached from other users. A further impact of a wrongly dropped wavelength is that in the case of a link failure the protection can fail. This happens if a
primary paths fails and the backup path uses the wavelength on the backup
path, which is dropped at the misconfigured ROADM. Hence, the whole traffic
from the primary path is lost.
The wrongly adding of an wavelength at the ROADM can be responsible for performance degradation or the loss of the information on that link on which the
wavelength is added. If the same wavelength, as the added wavelength, is already transmitted on the link, the two wavelengths interfere on the link and the
receiver cannot receive the signal correctly.
Optical Cross Connect (OXC)
OXC are required to handle more complex network topologies and large numbers of wavelengths [10]. OXCs can also set up or take down lightpaths as
needed. So a wrongly configured OXC causes the same impacts to the network
as described for ROADMs.

2007 Nokia Siemens Networks. All rights reserved.


Misconfiguration of patch panels

During maintenance of router or switches, for example when a interface card is
changed, the technicians have to disconnect the fiber connections at the patch
panel. If the fibers are installed wrongly in the patch panel after the change different failures in the network can occur. There could be a loop in the network or
one part of the customers are disconnected from the network.

2007 Nokia Siemens Networks. All rights reserved.


4. Network outages due to software failures

If a new faulty software is installed on a device like an IP router or Ethernet
switch it could happen that the parts of the device or the whole device fails. If the
software failure has only an impact on a specific protocol then the links using
this protocol are affected and the other links of the router are not affected. If only
one interface of the device is not available then only the traffic routed through
that link is affected, but all customers connected through this failed link have no
connection to the network.
If the router or switch starts to crash after a configuration change, then the problem is probably software-related. Because the router or switch crashed, all interfaces of the device are down and it is not available in the network anymore. All
paths which were routed through the failed device need to be rerouted. This can
lead to performance degradation, because other links, with lower bandwidth, are
used as backup path, or there will be a congestion on other links, because the
whole traffic from the failed router is routed additional to the existing traffic on a
certain link. Furthermore, the customer connected directly to the device have no
connection to the network.
Because of a software failure it could happen that a device reboots periodically
parts of the software or in the worst case the operating software. A device that
reboots itself periodically has no connection to the network and again the directly connected customer are not able to communicate over the network. Additional
the rebooting of a device can lead to additional control traffic in the network. Because protocols like Intermediate System to Intermediate System Protocol (ISIS) or Open Shortest Path First (OSPF) exchange information with other routers,
the device sends hello messages or down messages to the other devices to
inform them about the current status. If this happens in a short time period there
will be much additional traffic in the network caused by control messages send
from the routers and because the rebooting router is not available in the network, protocols like OSPF and BGP reroute the traffic over backup paths. This
rerouting of traffic can be responsible for a performance degradation on other
links or parts of the network, because the additional traffic must be processed.
Routers connected directly to the router with the faulty software need more
processing time to handle the control messages, which are sent because of the
rebooting of the router.
If only one part of a software is rebooted periodically the effect on the network is
smaller because fewer customer are affected by this outage, but if a part of a
company is not reachable then maybe Service Level Agreements (SLAs) are violated and the Internet Service Provider (ISP) has to pay penalty calculations.
If a device fails not only the directly connected links and devices can be affected, but also links and device in other parts of the backbone network. Protocols like Border Gateway Protocol (BGP) have information about possible desti-

2007 Nokia Siemens Networks. All rights reserved.


nations of the whole network in their routing tables and if a device fails, which is
part of a certain route, then a new route must be calculated.
Some possible errors for system crashes are described in [13]. Address errors,
arithmetic exception, cache error exception, and error interrupt are failures that
can lead to software crashes. It is also described that if the memory of a router
becomes too small the router reboots itself and reports this as software forcedcrash. During the rebooting of the router all connections to other devices are
down and no communication is possible. So paths through this failed router
must be recalculated and this can lead to service outages for a few minutes until
the new path is calculated. Also the performance of a service can be degraded
because the backup path has not the available bandwidth as the primary path.
The mismatch of software version on different devices can also be responsible
for the failing of a communication between two devices. If a newer software version supports a certain feature, which is needed to establish a connection, and
the older one does not support it, then it could happen that the connection between the two devices fails. If such a software conflict occurs on a device at the
edge of a backbone network, all users connected to this device cannot connect
to the network.
The upgrade of a switch or router by downloading the wrong files to the device
and by deleting the image file can also corrupt the software of the device [14].
The device does not pass the power-on self-test and there is no connectivity. All
routes going through this device must be rerouted and have the same impact as
described for a system crash above.
Software bugs on routers can lead to denial of service attacks to shutdown the
router. As reported in [15] Juniper and Cisco routers had such a software bug
and the faulty software needed to be patched. The filtering of denial of service
attack packets was not able with the router packet filters. The Cisco router software Internetwork Operating System (IOS) had additional a failure in the BGP
implementation and it was possible to shutdown the router through this security
hole. In [16] a further bug in Ciscos IOS is reported that can be used to create
an buffer overflow and to get the control over the router.
In [17] buffer leaks are described which are identified as software bugs. The
symptom of such buffer leaks is a full input queue. If the input queue of an interface is full the interface is called wedged interface and a router does not forward traffic that come from a wedged interface. If the traffic is not forwarded
from a certain link the communication through this link is lost. Buffer leaks are
often misinterpreted as burst of traffic [17].

2007 Nokia Siemens Networks. All rights reserved.


5. Rating system for configuration failures

In this chapter a rating system is developed to give an impression which configuration failure is more critical in the sense of impact, frequency, and SLA violations. As done in the Technical Report I also two different views are considered:
the customer view and the operator view. This differentiation is important, because the operator has the focus on avoiding SLA violations, whereas a customer want to have a high availability of the network. Hence, certain failures
have a different weighting depending on the point of view. In Table 5-1 the rating
of the single configuration failures is shown.




SLA violation


No processing of
packets, packet

Initial configuration



Filtering or
throughput of
wrong packets, no
connection to other devices, disconnected customer, security

All protocols,
every time a
new service is


disconnected customer, security


All protocols,
per interface,
depends also
on the installation of new services


Dropped packets,
no connection,
security hazard

Per interface,
every time new
service is installed, adaption of the network


Ethernet layer
MTU size

Filter rules




2007 Nokia Siemens Networks. All rights reserved.







SLA violation



Loop, performance degradation, higher path

costs, congestion

Initial configuration



No connection,
higher delay

Every Protocol,
modification of
the protocol


Wavelength shift,
less signal amplification, faster aging of component

Initial configuration


Wavelength shift,
less signal amplification, faster aging of component

Initial configuration

Disconnected customer, loop, security hazard

Adding new
paths, Reconfiguration of services


Wavelength filtered at receiver,

Interference of the
same wavelength

Adding new
paths, new services


Old configuration


WDM layer


Patch Panel

Wavelength assignment




2007 Nokia Siemens Networks. All rights reserved.






SLA violation


Security hazard,
New functions not

Installing new
software upgrade



Higher traffic,
dropped packets

Installing new
software upgrade


Installing new
software upgrade


Software Version

Buffer leak

Software failure

Rebooting of device, rebooting

parts of the device
shutdown of device, higher traffic
load, security hazard


Table 5-1: Rating of the different failures

For the three influence factors impact, frequency, and SLA violations the same
weighting is used as for the rating system of the IP layer. The impact of a failure
is set to 1, if it leads to a higher delay of packets or to QoS degradation on a link
or path. The loss of a connection between nodes is evaluated with a weight of 2,
because it could lead to SLA violations. Similar is the weighting of the frequency
of configuration errors. If a configuration task is done once in the beginning of
setting up a network, then the weight of a failures is 1. Otherwise, the value for
configuration failures of periodically done configuration tasks is set to 2. The
value for SLA violations caused by configuration errors is set to 0, if they cause
no SLA violations and it is set to 1, if configuration failures cause SLA violations.
To take into account the two different views, once the impact of configuration
failures is weighted with an factor of 2, to calculate the rating value for the customer view and once the value for SLA violations is weighted with 2, to calculate
the value for the provider view. The calculation of the general value of the configuration failures is done using equation 1. The value for the customer view is
calculated with equation 2 and the value for the provider view is calculated with
equation 3.
2007 Nokia Siemens Networks. All rights reserved.


Rating = (Impact + Frequency + SLA violation)/weighting;

Rating [0;1]


Rating = (2*Impact + Frequency + SLA violation)/weighting; Rating [0;1]


Rating = (Impact + Frequency + 2*SLA violation)/weighting; Rating [0;1]


At the Ethernet layer the misconfiguration of filter rules, the authentication method, the QoS configuration, and the use of old configuration files have the highest weighting of the configuration errors. As for the IP layer these failures can
lead to a disconnected links and, hence, to disconnected customers. This can
result in SLA violations for the provider.
At the WDM layer failures during the installation of fiber cables at the patch panel and failures by the wavelength assignment in the network have the highest
weighting. The wrong installation of fibers at the patch panel can lead to a loop
in the network or can be responsible for disconnected links in the network. This
kind of failures can also be responsible for security hazards, because customers
have maybe access to other Virtual Private Networks.
The wrong wavelength assignment can lead to blocking of wavelengths at optical devices or will cause interferences on one link, if the same wavelength is
send twice on it. Both failure scenarios cause disconnected links between nodes
in the network and influence the availability of the network. Assigning the wrong
power or temperature to a device can be responsible for a higher degradation of
signals going through this device, so that the received power at the end of a
path is lower than desired. The communication between the devices is still possible, if the signal power does not fall below a certain threshold. This is the reason why this kind of failures have a lower rating in comparison to the wrong wavelength assignment.
At least the impacts of software failures of nodes were considered. Again the
same rating system as for the other layers was used. For the evaluation of the
software three failure scenarios were considered: Wrong software version, software failures, and buffer leaks. Faulty software has the highest rating of the
three failure categories. These failures can lead to a multitude of failure scenarios and mostly affect more connections at the same time. For example a faulty
software can be responsible for a periodic rebooting of a device. Hence, the
node and also all links connected to this node are not available in the network
anymore. The periodic updates of the software, to close a security hole for example, or to install a new functionality on the device, increase the probability and
the frequency of software failures. A further source of error is the use of different
software versions on the different network nodes. The different software versions can lead to disconnected links, because certain functionalities are maybe
not supported by the older software version which are used by a new service.
Considering the customer and the provider view it can be seen that the most
failures have a similar rating. Especially for the Ethernet and the software failures. As mentioned before, the reason for the similar weighting is the relationship between the impact and possible SLA violations of the considered failures.
2007 Nokia Siemens Networks. All rights reserved.


SLA violations occur if a connection is down or the QoS of a service is lower

than defined in a contract between a provider and a customer. The most failures
in Table 5-1 can lead to a SLA violation, because its impacts are disconnected
links or paths. But such degradation effects also result in a lower availability,
which was the main criteria for the customer view of configuration failures. So
there is a correlation between impact and SLA violation.

2007 Nokia Siemens Networks. All rights reserved.


6. Conclusion
In this technical report possible configuration failures of the Ethernet and WDM
layer and faulty software are considered. On the Ethernet layer similar failures
are possible as for the IP layer described in [1]. At the WDM layer configuration
failures are related to wrong power setting, wrong wavelength switching, and
wrong operating temperature. More and more reconfigurable devices like
ROADMs or OXCs are used in backbone networks because they are more flexible for carriers when planning their networks. But as described the more reconfigurable devices are used in the network the more configuration failures can
happen when installing the devices. Hence, it is important to reduce misconfiguration to reduce the costs of network outages as shown in [5].
The rating of the configuration failures gives an impression about the heaviness
of the impact of the failures in the network. The rating includes how many customers could be affected from this failure and it also includes if failures happen
more often or not. A rating of 1 means the configuration failure has an larger impact on the network and a rating of 0 means it has a lower impact. The misconfiguration of the authentication method and the ACLs have a larger impact on
the network, because if connections from customers to the network a blocked
this could lead to SLA violations. Also a wrong configured filter can lead to a security hazard and to attacks in the network. The misconfiguration of the optical
component is also a significant failure in the network. The wrong power or temperature configuration of a laser leads maybe to a shift in the wavelength and so
to less amplification of the light signal and to a lower SNR.
To prevent configuration failures automating of the configuration and fallback
mechanism, for example a standard configuration, for single device could be
one step to reduce the complexity of configuration tasks. The development of
signaling protocols to detect degradation effects of optical elements, like an optical amplifier [18], before the device fails can help to reduce network outages.

2007 Nokia Siemens Networks. All rights reserved.



Access Control List




Bit Error rate


Border Gateway Protocol


Command Line Interface


Dense Wavelength Division Multiplexer


Erbium Doped Fiber Amplifier


Internetwork Operating System


Intermediate System to Intermediate System Protocol


Internet Service Provider


Link Aggregation Control Protocol


Loss of Signal


Maximum Transmission Unit


Mach-Zehnder Interferometer


Open Shortest Path First


Optical cross-connect


Port Aggregation Protocol


Physical Layer Interface Module


Quality of Service

ROADM Reconfigurable Optical Add Drop Multiplexer


Service Level Agreement


Signal to Noise Ratio


Spanning Tree Protocol


Wavelength Division Multiplex


Virtual LAN

2007 Nokia Siemens Networks. All rights reserved.



Christian Merkle, Analysis of configuration failures in transport networks:

Part I: Configuration failures of the IP layer, Technical Report, Nokia Siemens Network, 2007


G. Hudson Gilmer, Part1 in the Reliability Series: Examining the cost of Poor
Quality in IP Networks, White Paper, 2001


Craig Labovitz, Abha Ahuja, Farnam Jahanian, Experimental Study of Internet Stability and Backbone Failures, Fault-Tolerant Computing, 278-285,


Craig Labovitz, Abha Ahuja, Farnam Jahanian, Experimental Study of Internet Stability and Wide Area Backbone Failures, university of Michigan CSETR-382-98, 1998


Rob Dearborn, Michael Howard, Susan Klarich, Laura Whitcomb, Richard

Webb, Jeff Wilson, The Costs of Enterprise Downtime, North America 2004,
Infonetics Research, February 2004


Cisco, Troubleshooting Cisco Catalyst Switches to NIC Compatibility Issues,, 2005


Cisco Systems, Catalyst 6500 Series Software Configuration Guide, 5.5
uration/guide/channel.html, last seen 2007


Cisco Systems, Catalyst 2950 and Catalyst 2955 Switch Software Configuration Guide,12.1(14)EA1
e/12.1_14_ea1/configuration/guide/Sw8021x.html#wp1025467, last seen


Carmen Mas, Patrick Thiran, Jean-Yves Le Boudec, Fault Localization at the

WDM Layer, Photonic Network Communications, 1999


Rajiv Ramaswami, Kumar N. Sivarajan, Optical Networks A Practical Perspective, Second Edition, Morgan Kaufmann Puplisher, 2002


J. Kemtchou, M. Duhamel, P. Lecoy, Gain temperature dependence of erbium-doped silica and fluoridefiber amplifiers in multichannel wavelengthmultiplexed transmissionsystems, Journal of Lightwave Technology, vo. 15,
pp. 2083-290, 1997


Cisco Systems, Dense Wavelength Division Multiplexing Commands on

Cisco IOS XR Software, last seen 2007


Cisco Systems, Less common types of systems crashes
te09186a008010876d.shtml#ts, last seen 2007
2007 Nokia Siemens Networks. All rights reserved.



Cisco Systems, Cisco IOS Desktop Switching Software Configuration Guide,

Release 12.0(5)XU,
e12.0_5_xu/scg/kitrbl.html, last seen 2007


ChannelPartner, Bugs in Router-Software von Cisco und Juniper, 2005, last seen 2007


Computerwoche, Cisco stopft Sicherheitsloch in Router Betriebssystem IOS, 2005, last seen 2007


Cisco Systems, Troubleshooting Buffer Leaks
86a00800a7b85.shtml, last seen 2007


Lutz Rapp, Quality Surveillance Algorithm for Erbium-Doped Fiber Amplifiers, Workshop on Design and Reliable Communication Networks (DRCN),

2007 Nokia Siemens Networks. All rights reserved.