Вы находитесь на странице: 1из 69
Strategies and Tactics: Application Troubleshooting Simplified How to speed time-to-root cause with network traffic

Strategies and Tactics:

Application Troubleshooting Simplified

How to speed time-to-root cause with network traffic recording and application-centric analysis

Fluke Networks Is…

Global

A $300+ million company

Profitable since its inception as a separate operating entity

Over 800 employees worldwide service customers in more than 120 countries

Approximately 30% of revenue from outside the U.S.

Worldwide Headquarters: Everett, WA

Major Facilities: Colorado Springs, CO; Duluth, GA; Bridgewater, NJ; Rockville, MD;

Sales Offices & Associates: Worldwide

Technical Assistance Centers: Everett, WA; Eindhoven, NL

Fluke Networks Is…

Thriving

Backed by an $12B corporate parent, Danaher Corporation

Fluke Networks and Fluke are both part of Danaher (NYSE: DHR)

Trusted

Trusted by 98 of the Fortune 100 who use Fluke Networks solutions to deploy, solve, manage and optimize their networks.

Today’s Challenges

Agenda

Complexity – Applications and the infrastructure that delivers them

Change – You think you know your network? Wanna bet?

Triage – Determining just who owns this problem

Root cause analysis (RCA) – What is the specific cause of latency?

analysis (RCA) – What is the specific cause of latency? • Best Practices • Getting in

Best Practices

Getting in the Path of the Packets

Capturing all the Packets

Discovering Problems before the Customer Discovers Them

Resolving Problems in a Timely Manner

Challenges

Complexity? You got it!

…and this is just the view of the app inside of a data center. What
…and this is just the view of the
app inside of a data center. What is
happening in that user’s network?
got it! …and this is just the view of the app inside of a data center.
got it! …and this is just the view of the app inside of a data center.
got it! …and this is just the view of the app inside of a data center.

Change – You think you know your network?

The network is in a constant state of change

Making assumptions about

the path packets take

utilization levels on ports

traffic distributions

Can lead to increased problem resolution time

Without a clear view of the current state of the network, it is very difficult to quickly resolve network and application related problems

Triage – Determining just who owns the problem

“It’s the network!”

Whether it is a network problem, server problem, or application problem, the network always gets blamed first

The faster we can determine the fault domain of the problem, the faster we can get the right resources working on it

Root Cause Analysis (RCA)

Without a history of normal network operation, it is difficult to determine what is not normal

Keeping a history of:

Utilization levels

Roundtrip Latencies

Protocol Distributions

Packet captures of working applications

Allows us to get to the root of the problem, without chasing symptoms that are not really part of the problem

What best practices address the challenges?

Challenges

Complexity

Change

Triage

Root Cause Analysis

Best Practices

Getting in the path of the packets

Capturing all the Packets

Discovering Problems before the customer does

Resolving Problems in a Timely Manner

Best Practices

Getting in the Path of the Packets

No Network Documentation

Understanding Application Dependencies

Tapping Technologies

Virtual Machines

Capturing all the Packets

High Bandwidth Utilization

What Happened Yesterday at 3pm?

Discovering Problems before the Customer Does

Network is the backbone for everything

Automatically picking out problems from Gigabytes worth of data

Resolving Problems in a Timely Manner

Need better understanding of how applications work

Remote offices

Getting in the Path of the Packets

Why is this important?

In order to analyze the application traffic and troubleshoot the problem, we must have the application packets in the capture buffer The only way to get these packets in the buffer is to get in the path of the packets been the client and server

Getting in the Path of the Packets

Knowing the flow of the packets

Often times network administrators will think they know the path packets are taking through the network

In many cases however, the packets are taking a different path, making the troubleshooting process much more difficult

Without knowing the exact path, we cannot guarantee that we are in the path of the packets

Not only must we know the Layer 3 flow of the packets, but we also need to know the Layer 2 flow of the packets

Demonstration of Layer 2 and Layer 3 Traceroute

Demonstration of Layer 2 and Layer 3 Traceroute

Knowing Application Dependencies

Why is this important?

Unless we know the devices on which a client, server, service or application is dependent, we do not know which paths to monitor

Once the conversations are documented, we can isolate the path between the two endpoints and connect the monitoring equipment

Host Conversations

Host Conversations Data Center DHCP Server Single Sign-on Server Web Servers Application Servers Database Servers
Host Conversations Data Center DHCP Server Single Sign-on Server Web Servers Application Servers Database Servers
Host Conversations Data Center DHCP Server Single Sign-on Server Web Servers Application Servers Database Servers
Host Conversations Data Center DHCP Server Single Sign-on Server Web Servers Application Servers Database Servers

Data Center

Host Conversations Data Center DHCP Server Single Sign-on Server Web Servers Application Servers Database Servers
Host Conversations Data Center DHCP Server Single Sign-on Server Web Servers Application Servers Database Servers
Host Conversations Data Center DHCP Server Single Sign-on Server Web Servers Application Servers Database Servers

DHCP

Server

Single

Sign-on

Server

Web Servers

Application Servers

Database Servers

OptiView Host Conversations Demonstration

OptiView Host Conversations Demonstration

Getting in the path

Once we know the exact path of the packets, it is time to get into that path

There are three common methods of getting in the path and capturing the packets

Hub

Span

Tap

Each of these methods has its own pros and cons

Pros Cheap Available Easy to install

–

Hubs

Cons

Reduce link to half duplex

May not be a true hub

Not practical on servers or switch uplinks

If power drops, link drops

10/100 Mbps speeds

Span/Mirror Ports

Pros

Free

Available

Does not require link to be dropped

Great for one-time link monitoring

Cons

Requires switch access

Configuration mistakes can result in network outages

Can quickly become over provisioned

Requires a free switch port

1 3 7 5 9 11 17 13 21 15 19 23 CATALYST 3550 1
1
3 7
5 9
11 17
13 21
15 19
23
CATALYST 3550
1
2
SYSTEM
RPS
STAT
UTIL
DUPLEX
SPEED
2
4 8
6 10
12 18
14 22
16 20
24

Pros

Taps

Truly monitors full-duplex traffic

If power is lost link stays active

Can monitor gigabit links without packet loss

Once installed, can stay

Cons

Most expensive option

Have to break the link to install

Can over-provision the monitor port and drop packets

Most expensive option – Have to break the link to install – Can over-provision the monitor

Tap Deployment

Analysis equipment can be quickly connected to the network, without the need for configuration changes

Aggregators can be used to merge the traffic from multiple taps into a single stream

This allows a single analyzer to monitor traffic at multiple locations as well as redundant paths

a single stream • This allows a single analyzer to monitor traffic at multiple locations as

Tap Deployment

Having taps deployed at key locations provides easy access for the analysis equipment

These points include:

In front of server farms

At the Internet connection

Switch Uplinks

Demarcation Points between Responsibilities

Tap Deployment

Tap Deployment Data Center DHCP Server Single Sign-on Server Web Servers Application Servers Database Servers
Tap Deployment Data Center DHCP Server Single Sign-on Server Web Servers Application Servers Database Servers
Tap Deployment Data Center DHCP Server Single Sign-on Server Web Servers Application Servers Database Servers
Tap Deployment Data Center DHCP Server Single Sign-on Server Web Servers Application Servers Database Servers

Data Center

Tap Deployment Data Center DHCP Server Single Sign-on Server Web Servers Application Servers Database Servers
Tap Deployment Data Center DHCP Server Single Sign-on Server Web Servers Application Servers Database Servers
Tap Deployment Data Center DHCP Server Single Sign-on Server Web Servers Application Servers Database Servers

DHCP

Server

Single

Sign-on

Server

Web Servers

Application Servers

Database Servers

Capturing in a Virtual Machine Environment

The use of virtual servers create unique challenges when it comes to capturing packets

There is no place to attach a physical tap

The analyzer must be installed on the same virtual server as the virtual machines that are to be monitored

Using the vSwitch within the virtual server, the traffic can be spanned from the virtual machines to the virtual analyzer machine

Capturing in a Virtual Machine Environment

Capturing in a Virtual Machine Environment

NTM Connecting to the vSwitch

NTM Connecting to the vSwitch

Capturing all the Packets

Why is this important?

Without all of the packets, we are not able to analyze the application traffic. Some examples are:

VoIP Traffic – If we do not capture all of the traffic, the analyzer will report a lower Mean Opinion Score (MOS) due to packet loss

TCP Traffic – The analyzer will report missing segments, which will give the appearance that packets are being lost on the network, when in fact they are not

If the capture buffer is not big enough, the packets will roll out of the buffer, before anyone knows the problem even occurred

High Performance Packet Capture

In order to capture all of the packets, we must

Have a hardware capture card that can keep up with the data rate of the network. This was easy in the 10/100 days, with the deployment of 10 Gigabit networks, this has become much more difficult

Apply capture filters to the captured packets and discard those that do not match the filter

Transfer the filtered packets to the storage system at a rate equal to the data rate of the network

Index the captured traffic in such a way that it can be retrieved quickly by the protocol analyzer

High Performance Packet Capture

High Performance Packet Capture • 1 Ethernet traffic is captured from multiple ports at full line
High Performance Packet Capture • 1 Ethernet traffic is captured from multiple ports at full line
High Performance Packet Capture • 1 Ethernet traffic is captured from multiple ports at full line
• 1
1

Ethernet traffic is captured from multiple ports at full line rates by FPGA-based capture card – hardware filters supported

2 •
2

Entire frames are sent to the Packet Store repository for storage and post analysis

3 •
3

Entire frames are also sent to the various analytical and real-time monitoring engines that process, classify and index data – this information is stored in the metadata database

4 •
4

Atlas is the software interface that provides access to the rich network metadata information collected and created

5 •
5

For troubleshooting and in-depth network analysis, a packet view engine facilitates fundamental protocol and multi-segment flow analysis

10 Gig Packet Capture

H/W filter & frame de-duplication

Full Line Rate Capture with 2Gbps buffer

Fast PCI-e bus

Full Line Rate Capture with 2Gbps buffer • Fast PCI-e bus 10Gbps Adapter Card (2*10G XFP)
Full Line Rate Capture with 2Gbps buffer • Fast PCI-e bus 10Gbps Adapter Card (2*10G XFP)

10Gbps Adapter Card (2*10G XFP)

1Gbps Adapter Card (4*1G SFP)

How Stream to Disk Works

All NTMs use RAID controller for high performance stream-to-disk

All NTMs carry multiple disks to support multi-thread storage

Large storage capacity models support RAID5 for redundancy

All NTM are specified with “true” capacity:

True packet storage space

Addition storage available for OS, and metadata

specified with “true” capacity: – True packet storage space – Addition storage available for OS, and

Deployment of Stream to Disk

Typically deployed in the data center, in front of application servers, data base servers, VoIP servers

For troubleshooting purposes, can be deployed in a portable fashion to capture traffic over long periods of time

Going Back in Time

Often times a problem occurs, but no one reports the problem until several hours, or days later

The ability to go back in time allows the network analyst to search through the captured packets quickly to extract those packets related to the problem

The Network Time Machine provides the ability to select traffic by interface, time range, and device address

Going Back in Time Demonstration

Going Back in Time Demonstration Add Filter to narrow down scope
Going Back in Time Demonstration Add Filter to narrow down scope

Add Filter to narrow down scope

Deployment of Stream to Disk

Fixed Location – Data Center

Server farms

Database Servers

Load Balancers

Portable Solution

Remote Offices

– Data Center – Server farms – Database Servers – Load Balancers • Portable Solution –

Capture to Disk Deployment

Capture to Disk Deployment Data Center DHCP Server Single Sign-on Server Web Servers Application Servers Database
Capture to Disk Deployment Data Center DHCP Server Single Sign-on Server Web Servers Application Servers Database
Capture to Disk Deployment Data Center DHCP Server Single Sign-on Server Web Servers Application Servers Database
Capture to Disk Deployment Data Center DHCP Server Single Sign-on Server Web Servers Application Servers Database

Data Center

Capture to Disk Deployment Data Center DHCP Server Single Sign-on Server Web Servers Application Servers Database
Capture to Disk Deployment Data Center DHCP Server Single Sign-on Server Web Servers Application Servers Database
Capture to Disk Deployment Data Center DHCP Server Single Sign-on Server Web Servers Application Servers Database

DHCP

Server

Single

Sign-on

Server

Web Servers

Application Servers

Database Servers

Discovering Problems before your Customer Does

The network has become the backbone for most if not all of the communications within a organization

These include

E-mail

Phone Traffic (VoIP)

Video (Video Conferencing and Surveillance)

Business Critical Applications

Non-Business Critical Applications, but still important to the end user (facebook)

When these services are not performing well, the customer wants them fixed and fixed now!

Discovering Problems before your Customer Does

No so easy

There are Terrabytes of information going across the network everyday

Most of this traffic is working properly, the trick is pulling out the traffic that is not working properly

Discovering Problems before your Customer Does

Monitoring!!!!!!

The customer should not be used as a monitoring device

It is important to be able to discover network related problems, before the customer discovers them

To accomplish this, we need to be able to:

Perform Real Time Analysis of the network traffic

Take advantage of information available to us through SNMP, RMON, NetFlow

Set monitoring thresholds and alarms to notify us when things are not performing as they should

Real Time Analysis

It is important that the protocol analyzer be able to analyze packets not only after they have been captured, but as they are being captured

This allows you to detect problems as they are occurring, instead of waiting until the customer reports a problem

Detection can be combined with alerting, so that notifications are sent out when problems occur

Interface Utilization and Errors

Packet loss and link congestion contribute to slow applications

Eliminating these problems from the network will positively impact all of the applications running across the network

Monitoring routers and switches using SNMP will allow you to quickly isolate those links experiencing high utilization and interface errors

The OptiView Portable Network Analyzer provides the capability to collect these values and graph them in a useful fashion

Interface Utilization and Errors

Interface Utilization and Errors

Interface Utilization and Errors

FCS/CRC errors are a common problem on many networks

These errors result in packet loss, which in turn results in the retransmission of packets

Retransmission delays cause application delays

A typical cause of FCS/CRC errors are duplex mismatches

The OptiView Portable Network Analyzer displays the number of errors seen on each port, thereby reducing the time it takes to isolate packet loss.

Interface Utilization and Errors

Interface Utilization and Errors

Utilization and Interface Monitoring

Utilization and Interface Monitoring Data Center DHCP Server Single Sign-on Server Web Servers Application Servers
Utilization and Interface Monitoring Data Center DHCP Server Single Sign-on Server Web Servers Application Servers
Utilization and Interface Monitoring Data Center DHCP Server Single Sign-on Server Web Servers Application Servers
Utilization and Interface Monitoring Data Center DHCP Server Single Sign-on Server Web Servers Application Servers

Data Center

Utilization and Interface Monitoring Data Center DHCP Server Single Sign-on Server Web Servers Application Servers
Utilization and Interface Monitoring Data Center DHCP Server Single Sign-on Server Web Servers Application Servers
Utilization and Interface Monitoring Data Center DHCP Server Single Sign-on Server Web Servers Application Servers

DHCP

Server

Single

Sign-on

Server

Web Servers

Application Servers

Database Servers

Application Response Time

If the infrastructure supporting the application is running slowly, then the application will run slow

By monitoring the time it takes to traverse the network and connect to the server, we are able to either implicate or eliminate the network as the cause of application slowdowns

Monitoring these application ports over time will give us a baseline of the typical response time, which can then be compared with time periods when the application appears to be slow

Application Response Time Monitoring

Application Response Time Monitoring Data Center DHCP Server Single Sign-on Server Web Servers Application Servers
Application Response Time Monitoring Data Center DHCP Server Single Sign-on Server Web Servers Application Servers
Application Response Time Monitoring Data Center DHCP Server Single Sign-on Server Web Servers Application Servers
Application Response Time Monitoring Data Center DHCP Server Single Sign-on Server Web Servers Application Servers

Data Center

Application Response Time Monitoring Data Center DHCP Server Single Sign-on Server Web Servers Application Servers
Application Response Time Monitoring Data Center DHCP Server Single Sign-on Server Web Servers Application Servers
Application Response Time Monitoring Data Center DHCP Server Single Sign-on Server Web Servers Application Servers

DHCP

Server

Single

Sign-on

Server

Web Servers

Application Servers

Database Servers

Utilizing SNMP Data

Virtually ever device on the network has an SNMP agent

These agents can provide information about the performance, utilization, and faults on the device

This information includes:

Host Resource Tables

Route Tables

ARP Caches

Host Resource Table

SNMP enabled servers can be accessed with the OptiView Portable Network Analyzer

From these servers we can pull information about:

Memory and CPU utilization

Running Processes

Disk Utilization

Number of Users

Host Resource Table Demonstration

Host Resource Table Demonstration

Resolving Problems in a Timely Manner

To minimize the impact of application problems to the client, it is important to resolve the problems in a timely manner

Factors that reduce the amount of time necessary to resolve problems are:

Understanding the Application as far as dependencies, data flows, response times

Capturing in multiple locations and merging the packet captures to isolate packet loss and latency

Play back multimedia traffic to view the end user experience

Understanding Applications

While the network analyst does not need to understand applications down to the code level, it is important to understand the network traffic related to applications

This understanding will help reduce the amount of time it takes to troubleshoot the application

A good practice is to capture the application traffic when the application is running well.

This good capture can be compared with the problem trace to reduce the amount of time it takes to isolate the problem

Application Centric Analysis

Application Centric Analysis is the process of taking a top down approach to application troubleshooting as opposed to

a bottom up approach

it can be shown that the network is transporting traffic as it should, we can begin troubleshooting application by looking at data flows, instead of packets

This gives us a better picture of where the application may be failing, instead of digging through thousands of packets

If

What is a Transaction?

Business Transaction User Action Application Transaction Packets Packet #1 GET /tradepage.aspx Packet #2 Go to
Business Transaction
User Action
Application Transaction
Packets
Packet #1
GET /tradepage.aspx
Packet #2
Go to
Trade Page
Packet #3
GET /border.gif
Packet #4
Packet #5
Look up
GET /dnarrow.gif
Packet #6
Danaher
Purchase
Symbol
Packet #7
100 shares
GET /displayDHR.gif
Packet #8
of Danaher
Packet #9
Enter
stock
Packet #10
Symbol
GET /stylesheet.css
And Qty
Packet #11
Packet #12
GET /javascript.js
Packet #13
Submit
Packet #14
Order
POST /submit_order.asp
Packet #15
Packet #16

Demonstration of Application Centric Analysis

Demonstration of Application Centric Analysis

Multi-Segment Analysis

In order to get a complete picture of the problem, we may need to see both sides of the conversation at the same time

By capturing on both sides and merging that traffic together, we are able to quickly identify the source of packet loss and delays

To perform this multi-segment analysis, we must be able to synchronize the traces based on time stamp

Multi-Segment Analysis

ClearSight merges traces files from both analyzers

Client

Web Server

Network
Network
Multi-Segment Analysis • ClearSight merges traces files from both analyzers C l i e n t

Optiview

Multi-Segment Analysis

Multi-Segment Analysis • Firewall Latency • Router Latency • Core Latency

Firewall

Latency

Router

Latency

Core

Latency

Multimedia Playback

In some cases it takes more than just looking at packets to resolve an application problem

When troubleshooting VoIP and Video problems, it is helpful to be able to play the media stream back to view the quality

Problems such as echo with VoIP cannot be determined by looking at the statistics or packets. The only way to detect echo is to listen to the audio stream

Keys to Troubleshooting Multimedia

Need to have the appropriate Codecs available on the analysis equipment to playback media

Measurement of Metrics

MOS

R-Factor

V-Factor

Where to deploy the Equipment

The placement of the analysis equipment has a significant impact on the analysis

An analyzer placed close to the source of the multimedia traffic may not see the same problems as one placed near the destination

Portable Solution

Having an portable analysis solution allows the analyst to move connect to various locations to isolate the problem

In cases of remote offices, the analysis solution can be shipped to the office to capture the end user experience

Use of Taps

Having taps installed ahead of time provides a quick and easy way to connect the analyzer

The use of taps insures that the timing of the multimedia packets is not changed, which could adversely impact the metrics

Use of Taps

Use of Taps

VoIP Analyzer Deployment

VoIP Analyzer Deployment Data Center VoIP Server Single Sign-on Server Web Servers Application Servers Database Servers
VoIP Analyzer Deployment Data Center VoIP Server Single Sign-on Server Web Servers Application Servers Database Servers
VoIP Analyzer Deployment Data Center VoIP Server Single Sign-on Server Web Servers Application Servers Database Servers
VoIP Analyzer Deployment Data Center VoIP Server Single Sign-on Server Web Servers Application Servers Database Servers
VoIP Analyzer Deployment Data Center VoIP Server Single Sign-on Server Web Servers Application Servers Database Servers

Data Center

VoIP Analyzer Deployment Data Center VoIP Server Single Sign-on Server Web Servers Application Servers Database Servers
VoIP Analyzer Deployment Data Center VoIP Server Single Sign-on Server Web Servers Application Servers Database Servers
VoIP Analyzer Deployment Data Center VoIP Server Single Sign-on Server Web Servers Application Servers Database Servers

VoIP

Server

Single

Sign-on

Server

Web Servers

Application Servers

Database Servers

Demonstration of Multimedia Playback

Demonstration of Multimedia Playback

Summary of Best Practices and Challenges

Best Practice

Method

Fluke Networks Tools

Getting in the Path of the Packets

Flow of the packets

OptiView – Traceswitch Route OptiView – ICMP Traceroute

Application Dependencies

OptiView – Host Conversations

Span/Tap

Fluke Networks Taps

Capturing All the Packets

High Performance Packet Capture

Network Time Machine

Capture to Disk – Back in Time

Network Time Machine

Discovering Problems before the Customer Does

Real Time Analysis

Network Time Machine – Atlas Metrics OptiView – Interface Utilization and Errors OptiView – Application Response Time

Using SNMP Data

OptiView – Host Resource Table

Diagnosing Problems in a Timely Manner

Understanding Applications

ClearSight Analyzer – Application Centric Analysis ClearSight Analyzer – Multi-segment Analysis

Multimedia Analysis

Network Time Machine – High Peformance Capture ClearSight Analyzer – Multimedia Playback

Resources

90-Day ClearSight Trial – requires unique Proof of Purchase (POP) Code found on the ClearSight Flyer handed out at the seminar

14-Day ClearSight Trial – if you misplaced your POP Code you can download the 14- day trial at www.flukenetworks.com/csatrial

Application-Centric Resource Center: www.flukenetworks.com/app-centric

Network Forensics Resource Center: www.flukenetworks.com/ntmresources

Portable Network Analysis: www.flukenetworks.com/optiview

Request OptiView 5Day Evaluation: www.flukenetworks.com/optivieweval

•

For additional information:

Email: info@flukenetworks.com. Phone 800-283-5853 (US/Canada) or 425-446-4519 (other locations).