Вы находитесь на странице: 1из 35

Advanced Communication Networks

Data Center TCP (DCTCP)


Spring 2016
Mahshid R. Naeini, PhD

Data Center TCP-Reference

Data Center TCP (DCTCP)

Mohammad Alizadehzy, Albert Greenbergy, David A. Maltzy, Jitendra Padhyey, Parveen Pately, Balaji Prabhakarz, Sudipta Senguptay,
Murari Sridharany
ACM SIGCOMM 2010 conference

Some slides borrowed from Liyong at SYSU

We focus on
Soft real-time applications
Supporting

Web search
Retail
Advertising
Recommendation systems

These applications generate a diverse mix of short and long flows


Require three things from the data center network
Low latency for short flows
High burst tolerance
High utilization for long flows

Reducing network latency allows application developers to invest more cycles in the
algorithms that improve relevance and end user experience.

Foreground and Background Traffic


Foreground traffic is typically shorter data flows with low latency
tolerance. As used herein, the term foreground traffic may include any
traffic that is directly connected to the user experience, or any other
traffic designated as having a higher priority than background traffic.
Background traffic is often contained within the data center or between
data centers. Background traffic performs background tasks such as
updating databases or applications, backing up data, or doing other
administrative functions that are usually not time sensitive and that do not
impact or interact with users directly. Background traffic usually has
minimal latency requirements and, further, tends to include long
streams of data that can take a significant of amount of time to transmit.

Some Properties of DC Networks

Delays in Wide Area Networks are very different than DC networks.

For example: in DC networks: round trip times (RTTs) can be less than 250s, in absence of queuing.

Applications simultaneously need extremely high bandwidths and very low


latencies.
Data center environment offers certain luxuries:
The network is largely homogeneous and under a single administrative control.
Backward compatibility, incremental deployment and fairness to legacy
protocols are not major concerns.
Connectivity to the external Internet is typically managed through load
balancers and application proxies that effectively separate internal traffic from
external, so issues of fairness with conventional TCP are irrelevant.

TCP in the Data Center


Well see TCP does not meet demands of apps.
Incast
Suffers from bursty packet drops
Not fast enough utilize spare bandwidth

Builds up large queues:


Adds significant latency.
Wastes precious buffers, esp. bad with shallow-buffered switches.

Operators work around TCP problems.


Ad-hoc, inefficient, often expensive solutions

Solution: Data Center TCP

Mechanisms for Detecting Congestion


(i) Delay-based protocols use increases in RTT measurements as a sign of
growing queueing delay, and hence of congestion.
These protocols rely heavily on accurate RTT measurement, which is
susceptible to noise in the very low latency environment of data centers. Small
noisy fluctuations of latency become indistinguishable from congestion and the
algorithm can over-react.

(ii) Active Queue Management (AQM) approaches use explicit feedback


from congested switches.
DCTCP

Partition/Aggregate
TLA

Picasso

Art is

1.

Deadline
2. Art is=a250ms
lie

..

3.

Picasso

Time is money

MLA MLA

1.
Strict deadlines (SLAs)

Deadline = 50ms
2.

2. The chief
3.

..

3.

..

Missed deadline

1. Art is a lie

Lower quality result

Iterative Queries

ItI'd
isArt
Computers
Inspiration
your
chief
like
Bad
isto
you
awork
enemy
lie
live
artists
can
that
in
as
are
does
of
imagine
life
amakes
copy.
useless.
creativity
poor
that
exist,
man
us
is the
real.
is
DeadlineEverything
=The
10ms
They
but can
itultimate
with
Good
must
realize
only
good
lots
artists
find
give
seduction.
the
sense.
ofyou
money.
you
truth.
steal.
working.
answers.

Worker Nodes

Partition/Aggregate Application Structure

Common application structureL foundation of many large scale web applications


Web search, social network content composition, and advertisement selection are
all based around this application design pattern.
The synchronized and bursty traffic patterns that result from these application
structure, and identify three performance impairments these patterns cause.
9

Data Collection

Over 6000 servers in over 150 racks


Each rack in the clusters holds 44 servers
Each server connects to a Top of Rack switch (ToR) via 1Gbps Ethernet
The ToRs are shallow buffered, shared-memory switches; each with 4MB of
buffer shared among 48 1Gbps ports and two 10Gbps ports.
Latency Information Collected by
Passively collects socket level logs
Selected packet-level logs
App-level logs

>150TB of compressed data, collected over the course of a month from

The measurements reveal that 99.91% of trafc in our data center is TCP
trafc.
Our key learning from these measurements is that to meet the requirements of
such a diverse mix of short and long ows, switch buffer occupancies need to
be persistently low, while maintaining high throughput for the long ows.
10

Workloads
Partition/Aggregate
(Query) (2KB to 20KB in size)

Delay-sensitive

Short messages [100KB-1MB]


(Coordination, Control state)

Delay-sensitive

Large flows [1MB-100MB]


(Data update)

Throughput-sensitive

11

Switches
Like most commodity switches in clusters are shared memory switches
that aim to exploit statistical multiplexing gain through use of logically
common packet buffers available to all switch ports.
Packets arriving on an interface are stored into a high speed multi-ported
memory shared by all the interfaces.
Memory from the shared pool is dynamically allocated to a packet by a
MMU (attempts to give each interface as much memory as it needs while
preventing unfairness by dynamically adjusting the maximum amount of
memory any one interface can take).
Building large multi-ported memories is very expensive, so most cheap
switches are shallow buffered, with packet buffer being the scarcest
resource.
12

Impairments
Switches with
shared memory
result in 3
impairments

Incast

Impairments

Queue
Buildup

Buffer Pressure

13

Incast
Worker 1

Synchronized mice collide.


Caused by Partition/Aggregate.
Aggregator

Worker 2

Worker 3
RTOmin = 300 ms

Worker 4

TCP timeout

If many flows converge on the same interface of a switch over a short period of
time, the packets may exhaust either the switch memory or the maximum
permitted buffer for that interface, resulting in packet losses.

14

Queue Buildup
Sender 1

Big flows buildup queues.


Increased latency for short flows.

When long and short flows traverse the


same queue two impairments occur:
First, packet loss on the short flows
can cause incast problems.
Second, there is a queue buildup
impairment: even when no packets
are lost, the short flows experience
increased latency.

Receiver

Sender 2

15

Buffer Pressure

Since buffer space is a shared


resource, the queue build up reduces
the amount of buffer space available
to absorb bursts of traffic from
Partition/Aggregate traffic
The loss rate of short ows in this
trafc pattern depends on the number
of long ows traversing other ports
The bad result is packet loss and
timeouts, as in incast, but without
requiring synchronized ows.

16

Data Center Transport Requirements


1. High Burst Tolerance
Incast due to Partition/Aggregate is common.

2. Low Latency
Short flows, queries

3. High Throughput
Large file transfers

The challenge is to achieve these three together.


17

Balance Between Requirements


High Throughput
High Burst Tolerance

Low Latency

Deep Buffers:
Queuing Delays
Increase Latency

Shallow Buffers:
Bad for Bursts &
Throughput

Reduced RTOmin
(SIGCOMM 09)
Doesnt Help Latency

AQM RED:
Avg Queue Not Fast
Enough for Incast

Objective:
Low Queue Occupancy & High
Throughput
DCTCP

18

Review TCP Congestion Control


Four Stage:
Slow Start
Congestion Avoidance
Quickly Retransmission
Quickly Recovery

Router must maintain one or more queues on port, so it is


important to control queue
Two queue control algorithm
Queue Management Algorithm: manage the queue length through dropping
packets when necessary
Queue Scheduling Algorithm: determine the next packet to send

19

Queue management algorithm


Passive management Algorithm: dropping packets after queue is full.
Traditional Method
Drop-tail
Random drop
Drop front

Some drawbacks
Lock-out: several flows occupy queue exclusively, prevent the packets from others
flows entering queue
Full queues: send congestion signals only when the queues are full, so the queue is
full state in quite a long period

Solution: Active Management Algorithm (AQM)

20

AQM (dropping packets before queue is full)


RED (Random Early Detection)[RFC2309]
Calculate the average queue length(aveQ): Estimate the degree of congestion
Calculate probability of dropping packets (P): according to the degree of congestion.
(two threshold: minth, maxth)
abeQ<minth: dont drop packets
Minth<abeQ<maxth: drop packets in P
abeQ>maxth: drop all packets
Drawback: drop packets sometimes when queue isnt full

ECN (Explicit Congestion Notification)[RFC3168]


A method to use multibit feed-back notifying congestion instead of
dropping packets

21

Explicit Congestion Notification (ECN)


Routers or Switches must support it.(ECN-capable)
Set two bits by the ECN field in the IP packet header
ECT (ECN-Capable Transport): set by sender, to display the senders transmission
protocol whether support ECN or not.
CE (Congestion Experienced): set by routers or switches, to display whether
congestion occur or not.

Set two bits field in TCP header


ECN-Echo: receiver notify sender that it has received CE packet
CWR (Congestion Window Red-UCed): sender notify receiver that it has decreased
the congestion window

22

ECN working principle

IP
TCP

TCP

ECT

CE

ECT

CE

CWR

CWR

1
CWR

Host

3
4

Switch

ACK
1
ECN-Echo

Host
23

Review: The TCP/ECN Control Loop


Sender 1

ECN = Explicit Congestion Notification

ECN Mark (1 bit)

Receiver

Sender 2

24

Two Key Ideas


1. React in proportion to the extent of congestion, not its presence.
ECN Marks

TCP

DCTCP

1011110111

Cut window by 50%

Cut window by 40%

0000000001

Cut window by 50%

Cut window by 5%

2. Mark based on instantaneous queue length.


Fast feedback to better deal with bursts.

27

Data Center TCP Algorithm


Switch side:

Mark packets when Queue Length > K.

Mark K Dont
Mark

Sender side:
Maintain an estimate of fraction of packets marked ().
In each RTT:

where F is the fraction of packets that were marked in the last window of data
0 < g < 1 is the weight given to new samples against the past in the estimation of

Adaptive window decreases:


28

(Kbytes)

DCTCP in Action

Setup: Win 7, Broadcom 1Gbps Switch


Scenario: 2 long-lived flows, K = 30KB

29

Why it Works
1. High Burst Tolerance
Large buffer bursts fit.
Aggressive marking sources react before packets are dropped.

2. Low Latency
Small buffer occupancies low queuing delay

3. High Throughput
ECN averaging smooth rate adjustments, cwind low variance.

31

Analysis
We want to analyze the behavior of DCTCP for N infinitely long-lived flows with identical round trip
times (RTT), sharing a single bottleneck link of capacity C.
We assume flows are synchronized.

Window Size

Packets sent in this


RTT are marked.

W*+1
W*

(W*+1)(1-/2)

Time

We are interested in computing the following quantities:


The maximum queue size (Qmax)
The amplitude of queue oscillations (A)
The period of oscillations (TC)

33

Analysis
The queue size at time t is given by
Q(t) = NW(t)-CRTT
where W(t) is the window size of a single source

(3)

The fraction of marked packets


S(W1,W2) denote the number of packets sent by the sender, while its
window size increases from W1 to W2 > W1.
Since this takes W2-W1 round-trip times, during which the average
window size is (W1 +W2)/2

30

Analysis
Let W* = (C RTT +K)/N

This is the critical window size at which the queue size reaches K, and the switch starts
marking packets with the CE codepoint. During the RTT it takes for the sender to react to
these marks, its window size increases by one more packet, reaching W* + 1.

Hence

Plugging (4) into (5) and rearranging, we get:

Assuming is small, this can be simplied as


compute A , TC and in Qmax.

. We can now

31

Analysis
Note that the amplitude of oscillation in window size of a single ow, D, is
given by:

Since there are N ows in total

Finally, using (3), we have:

32

Analysis
How do we set the DCTCP parameters?
Marking Threshold(K). The minimum value of the queue occupancy in the
sawtooth is given by:

Choose K so that this minimum is larger than zero, i.e. the queue
underow. This results in:

does not

37

Conclusions
DCTCP satisfies all our requirements for Data Center packet transport.
Handles bursts well
Keeps queuing delays low
Achieves high throughput

Features:
Very simple change to TCP and a single switch parameter K.
Based on ECN mechanisms already available in commodity switch.

46

More readings on Data Center Networking


On the Feasibility of Completely Wireless Datacenters
J. Y. Shin, E. G. Sirer, H. Weatherspoon, and D. Kirovski, IEEE/ACM Transactions
on Networking (ToN), Volume 21, Issue 5 (October 2013), pages 1666-1680.

Analysis and Network Traffic Characteristics of Data Centers in the wild


T. Benson, A. Akella, and D. A. Maltz. In Proceedings of the 10th ACM
SIGCOMM conference on Internet measurement (IMC), pp. 267-280. ACM,
2010.
SoNIC: Precise Realtime Software Access and Control of Wired Networks
Ki Suh Lee , Han Wang , Hakim Weatherspoon
USENIX symposium , 2013
Energy-aware routing in data center network
Yunfei Shang, Dan Li, Mingwei Xu
ACM SIGCOMM workshop on Green networking, 2010
Data Center Network Virtualization: A Survey
Md. Faizul Bari, Raouf Boutaba, Rafael Esteves, Lisandro Zambenedetti Granville, Maxim Podlesny, Md Golam
Rabbani, Qi Zhang, and Mohamed Faten Zhani
IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 15, NO. 2, SECOND QUARTER 2013
35

Вам также может понравиться