Вы находитесь на странице: 1из 178

Linux Kernel Networking by Rami Rosen

This doc is licensed under a Creative Commons License http://creativecommons.org/licenses/by-nc-nd/3.0/ Rami Rosen website This is a !" pages document with a broad overview o# Linu$ %ernel networ%ing. &t re#lects the most recent Linu$ %ernel networ% git tree' net-ne$t' and it is updated every wee% according to most recent changes in this tree.
going deep into design and implementation details as well as theory

behind it
Linu$ (ernel )etwor%ing in#o is scattered in too many places around

the web* and sometimes is partial/not updated/missing. This document servers as a central document about this sub+ect.

Last updated: ,anuary -0 3 Though it is intended mostly #or developers' & do hope and believe that administrators and researchers can get some advice here. &t is based on a my practical e$perience with Linu$ %ernel networ%ing and a series o# lectures & gave in the Technion: .ee: Rami Rosen lectures

1/178

/lease #eel #ree send any #eedbac% or 0uestions or suggestions to Rami Rosen by sending email to: ramirose1gmail.com & will try hard to answer each and every 0uestion 2though sometimes it ta%es time3

Contents

&ntroduction - 4ierarchy o# networ%ing layers 3 )etwor%ing 5ata .tructures


3.1 SK_BUFF 3.2 net_device

6 )7/& 8 Routing .ubsystem


5.1 Routing Tables 5.2 Routing Cache 5.2.1 Creating a Routing Cache ntr! 5.3 "olic! Routing #$ulti%le tables& 5.3.1 "olic! Routing' add(delete a rule e)a$%le 5.* Routing Table loo+u% algorith$

9 Receiving a pac%et
,.1 For-arding

! .ending a /ac%et " :ultipath routing ; )et#ilter


..1 /et0ilter rule e)a$%le

;.- Connection Trac%ing 0 Tra##ic Control


2/178

3 6 8 9 ! " ; -0 -

&C:/ redirect message )eighboring .ubsystem )etwor% )amespaces <irtual )etwor% 5evices &/v9 <L7) =onding )etwor% device Teaming )etwor% 5evice /// T>)/T7/ =L>?T@@T4
Blue1 RFC233 42C5"

-- <AL7) -3 TC/ -6 &/.ec -6. ?$ample: 4ost to 4ost </) 2using openswan3 -8 Bireless .ubssytem -9 Lin%s and more in#o

Introduction 6 >nderstanding a pac%et wal%through in the %ernel is a %ey to understanding %ernel networ%ing. >nderstanding it is a must i# we want to understand )et#ilter or &/.ec internals' and more. 6 Be will deal with this wal%through in this document 2design and implementation details3. 4ierarchy o# networ%ing layers 6 The layers that we will deal with 2based on the ! layers model3 are: - Lin% Layer 2L-3 2ethernet3 - )etwor% Layer 2L33 2ip6' ipv93 - Transport Layer 2L63 2udp'tcp...3
3/178

)etwor%ing 5ata .tructures 6 The two most important structures o# linu$ %ernel networ% layer are: 7 sk_buff struct 2de#ined in include/linu$/s%bu##.h3 - net_device struct 2de#ined in include/linu$/netdevice.h3 &t is better to %now a bit about them be#ore delving into the wal%through code. SK_B !! 7ll networ%-related 0ueues and bu##ers in the %ernel use a common data structure' struct s%Cbu##. This is a large struct containing all the control in#ormation re0uired #or the pac%et 2datagram' cell' whatever3. The s%Cbu## elements are organiDed as a doubly lin%ed list' in such a way that it is very e##icient to move an s%Cbu## element #rom the beginning/end o# a list to the beginning/end o# another list. 7 0ueue is de#ined by struct s%Cbu##Chead' which includes a head and a tail pointer to s%Cbu## elements. 7ll the 0ueuing structures include an s%Cbu##Chead representing the 0ueue. Eor instance' struct soc% includes a receive and send 0ueue. Eunctions to manage the 0ueues 2s%bC0ueueChead23' s%bC0ueueCtail23' s%bCde0ueue23' s%bCde0ueueCtail233 operate on an s%Cbu##Chead. &n

4/178

reality' however' the s%Cbu##Chead is included in the doubly lin%ed list o# s%Cbu##s 2so it actually #orms a ring3. Bhen a s%Cbu## is allocated' also its data space is allocated #rom %ernel memory. s%Cbu## allocation is done with allocCs%b23 or devCallocCs%b23* drivers use devCallocCs%b23* 2#ree by %#reeCs%b23 and devC%#reeCs%b233. 4owever' s%Cbu## provides an additional management layer. The data space is divided into a head area and a data area. This allows %ernel #unctions to reserve space #or the header' so that the data doesnFt need to be copied around. Typically' there#ore' a#ter allocating an s%Cbu##' header space is reserved using s%bCreserve23. s%bCpull2int len3 G removes data #rom the start o# a bu##er 2s%ipping over an e$isting header3 by advancing data to dataHlen and by decreasing len. Be also handle alignment when allocating s%Cbu##: - when allocating an s%Cbu##' by netdevCallocCs%b23' we eventually call CCallocCs%b23 and in #act' we have two allocations here: - the s%Cbu## itsel# 2struct s%Cbu## Is%b3 this is done by ... s%b J %memCcacheCallocCnode2cache' g#pCmas% K LCCME/C5:7' node3* .... see CCallocCs%b23 in net/core/s%bu##.c the second is allocating data: ... siDe J .(=C57T7C7L&M)2siDe3* data J %mallocCnodeCtrac%Ccaller2siDe H siDeo#2struct s%bCsharedCin#o3' g#pCmas%' node3* ... see also CCallocCs%b23 in net/core/s%bu##.c the data is #or pac%et headers 2layer -' layer 3 ' layer 63 and pac%et data )ow' the data pointer is not #i$ed* we advance/decrease it as we move #rom layer to layer. The head pointer is #i$ed.
5/178

The allocation o# data above #orces alignment. )ow' when we call #rom the networ% driver to netdevCallocCs%b23' the data points to the ?thernet header. The &/ header #ollows immediately a#ter the ?thernet header. .ince the ethernet header is 6 bytes' this means that assuming data J %mallocCnodeCtrac%Ccaller23 returned a 9-bytes aligned address' as mentioned above' the &/ header will IInotII be 9 bytes aligned. 2it starts on dataH 63. &n order to align it' we should advance data in - bytes be#ore putting there the ethernet header. This is done by s%bCreserve2s%b' )?TC&/C7L&M)3* )?TC&/C7L&M) is -' and what s%bCreserve23 does is increment data in - bytes. 2letFs ignore the increment o# the tail' it is not important to this discussion3 .o now the ip header is 9 bytes aligned. see netdevCallocCs%bCipCalign23 in include/linu$/s%bu##.h The struct s%Cbu## ob+ects themselves are private #or every networ% layer. Bhen a pac%et is passed #rom one layer to another' the struct s%Cbu## is cloned. 4owever' the data itsel# is not copied in that case. )ote that struct s%Cbu## is 0uite large' but most o# its members are unused in most situations. The copy overhead when cloning is there#ore limited. &n most cases' s%Cbu## instances appear as Ns%bO in the %ernel code. struct s%Cbu## members: )ote: s%Cbu## members appear in the same order as in the header #ile' s%bu##.h. struct s%Cbu## Ine$t* struct s%Cbu## Iprev* %timeCt tstamp - time stamp o# receiving the pac%et.

6/178

net_enable_timestamp() must be called in order to get valid

timestamp values. 4elper method: static inline %timeCt s%bCgetC%time2const struct s%Cbu## Is%b3 : returns tstamp o# the s%b. struct soc% Is% - The soc%et who owns the s%b. 4elper method: static inline void s%bCorphan2struct s%Cbu## Is%b3 &# the s%b has a destructor' call this destructor* set s%b-Ps% and s%b-Pdestructor to null. struct netCdevice Idev - The netCdevice on which the pac%et was received' or the netCdevice on which the pac%et will be transmitted. char cbQ6"R CCaligned2"3 - control bu##er #or private variables. :any networ% modules de#ine a private s%b cb o# their own' and use the s%b-Pcb #or their own needs. Eor e$ample' in include/net/bluetooth/bluetooth.h' we have: Sde#ine btCcb2s%b3 22struct btCs%bCcb I322s%b3-Pcb33 unsigned long s%bCre#dst*
helper method: static inline struct dstCentry Is%bCdst2const struct s%Cbu## Is%b3

struct dstCentry Idst G the route #or this s%Cbu##* this route is determined by the routing subsystem. &t has - important #unction pointers: int 2Iinput32struct s%Cbu##I3* int 2Ioutput32struct s%Cbu##I3* input23 can be assigned to one o# the #ollowing : ipClocalCdeliver' ipC#orward' ipCmrCinput' ipCerror or dstCdiscardCin. output23 can be assigned to one o# the #ollowing :ipCoutput' ipCmcCoutput' ipCrtCbug' or dstCdiscardCout. Be will deal more with dst when tal%ing about routing. &n the usual case' there is only one dstCentry #or every s%b.

7/178

Bhen using &/sec' there is a lin%ed list o# dstCentries and only the last one is #or routing* all other dstCentries are #or &/.ec trans#ormers * these other dstCentries have the "S#_N$%&S% #lag set. These entries ' which has this "S#_N$%&S% #lag set are not %ept in the routing cache' but are %ept instead on the #low cache. struct secCpath Isp - used by &/.ec 2$#rm3 helper method: static inline int secpath_exists(struct sk_buff *skb) - returns i# sp is not )>LL. de#ined in include/net/$#rm.h

unsigned int len' unsigned int dataClen* 4elper method: static inline bool skb_is_nonlinear'const struct sk_buff (skb) returns dataClen 2when dataClen is not 0' the s%b is nonlinear3. CCu 9 macClen: The length o# the lin% layer 2L-3 header hdrClen* union T CCwsum csum* struct T CCu 9 csumCstart* CCu 9 csumCo##set* U* U* CCu3- priority*
/ac%et 0ueueing priority by de#ault the priority o# the s%b is 0.

s%b-Ppriority' in the TA path' is set #rom the soc%et priority 2s%Ps%Cpriority3* .ee' #or e$ample' ipC0ueueC$mit23 method in ipCoutput.c:
8/178

s%b-Ppriority J s%-Ps%Cpriority* Vou can set s%Cpriority o# s% by setsoc%opt* #or e$ample' thus: setsoc%opt2s' .@LC.@C(?T' .@C/R&@R&TV' Kprio' siDeo#2prio33 Bhen we are #orwarding the pac%et' there is no soc%et attached to the s%b. There#ore' in ipC#orward23' we set s%b-Ppriority according to a special table' called ipCtos-prio* this table has 9 entries* see include/net/route.h 7nd we have int ipC#orward2struct s%Cbu## Is%b3 T ... s%b-Ppriority J rtCtos-priority2iph-Ptos3* ... U CCu3- priority* There are other cases when we set the priority o# the s%b. Eor e$ample' in vlanCdoCreceive23 2net/"0- 0/vlanCcore.c3. %memchec%Cbit#ieldCbegin2#lags 3* CCu" localCd#: ' cloned: ' ipCsummed:-' nohdr: ' n#ctin#o:3* CCu" p%tCtype:3
The pac%et type is determined in ethCtypeCtrans23 method.

9/178

ethCtypeCtrans23 gets s%b and netCdevice as parameters.

2see net/ethernet/eth.c3. The pac%et type depends on the destination mac address in the ethernet header. it is /7C(?TC=R@75C7.T #or broadcast. it is /7C(?TC:>LT&C7.T #or multicast. it is /7C(?TC4@.T i# the destination mac address is mac address o# the device which was passed as a parameter. &t is /7C(?TC@T4?R4@.T i# these conditions are not met. 2there is another type #or outgoing pac%ets' /7C(?TC@>TM@&)M' devC0ueueC$mitCnit233 )otice that ethCtypeCtrans23 is uni0ue to ethernet* #or E55&' #or e$ample' we have #ddiCtypeCtrans23 2see net/"0-/#ddi.c3. #clone:-' ipvsCproperty: ' pee%ed: ' n#Ctrace: - net#ilter pac%et trace #lag %memchec%Cbit#ieldCend2#lags 3* CCbe 9 protocol*

s%b-Pprotocol is set in ethernet networ% drivers by assigning it to

ethCtypeCtrans23 return value. void 2Idestructor32struct s%Cbu## Is%b3* 4elper method: static inline void s%bCorphan2struct s%Cbu## Is%b3 &# the s%b has a destructor' call this destructor* set s%b-Ps% and s%b-Pdestructor to null. struct n#Cconntrac% In#ct* struct s%Cbu## In#ctCreasm* struct n#CbridgeCin#o In#Cbridge* int s%bCii#* The i#inde$ o# device we arrived on. CCneti#CreceiveCs%b23 sets the

10/178

s%bCii# to be the i#inde$ o# the device on which we arrived' s%b-Pdev. CCu3- r$hash*
The r$hash o# the s%b is calculated in the receive path'

in get_rps_cpu(), invo%ed #rom both #rom neti#CreceiveCs%b23 and #rom neti#Cr$23. The hash is calculate according to the source and dest address o# the ip header' and the ports #rom the transport header. CCu 9 vlanCtci* - vlan tag control in#ormation* 2- bytes3. Composed o# &5 and priority. CCu 9 tcCinde$* /I tra##ic control inde$ I/ CCu 9 tcCverd* /I tra##ic control verdict I/ CCu 9 0ueueCmapping* %memchec%Cbit#ieldCbegin2#lags-3* CCu" ndiscCnodetype:-* CCu" p#memalloc: * CCu" oooCo%ay: * CCu" l6Cr$hash: - 7 #lag which is set when we use 6-tuple hash over transport ports. CCs%bCgetCr$hash23 sets the r$hash. CCu" wi#iCac%edCvalid: * CCu" wi#iCac%ed: * CCu" noC#cs: * CCu" headC#rag: * CCu" encapsulation: - indicates that the s%b contains encapsulated pac%et. This #lag is set' #or e$ample' in v$lanC$mit23 in the v$lan driver. 2drivers/net/v$lan.c3 %memchec%Cbit#ieldCend2#lags-3* dmaCcoo%ieCt dmaCcoo%ie*
11/178

CCu3- secmar%* union T CCu3- mar%* CCu3- dropcount* CCu3- availCsiDe* U* s%Cbu##CdataCt transportCheader* - the transport layer 2L63 header 2can be #or e$ample tcp header/udp header/icmp header' and more3
4elper method: s%bCtransportCheader23.

s%Cbu##CdataCt networ%Cheader - networ% layer 2L33 header G 2can be #or e$ample ip header/ipv9 header/arp header3. 4elper method: skb_network_header(skb). s%Cbu##CdataCt macCheader* The Lin% layer 2L-3 header
4elper method: s%bCmacCheader23

s%Cbu##CdataCt tail* s%Cbu##CdataCt end* unsigned char Ihead' unsigned char Idata* unsigned int truesiDe* atomicCt users - a re#erence count. &nitialiDed to . &ncreased in RA path #or each protocol handler in deliverCs%b23. 5ecreased in %#reeCs%b23.
4elper method: s%bCshared23 returns true i# users P . 4elper method: static inline struct s%Cbu## Is%bCget2struct s%Cbu##

Is%b3 &ncrements users.


12/178

Receive /ac%et .teering 2rps3 There is a global table called rpsCsoc%C#lowCtable. ?ach call to recvmsg or sendmsg updates the rpsCsoc%C#lowCtable by calling soc%CrpsCrecordC#low23 which eventually calls rps_record_sock_flow(). struct rpsCsoc%C#lowCtable has an array called WentsW. - The inde$ to this array is a hash 2s%Cr$hash3 o# the soc%et 2soc%3 #rom user space. - The value o# each element is the 2desired3 C/>. ?ach call to send/receive #rom user space updates the C/> according to the C/> on which the call was done. Eor e$ample' in net/ipv4/af_inet.c: int inetCrecvmsg23 T ... rpsCrecordCsoc%C#low23 ... U &n net/ipv4/tcp.c: ssiDeCt tcpCspliceCread2struct soc%et Isoc%' lo##Ct Ippos' struct pipeCinodeCin#o Ipipe' siDeCt len' unsigned int #lags3 T ... soc%CrpsCrecordC#low2s%3* ... U struct rpsC#lowCtable is per r$ 0ueue. &t is a member o# struct netdevCr$C0ueue' which represents an instance o# an RA 0ueue. The number o# entries in this table is rpsC#lowCcnt.

13/178

&t can be set via: echo numEntries > /sys/class/net/<dev>/ ueues/r!"<n>/rps_flow_cnt *+S, #ransmit +acket Steering getC$psC0ueue23 is called to determine which t$ 0ueue to use. when C@)E&MCA/. is not set' this method returns - . getC$psC0ueue23 is called in the TA path' #rom netdevCpic%Ct$23 which is invo%ed by devC0ueueC$mit23. A/. code was written by Tom 4erbert #rom google. >nder sys#s we have $psCcpus entry. Eor e$ample' #or t$-0: /sys/class/net/em /0ueues/t$-0/$psCcpus netdevice structure includes a member called $psCdevCmaps which includes maps 2$psCmap3 which are inde$ed by C/>.

net_device struct netCdevice members: char nameQ&E)7:.&XR* The inter#ace name' li%e eth0' eth ' p-p ' ... Can be up to 9 characters 2&E)7:.&X3 struct hlistCnode nameChlist* - device name hash chain. char Ii#alias - snmp alias inter#ace name* its length can be up to -89 2&E7L&7.X3 4elper method: int devCsetCalias2struct netCdevice Idev' const char Ialias' siDeCt len3. unsigned long memCend* - shared mem end unsigned long memCstart* - shared mem start
14/178

unsigned long baseCaddr* - device &/@ address unsigned int ir0 - device &RY number 2this is the ir0 number with which we call re0uestCir0233. unsigned long state*
7 #lag which can be one o# these values:

CCL&)(C.T7T?C.T7RT CCL&)(C.T7T?C/R?.?)T CCL&)(C.T7T?C)@C7RR&?R CCL&)(C.T7T?CL&)(B7TC4C/?)5&)M CCL&)(C.T7T?C5@R:7)T struct listChead devClist* struct listChead napiClist* struct listChead unregClist* netdevC#eaturesCt #eatures - currently active device #eatures Two e$amples: N-#I!_!_N-#NS_L$C&L is set #or devices that are not allowed to move between networ% namespaces* sometime these devices are named Wlocal devicesW* Eor e$ample' #or loopbac% device' ppp device' v$lan device and pimreg 2multicast3 device' we set N-#I!_!_N-#NS_L$C&L. &# we try to move an inter#ace whose N-#I!_!_N-#NS_L$C&L #lag is set to a networ% namespace we created' we will get WRT)?TL&)( answers: &nvalid argumentW error message #rom devCchangeCnetCnamespace23 method 2net/core/dev.c3. .ee below in the )etwor% namespaces section. )otice that #or v$lan' we must set the )?T&ECEC)?T).CL@C7L since v$lan wor%s over >5/ soc%et' and the >5/ soc%et is part o# the namespace it is created in. :oving the v$lan device does not move that state. N-#I!_!_.L&N_C%&LL-N/-" is set #or devices which canFt handle with <L7) headers. 2usually because o# too large :T> due to vlan3. Eor e$ample' some types o# &ntel e 00 2see e 00Cprobe23in drivers/net/ethernet/intel/e 00.c3.
15/178

)?T&ECEC<L7)CC47LL?)M?5 is also set when creating a bonding' be#ore enslaving #irst ethernet inter#ace to it* bondCsetup23 T .... bondCdev-P#eatures ZJ )?T&ECEC<L7)CC47LL?)M?5* ... in drivers/net/bondin#/bond_main.c This is done to avoid problems that occur when adding <L7)s over an empty bond. .ee also later in the bonding section. N-#I!_!_LL#* is Loc%Less TA #lag and is considered deprecated. Bhen it is set' we donFt use the generic TA loc% 2 This is why it is called Loc%Less TA 3 .ee the #ollowing macro 247R5CTACL@C(3 #rom in net/core/dev.c: Sde#ine 47R5CTACL@C(2dev' t$0' cpu3 T [ i# 22dev-P#eatures K )?T&ECECLLTA3 JJ 03 T [ CCneti#Ct$Cloc%2t$0' cpu3* [ U[ U )?T&ECECLLTA is used in tunnel drivers li%e ipip' v$lan' veth' and in &/v6 over &/.ec tunneling driver: Eor e$ample' in ipip tunnel 2net/ipv6/ipip.c3' we have: ipipCtunnelCsetup23 T ... dev-P#eatures ZJ )?T&ECECLLTA* ... U and in v$lan: 2drivers/net/v$lan.c3 we have: static void v$lanCsetup2struct netCdevice Idev3 T ... dev-P#eatures ZJ )?T&ECECLLTA* .. U
16/178

in veth: 2drivers/net/veth.c3 static void vethCsetup2struct netCdevice Idev3 T ... dev-P#eatures ZJ )?T&ECECLLTA* ... U and also in the &/v6 over &/.ec tunneling driver' net/ipv6/ipCvti.c' we have: static void vtiCtunnelCsetup2struct netCdevice Idev3 T ... dev-P#eatures ZJ )?T&ECECLLTA* ... U )?T&ECECLLTA is also used in a #ew drivers which has their own T$ loc%' li%e drivers/net/ethernet/chelsio/c$gb: in drivers/net/ethernet/chelsio/c$gb/c$gb-.c' we have: static int CCdevinit initCone2struct pciCdev Ipdev' const struct pciCdeviceCid Ient3 T ... netdev-P#eatures ZJ )?T&ECEC.M Z )?T&ECEC&/CC.>: Z )?T&ECECRAC.>: Z )?T&ECECLLTA* ... U Eor the #ull list o# netCdevice #eatures' loo% in: include/linu$/netdevC#eatures.h. .ee more in#o in 5ocumentation/networ%ing/netdev-#eatures.t$t by :ichal :iroslaw. netdevC#eaturesCt hwC#eatures - user-changeable #eatures 6 hwC#eatures should be set only in ndoCinit callbac% and not changed later.

17/178

netdevC#eaturesCt wantedC#eatures - user-re0uested #eatures netdevC#eaturesCt vlanC#eatures* - mas% o# #eatures inheritable by <L7) devices. int i#inde$ - &nter#ace inde$. 7 uni0ue device identi#ier.
4elper method: static int devCnewCinde$2struct net Inet3

Bhen creating a networ% device' i#inde$ is set. The i#inde$ is incremented by each time we create a new networ% device. This is done by the devCnewCinde$23 method. 2.ince i#inde$ is an int' the method ta%es into account cyclic over#low o# integer3. The #irst networ% device we create' which is mostly always the loopbac% device' has i#inde$ o# . Vou can see the i#inde$ o# the loopbac% device by: cat /sys/class/net/lo/i#inde$ Vou can see the i#inde$ o# any other networ% device' which is named net5evice)ame' by: cat /sys/class/net/net5evice)ame/i#inde$ int i#lin%* struct netCdeviceCstats stats - device statistics' li%e number o# r$Cpac%ets' number o# t$Cpac%ets' and more. atomicClongCt r$Cdropped - dropped pac%ets by core networ% should not be used this in drivers. There are some cases when the stac% increments the r$Cdropped counter* #or e$ample' under certain conditions in CCneti#CreceiveCs%b23 const struct iwChandlerCde# I wirelessChandlers* struct iwCpublicCdata I wirelessCdata* const struct netCdeviceCops InetdevCops* netCdeviceCops includes pointers with several callbac% methods which we want to de#ine in case we want to override the de#ault behavior. netCdeviceCops ob+ect :>.T be initialiDed 2even to an empty struct3 prior to calling registerCnetdevice23 \ The reason is that in registerCnetdevice23 we chec% i# dev-PnetdevCops-PndoCinit e$ist
18/178

IwithoutI veri#ying be#ore that dev-PnetdevCops is not null. &n case we wonFt initialiDe netdevCops' we will have here a null pointer e$ception. const struct ethtoolCops IethtoolCops*
4elper: .?TC?T4T@@LC@/.23 macro G sets ethtoolCops #or a

netCdevice. This structure includes callbac%s 2including o##loads3. :anagement o# ethtool is done in net/core/ethtool.c.
&nstead o# #orcing device drivers to provide empty ethtoolCops' there is

a generic empty ethtoolCops named de#aultCethtoolCops 2net/core/dev.c3. &t was added in this patch' http://www.spinics.net/lists/netdev/msg- 089".html' #rom ?ric 5umaDet. Vou can get 2ethtool user space tool3 #rom: http://www.%ernel.org/pub/so#tware/networ%/ethtool/ The maintainer o# ethtool is =en 4utchings. @lder versions are available in: http://source#orge.net/pro+ects/g%ernel/ or #rom this git repository: git://git.%ernel.org/pub/scm/networ%/ethtool/ethtool.git const struct headerCops IheaderCops* unsigned int #lags - inter#ace #lags' you see or set #rom user space using i#con#ig utility3: Eor e$ample' &EECR>))&)M' &EEC)@7R/' &EEC/@&)T@/@&)T' &EEC/R@:&.C' &EEC:7.T?R' &EEC.L7<?. &EEC)@7R/ is set #or tunnel devices #or e$ample. Bith tunnel devices' there is no need #or sending 7R/ re0uests because you can connect only to the other device in the end o# the tunnel. .o we have' #or e$ample' in ipipCtunnelCsetup23 2net/ipv6/ipip.c3' static void ipipCtunnelCsetup2struct netCdevice Idev3 T ...
19/178

dev-P#lags J &EEC)@7R/* ... U &EEC/@&)T@/@&)T is set #or ppp devices. Eor e$ample' in drivers/net/ppp/pppCgeneric.c' we have: static void pppCsetup2struct netCdevice Idev3 T ... dev-P#lags J &EEC/@&)T@/@&)T Z &EEC)@7R/ Z &EEC:>LT&C7.T* ... U I!!_0&S#-R is set #or master devices 2whereas &EEC.L7<? is set #or slave devices3. Eor e$ample' #or bond devices' we have' in net/bonding/bondCmain.c' static void bondCsetup2struct netCdevice IbondCdev3 T bondCdev-P#lags ZJ &EEC:7.T?RZ&EEC:>LT&C7.T* U unsigned int privC#lags* These are #lags you cannot see #rom user space with i#con#ig or other utils. .ome e$amples o# privC#lags: I!!_-BRI"/- #or a bridge inter#ace. This #lag is set in brCdevCsetup23 in net/bridge/brCdevice.c &EEC=@)5&)M This #lag is set in bondCsetup23 method. This #lag is set also in bondCenslave23 method. both methods are in drivers/net/bonding/bondCmain.c. &EEC"0-C YC<L7) This #lag is set in vlanCsetup23 in net/"0- 0/vlanCdev.c &EECTAC.(=C.47R&)M &n ieee"0- Ci#Csetup23 ' net/mac"0- /i#ace.c we have: dev-PprivC#lags KJ L&EECTAC.(=C.47R&)M* &EECT?7:C/@RT This #lag is set in teamCportCenter23 method in drivers/net/team/team.c
20/178

I!!_ NIC&S#_!L#

.peci#ies that the driver handles unicast address #iltering. &n mv963$$CethCprobe23' drivers/net/ethernet/marvell/mv963$$Ceth.c' ... dev-PprivC#lags ZJ &EEC>)&C7.TCELT* ... The patch which added &EEC>)&C7.TCELT: http://www.spinics.net/lists/netdev/msg !-90 9.html I!!_LI.-_&""R_C%&N/ Bhen this #lag is set' we can change the mac address with ethCmacCaddr23 when the #lag is set. :any drivers use ethCmacCaddr23 as the ndoCsetCmacCaddress23 callbac% o# struct netCdeviceCops. see ethCmacCaddr23 in net/ethernet/eth.c.

unsigned short g#lags* unsigned short padded - 4ow much padding added by allocCnetdev23 unsigned char operstate - REC-"93 operstate Can be one o# the #ollowing: &EC@/?RC>)()@B) &EC@/?RC)@T/R?.?)T &EC@/?RC5@B) &EC@/?RCL@B?RL7V?R5@B) &EC@/?RCT?.T&)M &EC@/?RC5@R:7)T &EC@/?RC>/ unsigned char lin%Cmode - mapping policy to operstate. unsigned char i#Cport - .electable 7>&' T/'... unsigned char dma - 5:7 channel unsigned int mtu - inter#ace :T> value 4elper method: int ethCchangeCmtu2struct netCdevice Idev' int newCmtu3
21/178

:a$imum Transmission >nit: the ma$imum siDe o# #rame the device can handle. REC !; sets 9" as a minimum #or internet module :T>. The ethCchangeCmtu23 method above does not permit setting mtu which are lower then 9". &t should not be con#used with path :T>' which is 8!9 2also according to REC !; 3.
int devCsetCmtu2struct netCdevice Idev' int newCmtu3 - helper method

to set new mtu. &n case ndoCchangeCmtu is de#ined' we also call ndoCchangeCmtu o# netCdevCops. )?T5?<CC47)M?:T> message is sent. 6 ?ach protocol has mtu o# its own* the de#ault is 800 #or ?thernet. 6 you can change the mtu #rom user space with i#con#ig or with ip or via sys#s* #or e$ample'li%e this: i#con#ig eth0 mtu 600 ip lin% set eth0 mtu 600 - echo 1400 > i#con#ig eth0 or by: ip lin% show or by: cat /s s/class/net/eth0/!tu Vou cannot change it to values higher than 800 on 0:b/s networ%:

/s s/class/net/eth0/!tu

you can show the mtu o# inter#ace eth0 by:

ifconfi# eth$ mtu %&$% will give: W.&@C.&E:T>: &nvalid argument. unsigned short type - inter#ace hardware type. type is the hw type o# the device.

Eor ethernet it is 7R/4R5C?T4?R &n ethernet' the device type 7R/4R5C?T4?R is assigned

in ether_setup(). see: net/ethernet/eth.c

22/178

Eor ppp' the device type 7R/4R5C/// is assigned

in pppCsetup23. see drivers/net/ppp/pppCgeneric.c. Eor &/v6 tunnels' the type is 7R/4R5CT>))?L . Eor &/v9 tunnels' the type is 7R/4R5CT>))?L9 . Eor e$ample' #or ip in ip tunnel in &/v6 2net/net/ipv6/ipip.c3' we have: static void ipipCtunnelCsetup2struct netCdevice Idev3 T ... dev-Ptype J 7R/4R5CT>))?L* ... U 7nd #or ip in ip tunnel in &/v9' we have: static void ip9CtnlCdevCsetup2struct netCdevice Idev3 T ... dev-Ptype J 7R/4R5CT>))?L9* ... U Eor e$ample' in vtiCtunnelCsetup23' net/ipv6/ipCvti.c' we have static void vtiCtunnelCsetup2struct netCdevice Idev3 T ... dev-Ptype J 7R/4R5CT>))?L* ... U #or tun devices' the type is 7R/4R5C)@)?. 2see drivers/net/tun.c3 static void tunCnetCinit2struct netCdevice Idev3 T ... case T>)CT>)C5?<: dev-Ptype J 7R/4R5C)@)?* ... U unsigned short hardCheaderClen* This is the hardware header length. &n case o# ethernet' it is 6 2:7C .7 H :7C 57 H TV/?3 . &t is set to 6 2?T4C4L?)3 in etherCsetup23: void etherCsetup2struct netCdevice Idev3 T
23/178

... dev-PhardCheaderClen J ?T4C4L?)* ... U &n case o# tunnel devices' it is set to di##erent values' according to the tunnel speci#ics. .o in case o# v$lan' we have' in drivers/net/v$lan.c static void v$lanCsetup2struct netCdevice Idev3 T ... dev-PhardCheaderClen J ?T4C4L?) H <AL7)C4?75R@@:* ... U where <AL7)C4?75R@@: is siDe o# &/ header 2-03 H siDeo# >5/ header 2-03 H siDe o# <AL7) header 2"3 H siDe o# ?thernet header 2 63* so <AL7)C4?75R@@: is 80 bytes in total. Bith ipip tunnel we have in ipipCtunnelCsetup23' 2net/ipv6/ipip.c3 static void ipipCtunnelCsetup2struct netCdevice Idev3 T ... dev-PhardCheaderClen J LLC:7AC4?75?R H siDeo#2struct iphdr3* ... U /I e$tra head- and tailroom the hardware may need' but not in all cases I can this be guaranteed' especially tailroom. .ome cases also use I LLC:7AC4?75?R instead to allocate the s%b. I/ unsigned short neededCheadroom* unsigned short neededCtailroom* /I &nter#ace address in#o. I/ unsigned char permCaddrQ:7AC755RCL?)R* - permanent hw address unsigned char addrCassignCtype - hw address assignment type.

24/178

=y de#ault' the mac address is permanent 2)?TC755RC/?R:3. &n case the mac address was generated with a helper method called ethChwCaddrCrandom23' the type o# the mac address is )?TC755CR7)5@:. There is also a type called )?TC755RC.T@L?)' which is not used. The type o# the mac address is stored in addrCassignCtype member o# the netCdevice. 7lso when we change the mac address o# the device' with ethCmacCaddr23' we reset the addrCassignCtype with L)?TC755RCR7)5@: 2in case it was mar%ed as )?TC755RCR7)5@: be#ore3. Bhen we register a networ% device 2in registerCnetdevice233' in case i# the addrCassignCtype is )?TC755RC/?R:' we set dev-PpermCaddr to be dev-PdevCaddr.

unsigned char addrClen - hardware address length unsigned char neighCprivClen* unsigned short devCid - #or shared networ% cards. spinloc%Ct addrClistCloc%* struct netdevChwCaddrClist uc - >nicast mac addresses. 4elper method: int devCucCadd2struct netCdevice Idev' const unsigned char Iaddr3 7dd a unicast address to the device* in case this address already e$ists' increase the re#erence count. 4elper method: void devCucC#lush2struct netCdevice Idev3 Elush unicast addresses o# the device and Deroes the re#erence count. struct netdevChwCaddrClist mc - :ulticast mac addresses. bool ucCpromisc* unsigned int promiscuity*

25/178

a counter o# the times a )&C is told to set to wor% in promiscuous mode* used to enable more than one sni##ing client* it is used also in the bridging subsystem' when adding a bridge inter#ace* see the call to devCsetCpromiscuity23 in brCaddCi#23' net/bridge/brCi#.c 3. de"_set_p ro!iscuit () sets the I!!_+R$0ISC #lag o# the netdevice. .ince promiscuity is an int' dev_set_1romiscuity') ta%es into account cyclic over#low o# integer. unsigned int allmulti - a counter o# allmulti. 4elper method: dev_set_allmulti') updates allmulti count on a device. :oreover' it sets 2or removes3 the I!!_&LL0 L#I #lag. Vou set allmulti by: ifconfi# eth$ allmulti. 2This invo%es devCsetCallmulti23 %ernel method' with incJ ' in net/core/dev.c3 Vou remove allmulti by: i#con#ig eth0 -allmulti 2This invo%es devCsetCallmulti23 %ernel method' with incJ- ' in net/core/dev.c3 WallmultiW counter in netdevice enables or disables all-multicast mode. Bhen selected' all multicast pac%ets on the networ% will be received by the inter#ace. )ote that in case that the #lags o# the device did not include &EEC7LL:>LT& 2when enabling allmulti3 or did not include L&EEC7LL:>LT& 2when disabling allmulti3 then we also call : devCchangeCr$C#lags2dev' &EEC7LL:>LT&3* devCsetCr$Cmode2dev3*

/I /rotocol speci#ic pointers I/ struct vlanCin#o CCrcu IvlanCin#o - <L7) in#o struct dsaCswitchCtree IdsaCptr - dsa speci#ic data void Iatal%Cptr - 7ppleTal% lin% struct inCdevice CCrcu IipCptr - &/v6 speci#ic data
26/178

This pointer is assigned to a pointer to struct inCdevice in inetdevCinit23 2net/ipv4/devinet.c3

struct dnCdev CCrcu IdnCptr - 5?Cnet speci#ic data struct inet9Cdev CCrcu Iip9Cptr* /I &/v9 speci#ic data void Ia$-8Cptr - 7A.-8 speci#ic data struct wirelessCdev Iieee"0unsigned long lastCr$* Time o# last R$* This should not be set in drivers' unless really needed' because networ% stac% 2bonding3 use it i#/when necessary' to avoid dirtying this cache line.

Cptr - &??? "0-.

speci#ic data

should be assigned be#ore registering.

The #ollowing patchset by ,iri /ir%o suggested to remove master and netdevCsetCmaster23 #rom the networ% stac%: http'//www.spinics.net/lists/netdev/ms#(($)&*.html struct netCdevice Imaster and netdevCsetCmaster23 indeed were removed 7 list was added instead: struct listChead upperCdevClist* /I List o# upper devices I/ struct netdevCupper also added. unsigned char IdevCaddr*

The :7C address o# the device 29 bytes3. devCsetCmacCaddress2struct netCdevice Idev' struct soc%addr Isa3 helper method. Changes the mac address 2devCaddr member3 by invo%ing the ndoCsetCmacCaddress23 callbac% and sends N-#"-._C%&N/-&""R noti#ication. :any drivers use the ethernet generic ethCmacCaddr23 method 2net/ethernet/eth.c3 as the ndoCsetCmacCaddress23 callbac%. struct netdevChwCaddrClist devCaddrs* - list o# device hw addresses unsigned char broadcastQ:7AC755RCL?)R* /I hw bcast add I/ struct %set I0ueuesC%set* struct netdevCr$C0ueue ICr$*
27/178

unsigned int numCr$C0ueues number o# RA 0ueues allocated at registerCnetdev23 time unsigned int realCnumCr$C0ueues )umber o# RA 0ueues currently active in device. neti#CsetCrealCnumCr$C0ueues23 sets realCnumCr$C0ueues and updates sys#s entries. 2/sys/class/net/device+ame/ ueues/,3 )otice that allocCnetdevCm023 initialiDes numCr$C0ueues' realCnumCr$C0ueues' numCt$C0ueues and realCnumCt$C0ueues to the same value. Vou can set the number o# t$ 0ueues and r$ 0ueues by Wip lin%W when adding a device. Eor e$ample' i# we want to create a vlan device with 9 t$ 0ueues and ! r$ 0ueues' we can run: ip lin% add lin% p-p name p-p . 00 numt$0ueues 9 numr$0ueues ! type vlan id 00 >nder corresponding sys#s p-p . 00 entry' we will see indeed ! r$ 0ueues 2numbered #rom 0 to 93 and 9 t$ 0ueues 2numbered #rom 0 to 83. ls /sys/class/net/p-p . 00/0ueues r$-0 r$- r$-- r$-3 r$-6 r$-8 r$-9 t$-0 t$- t$-- t$-3 t$-6 t$-8 /I C/> reverse-mapping #or RA completion interrupts' inde$ed I by RA 0ueue number. 7ssigned by driver. This must only be I set i# the ndoCr$C#lowCsteer operation is de#ined. I/ struct cpuCrmap Ir$CcpuCrmap* r$ChandlerC#uncCt CCrcu Ir$Chandler*

28/178

4elper method: netdevCr$ChandlerCregister2struct netCdevice Idev' r$ChandlerC#uncCt Ir$Chandler' void Ir$ChandlerCdata3

r$Chandler is set by netdevCr$ChandlerCregister23. &t is used in bonding' team' openvswitch' macvlan' and bridge devices. void CCrcu Ir$ChandlerCdata* r$ChandlerCdata is also set by netdevCr$ChandlerCregister23* see here above. struct netdevC0ueue CCrcu IingressC0ueue* /I I Cache lines mostly used on transmit path I/ struct netdevC0ueue ICt$ CCCCcache lineCalignedCinCsmp* unsigned int numCt$C0ueues* /I )umber o# TA 0ueues allocated at allocCnetdevCm023 time I/ /I )umber o# TA 0ueues currently active in device I/ unsigned int realCnumCt$C0ueues* 4elper method: neti#CsetCrealCnumCt$C0ueues23 sets realCnumCt$C0ueues and updates sys#s entries.

struct Ydisc I0disc* - root 0disc #rom userspace point o# view.


29/178

dev_init_scheduler() method initialiDes 0disc in re#ister_netdevice(). unsigned long t$C0ueueClen* - :a$ #rames per 0ueue allowed* can be set by i#con#ig' #or e$ample: i#con#ig p-p t$0ueuelen 900 spinloc%Ct t$CglobalCloc%* struct $psCdevCmaps CCrcu I$psCmaps* unsigned long transCstart* /I Time 2in +i##ies3 o# last T$ I/ int watchdogCtimeo* /I used by devCwatchdog23 I/ struct timerClist watchdogCtimer* int CCpercpu IpcpuCre#cnt* - )umber o# re#erences to this device helper methods: static inline void devChold2struct netCdevice Idev3 increments re#erence count o# the device. static inline void devCput2struct netCdevice Idev3 decrements re#erence count o# the device. int netdevCre#cntCread2const struct netCdevice Idev3 reads sum o# all C/>s re#erence counts o# this device. /I delayed register/unregister I/ struct listChead todoClist* /I device inde$ hash chain I/ struct hlistCnode inde$Chlist* struct listChead lin%CwatchClist* /I register/unregister state machine I/ enum T )?TR?MC>)&)&T&7L&X?5J0'
30/178

)?TR?MCR?M&.T?R?5' /I completed registerCnetdevice I/ )?TR?MC>)R?M&.T?R&)M' /I called unregisterCnetdevice I/ )?TR?MC>)R?M&.T?R?5' /I completed unregister todo I/ )?TR?MCR?L?7.?5' /I called #reeCnetdev I/ )?TR?MC5>::V' /I dummy device #or )7/& poll I/ U regCstate:"* bool dismantle : device is going do be #reed. This #lag is set in rollback_registered_!an () when unregistering a device. &t is re#erenced #or e$ample in macvlanCstop23 enum rtnlClin%Cstate rtnlClin%Cstate can

be RT)LCL&)(C&)&T&7L&X&)M or RT)LCL&)(C&)&T&7L&X?5. Bhen creating a new lin%' in rtnlCnewlin%23' the rtnlClin%Cstate is set to be RT)LCL&)(C&)&T&7L&X&)M 2this is done byrtnlCcreateClin%23' which is invo%ed #rom rtnlCnewlin%233* later on' when calling rtnlCcon#igureClin%23' the rtnlClin%Cstate is set to be RT)LCL&)(C&)&T&7L&X?5.

/I Called #rom unregister' can be used to call #reeCnetdev I/ void 2Idestructor32struct netCdevice Idev3* struct netpollCin#o Inpin#o* struct net IndCnet* 6 ndCnet: The networ% namespace this networ% device is inside. devCnetCset2struct netCdevice Idev' struct net Inet3 - a helper method which sets the ndCnet o# netCdevice to the speci#ied net namespace. 2include/linu$/netdevice.h3
dev_chan#e_net_namespace() - a helper method.

move the networ% device to a di##erent networ% namespace.


31/178

&# the N-#I!_!_N-#NS_L$C&L #lag o# the net device is set' the operation is not per#ormed and an error is returned. Callers o# this method must hold the rtnl semaphore. This method returns 0 upon success. /I mid-layer private I/ union T void ImlCpriv* struct pcpuClstats CCpercpu Ilstats* /I loopbac% stats I/ struct pcpuCtstats CCpercpu Itstats* /I tunnel stats I/ struct pcpuCdstats CCpercpu Idstats* /I dummy stats I/ U* struct garpCport CCrcu IgarpCport* struct device dev - used #or class/net/name entry. /I space #or optional device' statistics' and wireless sys#s groups I/ const struct attributeCgroup Isys#sCgroupsQ6R* const struct rtnlClin%Cops IrtnlClin%Cops* rtnetlin% lin% ops instance. Be can use rtnlClin%Cops in a networ% driver or networ% so#tware module' and declare methods which we want to call when #or e$ample we create a new lin% with Wip lin%W command. &n case we use rtnlClin%Cops' we should register it with rtnlClin%Cregister23 in the driver in its init method' and unregister it in module e$it by rtnlClin%Cunregister23. .ee #or e$ample in the v$lan driver code' drivers/net/v!lan.c. &n case we do not use rtnlClin%Cops' then we will use the generic rtnetlin% callbac%s which are called upon receiving certain messages. Eor e$ample' in registerCnetdevice23' in case dev-PrtnlClin%Cops is )>LL'
32/178

we send an RT:C)?BL&)( message. This message is handled by rtnlCnewlin%23 callbac% in net/core/rtnetlin%.c.

/I #or setting %ernel soc% attribute on TC/ connection setup I/ Sde#ine M.@C:7AC.&X? 98839 unsigned int gsoCma$CsiDe* Sde#ine M.@C:7AC.?M. 98838 u 9 gsoCma$Csegs* const struct dcbnlCrtnlCops IdcbnlCops* -5ata Center =ridging netlin% ops u" numCtc* struct netdevCtcCt$0 tcCtoCt$0QTCC:7ACY>?>?R* u" prioCtcCmapQTCC=&T:7.( H R* /I ma$ e$change id #or ECo? LR@ by ddp I/ unsigned int #coeCddpC$id* struct netprioCmap CCrcu Ipriomap* /I phy device may attach itsel# #or hardware timestamping I/ struct phyCdevice Iphydev* The phy device associated with the networ% device. see phyCdevice struct de#inition in include/linu$/phy.h. struct loc%CclassC%ey I0discCt$Cbusyloc%* int group* The group the device belongs to. 6 4elper method: void devCsetCgroup2struct netCdevice Idev' int newCgroup3: a helper method to set a new group. struct pmC0osCre0uest pmC0osCre0* - #or power management re0uests. U
33/178

] macros starting with &)C5?< li%e: &)C5?<CE@RB7R523 or &)C5?<CRACR?5&R?CT.23 are related to netCdevice. struct inCdevice has a member named con# 2instance o# ipv6Cdevcon#3. .etting/proc/sys/net/ipv6/con#/all/#orwarding eventually sets the #orwarding member o# inCdevice to . The same is true to acceptCredirects and sendCredirects* both are also members o# cn# 2ipv6Cdevcon#3. 6 &n most distros' /proc/sys/net/ipv6/con#/all/#orwarding is 0 6 There are cases when we wor% with virtual devices. 7 Eor e$ample' bonding 2setting the same &/ #or two or more )&Cs' #or load balancing and #or high availability.3 7 :any times this is implemented using the private data o# the device 2the void Ipriv member o# netCdevice3*
struct netCdeviceCops has methods #or networ% device management: ndoCsetCr$Cmode23 is used to initialiDe multicast addresses 2&t was

done in the past by setCmulticastClist23 method' which is now deprecated3. ndoCchangeCmtu23 is #or setting mtu. Recently' three methods were added to support bridge operations: 2,ohn Eastabend3 ndoC#dbCadd23 ndoC#dbCdel23 ndoC#dbCdump23 &ntel i$gbe driver uses these methods. .ee drivers/net/ethernet/intel/i$gbe/i$gbeCmain.c 7lso' a new command which uses these methods is to be added to iproute- pac%age* this command is called WbridgeF. see http://patchwor%.oDlabs.org/patch/ !996/ There are some macros which operate on netCdevice struct and which are de#ined in include/linu$/netdevice.h: .?TC)?T5?<C5?<TV/?2net' devtype3: .?TC)?T5?<C5?<TV/?23 is used' #or e$ample' in brCdevCsetup23' in /net/bridge/brCdevice.c: static struct deviceCtype brCtype J T
34/178

.name J WbridgeW' U* brCdevCsetup23 T .?TC)?T5?<C5?<TV/?2dev' KbrCtype3* U Calling thus .?TC)?T5?<C5?<TV/?23 enables us to see 5?<TV/?Jbridge when running udevadm command on the bridge sys#s entry: udevadm in#o -0 all -p /sys/devices/virtual/net/mybr /: /devices/virtual/net/mybr ?: 5?</7T4J/devices/virtual/net/mybr ?: 5?<TV/?Jbridge ?: &5C::CC7)5&57T?J ?: &E&)5?AJ! ?: &)T?RE7C?Jmybr ?: .>=.V.T?:Jnet ?: .V.T?:5C7L&7.J/sys/subsystem/net/devices/mybr ?: T7M.J:systemd: ?: >.?CC&)&T&7L&X?5J6-"" !36-! The #ollowing patch #rom 5oug Moldstein 2which was applied3 adds sys#s type to vlan: The patch itsel#: http://www.spinics.net/lists/netdev/msg- 60 3.html The patch approval: http://www.spinics.net/lists/netdev/msg- 9 "6.html 7nother e$ample o# usage o# .?TC)?T5?<C5?<TV/?23 macro is in c#g"0CnetdevCnoti#ierCcall23 in net/wireless/core.c
35/178

static struct deviceCtype wiphyCtype J T .name J WwlanW' U* c#g"0... .?TC)?T5?<C5?<TV/?2dev' KwiphyCtype3* ... 7nother macro is .?TC)?T5?<C5?<2net' pdev3

CnetdevCnoti#ierCcall23 T

&t sets the sys#s physical device re#erence #or the networ% logical device. =roadcasts Limited =roadcast - .ent to -88.-88.-88.-88 - all )&Cs on the same networ% segment as the source )&C. 5irect broadcast : ?$ample: Eor networ% ;-. 9".0.0' the 5irect broadcast is ;-. 9".-88.-88. helper method: ipv6CisClbcast2address3: chec%s whether the address is -88.-88.-88.-88 2which is &)755RC=R@75C7.T3 Reverse /ath Eilter2rpC#ilter3 Bhen trying to send with a pac%et with a source &/ which is not con#igure on any inter#aces o# the machine2Wspoo#ingW3' the other side will discard the pac%et. Bhere is it done and how can we prevent it ^ The reason is that CC#ibCvalidateCsource23 returns -?A5?< 2WCrossdevice lin%W3 in such a case' when the R/E 2Reverse /ath Eilter3 is set' which is the de#ault. Be can avoid this problem and set R/E to o##: by echo 0 P /proc/sys/net/ipv6/con#/eth0/rpC#ilter echo 0 P /proc/sys/net/ipv6/con#/all/rpC#ilter
36/178

Be can view the number o# pac%ets re+ected by Reverse /ath Eilter by: netstat -s Z grep &/Reverse/athEilter &/Reverse/athEilter: This displays the L&)>AC:&=C&/R/E&LT?R :&= counter' which is incremented whenever ipCrcvC#inish23 gets -?A5?< error #rom ipCrouteCinputCnore#23. Be can also view it by cat /proc/net/netstat &)C5?<CR/E&LT?R2idev3 macro - used in #ibCvalidateCsource23. .ee myping.c as an e$ample o# spoo#ing in the #ollowing lin%: http'//www.tenouk.com/-odule4.a.html )etwor% inter#ace drivers 6 :ost o# the nics are /C& devices* There are cases' especially with .oC 2.ystem @n Chip3 vendors' where the networ% inter#aces are not /C& devices. There are also some >.= networ% devices. 6 The drivers #or networ% /C& devices use the generic /C& calls' li%e pci_register_dri"er() and pci_enable_de"ice()# 6 Eor more in#o on nic drives see the article NBriting )etwor% 5evice 5river #or Linu$O 2lin% no. ; in lin%s3 and chap ! in ldd3. 6 There are two modes in which a )&C can receive a pac%et. 7 The traditional way is interrupt driven each received pac%et is an asynchronous event which causes an interrupt.

N&+I 6 )7/& 2new 7/&3. 7 The )&C wor%s in polling mode. 7 &n order that the nic will wor% in polling mode it should be built with a proper #lag. G :ost o# the new drivers support this #eature. G Bhen wor%ing with )7/& and when there is a very high load' pac%ets are lost* but this occurs be#ore they are #ed into the networ% stac%. 2in the non)7/& driver they pass into the stac%3
37/178

The initial change to napiCstruct is e$plained in: http'//lwn.net//rticles/(4404$/ >ser .pace Tools: 6 iputils 2including ping' arping' tracepath' tracepath9' i#enslave and more3 6 net tools 2i#con#ig' netstat' route' arp and more3 6 &/R@>T?- 2ip command with many options3 7 >ses rtnetlin% 7/&. 7 4as much wider #unctionalities the net tools* #or e$ample' you can create tunnels with NipO command. G )ote: no need #or NnO #lag when using &/R@>T?- 2because it does not wor% with 5).3. Routing Subsystem 6 The routing table enable us to #ind the net device and the address o# the host to which a pac%et will be sent. 6 Reading entries in the routing table is done by calling #ibCloo%up
&n &/v6: int #ibCloo%up2struct net Inet' struct #lowi6 I#lp' struct

#ibCresult Ires3 &n &/v9 :struct dstCentry I#ib9CruleCloo%up2struct net Inet' struct #lowi9 I#l9' int #lags' polCloo%upCt loo%up3 6 E&= is the NEorwarding &n#ormation =aseO. 6 There are two routing tables by de#ault: 2non /olicy Routing case3 7 local E&= table 2ipC#ibClocalCtable * &5 -883. 7 main E&= table 2ipC#ibCmainCtable * &5 -863 G .ee : include/net/ipC#ib.h. 6 Routes can be added into the main routing table in one o# 3 ways: 7 =y sys admin command 2route add/ip route3. 7 =y routing daemons. 7 7s a result o# &C:/ 2R?5&R?CT3.
38/178

6 7 routing table is implemented by struct #ibCtable.

Routing #ables 6 #ibCloo%up23 #irst searches the local E&= table 2ipC#ibClocalCtable3. 6 &n case it does not #ind an entry' it loo%s in the main E&= table 2ipC#ibCmainCtable3. 6 Bhy is it in this order ^ 6 There was in the past a routing cache* there was a single routing cache' regardless o# how many routing tables there were. The routing cache was removed in ,uly -0 -* see http://www.spinics.net/lists/netdev/msg-083!-.html @ne o# the reason #or removal o# routing cache was that it was easy to launch denial o# service attac%s against it. 6 Vou can see the routing cache by running Oroute -CO. 6 7lternatively' you can see it by : Ncat /proc/net/rtCcacheO. 7 con: this way' the addresses are in he$ #ormat. Vou should distinguish between two cases: when C@)E&MC&/C:>LT&/L?CT7=L?. is set 2which is the de#ault in some distros' li%e E?5@R73 and between when it is not set. .ome #ib methods have two implementations' #or each o# these cases. Eor e$ample' #ibCgetCtable23 when C@)E&MC&/C:>LT&/L?CT7=L?. is set is implemented in net/ipv6/#ibC#rontend.c' whereas when C@)E&MC&/C:>LT&/L?CT7=L?. is not de#ined' it is implemented in include/net/ipC#ib.h Routing Cac2e )ote: &n recent %ernels' routing cache is removed. 6 The routing cache is built o# rtable elements: 6 struct rtable 2see: include/net/route.h3 The #irst member o# rtable is dst 2struct dstCentry dst3.
39/178

6 The dstCentry is the protocol independent part. 7 Thus' #or e$ample' we have a #irst member called dst also in rt9Cin#o in &/v9* rt9Cin#o is the parallel o# rtable #or &/v9 2include/net/ip9C#ib.h3. rtable is created in CCm%routeCinput23 and in CCm%routeCoutput23. 2net/ipv6/route.c3 There is a member in rtable called rtCisCinput' speci#ying whether it is input route or output route. 6 There are also two helper methods' rtCisCinputCroute23 and rtCisCoutputCroute23' which return whether the route is input route or output route. 6 The %ey #or a loo%up operation in the routing cache is an &/ address 2whereas in the routing table the %ey is a subnet3. 6 the loo%up is done by #ibCtrie 2net/ipv6/#ibCtrie.c3 6 &t is based on e$tending the loo%up %ey. 6 =y Robert @lsson et al 2see lin%s3. 7 TR7.4 2trie H hash3 7 7ctive Marbage Collection 6 Vou can view #ib tries stats by: cat /proc/net/#ibCtriestat 6 Vou can #lush the routing cache by: ip route #lush cache caveat: it sometimes ta%es --3 seconds or more* it depends on your machine. 6 Vou can show the routing cache by: ip route show cache

Creating a Routing Cac2e -ntry 6 7llocation o# rtable instance 2rth3 is done by: dstCalloc23. 7 dstCalloc23 in #act creates and returns a pointer to dstCentry and we cast it to rtable 2net/core/dst.c3. 6 .etting input and output methods o# dst:
40/178

rth-Pu.dst.input and rth-Pu.dst.output 6 .etting the #lowi member o# dst 2rth-P#l3 7 )e$t time there is a loo%up in the cache'#or e$ample ' ipCrouteCinput23' we will compare against rth-P#l. 6 7 garbage collection call which delete eligible entries #rom the routing cache. 6 Bhich entries are not eligible ^

/olicy Routing 2multiple tables3 6 Meneric routing uses destination address based decisions. 6 There are cases when the destination address is not the sole parameter to decide which route to give* /olicy Routing comes to enable this. 6 7dding a routing table : by adding a line to: /etc/iproute-/rtCtables. G Eor e$ample: add the line N-8- myCrtCtableO. G There can be up to -88 routing tables. 6 /olicy routing should be enabled when building the %ernel 2C@)E&MC&/C:>LT&/L?CT7=L?. should be set.3 6 ?$ample o# adding a route in this table: 6 P ip route add de#ault via ;-. 9".0. table myCrtCtable 6 .how the table by: G ip route show table myCrtCtable 6 Vou can add a rule to the routing policy database 2R/5=3 by Nip rule add ...O &n #ib6CrulesCinit23' we set net-Pipv6.#ibChasCcustomCrules to #alse. This is because when wor%ing with de#ault tables and not adding any other tables' there are no custom rules. ?ach time that we add a new rule 2by Wip rule addW3' we set netPipv6.#ibChasCcustomCrules to true* .ee: #ib6CruleCcon#igure23 in net/ipv6/#ibCrules.c 7lso when we delete a rule in #ib6CruleCdelete23' we set netPipv6.#ibChasCcustomCrules to true*
41/178

ip rule add 7 The rule can be based on input inter#ace' T@.' #wmar% 2#rom net#ilter3. 6 ip rule list G show all rules. struct #ibCrule represents the rules created by policy routing. /olicy Routing: add/delete a rule e$ample 6 ip rule add tos 0$06 table -8- G This will cause pac%ets with tosJ0$0" 2in the iphdr3 to be routed by loo%ing into the table we added 2-8-3 G .o the de#ault gw #or these type o# pac%ets will be ;-. 9".0. G ip rule show will give: G 3-!98: #rom all tos reliability loo%up myCrtCtable G ... /olicy Routing: add/delete a rule e$ample 6 5elete a rule : ip rule del tos 0$06 table -86 =rea%ing the #ibCtable into multiple data structures gives #le$ibility and enables #ine grained and high level o# sharing. G .uppose that we 0 routes to 0 di##erent networ%s have the same ne$t hop gw. G Be can have one #ibCin#o which will be shared by 0 #ibCaliases. G #DCdivisor is the number o# buc%ets 6 ?ach #ibC node element represents a uni0ue subnet. G The #nC%ey member o# #ibC node is the subnet 23- bit3 6 &n the usual case there is one #ibCnh 2)e$t 4op3. G &# the route was con#igured by using a multipath route' there can be more than one #ibCnh. 6 .uppose that a device goes down or enabled. 6 Be need to disable/enable all routes which use this device. 6 =ut how can we %now which routes use this device ^ 6 &n order to %now it e##iciently' there is the #ibCin#oCdevhash table. 6 This table is inde$ed by the device identi#ier. 6 .ee #ibCsyncCdown23 and #ibCsyncCup23 in net/ipv6/#ibCsemantics.c

42/178

Routing #able looku1 algorit2m 6 L/: 2Longest /re#i$ :atch3 is the loo%up algorithm. 6 The route with the longest netmas% is the one chosen. 6 )etmas% 0' which is the shortest netmas%' is #or the de#ault gateway. G Bhat happens when there are multiple entries with netmas%J0^ G #ibCloo%up23 returns the #irst entry it #inds in the #ib table where netmas% length is 0. 6 &t may be that this is not the best choice de#ault gateway. 6 .o in case that netmas% is 0 2pre#i$len o# the #ibCresult returned #rom #ibCloo% is 03 we call #ibCselectCde#ault23. 6 #ibCselectCde#ault23 will select the route with the lowest priority 2metric3 2by comparing to #ibCpriority values o# all de#ault gateways3. Receiving a 1acket 6 Bhen wor%ing in interrupt driven model' the nic registers an interrupt handler with the &RY with which the device wor%s by calling re0uestCir023. 6 This interrupt handler will be called when a #rame is received 6 The same interrupt handler will be called when transmission o# a #rame is #inished and under other conditions. 2depends on the )&C* sometimes' the interrupt handler will be called when there is some error3. ] Typically in the handler' we allocate s%Cbu## by calling devCallocCs%b23 * also ethCtypeCtrans23 is called* among other things it advances the data pointer o# the s%Cbu## to point to the &/ header * this is done by calling s%bCpull2s%b' ?T4C4L?)3. 6 .ee : net/ethernet/eth.c G ?T4C4L?) is 6' the siDe o# ethernet header. 6 The handler #or receiving an &/<6 pac%et is ipCrcv23. 2net/ipv6/ipCinput.c3
43/178

6 The handler #or receiving an &/<9 pac%et is ipv9Crcv23 2net/ipv9/ip9Cinput.c3 6 4andler #or the protocols are registered at init phase. 7 Li%ewise' arpCrcv23 is the handler #or 7R/ pac%ets. 6 Eirst' ipCrcv23 per#orms some sanity chec%s. Eor e$ample: i# 2iph-Pihl _ 8 ZZ iph-Pversion \J 63 goto inhdrCerror* G iph is the ip header * iphPihl is the ip header length 26 bits3. G The ip header must be at least -0 bytes. G &t can be up to 90 bytes 2when we use ip options3 6 Then it calls ipCrcvC#inish23' by: )EC4@@(2/EC&)?T' )EC&/C/R?CR@>T&)M' s%b' dev' )>LL' ipCrcvC#inish3* 6 This division o# methods into two stages 2where the second has the same name with the su##i$ #inish or slow' is typical #or networ%ing %ernel code.3 6 &n many cases the second method has a NslowO su##i$ instead o# N#inishO* this usually happens when the #irst method loo%s in some cache and the second method per#orms a loo%up in a table' which is slower. 6 ipCrcvC#inish23 implementation: i# 2s%b-Pdst JJ )>LL3 T int err J ipCrouteCinput2s%b' iph-Pdaddr' iph-Psaddr' iph-Ptos's%bPdev3* ... U ... return dstCinput2s%b3* 6 ipCrouteCinput23: Eirst per#orms a loo%up in the routing cache to see i# there is a match. &# there is no match 2cache miss3' calls ipCrouteCinputCslow23 to per#orm a loo%up in the routing table. 2This loo%up is done by calling #ibCloo%up233. 6 #ibCloo%up2const struct #lowi I#lp' struct #ibCresult Ires3 The results are %ept in #ibCresult. 6 ipCrouteCinput23 returns 0 upon success#ul loo%up. 2also when there is a cache miss but a success#ul loo%up in the routing table.3 7ccording to the results o# #ibCloo%up23' we %now i# the #rame is #or local delivery or #or #orwarding or to be dropped. 6 &# the #rame is #or local delivery ' we will set the input23 #unction pointer o# the route to ipClocalCdeliver23: rth-Pu.dst.inputJ ipClocalCdeliver* 6 &# the #rame is to be #orwarded' we will set the input23 #unction pointer to ipC#orward23: rth-Pu.dst.input J ipC#orward* Local 5elivery
44/178

6 /rototype: ipClocalCdeliver2struct s%Cbu## Is%b3 2net/ipv6/ipCinput.c3. calls )EC4@@(2/EC&)?T' )EC&/CL@C7LC&)' s%b' s%bPdev')>LL'ipClocalCdeliverC#inish3* 6 5elivers the pac%et to the higher protocol layers according to itstype.

!orwarding 6 prototype: G int ipC#orward2struct s%Cbu## Is%b 2net/ipv6/ipC#orward.c3 7 decreases the ttl in the ip header* &# the ttl is _J ' the methods send &C:/ message 2&C:/CT&:?C?AC??5?53 with &C:/C?ACCTTL 2WTTL count e$ceededW3' and drops the pac%et. 7 Calls )EC4@@(2/EC&)?T')EC&/CE@RB7R5' s%b' s%b-Pdev'rtPu.dst.dev' ipC#orwardC#inish3*

] ipC#orwardC#inish23: sends the pac%et out by calling dstCoutput2s%b3. 6 dstCoutput2s%b3 is +ust a wrapper' which calls s%b-Pdst-Poutput2s%b3. 2see include/net/dst.h3 Vou can see the number o# #orwarded pac%ets by Wnetstat -s Z grep #orwardedW or by cat /proc/net/snmp 2&/v63 and cat /proc/net/snmp9 2&/<93' and loo% in Eorw5atagrams column 2&/v63/&p9@utEorw5atagrams 2&/v93. Sending a +acket 6 Be need to per#orm routing loo%up also in the case o# transmission. There are cases when we per#orm two loo%ups' li%e in ipip tunnels. 6 4andling o# sending a pac%et is done by ipCrouteCoutputC%ey23. 6 &n case o# a cache miss' we calls ipCrouteCoutputCslow23' which loo%s in the routing table 2by calling #ibCloo%up23' as also is done in ipCrouteCinputCslow23.3 6 &# the pac%et is #or a remote host' we set dst-Poutput to ipCoutput23
45/178

6 ipCoutput23 will call ipC#inishCoutput23 G This is the )EC&/C/@.TCR@>T&)M point. 6 ip_finish_output() will eventually send the pac%et #rom a neighbor by: G dst-Pneighbour-Poutput2s%b3 G arpCbindCneighbour23 sees to it that the L- address o# the ne$t hop will be %nown. 2net/ipv6/arp.c3 6 &# the pac%et is #or the local machine: G dst-Poutput J ipCoutput G dst-Pinput J ipClocalCdeliver G ipCoutput23 will send the pac%et on the loopbac% device' G Then we will go into ipCrcv23 and ipCrcvC#inish23' but this time dst is )@T null* so we will end in ip_local_deli"er(). 6 .ee: net/ipv6/route.c /R$, MR@ stands #or Meneric Receive @##load. &n order to wor% with MR@ 2Meneric Receive @##load3: - you must set )?T&ECECMR@ in device #eatures. - you should call napiCgroCreceive23 #rom the RA path o# the driver. Bhen N-#I!_!_/R$ is not set' napiCgroCreceive23 continues to the usual RA path' namely it calls neti#CreceiveCs%b23. This is done by returning MR@C)@R:7L #rom devCgroCreceive23' and then calling neti#CreceiveCs%b23 #rom napiCs%bC#inish23 2see net/core/dev.c3. MR@ replaces LR@ 2Large Receive @##load3' as LR@ was only #or TC/ in &/v6. LR@ was removed #rom the networ% stac%. MR@ wor%s in con+unction with M.@ 2Meneric .egmentation @##load3. 0ulti1at2 routing 6 This #eature enables the administrator to set multiple ne$t hops #or a destination. 6 To enable multipath routing' C$N!I/_I+_R$ #-_0 L#I+&#% should be set when building the %ernel.
46/178

6 There was also an option #or multipath caching: 2by setting C@)E&MC&/CR@>T?C:>LT&/7T4CC7C4?53. 6 &t was e$perimental and removed in -.9.-3 .ee lin%s 293. 0ulticast and 0ulticast routing &nternet Mroup :anagement /rotocol' <ersion - 2&M:/v-3 REC --39 The &nternet Mroup :anagement /rotocol 2&M:/3 is used by &/ hosts to report their multicast group memberships to any immediatelyneighboring multicast routers. &n linu$ %ernel' &M:/ #or &/v6 is implemented in net/ipv6/igmp.c There are three types o# &M:/ messages: 0embers2i1 3uery 2Type: 0$ 3 0embers2i1 Re1ort 2<ersion -3 2Type: 0$ 93 Leave /rou1 2Type: 0$ !3 7nd one Legacy 2#or &M:/v bac%ward compatibility3 message: :embership Report 2<ersion 3 20$ -3 To add a multicast address at :7C level' you can use Wip maddr addW. )ote that Wip maddr addW e$pects a :7C address' not an &/ address\ .o this is o%: ip !addr add 01$00$%e$01$01$&% de" eth0 but this is wrong: 2pay attention' you will not get any error message\3 ip !addr add &&'#1#&#( Vou can +oin a multicast group also by setsoc%opt with )*_+,,_-.-/.012)** see #or e$ample: https://github.com/troglobit/toolbo$/blob/master/mc+oin.c 7ll :ulitcast addresses in mac presentations start with 0 :00:8? according to &7)7 re0uirements.
47/178

:ulticast addresses are translated #rom &/ notation to mac address by a #ormula* see ip_eth_mc_map() in include/net/ip.h. This is needed #or e$ample in arp translation' arp_mc_map() in net/ipv4/arp.c. The handler #or multicast RA is ipCmrCinput23 in net/ip"4/ip!r#c# The code which handles multicast routing is net/ipv6/ipmr.c #or &/v6' and net/ipv9/ip9mr.c #or &/v9. &n order to wor% with :ulticast routing' the %ernel should be build with &/C:R@>T?Jy. Vou should also need to wor% with multicast routing user space daemons' li%e 1imd or xor1. 2&n the past there was a daemon called mrouted3. )otice that /proc/sys/net/ipv6/con#/all/mcC#orwarding entry is a read only entry* ls -al /proc/sys/net/ipv6/con#/all/mcC#orwarding shows: -r--r--r-root root

4owever' starting a daemon li%e pimd changes its value to . 2stopping the daemon changes it again to 03. +I0 stands #or /rotocol &ndependent :ulticast see: http://en.wi%ipedia.org/wi%i//rotocolC&ndependentC:ulticast pimd open source pro+ect: https://github.com/downloads/troglobit/pimd/pimd--. .".tar.bDBhen pimd starts' the #ollowing happens: &t sends &M:/ <- membership 0uery. The membership 0uery has a TTL o# . The membership 0uery is sent each -8 seconds. 2&M:/CY>?RVC&)T?R<7L3. This is done in 0ueryCgroups23 method' pimd--. ."/igmpCproto.c. pimd +oins these two multicast groups: --6.0.0.- : The 7ll Routers multicast group addresses all routers on
48/178

the same networ% segment. --6.0.0. 3 : 7ll /&: Routers. This is done in k_1oin() method o# pimd--. ."/%ern.c. Two membership reports are sent as a result. These membership reports also has a TTL o# . see &/v6 :ulticast 7ddress .pace Registry: http://www.iana.org/assignments/multicast-addresses/multicastaddresses.$ml pimd creates an &M:/ soc%et. pimd adds entries to the multicast cache 2:EC3. This is done by setsoc%opt with :RTC755C:EC which invo%es i1mr_mfc_add') method in net/ipv4/ipmr.c Vou can see entries and statistics o# the multicast cache 2:EC3 by: cat /proc/net/ip_!r_cache This patch 26. -. -3 #rom )icolas 5ichtel enables to advertise m#c stats via rtnetlin%. This is done by adding a struct named rtaCm#cCstats in include/uapi/linu!/rtnetlink.h. see: ipmr/ip9mr: advertise m#c stats via rtnetlin%: http://permalin%.gmane.org/gmane.linu$.networ%/-8 6" `-0 `-0 W targetJWCblan%WPhttp://permalin%.gmane.org/gmane.linu$.networ%/-8 6"

Secondary addresses, 7n address is considered WsecondaryW i# it is included in the subnet o# another address on the same inter#ace. ?$ample: ip address add ;-. 9".0. /-6 dev p-p ip address add ;-. 9".0.-/-6 dev p-p ip addr list dev p-p 3: p-p : _=R@75C7.T':>LT&C7.T'>/'L@B?RC>/P mtu 800 0disc p#i#oC#ast state >/ 0len 000
49/178

lin%/ether 00:a :b0:9;:!6:00 brd ##:##:##:##:##:## inet ;-. 9".0. /-6 scope global p-p inet ;-. 9".0.-/-6 scope global secondary p-p inet9 #e"0::-a :b0##:#e9;:!600/96 scope lin% validCl#t #orever pre#erredCl#t #orever I/0+ snoo1ing &M:/ snooping can be controlled through sys#s inter#ace. Eor br)' the settings can be #ound under /sys/devices/virtual/net/br)/bridge. Eor e$ample'#or: brctl addbr br$ cat /sys/devices/virtual/net/br$/brid#e/multicast_snoopin# This multicastCdisabled o# netCbridge struct represents multicastCsnooping. rtnlCregister23 The rtnlCregister23 gets 3 callbac%s as parameters: doit' dumpit' and calcit callbac%s. Be have two rtnlCregister23 with RT:CM?TR@>T? calls #or the routing subsystem* rtnlCregister2/EC&)?T' RT:CM?TR@>T?' inetCrtmCgetroute' )>LL' )>LL3 in net/ipv6/route.c and rtnlCregister2/EC&)?T' RT:CM?TR@>T?' )>LL' inetCdumpC#ib' )>LL3* in net/ipv6/#ibC#rontend.c They are called according to the type o# userspace call: ip route get ;-. 9". . is implemented via inetCrtmCgetroute23 ip route show is implemented via inetCdumpC#ib23 Be also have callbac%s #or adding/deleting a route: rtnlCregister2/EC&)?T' RT:C)?BR@>T?' inetCrtmCnewroute' )>LL' )>LL3* rtnlCregister2/EC&)?T' RT:C5?LR@>T?' inetCrtmCdelroute' )>LL' )>LL3* in net/ipv4/fib_frontend.c
50/178

)ote: &n rtnetlink_net_init()' which is called e have: s% J netlin%C%ernelCcreate2net' )?TL&)(CR@>T?' Kc#g3* rtnetlin%CnetCinit23 is called #rom netlin%CprotoCinit23' net/netlin%/a#Cnetlin%.c. Be also have' in net/ipv6/#ibC#rontend.c: netlin%C%ernelCcreate2net' )?TL&)(CE&=CL@@(>/' Kc#g3* )?TL&)(CE&=CL@@(>/ is not used by iproute-* it IisI used by libnl' in a util called util named nl-#ib-loo%up' and also in other libnl code. )?TL&)(CE&=CL@@(>/ has one callbac%' named nlC#ibCinput2struct s%Cbu## Is%b3' which in #act per#orms eventually a #ib loo%up* you might wonder #or what is )?TL&)(CE&=CL@@(>/ soc%et needed i# we have Wip route getW' which uses )?TL&)(CR@>T? soc%et and RT:CM?TR@>T? message* the answer is that )?TL&)(CE&=CL@@(>/ was added when adding the trie code' and it stayed probably as a legacy. see:http://lists.openwall.net/netdev/-00;/08/-8/33 .RR+ .RR+ stands #or <irtual Router Redundancy /rotocol http://en.wi%ipedia.org/wi%i/<irtualCRouterCRedundancyC/rotocol Vou can #ind a M/L licensed implementation o# <RR/ designed #or Linu$ operating systems here: http://source#orge.net/pro+ects/vrrpd/ what is <RR/d daemon is is an implementation o# <RR/v- as speci#ied in r#c-33". &t runs in userspace on linu$.

xor1 1ro4ect, http://www.$orp.org/


#ea is the Eorwarding ?ngine 7bstraction m#ea is the :ulticast Eorwarding ?ngine 7bstraction.

A@R/ git tree https://github.com/greearb/$orp.ct.git


51/178

&n case you download the $orp tar.gD and you had build problem' you might consider git cloning the A@R/ git tree and building by scons KK scons install.

Netfilter 6 )et#ilter is the %ernel layer to support applying iptables rules. 7 &t enables: 6 Eiltering 6 Changing pac%ets 2mas0uerading3 6 Connection Trac%ing 6 see: http'//www.netfilter.or#/ Atables modules are pre#i$ed with $t' #or e$ample' net/netfilter/!t_2E342E56.c. Atables matches are always lowercase. Atables targets are always uppercase 2#or e$ample' $tCR?5&R?CT.c3 struct $tCtarget : de#ined in include/linu!/netfilter/!_tables.h Registering $tCtarget is done by $tCregisterCtarget23. Registering an array o# $tCtarget is done by $tCregisterCtargets23. see net/netfilter/!t_67289:.c 6 WBriting )et#ilter modulesW 29! pages pd#3 by ,an ?ngelhardt' )icolas =ouliane: http'//1en#elh.medo;as.de/documents/+etfilter_-odules.pdf Netfilter tables, Vou register/unregister a net#ilter table by ipt_register_table()/ipt_unregister_table()# we have the #ollowing 8 net#ilter tables in &/v6: nat table - has 6 chains:

52/178

)EC&)?TC/R?CR@>T&)M )EC&)?TC/@.TCR@>T&)M )EC&)?TCL@C7LC@>T )EC&)?TCL@C7LC&)


see net/ipv6/net#ilter/iptableCnat.c R?5&R?CT is a )7T table target* implemented

in net/netfilter/!t_2E342E56.c mangle table - has 8 chains: )EC&)?TC/R?CR@>T&)M )EC&)?TCL@C7LC&) )EC&)?TCE@RB7R5 )EC&)?TCL@C7LC@>T )EC&)?TC/@.TCR@>T&)M see:net/ipv6/net#ilter/iptableCmangle.c T/R@AV is a mangle table target* implemented in net/net#ilter/$tCT/R@AV.c raw table - has - chains: )EC&)?TC/R?CR@>T&)M )EC&)?TCL@C7LC@>T see:net/ipv6/net#ilter/iptableCraw.c #ilter table - has 3 chains: )EC&)?TCL@C7LC&) )EC&)?TCE@RB7R5 )EC&)?TCL@C7LC@>T R?,?CT is e$ample o# a #ilter table target. &t is implemented in net/ipv6/net#ilter/iptCR?,?CT.c. 5R@/ is also a #ilter table target. =oth in 5R@/ and in R?,?CT we drop the pac%et. The di##erence is that with R?,?CT target we send &C:/ pac%et 2port-unreachable is the de#ault3
53/178

Vou can set the &C:/ type with --re+ect-with type: it can be icmp-netunreachable' icmp-host-unreach-able' icmp-port-unreachable' icmpproto-unreachable' icmp-net-prohibited' icmp-host-prohibited or icmpadmin-prohibited.

see net/ipv6/net#ilter/iptableC#ilter.c security table - has 3 chains: )EC&)?TCL@C7LC&) )EC&)?TCE@RB7R5 )EC&)?TCL@C7LC@>T see: net/ipv6/net#ilter/iptableCsecurity.c

Xtables2 vs. nftables


htt%'((l-n.net(5rticles(531852(

Formal submission of Xtables2


htt%'((l-n.net(5rticles(531988( an !n"el#ar$t le%ture about :tables2 & http://inai.de/documents/Love_for_blobs.pdf

Connection #racking 7 connection entry is represented by struct n#Cconn.


see include/net/net#ilter/n#Cconntrac%.h

?ach connection trac%ing entry is %ept until a certain timeout elapse. This timeout period is di##erent #or TC/' >5/ and &C:/. Vou can see the connection trac%ing entries by: cat /proc/net/nf_conntrack .)7T and 5)7T is implemented in net/net#ilter/$tCnat.c
54/178

:7.Y>?R75? is implemented in net/ipv6/net#ilter/iptC:7.Y>?R75?.c #raffic Control Tc utility 2#rom iproute pac%age3 is used to con#igure Tra##ic Control in the Linu$ %ernel. There are three areas which Tra##ic Control handles: tc 5disc - Yueuing discipline. &mplementation: in net/sched/schCI #iles 2#or e$ample' net/sched/schC#i#o.c3. tc class &mplementation: 7lso in net/sched/schCI #iles. tc filter &mplementation: in net/sched/clsCI #iles. important structures: struct Ydisc : declared in include/net/schCgeneric.h
netCdevice has a Ydisc member 2named 0disc3.

struct YdiscCops : declared in include/net/schCgeneric.h The no0ueueC0disc is an e$ample o# Ydisc which is used in virtual devices. The no0ueueC0discCops is an e$ample o# YdiscCops 2member o# no0ueueC0disc3. =oth are de#ined in source/net/sched/schCgeneric. p#i#oC#ast is the de#ault 0disc on all networ% inter#aces.
?n0ueing/5e0ueing is done

by p#i#oC#astCen0ueue23 and p#i#oC#astCde0ueue23. p#i#oC#ast is a classless 0ueueing discipline' as opposed' #or e$ample' to C=Y or 4T=' which are class-based 0ueuing disciplines. Be can easily determine #rom loo%ing at the 0disc declaration whether it is classless or class based' by inspecting i# there is a classCops member:
cb0C0discCops has cb0CclassCops* see net/sched/schCcb0.c* it is a

class-based 0disc
55/178

htbC0discCops has htbCclassCops* see net/sched/schChtb.c * it is a

class-based 0disc p#i#oC#astCops doesnFt have classCops* see net/sched/schCgeneric.c* it is a classles 0disc. )ote: .ometimes you will encounter class#ul terminology #or classbased 0discs. C=Y is Class =ased Yueuing. There is a p#i#o 0disc and b#i#o 0disc: net/sched/sch#i#o.c The di##erence between p#i#o and b#i#o is that p#i#o is #or pac%ets' b#i#o is #or bytes. The di##erence between p#i#oC#ast and p#i#o/b#i#o is that p#i#oC#ast has 3 bands' while p#i#o/b#i#o has be changed band. The number o# bands is hard coded and cannot

2/E&E@CE7.TC=7)5. is de#ined as 33. Bhen having bands' we consider the T@. o# the pac%et. The three 0ueues in p#i#oC#astCpriv struct 2a member named W0W3 represent these three bands. /ac%ets are put into bands according to their T@.' where band 0 has the highest priority. ?$ample: using 4T= 24ierarchical To%en =uc%et3 tc 0disc add dev p-p root handle 0: htb
This triggers invocation o# tcCmodi#yC0disc23 in net/sched/schCapi.c

2handler o# RT:C)?BY5&.C message' sent #rom user space3 tc 0disc show dev p-p 0disc htb 0: root re#cnt - r-0 0 de#ault 0 directCpac%etsCstat 0 show statistics: tc -s 0disc show dev p-p 0disc htb 0: root re#cnt - r-0 0 de#ault 0 directCpac%etsCstat 0 .ent 0 bytes 0 p%t 2dropped 0' overlimits 0 re0ueues 03 bac%log 0b 0p re0ueues 0

56/178

tc class add dev p-p parent 0:0 classid 0: 0 htb rate 8:bit
This triggers invocation o# tcCctlCtclass23 in net/sched/clsCapi.c

2handler o# RT:C)?BTE&LT?R message' sent #rom user space3 7 class can be a parent class or a child class. tc class show dev p-p class htb 0: 0 root prio 0 rate 8000(bit ceil 8000(bit burst 900b cburst 900b tc -s class show dev p-p class htb 0: 0 root prio 0 rate 8000(bit ceil 8000(bit burst 900b cburst 900b .ent 0 bytes 0 p%t 2dropped 0' overlimits 0 re0ueues 03 rate 0bit 0pps bac%log 0b 0p re0ueues 0 lended: 0 borrowed: 0 giants: 0 to%ens: 60000 cto%ens: 60000 TC #ilter The main #unction o# #ilters is to assign the incoming pac%ets to classes #or a 0disc. The classi#ication o# pac%ets can be based on the &/ address' port numbers' etc. Two structures are important #or #ilter: struct tc#CprotoCops and struct tc#Cproto. =oth are declared in include/net/schCgeneric.h. Vou register/unregister tc#CprotoCops with registerCtc#CprotoCops23/unregisterCtc#CprotoCops23. tc #ilter add will trigger invocation o# tcCctlCt#ilter23 in net/sched/clsCapi.c u3- #ilter is implemented in net/sched/clsCu3-.c route is implemented in net/sched/clsCroute.c

see also: Linu$ 7dvanced Routing K Tra##ic Control 4@BT@: http://www.lartc.org/lartc.html Transparent pro$y:

57/178

)?TE&LT?RCATCT7RM?TCT/R@AV %ernel con#ig item should be set #or Transparent pro$y 2T/R@AV3 target support. T/R@AV target is somewhat similar to R?5&R?CT. &t can only be used in the mangle table and is use#ul to redirect tra##ic to a transparent pro$y. 7s opposed to R?5&R?CT' it does not depend on )et#ilter connection trac%ing and )7T. $tCT/R@AV.c /ort 3 -" is the de#ault port o# s0uid* in /etc/s0uid/s0uid.con#' you can de#ine a tpro$y port* #or e$ample' httpCport 3 -" tpro$y 7dding tpro$y will trigger calling setsoc%opt23 with &/CTR7)./7R?)T' when starting the s0uid daemon. This in turn will set the transparent member o# struct inetCsoc%. 7n iptables rule to wor% with T/R@AV can be #or e$ample: iptables -t mangle -7 /R?R@>T&)M -p tcp --dport "0 -+ T/R@AV --tpro$y-mar% 0$ /0$ --on-port 3 -" --tpro$y-mar% 0$ /0$ is #or setting s%b-Pmar% in the T/R@AV module. Remember that inetCsoc% is in #act a casting o# the soc%et: struct inetCsoc% Iinet J inetCs%2s%3* ... static inline struct inetCsoc% IinetCs%2const struct soc% Is%3 T return 2struct inetCsoc% I3s%* U Netfilter 2ooks struct n#Choo%Cops - represents a net#ilter hoo%. Registration o# net#ilter hoo% is done by n#CregisterChoo%23. n#CregisterChoo%23 is implemented in net/net#ilter/core.c
58/178

)et#ilter rule e$ample 6 .hort e$ample: 6 7pplying the #ollowing iptables rule: G iptables 7 &)/>T p udp dport ;;;; + 5R@/ 6 This is )EC&/CL@C7LC&) rule* 6 The pac%et will go to: 6 ipCrcv23 6 and then: ipCrcvC#inish23 6 7nd then ipClocalCdeliver23 6 but it will )@T proceed to ipClocalCdeliverC#inish23 as in the usual case' without this rule. 6 7s a result o# applying this rule it reaches n#Choo%Cslow23 with verdict JJ )EC5R@/ 2calls s%bC#ree23 to #ree the pac%et3 6 .ee net/net#ilter/core.c. 6 iptables -t mangle 7 /R?R@>T&)M -p udp -dport ;;;; -+ :7R( -setmar% 8 7 7pplying this rule will set s%b-Pmar% to 0$08 in ipCrcvC#inish23. &C:/ redirect message 6 &C:/ protocol is used to noti#y about problems. 6 7 R?5&R?CT message is sent in case the route is suboptimal 2ine##icient3. 6 There are in #act 6 types o# R?5&R?CT 6 @nly one is used : 7 Redirect 4ost 2&C:/CR?5&RC4@.T3 6 .ee REC " - 2Re0uirements #or &/ <ersion 6 Routers3.

59/178

6 To support sending &C:/ redirects' the machine should be con#igured to send redirect messages. 7 /proc/sys/net/ipv6/con#/all/sendCredirects should be . 6 &n order that the other side will receive redirects' we should set /proc/sys/net/ipv6/con#/all/acceptCredirects to . 6 ?$ample: 6 7dd a suboptimal route on ;-. 9".0.3 : 6 route add net ;-. 9".0. 0 netmas% -88.-88.-88.-88 gw ;-. 9".0. 6 Running now NrouteO on ;-. 9".0.3 will show a new entry: 5estination Mateway Menmas% Elags :etric Re# >se &#ace ;-. 9".0. 0 ;-. 9".0. - -88.-88.-88.-88 >M4 0 0 0 eth0 6 .end pac%ets #rom ;-. 9".0.3 to ;-. 9".0. 0 : 6 ping ;-. 9".0. 0 2#rom ;-. 9".0.3 3 6 Be will see 2on ;-. 9".0.3 3: G Erom ;-. 9".0. - : icmpCse0JRedirect 4ost2)ew ne$thop: ;-. 9".0. 03 6 now' running on ;-. 9".0. - : G route Cn Z grep . 0 ] shows that there is a new entry in the routing cache: 6 ;-. 9".0.3 ;-. 9".0. 0 ;-. 9".0. 0 ri 0 0 36 eth0 6 The NrO in the #lags column means: RTCEC5@R?5&R?CT. 6 The ;-. 9".0. - machine had sent a redirect by calling ipCrtCsendCredirect23 #rom ipC#orward23. 2net/ipv6/ipC#orward.c3 6 7nd on ;-. 9".0.3 ' running Nroute -cW Z grep . 0O shows now a new entry in the routing cache: 2in case acceptCredirectsJ 3 6 ;-. 9".0.3 6 ;-. 9".0.3 ;-. 9".0. 0 ;-. 9".0. 0 0 0 eth0 6 &n case acceptCredirectsJ0 2on ;-. 9".0.3 3' we will see: ;-. 9".0. 0 ;-. 9".0. - 0 0 0 eth0 6 which means that the gw is still ;-. 9".0. - 2which is the route that we added in the beginning3. 6 7dding an entry to the routing cache as a result o# getting &C:/ R?5&R?CT is done in ipCrtCredirect23'net/ipv6/route.c. 6 The entry in the routing table is not deleted.
60/178

)eighboring .ubsystem 6 :ost %nown protocol: 7R/ 2in &/<9: )5' neighbour discovery3 6 7R/ table. 6 ?thernet header is 6 bytes long: G .ource mac address 29 bytes3. G 5estination mac address 29 bytes3. G Type 2- bytes3. 6 0$0"00 is the type #or &/ pac%et 2?T4C/C&/3 6 0$0"09 is the type #or 7R/ pac%et 2?T4C/C7R/3 6 0$" 00 is the type #or <L7) pac%et 2?T4C/C"0- Y3 6 see: include/linu$/i#Cether.h 6 Bhen there is no entry in the 7R/ cache #or the destination &/ address o# a pac%et' a broadcast is sent 27R/ re0uest' 7R/@/CR?Y>?.T: who has &/ address $.y.D...3. This is done by a method called arpCsolicit23. 2net/ipv6/arp.c3 6 Vou can see the contents o# the arp table by running: Ncat /proc/net/arpO or by running the NarpO #rom a command line . 6 Vou can delete and add entries to the arp table* see man arp. Bridging Subsystem =ridging implementation in Linu$ con#orms to &??? "0-. d standard 2which describes =ridging and .panning tree3. .ee http://en.wi%ipedia.org/wi%i/&???C"0-. 5 6 Vou can de#ine a bridge and add )&Cs to it 2Nenslaving portsO3 using brctl 2#rom bridge-utils3. bridge-utils is maintained by .tephen 4emminger. you can get the sources by: git clone git://git.%ernel.org/pub/scm/linu$/%ernel/git/shemminger/bridgeutils.git =uilding is simple: #irst run: autocon# 2in order to create Wcon#igureW #ile3 Then run ma%e.
61/178

There are two important structures in the bridging subsystem: struct netCbridge represents a bridge. struct netCbridgeCport represents a bridge port. 2=oth are de#ined in net/bridge/brCprivate.h3 netCbridge has a hash table inside called WhashW. &t has -89 entries 2=RC47.4C.&X?3. 6 Vou can have up to 0-6 ports #or every bridge device 2=RC:7AC/@RT.3 . 6 ?$ample: 6 brctl addbr mybr 2Create a bridge named WmybrW3 6 brctl addi# mybr eth0 2add a port to a bridge3. 6 brctl show 6 brctl delbr mybr 25elete the bridge named WmybrW3 )ote: Vou can see the #db by #/bridge/bridge fdb show Eor e$ample' a#ter brctl addbr mybr brctl addif mybr p(p% @utput can be' #or e$ample' 00:a :b0:9;:!6:00 dev p-p permanent )ote: The WbridgeW util is part o# iproute- pac%age. &n case you donFt have the WbridgeW util' you can git clone it by: git clone git://git.%ernel.org/pub/scm/linu$/%ernel/git/shemminger/iproute-.git Vou cannot add a wireless device to a bridge. The #ollowing series will #ail: brctl addbr mybr
62/178

brctl addi# mybr wlan0 canFt add wlan0 to bridge mybr: @peration not supported. Vou cannot add a loopbac% device to a bridge: brctl addbr mybr brctl addi# mybr lo canFt add lo to bridge mybr: &nvalid argument

The reason: &n br_add_if()' we chec% the privC#lags o# the device' and in case &EEC5@)TC=R&5M? is set' we return -?@/)@T.>// 2@peration not supported3. &n case o# wireless device' c#g"0- CnetdevCnoti#ierCcall23 method sets the &EEC5@)TC=R&5M? 2see net/wireless/core.c3 T=5: >nder &n which circumstances do we remove the &EEC5@)TC=R&5M? #lag in c#g"0- CchangeCi#ace23 in net/wireless/util.c^ 6 Bhen a )&C is con#igured as a bridge port' the brCport member o# netCdevice is initialiDed. G 2brCport is an instance o# struct netCbridgeCport3. Bhen a bridge is created' we call netdevCr$ChandlerCregister23 to register a method #or handling a bridge method to handle pac%ets. This method is called brChandleC#rame23. ?ach pac%et which is received by the bridge is handled by brChandleC#rame23. .ee brCaddCi#23 method is net/bridge/brCi#.c. 2=esides the bridging inter#ace' also the macvlan inter#ace and the bonding inter#ace invo%es netdev_r!_handler_re#ister()* &n #act what this method does is assign a method to the netCdevice r$Chandler member' and assign r$ChandlerCdata to netCdevice
63/178

r$ChandlerCdata member. Vou cannot call twice netdev_r!_handler_re#ister() on the same networ% device* this will return an error 2W5evice or resource busyW' ?=>.V3. see drivers/net/macvlan.c and net/bonding/bondCmain.c. 6 &n the past' when we received a #rame' neti#CreceiveCs%b23 called handleCbridge23. )ow we call brChandleC#rame23' via invo%ing r$Chandler23 2see CCneti#CreceiveCs%b23 in net/core/dev.c3 6 The bridging #orwarding database is searched #or the destination :7C address. 6 &n case o# a hit' the #rame is sent to the bridge port with br_forward() 2net/bridge/brC#orward.c3. 6 &# there is a miss' the #rame is #looded on all bridge ports using br_flood() 2net/bridge/brC#orward.c3. 6 )ote: this is not a broadcast \ 6 The ebtables mechanism is the L- parallel o# L3 )et#ilter. 6 ?btables enable us to #ilter and mangle pac%ets at the lin% layer 2L-3. The ebtables are implemented under net/bridge/net#ilter. 6 There are #ive points in the Linu$ bridging layer where we have the bridge hoo%s:

)EC=RC/R?CR@>T&)M 2brChandleC#rame233. )EC=RCL@C7LC&) 2brCpassC#rameCup23/brChandleC#rame233 )EC=RCE@RB7R5 2CCbrC#orward233 )EC=RCL@C7LC@>T2CCbrCdeliver233 )EC=RC/@.TCR@>T&)M 2brC#orwardC#inish233 $1en vSwitc2 @pen v.witch is an open source pro+ect implementing virtual switch http://openvswitch.org/ The code is under net/openvswitch/
64/178

The maintainer is ,esse Mross. .ee also 3ocumentation/networkin#/openvswitch.t!t. )etwor% namespaces 7 networ% namespace is logically another copy o# the networ% stac%' with itFs own routes' #irewall rules' and networ% devices.
7 networ% device belongs to e$actly one networ% namespace. 7 soc%et belongs to e$actly one networ% namespace.

7 networ% namespace provides an isolated view o# the networ%ing stac% - networ% device inter#aces - &/v6 and &/v9 protocol stac%s' - &/ routing tables - #irewall rules - /proc/net and /sys/class/net directory trees - soc%ets - more )etwor% namespace is implemented by struct net' include/net/netCnamespace.h =y running: ip netns add netnsCone we create a #ile under /var/run/netns/ called netnsCone. .ee: man ip netns &n order to show all o# the named networ% namespaces' we run: 67i17i1 netns list )e$t you run: 67i1 link add name if_one ty1e vet2 1eer name if_one_1eer 67i1 link set dev if_one_1eer netns netns_one -xam1le for network names1aces usage, Create two namespaces' called Wmyns W and Wmyns-W:
65/178

ip netns add ! ns1 ip netns add ! ns& 7ssigning p-p inter#ace to myns networ% namespaces: ip link set p&p1 netns ! ns1 )ow: Running: ip netns exec ! ns1 bash will trans#er me to myns networ% namespaces* so i# & will run there: ifconfi# "a & will see p-p * @n the other hand' running ip netns exec ! ns& bash will trans#er me to myns- networ% namespaces* but i# & will run there: ifconfi# "a & will not see p-p . Vou move bac% p-p to the initial networ% namespace by ip lin% set p-p netns The FnetnsF argument can be either a netns name or a process &5 2pid3. /roviding a pid' youFll be moving the inter#ace to the netns o# the given process' which is the initial networ% namespace #or pid J 2the init process3. There are some devices whose N-#I!_!_N-#NS_L$C&L #lag is set' and they are considered local devices* we do not permit moving them to any other namespace. 7mong these devices are the loopbac% inter#ace 2lo3' the bridge inter#ace' the ppp inter#ace' the MR? tunnel inter#ace' <AL7) inter#ace' and more. Trying to move an inter#ace whose N-#I!_!_N-#NS_L$C&L #lag is set to a di##erent networ% namespace' we result with WRT)?TL&)( answers: &nvalid argumentW error message #rom devCchangeCnetCnamespace23 method 2net/core/dev.c3. =ehind
66/178

the scenes' devCchangeCnetCnamespace23 chec%s the N-#I!_!_N-#NS_L$C&L #lag o# the net device. &# it is set' we will not permit changing o# networ% namespace' and we will return ?&)<7L.

>nder the hood' when calling ip netns e$ec ' we have here invocation o# two system calls #rom user space: setns system call with CL@)?C)?B)?T 2%ernel/nspro$y.c3 unshare system call with CL@)?C)?B). in 2%ernel/#or%.c3 see netnsCe$ec23 in ip/ipnetns.c 2iproute pac%age3 )ote: Currently there is an issue 2W5evice or resource busyW error3 when trying to delete a namespace. The #ollowing series gives an error: ip ip ip ip ip ip netns add netnsCone netns add netnsCtwo lin% add name i#Cone type veth peer name i#ConeCpeer lin% add name i#Ctwo type veth peer name i#CtwoCpeer lin% set dev i#ConeCpeer netns netnsCone lin% set dev i#CtwoCpeer netns netnsCtwo

ip netns e$ec netnsCone bash S in other terminal: ip netns delete netnsCtwo S JP Cannot remove /var/run/netns/netnsCtwo: 5evice or resource busy .ee: http://permalin%.gmane.org/gmane.linu$.networ%/-60"!8 &n the #uture' there is intention to add these commands to iproute-: Wip netns pidsW and Wip netns identi#yW. see: http://www.spinics.net/lists/netdev/msg- !;8".html There is CL$N-_N-8N-# #or #or% 2since Linu$ -.9.-63 - &# CL$N-_N-8N-# is set' then create the process in a new networ% namespace. &# this #lag is not set' then the
67/178

process is created in the same networ% namespace as the calling process. This #lag is intended #or the implementation o# containers. Three lwn articles about namespaces: Wnetwor% namespacesW http://lwn.net/7rticles/- ;!;6/ W/&5 namespaces in the -.9.-6 %ernelW http://lwn.net/7rticles/-8;- !/ W)otes #rom a containerW http://lwn.net/7rticles/-893";/ 7 new approach to user namespaces: ,onathan Corbet' 7pril -0 http://lwn.net/7rticles/6; 3 0/ Chec%point/restore mostly in the userspace: http://lwn.net/7rticles/68 ; 9/ Chec%point and Restore: are we there yet^ lecture by /avel ?melyanov http://linu$.con#.au/schedule/30 9/viewCtal%^dayJthursday

TC/ TC/: REC !;3: http://www.iet#.org/r#c/r#c!;3.t$t TC/ - provides connected-orienetd service. :.. J :a$imum segment siDe tcpCsendmsg23 is the main handler in the TA path. s%Cstate is the state o# the TC/ soc%et. &n case it is not in TC/EC?.T7=L&.4?5 or TC/ECCL@.?CB7&T we cannot send data. 7llocation o# a new segment is done via s%CstreamCallocCs%b23. helper: tcp_current_!ss(): compute the current e##ective :... &mportant structures: struct tcpCsoc%: u3- sndCcwnd - the congestion sending window siDe. u" ecnC#lags - ?C) status bits.
68/178

?C) stands #or ?$plicit Congestion )oti#ication.

can be one o# the #ollowing: TC/C?C)C@( TC/C?C)CY>?>?CCBR TC/C?C)C5?:7)5CCBR TC/C?C)C.??) There is a con#igurable proc#s tcpCecn entry: 71roc7sys7net7i1v97tc1_ecn /ossible values are: 0 5isable ?C). )either initiate nor accept ?C). 7lways re0uest ?C) on outgoing connection attempts. - ?nable ?C) when re0uested by incoming connections but do not re0uest ?C) on outgoing connections. 5e#ault: see more in 3ocumentation/networkin#/ip"sysctl.t!t Vou can change the TC/ initcwnd thus: ip route chan#e %<(.%0).%.%$% via %<(.%0).%.%$ dev em% initcwnd %% Then: ip route will show that the action was per#ormed: ... ;-. 9". . 0 via ;-. 9". . 0 dev em initcwnd tcpCv6CinitCsoc%23: initialiDation o# the TC/ soc%et is done in net/ipv6/tcpCipv6.c* invo%es tcpCinit23. The the congestion sending window siDe is initialiDed to 0 2TC/C&)&TCCB)53. tcpCv6Cconnect23: create a TC/ connection. 2net/ipv6/tcpCipv6.c3 ?ach soc%et 2struct soc% instance3 has a transmit 0ueue named s%CwriteC0ueue.

69/178

#rom include/uapi/linu$/tcp.h: struct tcphdr T CCbe 9 source* CCbe 9 dest* CCbe3- se0* CCbe3- ac%Cse0* Si# de#ined2CCL&TTL?C?)5&7)C=&TE&?L53 CCu 9 res :6' do##:6' #in: ' syn: ' rst: ' psh: ' ac%: ' urg: ' ece: ' cwr: * Seli# de#ined2CC=&MC?)5&7)C=&TE&?L53 CCu 9 do##:6' res :6' cwr: ' ece: ' urg: ' ac%: ' psh: ' rst: ' syn: ' #in: * Selse Serror W7d+ust your _asm/byteorder.hP de#inesW Sendi# CCbe 9 window* CCsum 9 chec%* CCbe 9 urgCptr* U* TC/ pac%et loss can be detected by two events:
a timeout receiving duplicate 7C(s. 70/178

Bhen and why do we get Wduplicate 7C(sW^ 7ccording to REC -8" ' WTC/ Congestion ControlW http://www.iet#.org/r#c/r#c-8" .t$t:

7 TC/ receiver .4@>L5 send an immediate duplicate 7C( when an outo#-order segment arrives. The purpose o# this 7C( is to in#orm the sender that a segment was received out-o#-order and which se0uence number is e$pected. see: Congestion 7voidance and Control <an ,acobson Lawrence =er%eley Laboratory :ichael ,. (arels >niversity o# Cali#ornia at =er%eley ee.lbl.#ov/papers/con#avoid.pdf #C+ timers, (eep 7live timer - implemented in tcpC%eepaliveCtimer23 in net/ipv6/tcpCtimer TC/ retransmit timer - implemented in tcpCretransmitCtimer23 in net/ipv6/tcpCtimer RT@ - retransmission timeout. RTT - round trip time. I+S-C 6 Bor%s at networ% &/ layer 2L33 6 >sed in many #orms o# secured networ%s li%e </)s. 6 :andatory in &/v9. 2not in &/v63 6 &mplemented in many operating systems: Linu$' .olaris' Bindows' and more. 6 REC-60
71/178

6 &n -.9 %ernel : implemented by 5ave :iller and 7le$ey (uDnetsov. 6 &/.ec subsystem :aintainers: 4erbert Au and 5avid :iller. .te##en (lassert was added as a maintainer in @ctober -0 -. see: http://marc.in#o/^tJ 3803--"3000003KrJ KwJ&/.ec git %ernel repositories: There are two git trees at %ernel.org' an FipsecF tree that trac%s the net tree and an Fipsec-ne$tF tree that trac%s the net-ne$t tree. They are located at git://git.%ernel.org/pub/scm/linu$/%ernel/git/%lassert/ipsec.git git://git.%ernel.org/pub/scm/linu$/%ernel/git/%lassert/ipsec-ne$t.git Two data structures are important #or &/.ec con#iguration: struct $#rmCstate and struct $#rmCpolicy. =oth de#ined in include/net/$#rm.h Be handle &/.ec rules management 2add/del/update actions' etc 3 #rom user space by accessing methods in net/$#rm/$#rmCuser.c. Eor e$ample' adding a policy is done by $#rmCaddCpolicy23. This is done in response to getting AER:C:.MC)?B/@L&CV message #rom userspace. 5eleting a policy is done by $#rmCgetCpolicy23 when receiving AER:C:.MC5?L/@L&CV. $#rmCgetCpolicy23 also handles AER:C:.MCM?T/@L&CV messages 2which per#orm a loo%up3. 6 Trans#ormation bundles. 6 Chain o# dst entries* only the last one is #or routing. 6 >ser space tools: http://ipsectools.s#.net
72/178

6 @penswan: http://www.openswan.org/ 2@pen .ource pro+ect3. 7lso strong.wan: http://www.strongswan.org/ 6 There are also non &/.ec solutions #or </) 7 e$ample: pptp 6 struct $#rmCpolicy has the #ollowing member: 7 struct dstCentry Ibundles. 7 CC$#rm6CbundleCcreate23 creates dstCentries 2with the 5.TC)@47.4 #lag3 see: net/ipv4/!frm4_policy.c 6 Transport :ode and Tunnel :ode. 6 .how the security policies: 7 ip $#rm policy show 6 .how $#rm states - ip $#rm state show 6 Create R.7 %eys: 7 ipsec rsasig%ey verbose -06" P %eys.t$t 7 ipsec showhost%ey le#t P le#t.public%ey G ipsec showhost%ey right P right.public%ey .ome &/.ec lin%s: >.7M& &/v9 &/sec 5evelopment #or Linu$ http://hiroshi .hongo.wide.ad.+p/hiroshi/papers/.7&)T-006C%andaipsec.pd# 5esign and &mplementation to .upport :ultiple (ey ?$change /rotocols #or &/sec http://ols.#edorapro+ect.org/@L./Reprints--009/miyaDawa-reprint.pd# Linu$ &/v9 )etwor%ingW there is a section about &/.ec http://www.%ernel.org/doc/ols/-003/ols-003-pages-80!-8-3.pd# Linu$ &/v9 .tac% &mplementation =ased on .erialiDed 5ata .tate /rocessing http://hiroshi .hongo.wide.ad.+p/hiroshi/papers/yoshi#u+iC:ar-006.pd#

73/178

?$ample: 4ost to 4ost </) 2using openswan3 in /etc/ipsec.con#: conn linu$tolinu$ le#tJ ;-. 9".0. "; le#tne$thopJ`direct le#trsasig%eyJ0s7Y//Y... rightJ ;-. 9".0.68 rightne$thopJ`direct rightrsasig%eyJ0s7Y)wb... typeJtunnel autoJstart 6 service ipsec start 2to start the service3 6 ipsec veri#y G Chec% your system to see i# &/sec got installed and started correctly. 6 ipsec auto Gstatus G &# you see N&/sec .7 establishedO ' this implies success. 6 Loo% #or errors in /var/log/secure 2#edora core3 or in %ernel syslog Tips #or hac%ing 6 5ocumentation/networ%ing/ipsysctl. t$t: networ%ing %ernel tunabels 6 ?$ample o# reading a he$ address: 6 iph-Pdaddr JJ 0$07007"C0 or means chec%ing i# the address is ;-. 9".0. 0 2C0J ;-'7"J 9"'00J0'07J 03. 5isable ping reply: 6 echo P/proc/sys/net/ipv6/icmpCechoCignoreCall 6 5isable arp: ip lin% set eth0 arp o## 2the )@7R/ #lag will be set3 6 7lso i#con#ig eth0 arp has the same e##ect. 6 4ow can you get the /ath :T> to a destination 2/:T>3^ G >se tracepath 2see man tracepath3. G Tracepath is #rom iputils. 6 (eep iphdr struct handy 2printout3: 2#rom linu$/ip.h3 struct iphdr T CCu" ihl:6' version:6* CCu" tos* CCbe 9 totClen* CCbe 9 id* CCbe 9 #ragCo##* CCu" ttl* CCu" protocol* CCsum 9 chec%* CCbe3- saddr* CCbe3- daddr* /IThe options start here. I/ U* 6 )&/Y>7523 : macro #or printing he$ addresses 6 C@)E&MC)?TC5:7 is #or TC//&/ o##load.

74/178

6 Bhen you encounter: $#rm / C@)E&MCAER: this has to to do with &/.?C. 2trans#ormers3. )ew and #uture trends 6 &@/7T. 6 )etChannels 2<an ,acobson and ?vgeniy /olya%ov3. 6 TC/ @##loading. 6 R5:7 - Remote 5irect :emory 7ccess. - iB7R/ - stands #or: &nternet Bide 7rea R5:7 /rotocol - Currently there are only two drivers in the %ernel tree #or )&C. with R5:7 support: 2rnics3 3 drivers/in#iniband/hw/amso 00 -3 drivers/in#iniband/hw/c$gb3. - driver #or the Chelsio T3 Mb? and 0Mb? adapters. The %ernel maintainer o# the &)E&)&=7)5 .>=.V.T?: is Roland 5reier. :ulit0ueus : some new nics' li%e e 000 and &/B--00' allow two or more hardware T$ 0ueues. 7lso with virtio' patches which support multi0ueue were recently sent. &n case you want to override the %ernel selection o# t$ 0ueue' you should implement ndoCselectC0ueue23 member o# the netCdeviceCops struct in your driver. Eor e$ample' this is done in ieee"0in net/mac"0- /i#ace.c ... ndoCselectC0ueue J ieee"0... see 5ocumentation/networ%ing/multi0ueue.t$t and also 3ocumentation/networkin#/scalin#.t!t CnetdevCselectC0ueue Cdatai#Cops struct

:anaging multiple 0ueues: a##inity and other issues =en 4utchings - netcon# -0
75/178

vger.%ernel.org/netcon#-0

Cslides/bwhCnetcon#-0

.pd#

&n some drivers' the number o# 0ueues is passed as a module parameter: see' #or e$ample' drivers/net/ethernet/broadcom/bn$-$/bn$-$Cmain.c numC0ueues is a module parameter 2number o# 0ueues3 in this driver.

Vou should also use allocCetherdevCm023 in your networ% driver instead o# allocCetherdev23 6 .ee: N?nabling Linu$ )etwor% .upport o# 4ardware :ulti0ueue 5evicesO' @L. -00!. 6 .ome more in#o in: 5ocumentation/networ%ing/multi0ueue.t$t in recent Linu$ %ernels. .ee also 5ave :iller multi0ueue networ%ing presentation he gave at the 8th )et#ilter Bor%shop'.eptember th- 6th' -00!. (arlsruhe' Mermany http://vger.%ernel.org/Ldavem/multi0ueue.odp and also: W:ulti0ueue networ%ingW' article by Corbet: http'//lwn.net//rticles/()<%.*/ 6 5evices with multiple TA/RA 0ueues will have the )?T&ECEC:>LT&CY>?>? #eature 2include/linu$/netdevice.h3 6 :ultiYueue nic drivers will call allocCetherdevCm023 or allocCnetdevCm023 instead o# allocCetherdev23 or allocCnetdev23. 6 Be pass the setup method as a parameter to these methods* .o ' #or e$ample' with ethernet devices we pass ether_setup()* with wi#i devices' we pass ieee30&11_if_setup(). 2see ieee"0- Ci#Cadd23 in net/mac"0- /i#ace.c3

76/178

lnstat tool lnstat tool is a power#ul tool' part o# iproute - pac%age ?$amples o# usage: lnstat -# rtCcache -% entries shows number o# routing cache entries lnstat -# rtCcache -% inChit shows number o# routing cache hits :isc: &n this section there are some topics on which & intend to add more in#o during time. !ragmentation, Eragmentation o# outgoing pac%ets: Bhen the length o# the s%b is larger then the :T> o# the device #rom which the pac%et is transmitted' we pre#orm #ragmentation* this is done in ipC#ragment23 method 2net/ipv6/ipCoutput.c3* in &/v9' it is done in ip9C#ragment23 in net/ipv9/ip9Coutput.c Eragmentation can be done in two ways: - via a page array 2called s%bCshin#o2s%b3-P#ragsQR3 2There can be up to :7AC.(=CER7M.* :7AC.(=CER7M. is 9 when page siDe is 6(3. - via a list o# .(=s 2called s%bCshin#o2s%b3-P#ragClist3 - Then method s%bChasC#ragClist23 tests the second 2This method was called s%bChasC#rags23 in the past3. Bhen creating a soc%et in user space' we can tell it not to support #ragmentation. This is done #or e$ample in tracepath util 2part o# iputils3' with setsoc%opt23' 2tracepath util #inds the path :T>3
77/178

... in on J &/C/:T>5&.CC5@* setsoc%opt2#d' .@LC&/' &/C:T>C5&.C@<?R' Kon' siDeo#2on33* ... &n the %ernel' ipCdontC#ragment23 chec%s the value o# pmtudisc #ield o# the soc%et 2struct inetCsoc%' which is embedded the soc% structure3. &n case pmtudisc e0uals &/C/:T>5&.CC5@' we set the &/C5E 25onFt #ragment3 #lag in the ip header by iph-P#ragCo## J htons2&/C5E3. .ee #or e$ample' ipCbuildCandCsendCp%t23 in net/ipv6/ipCoutput.c rawCsendmsg23 and udpCsendmsg23 use ipCappandCdata23' which uses the generic ip #ragmentation method' ipCgenericCget#rag23. ?$ception to this is udplite soc%ets' which uses udpliteCget#rag23 #or #ragmentation. ?$tracting the #ragment o##set #rom the ip header and the #ragmen #lags: The W#ragCo##W #ield 2which is 9 bit in length3 in the ip header represents the o##set and the #lags o# the #ragment. - 3 le#tmost bits are the o##set. 2the o##set units is "-bytes3 - 3 rightmost bits are the #lags. .o in order getting the o##set and the #lag #rom the ip header can be done thus: &/C@EE.?T is 0$ EEE: a mas% #or getting 3 le#tmost bits. 2see Sde#ine &/C@EE.?T 0$ EEE in ip.h3

int o##set' #lags* o##set J ntohs2ipChdr2s%b3-P#ragCo##3* #lags J o##set K L&/C@EE.?T* o##set KJ &/C@EE.?T* o##set __J 3* /I o##set is in "-byte chun%s I/

78/178

- see #or e$ample' ipC#ragC0ueue23 in net/ipv6/ipC#ragment.c ?ach #ragment has the &/C:E #lag 2W:ore #ragmentsW3 set' e$cept #or the last #ragment. The id #ield o# the ip header is the same #or all #ragments. &# a #ragment is not received at the second side a#ter a predetermined time' an &C:/ is sent bac%* this is an &C:/CT&:?C?AC??5?5 with WEragment Reassembly Timeout e$ceededW message 2&C:/C?ACCER7MT&:?3. )otice that &C:/CT&:?C?AC??5?5 also is sent when ttl is set to 0' in ipC#orward23. =ut' in that case' it is &C:/C?ACCTTL 2WTTL count e$ceededW3. .etting the ip header id #ield 2Widenti#icationW3 is very important #or per#orming #ragmentation* all #ragments must have the same id so that the other side will be able to reassemble. 7ssigning id to ip header is done by CCipCselectCident23* see net/ipv6/route.c. Neig2boring Subsystem 6 Bhy do we need the neighboring subsystem ^ 6 ;The world is a +ungle in general' and the networ%ing game contributes many animals.O 2#rom REC "-9' 7R/' ;"-3 6 :ost %nown protocol: 7R/ 2in &/v9: )5' neighbour discovery3 6 ?thernet header is 6 bytes long: 6 .ource :ac address and destination :ac address are 9 bytes each. 7 Type 2- bytes3. Eor e$ample' 2include/linu$/i#Cether.h3 6 0$0"00 is the type #or &/ pac%et 2?T4C/C&/3
79/178

6 0$0"09 is the type #or 7R/ pac%et 2?T4C/C7R/3 6 0A"038 is the type #or R7R/ pac%et 2?T4C/CR7R/3 )eighboring .ubsystem G struct neighbour 6 neighbour 2instance o# struct neighbour3 is embedded in dst' which is in turn is embedded in s%Cbu##: 6 &mplementation: important data structures 6 struct neighbour 2include/net/nei#hbour.h3 7 ha is the hardware address 2:7C address when dealing with ?thernet3 o# the neighbour. This #ield is #illed when an 7R/ response arrives. primaryC%ey G The &/ address 2L33 o# the neighbour. 6 loo%up in the arp table is done with the primaryC%ey. nudCstate represents the )etwor% >nreachability 5etection state o# the neighbor. 2#or e$ample' )>5CR?7C47=L?3. 6 int 2Ioutput32struct s%Cbu## Is%b3* 7 output23 can be assigned to di##erent methods according to the state o# the neighbour. Eor e$ample' neighCresolveCoutput23 and neighCconnectedCoutput23. &nitially' it is neighCblac%hole23. 7 Bhen a state changes' than also the output #unction may be assigned to a di##erent #unction. 6 re#cnt incremented by neighChold23* decremented by neighCrelease23. Be donFt #ree a neighbour when the re#cnt is higher than *instead' we set dead2a member o# neighbour3 to . 6 timer 2The callbac% method is neighCtimerChandler233. 6 struct hhCcache Ihh 2de#ined in include/linu$/netdevice.h3 6 con#irmed G con#irmation timestamp. 7 Con#irmation can be also done #rom L6 2transport layer3. G Eor e$ample' dstCcon#irm23 calls neighCcon#irm23. G dstCcon#irm23 is called #rom tcpCac%23

80/178

2net/ipv6/tcpCinput.c3 G and by udpCsendmsg23 2net/ipv6/udp.c3 and more. G neighCcon#irm23 does )@T change the state 7 it is the +ob o# neighCtimerChandler23. 6 dev 2netCdevice3 6 arpC0ueue G every neighbour has a small arp 0ueue o# itsel#. G There can be only 3 elements by de#ault in an arpC0ueue. 7 This is con#igurable:/proc/sys/net/ipv6/neigh/de#ault/unresC0len struct neighCtable 6 struct neighCtable represents a neighboring table G 2/include/net/neighbour.h3 7 The arp table 2arpCtbl3 is a neighCtable. 2include/net/arp.h3 7 &n &/v9' ndCtbl 2)eighbor 5iscovery table 3 is a neighCtable also 2include/net/ndisc.h3 G There is also dnCneighCtable 25?cnet 3 2linu$/net/decnet/dnCneigh.c3 and clipCtbl 2#or 7T:3 2net/atm/clip.c3 G gcCtimer: neighCperiodicCtimer23 is the callbac% #or garbage collection. G neighCperiodicCtimer23 deletes E7&L?5 entries #rom the 7R/ table. )eighboring .ubsystem arp 6 Bhen there is no entry in the 7R/ cache #or the destination &/ address o# a pac%et' a broadcast is sent 27R/ re0uest'7R/@/CR?Y>?.T: who has &/ address $.y.D...3. This is done by a method called arpCsolicit23.2net/ipv6/arp.c3 G &n &/v9' the parallel mechanism is called )5 2)eighbor discovery3 and is implemented as part o# &C:/v9. G 7 multicast is sent in &/v9 2and not a broadcast3. 6 &# there is no answer in time to this arp re0uest' then we will end up with sending bac% an &C:/ error 25estination 4ost >nreachable3. 6 This is done by arpCerrorCreport23 ' which indirectly calls ipv6Clin%C#ailure23 * see net/ipv6/route.c. 6 Vou can see the contents o# the arp table by running: Ncat /proc/net/arpO or by running the NarpO #rom a command line.
Vou can view statistics o# arp cache 2&/<63 by: cat

/proc/net/stat/arpCcache Vou can view statistics o# ndisc cache 2&/<93 by: cat /proc/net/stat/ndiscCcache
81/178

6 Wip neigh showW is the new method to show arp 2#rom &/R@>T?-3
&n &/v9 it is Wip -9 neigh showW.

6 Vou can delete and add entries to the arp table* see man arp/man ip. 6 Bhen using Nip neigh addO you can speci#y the state o# the entry which you are adding 2li%e permanent' stale' reachable' etc3. 6 arp command does not show reachability states e$cept the incomplete state and permanent state: /ermanent entries are mar%ed with : in Elags: e$ample : arp output 7ddress 4Btype 4Baddress Elags :as% &#ace 0.0.0.- 2incomplete3 eth0 0.0.0.3 ether 00:0 :0-:03:06:08 C: eth0 0.0.0. 3" ether 00:-0:"E:0C:9":03 C eth0 )eighboring .ubsystem G ip neigh show. 6 Be can see the current neighbour states: 6 ?$ample : 6 ip neigh show ;-. ;-. ;-. ;-. 9".0.-86 dev eth0 lladdr 00:03:-!:# :a :3 R?7C47=L? 9".0. 8- dev eth0 lladdr 00:00:00:cc:bb:aa .T7L? 9".0. - dev eth0 lladdr 00: 0: ": b: c: 6 /?R:7)?)T 9".0.86 dev eth0 lladdr aa:ab:ac:ad:ae:a# .T7L?

6 arpCprocess23 handles both 7R/ re0uests and 7R/ responses. G net/ipv6/arp.c 7 &# the target ip 2tip3 address in the arp header is the loopbac% then arpCprocess23 drops it since loopbac% does not need 7R/ . ... i# 2L@@/=7C(2tip3 ZZ :>LT&C7.T2tip33 goto out* out: ... %#reeCs%b2s%b3* return 0* 2see: Sde#ine L@@/=7C(2$3 222$3 K htonl20$##00000033 JJ htonl20$!#00000033 in linu$/in.h 6 &# it is an 7R/ re0uest 27R/@/CR?Y>?.T3 we call ipCrouteCinput23. 6 Bhy ^
82/178

6 &n case it is #or us' 2RT)CL@C7L3 we send and 7R/ reply. G arpCsend27R/@/CR?/LV'?T4C/C7R/'sip'dev'tip'sha 'devP devCaddr'sha3* G Be also update our arp table with the sender entry 2ip/mac3. 6 .pecial case: 7R/ pro$y server. 6 &n case we receive an 7R/ reply G 27R/@/CR?/LV3 G Be per#orm a loo%up in the arp table. 2by calling CCneighCloo%up233 G &# we #ind an entry' we update the arp table by neighCupdate23. 6 &# there is no entry and there is )@ support #or unsolicited 7R/ we donFt create an entry in the arp table. G .upport #or unsolicited 7R/ by setting /proc/sys/net/ipv6/con#/all/arpCaccept to . G The corresponding macro is: &/<6C5?<C@)EC7LL27R/C7CC?/T33 G &n older %ernels' support #or unsolicited 7R/ was done by: G C@)E&MC&/C7CC?/TC>).@L&C&T?5C7R/ )eighboring .ubsystem G loo%up 6 Loo%up in the neighboring subsystem is done via: neighCloo%up23 parameters: G neighCtable 2arpCtbl3 G p%ey 2ip address' the primaryC%ey o# neighbour struct3 G dev 2netCdevice3 G There are wrappers: G CCneighCloo%up23 6 +ust one more parameter: creat 2a #lag: to create a neighbor by neighCcreate23 or not33 6 and CCneighCloo%upCerrno23 )eighboring .ubsystem G static entries 6 7dding a static entry is done by: arp -s ip7ddress :ac7ddress 6 7lternatively' this can be done by: ip neigh add ip7ddress dev eth0 lladdr :ac7ddress nud permanent 6 The state 2nudCstate3 o# this entry will be )>5C/?R:7)?)T
ip neigh show will show it as /?R:7)?)T.

6 Bhy do we need /?R:7)?)T entries ^ arpCbindCneighbour23 method 6 .uppose we are sending a pac%et to a host #or the #irst time. 6 a dstCentry is added to the routing cache by rtCinternChash23.
83/178

6 Be should %now the L- address o# that host. G so rtCinternChash23 calls arpCbindCneighbour23. 6 only #or RT)C>)&C7.T 2not #or multicast/broadcast3. G arpCbindCneighbour23: net/ipv6/arp.c G dst-P neighbourJ)>LL' so it callsCCneighCloo%upCerrno23. G There is no such entry in the arp table. G .o we will create a neighbour with neighCcreate23 and add it to the arp table. 6 neighCcreate23 creates a neighbour with )>5C)@)? state 7 setting nudCstate to )>5C)@)? is done in neighCalloc23 The &EEC)@7R/ #lag 6 5isabling and enabling arp 6 i#con#ig eth -arp 7 Vou will see the )@7R/ #lag now in i#con#ig a 6 i#con#ig eth arp 2to enable arp o# the device3 6 &n #act' this sets the &EEC)@7R/ #lag o# netCdevice. 6 There are cases where the inter#ace by de#ault is with the &EEC)@7R/ #lag 2#or e$ample' ppp inter#ace' see ppp_setup() (drivers/net/ppp_#eneric.c) Changing &/ address 6 .uppose we try to set eth to an &/ address o# a di##erent machine on the L7): 6 Eirst' we will set an ip #or eth in 2in Eedora Core "'#or e$ample3 6 /etc/syscon#ig/networ%scripts/i#c#g-eth 6 ... &/755RJ ;-. 9".0. -- ... and than run: 6i#up eth 6 we will get:
?rror' some other host already uses address

;-. 9".0. --.

6 =ut: 6 i#con#ig eth0 ;-. 9".0. -6 wor%s o% \


84/178

6 Bhy is it so ^ 5uplicate 7ddress 5etection 25753 6 5uplicate 7ddress 5etection mode 25753 6 arping & eth0 5 ;-. 9".0. 0 7 sends a broadcast pac%et whose source address is 0.0.0.0. 0.0.0.0 is not a valid &/ address 2#or e$ample' you cannot set an ip address to 0.0.0.0 with i#con#ig3 6 The mac address o# the sender is the real one. 6 -5 #lag is #or 5uplicate 7ddress 5etection mode. Code: 2#rom arpCprocess23 * see /net/ipv6/arp.c3 /I .pecial case: &/v6 duplicate address detection pac%et 2REC- 3 3I/ i# 2sip JJ 03 T i# 2arpP arCop JJ htons27R/@/CR?Y>?.T3 KK inetCaddrCtype2tip3 JJ RT)CL@C7L KK \arpCignore2inCdev'dev'sip'tip33 arpCsend27R/@/CR?/LV'?T4C/C7R/'tip'dev'tip'sha'devPdevCaddr'devP devCaddr3* goto out* U )eighboring .ubsystem G Marbage Collection 6 Marbage Collection G neighCperiodicCtimer23 G neighCtimerChandler23 G neighCperiodicCtimer23 removes entries which are in )>5CE7&L?5 state. This is done by setting dead to ' and calling neighCrelease23. The re#cnt must be to ensure no one else uses this neighbour. 7lso e$pired entries are removed. 6 )>5CE7&L?5 entries donFt have :7C address * see ip neigh show3 )eighboring .ubsystem G 7synchronous Marbage Collection 6 neighC#orcedCgc23 per#orms asynchronous Marbage Collection. 6 &t is called #rom neighCalloc23 when the number o# the entries in the arp table e$ceeds a 2con#igurable3 limit. 6 This limit is con#igurable 2gcCthresh-'gcCthresh33 /proc/sys/net/ipv6/neigh/de#ault/gcCthresh/proc/sys/net/ipv6/neigh/de#ault/gcCthresh3 7 The de#ault #or gcCthresh3 is 0-6.
85/178

Candidates #or cleanup: ?ntries which their re#erence count is ' or which their state is )@T permanent. 6 Changing the neighbour state is done only in neighCtimerChandler23. L<. 2Linu$ <irtual .ever3 6 http://www.linu$virtualserver.org/ 6 &ntegrated into the Linu$ %ernel 2in -.6 %ernel it was a patch3. 6 Located in: net/net#ilter/ipvs in the %ernel tree. 6 L<. has eight scheduling algorithms. 6 L<./5R is L<. with direct routing 2a load balancing solution3. 6 ipvsadm is the user space management tools 2available in most distros3. 6 5irect Routing is the pac%et #orwarding method. 6 -g' gatewaying JP >se gatewaying 2direct routing3 6 see man ipvsadm. L<./5R 6 ?$ample: 3 Real .ervers and the 5irector all have the same <irtual&/ 2<&/3. 6 There is an 7R/ problem in this con#iguration. 6 Bhen you send an 7R/ broadcast' and the receiving machine has two or more )&Cs' each o# them responds to this 7R/ re0uest. ?$ample: a machine with two )&Cs * 6 eth0 is ;-. 9".0. 8 and eth is ;-. 9".0. 8-. L<. and 7R/ 6 .olutions 3 .et 7R/C&M)@R? to :
echo N O P /proc/sys/net/ipv6/con#/eth0/arpCignore echo N O P /proc/sys/net/ipv6/con#/eth /arpCignore

-3 >se arptables. G There are 3 points in the arp wal%through: 2include/linu$/net#ilterCarp.h3 G )EC7R/C&) 2in arpCrcv23 ' net/ipv6/arp.c3. G )EC7R/C@>T 2in arpC$mit233'net/ipv6/arp.c3 G )EC7R/CE@RB7R5 2 in brCn#C#orwardCarp23' net/bridge/brCnet#ilter.c3
86/178

6 http://ebtables.source#orge.net/download.html 7 ?btables is in #act the parallel o# net#ilter but in L-. L<. e$ample 2ipvsadm3 6 7n e$ample #or setting L<./5R on TC/ port "0 with three real servers: 6 ipvsadm C // clear the L<. table 6 ipvsadm 7 t 5irector&/7ddress:"0 6 ipvsadm -a t 5irector&/7ddress:"0 r Real.erver g 6 ipvsadm -a t 5irector&/7ddress:"0 r Real.erver- g 6 ipvsadm -a t 5irector&/7ddress:"0 r Real.erver3 g 6 This e$ample deals with tcp connections 2#or udp connection we should use u instead o# t in the last 3 lines3. L<. e$ample: 6 ipvsadm -Ln // list the L<. table 6 /proc/sys/net/ipv6/ipC#orward should be set to 6 &n this e$ample' pac%ets sent to <&/ will be sent to the load balancer* it will delegate them to the real server according to its scheduler. The dest :7C address in L- header will be the :7C address o# the real server to which the pac%et will be sent. The dest &/ header will be <&/. 6 This is done with )EC&/CL@C7LC&). 7R/5 G arp user space daemon 6 7R/5 is a user space daemon* it can be used i# we want to remove some wor% #rom the %ernel. 6 The user space daemon is part o# iproute- 2/misc/arpd.c3 6 7R/5 has support #or negative entries and #or dead hosts. 7 The %ernel arp code does )@T support these type o# entries\ 6 The %ernel by de#ault is not compiled with 7R/5 support* we should set C@)E&MC7R/5 #or using it: 6 )etwor%ing .upport-P )etwor%ing @ptions-P &/: 7R/ daemon support. 6 see: /usr/share/doc/iproute-.9.--/arpd.ps 27le$ey (uDnetsov3.
87/178

6 Be should also set appCprobes to a value greater than 0 by setting G /proc/sys/net/ipv6/neigh/eth0/appCsolicit G This can be done also by the a 2activeCprobes3 parameter. G The value o# this parameter tells how many 7R/ re0uests to send be#ore that neighbour is considered dead. 6 The % parameter tells the %ernel not to send 7R/ broadcast* in such case' the arpd daemon is not only listening to 7R/ re0uests' but also send 7R/ broadcasts. 6 7ctivation: 6 arpd a % eth0 K 6 @n some distros' you will get the error dbCopen: )o such #ile or directory unless you simply run m%dir /var/lib/arpd/ be#ore 2#or the arpd.db #ile3. 6 /ay attention: you can start arpd daemon when there is no support in the %ernel2C@)E&MC7R/5 is not set3. 6 &n this case you' arp pac%ets are still caught by arpd daemon getCarpCp%t23 (misc/arpd.c) 6 =ut you donFt get messages #rom the %ernel. 6 getCarpCp%t23 is not called.2misc/arpd.c3 6 Tip: to chec% i# C@)E&MC7R/5 is set' simply see i# there are any results #rom 7 cat /proc/%allsyms Z grep neighCapp :ac addresses 6 :7C address 2:edia 7ccess Control3 6 7ccording to specs' :7C address should be uni0ue. 6 The 3 #irst bytes speci#y a hw manu#acturer o# the card. 6 7llocated by &7)7. There are e$ceptions to this rule. 7 ?thernet 4Baddr 00: 9:3?:3E:9?:85 7R/watch 2detect 7R/ cache poisoning3

88/178

6 Changing :7C address can be as a result o# some security attac% 27R/ cache poisoning3. 6 7rpwatch can help detect such an attac%. 6 7ctivation: arpwatch d i eth0 2output to stderr3 6 7rpwatch %eeps a table o# ip/mac addresses and senses when there is a change. 6 d is #or redirecting the log to stderr 2no syslog' no mail3. 6 &n case someone changed :7C address on the same networ%' you will get a message li%e this: 7R/watch ?$ample Erom: root 27rpwatch3 To: root .ub+ect: changed ethernet address 2+upiter3 hostname: +upiter ip address: ;-. 9".0.86 ethernet address: aa:bb:cc:dd:ee:## ethernet vendor: _un%nownP old ethernet address: 0:-0: ":9 :e8:e0 old ethernet vendor: ... )eighbour states 6 neighbour states neighCalloc23 Reachable &ncomplete )one .tale 5elay /robe )eighboring .ubsystem 67 )>5C)@)? 7 )>5CR?7C47=L? 7 )>5C.T7L? 7 )>5C5?L7V 7 )>5C/R@=? 7 )>5CE7&L?5 7 )>5C&)C@:/L?T? 6 .pecial states: 6 )>5C)@7R/ 6 )>5C/?R:7)?)T 6 )o state transitions are allowed #rom these states to another state. )eighboring .ubsystem G states 6 )>5 state combinations:

89/178

6 )>5C&)CT&:?R 2)>5C&)C@:/L?T?Z)>5CR?7C47=L?Z )>5C5?L7VZ )>5C/R@=?3 6 )>5C<7L&5 2)>5C/?R:7)?)TZ)>5C)@7R/Z )>5CR?7C47=L?Z )>5C/R@=?Z)>5C.T7L?Z)>5C5?L7V3 6 )>5CC@))?CT?5 2)>5C/?R:7)?)TZ)>5C)@7R/Z )>5CR?7C47=L?3 6 Bhen a neighbour is in a .T7L? state it will remain in this state until one o# the two will occur G a pac%et is sent to this neighbour. G &ts state changes to E7&L?5. 6 neighCresolveCoutput23 and neighCconnectedCoutput23. 6 net/core/neighbour.c 6 7 neighbour in &)C@:/L?T? state does not have :7C address set yet 2ha member o# neighbour3 6 .o when neighCresolveCoutput23 is called' the neighbour state is changed to &)C@:/L?T?. 6 Bhen neighCconnectedCoutput23 is called' the :7C address o# the neighbour is %nown* so we end up with calling de"_4ueue_x!it()' which calls the ndoCstartC$mit23 callbac% method o# the )&C device driver. 6 The ndo_start_x!it() method actually puts the #rame on the wire. Change o# &/ address/:ac address 6 Change o# &/ address does not trigger noti#ying its neighbours. 6 Change o# :7C address ' )?T5?<CC47)M?755R 'also does not trigger noti#ying its neighbours. 6 &t does update the local arp table by nei#h_chan#eaddr(). 7 ?$ception to this is irlan eth: irlanCethCsendCgratuitousCarp23 G 2net/irda/irlan/irlanCeth.c3 G .ome nics donFt permit changing o# :7C address G you get: .&@C.&E4B755R: 5evice or resource busy. Elushing the arp table 6 Elushing the arp table: 6 ip statistics neigh #lush dev eth0 6
90/178

Round ' deleting ! entries III 6 Elush is complete a#ter round III 6 .peci#ying twice statistics will also show which entries were deleted' their mac addresses' etc... 6 ip statistics statistics neigh #lush dev eth0 6 ;-. 9".0.-86 lladdr 00:06:-!:#d:ad:30 re# ! used 0/0/0 R?7C47=L? ] 6 III Round ' deleting entries III round III 6 III Elush is complete a#ter

6 calls neighCdelete23 in net/core/neighbour.c 6 Changes the state to )>5CE7&L?5 .irtual network devices The t$C0ueueClen o# virtual devices is usually 0 as they do not hold a 0ueue o# their own* so' #or e$ample' i# you will create a vlan with vcon#ig or a bridge with brctl' i#con#ig will show that t$C0ueueClen is 0. see: brCdevCsetup23 in net/bridge/brCdevice.c ... dev-Pt$C0ueueClen J 0* ... and vlanCsetup23 in net/"0- 0/vlanCdev.c ... dev-Pt$C0ueueClen ... and bondCsetup23
91/178

J 0*

in drivers/net/bonding/bondCmain.c: ... bondCdev-Pt$C0ueueClen J 0* ... and macvlanCsetup23 in drivers/net/macvlan.c: ... dev-Pt$C0ueueClen ... and v$lanCsetup23 in drivers/net/v$lan.c ... dev-Pt$C0ueueClen J 0* ... Bith pimreg 2multicast3 device' t$C0ueueClen is not initialiaDed at all* so when running i#con#ig on pimreg device' you get: t$0ueuelen 0 2>)./?C3 )otice that #or virtual devices' li%e loopbac% and vlan' the 0disc is the no0ueue 0disc. .o #or e$ample' when running Wip addr showW you will see #or the loopbac% device: : lo: _L@@/=7C('>/'L@B?RC>/P mtu 98839 0disc no0ueue state >)()@B) and #or the a vlan device 2eth0.9 in this e$ample3: eth0.91eth0: _=R@75C7.T':>LT&C7.T'>/'L@B?RC>/P mtu 800 0disc no0ueue state where as in other' non virtual devices' you will have p#i#oC#ast 0disc. .ome more implementation details about achieving it: in attachConeCde#aultC0disc23 2net/sched/schCgeneric.c3 we have this code: J 0*

92/178

static void attachConeCde#aultC0disc2struct netCdevice Idev' struct netdevC0ueue IdevC0ueue' void ICunused3 T struct Ydisc I0disc J Kno0ueueC0disc* i# 2dev-Pt$C0ueueClen3 T 0disc J 0discCcreateCd#lt2devC0ueue' Kp#i#oC#astCops' TCC4CR@@T3* i# 2\0disc3 T netdevCin#o2dev' Wactivation #ailed[nW3* return* U U devC0ueue-P0discCsleeping J 0disc* U .o when dev-Pt$C0ueueClen is 0' as in the case with virtual devices' we use the no0ueueC0disc and do not call 0discCcreateCd#lt23. 7nother #eature o# virtual devices is that they appear under /sys/devices/virtual/net. .o #or e$ample' a#ter boot' we have /sys/devices/virtual/net/lo/ entry #or the loopbac% device. The entries which are created under /sys/devices/virtual/net #or virtual networ% device are created not because t$C0ueueClen o# virtual devices is 0 and not because the no0ueueC0disc o# virtual devices. The reason they are created is because with virtual devices' we do not call the .?TC)?T5?<C5?<23 macro. &n case youFll loo% at this simple macro' which should always be called be#ore registerCnetdev23' youFll see that all it does is assign the parent member in netCdevice. 4ow does this has to do with the virtual entry under sys#s^ The answer is that devices which have no parent are considered WvirtualW classdevices. 7nd i# you will loo% #or the implementation details' you see that registerCnetdev23 calls deviceCadd23 in netdevCregisterC%ob+ect23. 7nd deviceCadd23 2in drivers/base/core3 creates an entry under /sys/devices/virtual/net #or a device whose parent is null 2see getCdeviceCparent23 method' which is invo%ed #rom deviceCadd23. .o' #or e$ample' in the case o# creating a tun device 2which is a virtual device3 by: ip tuntap add tun0 mode tun Vou will have an entry under:
93/178

7sys7devices7virtual7net7tun:7 7nd when creating a tap device 2which is also a virtual device3 by: ip tuntap add tap0 mode tap Vou will have an entry under: 7sys7devices7virtual7net7ta1:7 Be remove the tuntap devices by: ip tuntap del tap0 !ode tap ip tuntap del tun0 !ode tun #unnels Bhat is the di##erence between ipip tunnel and gre tunnel^ gre tunnel supports multicasting whereas ipip tunnel does support only unicast. 0# :T> stands #or :a$imum Trans#er >nit 2or sometimes also :a$imum Trans#er >nit3. :T> is symmetrical and applies both to receive and transmit. Layer 3 should not pass pass an s%b which has payload bigger than an :T>. M.@ and T.@ are e$ceptions* in such cases' the device will separate the pac%et into smaller pac%ets' which are smaller than the :T>. :ulticasting struct netCdevice holds two lists o# addresses 2instances o# struct netdevChwCaddrClist 3:
uc is the unicast mac addresses list mc is the multicast mac addresses list

Vou add multicast addresses to the multicast mac addresses list 2mc3 both in &/v6 and &/v9 by: devCmcCadd23 2in net/core/devCaddrClists.c3.

94/178

&n ipv6' a device adds the --6.0.0. multicast address 2&M:/C7LLC4@.T. ' see include/linu$/igmp.h3' in ipCmcCup23 2see net/ipv6/igmp.c3. /S$ Eor implementing M.@' a method called gsoCsegment was added to netCprotocol struct in ipv6 2see include/net/protocol.h3 Eor tcp' this method is tcpCtsoCsegment23 2see tcpCprotocol in net/ipv6/a#Cinet.c3. There are drivers who implement T.@* #or e$ample' e 000e o# &ntel. 7 member called gsoCsiDe was added to s%bCsharedCin#o 7lso a helper method called s%bCisCgso23 was added* this method chec%s whether gsoCsiDe o# s%bCsharedCin#o is 0 or not 2returns true when gsoCsiDe is not 03 /rou1ing net devices 7n interesting patch #rom <lad 5ogaru 2,anuary -0 3 added support #or networ% device groups. This was done by adding a member called WgroupW to struct netCdevice' and an 7/& to set this group #rom %ernel 2devCsetCgroup233 and #rom user space. =y de#ault' all networ% devices are assigned to the de#ault group' group 0. 2&)&TC)?T5?<CMR@>/3* see allocCnetdevCm0s23 in net/core/dev.c ethtool struct ethtoolCops had recently been added ??? support 2?nergy ?##icient ?thernet3 in the #orm o# a new struct called ethtoolCeee 2added in include/linu$/ethtool.h3 and two methods getCeee23 and setCeee23 &/ address
95/178

&n &/v6' when you set and &/ address' you in #act assign it to i#aPi#aClocal. 2i#a is a pointer to struct inCi#addr3 Bhen running Wi#con#igW or Wip addr showW' you in #act issue an .&@CM&E755R ioctl' #or getting inter#ace address' which is handled by struct inCdevice #rom include/linu$/inetdevice.h has a list : i#aClist' which is the &/ i#addr chain. i#aClocal is a member o# struct inCi#addr which represents ipv6 address.

I+.; &n &/<9' the neighboring subsystem uses &C:/<9 #or )eighboring messages 2instead o# 7R/ messages in &/<63. 6 There are 8 types o# &C:/ codes #or neighbour discovery messages: )?&M4=@>R .@L&C&T7T&@) 2 383 parallel to 7R/ re0uest in &/<6 )?&M4=@>R 75<?RT&.?:?)T 2 393 parallel to 7R/ reply in &/<6 R@>T?R .@L&C&T7T&@) 2 333 R@>T?R 75<?RT&.?:?)T 2 363 R?5&R?CT 2 3!3 .pecial 7ddresses: 7ll nodes 2or : 7ll hosts3 address: EE0-:: G ipv9CaddrCallCnodes23 sets address to EE0-:: G 7ll Routers address: EE0-::G ipv9CaddrCallCrouters23 sets address to EE0-::=oth in include/net/addrcon#.h &n &/<9: 7ll addresses starting with EE are multicast address. 6 &/<6: 7ddresses in the range --6.0.0.0 G -3;.-88.-88.-88 are multicast addresses 2class 53. +rivacy -xtensions
96/178

] .ince the address is build using a pre#i$ and :7C address' the identity o# the machine can be #ound. ] To avoid this' you can use /rivacy ?$tensions. G This adds randomness to the &/<9 address creation process. 2calling getCrandomCbytes23 #or e$ample3. ] REC 306 /rivacy ?$tensions #or .tateless 7ddress 7utocon#iguration in &/v9. ] Vou need C@)E&MC&/<9C/R&<7CV to be set when building the %ernel. 4osts can disable receiving Router 7dvertisements by setting 7utocon#iguration ] Bhen a host boots' 2and its cable is connected3 it #irst creates a Lin% Local 7ddress. G 7 Lin% Local address starts with E?"0. G This address is tentative 2only wor%s with )5 messages3. ] The host sends a )eighbour .olicitation message. G The target is its tentative address' the source is all Deros. G This is 575 25ouble 7ddress 5etection3. ] &# there is no answer in due time' the state is changed to permanent. 2&E7CEC/?R:7)?)T3 6 Then the host send Router .olicitation. G The target address o# the Router .olicitation message is the 7ll Routers multicast address EE0-::G 7ll the routers reply with a Router 7dvertisement message. G The host sets address/addresses according to the pre#i$/pre#i$es received and starts the 575 process as be#ore. 6 7t the end o# the process' the host will have two 2or more3 &/v9 addresses: G Lin% Local &/<9 address. G The &/<9 address/addresses which was built using the pre#i$ 2in case that there is one or more routers sending R7s3. 6 There are three trials by de#ault #or sending Router .olicitation. G &t can be con#igured by: ] /proc/sys/net/ipv9/con#/eth0/routerCsolicitations .L&N '<:=6>3)
97/178

<L7) 2<irtual L7)3 enables us to partition a physical networ%. Thus' di##erent broadcast domains are created. This is achieved by inserting <L7) tag into the pac%et. The <L7) tag is 6 bytes: - bytes are Tag /rotocol &denti#ier 2T/&53' which has a value o# 0$" 00* - bytes are the Tag Control &denti#ier 2TC&3. 2&n linu$ documentation' TC& is termed Wtag control in#ormationW' see vlanCtci in s%Cbu## struct' include/linu$/s%Cbu##3 The <L7) tag is inserted between the source mac address and ethertype o# the eth header. The vlanCinsertCtag23method implements this tag insertion 2include/linu$/i#Cvlan.h3. struct vlanCethhdr represents vlan ethernet header 2ethhdr H vlanChdr3. hCvlanCproto in this struct will get always 0$" 00 value. hCvlanCTC& in this struct is the TC&' composed #rom priority and <L7) &5. vlanCinsertCtag23 is invo%ed #rom the vlan r$ handler' vlanCdoCreceive23. 2see include/linu$/i#Cvlan.h3. <L7) support in linu$ is under net/"0- 0. There is also the macvlan driver 2drivers/net/macvlan.c3. The header #ile #or vlan is include/linu$/i#Cvlan.h The header #ile #or macvlan is include/linu$/i#Cmacvlan.h The maintainer o# vlan is /atric% :c4ardy. <L7) supports almost everything a regular ethernet inter#ace does' including #irewalling' bridging' and o# course &/ tra##ic. Vou will need the Fvcon#igF tool #rom the <L7) pro+ect in order to e##ectively use <L7)s. &n #edora' there is a pac%age 2WrpmW3 called vcon#ig* you install it by Wyum install vcon#igW. &n >buntu' vcon#ig belongs to a pac%age named WvlanW* you install it by Wapt-get install vlanW Vou can also set vlan/macvlan with Wvcon#igW utility thus:
98/178

vcon#ig add p-p vlan&5 )otice that you can add up to 60;6 <L7)s per ethernet inter#ace. &n case you try to add more than 60;6' you will get this error: ?RR@R: trying to add <L7) Svlan&5 to &E -:p-p :- error: )umerical result out o# range 7ccording to http://en.wi%ipedia.org/wi%i/&???C"0-. Y: WThe he$adecimal values o# 0$000 and 0$EEE are reserved.W. Vou can also set vlan/macvlan with WipW utility: ip link add link p&p1 na!e p&p1#100 t pe "lan id % ip link add link p&p1 na!e p&p15101 address 00$aa$bb$cc$dd$ee t pe !ac"lan Vou can get some in#o about vlan devices in proc#s under: /proc/net/vlan 71roc7net7vlan7config 2this includes in#o about vlan id3. .ee :ore in#o in <L7) web page: http://www.candelatech.com/Lgreear/vlan.html <L7) tra##ic has 0$" 00 type 2?T4C/C"0- Y3. Eor networ% devices which do not support <L7) TA 4B acceleration 2the )?T&ECEC4BC<L7)CTA #lag is not set3' we insert the <L7) tag by calling CCvlanCputCtag23 in devChardCstartC$mit23. CCvlanCputCtag23 is a wrapper which calls vlanCinsertCtag23 2both are in include/linu$/i#Cvlan.h3. &n vlanCinsertCtag23 the macCheader pointer 2s%b-PmacCheader3 is decremented by 6 2<L7)C4L?)3 and we insert the vlan tag where needed. 7lso s%b-Pprotocol is set to be "0- 0 2?T4C/C"0- Y3 ?$ample #or such driver without <L7) TA 4B acceleration support is RealTe% " 3;too driver: drivers/net/ethernet/realte%/" 3;too.c.

99/178

<L7) inter#ace is a virtual device 2you set the netdevice t$C0ueueClen to be 03 &n case <L7) is compiled as a %ernel module' its name is "0- 0.%o. 7dding/5eleting vlans is done via ioctls which are sent #rom user space* #or e$ample' adding vlan is triggered by receiving &""_.L&N_C0" ioctl #rom user space. This triggers the registerCvlanCdevice23 method. 7s said above' you cannot add more than 60;6 vlans to a single ethernet device. &n the beginning o# registerCvlanCdevice23 we have: i# 2vlanCid PJ <L7)C<&5C:7.(3 return -?R7)M?* 7nd <L7)C<&5C:7.( is 0$0### 260;83. Bhen returning -?R7)M?' we get the error mentioned above: error: )umerical result out o# range 5eleting vlan is done by receiving 5?LC<L7)CC:5 ioctl #rom user space. This triggers the unregisterCvlanCdev23 method. These ioctls are de#ined in include/uapi/linu$/i#Cvlan.h 2@nce they were de#ined in include/linu$/i#Cvlan.h3. The handler #or this ioctls is vlanCioctlChandler23 in net/"0- 0/vlan.c =y de#ault' ethernet header reorders are turned o##. 2The <L7)CEL7MCR?@R5?RC45R #lag is not set3. Bhen ethernet header reorders are set' dumping the device will appear as a common ethernet device without vlans. <L7) private device data is represented by struct vlanCdevCpriv 2net/"0- 0/vlan.h3 &t has two arrays in it: egressCpriorityCmap and ingressCpriorityCmap. Be add entries to egressCpriorityCmap array by vlanCdevCsetCegressCpriority23. This is triggered by sending .?TC<L7)C?MR?..C/R&@R&TVCC:5 ioctl #rom user space 2vcon#ig setCegressCmap3

100/178

Be add entries to ingressCpriorityCmap array by vlanCdevCsetCingressCpriority23. This is triggered by sending .?TC<L7)C&)MR?..C/R&@R&TVCC:5 ioctl #rom user space 2vcon#ig setCingressCmap 3 Vou can enable vlan reordering with vcon#ig thus: vcon#ig setC#lag eth0. 00 7nd you can view the reordering #lag thus: cat /proc/net/vlan/eth0. 00 Vou can disable vlan reordering with vcon#ig thus: vcon#ig setC#lag eth0. 00 )ote that there are chances that the man page/help o# some distros is not accurate about this. &t says setC#lag Qvlan-deviceR 0 Z 7nd it should be: setC#lag Qvlan-deviceR Q#lag-numR 0 Z .ee #or e$ample: https://bugDilla.redhat.com/showCbug.cgi^idJ69"" 3 4elper methods: int is_"lan_de"(struct net_de"ice *de") : chec%s whether the device is a vlan device' by chec%ing the privC#lags o# netCdevice. 5e#ined in include/linu!/if_vlan.h bool "lan_uses_de"(const struct net_de"ice *de") : chec%s whether is device is used by vlan 2by chec%ing whether vlanCin#o member o# the device is null or not3. "lan_tx_tag_present(__skb) : chec%s whether the <L7)CT7MC/R?.?)T #lag is set. 25e#ined in include/linu$/i#Cvlan.h3. Bhen we encounter in the RA path pac%ets with vlan tag' the <L7) pac%ets are handled by vlanCdoCreceive23which is invo%ed #rom C_netif_receive_skb(). vlanCdoCreceive23 is implemented in net/)$(% /vlan_core.c.
101/178

There are some adapters which support <L7) hardware acceleration o##loading. Vou can get in#o about <L7) hardware acceleration o##loading with ethtool: ethtool 6k p&p1 ... r$-vlan-o##load: on t$-vlan-o##load: on ...

Bonding "river 'Link aggregation) The bonding networ% driver is #or putting multiple physical ethernet devices into one logical one' what is o#ten termed lin% aggregation/trun%ing/Lin% bundling/?thernet/networ%/)&C bonding. 2these terms can be considered as synonyms3. The new generation o# the bonding driver is called teaming. &t has also a user space part called libteam. see also 6eamin# driver section. i#enslave is an iputils pac%age. Vou can set lin% aggregation with i#enslave li%e in the #ollowing e$ample: modprobe bonding modeJbalance-alb miimonJ 00 i#con#ig bond0 ;-. 9". . i#enslave bond0 eth0 i#enslave bond0 eth Vou can set vlan device over a bonding inter#ace* Eor e$ample' on the bond0 you created' you con#igure a vlan thus: vcon#ig add bond0 00 &# you will try to con#igure a vlan on an empty bonding device 2be#ore enslaving at least one inter#ace to it3 you will get an error:

102/178

SP vcon#ig add bond0 00 ?RR@R: trying to add <L7) S 00 to &E -:bond0:- error: @peration not supported. 4ow is this implemented ^ 7n empty bonding device has )?T&ECEC<L7)CC47LL?)M?5 set. &n vlanCchec%CrealCdev23' which is invo%ed #rom registerCvlanCdevice23 when con#iguring <L7) over a device' we chec% the )?T&ECEC<L7)CC47LL?)M?5 #lag o# the device on which we are setting the <L7). &# this #lag is set' we return -?@/)@T.>//: int vlanCchec%CrealCdev2struct netCdevice IrealCdev' u 9 vlanCid3 T ... ... i# 2realCdev-P#eatures K )?T&ECEC<L7)CC47LL?)M?53 T prCin#o2W<L7)s not supported on `s[nW' name3* return -?@/)@T.>//* U ... ... U The :aintainers o# the bonding driver are ,ay <osburgh and 7ndy Mospodare%. &n the %ernel' the bonding code is in drivers/net/bonding. #eaming network device location: drivers/net/team Teaming networ% device is in #act the new bonding driver. Teaming networ% device is #or putting multiple physical ethernet devices into one logical one' what is o#ten termed lin% aggregation/trun%ing/Lin% bundling/?thernet/networ%/)&C bonding.
103/178

2these terms can be considered as synonyms3. Team has also a user-space util' libteam. The team driver registers an RA handler by netdevCr$ChandlerCregister23. The handler is teamChandleC#rame23. This is common in a virtual driver* also the bonding driver registers an RA handler named bondChandleC#rame23 and also the bridge driver reigsters a handler named brChandleC#rame23. These handlers are invo%ed in CCneti#CreceiveCs%b23 2net/core/dev.c3 7dding/5eleting a team device is done by: ip lin% add name team0 type team ip lin% del team0 ip lin% add name team0 type team triggers a call to teamCnewlin%23' which is one o# the rtnlClin%Cops callbac%s. Bhen you add a team device thus' the hw address is random' generated by ethChwCaddrCrandom23. &n case you want to speci#y an hw address when creating the team device' you can do it thus' #or e$ample: ip lin% add name team0 address 00: :--:33:66:88 type team )otice that the Wtype teamW should be in the end. Trying: ip lin% add name team0 type team address 00: :--:33:66:88 will #ail with this error: Marbage instead o# arguments Waddress ...W. Try Wip lin% helpW Vou can notice that team rtnlClin%Cops does has a newlin% callbac% 2teamCnewlin%3 but does not have dellin% callbac%. .o how is unregistering o# the team0 done in this case ^ The answer is simple' and apply also to other devices which do not set the dellin% callbac% in rtnlClin%Cops: Bhen registering
104/178

a device' in case we did not de#ine dellin% in the rtnlClin%Cops' then we assign the generic unregisterCnetdeviceC0ueue23 method to the dellin% callbac% o# rtnlClin%Cops. 7nd when running Wip lin% del team0W' we arrive at rtnlCdellin%23 ' which eventually calls unregisterCnetdeviceC0ueue23 and unregisters the netCdevice. see' in net/core/rtnetlin%.c int CCrtnlClin%Cregister2struct rtnlClin%Cops Iops3 T i# 2\ops-Pdellin%3 ops-Pdellin% J unregisterCnetdeviceC0ueue* ... return 0* U

7dding p-p : ip lin% set p-p master team0 ip lin% set p-p master team0 triggers a call to teamCportCadd23 Removing p-p : ip lin% set p-p nomaster ip lin% set eth nomaster triggers a call to teamCportCdel23. 2&n #act' this is done via invo%ing the ndoCdelCslave23member o# rtnlClin%Cops in doCsetCmaster23 o# rtnetlin% 2net/core/rtnetlin%.c3 )otice that p-p must be down #or this operation to succeed* in case it is up' you will get WRT)?TL&)( answers: 5evice or resource busyW error. Trying to add a loopbac% device to a team device will #ail. Eor e$ample' ip lin% set lo master team0
105/178

emits this error in the %ernel log: team0: 5evice lo is loopbac% device. Loopbac% devices canFt be added as a team port There are #our modules 2or #or WmodesW' which is the word the team code uses3 in the team driver: teamCmodeCbroadcast.c The broadcast mode is a basic mode in which all pac%ets are sent via all available ports. teamCmodeCroundrobin.c The roundrobin mode is a basic mode with very simple transmit port-selecting algorithm based on looping around the port list. This is the only mode able to run on its own without userspace interactions. teamCmodeCactivebac%up.c The activebac%up mode' in which only one port is active at a time and able to per#orm transmit and receive o# s%b. The rest o# the ports are bac%up ports. This :ode e$poses activeport option through which userspace application can speci#y the active port. teamCmodeCloadbalance.c The loadbalance mode is a more comple$ mode used #or e$ample #or L7C/ 2Lin% 7ggregation Control /rotocol3 and userspace controlled transmit and receive load balancing. L7C/ protocol is part o# the "0-.3ad standard and is very common #or smart switches. teamCmodeCregister23/teamCmodeCunregister23 is the 7/& #or registering/unregsitering a mode. 7 mode can register options via teamCoptionsCregister23. @nly two modes uses the options mechanism. @ne is teamCmodeCactivebac%up' and the second is teamCmodeCloadbalance.
106/178

The teaming networ% driver uses the Meneric )etlin% 7/&* it calls genlCregisterC#amily23 and genlCregisterCmcCgroup23 and other methods o# the Meneric )etlin% 7/&. &n #edora 9/ ! there is an rpm #or the user-space util 2libteam3. Team &n#rastructure .peci#ication: https://#edorahosted.org/libteam/wi%i/&n#rastructure.peci#ication see: https://#edorahosted.org/libteam/ https://github.com/+pir%o/libteam ,iri /ir%o presentation: http://www.pir%o.cD/teamdev.pp.pd# The maintainer o# the teaming driver is ,iri /ir%o. +++ The most commonly used user space daemon #or ppp is pppd. &t can be downlowded #rom here: #tp://#tp.samba.org/pub/ppp/ pppd website is: http://ppp.samba.org/ &n case you need to use pppoe in con+unction with ppp' you should install rp-pppoe: http://www.roaringpenguin.com/products/pppoe ppp setting are con#igurable via /etc/ppp. The generic ppp layer is implemented in pppCgeneric.c 2drivers/net/ppp/pppCgeneric.c3. ///o? and ///L-T/ uses the generic ppp layer. Vou register a ppp generic channel by calling the pppCregisterCnetCchannel23 method o# pppCgeneric. This is done in pppoeCconnect23 2drivers/net/ppp/pppoe.c3 and in pppol-tpCconnect23 2net/l-tp/l-tpCppp.c3. These two modules also call pppCinput23 #or handling receiving o# /// pac%ets over the ppp channel.
107/178

>nregistering is done by the pppCunregisterCchannel23 method o# pppCgeneric. pppo$CunbindCsoc%23 calls pppCunregisterCchannel23 drivers/net/ppp/p ppo$.c. Eor pppoe' pppo$CunbindCsoc%23 is invo%ed when a ///o? soc%et is closed. 2pppoeCrelease23 in http://l$r.#reeelectrons.com/source/drivers/net/ppp/pppoe.c3. Eor l-tpCppp' pppo$CunbindCsoc%23 is invo%ed by pppol-tpCsessionCclose23 and pppol-tpCrelease23. +++o///o? stand #or /oint-to-/oint /rotocol over ?thernet. de#ined in REC -8 9: http://www.iet#.org/r#c/r#c-8 9.t$t ///o? is implemented in pppoe.c. 2drivers/net/ppp/pppoe.c3 Eor establishing ///o? connection' there are two stages: the 5iscovery stage and the .ession stage. The 5iscovery stage consists o# #our steps between the client computer and the ///o? server 2access concentrator3 at the &./. 3 -3 33 63 /75& 2&nitiation3 /75@ 2@##er3 /75R 2Re0uest3 /75. 2.ession con#irmation3.

The 5iscovery stage is managed the pppd daemon. Vou end a session by sending a /75T pac%et 2termination pac%et3. The 5iscovery stage pac%ets has an ehtertype o# 0$""93 2?T4C/C///C5&.C' de#ined in include/uapi/linu$/i#Cether.h3. The session stage pac%ets has an ehtertype o# 0$""96 2?T4C/C///C.?.' also de#ined in include/uapi/linu$/i#Cether.h3. SKB R-C?CL-

108/178

s%bCrecycle was a Linu$ %ernel networ% stac% #eature which was removed. Bhen we donFt need anymore an s%b' we #ree its memory by calling 2#or e$ample3 CC%#reeCs%b23. The s%bCrecycle patch is based mainly on adding code in CC%#reeCs%b23' so that this s%b will not be #reed. &nstead we will initialiDed members o# s%b so the result will be as o# a new s%b which was +ust created. .ee: Wgeneric s%b recyclingW - a patch by Lennert =uytenhe% http://lwn.net/7rticles/33-03!/ @n 8. 0. - a patch was sent to netdev by ?ric 5umaDet titled Wnet: remove s%b recyclingW* this patch was applied. see http://marc.in#o/^lJlinu$-netdevKmJ 36;686-6!308"0KwJhttp://marc.in#o/^lJlinu$-netdevKmJ 36;8"6";8-9-36KwJ7ccording to this patch' since the s%b recycling #eature got little interest and many bugs' it was suggested to remove it. >sage o# s%bCrecycle was only in 8 ethernet drivers: cal$eda/$gmac.c '#reescale/gian#ar.c '#reescale/uccCgeth.c' marvell/mv963$$Ceth.c and stmicro/stmmac/stmmacCmain.c

# N7#&+ T>)/T7/ provides pac%et reception #or transmission #or user space programs. &t can be seen as a simple /oint-to-/oint or ?thernet device' which' instead o# receiving pac%ets #rom physical media' receives them #rom user space program and instead o# sending pac%ets via physical media writes them to the user space program.

109/178

T>)/T7/ is a driver which enables us to receive pac%ets #rom user space and send pac%ets to user space. T>)/T7/ is di##erent #rom other virtual devices in that it does not relay on real devices #or its wor%* it is a purely sw driver which wor% with user space soc%ets. The implementation is in drivers/net/tun.c. The tun driver has two netCdeviceCops instances:
tapCnetdevCops #or tap devices. tunCnetdevCops #or tun devices .

The tun device is /dev/net/tun* it is a character device' created with miscCregister23. To insert tuntap module you should run: modprobe tun. Bith recent iproute-' you can create tun/tap devices with ip tuntap command. see: ip tuntap help Eor e$ample: ip tuntap add tap0 mode tap or ip tuntap add tun0 mode tun )otice that i# you try to delete a none$istent tun or tap device' you will not get an error mesage or any warning. Calling registerCnetdevice23 creates a #older under sys#s #or this device. .o i# the device name is Wdevice)ameW' then /sys/class/net/device)ame will be generated. This is also the case with regular ethernet devices li%e eth0' eth '.... 4owever' with tun/tap' three additional entries are created 2via a call to deviceCcreateC#ile23 in tunCsetCi##233. These are WtunC#lagsW' WownerW and WgroupW. )otice that you will #ail with Wrmmod tunW i# you did not remove the tuntap devices be#ore' with W:odule tun is in useW error.

110/178

tun devices do not have mac addresses' but tap devices have an hw address which was created by calling ethChwCaddrCrandom23 Trying to set a mac address on a tun- device will give an error* #or e$ample' i#con#ig tun- hw ether 00:0 :0-:03:06:08 .&@C.&E4B755R: @peration not supported @n a tap devices' changing the mac address in this way is possible. Bith tun device' the tunCnetCopen23 and tunCnetCclose23 methods are called when you run Wi#con#ig tun0 upW and Wi#con#ig tun0 downW' respectively. The same is true also with tap device* tunCnetCopen23 is invo%ed when calling Wi#con#ig tap0 upW and tunCnetCclose23 is invo%ed when calling Wi#con#ig tap0 downW calling #d J open2W/dev/net/tunW3 triggers tunCchrCopen23 calling close2#d3 triggers tunCchrCclose23 Eollowing is a simple user space app which create a tun device. )otice that calling T>).?T/?R.&.T is mandatory in this program. &n case we will not call this method' then when e$iting the program the #d 2o# W/dev/net/tunW3 will be closed and tunCchrCclose23 will be invo%ed' as described above. &n case T>).?T/?R.&.T is not set' unregisterCnetdevice2dev3 will be called 2by CCtunCdetach233. &n case we set the T>)C)@C/& #lag 2note set in the e$ample below3 this means that pac%et in#ormation 2/&3 will not be provided. /ac%et &n#ormation is 6 bytes which are added when the #lag is not set. These 6 bytes are - bytes o# #lags' and - bytes o# protocol. Bireshar% sni##er does not show these 6 bytes. see: include/uapi/linu$/i#Ctun.h: struct tunCpi T CCu 9 #lags* CCbe 9 proto* U* // tun.c
111/178

Sinclude Sinclude Sinclude Sinclude Sinclude Sinclude Sinclude Sinclude Sinclude

_string.hP _sys/types.hP _sys/stat.hP _#cntl.hP _sys/ioctl.hP _net/i#.hP _linu$/i#Ctun.hP _linu$/soc%et.hP _stdio.hP

int main23 T int #d'err* struct i#re0 i#r* #d J open2W/dev/net/tunW'@CR5BR3* i# 2#d _ 03 T print#2W#d _ 0 in open[nW3* return - * U memset2Ki#r' 0' siDeo#2i#r33* i#r.i#rC#lags J &EECT>)* strncpy2i#r.i#rCname' Wtun W' &E)7:.&X3* err J ioctl2#d' T>).?T&EE' 2voidI3Ki#r3* i# 2err _ 03 T print#2Werr_0 a#ter T>).?T&EE' ioctl[nW3* close2#d3* return - * U err J ioctl2#d' T>).?T/?R.&.T' 3* i# 2err _ 03 T print#2Werr_0 a#ter T>).?T/?R.&.T ioctl[nW3* close2#d3* return - * U U
112/178

The only method that the tun driver e$port is tunCgetCsoc%et23' and it is used in the vhost driver 2drivers/vhost/net.c3. tunctl is an older tool #or creating tun/tap devices http://tunctl.source#orge.net/ Vou can also use a util #rom openvpn to create a tun/tap device: open"pn 66!ktun 66de" tun& open"pn 66r!tun 66de" tun& =y de#ault' when the device name starts with WtunW' Wopenvpn --m%tunW creates a T>) device. Bhen the device name starts with WtapW' Wopenvpn --m%tunW creates a T7/ device. 4owever' i# you need #or some reason to create a tap device which its name starts with tun' you still can do it thus: openvpn --m%tun --dev tun --dev-type tap )otice that also here when you try to delete a none$isting tun/tap device' you donFt get any warning. T>)/T7/ devices are widely used' in virtualiDation and in other #ields. Eor e$ample' with virt-manager' libvirt and (<:' when we start a guest' a T7/ device named vnet0 is created on the host. &t is added to a bridge inter#ace on the host' called virbr0' with ;-. 9". --. ip address. &n the guest' you can add the host bridge inter#ace as a de#ault gateway in order to be connected to the outside B7). Eor implementation details o# creating the tap device in libvirt' loo% in vir)et5evTapCreate23 method in src/util/virnetdevtap.c o# the libvirt pac%age. .ee more in#o about tuntap in 5ocumentation/networ%ing/tuntap.t$t .ee also this good lin% about tuntap: http://bac%re#erence.org/-0 0/03/-9/tuntap-inter#ace-tutorial/

113/178

The maintainer is :a$im (rasnyans%y. web site: http://vtun.source#orge.net/tun. &n interesting patch series' adding multi0ueue support #or tuntap' was sent by ,ason Bang in @ctober -0 -: http://www.spinics.net/lists/netdev/msg- 6"9;.html 7lso an ioctl called T>).?TY>?>? was added * this ioctl' this &EEC7TT7C4CY>?>?/&EEC5?T7C4CY>?>? #lags' enables attaching/detaching a 0ueue #rom user space. see: http://www.spinics.net/lists/%ernel/msg 6-;890.html Eollowing is an e$ample o# using tun multi0ueues. /lease notice that we set &EEC:>LT&CY>?>? when calling T>).?T&EE* later on' we call T>).?T/?R.&.T on the same #d ' and then open a new #d and call T>).?TY>?>? with &EEC7TT7C4CY>?>? #lag set' and a third #d' on which we again call T>).?TY>?>? with &EEC7TT7C4CY>?>? #lag set. The reason #or the pause23 at the end is that without it all the #ile descriptors will be closed. Closing the #d invo%es tunCchrCclose23' which subse0uently call tunCdetach23' removes the sys 0ueue entries and unregisters the device. calling twice T>).?TY>?>? as in this e$ample will result with having 3 0ueues in the end * we can view these 0ueues also under sys 0ueue entry: ls /sys/class/net/tun /0ueues/ r$-0 r$- r$-- t$-0 t$- t$-// tuntap/tun:ultiYueue.c Sinclude Sinclude Sinclude Sinclude Sinclude Sinclude Sinclude Sinclude Sinclude _string.hP _sys/types.hP _sys/stat.hP _#cntl.hP _sys/ioctl.hP _net/i#.hP _linu$/i#Ctun.hP _linu$/soc%et.hP _stdio.hP
114/178

int main23 T int #d' #d ' #d-' err* struct i#re0 i#r* #d J open2W/dev/net/tunW'@CR5BR3* i# 2#d _ 03 T print#2W#d _ 0 in open[nW3* return - * U memset2Ki#r' 0' siDeo#2i#r33* i#r.i#rC#lags J &EECT>) Z &EEC:>LT&CY>?>?* strncpy2i#r.i#rCname' Wtun W' &E)7:.&X3* err J ioctl2#d' T>).?T&EE' 2voidI3Ki#r3* i# 2err _ 03 T print#2Werr_0 a#ter T>).?T&EE' ioctl[nW3* close2#d3* return - * U err J ioctl2#d' T>).?T/?R.&.T' 3* i# 2err _ 03 T print#2Werr_0 a#ter T>).?T/?R.&.T ioctl[nW3* close2#d3* return - * U #d J open2W/dev/net/tunW'@CR5BR3* i# 2#d _ 03 T print#2W#d _ 0 in open[nW3* return - * U memset2Ki#r' 0' siDeo#2i#r33* i#r.i#rC#lags J &EECT>) Z &EEC7TT7C4CY>?>?* strncpy2i#r.i#rCname' Wtun W' &E)7:.&X3* err J ioctl2#d ' T>).?TY>?>?' 2voidI3Ki#r3*
115/178

i# 2err _ 03 T perror2WT>).?TY>?>? 2second3[nW3* close2#d 3* return - * U print#2Wcalling T>).?TY>?>? again with a third #d[nW3* #d- J open2W/dev/net/tunW'@CR5BR3* i# 2#d- _ 03 T print#2W#d- _ 0 in open[nW3* return - * U memset2Ki#r' 0' siDeo#2i#r33* i#r.i#rC#lags J &EECT>) Z &EEC7TT7C4CY>?>?* strncpy2i#r.i#rCname' Wtun W' &E)7:.&X3* err J ioctl2#d-' T>).?TY>?>?' 2voidI3Ki#r3* i# 2err _ 03 T perror2WT>).?TY>?>? 2second3[nW3* close2#d-3* return - * U else print#2WThird call to T>).?T&EE on #d is @([nW3* pause23* U

Bhen we issue open23 system call on tun/tap device #ile 2/dev/tun3' we create a soc%et by s%Calloc23 and assign it to to a pointer to t#ile. This is done in tunCchrCopen23. Bhen we issue close23 system call on a tun/tap device #ile 2/dev/tun3' we call tunCdetach23 in order to release the soc%et 2by s%CreleaseC%ernel233.
116/178

&n case you create a tun or tap device' and you want later to %now the type o# the device' you can do it by eththool -i un%nownType5evice)ameZ grep bus-in#o virtio virtio was developed by Rusty Russell #or his lguest pro+ect. virtio has a common 7/& #or di##erent types o# devices 2bloc% devices' net devices' pci devices' and more3. The virtio networ% driver is implemented in drivers/net/virtioCnet.c. BL -#$$#% =luetooth is a wireless technology standard #or e$changing data over short distances. =luetooth implementation in the Linu$ %enel is #ound in two locations: =luetooth core: net/bluetooth =luetooth drivers: drivers/bluetooth/ =luetooth %ernel diagram: 2)ote: cmtp is a module #or &.5)' not commonly used3

117/178

)ote that there are very #ew drivers #or bluetooth' as many devices use the generic drivers. Be have' #or e$ample' the Meneric =luetooth >.= driver 2drivers/bluetooth/btusb.c3 which is used #or many >.= =T devices. Eor e$ample' the 7.>. >.= =T- dongle' which has a =roadcom chip' uses this driver. =lueX =lueX is the user space pac%age #or bluetooth.
Erom version 6.0 o# =lueX' the main daemon is

called bluetoothd 2instead o# hcid in earlier versions3. The main con#iguration #ile is /etc/bluetooth/main.con#. This daemon also creates an sdp server' calling startCsdpCserver23. startCsdpCserver23 is implemented in blueD-6.;;/src/sdpd-server.c. This sdp server opens two soc%ets: >)&A local domain soc%et' #or getting re0uests sent #rom the local machine' such as adding a service 2sdptool add3. The soc%et is opened on /var/run/sdp. L-C7/ soc%et #or getting re0uests #rom outside 2#or e$ample' when a remote machine runs Wsdptool browseW with the machine address. There is also a =luetooth virtual 4C& driver' drivers/bluetooth/hciCvhci.c* it wor%s with a misc character device' /dev/vhci. The hciemu' 4C& emulator' #rom blueD pac%age' uses this driver. ?$ample: modprobe hciCvhci then: ./hciemu -n 0 This will create a virtual =T hci device. hcicon#ig will show it with W=us: <&RT>7L.W =y de#ault' it will be =R?5R device 2Type: =R/?5R3. &n case we need 7:/ device' we should #irst run Wmodprobe hciCvhci ampJ W. Then hcicon#ig will show 7:/ device 2Type: 7:/3. notice that you should )@T run
118/178

m%nod /dev/vhci c 0 -80 7s appears in some deprecated docs 2#or e$ample' http://www.hanscees.com/blueDhowto.html3 Eollowing here is the list o# =luetooth %ernel soc%ets* we will discuss them in the #ollowing te$t. =T/R@T@CL-C7/ =T/R@T@C4C& =T/R@T@C.C@ =T/R@T@CREC@:: dund uses an REC@:: soc%et 2blueD-6.;;/compat/dund.c3 =T/R@T@C=)?/ =T/R@T@CC:T/ 2#or &.5)3. =T/R@T@C4&5/ =T/R@T@C7<5T/ 7udio / <ideo 5istribution Transport /rotocol. There is no 7<5T/ in the %ernel' this is reserved #or #uture use. =luetooth userspace utils: Metting in#o about hci devices is done by: hcicon#ig hcicon#ig shows /.C7) and &.C7) #lags. &n case &.C7) is not set' most li%ely the device will not be discoverable. .o &n case &.C7) is not set' you can run Whcicon#ig hci0 piscanW to set it. /.C7) is /age scan and &.C7) is &n0uiry scan. hcicon#ig shows also the type o# the device' whether it is >.= or >7RT. >.= dongles are naturally >.=' whereas in mobile phones 2Li%e .masung :ini Mala$y #or e$ample3 it is usually >7RT. Metting detailed in#o about hci devices is done by: hcicon#ig -a =ringing down/up an hci inter#ace is done by: hcicon#ig hci0 down
119/178

hcicon#ig hci0 up These two commands send 4C&5?<5@B)/4C&5?<>/ ioctls #rom user space. These ioctls are handled by hciCdevCopen23 and hciCdevCclose23' respectively
see net/bluetooth/hci_sock.c

Resetting a bluetooth device can be done by: hcion#ig hci0 reset .canning #or bluetooth devices is done by: hcitool scan hcitool scan triggers a call to hciCin0uiry23 in user space 2blueD6.;;/lib/hci.c3.This method creates a =T/R@T@C4C& soc%et and send 4C&&)Y>&RV to the %ernel. This &@CTL is handled int hciCin0uiry23 in the %ernel 2net/bluetooth/hciCcore.c3. hcitool con show active connections. =luetooth sni##ing can be done by: hcidu!p 2you can add #lags' li%e hcidump -At3. hcidump in Eedora is part o# the blueD-hcidump pac%age. hciattach h is used to attach a serial >7RT to the =luetooth stac%. Vou can change the address o# the 4C& adapter by bdaddr util #rom blueD. The bdaddr util is not installed as part o# the blueD pac%age binaries in most distros 2li%e Eedora' #or e$ample3. Eor building bdaddr #rom source you should #irst run ./con#igure --enable-test and then run ma%e. Met more in#o by ./test/bdaddr -help Vou should use the -t #lag #or temporary change or permanent. The -r
120/178

#lag is #or so#t reset. Bithout it' you should per#orm hard reset by removing and replugging the device. >sing =luetooth input devices' li%e a mouse/%eyboard: =luetooth &nput devices are handled by the hidp %ernel module 2net/bluetooth/hidp/hidp.%o3. Vou can connect to the =luetooth mouse by: hidd --server --search then push the connect or reset button on the mouse and it will #ind it and pair. Connecting to the =luetooth mouse in this way is in #act sending 4&5/C@))755 ioctl to the %ernel via hidp soc%et 2soc%et which protocol is =T/R@T@C4&5/3.This ioctl is 4andled by hidpCaddCconnection23. &t creates a %ernel thread named %hidpdCvidCpid 2vid is vendor id' pid is product id3. This %ernel thread runs the hidpCsession23 method. The =T/R@T@C4&5/ ioctls are handled by hidpCsoc%Cioctl23 net/bluetooth/hidp/soc%.c Vou can show the connections by: hidd --show
This command sends 4&5/C@))L&.T ioctl to the =T/R@T@C4&5/ %ernel

soc%et. Vou will get something li%e: 00:C0:5E:06:";:7; =T :ouse Q068":00a!R connected Vou can terminate the connection by: hidd --unplug 00:C0:5E:06:";:7;
This command sends 4&5/C@))5?L ioctl to the =T/R@T@C4&5/ %ernel

soc%et. Reconnecting can be done by: hidd --connect 00:C0:5E:06:";:7; hidd is #rom blueD-compat pac%age. &t connects to the %ernel hci device by hciCopenCdev23 2user space 7/&3.
121/178

and by opening a Raw 4C& soc%et' with 4C& protocol. soc%et27EC=L>?T@@T4' .@C(CR7B' =T/R@T@C4C&3* l-ping - L-C7/ ping util. sdptool browse AA:AA:AA:AA:AA:AA shows opened services on the speci#ied device. sdptool browse bt7ddr does the #ollowing: creates a L-C7/ soc%et 2by l-capCsoc%Ccreate23' in net/bluetooth/l-capCsoc%.c. connect to this soc%et 2by l-capCsoc%Cconnect23 in net/bluetooth/l-capCsoc%.c. calls hciCconnect23 with 7CLCL&)(' which eventually calls hciCconnectCacl23' in net/bluetooth/hciCconn.c.

sdptool browse local shows opened services on the local device. sdptool add service)ame - adds a service to the local sdpd. ?$ample: sdptool add --channelJ 8 ./ This adds the .erial /ort service on channel 8.

The service name can be one #rom the #ollowing list: W5&5W'W./W'W5>)W'WL7)W'WE7AW'W@/>.4W'WET/W'W/R&)TW' W4.W'W4.7MW'W4EW'W4E7MW'W.7/W'W/=7/W'W)7/W'WM)W'W/7)>W'W4CR/W'W4 &5W'W(?V=W'WB&&:@T?W'WC&/W'WCT/W'W7-.RCW'W7-.)(W'W7<RCTW'W7<RTMW' W>5&>?W'W>5&T?W'W.?:C4L7W'W.R W'W.V)C:LW'W.V)C:L.?R<W'W7CT&<? .V)CW'W4@T.V)CW'W/7L:@.W'W)@(&5W'W/C.>&T?W'W)ET/W'W).V)C:LW'W )M7M?W'W7//L?W'W&.V)CW' WM7TTW. Bhen we run:

pand --listen --role )7/ Then Wsdptool browseW on that device will show' among other .5/

services' the W)etwor% 7ccess /ointW service' which might be something li%e this: .ervice )ame: )etwor% service .ervice 5escription: )etwor% service .ervice Rec4andle: 0$ 0006 .ervice Class &5 List:
122/178

W)etwor% 7ccess /ointW 20$ 93 /rotocol 5escriptor List: WL-C7/W 20$0 003 /.:: 8 W=)?/W 20$000#3 <ersion: 0$0 00 .?Y 9: "00 "09 Language =ase 7ttr List: codeC&.@93;: 0$989e encoding: 0$9a baseCo##set: 0$ 00 /ro#ile 5escriptor List: W)etwor% 7ccess /ointW 20$ 93 <ersion: 0$0 00 7nd when we run: pand --listen --role M/ Then Wsdptool browseW on that device will show' among other .5/ services' the W/7) Mroup )etwor%W service' which might be something li%e this: .ervice )ame: Mroup )etwor% .ervice .ervice 5escription: =lueX /7) .ervice .ervice /rovider: =lueX /7) .ervice Rec4andle: 0$ 000! .ervice Class &5 List: W/7) Mroup )etwor%W 20$ !3 /rotocol 5escriptor List: WL-C7/W 20$0 003 /.:: 8 W=)?/W 20$000#3 <ersion: 0$0 00 .?Y 9: "00 "09 Language =ase 7ttr List: codeC&.@93;: 0$989e encoding: 0$9a baseCo##set: 0$ 00 /ro#ile 5escriptor List: W/7) Mroup )etwor%W 20$ !3 <ersion: 0$0 00

123/178

sdptool search --bdaddr AA:AA:AA:AA:AA:AA ET/ shows opened @=?A ET/ service on the speci#ied device and the respective channel. openobe$ site: http://dev.Duc%schwerdt.org/openobe$/ REC@:: 7cronym #or: Radio Ere0uency Communications protocol. Eollowing is a practical e$ample o# establishing /C to /C connection with REC@:: over serial: Run set a .erial /ort service 2./3 on both sides: sdptool add --channelJ ./ 2you can choose a di##erent channel than ' but it should be the same on the client and server3 )ow' run on the listener side the #ollowing: r#comm listen r#comm0 This command triggers creating an =T/R@T@CREC@:: %ernel soc%et and calling r#commCsoc%Clisten23 method and a#terwards r#commCsoc%Caccept23. @nly a#ter the soc%et is created' a device named /dev/r#comm0 is created by sending an ioctl 2REC@::CR?7T?5?<3 to this soc%et. struct r#commCdev represents the r#comm device. 7 sys#s entry is generated #or this device' /sys/class/tty/r#comm0. This is done by deviceCcreateC#ile23. This #older contains values such as the address and the channel o# this device. The address is the dst member o# struct r#commCdev and the channel is the channel member o# struct r#commCdev. @n the sender side' run r#comm connect r#comm0 00: :--:33:66:88 This command triggers creating an =T/R@T@CREC@:: %ernel soc%et and then calling r#commCsoc%Cbind23 and r#commCsoc%Cconnect23 and creating a device named /dev/r#comm0 by sending an ioctl 2REC@::CR?7T?5?<3 to this soc%et.

124/178

Vou should get on the sender this message: Connected /dev/r#comm0 to 00: :--:33:66:88 on channel 8 /ress CTRL-C #or hangup )ow you can send te$t #rom the sender to the listener thus: #irst' run on the listener' on a di##erent console: cat /dev/r#comm0 then' on the sender' run: echo W#ooW PP /dev/r#comm0 Vou should see W#ooW on the listener terminal. The REC@:: tty module 2net/bluetooth/r#comm/tty.c3 implements serial emulation o# =luetooth using tty driver 7/&' calling ttyCregisterCdriver23 and ttyCportCregisterCdevice23. Be can establish TC//&/ connection over =luetooth devices in this way' #or e$ample: on the server side: pand --listen --roleJ)7/ 7nd on the client-side pand --connect bt7ddress@#.erver 7n inter#ace called bnep0 will be created on both sides. Be can assign &/ addresses on these two inter#aces and have TC//&/ tra##ic. &n case you encounter problems' li%e WConnect to bt7ddr #ailed. &nvalid e$change28-3W ' or Wconnection re#usedW' used hcidump to try to debug the problem. :a%e sure the the &.C7) and /.C7) #lags are set on both sides. pand --connect bt7ddress@#.erver triggers the #ollowing se0uence: Eirst' create a L-C7/ soc%et and connect to it' by invo%ing soc%et23 system call with =T/R@T@CL-C7/ protocol and then calling connect23' #rom user space. This is handled by l-capCsoc%Cconnect23 in the %ernel 2net/bluetooth/l-capCsoc%.c3
125/178

l-capCsoc%Cconnect23 also creates a new connection. &n this process'an entry is added under sys#s. This is done by hciCconnCinitCsys#s23 and hciCconnCaddCsys#s23 in net/bluetooth/hciC sys#s.c. Bhen a new connection is removed' this entry is removed #rom sys#s' with hciCconnCdelCsys#s23. This entry has only 3 attributes 2besides the generic device attributes3: type' address and #eatures. Then. send =)?/C@))755 ioctl to the bnep soc%et* this is handled by bnepCsoc%Cioctl23 in net/bluetooth/bnep/soc%.c' and invo%es bnepCaddCconnection23 in the bnep module 2net/bluetooth/bnep/core.c3. The bnepCaddCconnection23 creates a %ernel thread name %bnepd and creates a networ% device named bnep0. The %bnepd %ernel thread handles both T$ and R$ by bnep_tx_fra!e() and bnep_rx_fra!e()' respectively. =luetooth sys#s entries are under: /sys/class/bluetooth/ >sing blueD dbus 7/& Vou can use dbus-send to access a dbus device. Eor e$ample' dbus-send --system --destJorg.blueD --print-reply / org.blueD.:anager.5e#ault7dapter will give you a path to the =T adapter. Vou can get detailed in#o about =lueX 5=>. api here: http://blueD.cvs.source#orge.net/viewvc/blueD/utils/hcid/dbus-api.t$t .ee more about pand here: http://wi%i.openmo%o.org/wi%i/:anuallyCusingC=luetooth L-C7/ L-C7/ header is 6 bytes:

126/178

- - bytes #or length o# the entire L-C7/ /5> in bytes2without the header3. The ma$imum length can be 988-; or 9883 bytes 2according to 3.3. in the spec3. - - bytes #or cid 2Channel &denti#ier3 each L-C7/ channel endpoint on any device has a di##erent C&5. L-C7/ soc%et is created by l-capCsoc%Ccreate23. Bhen allocating a new soc%et 2l-capCsoc%Calloc233' we also create a channel which is associated with this soc%et 2l-capCchanCcreate233. Channel .ecurity level: There are #our levels o# channel security: =TC.?C>R&TVC.5/' =TC.?C>R&TVCL@B' =TC.?C>R&TVC:?5&> :' and =TC.?C>R&TVC4&M4. The security level o# the channel is =TC.?C>R&TVCL@B by de#ault* it is set in l(cap_chan_set_defaults()method' net/bluetooth/l-capCcore.c. The L-C7/ header is represented by l-capChdr struct in include/net/bluetooth/l-cap.h. Two types o# controllers are de#ined in =luetooth version 3 by the core speci#ication:
a =asic Rate / ?nhanced 5ata Rate controller 24C&C=R?5R3 an 7lternate :7C//4V 27:/3 24C&C7:/3

Vou can #ind the type o# your bluetooth device by hcicon#ig. The #irst line shows the type. Eor e$ample' #or =asic Rate / ?nhanced 5ata Rate controller 24C&C=R?5R3 we have: hcicon#ig hci0: Type: =R/?5R... Vou can also get the type by reading the bluetooth sys#s entry: cat /sys/class/bluetooth/hci0/type =R/?5R The type can be =R/?5R or 7:/ or >)()@B).
127/178

&n#o about establishing /7) can be #ound here: http://blueD.source#orge.net/contrib/4@BT@-/7) )otice that some o# the in#o about the pand daemon is not updated to recent pand releases. Eor e$ample' running the #ollowing command' which is mentioned in this howto: pand --listen --role )7/ --sdp will give the #ollowing error with pand o# blueD-compat-6.;;-.#c !.$"9C96: pand: unrecogniDed option F--sdpF 2The --nodsp option does e$ist3 =)?/ layer is #or the transmission o# &/ pac%ets in the /ersonal 7rea )etwor%ing /ro#ile and is implemented innet/bluetooth/bnep. .ite #or Linu$ =luetooth: http://www.blueD.org/ The =lueX /ro+ect started in -00 by Yualcomm. @be$d is the @b+ect ?$change /rotocol2@=?A3 and is part o# =lueX. The Linu$ =L>?T@@T4 subsystem and drivers are maintained by :arcel 4oltmann' Mustavo /adovan and ,ohan 4edberg =5 2bluetooth device3 address is 6" bits' and it loo%s li%e this: AA:AA:AA:AA:AA:AA Lower 7ddress /art 2L7/3: -6bits >pper 7ddress /art 2>7/3: " bits )onsigni#icant 7ddress /art 2)7/3: 9 bits 4elper methods: static inline int bacmp2bdaddrCt Iba ' bdaddrCt Iba-3 compares two bt addresses* return 0 i# e0ual.
128/178

static inline void bacpy2bdaddrCt Idst' bdaddrCt Isrc3 copy src address to dst address. int ba-str2const bdaddrCt Iba' char Istr3 converts #rom bdaddrCt to a Dero-terminated string. int str-ba2const char Istr' bdaddrCt Iba3 converts #rom Dero-terminated string to bdaddrCt. Read more:http://wi%i.answers.com/Y/BhatsCaCbdCaddressCasCitCas%sC#orCit ConCmyCphoneC#orCbluetoothSi$DD-",wD8 ca

=lueman is a MT(H bluetooth management utility #or M)@:? using blueD dbus bac%end. Linu$ bluetooth mailing list archive: http://www.spinics.net/lists/linu$-bluetooth/ This mailing list is #or patches: /atches starting with W=luetoothW are #or %ernel. /atches starting with W=lueXW are #or user space. .ome =luetooth acronyms: =)?/: The =luetooth )etwor% ?ncapsulation /rotocol. =5: =luetooth device. L-C7/: The Logical Lin% Control and 7daption protocol REC@::: The Radio Ere0uency Communications protocol. 5>)5: 5ial->p )etwor%ing 5aemon. The 5>)5 service allows ppp connections via bluetooth.

129/178

7CL: The 7synchronous Connection-oriented Logical transport protocol. .C@: .ynchronous Connection-@riented logical transport. ../: .ecure .imple /airing 2../3. The headline #eature o# =luetooth -. hcicon#ig hci0 sspmode 0 - this command disable sspmode. hcicon#ig hci0 sspmode - this command enables sspmode. hcicon#ig hci0 sspmode - this command shows sspmode. =lue5roid Bith 7ndroid 6.- release' =lueX-based =luetooth stac% was replaced with a new stac% ' named W=luedroidW' which is a collaboration between Moogle and =roadcom. .ee: https://developer.android.com/about/versions/+elly-bean.html http://lwn.net/7rticles/8-8" 9/ http://lwn.net/7rticles/8-8939/ Lin%s: =luetooth git tree #or developers 2#or submitting patches3: git://git.%ernel.org/pub/scm/linu$/%ernel/git/bluetooth/bluetoothne$t.git http://www.blueD.org/release-o#-blueD-8-0/ The 8.0 =lueX' =y )athan Billis' ,anuary 3' -0 3: http://lwn.net/7rticles/83 33/ =lueX 8 7/& introduction and porting guide: http://www.blueD.org/blueD-8-api-introduction-and-porting-guide/ >sing =luetooth article by =en 5u/ont on 5r5obbs 2,anuary 3 ' -0 -3 http://www.drdobbs.com/mobile/using-bluetooth/-3-800"-" @L.: 7udio .treaming over =luetooth - article by &an Bard http://lwn.net/7rticles/-;39;-/ =luetooth pro#iles boo%: http://www.amaDon.com/=luetooth-/ro#iles-5ean-7nthonyMratton/dp/0 300;-- 8/re#JsrC C ^
130/178

sJboo%sKieJ>TE"K0idJ 3888"3- 9KsrJ K%eywordsJbluetoothHpro#iles =luetooth .ecurity 27rtech 4ouse Computer .ecurity .eries3 http://www.amaDon.com/=luetooth-.ecurity-7rtech-4ouseComputer/dp/ 8"0838069 5es%top integration o# =luetooth. :arcel 4oltmann' @L. -00!: %ernel.org/doc/ols/-00!/ols-00!v -pages--0 --06.pd# )?T&LT?R Linu$ 3.! %ernel includes support #or &/v9 )7T .ee: http://lwn.net/7rticles/8 60"!/ :ost patches are #rom /atric% :c4ardy.

.*L&N <AL7) device 2<irtual eAtensible Local 7rea )etwor%3 <AL7) is a standard protocol to trans#er layer - ?thernet pac%ets over >5/. <AL7) #or Linu$ %ernel is developed by .tephen 4emminger. &t integrates a <irtual Tunnel ?ndpoint 2<T?/3 #unctionality that learns :7C to &/ address mapping. Bhy do we need v$lan and not use instead ipip or gre tunnel^ There are #irewalls which bloc% tunnels and allow only TC//>5/ tra##ic #or e$ample. iproute has support #or managing v$lan tunnels 2ip/iplin%Cv$lan.c3 This patch:
131/178

http://www.spinics.net/lists/netdev/msg- --0-.html is #or adding support #or managing v$lan tunnels in iproute-. The basic way to add v$lan virtual inter#ace is by: ip lin% add myv$lan type v$lan id This sets the vni member o# v$lanCdev struct to 2via v$lanCnewlin%23 method3. vni is the virtual networ% id* the vni can be in the range 0- 9!!!- 8 2whereas in vlans the id is restricted to 0-60;63. Vou can add v$lan with group address and ttl thus: ip lin% add myv$lan type v$lan id group -3;.0.0.6- ttl 0 This sets also the ttl and the gaddr 2multicast group address3 o# the v$lan device 2v$lanCdev3 Removing v$lan virtual inter#ace is done thus: ip lin% del myv$lan 2This triggers the v$lanCdellin%23 method3 Vou can view the #db o# the v$lan inter#ace by: #/bridge/bridge fdb show

The <AL7) module creates a %ernel >5/ soc%et by soc%CcreateC%ern23 2in v$lanCinitCnet233. This is a >5/ encapsulation soc%et 2this is set by udpCencapCenable233. This means that the %ernel inserts >5/ header into the pac%et. >5/ encapsulation is done also #or )7T traversal. Eor e$ample' with l-tp* when we want to use L-T/ >5/ encapsulation 2L-T/C?)C7/TV/?C>5/3' we also call udpCencapCenable233 when creating the l-tp tunnel 2l-tpCtunnelCcreate233' net/l-tp/l-tpCcore.c3. <AL7) module currently uses >5/ destination port "6!-' which is assigned #or @verlay Transport <irtualiDation 2@T<3' untill &7)7 will assign a special <AL7) port. see: http://www.speedguide.net/port.php^portJ"6!4owever' the >5/ destination port is a module parameter. Eirst patches were sent on .eptember -0 -:
132/178

.ee: http://www.spinics.net/lists/netdev/msg.etting up <AL7):

896.html

http://vincent.bernat.im/en/blog/-0 --multicast-v$lan.htmlSsettingup-v$lan http://blogs.cisco.com/datacenter/digging-deeper-into-v$lan/ .tephan hemminger blog about v$lan: http://linu$-networ%-plumber.blogspot.co.il/-0 -/0;/+ust-publishedlinu$-%ernel.html 7 Eirst Loo% 7t <AL7) over &n#iniband )etwor% @n Linu$ 3.!-rc!: by )aoto :7T.>:@T@ on )ov -;' -0 -

v$lan dra#t: http://tools.iet#.org/html/dra#t-mahalingam-dutt-dcops-v$lan-0This dra#t does not support &/v9' but probably &/v9 will be supported in the #uture. .ee also 5ocumentation/networ%ing/v$lan.t$t v$lan tools/userspace Two userspace apps 'Wv$landW and Wv$lanctlW. v$land' is a v$lan daemon' #orwards pac%et to <AL7) @verlay )etwor%. v$lanctl is command #or controlling v$lan. Vou can create/destroy v$lan tunnel inter#ace using v$lanctl. git clone git://github.com/upa/v$lan.git re0uires uthash pac%age late .; 2#or the hash table usage3. you can #etch uthash #rom http://uthash.source#orge.net/. Then put the header #ile' uthash.h' under /usr/include' and run W:a%eW #or the v$lan pro+ect #rom github.

133/178

<AL7) includes support #or 5istributed @verlay <irtual ?thernet 25@<?3 networ%s by 5avid L .tevens #rom &=:. vti 2&/v6 over &/.ec tunneling driver3 <T& stands #or <irtual Tunnel &nter#ace. The linu$ implementation is in net/ipv6/ipCvti.c insmoding the %ernel module 2net/ipv6/ipCvti.%o3 creates an ipCvti0 inter#ace.

N!C )EC stands #or: )ear Eield Communication. 7EC)EC soc%ets are implemented under net/n#c. neard' The )ear Eield Communication manager' is available in: http://git.%ernel.org/^pJnetwor%/n#c/neard.git*aJsummary linu$-n#c web site: https://www.0 .org/linu$-n#c linu$-n#c mailing list https://lists.0 .org/mailman/listin#o/linu$-n#c )ear Eield Communication with Linu$ slides' elc-0 -' =arcelona: http://elinu$.org/images/d/d /)earCEieldCCommunicationCwithCLinu$.p d# MR? over &/v9 5mitry (oDlov added support #or MR? over &/v9. These patches were applied in 7ugust -0 .ee: http://lwn.net/7rticles/80"!"9/ http://comments.gmane.org/gmane.linu$.networ%/-3;!09 Linu$ virtual server:
134/178

http://www.linu$virtualserver.org/ &mplemented in net/net#ilter/ipvs &/ <irtual .erver lets you build a high-per#ormance virtual server based on cluster o# two or more real servers. Sockets, There are two types o# soc%et in the %ernel* most o# them are soc%ets created #rom user space. There are also %ernel soc%ets* they are created by soc%CcreateC%ern23. Eor e$ample' in bluetooth %ernel stac% 2net/bluetooth/r#comm/core.c3: r#commCl-soc%Ccreate2struct soc%et IIsoc%3 T ... err J soc%CcreateC%ern2/EC=L>?T@@T4' .@C(C.?Y/7C(?T' =T/R@T@CL-C7/' soc%3* ... U and in v$lan ' <irtual eAtensible Local 7rea )etwor% 2drivers/net/v$lan.c3: static CCnetCinit int v$lanCinitCnet2struct net Inet3 T ... rc J soc%CcreateC%ern27EC&)?T' .@C(C5MR7:' &//R@T@C>5/' KvnPsoc%3* ... U Creating a soc%et #rom user space is done by the soc%et23 system call. @n success' a #ile descriptor #or the new soc%et is returned. The #irst parameter' #amily' is also sometimes re#erred to as NdomainO. The #amily is /EC&)?T #or &/<6 or /EC&)?T9 #or &/<9. The #amily is /EC/7C(?T #or /ac%et soc%ets' which operate at the device driver layer. 2Layer -3. /EC/7C(?T soc%ets are used' #or e$ample' in pcap library #or Linu$.
135/178

pcap library is in use by sni##ers such as tcpdump or wireshar%. 7lso hostapd uses /EC/7C(?T soc%ets 2hostapd is a wireless access point management pro+ect3. Erom hostapd source code: ... drv-PmonitorCsoc% J soc%et2/EC/7C(?T' .@C(CR7B' htons2?T4C/C7LL33* ... Type: G .@C(C.TR?7: and .@C(C5MR7: are the mostly used types. ] .@C(C.TR?7: #or TC/' .CT/' =L>?T@@T4. 6 .@C(C5MR7: #or >5/. 6 .@C(CR7B #or R7B soc%ets. 6 There are cases where protocol can be either .@C(C.TR?7: or .@C(C5MR7:* #or e$ample' >ni$ domain soc%et 27EC>)&A3. G /rotocol:usually 0 2 &//R@T@C&/ is 0' see: include/linu$/in.h3. G Eor .CT/' the protocol is &//R@T@C.CT/: ] soc%#dJsoc%et27EC&)?T' .@C(C.TR?7:'&//R@T@C.CT/3* Eor bluetooth/REC@::: ] soc%et27EC=L>?T@@T4' .@C(C.TR?7:' =T/R@T@CREC@::3* 6 .CT/: .tream Control Transmission /rotocol. 6 Eor every soc%et which is created by a userspace application' there is a corresponding soc%et struct and soc% struct in the %ernel. ] This system call eventually invo%es the soc%Ccreate23 method in the %ernel.
7n instance o# struct soc%et is created 2include/linu$/net.h3

struct soc%et has only " members* struct soc% has more than -0' and is one o# the biggest structures in the networ%ing stac%. Vou can easily be con#used between them. .o the convention is this: G soc% always re#ers to struct soc%et. G s% always re#ers to struct soc%.

136/178

The s%Cprotocol member o# struct soc% e0uals to the third parameter 2protocol3 o# the soc%et23 system call. struct soc% has three 0ueues:
s%CreceiveC0ueue #or r$ s%CwriteC0ueue #or t$ s%CerrorC0ueue #or errors.

s%bC0ueueCtail23 : 7dding to the 0ueue. s%bCde0ueue23 : removing #rom the 0ueue. Eor the error 0ueue: soc%C0ueueCerrCs%b23 adds to its tail 2include/net/soc%.h3. ?ventually' it also calls s%bC0ueueCtail23. ?rrors can be &C:/ errors or ?:.M.&X? errors. >5/ and TC/ soc%ets: )o e$plicit connection setup is done with >5/. G &n TC/ there is a preliminary connection setup. /ac%ets can be lost in >5/ 2there is no retransmission mechanism in the %ernel3. TC/ on the other hand is reliable 2there is a retransmission mechanism3. :ost o# the &nternet tra##ic is TC/ 2li%e http' ssh3. 7 >5/ is #or audio/video 2RT/3/streaming. 6 )ote: streaming with <LC is by >5/ 2RT/3. ] .treaming via VouTube is tcp 2http3 The udp header 6 There are a very #ew >5/-based servers li%e 5).' )T/' 54C/' TET/ and more. ] Eor 54C/' it is 0uite natural to be >5/ 2.ince many times with 54C/' you donFt have a source address' which is a must #or TC/3. ] TC/ implementation is much more comple$ G The TC/ header is much bigger than >5/ header. The udp header: include/linu$/udp.h struct udphdr T CCbe 9source* CCbe 9dest* CCbe 9len*

137/178

CCsum 9 chec%* U* >5/ pac%et J >5/ header H payload. Erom user space' you can receive udp tra##ic by three system calls: G recv23 2when the soc%et is connected3 G recv#rom23 G recvmsg23 7ll three are handled by udpCrecvmsg23 in the %ernel. )ote that #ourth parameter o# these 3 methods is #lags* however' this parameter is )@T changed upon return. &# you are interested in returned #lags ' you must use only recvmsg23' and to retrieve the msg.msgC#lags member. Eor e$ample' suppose you have a client-server udp applications' and the sender sends a pac%ets which is longer then what the client had allocated #or input bu##er. The %ernel than truncates the pac%et' and send :.MCTR>)C #lag. &n order to retrieve it' you should use something li%e: recvmsg2udp.oc%et' Kmsg' #lags3* i# 2msg.msgC#lags K :.MCTR>)C3 print#2W:.MCTR>)C[nW3* There was a suggestion recently #or recvmmsg23 system call #or receiving multiple messages 2=y 7rnaldo Carvalho de :elo3. The recvmmsg23 meant to reduce the overhead caused by multiple system calls o# recvmsg23 in the usual case. udpCrcv23 is the handler #or all >5/ pac%ets. &t handles all incoming pac%ets in which the protocol #ield in the ip header is &//R@T@C>5/ 2 !3. .ee the udpCprotocol de#inition: 2net/ipv6/a#Cinet.c3 struct netCprotocol udpCprotocol J T .handler J udpCrcv' .errChandler J udpCerr' ... U*
138/178

6 &n the same way we have : G rawCrcv23 as a handler #or raw pac%ets. G tcpCv6Crcv23 as a handler #or TC/ pac%ets. G icmpCrcv23 as a handler #or &C:/ pac%ets. ] (ernel implementation: the protoCregister23 method registers a protocol handler. 2net/core/soc%.c3 udpCrcv23 implementation: ] Eor broadcasts and multicast G there is a special treatment: i# 2rt-PrtC#lags K 2RTCEC=R@75C7.TZRTCEC:>LT&C7.T33 return CCudp6ClibCmcastCdeliver2net' s%b' uh' saddr' daddr' udptable3* Then per#orm a loo%up in a hashtable o# struct soc%. G 4ash %ey is created #rom destination port in the udp header. G &# there is no entry in the hashtable' then there is no soc% listening on this >5/ destination port JP so send &C:/ bac%: 2o# port unreachable3. G icmpCsend2s%b' &C:/C5?.TC>)R?7C4' &C:/C/@RTC>)R?7C4' 03* &n this case' a corresponding .):/ :&= counter is incremented 2>5/C:&=C)@/@RT.3 >5/C&)CC.T7T.C=42net' >5/C:&=C)@/@RT.' proto JJ &//R@T@C>5/L&T?3* ] Vou can view it by: netstat 6s ..... >dp: ... 38 pac%ets to un%nown port received. @r' by: ] cat /proc/net/snmp Z grep >dp: >dp: &n5atagrams )o/orts &n?rrors @ut5atagrams Rcvbu#?rrors .ndbu#?rrors >dp: 6 38 0 30 0 0

139/178

&# there is a soc% listening on the destination port' call udpC0ueueCrcvCs%b23. G ?ventually calls soc%C0ueueCrcvCs%b23. ] Bhich adds the pac%et to the s%CreceiveC0ueue by s%bC0ueueCtail23. udpCrecvmsg23: Calls CCs%bCrecvCdatagram23 ' #or receiving one s%Cbu##. G The CCs%bCrecvCdatagram23 may bloc%. G ?ventually' what CCs%bCrecvCdatagram23 does is read one s%Cbu## #rom the s%CreceiveC0ueue 0ueue memcpyCtoiovec23 per#orms the actual copy to user space by invo%ing copyCtoCuser23. ] @ne o# the parameters o# udpCrecvmsg23 is a pointer to struct msghdr. LetFs ta%e a loo%: Erom include/linu$/soc%et.h: struct msghdr T void ImsgCname* /I .oc%et name I/ int msgCnamelen* /I Length o# name I/ struct iovec ImsgCiov* /I 5ata bloc%s I/ CC%ernelCsiDeCt msgCiovlen* /I )umber o# bloc%s I/ void ImsgCcontrol* CC%ernelCsiDeCt msgCcontrollen* /I Length o# cmsg list I/ unsigned msgC#lags* U* Control messages 2ancillary messages3 ] The msgCcontrol member o# msgdhr represent a control message. G .ometimes you need to per#orm some special things. Eor e$ample' getting to %now what was the destination address o# a received pac%et. ] .ometimes there is more than one address on a machine 2and also you can have multiple addresses on the same nic3. G 4ow can we %now the destination address o# the ip header in the application^ G struct cmsghdr 2/usr/include/bits/soc%et.h3 represents a control message.
140/178

cmsghdr members can mean di##erent things based on the type o# soc%et. ] There is a set o# macros #or handling cmsghdr li%e C:.MCE&R.T45R23' C:.MC)AT45R23' C:.MC57T723' C:.MCL?)23 and more. ] There are no control messages #or TC/ soc%ets. .oc%et options: &n order to tell the soc%et to get the in#ormation about the pac%et destination' we should call setsoc%opt23. ] setsoc%opt23 and getsoc%opt23 - set and get options on a soc%et. 7 =oth methods return 0 on success and - on error. ] /rototype: int setsoc%opt2int soc%#d' int level' int optname'... There are two levels o# soc%et options: To manipulate options at the soc%ets 7/& level: .@LC.@C(?T To manipulate options at a protocol level' that protocol number should be used* G #or e$ample' #or >5/ it is &//R@T@C>5/ or .@LC>5/ 2both are e0ual !3 * see include/linu$/in.h and include/linu$/soc%et.h ] .@LC&/ is 0. There are currently ; Linu$ soc%et options and one another on option #or =.5 compatibility. 6 There is an option o# .@C=&)5T@5?<&C? 2assigning soc%et to a speci#ied device3.
This patch added also an option to

get .@C=&)5T@5?<&C?via getsoc%opt: http://www.spinics.net/lists/netd ev/msg- 6006.html

] There is an option called &/C/(T&)E@. G Be will set the &/C/(T&)E@ option on a soc%et in the #ollowing e$ample. // #rom /usr/include/bits/in.h Sde#ine &/C/(T&)E@ " /I bool I/ /I .tructure used #or &/C/(T&)E@. I/ struct inCp%tin#o T
141/178

int ipiCi#inde$* /I &nter#ace inde$ I/ struct inCaddr ipiCspecCdst* /I Routing destination address I/ struct inCaddr ipiCaddr* /I 4eader destination address I/ U* const int on J * soc%#d J soc%et27EC&)?T' .@C(C5MR7:'03* i# 2setsoc%opt2soc%#d' .@LC&/' &/C/(T&)E@' Kon' siDeo#2on33_03 perror2Wsetsoc%optW3* ... ... ... Bhen calling recvmsg23' we will parse the msghr li%e this: #or 2cmptrJC:.MCE&R.T45R2Kmsg3* cmptr\J)>LL* cmptrJC:.MC)AT45R2Kmsg'cmptr33 T i# 2cmptr-PcmsgClevel JJ .@LC&/ KK cmptr-PcmsgCtype JJ &/C/(T&)E@3 T p%tin#o J 2struct inCp%tin#oI3C:.MC57T72cmptr3* print#2WdestinationJ`s[nW' inetCntop27EC&)?T' Kp%tin#oPipiCaddr' str' siDeo#2str333* U U &n the %ernel' this calls ipCcmsgCrecv23 in net/ipv6/ipCsoc%glue.c. 2which eventually calls ipCcmsgCrecvCp%tin#o233. ] Vou can in this way retrieve other #ields o# the ip header: G Eor getting the TTL: ] setsoc%opt2soc%#d' .@LC&/' &/CR?C<TTL' Kon' siDeo#2on33_03. ] =ut: cmsgCtype JJ &/CTTL. G Eor getting ipCoptions: ] setsoc%opt23 with &/C@/T&@).. )ote: you cannot get/set ipCoptions in ,ava .ending pac%ets in >5/

142/178

Erom user space' you can send udp tra##ic with three system calls: G send23 2when the soc%et is connected3. G sendto23 G sendmsg23 ] 7ll three are handled by udpCsendmsg23 in the %ernel. ] udpCsendmsg23 is much simpler than the tcp parallel method ' tcpCsendmsg23. ] udpCsendpage23 is called when user space calls send#ile23 2to copy a #ile into a udp soc%et3. G send#ile23 can be used also to copy data between one #ile descriptor and another. udpCsendpage23 invo%es udpCsendmsg23. ] udpCsendpage23 will wor% only i# the nic supports .catter/Mather 2)?T&ECEC.M #eature is supported3. =ind: Vou cannot bind to privileged ports 2ports lower than 0-63 when you are not root \ G Trying to do this will give: G N/ermission deniedO 2?/?R:3. G Vou can enable non root binding on privileged port by running as root: 2Vou will need at least a -.9.-6 %ernel3 G setcap FcapCnetCbindCserviceJHepF udpclient G This sets the C7/C)?TC=&)5C.?R<&C? capability. Vou cannot bind on a port which is already bound. G Trying to do this will give: G N7ddress already in useO 2?755R&)>.?3 ] Vou cannot bind twice or more with the same >5/ soc%et 2even i# you change the port3. G Vou will get Nbind: &nvalid argumentO error in such case 2?&)<7L3 &# you try connect23 on an unbound >5/ soc%et and then bind23 you will also get the ?&)<7L error. The reason is that connecting to an unbound soc%et will call inetCautobind23 to automatically bind an unbound soc%et 2on a random port3. .o a#ter connect23' the soc%et is bounded. 7nd the calling bind23 again will #ail with ?&)<7L 2since the
143/178

soc%et is already bonded3. =inding in the %ernel #or >5/ is implemented in inetCbind23 and inetCautobind23 G 2in &/<9: inet9Cbind23 3 )on local bind Bhat happens i# we try to bind on a non local address ^ 2a non local address can be #or e$ample' an address o# inter#ace which is temporarily down3 G Be get ?755R)@T7<7&L error: G Nbind: Cannot assign re0uested address.O G 4owever' i# we set /proc/sys/net/ipv6/ipCnonlocalCbind to ' by G echo W W P /proc/sys/net/ipv6/ipCnonlocalCbind G @r adding in /etc/sysctl.con#: net.ipv6.ipCnonlocalCbindJ G The bind23 will succeed' but it may sometimes brea% applications. Bhat will happen i# in the above udp client e$ample' we will try setting a broadcast address as the destination 2instead o# ;-. 9".0. - 3' thus: inetCaton2W-88.-88.-88.-88W'Ktarget.sinCaddr3* ] Be will get ?7CC?.. error 2N/ermission deniedO3 #or sendto23. &n order that >5/ broadcast will wor%' we have to add: int #lag J * i# 2setsoc%opt 2s' .@LC.@C(?T' .@C=R@75C7.T'K#lag' siDeo#2#lag33 _ 03 perror2Wsetsoc%optW3* >5/ soc%et options ] Eor &//R@T@C>5//.@LC>5/ level' we have two soc%et options: ] >5/CC@R( soc%et option. G 7dded in Linu$ %ernel -.8.66. int stateJ * setsoc%opt2s' &//R@T@C>5/' >5/CC@R(' Kstate' siDeo#2state33*
144/178

#or 2+J *+_ 000*+HH3 sendto2s'bu# '...3 stateJ0* setsoc%opt2s' &//R@T@C>5/' >5/CC@R(' Kstate'siDeo#2state33* 6 The above code #ragment will call udpCsendmsg23 000 times without actually sending anything on the wire 2in the usual case' when without setsoc%opt23 with >5/CC@R(' 000 pac%ets will be send3. ] @nly a#ter the second setsoc%opt23 is called' with >5/CC@R( and stateJ0' one pac%et is sent on the wire. ] (ernel implementation: when using >5/CC@R(' udpCsendmsg23 passes :.MC:@R? to ipCappendCdata23. 7 &mplementation detail: >5/CC@R( is not in glibc-header 2/usr/include/netinet/udp.h3* you need to add in your program: G Sde#ine >5/CC@R( ] >5/C?)C7/ soc%et option. G Eor usage with &/.?C. ] >sed' #or e$ample' in ipsec-tools. ] )ote: >5/C?)C7/ does not appear yet in the man page o# udp 2>5/CC@R( does appear3. ] )ote that there are other soc%et options at the .@LC.@C(?T level which you can get/set on >5/ soc%ets: #or e$ample' .@C)@CC4?C( 2to disable chec%sum on >5/ receive3. 6 .@C5@)TR@>T? 2e0uivalent to :.MC5@)TR@>T? in send23. ] The .@C5@)TR@>T? option tells NdonFt send via a gateway' only send to directly connected hosts.O ] 7dding: G setsoc%opt2s' .@LC.@C(?T' .@C5@)TR@>T?' val' siDeo#2one33 _ 03 G 7nd sending the pac%et to a host on a di##erent networ% will cause N)etwor% is unreachableO error to be received. 2?)?T>)R?7C43 G The same will happen when :.MC5@)TR@>T? #lag is set
145/178

in sendto23. ] .@C.)5=>E. ] getsoc%opt2s' .@LC.@C(?T' .@C.)5=>E' 2void I3 Ksndbu#3. .uppose we want to receive &C:/ errors with the >5/ client e$ample 2li%e &C:/ destination unreachable/port unreachable3. ] 4ow can we achieve this ^ ] Eirst' we should set this soc%et option: G int valJ * G setsoc%opt2s' .@LC&/' &/CR?C<?RR'2charI3Kval' siDeo#2val33* udpCsendmsg23 udpCsendmsg2struct %iocb Iiocb' struct soc% Is%' struct msghdr Imsg' siDeCt len3 ] .anity chec%s in udpCsendmsg23: The destination >5/ port must not be 0. ] &# we try destination port o# 0 we get ?&)<7L error as a return value o# udpCsendmsg23 G The destination >5/ is embedded inside the msghdr parameter 2&n #act' msg-PmsgCname represents a soc%addrCin* sinCport is soc%addrCin is the destination port number3. ] :.MC@@= is the only illegal #lag #or >5/. Returns ?@/)@T.>// error i# such a #lag is passed. 2only permitted to .@C(C.TR?7:3 ] :.MC@@= is also illegal in 7EC>)&A. @@= stands #or N@ut @# =and dataO. ] The :.MC@@= #lag is permitted in TC/. G &t enables sending one byte o# data in urgent mode. G 2telnet ' Nctrl/cO #or e$ample3. ] The destination must be either: G speci#ied in the msghdr 2the name #ield in msghdr3. G @r the soc%et is connected. ] s%-Ps%Cstate JJ TC/C?.T7=L&.4?5 G )otice that though this is >5/' we use TC/ semantics here. &n case the soc%et is not connected' we should #ind a route to it* this is done by calling ipCrouteCoutputC#low23. ] &n case it is connected' we use the route #rom the soc%
146/178

2s%CdstCcache member o# s%' which is an instance o# dstCentry3. G Bhen the connect23 system call was invo%ed' ip6CdatagramCconnect23 #inds the route by ipCrouteCconnect23 and set s%-Ps%CdstCcache in s%CdstCset23 6 :oving the pac%et to Layer 3 2&/ layer3 is done by ipCappendCdata23. &n TC/' moving the pac%et to Layer 3 is done with ipC0ueueC$mit23. G BhatFs the di##erence ^ ] >5/ does not handle #ragmentation* ipCappendCdata23 does handle #ragmentation. G TC/ handles #ragmentation in layer 6. .o no need #or ipCappendCdata23. ipC0ueueC$mit23 is 2naturally3 a simpler method. ] =asically what the udpCsendmsg23 method does is: ] Einds the route #or the pac%et by ipCrouteCoutputC#low23 ] .ends the pac%et with ipClocalCout2s%b3 &sync2ronous I7$ ] There is support #or 7synchronous &/@ in >5/ soc%ets. This means that instead o# polling to %now i# there is data 2by select23' #or e$ample3' the %ernel sends a .&M&@ signal in such a case >sing 7synchronous &/@ >5/ in a user space application is done in three stages: G 3 7dding a .&M&@ signal handler by calling sigaction23 system call G -3 Calling #cntl23 with EC.?T@B) and the pid o# our process to tell the process that it is the owner o# the soc%et 2so that .&M&@ signals will be delivered to it3. .everal processes can access a soc%et. &# we will not call #cntl23 with EC.?T@B)' there can be ambiguity as to which process will get the .&M&@ signal. Eor e$ample' i# we call #or%23 the owner o# the .&M&@ is the parent* but we can call' in the son' #cntl2s'EC.?T@B)' getpid233. G 33 .etting #lags: calling #cntl23 with EC.?TEL and @C)@)=L@C( Z E7.V)C.
147/178

&n the .&M&@ handler' we call recv#rom23. ] ?$ample: struct soc%addrCin source* struct sigaction handler* source.sinC#amily J 7EC&)?T* source.sinCport J htons2"""3* source.sinCaddr.sCaddr J htonl2&)755RC7)V3* serv.oc%et J soc%et27EC&)?T' .@C(C5MR7:' 03* bind2serv.oc%et'2struct soc%addrI3Ksource'siDeo#2struct soc%addrCin33* handler.saChandler J .&M&@4andler* sig#illset2Khandler.saCmas%3* handler.saC#lags J 0* sigaction2.&M&@' Khandler' 03* #cntl2serv.oc%et'EC.?T@B)' getpid233* #cntl2serv.oc%et'EC.?TEL' @C)@)=L@C( Z E7.V)C3* The #cntl23 which sets the @C)@)=L@C( Z E7.V)C #lags invo%es soc%C#async23 in net/soc%et.c to add the soc%et. G The .&M&@4andler23 method will be called when there is data 2since a .&M&@ signal was generated3 * it should call recvmsg23.

app.

R"0& 'Infiniband) .ee this sites by 5otan =ara%: http://www.rdmamo+o.com/ http://www.rdmamo+o.com/lin%s/


148/178

Linux 8ireless Subsystem '<:=6>>)6 ?ach :7C #rame consists o# a :7C header' a #rame body o# variable length and an EC. 2Erame Chec% .e0uence3 o# 3- bit CRC. )e$t #igure shows the "0-. header.

The "0-. header is represented in mac"0called ieee"0- Chdr 2include/linu$/ieee"0.h3.

by a structure

7s opposed to an ?thernet header 2struct ethhdr3' which contains only three #ields 2source :7C address' destination :7C address' and type3' the "0-. header contains #our addresses and not two' and some other #ields. #rame control: The #irst #ield in the "0-. header is called the #rame control: it is a an important #ield and in many cases' its contents determine the meaning o# other #ields o# the "0-. header 2especially addresses3. The #rame control length is 9 bits* #ollowing here is a discussion o# its #ields.

/rotocol version: The version o# the :7C "0-. we use. Currently there is only one version o# :7C' so this #ield is always 0. Type:
149/178

There are three types o# pac%ets in "0-. data.

:management' control and

6 :anagement pac%ets 2&???"0- CETV/?C:M:T3 are #or management actions li%e association' authentication' scanning and more. Be will deal more with management pac%ets in the #ollowing sections. 6 Control pac%ets 2&???"0- CETV/?CCTL3 usually have some relevance to data pac%ets* #or e$ample' a /. /oll pac%et is #or retrieving pac%ets #rom an 7ccess /oint bu##er. 7nother e$ample: a station that wants to transmit #irst sends a control pac%et called RT. 2re0uest to send3* i# the medium is #ree' the destination station will send a control pac%et called CT. 2clear to send3. 6 5ata pac%ets 2&???"0- CETV/?C57T73 are the raw data pac%ets. )ull pac%ets are a special case o# raw pac%ets. There is also &???"0CETV/?C?AT type - we will not discuss it. .h

These types are de#ined in /include/linu$/ieee"0-

.ubtype: Eor all the a#orementioned three types o# pac%ets 2management' control and data3' there is a subtype #ield which identi#y the character o# the pac%et we use. Eor e$ample' a value o# 0 00 #or the subtype #ield in a management #rame denotes that the pac%et is a /robe Re0uest 2&???"0- C.TV/?C/R@=?CR?Y3 management pac%et' which is used in a scan operation. )otice that the action management #rame 2&???"0- C.TV/?C7CT&@)3 was introduced with "0-. h amendment' which dealt with spectrum and transmit power management* however' since there is a lac%
150/178

o# space #or management pac%ets subtypes' action management #rames are used also in "0-. n management pac%ets. 7 value o# 0 #or the subtype #ield in a control pac%et denotes that this is a re0uest to send 2&???"0- C.TV/?CRT.3 control pac%et. 7 value o# 0 00 #or the subtype #ield o# a data pac%et denotes that this a a null data 2&???"0- C.TV/?C)>LLE>)C3 pac%et' which is used #or power management control. 7 value o# 000 2&???"0- C.TV/?CY@.C57T73 #or a subtype o# a data pac%et means that this is a Yo. data pac%et* this subtype was added by the &???"0-. e amendment' which dealt with Yo. enhancements. To5.: Bhen this bit is set' this means that the pac%et is #or the distribution system. Erom5.: Bhen this bit is set' this means that the pac%et is #rom the distribution system. :ore Erag: Bhen we use #ragmentation' this bit is set to . Retry: Bhen a pac%et is retransmitted' this pac%et is set to . 7 common case o# retransmission is when a pac%et that was sent did not receive an ac%nowledgment in time. The ac%nowledgements are sent by the #irmware o# the wireless driver. /wr :gmt: Bhen the power management bit is set' this means that the station will enter power save mode. :ore 5ata: Bhen an 7ccess /oints sends pac%ets that it bu##ered #or a sleeping
151/178

station' it sets the more data bit to when the bu##er is not empty. Thus the station %nows that there are more pac%ets it should retrieve. Bhen the bu##er has been emptied' this bit is set to 0. /rotected Erame: This bit is set to when the #rame body in encrypted* only data #rames and authentication #rames can be encrypted. @rder: There is a :7C service called Nstrict orderingO. Bith this service' the order o# #rames is important. Bhen this service is in use' the order bit is set to . &t is rarely used. 5uration/&5: The duration holds values #or the )etwor% 7llocation <ector 2)7<3 in microseconds' and it consists o# 8 bits o# the duration #ield. The si$teenth #ield is 0. Bhen wor%ing in power save mode it is the 7&5 27ssociation &53 o# a station. The )etwor% 7llocation <ector 2)7<3 is a virtual carrier sensing mechanism. .e0uence control: This is a - byte #ield speci#ying the se0uence control. &n "0-. ' it is possible that a pac%et will be received more than once. The most common cause #or such a case is when an ac%nowledgement is not received #or some reason. The se0uence control #ield consists o# a #ragment number 26 bits3 and a se0uence number 2 - bits3. The se0uence number is generated by the transmitting station' in ieee"0- Ct$ChCse0uence23. &n case o# a duplicate #rame in a retransmission' it is
152/178

dropped' and a counter o# the dropped duplicate #rames 2dot Erame5uplicateCount3 is incremented by * this is done in ieee"0- Cr$ChCchec%23. .e0uence Control #ield is not present in control pac%ets. 7ddress Eields: There are #our addresses' but we donFt always use all o# them. 7ddress is the Receive 7ddress 2R73' and is used in all pac%ets. 7ddress - is the Transmit 7ddress 2T73' and it e$ists in all pac%ets e$cept 7C( and CT. pac%ets. 7ddress 3 is used only #or management and data pac%ets. 7ddress 6 is used when To5. and Erom5. bits o# the #rame control are set* this happens when operating in a Bireless 5istribution .ystem 2B5.3. @o. Control: The Yo. Control #ield was added by "0-. e amendment and it is only present in Yo. data pac%ets. .ince it is not part o# the original "0-. spec' it is not part o# the original mac"0implementation' so it is not a member o# the ieee"0- Chdr struct. &n #act' it was added at the end o# "0-. header and it can be accessed by ieee"0- CgetC0osCctl23 method. The Yo. Control #ield includes the tid 2Tra##ic &denti#ication3' the 7c% /olicy' and a #ield called 7-:.5> present' which tells whether an 7-:.5> is present. 4T Control Eield: 4T Control Eield was added by "0-. n amendment. 4T stands #or 4igh Throughput. @ne o# the most important #eatures o# "0-. n amendment is increasing the rate to up to 900 :bps.
153/178

7ll stations must authenticate and associate and with the 7ccess /oint prior to communicating. .tations usually per#orm scanning prior to authentication and association in order to get details about the 7ccess /oint 2li%e mac address' essid' and more3. .canning is done thus: i#con#ig wlan0 up iwlist wlan0 scan .canning is triggered by issuing .&@C.&B.C7) ioctl 2include/linu$/wireless.h3 iwlist 2and iwcon#ig3 is #rom wireless-tools pac%age. /lease note: wireless-tools is regarded deprecated. Be should use WiwW' which is more modern and which is based on nl"0. Vou can download iw #rom http://linu$wireless.org/download/iw/. iw git repositories are at: http://git.sipsolutions.net/iw.git ?ventually' scanning starts by calling CCieee"02net/mac"0/scan.c3 7ctive .canning is per#ormed by sending /robe Re0uests on all the channels which are supported by the station @pen-system authentication 2BL7)C7>T4C@/?)3 is the only mandatory authentication method re0uired by "0-. . 2BL7)C7>T4C@/?) is de#ined in include/linu$/ieee"0.h3 6 7t a given moment' a station may be associated with no more than one 7/. ] 7 .tation 2N.T7O3 can select a =.. and authenticate and associate to it. 6 &n 7d-4oc : authentication is not de#ined.
154/178

CstartCscan23

6 7n 7ccess /oint will not receive any data #rames #rom a station be#ore it it is associated with the 7/. 6 7n 7ccess /oint which receive an association re0uest will chec% whether the mobile station parameters match the 7ccess point parameters. 7 These parameters are ..&5' .upported Rates and capability in#ormation. ] Bhen a station associates to an 7ccess /oint' it gets an 7..@C&7T&@) &5 27&53 in the range --00!. 6 Trying unsuccess#ully to associate more than 3 times results with this message in the %ernel log: Wassociation with 7/ ap:ac7ddress timed outW 2&???"0- C7..@CC:7ACTR&?. is the number o# ma$ tries to associate' see net/mac"04ostapd hostapd is a user space daemon implementing access point #unctionality 2and authentication servers3. &t supports Linu$ and Eree=.5. 6 http://hostap.epitest.#i/hostapd/ 6 5eveloped by ,ouni :alinen 6 hostapd.con# is the con#iguration #ile. 6 Certain devices' which support :aster :ode' can be operated as 7ccess /oints by running the hostapd daemon. ] 4ostapd implements part o# the :L:? 7/ code which is not in the %ernel ] and probably will not be in the near #uture. ] Eor e$ample: handling association re0uests which are received #rom wireless clients. /mlme.c3

155/178

4ostapd manages: ] 7ssociation/5isassociation re0uests. ] 7uthentication/deauthentication re0uests. wpaCsupplicant is part o# hostapd pro+ect Vou can clone hostap by: git clone git://w .#i/srv/git/hostap.git /ower save mode 4ardware can handle power save by itsel#* when this is done' it should set the &???"0- C4BC.>//@RT.C/. #lag. There are three types o# &???"0data. 2These correspond to &???"0&???"0stac%3. CETV/?CCTL and &???"0pac%ets: :anagement' control and

CETV/?C:M:T' CETV/?C57T7 &n the mac"0-

6 Control pac%ets include RT. 2Re0uest to .end3' CT. 2Clear to .end3 and 7C( pac%ets. 6 :anagement pac%ets are used #or 7uthentication and 7ssociation.

6 :obile devices are usually battery powered most o# the time. ] 7 station may be in one o# two di##erent modes: G 7wa%e 2#ully powered3 G 7sleep 2also termed NdoDedO in the specs3 ] 7ccess points never enters power save mode and does not transmit )ull pac%ets. ] &n power save mode' the station is not able to transmit or receive and consumes very low power. &n order to sni## wireless tra##ic in Linu$ with wireshar%' you can do this: iwcon#ig wlan0 mode monitor i#con#ig wlan0 up
156/178

7nd then start wireshar% and select the wlan0 inter#ace. Vou can %now the channel number while sni##ing by loo%ing at the radiotap header in the sni##er output* channel #re0uency translates to a channel number 2 to correspondence.3 :oreover' the channel number appears in s0uare brac%ets. Li%e: G channel #re0uency -63! Q=M 9R The radiotap header is added in certain cases under monitor mode. &t precedes the "0-. &t is done in ieee"0net/mac"0- /r$.c ieee"0 ieee"0 ieee"0-

header. CaddCr$CradiotapCheader23 in

CaddCr$CradiotapCheader23 is invo%ed #rom: Cr$Cmonitor23. Cr$Ccoo%edCmonitor23.

Vou can %now the mac address o# your wireless nic by: cat /sys/class/ieee"02net/mac"0/mlme.c3 /phyI/macaddress CsendCnull#unc23 6 7 station send a null pac%et by calling ieee"0-

The /: bit in the #rame control o# this pac%et is set. 2&???"0- CECTLC/: bit3 6 ?ach access point has an array o# s%bs #or bu##ering unicast pac%ets #rom the stations which enter power save mode. 6 &t is called psCt$Cbu# 2in struct staCin#o* see net/mac"0- /staCin#o.h3 7n access point also has a psCbcCbu# 0ueue #or #or multicast and broadcast pac%ets. psCt$Cbu# can bu##er up to 96 s%bs. 2.T7C:7ACTAC=>EE?RJ96' in net/mac"0- /staCin#o.h3

157/178

&n case the bu##er is #illed' old s%bs will be dropped. Bhen a station enters /. mode it turns o## its RE. Erom time to time it turns the RE on' but only #or receiving beacons. 7n 7ccess /oint sends beacon #rames periodically 2usually about 0 beacons per second3. ?ach beacon has a T&: 2Tra##ic &ndication :ap3 #ield. ieee"0- Cr$CmgmtCbeacon23 handles receiving beacons. 2net/mac"0- /mlme.c3. 7 beacon is a management' represented by struct beacon' which is one o# the members in a union 2named WuW3 in ieee"0Cmgmt struct. the WvariableW member o# struct beacon represents the W&n#ormation ?lementsW this beacon can contain* W&n#ormation ?lementsW can be ..&5' .upported rates' E4 /arams' 5. /arams' CE /arams' &=.. /arams' T&:' and more.

158/178

struct ieee"0-C a structure

Celems represent W&n#ormation ?lementW. &t conatins CtimCie3 ' representing the tim 2Tra##ic &ndication

called tim 2ieee"0:ap3.

This method calls ieee"0#or this station 7&5. 2ieee"0ieee"0-

Cchec%Ctim23' to see i# the T&: has tra##ic .h3

Cchec%Ctim23 is implemented in include/linu$/ieee"0CsendCpspoll23 in order to send /./@LL pac%et to the 7/.

/./@LL are control pac%ets. )ote that in the #ollowing diagram' we do not show the 7C( pac%ets.

159/178

75 4oc &mplementation o# "0-. ieee"0"0"0.n .n started with the 4igh Throughput .tudy Mroup in about -00-. 75 4oc is mainly in: net/mac"0/ibss.c Ci#Cibss structure represents an 75 hoc station.

&n "0-. ' each pac%et should be ac%nowledged. &n "0-. nm we grouping pac%ets in a bloc% and ac%nowledging this bloc% instead ac%nowledging each pac%et separately. This improves per#ormance. Mrouping pac%ets in a bloc% in this way is called Wpac%et aggregationW in "0-. n terminology.
160/178

There are two #orms o# aggregation: 7:/5> 2The more common #orm3 7:/5> aggregation re0uires the use o# bloc% ac%nowledgement or =loc%7c%' which was introduced in "0-. has been optimiDed in "0-. n. "0-. e is the 0uality-o#-service e$tensions amendment. e and

The "0-. e amendment deals with Yo.* it introduced #our 0ueues #or di##erent types o# tra##ic: voice tra##ic' video tra##ic' best-e##ort tra##ic and bac%ground tra##ic. The Linu$ implementation o# "0-. e uses multi0ueues. Tra##ic in higher priority 0ueue is transmitted be#ore tra##ic in a lower priority 0ueue. :/5> stand #or: :7C protocol data units 7:.5> Bith 7:.5>' you ma%e one big pac%et out o# some pac%ets. This big pac%et should be ac%ed. 5isadvantage: more ris% o# corruption o# the big pac%et. Less in usage' #ading out. :.5> stands #or: :7C service data units. /ac%et aggregation 6 There are two sides to a bloc% ac% session: originator and recipient. ?ach bloc% session has a di##erent T&5 2tra##ic identi#ier3. 6 The originator starts the bloc% ac%nowledge session by calling ieee"0- CstartCt$CbaCsession23 2net/mac"0- /agg-t$.c3 ieee"0- Ct$CbaCsessionChandleCstart23 is a callbac% o# ieee"0- CstartCt$CbaCsession23. &n this callbac% we send an 755=7 2add =loc% 7c%nowledgment3 re0uest pac%et' by invo%ing ieee"0- CsendCaddbaCre0uest23 method 27lso in net/mac"0- /aggt$.c3 ieee"0- CsendCaddbaCre0uest23 method builds a management action pac%et 2The sub type is action' &???"0C.TV/?C7CT&@)3.
161/178

The response to the 755=7 re0uest should be received within 4X' which is one millisecond in $"9C96 machines 2755=7CR?./C&)T?R<7L' de#ined in net-ne$t/net/mac"0- /staCin#o.h3 &n case we do not get a response in time' the staCaddbaCrespCtimerCe$pired23 will stop the =7 session by calling ieee"0- CstopCt$CbaCsession23. Bhen the other side 2the recipient3 receives the 755=7 re0uest' it #irst sends an 7C(. Then it processes the 755=7 re0uest by calling ieee"0- CprocessCaddbaCre0uest23* 2net/mac"0- /agg-r$.c3 i# everything is o%' it sets the aggregation state o# this machine to operational 24TC7MMC.T7T?C@/?R7T&@)7L3' and sends an 755=7 Response by calling ieee"0- CsendCaddbaCresp23. 7#ter a session was started' a data bloc%' containing multiple :/5> pac%ets is sent. Conse0uently' the originator sends a =loc% 7c% Re0uest 2=7R3 pac%et by calling ieee"0- CsendCbar23. 2net/mac"0- /agg-t$.c3 The =7R is a control pac%et with =loc% 7c% Re0uest subtype 2&???"0- C.TV/?C=7C(CR?Y3. The bar pac%et includes the ..) 2start se0uence number3' which is the se0uence number o# the oldest :.5> in the bloc% which should be ac%nowledged. The =7R 24T =loc% 7c% Re0uest3 is de#ined in include/linu$/ieee"0- .h. &ts startCse0Cnum member is initialiDed to the proper ..). There are two types o# =loc% 7c%: &mmediate =loc% 7c% and 5elayed =loc% 7c%. :ac"0debug#s support: &n order to have mac"0debug#s support' %ernel should be built with C@)E&MC:7C"0- C5?=>ME. 2and C@)E&MC5?=>MCE.3 Then a#ter: mount -t debug#s debug#s /sys/%ernel/debug Vou can see debug#s entries under:
162/178

/sys/%ernel/debug/ieee"0-

/phyI

@pen Eirmware The 7theros "0-. "0-. 7C is 7C. stac%. Eor e$ample' /util.c. CvhtCoperation 7C was added in mac"0n >.= chipset 27R; !03 has open #irmware* see http://www.linu$wireless.org/en/users/5rivers/ar; !0.#w The ne$t generation o# "0-. .upport #or "0-. ieee"0struct ieee"0-

CieCbuildCvhtCcap23 in net/mac"0.h.

CvhtCcapabilities and struct ieee"0-

in include/linu$/ieee"0-

<4T stands #or: <ery 4igh Throughput 5evelopment: .ending patches should be done against the wireless-testing tree git://git.%ernel.org/pub/scm/linu$/%ernel/git/linville/wireless-testing.git The maintainer o# compat wireless is Luis R. RodrigueD.

:esh networ%ing 2"0-.


"0-.

s3

s started as a .tudy Mroup o# &??? in .eptember -003' and became a Tas% Mroup named TMs in -006. &n -009' two proposals' out o# 8' 2the W.??:eshW and WBi-:eshW proposals3 were merged into one. This is dra#t 50.0 . There are two topologies #or mesh networ%s.

The #irst is #ull mesh* with #ull mesh' each node is connected to all the

other nodes.
163/178

The second mesh topology is partial mesh. Bith partial mesh' nodes

are connected to only some o# the other nodes' not all. This topology is much more common in wireless mesh networ%s. &n -.9.-9' the networ% stac% added support #or the dra#t o# wireless mesh networ%ing 2"0-. s3' than%s to the open"0- s pro+ect. The open"0- s pro+ect goal was to create the #irst open implementation o# "0-. s. The pro+ect got some sponsorship #rom the @L/C pro+ect. Luis Carlos Cobo and ,avier Cardona and other developers #rom CoDybit developed the Linu$ mac"0mesh code. This code was merged into the Linu$ (ernel #rom -.9.-9 release 2,uly -00"3. There are some drivers in the linu$ %ernel with support to mesh networ%ing 2ath8%' b63' libertasCt#' p86' Dd - rw3. 4B:/ protocol. "0-. s de#ines a de#ault routing protocol called 4B:/ 24ybrid Bireless :esh /rotocol3. The 4B:/ protocol wor%s with layer - 2:ac addresses3 as opposed to &/<6 routing protocol' #or e$ample' which wor%s with layer 3 2&/ addresses3. 4B:/ routing is based on two types o# routing 2hence it is called hybrid3. The #irst is on demand routing and the second is proactive' dynamic routing. Currently only on demand routing is implemented in the Linu$ (ernel. Be have three types o# messages with on demand routing. The #irst is /R?Y 2/ath Re0uest3. This type o# messages is sent as a broadcast when we loo% #or some destination' which we still do not have a route to. This /R?Y message is propagated in the mesh until it gets to its destination. @n each station until the #inal destination is reached' a loo%up is per#ormed 2by meshCpathCloo%up23' net/mac"0- /meshCpathtbl.c3. &n case the loo%up #ails' the /R?Y is #orwarded 2as a broadcast3. The /R?Y message is sent in a management pac%et* its subtype is a
164/178

Then a /R?/ 2/ath Reply3 unicast pac%et is sent. This pac%et is sent in the reverse path. The /R?/ message is also sent in a management pac%et* its subtype is also action. 2&???"0- C.TV/?C7CT&@)3. &t is handled by hwmpCprepC#rameCprocess23. =oth /R?Y and /R?/ are sent in the meshCpathCselC#rameCt$23 #unction. &# there is some #ailure on the way' a /?RR is sent.2/ath ?rror3. 7 /?RR message is handled by meshCpathCerrorCt$23. The route ta%e into consideration a radio-aware metric 2airtime metric3. The airtime metric is calculated in the airtimeClin%CmetricCget23 method ' net/mac"0- /meshChwmp.c2based on rate and other hardware parameters3. :esh /oints continuously monitor their lin%s and update metric values with neighbours. The station which sent the /R?Y may try to send pac%ets to the #inal destination while still not %nowing the route to that destination* these pac%ets are %ept in a bu##er called #rameC0ueue' which is a member o# meshCpath struct* net/mac"0- /mesh.h3 in such a case' when a /R?/ #inally arrives' the pending pac%ets o# this bu##er are sent to the #inal destination 2by calling meshCpathCt$Cpending233. The ma$imum number o# #rames bu##ered per destination #or unresolved destinations is 0 2:?.4CER7:?CY>?>?CL?)' de#ined in net/mac"0- /mesh.h3. The advantages o# mesh networ%ing are: ] Rapid deployment. ] :inimal con#iguration* ine$pensive. ] ?asy to deploy in hard-to-wire environments. ] Connectivity while nodes are in motion. The disadvantages:

165/178

] :any broadcasts limit networ% per#ormance ] )ot all wireless drivers support mesh mode at the moment.

Tip #or hac%ing mac"0-

with openwrt:

----------------------------------------------------------The BRTM86L Lin%.ys wireless router comes out o# #actory with Linu$. &n case you want to hac% mac"0with bac%#ire or with @penBrt' you can do it

with %ami%aDe' which are versions o# @penBrt. &n case o# %ami%aDe' you will soon #ind out that with recent %ami%aDe releases 2".0;. and ".0;.-3' the wireless driver does not e$ist 2%mod-b633. Eor this reason Wop%g install %mod-b63W #ails on %ami%aDe ".0;. and %ami%aDe ".0;.-. Vou can use also %ami%aDe ;.0.- and build the broadcom wireless driver as a %ernel module. 7 simple way o# achieving this is thus: Wma%e %ernelCmenucon#igW Then: select driver/networ%/wireless/=63 by 2=roadcom 63$$ wireless support 2mac"0stac%33

C@)E&MC=63 should be WmW. :a%e sure that you create also mac"0- .%o and c#g"0- .%o bac%#ireCsvn/buildCdir/linu$-brcm6!$$/compat-wireless--0 - -The source #iles #or b63 drivers selected in this way are under
166/178

buildCdir/linu$-brcm6!$$/compat-wireless--0 0 /drivers/net/wireless/b63 Tip:

- --

Bhen wor%ing with b63 %ernel module 2b63.%o3 it is enough to run ma%e target/linu$/compile in order to create b63.%o 2under buildCdir/linu$-brcm6!$$/linu$-.9.3-.-!/drivers/net/wireless/b63/3 and copy it.

The hostapd sources are under: buildCdir/target-mipselCuClibc-0.;.30. /hostapd-#ull

Copy c#g"0-

.%o' mac"0-

.%o and b63.%o to the lin%sys device.

&nsert them by this order: insmod c#g"0- .%o insmod mac"0- .%o insmod b63.%o iwcon#ig should show Wwlan0W. Bhen trying Wi#con#ig wlan0 upW' in case you get an error about #irmware' li%e this error message about missing #irmware #ile' Wb63-phy0 ?RR@R: Eirmware #ile Wb63/ucode8.#wW not #ound or load #ailed.W do as described in: http://linu$wireless.org/en/users/5rivers/b63Sdevice#irmware &n case you will try to scan' you will get: i#con#ig wlan0 up iwlist wlan0 scan
167/178

wlan0

&nter#ace doesnFt support scanning : @peration not supported

&t IIisII inlcuded in %ami%aDe ".0;. 2so when booting with %ami%aDe ".0; you do see wireless inter#ace when running iwcon#ig3. .ee this thread: https://#orum.openwrt.org/viewtopic.php^idJ-- 03 WBhy is b63 driver missing in recent releases^W .ee also under: http://downloads.openwrt.org/%ami%aDe/ 7nother tip: &n order to use ' in /etc/hostapd.con#' driverJnl"0' Jy is not set in you should have in hostapd .con#ig' be#ore running Wma%eW' C@)E&MC5R&<?RC)L"0-

)ot that in some distributions C@)E&MC5R&<?RC)L"0hostapd pac%age.

- &n case there are any problems with burning an image and you cannot access the BRT86ML lin%sys device' you can burn an image via t#tp' in this way: t#tp ;-. 9". . bin trace timeout 90 re$mt
168/178

put name@#EirmareEile - Bhen using this way' you should download the #irmware #rom lin%sys site: http://homesupport.cisco.com/en-us/support/routers/BRT86ML - &n case you will try to burn an openwrt image' most li%ely you will get errors* li%e: .... ... t#tpP put openwrt-brcm6!$$-s0uash#s.tr$ received 7C( _bloc%J0P sent 57T7 _bloc%J ' 8 - bytesP received 7C( _bloc%J0P received ?RR@R _codeJ6' msgJcode pattern incorrectP ?rror code 6: code pattern incorrect ... ... @penEBBE website: http://www.ing.unibs.it/Lopen#ww#/ - also #or wrt86ML. building a #irmware #or b63 is simple: you download b63-tools and b63 #irmware. Erom b63-tools/assembler you run Wma%e KK ma%e installW. 2you only need assembler #or building the b63 #irmware3 &n case you get the #ollowing error:

169/178

b63-tools/assemblerPS ma%e CC b63-asm.bin /usr/bin/ld: cannot #ind -l#l ma%e sure that #le$-static and #le$ are installed. 2yum install #le$-static #le$3 Then simply go to the #older where you e$tracted the #irmware' and run Wma%eW. 7 #ile name Wucode8.#wW will be generated. Bith b63 on the BRT86ML' we use ..=C=>.TV/?C..= This means that in b63CwirelessCcoreCstart23 2drivers/net/wireless/b63/main.c3' dev-Pdev-Pbus-Pbustype is ..=C=>.TV/?C..= and we call re0uestCthreadedCir023 and not b63CsdioCre0uestCir023. 2The other possibilities are ..=C=>.TV/?C/C&' ..=C=>.TV/?C/C:C&7 or ..=C=>.TV/?C.5&@3.

b63/b63legacy Linu$ driver discussions: http://lists.in#radead.org/mailman/listin#o/b63-dev - /atches which are sent to this mailing list are also sent to Linu$ %ernel wireless mailing list. ath;%-devel mailing list: http://www.mail-archive.com/ath;%-devel1lists.ath;%.org/inde$.html

170/178

T=5: The #ollowing downloads ".0;.- and not ".0;* how you get ".0; and not ".0;.-^ Vou can download %ami%aDe ".0; by: svn co svn://svn.openwrt.org/openwrt/branches/".0; @penBrt repositories are in the #ollowing lin%: https://dev.openwrt.org/wi%i/Met.ource RE(&LL r#%ill is a simple tool #or accessing the Linu$ r#%ill device inter#ace' which is used to enable and disable wireless networ%ing devices' typically BL7)' =luetooth and mobile broadband. r#%ill list will list the status o# r#%ill. r#%ill bloc% to set a so#t loc% r#%ill unbloc% to clear a so#t loc% see: http://www.linu$wireless.org/en/users/5ocumentation/r#%ill Bi:7A LT? will undoubtedly be the 6M technology. There is Bim:7A solution in Linu$ %ernel though. Bi:7A 2Borldwide interoperability #or :icrowave access3 is based on &???"0-. 9 standard. &t is a wireless solution #or broadband B7) 2Bide 7rea )etwor%3. about -00 Bi:7A pro+ects around the world. Bi:7A products can accommodate #i$ed and mobile usage models. There is a Bi:7A Linu$ git tree' maintained by &na%y /ereD-MonDaleD #rom &ntel. &n the past' &na%y was involved in developing the Linu$ >.= stac% and the Linu$ >B= 2>ltra Bideband3 stac%. The Bi:7A stac% and driver have been accepted in mainline
171/178

#or -.9.-; in ,anuary -00;. The Bi:7A support in Linu$ consists o# a (ernel module 2net/wima$/wima$.%o3' device-speci#ic drivers under it' and a user space management stac%' Bi:7A )etwor% .ervice. There was in the past an initiative #rom )o%ia #or a Bi:7A stac% #or Linu$' but it is not integrated currently. 7lso wor% was done on 5-=us inter#ace to the Bi:7A stac%' which will help user space tools manage the Bi:7A stac%. There is currently one Bi:7A driver in the Linu$ tree' the &ntel Bi:7A Connection -600 over >.= driver 2which supports any o# the &ntel Bireless Bi:7A/BiEi Lin% 8$80 series3. The Bi:7A stac% uses generic netlin% protocol mechanism to send and receive netlin% messages to and #rom userspace. Eree #orm messages can be sent bac% and #orth between driver/device and user space batman-adv W=.7.T.:.7.). 7dvanced :eshing /rotocol is a routing protocol #or multi-hop ad-hoc mesh networ%s. The networ%s may be wired or wireless. &mplementation is in net/batman-adv .ee http://www.open-mesh.org/ Bireless .ummit 2-0 -3 http://wireless.%ernel.org/en/developers/.ummits/=arcelona--0 Bill deal with "0-. more. ' "0-. 8.6 stac% 29lowpan3' =luetooth' )EC' and

lecture slides o# the "0-. 8.6 lecture by 7lan @tt: "0-. 8.6 stac% 29lowpan3 9LoB/7) stands #or: W&/v9 over Low power Bireless /ersonal 7rea )etwor%sW http://elinu$.org/images/!/! /BirelessC)etwor%ingCwithC&???C"0-. 8. 6CandC9LoB/7).pd#
172/178

&??? "0-. 8.6 &??? standard "0-. 8.6 is #or wireless personal area networ% 2B/7)3. &mplementation in the Linu$ %ernel tree: net/ieee"0- 86/ The maintainers o# &??? "0-. 8.6 .>=.V.T?: are 7le$ander .mirnov and 5mitry ?remin-.oleni%ov. Beb site: http://source#orge.net/apps/trac/linu$-Digbee Mit tree: git://git.%ernel.org/pub/scm/linu$/%ernel/git/lowpan/lowpan.git compat-wireless compat-wireless is a bac%port o# the wireless stac% #rom newer %ernels to older ones. Bi-Ei 5irect' previously %nown as Bi-Ei /-/' is a standard that allows Bi-Ei devices to connect to each other without the need #or an 7ccess /oint. CR57 stands #or WCentral Regulatory 5omain 7gentW. &t is based on nl"0and udev. Vou can download the source code by: git clone git://github.com/mcgro#/crda.git :ostly written by Luis R. RodrigueD 2mcgro#10ca.0ualcomm.com3. see: http://www.linu$wireless.org/en/developers/Regulatory/CR57 Lin%s: WLinu$ wireless networ%ingW' article #rom -006 http://www.ibm.com/developerwor%s/library/wi-enable/inde$.html >pdated standard 2-0 -3 http://standards.ieee.org/getieee"0-/download/"0-. --0 -.pd#

173/178

=oo%s: "0-. Bireless )etwor%s: The 5e#initive Muide' -nd ?dition =y :atthew Mast /ublisher: @FReilly :edia' -008 "0-. n: 7 .urvival Muide =y :atthew Mast /ublisher: @FReilly :edia' -0 T=5: 7C. 27utomatic Channel .election3 >se#ul tips: /rinting &/ address: CCbe3- ip7ddr* print%2Wip7ddr J `p&6[nW' Kip7ddr3* when u3- ip7ddr* T=5\ &# you want immediate >5/ tra##ic' you can use traceroute. Remember that the destination port is incremented by pac%et. tarceroute -/ 25e#ault protocol is -83 ' see r#c39;-3. wireshark tip: .ometimes you see in wireshar% sni##er' that the amount o# W=ytes on wireW is larger then the :T> o# the networ% card. This is probably due to using ,umbo pac%ets or o##loading.
174/178

#or each sent

Vou can also generate raw >5/ tra##ic with traceroute' by:

Links and more info 3 >nderstanding the Linu$ (ernel' .econd ?dition =y 5aniel /. =ovet' :arco Cesati .econd ?dition 5ecember -00- chapter ": networ%ing. >nderstanding Linu$ )etwor% &nternals' Christian =envenuti' @Freilly contains all details o# the Linu$ networ%ing stac%. -3 Linu$ 5evice 5rivers' by ,onathan Corbet' 7lessandro Rubini' Mreg (roah 4artman Third ?dition Eebruary -008. 7 Chapter !' )etwor% 5rivers 33 Linu$ networ%ing: 2a lot o# docs about speci#ic networ%ing topics3 G http://linu$net.osdl.org/inde$.php/:ainC/age -93 LC?: Challenges #or Linu$ networ%ing ' =y ,onathan Corbet ' )ovember !' -0 -: http://lwn.net/7rticles/8-308"/ 83 netdev mailing list: http://www.spinics.net/lists/netdev/ 93 Removal o# multipath routing cache #rom %ernel code: htt%'((lists.o%en-all.net(netdev(2<<8(<3(12(8,htt%'((l-n.net(5rticles(2*1*,5( !3 Linu$ 7dvanced Routing K Tra##ic Control : http://lartc.org/ "3 ebtables G a #iltering tool #or a bridging: http://ebtables.source#orge.net/ ;3 Briting )etwor% 5evice 5river #or Linu$: 2article3 G http://app.linu$.org.mt/article/writingnetdrivers^localeJen 03 )etcon# G a yearly networ%ing con#erence* #irst was in -006. 7 http://vger.%ernel.org/netcon#-006.html 7 http://vger.%ernel.org/netcon#-008.html 7 http://vger.%ernel.org/netcon#-009.html 7 Linu$ Con# 7ustralia' ,anuary -00"':elbourne http://vger.%ernel.org/netcon#-0 0.html
175/178

http://vger.%ernel.org/netcon#-0

.html

3 http://www.policyrouting.org//olicyRouting=oo%/ -3 T4R7.4 7 dynamic LCtrie and hash data structure: Robert @lsson .te#an )ilsson' 7ugust -009 http://www.csc.%th.se/Lsnilsson/public/papers/trash/trash.pd# 33 &/.ec howto: http://www.ipsechowto.org/t .html 63 @penswan: =uilding and &ntegrating <irtual /rivate )etwor%s ' by /aul Bouters' (en =anto#t http://www.pac%tpub.com/boo%/openswan/mid/09 -08+0dnh-by publis her: /ac%t /ublishing. 83 http://www.vyatta.com/ @pen-.ource )etwor%ing 93 Eor a very basic description o# the networ% stac%' see Q R. !3 http://www.ibm.com/developerwor%s/linu$/library/l-linu$networ%ing-stac%/ gives an overview o# the networ%ing stac%. "3 http://www.ma%elinu$.net/re#erence is a general re#erence #or Linu$ %ernel internals. ;3 This Linu$ ,ournal article by 7lan Co$ is an overall introduction to the networ%ing %ernel. -03 Receive pac%et steering 2R/.3 http://lwn.net/7rticles/39-33;/ R/. and RE. http://lwn.net/7rticles/3;"3"8/ Receive #low steering http://lwn.net/7rticles/3"-6-"/ $ps: Transmit /ac%et .teering http://lwn.net/7rticles/6 -09-/ - 3 application #or Dero copy: http://netsni##-ng.org/
176/178

2tra#gen* uses /EC/7C(?T R7B soc%ets and sendto23 sys call3 --3 splice tools: http://bric%.%ernel.d%/snaps/splice-git-latest.tar.gD networ% splice receive: http://lwn.net/7rticles/-39; "/ -33 )etwor% namespaces - by ,onathan Corbet: http://lwn.net/7rticles/- ;!;6/ -63 The initial change to napiCstruct is e$plained in ttp://lwn.net/7rticles/-66960/ -83 4ow MR@ wor%s by 5avid :iller: http://vger.%ernel.org/Ldavem/cgi-bin/blog.cgi/-0 0/0"/30 -93 7 ,&T #or pac%et #ilters =y ,onathan Corbet' 7pril -' -0 http://lwn.net/7rticles/63!;" / -!3 dynamic seccomp policies 2using =/E #ilters3 http://lwn.net/7rticles/6!80 ; -"3 L7) ?thernet :a$imum Rates' Meneration' Capturing K :onitoring http://wi%i.networ%securitytool%it.org/nstwi%i/inde$.php/L7)C?thernetC :a$imumCRates'CMeneration'CCapturingC`-9C:onitoring -;3 )etwor% data #low through %ernel - diagram: http://www.linu$#oundation.org/images/ / c/)etwor%CdataC#lowCthroug hC%ernel.png 303 The TC//&/ Muide: online boo%: http://www.tcpipguide.com/#ree/inde$.htm 3 3 Yuagga: http://www.nongnu.org/0uagga/ 3-3 Communicating between the %ernel and user-space in Linu$ using )etlin% soc%ets )etlin% article: http:// ;"6.lsi.us.es/Lpablo/docs/spae.pd# 33 3 generic netlin% soc%ets: https://www.linu$#oundation.org/collaborate/wor%groups/networ%ing/g enericCnetlin%Chowto 363 Convert and locate &/ addresses:
177/178

http://www.%loth.net/services/iplocate.php %ernel networ%ing repositories: To clone the stable tree you should run: git clone git://git.%ernel.org/pub/scm/linu$/%ernel/git/davem/net.git To clone net-ne$t you should run: git clone git://git.%ernel.org/pub/scm/linu$/%ernel/git/davem/netne$t.git Rami Rosen

178/178

Вам также может понравиться