Вы находитесь на странице: 1из 804

TableofContents

Summary

Introduction

Introduction
 

1.1

1.1

Booting

Booting
 

1.2

1.2
Frombootloadertokernel   1.2.1

Frombootloadertokernel

 

1.2.1

1.2.1
Firststepsinthekernelsetupcode   1.2.2

Firststepsinthekernelsetupcode

 

1.2.2

1.2.2
Videomodeinitializationandtransitiontoprotectedmode   1.2.3

Videomodeinitializationandtransitiontoprotectedmode

 

1.2.3

1.2.3
Transitionto64-bitmode   1.2.4

Transitionto64-bitmode

 

1.2.4

1.2.4
Kerneldecompression   1.2.5

Kerneldecompression

 

1.2.5

1.2.5

Initialization

Initialization
 

1.3

1.3
Firststepsinthekernel   1.3.1

Firststepsinthekernel

 

1.3.1

1.3.1
Earlyinterruptshandler   1.3.2

Earlyinterruptshandler

 

1.3.2

1.3.2
Lastpreparationsbeforethekernelentrypoint   1.3.3

Lastpreparationsbeforethekernelentrypoint

 

1.3.3

1.3.3
Kernelentrypoint   1.3.4

Kernelentrypoint

 

1.3.4

1.3.4
Continuearchitecture-specificboot-timeinitializations   1.3.5

Continuearchitecture-specificboot-timeinitializations

 

1.3.5

1.3.5
Architecture-specificinitializations,again   1.3.6

Architecture-specificinitializations,again

 

1.3.6

1.3.6
Endofthearchitecture-specificinitializations,almost   1.3.7

Endofthearchitecture-specificinitializations,almost

 

1.3.7

1.3.7
Schedulerinitialization   1.3.8

Schedulerinitialization

 

1.3.8

1.3.8
RCUinitialization   1.3.9

RCUinitialization

 

1.3.9

1.3.9
Endofinitialization   1.3.10

Endofinitialization

 

1.3.10

1.3.10

Interrupts

Interrupts
 

1.4

1.4
Introduction   1.4.1

Introduction

 

1.4.1

1.4.1
Starttodiveintointerrupts   1.4.2

Starttodiveintointerrupts

 

1.4.2

1.4.2
Interrupthandlers   1.4.3

Interrupthandlers

 

1.4.3

1.4.3
Initializationofnon-earlyinterruptgates   1.4.4

Initializationofnon-earlyinterruptgates

 

1.4.4

1.4.4
Implementationofsomeexceptionhandlers   1.4.5

Implementationofsomeexceptionhandlers

 

1.4.5

1.4.5
HandlingNon-Maskableinterrupts   1.4.6

HandlingNon-Maskableinterrupts

 

1.4.6

1.4.6
Diveintoexternalhardwareinterrupts   1.4.7

Diveintoexternalhardwareinterrupts

 

1.4.7

1.4.7
Initializationofexternalhardwareinterruptsstructures   1.4.8

Initializationofexternalhardwareinterruptsstructures

 

1.4.8

1.4.8
Softirq,TaskletsandWorkqueues   1.4.9

Softirq,TaskletsandWorkqueues

 

1.4.9

1.4.9
Lastpart   1.4.10

Lastpart

 

1.4.10

1.4.10

Systemcalls

Systemcalls
 

1.5

1.5
Introductiontosystemcalls   1.5.1

Introductiontosystemcalls

 

1.5.1

1.5.1
HowtheLinuxkernelhandlesasystemcall   1.5.2

HowtheLinuxkernelhandlesasystemcall

 

1.5.2

1.5.2
vsyscallandvDSO   1.5.3

vsyscallandvDSO

 

1.5.3

1.5.3
HowtheLinuxkernelrunsaprogram   1.5.4

HowtheLinuxkernelrunsaprogram

 

1.5.4

1.5.4

Timersandtimemanagement

Timersandtimemanagement
 

1.6

1.6
Introduction   1.6.1

Introduction

 

1.6.1

1.6.1
Clocksourceframework   1.6.2

Clocksourceframework

 

1.6.2

1.6.2
Thetickbroadcastframeworkanddyntick   1.6.3

Thetickbroadcastframeworkanddyntick

 

1.6.3

1.6.3
Introductiontotimers   1.6.4

Introductiontotimers

 

1.6.4

1.6.4
Clockeventsframework   1.6.5

Clockeventsframework

 

1.6.5

1.6.5
x86relatedclocksources   1.6.6

x86relatedclocksources

 

1.6.6

1.6.6
Timerelatedsystemcalls   1.6.7

Timerelatedsystemcalls

 

1.6.7

1.6.7

Synchronizationprimitives

Synchronizationprimitives
 

1.7

1.7
Introductiontospinlocks   1.7.1

Introductiontospinlocks

 

1.7.1

1.7.1
Queuedspinlocks   1.7.2

Queuedspinlocks

 

1.7.2

1.7.2
Semaphores   1.7.3

Semaphores

 

1.7.3

1.7.3
Mutex   1.7.4

Mutex

 

1.7.4

1.7.4
Reader/Writersemaphores   1.7.5

Reader/Writersemaphores

 

1.7.5

1.7.5
SeqLock   1.7.6

SeqLock

 

1.7.6

1.7.6
RCU 1.7.7

RCU

1.7.7

1.7.7
Lockdep   1.7.8

Lockdep

 

1.7.8

1.7.8

Memorymanagement

Memorymanagement
 

1.8

1.8
Memblock   1.8.1

Memblock

 

1.8.1

1.8.1
Fixmapsandioremap   1.8.2

Fixmapsandioremap

 

1.8.2

1.8.2
kmemcheck   1.8.3

kmemcheck

 

1.8.3

1.8.3

Cgroups

Cgroups
 

1.9

1.9
Introduction   1.9.1

Introduction

 

1.9.1

1.9.1

Concepts

Concepts
 

1.10

1.10
Per-CPUvariables   1.10.1

Per-CPUvariables

 

1.10.1

1.10.1
Cpumasks   1.10.2

Cpumasks

 

1.10.2

1.10.2
Theinitcallmechanism   1.10.3

Theinitcallmechanism

 

1.10.3

1.10.3
NotificationChainsinLinuxKernel   1.10.4

NotificationChainsinLinuxKernel

 

1.10.4

1.10.4
DataStructuresintheLinuxKernel

DataStructuresintheLinuxKernel

DataStructuresintheLinuxKernel
 
1.11

1.11

1.11
Doublylinkedlist   1.11.1

Doublylinkedlist

 

1.11.1

1.11.1
Radixtree   1.11.2

Radixtree

 

1.11.2

1.11.2
Bitarrays   1.11.3

Bitarrays

 

1.11.3

1.11.3

Theory

Theory
 

1.12

1.12
Paging   1.12.1

Paging

 

1.12.1

1.12.1
Elf64   1.12.2

Elf64

 

1.12.2

1.12.2
Inlineassembly   1.12.3

Inlineassembly

 

1.12.3

1.12.3
CPUID   1.12.4

CPUID

 

1.12.4

1.12.4
MSR   1.12.5

MSR

 

1.12.5

1.12.5

Initialramdisk

Initialramdisk
 

1.13

1.13
initrd   1.13.1

initrd

 

1.13.1

1.13.1

Misc

Misc

1.14

1.14
Howthekerneliscompiled   1.14.1

Howthekerneliscompiled

 

1.14.1

1.14.1
Linkers   1.14.2

Linkers

 

1.14.2

1.14.2
Linuxkerneldevelopment   1.14.3

Linuxkerneldevelopment

 

1.14.3

1.14.3
Programstartupprocessinuserspace   1.14.4

Programstartupprocessinuserspace

 

1.14.4

1.14.4
WriteandSubmityourfirstLinuxkernelPatch 1.14.5

WriteandSubmityourfirstLinuxkernelPatch

1.14.5

1.14.5
Datatypesinthekernel   1.14.6

Datatypesinthekernel

 

1.14.6

1.14.6

KernelStructures

KernelStructures
 

1.15

1.15
IDT   1.15.1

IDT

 

1.15.1

1.15.1

Usefullinks

Usefullinks
 

1.16

1.16

Contributors

Contributors
 

1.17

1.17

Introduction

linux-insides

Abook-in-progressaboutthelinuxkernelanditsinsides.

Thegoalissimple-tosharemymodestknowledgeabouttheinsidesofthelinuxkernel

andhelppeoplewhoareinterestedinlinuxkernelinsides,andotherlow-levelsubject

matter.

Questions/Suggestions:Feelfreeaboutanyquestionsorsuggestionsbypingingmeat

twitter@0xAX,addinganissueorjustdropmeanemail.

Support

SupportIfyoulike

linux-insides

youcansupportmewith:

Support Support Ifyoulike linux-insides youcansupportmewith: Onotherlanguages Russian Chinese Spanish LICENSE Licensed

Onotherlanguages

LICENSE

Contributions

Feelfreetocreateissuesorpull-requestsifyouhaveanyproblems.

PleasereadCONTRIBUTING.mdbeforepushinganychanges.

Introduction

Introduction Author @0xAX 6

Author

Booting

KernelBootProcess

Thischapterdescribesthelinuxkernelbootprocess.Hereyouwillseeacoupleofposts

whichdescribesthefullcycleofthekernelloadingprocess:

Fromthebootloadertokernel-describesallstagesfromturningonthecomputerto runningthefirstinstructionofthekernel. Firststepsinthekernelsetupcode-describesfirststepsinthekernelsetupcode.You willseeheapinitialization,queryofdifferentparameterslikeEDD,ISTandetc Videomodeinitializationandtransitiontoprotectedmode-describesvideomode initializationinthekernelsetupcodeandtransitiontoprotectedmode.

Transitionto64-bitmode-describespreparationfortransitioninto64-bitmodeand

detailsoftransition.

KernelDecompression-describespreparationbeforekerneldecompressionanddetails

ofdirectdecompression.

Frombootloadertokernel

Kernelbootingprocess.Part1.

Fromthebootloadertothekernel

Ifyouhavebeenreadingmypreviousblogposts,thenyoucanseethat,forsometime,I havebeenstartingtogetinvolvedinlow-levelprogramming.Ihavewrittensomeposts

aboutx86_64assemblyprogrammingforLinuxand,atthesametime,Ihavealsostartedto

diveintotheLinuxsourcecode.Ihaveagreatinterestinunderstandinghowlow-levelthings work,howprogramsrunonmycomputer,howaretheylocatedinmemory,howthekernel managesprocesses&memory,howthenetworkstackworksatalowlevel,andmanymany otherthings.So,IhavedecidedtowriteyetanotherseriesofpostsabouttheLinuxkernel

forx86_64.

NotethatI'mnotaprofessionalkernelhackerandIdon'twritecodeforthekernelatwork. It'sjustahobby.Ijustlikelow-levelstuff,anditisinterestingformetoseehowthesethings work.Soifyounoticeanythingconfusing,orifyouhaveanyquestions/remarks,pingmeon

twitter0xAX,dropmeanemailorjustcreateanissue.Iappreciateit.Allpostswillalsobe

accessibleatlinux-insidesand,ifyoufindsomethingwrongwithmyEnglishorthepost

content,feelfreetosendapullrequest.

Notethatthisisn'tofficialdocumentation,justlearningandsharingknowledge.

Requiredknowledge

Requiredknowledge UnderstandingCcode Understandingassemblycode(AT&Tsyntax)

UnderstandingCcode

Understandingassemblycode(AT&Tsyntax)

Anyway,ifyoujuststarttolearnsometools,Iwilltrytoexplainsomepartsduringthisand

thefollowingposts.Alright,thisistheendofthesimpleintroduction,andnowwecanstartto

diveintothekernelandlow-levelstuff.

Allcodeisactuallyforthe3.18kernel.Iftherearechanges,Iwillupdatetheposts

accordingly.

TheMagicalPowerButton,Whathappens

next?

Frombootloadertokernel

AlthoughthisisaseriesofpostsabouttheLinuxkernel,wewillnotbestartingfromthe

kernelcode-atleastnot,inthisparagraph.Assoonasyoupressthemagicalpowerbutton

onyourlaptopordesktopcomputer,itstartsworking.Themotherboardsendsasignaltothe

powersupply.Afterreceivingthesignal,thepowersupplyprovidestheproperamountof

electricitytothecomputer.Oncethemotherboardreceivesthepowergoodsignal,ittriesto

starttheCPU.TheCPUresetsallleftoverdatainitsregistersandsetsuppredefinedvalues

foreachofthem.

80386andlaterCPUsdefinethefollowingpredefineddatainCPUregistersafterthe

computerresets:

IP

CS selector 0xf000

CS base

0xfff0

0xffff0000

Theprocessorstartsworkinginrealmode.Let'sbackupalittleandtrytounderstand

memorysegmentationinthismode.Realmodeissupportedonallx86-compatible

processors,fromthe8086allthewaytothemodernIntel64-bitCPUs.The8086processor

hasa20-bitaddressbus,whichmeansthatitcouldworkwitha0-0x100000addressspace

(1megabyte).Butitonlyhas16-bitregisters,whichhaveamaximumaddressof2^16-1or

0xffff(64kilobytes).Memorysegmentationisusedtomakeuseofalltheaddressspace

available.Allmemoryisdividedintosmall,fixed-sizesegmentsof65536bytes(64KB).

Sincewecannotaddressmemoryabove64KBwith16bitregisters,analternatemethodis

devised.Anaddressconsistsoftwoparts:asegmentselector,whichhasabaseaddress,

andanoffsetfromthisbaseaddress.Inrealmode,theassociatedbaseaddressofa

segmentselectoris

needtomultiplythesegmentselectorpartby16andaddtheoffset:

Segment Selector * 16

.Thus,togetaphysicaladdressinmemory,we

PhysicalAddress = Segment Selector * 16 + Offset

Forexample,if

CS:IP is 0x2000:0x0010
CS:IP
is
0x2000:0x0010

,thenthecorrespondingphysicaladdresswillbe:

>>> hex((0x2000 << 4) + 0x0010)

'0x20010'

But,ifwetakethelargestsegmentselectorandoffset,

addresswillbe:

0xffff:0xffff

,thentheresulting

>>> hex((0xffff << 4) + 0xffff)

'0x10ffef'

Frombootloadertokernel

whichis65520bytespastthefirstmegabyte.Sinceonlyonemegabyteisaccessibleinreal

mode,

0x10ffef

becomes

0x00ffef

withdisabledA20.

Ok,nowweknowaboutrealmodeandmemoryaddressing.Let'sgetbacktodiscussing

registervaluesafterreset:

The

address.Whilethebaseaddressisnormallyformedbymultiplyingthesegmentselector

valueby16,duringahardwareresetthesegmentselectorintheCSregisterisloadedwith

0xf000andthebaseaddressisloadedwith0xffff0000;theprocessorusesthisspecialbase

addressuntil

CS
CS

registerconsistsoftwoparts:thevisiblesegmentselector,andthehiddenbase

CS
CS

ischanged.

ThestartingaddressisformedbyaddingthebaseaddresstothevalueintheEIPregister:

>>> 0xffff0000 + 0xfff0

'0xfffffff0'

Weget

thememorylocationatwhichtheCPUexpectstofindthefirstinstructiontoexecuteafter

reset.Itcontainsajump(

)instructionthatusuallypointstotheBIOSentrypoint.For

example,ifwelookinthecorebootsourcecode,wesee:

0xfffffff0

,whichis4GB(16bytes).ThispointiscalledtheResetvector.Thisis

jmp
jmp

.section ".reset"

.code16

.globl reset_vector reset_vector:

.byte 0xe9

.int

_start - ( . + 2 )

Herewecanseethe

jmp
jmp

instructionopcode,whichis0xe9,anditsdestinationaddressat

_start - ( . + 2)

.Wecanalsoseethatthe

reset
reset

sectionis16bytes,andthatitstartsat

0xfffffff0

:

SECTIONS {

_ROMTOP = 0xfffffff0; . = _ROMTOP; .reset . : { *(.reset)

.

= 15 ;

BYTE(0x00);

}

}

Frombootloadertokernel

NowtheBIOSstarts;afterinitializingandcheckingthehardware,theBIOSneedstofinda bootabledevice.AbootorderisstoredintheBIOSconfiguration,controllingwhichdevices theBIOSattemptstobootfrom.Whenattemptingtobootfromaharddrive,theBIOStriesto findabootsector.OnharddrivespartitionedwithanMBRpartitionlayout,thebootsectoris

storedinthefirst446bytesofthefirstsector,whereeachsectoris512bytes.Thefinaltwo

bytesofthefirstsectorare

isbootable.Forexample:

0x55
0x55

and

0xaa
0xaa

,whichdesignatestotheBIOSthatthisdevice

; ; Note: this example is written in Intel Assembly syntax ; [BITS 16] [ORG
;
; Note: this example is written in Intel Assembly syntax
;
[BITS 16]
[ORG 0x7c00]
boot:
mov al, '!'
mov ah, 0x0e
mov bh, 0x00
mov bl, 0x07
int 0x10
jmp $
times 510-($-$$) db 0
db 0x55
db 0xaa

Buildandrunthiswith:

nasm -f bin boot.nasm && qemu-system-x86_64 boot

ThiswillinstructQEMUtousethe

binarygeneratedbytheassemblycodeabovefulfillstherequirementsofthebootsector

(theoriginissetto

binaryasthemasterbootrecord(MBR)ofadiskimage.

boot
boot

binarythatwejustbuiltasadiskimage.Sincethe

0x7c00
0x7c00

andweendwiththemagicsequence),QEMUwilltreatthe

Youwillsee:

Frombootloadertokernel

Frombootloadertokernel Inthisexamplewecanseethatthecodewillbeexecutedin16bitrealmodeandwillstartat

Inthisexamplewecanseethatthecodewillbeexecutedin16bitrealmodeandwillstartat

inmemory.Afterstarting,itcallsthe0x10 interrupt,whichjustprintsthe 0x10interrupt,whichjustprintsthe

itfillstheremaining510byteswithzerosandfinisheswiththetwomagicbytes

! 0xaa
!
0xaa

symbol;

and

0x55 .
0x55
.

Youcanseeabinarydumpofthisusingthe

objdump
objdump

utility:

nasm -f bin boot.nasm objdump -D -b binary -mi386 -Maddr16,data16,intel boot

Areal-worldbootsectorhascodeforcontinuingthebootprocessandapartitiontable

insteadofabunchof0'sandanexclamationmark:)Fromthispointonwards,theBIOS

handsovercontroltothebootloader.

NOTE:Asexplainedabove,theCPUisinrealmode;inrealmode,calculatingthephysical

addressinmemoryisdoneasfollows:

PhysicalAddress = Segment Selector * 16 + Offset

justasexplainedbefore.Wehaveonly16bitgeneralpurposeregisters;themaximumvalue

ofa16bitregisteris

0xffff
0xffff

,soifwetakethelargestvalues,theresultwillbe:

>>> hex((0xffff * 16) + 0xffff)

'0x10ffef'

Frombootloadertokernel

where

processorwithrealmode),incontrast,hasa20bitaddressline.Since

0x10ffef

isequalto

1MB + 64KB - 16b

.A8086processor(whichwasthefirst

2^20 = 1048576

is

1MB,thismeansthattheactualavailablememoryis1MB.

Generalrealmode'smemorymapisasfollows:

0x00000000 - 0x000003FF - Real Mode Interrupt Vector Table 0x00000400 - 0x000004FF - BIOS Data Area 0x00000500 - 0x00007BFF - Unused 0x00007C00 - 0x00007DFF - Our Bootloader 0x00007E00 - 0x0009FFFF - Unused 0x000A0000 - 0x000BFFFF - Video RAM (VRAM) Memory 0x000B0000 - 0x000B7777 - Monochrome Video Memory 0x000B8000 - 0x000BFFFF - Color Video Memory 0x000C0000 - 0x000C7FFF - Video ROM BIOS 0x000C8000 - 0x000EFFFF - BIOS Shadow Area 0x000F0000 - 0x000FFFFF - System BIOS

Inthebeginningofthispost,IwrotethatthefirstinstructionexecutedbytheCPUislocated

ataddress

accessthisaddressinrealmode?Theanswerisinthecorebootdocumentation:

0xFFFFFFF0

,whichismuchlargerthan

0xFFFFF
0xFFFFF

(1MB).HowcantheCPU

0xFFFE_0000 - 0xFFFF_FFFF: 128 kilobyte ROM mapped into address space

Atthestartofexecution,theBIOSisnotinRAM,butinROM.

Bootloader

ThereareanumberofbootloadersthatcanbootLinux,suchasGRUB2andsyslinux.The

LinuxkernelhasaBootprotocolwhichspecifiestherequirementsforabootloaderto

implementLinuxsupport.ThisexamplewilldescribeGRUB2.

Continuingfrombefore,nowthattheBIOShaschosenabootdeviceandtransferredcontrol tothebootsectorcode,executionstartsfromboot.img.Thiscodeisverysimple,duetothe limitedamountofspaceavailable,andcontainsapointerwhichisusedtojumptothe

locationofGRUB2'scoreimage.Thecoreimagebeginswithdiskboot.img,whichisusually

storedimmediatelyafterthefirstsectorintheunusedspacebeforethefirstpartition.The

abovecodeloadstherestofthecoreimage,whichcontainsGRUB2'skernelanddriversfor

handlingfilesystems,intomemory.Afterloadingtherestofthecoreimage,itexecutes

Frombootloadertokernel

grub_main

initializestheconsole,getsthebaseaddressformodules,setstherootdevice,

loads/parsesthegrubconfigurationfile,loadsmodules,etc.Attheendofexecution,

grub_main

movesgrubtonormalmode.

grub_normal_execute

(from

grub-
grub-

core/normal/main.c

)completesthefinalpreparationsandshowsamenutoselectan

operatingsystem.Whenweselectoneofthegrubmenuentries,

runs,executingthegrub

grub_menu_execute_entry

boot
boot

commandandbootingtheselectedoperatingsystem.

Aswecanreadinthekernelbootprotocol,thebootloadermustreadandfillsomefieldsof

offsetfromthekernelsetupcode.The

thekernelsetupheader,whichstartsatthe

kernelheaderarch/x86/boot/header.Sstartsfrom:

0x01f1
0x01f1

.globl hdr

hdr:

setup_sects: .byte 0

root_flags: .word ROOT_RDONLY

syssize:

.long 0

ram_size:

.word 0

vid_mode:

.word SVGA_MODE

root_dev:

.word 0

boot_flag:

.word 0xAA55

Thebootloadermustfillthisandtherestoftheheaders(whichareonlymarkedasbeing

type

eitherreceivedfromthecommandlineorcalculated.(Wewillnotgooverfulldescriptions

andexplanationsforallfieldsofthekernelsetupheadernowbutinsteadwhenthediscuss

howkernelusesthem;youcanfindadescriptionofallfieldsinthebootprotocol.)

write
write

intheLinuxbootprotocol,suchasinthisexample)withvalueswhichithas

Aswecanseeinthekernelbootprotocol,thememorymapwillbethefollowingafter

loadingthekernel:

Frombootloadertokernel

 

|

Protected-mode kernel |

100000

+------------------------+

|

I/O memory hole

|

0A0000

+------------------------+

|

Reserved for BIOS

| Leave as much as possible unused

~

~

|

Command line

| (Can also be below the X+10000 mark)

X+10000 +------------------------+

| Stack/heap

| For use by the kernel real-mode code.

X+08000 +------------------------+

| Kernel setup

| Kernel boot sector

| The kernel real-mode code.

| The kernel legacy boot sector.

X +------------------------+

| Boot loader

| <- Boot sector entry point 0x7C00

001000

+------------------------+

|

Reserved for MBR/BIOS |

000800

+------------------------+

|

Typically used by MBR

|

000600

+------------------------+

|

BIOS use only

|

000000

+------------------------+

So,whenthebootloadertransferscontroltothekernel,itstartsat:

0x1000 + X + sizeof(KernelBootSector) + 1

where

aswecanseeinamemorydump:

X
X

istheaddressofthekernelbootsectorbeingloaded.Inmycase,

X is 0x10000 ,
X is
0x10000
,
X is 0x10000 ,

ThebootloaderhasnowloadedtheLinuxkernelintomemory,filledtheheaderfields,and

thenjumpedtothecorrespondingmemoryaddress.Wecannowmovedirectlytothekernel

setupcode.

StartofKernelSetup

Frombootloadertokernel

Finally,weareinthekernel!Technically,thekernelhasn'trunyet;first,weneedtosetupthe kernel,memorymanager,processmanager,etc.Kernelsetupexecutionstartsfrom

arch/x86/boot/header.Sat_start.Itisalittlestrangeatfirstsight,asthereareseveral

instructionsbeforeit.

Alongtimeago,theLinuxkernelusedtohaveitsownbootloader.Now,however,ifyourun,

forexample,

qemu-system-x86_64 vmlinuz-3.18-generic

thenyouwillsee:

qemu-system-x86_64 vmlinuz-3.18-generic thenyouwillsee: Actually, followingthe PE header: header.S startsfrom MZ

Actually,

followingthePEheader:

header.S

startsfromMZ(seeimageabove),theerrormessageprintingand

#ifdef CONFIG_EFI_STUB # "MZ", MS-DOS header .byte 0x4d .byte 0x5a #endif

pe_header:

.ascii "PE"

.word 0

Frombootloadertokernel

ItneedsthistoloadanoperatingsystemwithUEFI.Wewon'tbelookingintoitsinner

workingsrightnowandwillcoveritinupcomingchapters.

Theactualkernelsetupentrypointis:

// header.S line 292 .globl _start _start:

Thebootloader(grub2andothers)knowsaboutthispoint(

makesajumpdirectlytoit,despitethefactthat

whichprintsanerrormessage:

0x200 header.S
0x200
header.S

offsetfrom

MZ .bstext
MZ
.bstext

)and

section,

startsfromthe

//

// arch/x86/boot/setup.ld //

= 0;

. .bstext : { *(.bstext) } .bsdata : { *(.bsdata) }

// current position // put .bstext section to position 0

Thekernelsetupentrypointis:

.globl _start _start:

.byte 0xeb .byte start_of_setup-1f

1:

// // rest of the header //

Herewecanseea

point.In

thatispresentrightafterjump,anditcontainstherestofthesetupheader.Rightafterthe

setupheader,weseethe

jmp 2f
jmp
2f

instructionopcode(

0xeb
0xeb

)thatjumpstothe

2:
2:

start_of_setup-1f

1
1
Nf
Nf

notation,

referstothefollowinglocal

.entrytext

label;inourcase,itislabel

start_of_setup

label.

section,whichstartsatthe

Thisisthefirstcodethatactuallyruns(asidefromthepreviousjumpinstructions,ofcourse).

Afterthekernelsetupreceivedcontrolfromthebootloader,thefirst

locatedatthe

bytes.ThiswecanbothreadintheLinuxkernelbootprotocolandseeinthegrub2source

code:

jmp
jmp

instructionis

0x200
0x200

offsetfromthestartofthekernelrealmode,i.e.,afterthefirst512

Frombootloadertokernel

segment = grub_linux_real_target >> 4; state.gs = state.fs = state.es = state.ds = state.ss = segment; state.cs = segment + 0x20;

Thismeansthatsegmentregisterswillhavethefollowingvaluesafterkernelsetupstarts:

gs = fs = es = ds = ss = 0x1000 cs = 0x1020

Inmycase,thekernelisloadedat

0x10000 .
0x10000
.

Afterthejumpto

start_of_setup

,thekernelneedstodothefollowing:

start_of_setup ,thekernelneedstodothefollowing: Makesurethatallsegmentregistervaluesareequal

Makesurethatallsegmentregistervaluesareequal

Setupacorrectstack,ifneeded

Setupbss

JumptotheCcodeinmain.c

Let'slookattheimplementation.

Segmentregistersalign

Firstofall,thekernelensuresthat

address.Next,itclearsthedirectionflagusingthe

andaddress.Next,itclearsthedirectionflagusingthe es cld segmentregisterspointtothesame instruction: movw

es cld
es
cld

segmentregisterspointtothesame

instruction:

movw

%ds, %ax

movw

%ax, %es

cld

AsIwroteearlier,grub2loadskernelsetupcodeataddress

becauseexecutiondoesn'tstartfromthestartoffile,butfrom

0x10000
0x10000

and

cs at 0x1020
cs
at
0x1020

_start:

.byte 0xeb .byte start_of_setup-1f

jump 0x10000
jump
0x10000

,whichisata512byteoffsetfrom4d5a.Italsoneedstoalign

cs
cs

from

0x10200
0x10200

,aswellasallothersegmentregisters.Afterthat,wesetupthestack:

to

pushw

%ds

pushw

$6f

lretw

Frombootloadertokernel

whichpushesthevalueof

lretw
lretw
ds lretw
ds
lretw

tothestackwiththeaddressofthe6labelandexecutesthe

instructioniscalled,itloadstheaddressoflabel

cs
cs

withthevalueof

ds
ds

.Afterwards,

6 ds
6
ds

and

instruction.Whenthe

intotheinstructionpointerregisterandloads

willhavethesamevalues.intothe instructionpointer registerandloads StackSetup

StackSetup

AlmostallofthesetupcodeisinpreparationfortheClanguageenvironmentinrealmode.

Thenextstepischeckingthe

ss
ss

registervalueandmakingacorrectstackif

ss
ss

iswrong:

movw

%ss, %dx

cmpw

%ax, %dx

movw

%sp, %dx

je

2f

Thiscanleadto3differentscenarios:

ss ss ss
ss
ss
ss

hasvalidvalue0x10000(asdoallothersegmentregistersbeside

isinvalidand

CAN_USE_HEAP

flagisset(seebelow)

isinvalidand

CAN_USE_HEAP

flagisnotset(seebelow)

cs )
cs
)

Let'slookatallthreeofthesescenariosinturn:

hasacorrectaddress(0x10000).Inthiscase,wegotolabel2 : 2:

2:

andw

$~3, %dx

jnz

3f

movw

$0xfffc, %dx

3:

movw

%ax, %ss

movzwl %dx, %esp sti

Herewecanseethealignmentof dx (contains sp givenbybootloader)to4bytesanda
Herewecanseethealignmentof
dx
(contains
sp
givenbybootloader)to4bytesanda
checkforwhetherornotitiszero.Ifitiszero,weput
0xfffc
(4bytealignedaddress
beforethemaximumsegmentsizeof64KB)in
dx
.Ifitisnotzero,wecontinuetouse
sp
,
givenbythebootloader(0xf7f4inmycase).Afterthis,weputthe
ax
valueinto
ss
,which
storesthecorrectsegmentaddressof
0x10000
andsetsupacorrect
sp
.Wenowhavea

correctstack:

Frombootloadertokernel

Frombootloadertokernel Inthesecondscenario,( endofthesetupcode)into ss != ds dx ).First,weputthevalueof _end
Frombootloadertokernel Inthesecondscenario,( endofthesetupcode)into ss != ds dx ).First,weputthevalueof _end

Inthesecondscenario,(

endofthesetupcode)into

ss != ds dx
ss
!=
ds
dx

).First,weputthevalueof_end(theaddressofthe

loadflags

headerfieldusingthe

testb
testb

andcheckthe

instructiontoseewhetherwecanusetheheap.loadflagsisabitmaskheaderwhichis

definedas:

#define LOADED_HIGH

(1<<0)

#define QUIET_FLAG

(1<<5)

#define KEEP_SEGMENTS

(1<<6)

#define CAN_USE_HEAP

(1<<7)

and,aswecanreadinthebootprotocol,

Field name: loadflags

This field is a bitmask.

Bit 7 (write): CAN_USE_HEAP Set this bit to 1 to indicate that the value entered in the heap_end_ptr is valid. If this field is clear, some setup code functionality will be disabled.

Ifthe

add

notbecarried,dx=_end+512),jumptolabel

correctstack.

CAN_USE_HEAP STACK_SIZE
CAN_USE_HEAP
STACK_SIZE

bitisset,weput

heap_end_ptr

into

dx
dx

(whichpointsto

dx
dx
_end
_end

)and

(minimumstacksize,512bytes)toit.Afterthis,if

2
2

isnotcarried(itwill

(asinthepreviouscase)andmakea

Frombootloadertokernel

Frombootloadertokernel When CAN_USE_HEAP STACK_SIZE : isnotset,wejustuseaminimalstackfrom _end to _end + BSSSetup
When CAN_USE_HEAP STACK_SIZE :
When
CAN_USE_HEAP
STACK_SIZE
:

isnotset,wejustuseaminimalstackfrom

_end to _end +
_end
to
_end +
: isnotset,wejustuseaminimalstackfrom _end to _end + BSSSetup

BSSSetup

ThelasttwostepsthatneedtohappenbeforewecanjumptothemainCcodearesetting

uptheBSSareaandcheckingthe"magic"signature.First,signaturechecking:

cmpl

$0x5a5aaa55, setup_sig

jne

setup_bad

Thissimplycomparesthesetup_sigwiththemagicnumber

equal,afatalerrorisreported.

0x5a5aaa55

.Iftheyarenot

Ifthemagicnumbermatches,knowingwehaveasetofcorrectsegmentregistersanda

stack,weonlyneedtosetuptheBSSsectionbeforejumpingintotheCcode.

TheBSSsectionisusedtostorestaticallyallocated,uninitializeddata.Linuxcarefully

ensuresthisareaofmemoryisfirstzeroedusingthefollowingcode:

Frombootloadertokernel

movw

$

bss_start,

%di

movw

$_end+3, %cx

xorl

%eax, %eax

subw

%di, %cx

shrw

$2, %cx

rep; stosl

First,the

4bytes)ismovedinto

bsssectionsize(

sizeofa'word'),andthe

bss_startaddressismovedinto

cx .The eax cx - di stosl di
cx
.The
eax
cx
- di
stosl
di
di
di

.Next,the

_end + 3 xor cx .Then, cx
_end + 3
xor
cx .Then,
cx

address(+3-alignsto

registeriscleared(usinga

instruction),andthe

isdividedbyfour(the

)iscalculatedandputinto

instructionisusedrepeatedly,storingthevalueof

,automaticallyincreasing

eax
eax

(zero)intotheaddresspointedtoby

until

inmemoryfrom

byfour,repeatingeax (zero)intotheaddresspointedtoby until inmemoryfrom

reacheszero).Theneteffectofthiscodeisthatzerosarewrittenthroughallwordsuntil inmemoryfrom byfour,repeating bss_start to _end : Jumptomain

bss_start to _end :
bss_start
to
_end
:
bss_start to _end : Jumptomain

Jumptomain

That'sall-wehavethestackandBSS,sowecanjumptothe

Cfunction:That'sall-wehavethestackandBSS,sowecanjumptothe calll main The inthenextpart. functionislocatedin

calll main

The

inthenextpart.

functionislocatedinarch/x86/boot/main.c .Youcanreadaboutwhatthisdoes arch/x86/boot/main.c.Youcanreadaboutwhatthisdoes

Conclusion

ThisistheendofthefirstpartaboutLinuxkernelinsides.Ifyouhavequestionsor

suggestions,pingmeontwitter0xAX,dropmeanemail,orjustcreateanissue.Inthenext

part,wewillseethefirstCcodethatexecutesintheLinuxkernelsetup,theimplementation

Frombootloadertokernel

ofmemoryroutinessuchas

andinitialization,andmuchmore.

memset , memcpy , earlyprintk
memset
, memcpy
, earlyprintk

,earlyconsoleimplementation

PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforany

inconvenience.IfyoufindanymistakespleasesendmePRtolinux-insides.

Links

Firststepsinthekernelsetupcode

Kernelbootingprocess.Part2.

Firststepsinthekernelsetup

Westartedtodiveintolinuxkernelinsidesinthepreviouspartandsawtheinitialpartofthe

kernelsetupcode.Westoppedatthefirstcalltothe

functionwritteninC)fromarch/x86/boot/main.c.

main
main

function(whichisthefirst

Inthispartwewillcontinuetoresearchthekernelsetupcodeand

seewhatInthispartwewillcontinuetoresearchthekernelsetupcodeand protected mode is, somepreparationforthetransitionintoit,

protected mode

is,

somepreparationforthetransitionintoit,

theheapandconsoleinitialization,

memorydetection,cpuvalidation,keyboardinitialization

andmuchmuchmore.memorydetection,cpuvalidation,keyboardinitialization So,Let'sgoahead. Protectedmode

So,Let'sgoahead.

Protectedmode

BeforewecanmovetothenativeIntel64LongMode,thekernelmustswitchtheCPUinto

protectedmode.

Whatisprotectedmode?Protectedmodewasfirstaddedtothex86architecturein1982

andwasthemainmodeofIntelprocessorsfromthe80286processoruntilIntel64andlong

modecame.

ThemainreasontomoveawayfromRealmodeisthatthereisverylimitedaccesstothe

20

RAM.Asyoumayrememberfromthepreviouspart,thereisonly2 bytesor1Megabyte,

sometimesevenonly640KilobytesofRAMavailableintheRealmode.

Protectedmodebroughtmanychanges,butthemainoneisthedifferenceinmemory

management.The20-bitaddressbuswasreplacedwitha32-bitaddressbus.Itallowed

accessto4Gigabytesofmemoryvs1Megabyteofrealmode.Alsopagingsupportwas

added,whichyoucanreadaboutinthenextsections.

MemorymanagementinProtectedmodeisdividedintotwo,almostindependentparts:

SegmentationMemorymanagementinProtectedmodeisdividedintotwo,almostindependentparts: Paging 24

PagingMemorymanagementinProtectedmodeisdividedintotwo,almostindependentparts: Segmentation 24

Firststepsinthekernelsetupcode

Herewewillonlyseesegmentation.Pagingwillbediscussedinthenextsections.

Asyoucanreadinthepreviouspart,addressesconsistoftwopartsinrealmode:

Baseaddressofthesegment

Offsetfromthesegmentbase

Andwecangetthephysicaladdressifweknowthesetwopartsby:

PhysicalAddress = Segment Selector * 16 + Offset

Memorysegmentationwascompletelyredoneinprotectedmode.Thereareno64Kilobyte

fixed-sizesegments.Instead,thesizeandlocationofeachsegmentisdescribedbyan

associateddatastructurecalledSegmentDescriptor.Thesegmentdescriptorsarestoredin

adatastructurecalled

Global Descriptor Table

(GDT).

TheGDTisastructurewhichresidesinmemory.Ithasnofixedplaceinthememoryso,its

addressisstoredinthespecial

Linuxkernelcode.Therewillbeanoperationforloadingitintomemory,somethinglike:

GDTR
GDTR

register.LaterwewillseetheGDTloadinginthe

lgdt gdt

wherethe

tothe

lgdt GDTR
lgdt
GDTR

instructionloadsthebaseaddressandlimit(size)ofglobaldescriptortable

GDTR
GDTR

isa48-bitregisterandconsistsoftwoparts:

register.

size(16-bit)ofglobaldescriptortable;

address(32-bit)oftheglobaldescriptortable.

AsmentionedabovetheGDTcontains

segments.Eachdescriptoris64-bitsinsize.Thegeneralschemeofadescriptoris:

segment descriptors

whichdescribememory

0

------------------------------------------------------------

31

24

19

16

7

|

| |B| |A|

|

|

| |0|E|W|A|

|

| BASE 31:24

|G|/|L|V| LIMIT |P|DPL|S|

TYPE | BASE 23:16 | 4

|

| |D| |L| 19:16 | |

| |1|C|R|A|

|

------------------------------------------------------------

|

|

|

|

BASE 15:0

|

LIMIT 15:0

|

0

|

|

|

------------------------------------------------------------

Don'tworry,Iknowitlooksalittlescaryafterrealmode,butit'seasy.ForexampleLIMIT

15:0meansthatbit0-15oftheDescriptorcontainthevalueforthelimit.Therestofitisin

LIMIT19:16.So,thesizeofLimitis0-19i.e20-bits.Let'stakeacloserlookatit:

Firststepsinthekernelsetupcode

1. Limit[20-bits]isat0-15,16-19bits.Itdefines

(Granularity)bit.1. Limit[20-bits]isat0-15,16-19bits.Itdefines length_of_segment - 1 .Itdependson G

length_of_segment - 1

.Itdependson

G
G

(bit55)is0andsegmentlimitis0,thesizeofthesegmentis1Byte(Granularity)bit. length_of_segment - 1 .Itdependson G if G is1andsegmentlimitis0,thesizeofthesegmentis4096Bytes if

if

G
G

is1andsegmentlimitis0,thesizeofthesegmentis4096Bytes(bit55)is0andsegmentlimitis0,thesizeofthesegmentis1Byte if G if G

if

G
G

is0andsegmentlimitis0xfffff,thesizeofthesegmentis1MegabyteG is1andsegmentlimitis0,thesizeofthesegmentis4096Bytes if G if G

if

G
G

is1andsegmentlimitis0xfffff,thesizeofthesegmentis4Gigabytesif G if So,itmeansthatif

if

So,itmeansthatif

ifGis0,Limitisinterpretedintermsof1Byteandthemaximumsizeofthe

segmentcanbe1Megabyte.

ifGis1,Limitisinterpretedintermsof4096Bytes=4KBytes=1Pageandthe

maximumsizeofthesegmentcanbe4Gigabytes.ActuallywhenGis1,thevalue

32

ofLimitisshiftedtotheleftby12bits.So,20bits+12bits=32bitsand2 =4

Gigabytes.

2. Base[32-bits]isat(0-15,32-39and56-63bits).Itdefinesthephysicaladdressofthe

segment'sstartinglocation.

3. Type/Attribute(40-47bits)definesthetypeofsegmentandkindsofaccesstoit.

flagatbit44specifiesdescriptortype.IfS is0thenthissegmentisasystem

S
S

is0thenthissegmentisasystem

is1thenthisisacodeordatasegment(Stacksegments

S
S

segment,whereasif

aredatasegmentswhichmustberead/writesegments).

TodetermineifthesegmentisacodeordatasegmentwecancheckitsEx(bit43)Attribute

markedas0intheabovediagram.Ifitis0,thenthesegmentisaDatasegmentotherwiseit

isacodesegment.

Asegmentcanbeofoneofthefollowingtypes:

Firststepsinthekernelsetupcode

| Type Field

|-----------------------------|-----------------|------------------

| Descriptor Type | Description

|

Decimal

|

|

|

0

E

W

A

|

|

|

0

0

0

0

0 | Data

| Read-Only

|

1

0

0

0

1 | Data

| Read-Only, accessed

|

2

0

0

1

0 | Data

| Read/Write

|

3

0

0

1

1 | Data

| Read/Write, accessed

|

4

0

1

0

0 | Data

| Read-Only, expand-down

|

5

0

1

0

1 | Data

| Read-Only, expand-down, accessed

|

6

0

1

1

0 | Data

| Read/Write, expand-down

|

7

0

1

1

1 | Data

| Read/Write, expand-down, accessed

|

C

R

A

|

|

|

8

1

0

0

0 | Code

| Execute-Only

|

9

1

0

0

1 | Code

| Execute-Only, accessed

|

10

1

0

1

0 | Code

| Execute/Read

|

11

1

0

1

1 | Code

| Execute/Read, accessed

|

12

1

1

0

0 | Code

| Execute-Only, conforming

|

14

1

1

0

1 | Code

| Execute-Only, conforming, accessed

|

13

1

1

1

0 | Code

| Execute/Read, conforming

|

15

1

1

1

1 | Code

| Execute/Read, conforming, accessed

Aswecanseethefirstbit(bit43)is

nextthreebits(40,41,42,43)areeither

0 EWA
0
EWA

foradatasegmentand

1
1

foracodesegment.The

(ExpansionWritableAccessible)or

CRA(ConformingReadableAccessible).

ifE(bit42)is0,expandupotherwiseexpanddown.Readmorehere.

ifW(bit41)(forDataSegments)is1,writeaccessisallowedotherwisenot.Notethat

readaccessisalwaysallowedondatasegments.

A(bit40)-Whetherthesegmentisaccessedbyprocessorornot.

C(bit43)isconformingbit(forcodeselectors).IfCis1,thesegmentcodecanbe

executedfromalowerlevelprivilegee.g.userlevel.IfCis0,itcanonlybeexecuted

fromthesameprivilegelevel.

R(bit41)(forcodesegments).If1readaccesstosegmentisallowedotherwisenot.

Writeaccessisneverallowedtocodesegments.

1. DPL[2-bits](DescriptorPrivilegeLevel)isatbits45-46.Itdefinestheprivilegelevelof

thesegment.Itcanbe0-3where0isthemostprivileged.

2. Pflag(bit47)-indicatesifthesegmentispresentinmemoryornot.IfPis0,the

segmentwillbepresentedasinvalidandtheprocessorwillrefusetoreadthissegment.

3. AVLflag(bit52)-Availableandreservedbits.ItisignoredinLinux.

4. Lflag(bit53)-indicateswhetheracodesegmentcontainsnative64-bitcode.If1then

thecodesegmentexecutesin64bitmode.

Firststepsinthekernelsetupcode

5. D/Bflag(bit54)-Default/Bigflagrepresentstheoperandsizei.e16/32bits.Ifitisset

then32bitotherwise16.

Segmentregisterscontainsegmentselectorsasinrealmode.However,inprotectedmode, asegmentselectorishandleddifferently.EachSegmentDescriptorhasanassociated

SegmentSelectorwhichisa16-bitstructure:

15 3

-----------------------------

RPL |

-----------------------------

2

1

0

| Index

| TI

|

Where,

IndexshowstheindexnumberofthedescriptorintheGDT.

TI(TableIndicator)showswheretosearchforthedescriptor.Ifitis0thensearchinthe

GlobalDescriptorTable(GDT)otherwiseitwilllookinLocalDescriptorTable(LDT).

AndRPLisRequester'sPrivilegeLevel.

Everysegmentregisterhasavisibleandhiddenpart.

Visible-SegmentSelectorisstoredhere

Hidden-SegmentDescriptor(base,limit,attributes,flags)

Thefollowingstepsareneededtogetthephysicaladdressintheprotectedmode:

Thesegmentselectormustbeloadedinoneofthesegmentregisters

TheCPUtriestofindasegmentdescriptorbyGDTaddress+Indexfromselectorand

loadthedescriptorintothehiddenpartofthesegmentregister

Baseaddress(fromsegmentdescriptor)+offsetwillbethelinearaddressofthe

segmentwhichisthephysicaladdress(ifpagingisdisabled).

Schematicallyitwilllooklikethis:

Firststepsinthekernelsetupcode

Firststepsinthekernelsetupcode Thealgorithmforthetransitionfromrealmodeintoprotectedmodeis: Disableinterrupts

Thealgorithmforthetransitionfromrealmodeintoprotectedmodeis:

Thealgorithmforthetransitionfromrealmodeintoprotectedmodeis: Disableinterrupts DescribeandloadGDTwith

Disableinterrupts

DescribeandloadGDTwith

SetPE(ProtectionEnable)bitinCR0(ControlRegister0)

Jumptoprotectedmodecode

lgdt
lgdt

instruction

Wewillseethecompletetransitiontoprotectedmodeinthelinuxkernelinthenextpart,but

beforewecanmovetoprotectedmode,weneedtodosomemorepreparations.

Let'slookatarch/x86/boot/main.c.Wecanseesomeroutinestherewhichperformkeyboard

initialization,heapinitialization,etc Let'stakealook.

Copyingbootparametersintothe"zeropage"

Wewillstartfromthe

main
main

routinein"main.c".Firstfunctionwhichiscalledin

main is boot_params
main
is
boot_params

.Itcopiesthekernelsetupheaderintothefieldofthe

structurewhichisdefinedinthearch/x86/include/uapi/asm/bootparam.h.

Firststepsinthekernelsetupcode

The

containsthesamefieldsasdefinedinlinuxbootprotocolandisfilledbythebootloaderand

alsoatkernelcompile/buildtime.

boot_params

structurecontainsthe

struct setup_header hdr

field.Thisstructure

copy_boot_params

doestwothings:

1. Copies

hdr
hdr

fromheader.Stothe

boot_params

structurein

setup_header

field

2. Updatespointertothekernelcommandlineifthekernelwasloadedwiththeold

commandlineprotocol.

Notethatitcopies

Let'shavealookinside:

withNotethatitcopies Let'shavealookinside: memcpy functionwhichisdefinedinthe copy.S sourcefile.

memcpy
memcpy

functionwhichisdefinedinthecopy.Ssourcefile.

GLOBAL(memcpy)

pushw

%si

pushw

%di

movw

%ax, %di

movw

%dx, %si

pushw

%cx

shrw

$2, %cx

rep; movsl

popw

%cx

andw

$3, %cx

rep; movsb

popw

%di

popw

%si

retl

ENDPROC(memcpy)

Yeah,wejustmovedtoCcodeandnowassemblyagain:)Firstofallwecanseethat

memcpy GLOBAL and ENDPROC . GLOBAL defines globl whichmarksthe name
memcpy
GLOBAL
and
ENDPROC
. GLOBAL
defines
globl
whichmarksthe
name

andotherroutineswhicharedefinedhere,startandendwiththetwomacros:

isdescribedinarch/x86/include/asm/linkage.hwhich

ENDPROC
ENDPROC

directiveandthelabelforit.

name
name

symbolasafunctionnameandendswiththesizeofthe

symbol.

Implementationof

tothestacktopreservetheirvaluesbecausetheywillchangeduringthe

(andotherfunctionsincopy.S)use

parametersfromthe

iseasy.Atfirst,itpushesvaluesfromthe(andotherfunctionsincopy.S)use parametersfromthe ax , dx and fastcall cx si and di registers memcpy .

ax , dx
ax
,
dx

and

fastcall cx
fastcall
cx
si and di registers memcpy . memcpy
si
and
di registers
memcpy
. memcpy

callingconventions.Soitgetsitsincoming

memcpy
memcpy

lookslikethis:

registers.Calling

memcpy(&boot_params.hdr, &hdr, sizeof hdr);

So,

ax dx
ax
dx

willcontaintheaddressofthe

willcontaintheaddressof

boot_params.hdr

hdr
hdr

Firststepsinthekernelsetupcode

willcontainthesizeofFirststepsinthekernelsetupcode hdr inbytes. memcpy putstheaddressof boot_params.hdr into di andsavesthesizeonthestack.

hdr
hdr

inbytes.

memcpy
memcpy

putstheaddressof

boot_params.hdr

into

di
di

andsavesthesizeonthestack.

Afterthisitshiftstotherighton2size(ordivideon4)andcopiesfrom

bytes.Afterthiswerestorethesizeof

bytesfrom

stackintheendandafterthiscopyingisfinished.

tobytesfrom stackintheendandafterthiscopyingisfinished. di by4 hdr again,alignitby4bytesandcopytherestofthe si and

di
di

by4

hdr
hdr

again,alignitby4bytesandcopytherestofthe

si
si

and

valuesfromthedi by4 hdr again,alignitby4bytesandcopytherestofthe si and si to di bytebybyte(ifthereismore).Restore

si
si

to

di
di

bytebybyte(ifthereismore).Restore

Consoleinitialization

After hdr iscopiedinto boot_params.hdr ,thenextstepisconsoleinitializationbycalling the console_init
After
hdr
iscopiedinto
boot_params.hdr
,thenextstepisconsoleinitializationbycalling
the
console_init
functionwhichisdefinedinarch/x86/boot/early_serial_console.c.

Ittriestofindthe

itparsestheportaddressandbaudrateoftheserialportandinitializestheserialport.Value

of

earlyprintk

optioninthecommandlineandifthesearchwassuccessful,

earlyprintk

commandlineoptioncanbeoneofthese:

serial,0x3f8,115200earlyprintk commandlineoptioncanbeoneofthese: serial,ttyS0,115200 ttyS0,115200

serial,ttyS0,115200commandlineoptioncanbeoneofthese: serial,0x3f8,115200 ttyS0,115200

ttyS0,115200serial,0x3f8,115200 serial,ttyS0,115200 Afterserialportinitializationwecanseethefirstoutput: if

Afterserialportinitializationwecanseethefirstoutput:

if (cmdline_find_option_bool("debug")) puts("early console in setup code\n");

Thedefinitionof

callingthe

puts putchar
puts
putchar

isintty.c.Aswecanseeitprintscharacterbycharacterinaloopby

putchar
putchar

implementation:

function.Let'slookintothe

void

attribute

((section(".inittext")))

putchar(int ch)

{

 

if (ch == '\n') putchar('\r');

bios_putchar(ch);

if (early_serial_base != 0) serial_putchar(ch);

}

attribute

((section(".inittext")))

meansthatthiscodewillbeinthe

.inittext

section.Wecanfinditinthelinkerfilesetup.ld.

Firststepsinthekernelsetupcode

Firstofall,

thatitoutputsthecharacterontheVGAscreenbycallingtheBIOSwiththe

call:

putchar
putchar

checksforthe

symbolandifitisfound,printscall: putchar checksforthe \r 0x10 before.After interrupt static void attribute

\r 0x10
\r
0x10

before.After

interrupt

static void

attribute

((section(".inittext")))

bios_putchar(int ch)

{

struct biosregs ireg;

 

initregs(&ireg); ireg.bx = 0x0007; ireg.cx = 0x0001; ireg.ah = 0x0e; ireg.al = ch; intcall(0x10, &ireg, NULL);

 

}

Here initregs memset
Here
initregs
memset

takesthe

biosregs

structureandfirstfills

functionandthenfillsitwithregistervalues.

biosregs

withzerosusingthe

memset(reg, 0, sizeof *reg); reg->eflags |= X86_EFLAGS_CF; reg->ds = ds(); reg->es = ds(); reg->fs = fs(); reg->gs = gs();

Let'slookatthememsetimplementation:

GLOBAL(memset)

pushw

%di

movw

%ax, %di

movzbl %dl, %eax

imull

$0x01010101,%eax

pushw

%cx

shrw

$2, %cx

rep; stosl

popw

%cx

andw

$3, %cx

rep; stosb

popw

%di

retl

ENDPROC(memset)

Asyoucanreadabove,itusesthe

whichmeansthatthefunctiongetsparametersfrom

fastcall callingconventionslikethe

ax , dx
ax
,
dx

and

memcpy cx registers.
memcpy
cx registers.

function,

Firststepsinthekernelsetupcode

Generally

memset ax movzbl eax
memset
ax
movzbl
eax

islikeamemcpyimplementation.Itsavesthevalueofthe

valueinto

whichistheaddressoftheislikeamemcpyimplementation.Itsavesthevalueofthe valueinto instruction,whichcopiesthe dl di biosregs register

instruction,whichcopiesthe

dl
dl
di biosregs
di
biosregs

register

onthestackandputsthe

structure.Nextisthe

the

valuetothelow2bytesof

register.Theremaining2highbytesof

willbefilledwithzeros.the valuetothelow2bytesof register.Theremaining2highbytesof Thenextinstructionmultiplies

Thenextinstructionmultiplies

4bytesatthesametime.Forexample,weneedtofillastructurewith

eax
eax

with

0x01010101

.Itneedstobecause

memset 0x7 withmemset. eax with 0x01010101 memset uses
memset
0x7
withmemset.
eax
with
0x01010101
memset
uses

willcopy

eax willcontain 0x00000007 willget 0x07070707 rep; stosl
eax
willcontain
0x00000007
willget
0x07070707
rep; stosl

valueinthiscase.Soifwemultiply

,we

andnowwecancopythese4bytesintothestructure.

eax
eax

into

es:di .
es:di
.

instructionsforcopying

Therestofthe

memset
memset

functiondoesalmostthesameas

memcpy .
memcpy
.

Afterthe

whichprintsacharacter.Afterwardsitchecksiftheserialportwasinitializedornotand

writesacharactertherewithserial_putcharand

biosregs

structureisfilledwith

memset , bios_putchar
memset
, bios_putchar

inb/outb

callsthe0x10interrupt

instructionsifitwasset.

Heapinitialization

Afterthestackandbsssectionwerepreparedinheader.S(seepreviouspart),thekernel

needstoinitializetheheapwiththe

function.

 

Firstofall

init_heap

checksthe

 

flagfromthe

inthekernelsetup

headerandcalculatestheendofthestackifthisflagwasset:

char *stack_end;

if (boot_params.hdr.loadflags & CAN_USE_HEAP) { asm("leal %P1(%%esp),%0" : "=r" (stack_end) : "i" (-STACK_SIZE));

orinotherwords

Thenthereisthe

stack_end = esp - STACK_SIZE

.

heap_end

calculation:

heap_end = (char *)((size_t)boot_params.hdr.heap_end_ptr + 0x200);

whichmeans

isgreaterthan

equal.

heap_end_ptr stack_end
heap_end_ptr
stack_end

or

_end + 512 ( 0x200h stack_end
_end
+ 512
( 0x200h
stack_end

).Thelastcheckiswhether

heap_end

heap_end

.Ifitisthen

isassignedto

tomakethem

Nowtheheapisinitializedandwecanuseitusingthe

isused,howtouseitandhowtheitisimplementedinthenextposts.

GET_HEAP

method.Wewillseehowit

Firststepsinthekernelsetupcode

CPUvalidation

Thenextstepaswecanseeiscpuvalidationby

validate_cpu

Itcallsthe

thatthekernellaunchesontherightcpulevel.

functionandpassescpulevelandrequiredcpuleveltoitandchecks

check_cpu(&cpu_level, &req_level, &err_flags); if (cpu_level < req_level) {

return -1;

}

check_cpu

checksthecpu'sflags,presenceoflongmodeincaseofx86_64(64-bit)CPU,

checkstheprocessor'svendorandmakespreparationforcertainvendorsliketurningoff

SSE+SSE2forAMDiftheyaremissing,etc.

Memorydetection

Thenextstepismemorydetectionbythe

providesamapofavailableRAMtothecpu.Itusesdifferentprogramminginterfacesfor

memorydetectionlike

0xE820here.

detect_memory

and

0x88
0x88

function.

detect_memory

basically

0xe820 , 0xe801
0xe820
,
0xe801

.Wewillseeonlytheimplementationof

Let'slookintothe

sourcefile.Firstofall,the

wesawaboveandfillsregisterswithspecialvaluesforthe

detect_memory_e820

implementationfromthearch/x86/boot/memory.c

functioninitializesthe

biosregs 0xe820 call:
biosregs
0xe820
call:

detect_memory_e820

structureas

initregs(&ireg); ireg.ax = 0xe820; ireg.cx = sizeof buf; ireg.edx = SMAP; ireg.di = (size_t)&buf;

ax cx edx es:di ebx
ax
cx
edx
es:di
ebx

containsthenumberofthefunction(0xe820inourcase)

registercontainssizeofthebufferwhichwillcontaindataaboutmemory

SMAP
SMAP

magicnumber

mustcontainthe

mustcontaintheaddressofthebufferwhichwillcontainmemorydata

hastobezero.