LectureCA All Slides

This is the collection of lecture slides* of the lecture Computer Architecture tought in Wintersemester 06/07 at University DuisburgEssen.
I slightly revised the surveys of the subjects and added slide numbers now.
Stefan Freinatis, March 2007
Computer Architecture
* Actually, this is the internet version of the lecture slides. With respect to the slides used in the lectures, animations are removed (errors hopefully as well) and additional text is added.
Slide 1
WS 06/07
Dr.-Ing. Stefan Freinatis
Slide 2
WS 06/07
Lecture Dr.-Ing. Stefan Freinatis
Fachgebiet Verteilte Systeme (Prof. Geisselhardt) Raum BB 1017
Times & Dates

1. 2. 3. 4. 5. 6. 7.
Exercises Dipl.-Math. Kerstin Luck

Fachgebiet Verteilte Systeme Raum BB 910
Slide 3 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
8. 9. 10. 11. 12. 13.
25.10.06 01.11.06 All Saints Day (public holiday in NRW, no lectures) 08.11.06 15.11.06 22.11.06 29.11.06 Lecture: 08:15 09:45 06.12.06 Exercise: 10:00 10:45 13.12.06 20.12.06 10.01.07 17.01.07 24.01.07 31.01.07 07.02.07
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Slide 4
Resources
Homepage Verteilte Systeme
Topics
Introduction & History
http://www.fb9dv.uni-duisburg.de/vs/de/index.htm
1. Operating Systems (slide 34)

System layers, batching, multi-programming, time sharing
2. File Systems (slide 65)

Storage media, files & directories, disk scheduling
3. Process Management (slide 151)

Processes, threads, IPC, scheduling, deadlocks
Select English Lectures Winter semester 2006/2007 Computer Architecture
4. Memory Management (slide 351)

Memory, paging, segmentation, virtual memory, caches
Direct link to homepage of lecture Computer Architecture

http://www.fb9dv.uni-duisburg.de/vs/en/education/dv3/index2006.htm
Slide 5 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 6
WS 06/07
Literature
[HP03] J. Hennessy, D. Patterson: Computer Architecture A Quantitative Approach, 3rd ed., Elsevier Science, 2003, ISBN 1-55860-724-2. J. Hennessy, D. Patterson: Computer Architecture A Quantitative Approach, 4th ed., Elsevier Science, 2006, ISBN 0-12-370490-1 . A. Silberschatz: Applied Operating System concepts, 1st ed., John Wiley & Sons, 2000, ISBN 0-471-36508-4. A. Tanenbaum: Modern Operating Systems, 2nd ed., Prentice Hall, 2001, ISBN 0-13-092641-8.
[HP06]
Introduction
[Sil00]
[Ta01]
Slide 7
WS 06/07
Slide 8
WS 06/07
Introduction
Computer Architecture is the conceptual design and fundamental operational structure of a computer system [Wikipedia]. Computer Architecture encompasses [HP03 p.9]: Instruction set architecture
stack or accumulator or general purpose register architecture
Computer Application Areas

Introduction
General Purpose desktops

balanced performance for range of tasks, graphics, video, audio
Scientific desktops and servers

high-performance floating point and graphics
Commercial servers
databases, transaction processing, highly reliable
Organization
memory system, bus structure, CPU design
Hardware
machine specifics, logic design, technology
Embedded computing
low power, small size, safety critical
Computer
Introduction
History
Introduction
A computer is a person or an apparatus that is capable of processing information by applying calculation rules.
Generalized technology independent definition.
~ 5000 bc
Basis of calculating is counting. 10 fingers decimal system Abacus (Suan Pan, Soroban)
~ 1000 bc
A computer is a machine for manipulating data according to a list of instructions known as program [Wikipedia] .
Chinese Suan Pan
Slide 11 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 12 Computer Architecture WS 06/07
Roman Abacus
History
Introduction
History
Introduction
Book from 1958
300 bc 1000 ac
Roman numeral system addition system, no zero

Numeral Value
Finger technique (from Japanese book 1954)
M D C L X V I
1000 500 100 50 10 5 1
Value 19: XVIIII or XIX

Not suitable for performing multiplications.
See also: http://www.ee.ryerson.ca/~elf/abacus/leeabacus/

Slide 13 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 14 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
History
Introduction
History
Introduction
~ 500 ac
Hindu-Arabic Numeral System, place value system, introduction of 0

Indian (3rd century bc) Indian (8th century) West-Arabic (11th century) European (15th century) European (16th century) Today
1623
Wilhelm Schickard
Calculation machine
1641
Blaise Pascal
Adding machine
1679
G.W. Leibniz
Dyadic system (binary system)
Forms the basis point for the development of calculation on machines.

1808
J. M. Jaquard
Punch card controlled loom
Slide 16
History
Introduction
History
Introduction
1833
Charles Babbage
Difference Engine
1847
George Boole
Logic on mathematical statements
1890
H. Hollerith
Punch card based tabulating machine
Data memory, program memory Instruction based operation Conditional jumps I/O unit
Digital data logging on punch cards. First electro mechanical data processing.
History
Introduction
History
Introduction
1936
Alan Turing
Philosophy of information, Turing machine Founder of Computer Science
Characteristics of the first 5 operative digital computers

Computer Zuse Z3 Nation Germany USA UK USA Shown working May 1941 Summer 1941 1943 1944 1944 ENIAC USA 1948 Digital Yes Yes Yes Yes Yes Yes Binary Yes Yes Yes No No No Electronic No Yes Yes No Yes Yes Programmable By punched film stock No Partially, by rewiring By punched paper tape Partially, by rewiring By function table ROM
1941
Konrad Zuse
First electro-mechanic computer Z3 Binary arithmetic, floating point
Atanasoff - Berry Computer Colossus Harvard Mark I
Z3 rebuild in 1961
Information source: Wikipedia on Z3 or on ENIAC, English

History
Introduction
v. Neumann Model
A computer consists of 5 units
Introduction
1945
John v. Neumann
Concept of universal computer systems Founder of Computer Architecture
Input
data
Input Unit Output Unit Memory

Storage for program and data. Addressable storage locations. Read / Write. Communication with the environment
Memory Output
data
data
ALU
instructions
Control Unit
Interpretation of the program. Timing control of units.
control signals
control signals
Control
ALU (Arithmetic Logic Unit)

Performs calculations.
The von Neumann model of a universal computer (stored program computer)

v. Neumann Model
Introduction
Characteristics
von Neumann Model
Today:
Input unit and output unit are combined (not necessarily physically!) to form the Input/Output unit (short: I/O unit). The control unit and the ALU are combined to form the microprocessor.
Architecture is independent of problem to be processed

Universal stored program computer, not tailored to specific problem.
Random accessible memory locations

Selection of location by means of an address. All locations have same capacity.
Keyboard Monitor ...
Input / Output
Addresses
Both program and data reside in memory

The state of the machine (control unit) decides whether the content of a memory location is considered data or code.
Memory
Data Control
Computer is centrally controlled

The v. Neumann model (or architecture) basically still applies to the majority of modern computer systems.
WS 06/07 Dr.-Ing. Stefan Freinatis
CPU has the master role.
Microprocessor (CPU)
Microcomputer
Sequential processing
Execution of a program is done instruction by instruction.
Slide 23
v. Neumann Model
Steps in executing an instruction
1. Fetch instruction from memory and put it into instruction register (in CPU). 2. Evaluate instruction (decode instruction) 3. When needed for this particular instruction, address the data (the operands) in memory. 4. Fetch the data (usually into CPU internal registers). 5. Perform operation on the data (usually this is carried out by the ALU) and write back the results. 6. Adjust address counter to point to next instruction.
Slide 25 Computer Architecture WS 06/07
v. Neumann Bottleneck
Introduction
Memory accesses in executing C = A + B

address of instruction instruction
Introduction
The
Instruction phase
CPU side
Bus System
A address of B B address of C C
address of A
Memory side
Data phase
The
A, B, C: data in memory address bus Slide data 26 bus
time
v. Neumann Bottleneck
Introduction
Computer Performance
Introduction
The data is processed faster by the CPU than it can be taken from or stored in memory.
The processor memory interface is crucial for the overall computation performance. Reduction of the bottleneck effect through introduction of a hierarchical memory organization. Register Cache Main memory
Performance, the work done in a certain amount of time.

Performance is like Power.
P=
W t
Work can have the meaning of processing an instruction, carrying out a floating-point or an integer operation, processing a standardized program (benchmark)
Introduction
Introduction
Popular performance measures Clock rate [Hz]

The frequency at which the CPU is clocked.
Many performance measures are not very expressive ... as they do not
consider the number of instructions being carried out per cycle (parallel execution), cover the effective throughput between CPU and memory, distinguish between complex instruction set computer (CISC) and reduced instruction set computer (RISC).
Dr.-Ing. Stefan Freinatis Slide 30 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
MIPS
Million instructions per second
FLOPS
Floating point operations per second
Slide 29
WS 06/07
Computer performance compared Many performance measures to a VAX-11/780 from 1978.
Introduction
Moores Law
Introduction
are not
very expressive ... as they do not

consider the number of instructions being carried out per cycle (parallel execution), cover the effective throughput between CPU and memory,
Figure from [HP06 p.3] distinguish between complex instruction set computer (CISC) and reduced instruction set computer (RISC).
Gordon Moore empirically observed in 1965 that

the number of transistors on a chip doubles approximately every 12 months.
Gordon E. Moore
In 1975 he revised his prediction to

the number of transistors on a chip doubling every two years.
Moores Law:
N t N 0 10 0.15 t
Computer Architecture WS 06/07
where t is in [years]
See also: www.thocp.net/biographies/papers/moores_law.htm

Slide 31 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 32 Dr.-Ing. Stefan Freinatis
Moores Law
Operating Systems
System layers (36) Early computer Systems (42) Batch systems (46) Multi-program systems (50) Time sharing systems (54) Modern systems (57)
Image source: Wikipedia on Moores Law, English
Slide 34
WS 06/07
Operating Systems
An operating system is a program that acts as an intermediary between a user of a computer and the computer hardware [Sil00 p.3]. Purpose: provision of an environment in which a user can execute programs. Objectives: to make the system convenient to use
Usability, extending the machine beyond low level hardware programming
Operating Systems
to use the hardware in an efficient manner

Resource management, manage the hardware allocation among different programs
Slide 35 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 36 Computer Architecture
Computer system layers

Figure from [Sil00 p.4]
System Layers
Operating systems
System Layers
Operating systems
1. 2.
Hardware provides basic computing resources.

CPU, Memory, I/O, and devices connected to I/O.
Operating system coordinates the use of the hardware among the various application programs for the various users. Applications programs the programs used to solve the computing problems of the users. User people, machines, or other computers using the computer system.
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 38 Computer Architecture
3. 4.
Computer system layers

Figure from lecture CA WS 05/06, original source unknown
Slide 37
Operating Systems
Usability the operating system as an Extended Machine
The architecture of most computers at the machine language level is awkward to program, especially for I/O. The operating system shields the programmer from the hardware details, provides simple(r) interfaces, offers high level abstractions and, in this view, presents the user with the equivalent of an extended machine.
See also [Ta01 p.4]
Operating Systems
Resource Management The operating system as a
Resource Manager Computer resources: processor(s), memory, timer, disks, network interfaces, printer, graphic card, ... The operating system keeps track of who is using which resource, grants or denies resource requests, accounts the usage of resources.
See also [Ta01 p.5]
Resource Management
Operating systems
Early Computer Systems

First computer generation (1945 55)
Operating systems
Resource management may be divided into time management (e.g. CPU time, printer time), and space management (e.g. memory or disk space). Resource management incorporates
Vacuum tubes A single group of people did all the work

design, construction, programming, operating, maintenance
process management, memory management, file system management, device management.

Programming in machine language

plugboard, no programming languages Before going in into these subjects, lets have a look at the computer development since 1945.
Users directly interact with computer system Programs directly interact with hardware No operating system

Operating systems

Operating systems
IBM 407 Accounting Machine

Electro mechanical tabulator
Wiring panel (plugboard) IBM 402 plugboard

Source: Slide 44 http://www.columbia.edu/acis/history/plugboard.html Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Source: http://www.columbia.edu/acis/history/407.html

Operating systems
Batch Systems
Second computer generation (1955 65)
Operating systems
Vacuum tubes A single group of people did all the work

design, construction, programming, operating, maintenance
Transistors, Mainframe computers First high level programming languages

Fortran (Formula translation), Algol (Algorithmic language), Lisp (List Processing)
Programming in machine language

plugboard, no programming languages
No direct user interaction with computer

Everything went via the computer operators.
Users directly interact with computer system Programs directly interact with hardware No operating system
Users submit job to operator

job = program + data + control information.
Operator batched jobs

Composition of jobs with similar needs
Batch Systems
Operating systems
Batch Systems
Operating systems
Structure of a typical FMS (Fortran Monitor System) batch job
Batch job processing scence [Tanenbaum]
IBM
IBM
IBM
Figure from [Ta01 p.9]
Slide 47
WS 06/07
Slide 48
WS 06/07
Batch Systems
Operating systems
Multiprogram Systems
Third computer generation (1965 80)
Operating systems
Resident monitor program in memory

Monitor program loading one job one after another (from tape).
Integrated Circuits Disks

Monitor program Direct access to several jobs on disk. Now the operating system can select jobs (job scheduling). Operating System job 1 job 2 job 3 job 4
Memory
Sequenced job input

Jobs from tape or from card reader. Monitor program cannot select jobs on its own.
Multiprogrammed Batch Systems

Several jobs in memory at the same time Operating system shares CPU time among the jobs (CPU scheduling). Better CPU utilization
Slide 50 Computer Architecture
One job in memory at a time CPU often idle

waiting for slow I/O devices
Memory
job
Slide 49
WS 06/07
Operating systems
Operating systems
Assume program A being executed on a single-program computer. The program needs two I/O operations.
CPU usage over time
Total execution time on a single-program computer:
A1
I/O A
A2
I/O A
A3
Now assume program A and B being executed on a multi-program computer.

CPU usage over time
Assume program B being executed on the same computer at some other time. The program needs no I/O.
CPU usage over time
A1
B1
I/O A
A2
B2
I/O A
A3
B3
B1
B2
B3
Total execution time on a multi-program computer:
t
WS 06/07 Dr.-Ing. Stefan Freinatis Slide 52 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Slide 51
Operating systems
Time Sharing Systems

Operating systems
Multiprogram computers were still batch systems Desire for quicker response time
It took hours/days until output ready. A single misplaced comma could cause a compilation to fail, and the programmer wasted half a day [Ta01 p.11].
Direct user interaction

Many users share a computer simultaneously. Terminals Host.
Multiple job execution with high frequent switching

Operating system must provide more sophisticated CPU scheduling.
Desire for interactivity

Users wanted to have the machine for themselves, working online.
Disk as backing store for memory

Virtual memory
Operating System
Swapping, address translation, protecting memory (memory management)
Requests paved the way for timesharing systems (still in third computer generation)
Many jobs awaiting execution Disk as input / output storage
Need for the OS to manage user data (file system management)

Time Sharing Systems

Assume program A and B as previously. Execution on a time sharing system:
CPU usage over time
Memory Layout
Operating System
program 1 program 2
CPU idle
Time sharing system
Program B has finished
Multi program system Operating System job 1 job 2 job 3
program 3
I/O A
I/O A
program 4 program 5 program 6 program n
Batch system Monitor program
Small time slices allow for interactivity (quasi parallel execution)
Time sharing is not necessarily faster. Compare to the multiprogramming example:

CPU usage over time
A1
B1
I/O
A2
B2
I/O
A3
B3
t
Memory
Slide 56
Working Memory
job job 4
Memory
Memory
Slide 55
WS 06/07
Modern Systems
Fourth computer generation (1980 present)
Operating systems
Real Time Systems

Modern systems
Single-chip CPUs Personal Computers Real-Time Systems Multiprocessor Systems Distributed Systems Embedded Systems
Slide 57
CP/M MS-DOS, DR-DOS Windows 1.0 ... Windows 98 / ME Windows NT 4.0 ... 2003, XP XENIX, MINIX, Linux, FreeBSD
Rigid time requirements Hard Real Time

Industrial control & robotics Guaranteed response times Slimmed OS features (no virtual memory)
RT System
Soft Real Time

Multimedia, virtual reality
RT System
Less restrictive time requirements

Multiprocessor Systems
Modern systems
Distributed Systems
Modern systems
n processors in system (n > 1), tightly coupled

Resource sharing Symmetric Multiprocessing
Each CPU runs identical copy of OS All CPUs are peers (no master-slave)
CPU CPU User User User User Operating System
n computers/processors (n > 1), loosely coupled

Individual computers Autonomous operation Communication via network Network Operating System
File Sharing Message exchange
CPU
CPU
Asymmetric Multiprocessing
Each CPU is assigned specific task Task assignment by master CPU
User
User
User
User
Operating System
CPU
CPU
CPU
CPU
WS 06/07
Slide 60
WS 06/07
Embedded Systems
Modern systems
Resource Management
File system management
Operating systems
Dedicated to specific tasks Encapsulated in host device

invisible, usually not repaired when defect
Creation and organization of a logical storage location where data (user data, system data, programs) can be persistently stored in terms of files. Assigning rights and managing accesses. Maintentance.
Process management
Creation of processes (programs in execution) and sharing the CPU among them. Control of execution time. Enabling communication between processes.
Small in size, low energy Sometimes safety-critical

automotive drive by wire, medical apparatus
Memory management
Assigning memory areas to processes. Organizing virtual memory.
Custom(ized) operating system

Little or no file I/O, sometimes multitasking, no fancy OSs.
Device management.
Low level administrative work related to the specifics of the I/O devices. Translations, low level stream processing. Usually by device drivers.
Operating Systems
An operating system in the wide sense is the software package for making a computer operable.
Image source: Wikipedia on kernel, English
Operating Systems
Operating system categories
Single User - Single Tasking Single User - Multi Tasking Multi User - Single Tasking Multi User - Multi Tasking
MS-DOS Windows, MacOS
The operating system in the narrow sense is the one program running all the time on the computer (the kernel). It consists of several tasks and is asked for services through system calls.
CP/M Unix, VMS
Slide 64
WS 06/07
File System Management

Storage Media (67) Magnetic Disks (71) Files and Directories (81, 90) File Implementation (98) Directory Implementation (114) Free Block Management (124) File System Layout (129) Cylinder skew, disk scheduling (135) Floppy Disks (145)
Slide 65
WS 06/07
Slide 66
WS 06/07
Storage Media
primary storage
Storage Media
low
Storage hierarchy
Cost versus access time for DRAM and magnetic disks [HP06 p.359]
seconday storage
access time
Flash
high
1ms
10ms
Storage Media
Requirements for secondary storage

Store large amount of data

Much more data than fits into (virtual) memory
Persistent store
The information must survive the termination of the process creating or using it.
Concurrent access to data

Multiple processes should be able to access the data simultaneously.
Storage of data on secondary storage media in terms of files.

Magnetic Disks
Magnetic disk drive principle
Magnetic Disks
Sector: Smallest addressable unit on magnetic disk.
Data size between 32 and 4096 bytes (standard 512 bytes).
512 bytes
A disk sector
disk controller
Several sectors may be combined to form a logical block. The

Disk drive
composition is usually performed by a device driver. In this way the higher software layers only deal with abstract devices that all have the same block size, independent of the physical sector size.
host controller
Computer
Such a block is also termed cluster.

Slide 71
WS 06/07
Magnetic Disks
Formatted Disk Capacity
= bytes per sector
x
Magnetic Disks
number of tracks on a platter
x
sectors per track
cylinder
tracks per cylinder
(heads)
capacity of a track capacity of one platter side capacity of all platter sides = disk capacity
CHS = (7, 2, 9), sector size: 512 byte Capacity = 63 kB C = cylinder H = Heads = tracks per S = sectors per track
cylinder
Disk parameters for the original IBM PC floppy disk and a Western Digital WD 18300 hard disk [Ta01 p.301].
Magnetic Disks
On older disks the number of sectors per track was the same for all cylinders.
The physics of the inner track sectors defined the maximum number of bytes per sector. From physics, the outer sectors could have stored more bytes than defined, as the areas are bigger.
Magnetic Disks
Modern disks are divided into zones with more sectors in the outer zones than in the inner zones (zone bit recording).
Waste of space / capacity

Physical disk geometry
Physical geometry (left) and corresponding virtual geometry example (right)

Figure from [Ta01 p.302], modified Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
This must be seen as two sectors
Magnetic Disks
Physical geometry: The true physical disk layout. With modern disks only the internal electronic knows about it. CHS (for old disks) or not published any more Virtual geometry: The published disk layout to the external world (device driver, operating system, user) CHS (e.g. WD 18300 example) LBA (logical block addressing)
Disk sectors are just numbered consecutively without regard of the physical geometry.
Magnetic Disks
Low level formatting: Creation of the physical geometry on the disk platters. Defect disk areas are masked out and are replaced by spare areas. Done by disk drive internal software. Partitioning: The disk is divided into independent partitions, each logically acting as a separate disk. Definition of a master boot record in first sector of the disk. Done by application program. High level formatting: A partition receives a boot block and an empty file system (free storage administration, root directory).
Done by application program or by operating system administration tool.
A disk is a random access storage device.

Slide 78
WS 06/07
Logical Disk Layout

Magnetic disks

Storage Media (67) Magnetic Disks (71) Files and Directories (81, 90) File Implementation (98) Directory Implementation (114) Free Block Management (124) File System Layout (129) Cylinder skew, disk scheduling (135)
Figure from [Ta01 p.400], modified
File system
Floppy Disks (145)

Slide 79
WS 06/07
Files
A file is a named collection of related information recorded on secondary storage. [Sil00 p.346] A file is a logical storage unit. It is an abstract data type. [Sil00 p345, 347] Files are an abstraction mechanism for storing information and retrieving it back later. [after Ta01 p.380]
File Structure
Files
Logical file structure examples [Ta01 p.382]

File Structure
Files
File Access
Sequential Access
Simple and most common. Based on the tape model of a file. Data is processed in order (byte after byte or record after record). Operations: read, write, rewind. Records need not to be of same length (e.g. text files with each line posing a record. Remember Pascal readln, writeln.
Files
a) Byte sequence
Unstructured. The OS does not know or care what is in the file. Meaning imposed by application program. Maximum flexibility. Approach used by Unix and Windows.
b) Sequence of records (fixed-length)

Each record has some internal structure. Background idea: read / write operations from secondary storage have record size.
c) Tree of records
Highly structured. Records may be of variable size. Access to a record through key (e.g. Pony). Lookup / read / write / append are performed by OS, not by application program. Approach used in large mainframe computers (commercial data processing systems).
Record
Figure from [Sil00 p.355], modified

File Access
Direct Access
Files
File Access
Indexed Access
Index file holds keys. Keys point to records within relative file. Suited for tree structures.
Files
Bytes or fixed-length logical records. Records are numbered. Access can be in no particular order. Access by record number. Based on disk model of a file. Useful for immediate access to large data records (e.g. database). Operations: read, write, seek.
(file pointer)
1 2 3 4 5 6 7 8 9 10 11 12
Byte or record
Slide 85
seek

WS 06/07 Dr.-Ing. Stefan Freinatis Slide 86
Example of index file and relative file, figure from [Sil00 p.358]
File Names
Files
File Attributes
Additional information about a file.
Depends on operating system and file system what attributes there are. Files
Name assigned by creation process

andrew 2day urgent! fig_14
Assigned by the operating system. Stored in the file system
Case sensitivity
Andrew andrew ANDREW
Unix: case sensitive. MS-DOS: not sensitive.
Some possible file attributes Access rights Who can access the file and in what way? Date of file creation Whether the file content is text or is binary If set, file is a temporary file and is deleted on process exit Whether or not file name is displayed in listings Regular file or directory file or ...
Two-part file names: basename.extension

readme.txt prog.c.Z lecture.doc
Extensions are often just conventions, not mandatory by the operating system (although convenient when the OS knows about them).
Creation date text / binary flag Temp flag Hidden flag File type
Slide 88
File Types
Windows Files Unix
Directories
Block special files Character special files
Regular files
Directories
A directory is a named logical place to put files in.

Single-level directory
This is the directory entry for the file called records, pointing to the file content on the storage media.
Files for maintaining the logical structure of the file system
Text files (also termed ASCII files)

Contain bytes (words in Unicode) according to a standardized character set, such as EBCDIC, ASCII or Unicode. The content is directly printable (screen, printer). Data.
Binary files
Contents not intended to be printed (at least directly). Content has meaning only to those programs using the files. Program (binary executable) or data.
Early operating systems (CP/M, MS-DOS 1.0) Still used in tiny embedded systems File names are unique
This is the file content of the file records.

Directories
Two-level directory
user1 user2 user3 user4
Directories
Multi-level directory
root directory
sub directories
Hierarchical structure (tree of depth 1) Absolute file names, relative file names, path names
/user1/test, /user3/test test, ../user4/data /user3
Absolute file names are unique

level
Multi-Level Directory
Directories
Directories
Generalization of two-level directory Hierarchical structure of arbitrary depth

Tree structure, graph structure. Logical organization structure.
Acyclic graph directory structure

Additional directory entries (Links) Shared directories Shared files More than one absolute name for a file (or a directory) Dangling link problem
Shared directory Shared files
One root directory

Arbitrary number of sub (sub sub ...) directories
Efficient file search

Tree / Graph traversing routines. Much faster than sequential search.
Logical grouping
System files, user files, shared files, ...
Most common structure

WS 06/07
Directories

Now turning from the users view to the implementors view. Users are concerned with how files are named, what operation are allowed and what the directories look like. Implementors are interested in
General graph directory structure

Allowing links to point to directories creates the possibility of cycles.
Avoiding cycles: Forbid any links to directories No more shared directories then Use cycle detection algorithm
how files and directories are stored on the disk, how the disk space is managed, and how to make everything work efficiently.

File Implementation
The most important issue in implementing files is the way how the available disk space is allocated to a file.
Contiguous Allocation Linked Allocation

Chained Blocks Chained Pointers
Indexed Allocation
Contiguous Allocation
File Implementation
File Implementation
Each file occupies a set of contiguous blocks on the disk. File defined by disk address (first block) and by length in block units.
Advantage
Simple implementation
For each file we just need to know its start block and its length
Fast access
Access in one continuous operation. Minimum head seeks. Disadvantage
Disk fragmentation
Problem of finding space for new file. The final file size must be known in advance!
(a) Contiguous allocation of disk space for 7 files (b) State of the disk after files D and E have been removed
External Fragmentation
File Implementation
Linked Allocation
File Implementation
Free disk space is broken into chunks (holes) which are spread all over the disk. New files are put into available holes, often not filling them up entirely and thus leaving smaller holes. A big problem arises when the largest available hole is too small for a new file.
Each file is a linked list of disk blocks. The blocks may be scattered anywhere on the disk. Each block has besides its data a pointer to the next block. The pointer is a number (a block number).
next
next
next
nil
Internal Fragmentation
A file usually does not fill up its last block entirely, so the remaining space in the block is left unused.
data
used
...
data
...
data
disk
Chained blocks
Linked Allocation
File Implementation
Linked Allocation
Disadvantage
File Implementation
The file jeep starts with block 9. It consists of the blocks 9, 16, 1, 10, and 25 in this order.
Free space management

Somehow all the free blocks must be recorded in some free-block pool.
Higher access time

More seeks to access the whole file owing to block scattering.
Space reduction
Advantage Some bytes of each block are needed for the pointer.
Only first block number needed.
Reliability
If a pointer is broken, the remainder of the file is inaccessible.
No external fragmentation
Slide 103
Not efficient for random access

To get to block k we must walk along the chain.
Files consist of blocks scattered on the disk. No more useless blocks on disk.
Slide 104
WS 06/07
Linked Allocation
File Implementation
Linked Allocation
File Implementation
block block
In particular the last disadvantage of the chained blocks allocation method, the unsuitability for random accesses to files, lead to the chained pointers allocation method.
block
A table contains as many entries as there are disk blocks. The entries are numbered by block number. The block numbers of a file are linked in this table in chain manner (as with chained blocks). This table is called file allocation table (FAT).
Chained block allocation Chained pointer allocation (FAT) The FAT is stored on disk and is loaded into memory when the operating system starts up.
Figures from [Ta01 p.403,404], modified

Chained pointers
Advantage
Linked Allocation
Indexed Allocation
File Implementation
One simple table for both file allocation and free-block pool.
Each file is assigned an index block. The index block is an array of block numbers, listing in order the blocks belonging to the file. To get to block k of a file, one reads the kth entry of the index block.
next next next next index block nil data data data data
disk
Whole block available for data

No more pointers taking away data space.
Suitable for random accesss

Although the principle of getting to block k did not change, the search (counting) is now done on the block numbers, not on the blocks themselves. Disadvantage
FAT takes up disk space and memory (when cached)

One table entry for each disk block. Table size proportional to disk size.
Higher access time (compared to contiguous allocation)

Still it needs many seeks to collect all the scattered blocks.
Slide 108
WS 06/07
Indexed Allocation
File Implementation
Indexed Allocation
Advantage
File Implementation
The file jeep is described by index block 19. The index block has 8 entries of which 5 are used.
Good for random access

Fast determination of block k of a file.
Lesser memory occupation

Only for those files currently in use (open files) the corresponding index blocks are loaded into in memory.
Lesser disk space occupation

Only as many index blocks needed as there are files in the file system. Disadvantage
Free block management

A separate free-block pool must be available. Index blocks are also called index nodes, short i-nodes or inodes.
Index block utilization

Unused entries in index block do waste space.
Indexed Allocation
What if a file needs more blocks than entries available in an index block? Linked index blocks
The last entry in an index block points to another index block (chaining).
File Implementation
Indexed Allocation
File Implementation
data data data
Multilevel index blocks

An entry does not point to the data, but points to a first-level index block (single indirect block) which then points to the data. Optionally, additional level available through second-level and third-level index blocks.
Combined scheme
Most entries point to the data directly. The remaining entries point to first-level and second-level and third-level index blocks. Used by Unix.
Combined scheme example (Unix V7)

from [Ta01 p.447], modified Note: The inodes are no disk blocks, but are records stored in disk blocks. The single / double / triple indirect blocks are disk blocks.
data

Directory Implementation
Before accessing a file, the file must first be opened by the operating system. For that, the OS uses the path name supplied by the user to locate the directory entry. A directory entry provides the name of the file, the information needed to find the blocks of the file, and information about the files attributes.
Directory Entry
Attribute placement
Directory Entry
MS-DOS directory entry
Directory entry size: 32 byte The attributes may be stored a) together with the file name in the directory entry (MS-DOS, VMS) b) or off the directory entry (Unix)
File attributes stored in entry. First block number points to first file block, respectively to the corresponding entry in the FAT (DOS uses chained pointers).
Directory Entry
Unix directory entry (Unix V7)
Directory Entry
MS-DOS file attributes
attributes
directory entry
ADVSHR
Entry size: 16 byte.

Modern Unix versions allow for longer file names.
A : Archive flag D: Directory flag V: Volume label flag

Dr.-Ing. Stefan Freinatis Slide 118
S : System file flag H: Hidden flag R: Read-only flag

of file creation
File attributes are stored in the inode. The rest of the inode points to the file blocks
An MS-DOS directory (not the entry) itself is a file (a binary file) with the file type attribute set to directory. The disk blocks pointed to contain other directory entries (each again of 32 byte size) which either depict files or subsequent directories (sub directories). Upon installing an MS-DOS file system, there is automatically created a root directory. Similar applies to Unix. When the file type attribute is set to directory, the file blocks contain directory entries. Windows 2000 and descendants (NTFS) treat directories as entities different from files.
MS-DOS directory
disk block
directory entry
directory entry directory entry directory entry directory entry directory entry
pointing to disk blocks containing directory entries pointing to disk blocks containing file data
...
Legend:
directory entry directory entry
= Directory attribute set = Directory attribute not set. Regular file.

Slide 120
File Lookup
File Lookup
How to find a file name in a directory Linear Search

Each directory entry has to be compared against the search name (string compare). Slow for large directories.
Binary Search
Needs a sorted directory (by name). Entering and deleting files requires moving directory entries around in order to keep them sorted (Insertion Sort).
Hash Table
In addition to each file name, an hash value (a number) is created and stored. Search is then done on the hash value, not on the name.
B-tree
File names are nodes and leafs in a balanced tree. NTFS.
The steps in looking up the file /usr/ast/mbox in classical Unix


Free Block Management

To keep track of the blocks available for allocation (free blocks), the operating system must somehow maintain a free block pool. When a file is created, the pool is searched for free blocks. When a file is deleted, the freed blocks are added to the pool. File systems using a FAT do not need a separate free block pool. Free blocks are simply marked in the table by a 0.
Linked List
Free Block Pool Implementations: Free List
Bit Map

Linked List
The free blocks form a linked list where each block points to the next one (chained blocks).

Free List
The free block numbers are listed in a table. The table is stored in disk blocks. The table blocks may be linked together.
Simple Implementation
Only first block number needed.
Quick Access
New blocks are prepended (LIFO principle)
17 18 0
Space
Each free block requires 4 byte in table
Disk I/O
Updating the pointers involves I/O.
Block Modification
Modified content hinders undelete of the block.
Management
Adding and deleting block numbers needs time, in particular when a table block is almost full (additional disk I/O required).

Bit Map
To each existing block on disk a bit is assigned. When a block is free, the bit is set. When the block is occupied, the bit is reset (or vice versa). All bits form a bit map.

Storage Media (67) Magnetic Disks (71) Files and Directories (81, 90) File Implementation (98) Directory Implementation (114) Free Block Management (124) File System Layout (129) Cylinder skew, disk scheduling (135)
Compact
Each block represented by a single bit. Fixed size.
Logical order
Neighboring bits represent neighboring blocks (logical order). Quite easy to find contiguous blocks, or blocks located close together.
Conversion block number bit position

From the block number the corresponding bit position must be calculated and vice versa.
Floppy Disks (145)

Slide 127
WS 06/07
File System Layout
File System Layout

Layout of FAT file system
Information about the filesystem location is stored in the boot block.
A copy of the FAT for reliability reasons
FAT
FAT copy
Root dir
Files and directories
File system
Number of entries in root directory is limited, except for FAT-32 where it is a cluster chain.
Each Partition starts with a boot block (first block) which is followed by the file system. The boot block may be modified by the file system.
Microsoft FAT-32 specification at

http://www.microsoft.com/whdc/system/platform/firmware/fatgen.mspx
Figure from [Ta01 p.400], modified

File System Layout

Possible file system layouts for a UNIX file system
File System Layout

Layout of NTFS file system
Information about the filesystem location is stored in the boot block.
Super block
Inodes
Root dir
Files and directories

Master File Table. Linear sequence of 1kB records. Each record describes one file or directory. MFT is a file, may be located anywhere on disk.
The inode for the root directory is located at a fixed place. Bit map free block management
MFT Inodes Root dir Files and directories
System files
File area
Super block
Free block pool
Files for storing metadata about the file system. Actually, the MFT itself is a system file.
Information about filesystem (block size, volume label, size of inode list, next free inode, next free block, ...)
More about NTFS: http://www.ntfs.com/ntfs_basics.htm


Cylinder Skew
Disk Performance
Cylinder skew example

Assumption: Reading from inner tracks towards outer tracks. Here: skew = 3 sectors. After head has moved to next track, sector 0 arrives just in time. Reading can continue right away. Performance improvement when reading multiple tracks.
Physical disk geometry, figure from [Ta01 p.316]
Slide 133
WS 06/07
Disk Scheduling
Disk Performance
Disk Scheduling
Disk Performance
Modern disk drives are addressed as large one-dimensional arrays of logical blocks, where the logical block is the smallest unit of transfer. The array of logical blocks is mapped into the sectors of the disk sequentially. Sector 0 is the first sector of the first track on the outermost cylinder. Mapping proceeds in order through that track, then the rest of the tracks in that cylinder, and then through the rest of the cylinders from outermost to innermost. However, it is difficult to convert a logical block into CHS: The disk may have defective sectors which are replaced by spare sectors from elsewhere on the disk. Owing to zone bit recording the number of sectors per track is not the same for all cylinders. After [Sil00 p.436]
Fast access desired (high disk bandwidth)

Disk bandwidth is the total number of bytes transferred, divided by the total time from the first request for service until completion of the last transfer.
Bandwidth depends on
Seek time, the time for the disk to move the heads to the
cylinder containing the desired sector.
Rotational latency, the additional time waiting for the disk to

rotate the desired sector to the disk head.
Seek time seek distance. Scheduling goal: minimizing seek time

Scheduling in earlier days done by OS, nowadays by either OS (then guessing the physical disk geometry) or by the integrated disk drive controller.
Disk Scheduling
Scheduling Algorithms
Disk Performance
FCFS
Disk Scheduling
First-Come First-Served (FCFS) Shortest Seek Time First (STTF) SCAN

time
track
C-SCAN C-LOOK
For the following examples: Assumption that there are 200 tracks on a single sided disk. Read requests are queued in some queue. The queue is currently holding the requests for tracks 98, 183, 37, 122, 14, 124, 65 and 67. The head is at track 53.
The requests are serviced in the order of their entry (first entry is served first).
Slide 138
WS 06/07
SSTF
Disk Scheduling
track
SCAN
Disk Scheduling
track
The next request served is the one that is closest to current position (shortest seek time).
time time
Disk arm starts at one end of the disk and sweeps over to the other end, thereby servicing the requests. At the other end the head reverses direction and servicing continues on the return trip.
Slide 139
WS 06/07
Slide 140
WS 06/07
C-SCAN
Disk Scheduling
track
C-LOOK
Disk Scheduling
track
time
the disk without servicing on the return trip.

Figure from [Sil00 p.440] Figure from [Sil00 p.441]
Slide 141
WS 06/07
time
Disk arm starts at one end of the disk and sweeps over to the other end, thereby servicing the requests. At the other end the head returns to the beginning of
Like SCAN or C-SCAN, but the head moves only as far as the final request in each direction.
Slide 142
WS 06/07
Disk Scheduling
Disk Performance

SSTF is common and has a natural appeal SCAN and C-SCAN perform better for systems that place a heavy load on the disk. Performance depends on the number and types of requests. Requests for disk service are influenced by the file allocation method. Either SSTF or LOOK is a reasonable choice as default algorithm.
Slide 143
WS 06/07
Slide 144
WS 06/07
Floppy Disks
Portable storage media 8 floppy in 1969 5.25 floppy in 1978 3.5 floppy in 1987
Figure from www.computermuseum.li
Floppy Disks
4 22 40 Seite 0 0 (Front) Page 3 21 Sektorerkennung BDOS: 2,0,6 42 Sector number BIOS: (CHS) BIOS: 0,2,6 (Seite, Spur, Sektor) BDOS: 42 23 41 5
39 42 24 6
Page 1(Rckseite) (Back)

Spurnummer Track index (0, 1, 2, ... ) 0
Seite 1
2 20 38
43 37 44 26 8
25
8 disk Capacity:
80K ... 1.2M
5.25 disk
360k ... 1.2M
3.5 disk
720K, 1.44 MB
BIOS 0,0,1
19
45 27 9
Floppy disks almost displaced by Flash Memory (e.g. USB Stick) now, except for the purpose of booting computers (bootable floppies).
Two sided floppy disk
Beginn der Spuren Track start
Rotation direction
Drehrichtung
Slide 146
Floppy Disks
4 22 40 Seite 0 0 (Front) Page 3 21 Sektorerkennung BDOS: 2,0,6 42 Sector number BIOS: (CHS) BIOS: 0,2,6 (Seite, Spur, Sektor) BDOS: 42 23 41 5
Floppy Disks
Sector Structure
sectors
Page 1(Rckseite) (Back)
42 24 6 Spurnummer Track index (0, 1, 2, ... ) 0 Seite 1
39
20 38 BIOS = 2 Basic Input Output System
Stored in (EP)ROM
43 7 Sector access through invoking a software-interrupt and addressing 37 19 a sector by means of CHS. 1 45 BDOS = Basic Disk Operating System 44 26 8 system. 25
Address field
Data field
Sync IAM index index index
track head
sector sector length CRC
DAM data bytes ECC

CRC: ECC: IAM: DAM:
128-1024
CRC/ InterRecord Gap
Originates BIOS 0,0,1
from CP/M 27 operating

9
Higher abstraction level than BIOS.
Sector access through invoking a software-interrupt and addressing a sector by means of a logical consecutive sector number (1, 2, ...). Beginn der Spuren Track start Drehrichtung Rotation direction
Cyclic Redundancy Check Error Checking/Correction Index Address Mark Data Address Mark
WS 06/07
Floppy Disks
Starting sector numbers for system and data areas (FAT file system). All numbers are in decimal notation.
Floppy Disks
Track 0, Page 0
9 1 Dir. 8 (4) Dir. (3) FAT (1) 7 Dir. (2) FAT (2) Dir. (1) 6 FAT (4) 5 FAT (3) 4 15 3 16 Data Data (4) (4) Dir. (7) 2
Track 0, Page 1
18 10 Data Data (6) (6) Dir. (5)
Bootstrap Loader loader
Bootstrap 17
Disk 360 K 720 K 1.2 M 1.44 M
Boot sector FAT 1 1 1 1 1 2 2 2 2
FAT 2 4 5 9 11
Root dir 6 8 16 20
Data 13 15 30 34
Data (5) (5)

Dir. (6) 11
Data
Data Data (3) (3)

Data (2)
12
Data (2)
14
Data Data (1)

(1) 13
Dir. = allocated space for root directory

Track 0 of a 360 kB floppy disk

Spur 0, Seite 1
Process Management
Processes (153)
Process Management
Threads (178) Interprocess Communication (IPC) (195) Scheduling (247) Real-Time Scheduling (278) Deadlocks (318)
Slide 151
WS 06/07
Slide 152
WS 06/07
Processes
A process is a set of identifiable, repeatable actions which are ordered in some way and contribute to the fulfillment of an objective.
(General definition)
Process Model
Several processes are working quasi-parallel. A process is a unit of work. Conceptually, each process has its own virtual CPU.
In reality, the real CPU switches back and forth from process to process. Processes make progress over time Processes
A process is a program in execution.

(Computer oriented definition)
Program: static, passive

A cooking recipe is a program.
Process: dynamic, active

Acting according to the recipe (cooking) is a process.
Sequential view
Slide 154
Process model view

Figure from [Ta01 p.72] Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Processes
A process may be described as either (more or less)
Address Space
Processes
a) CPU-bound
spends more time doing computations few very long CPU bursts.
A process is an executing program, and encompasses the current values of the program counter, of the registers, of the variables and of the stack. code section (text section or segment)
This is the actual program code (the machine instructions).
PC
b) I/O-bound
spends more time doing I/O than computations many short CPU bursts.
data section (data segment)

This segment contains global variables (global to the process, not global to the computer system).
CPU SP
CS
DS
stack section (stack segment)

Figure from [Ta01 p.134] Slide 155 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
The stack contains temporary data (local variables, return addresses, function parameters).
SS Memory
Process States
Processes
Process States
New
The process is created. Resources are allocated. Processes
Ready
The process is ready to be (re)scheduled.
Running
The CPU is allocated to the process, that is, the program instructions are being executed.
Waiting
The process is waiting for some event to occur. Without this event the process cannot continue even if the CPU would be free.
Terminated
Note: Only in the running state the process needs CPU cycles, in all other states it is actually frozen (or nonexistent any more). Figure from [Sil00 p.89]
Work is done. The process is taken off the system (off the queues) and its resources are freed.
Slide 158
WS 06/07
Processes
Events at which processes are created
Process Creation
Parent process creates a child process
which in turn may create other processes, forming a tree of processes. Processes
Operating System Start-Up

Most of the system processes are created here. A large portion of them are background processes (daemons).
Resource sharing
Parent and child share no resources.
Sys tem call s fo Un ix: rc Win fork reatin ga () dow chil s: C dp rea roc teP ess roc : ess ()
Interactive User Request

A user requests an application to start.
Parent and child share all resources. Child shares subset of parents resources.
Batch job
Jobs that are scheduled to be carried out when the system has available the resources (e.g. calendar-driven events, low priority jobs)
Execution
Parent and child execute concurrently. Parent waits until child terminates.
Existing process gives birth

An existing process (e.g. a user application or a system process) creates a process to carry out some related (sub)tasks.
Address Space
Child is copy of parent Child has program loaded into it
Slide 159
WS 06/07
fork() example
#include <stdio.h>
Process Creation system call that tells a process its pid (process identifier) which is a unique void main() process number within the system. { int result; printf(Parent, my pid is: %d\n", getpid()); result = fork(); from here on if (result == 0) { /* child only */ think parallel printf(Child, my pid is: %d\n", getpid()); ... } else { /* parent only */ printf(Parent, my pid is: %d\n", getpid()); ...
fork() example
Terminal output: Parent, my pid is: 189
Child, my pid is: 190 Parent, my pid is: 189
Process Creation order depends on whether parent or child is scheduled first after fork().
Before fork()
After fork()
Executed by child
PC
... fork() ...
PC
... fork() ...
PC
... fork() ...
} }
Slide 161
Executed by parent
pid = 189
WS 06/07 Dr.-Ing. Stefan Freinatis Slide 162
pid = 189
pid = 190
Process Creation
Processes
Process Termination
Events at which processes are terminated
Processes
Process asks the OS to delete it

Work is done. Resources are deallocated (memory is freed, open files are closed, used I/O buffers are flushed).
Parent terminates child

A Unix process tree
Child may have exceeded allocated resources. Task assigned to child is no longer required. Parents parent is exiting.
Some OS do not allow a child to continue when its parent terminates. Cascading termination (a sub tree is deleted).
Figure from [Sil00 p.96] Slide 163 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 164 Computer Architecture WS 06/07
Sys
tem
call
s fo rs Unix elf-te rmin : atio Win exit( no ) dow fa s: E pro xit ces s: Pro ces s()
Process Control Block

Processes
Process Control Block

Processes
Operating system maintains a process table Each entry represents a process Entry often termed PCB (process control block)
A PCB contains all information about a process that must be saved when the process is switched from running into waiting or ready, such that it can later be restarted as if it had never been stopped. Info regarding process management, regarding memory occupation and open files. PCB example
Slide 165 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Figure from [Sil00 p.89]
Typical fields of a PCB

Table from [Ta01 p.80] Slide 166 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Context Switch
Processes
Context Switch
Processes
context switch
The task of switching the CPU from one process to another is termed context switch (sometimes also process switch):
Saving the state of old process

Saving the current context of the process in its PCB.
context switch time
Loading the state of new process

Restoring the former context of the process from its PCB.
Context switching is pure administrative overhead. The duration of a switch lies in the range of 1 ... 1000 s. The switch time depends on the hardware. Processors with multiple sets of registers are faster in switching. Context switching poses a certain bottleneck, which is one reason for the introduction of threads.
context switch
context switch time
Figure from [Sil00 p.90] Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Scheduling
On a uniprocessor system there is only one process running, all others have to wait until they are scheduled. They are waiting in some scheduling queue:
Processes
Scheduling
Processes
Ready queue
Job Queue
Holds the future processes of the system.
Tape
These queues are empty

Device queues
Ethernet
registers registers registers
Ready Queue (also called CPU queue)

Holds all processes that reside in memory and are ready to execute.
Device Queue (also called I/O queue)

Each device has a queue holding the processes waiting for I/O completion.
Disk
IPC Queue
Holds the processes that wait for some IPC (inter process communication) event to occur.
Terminal
registers
The ready queue and some device queues

Slide 169
WS 06/07
Slide 170
WS 06/07
Scheduling
From the job queue a new process is initially put into the ready queue. It waits until it is dispatched (selected for execution). Once the process is allocated the CPU, one of these events may occur.
Processes
Scheduling
Processes processes are ready process is running
Interrupt
The time slice may be expired or some higher priority process is ready. Hardware error signals (exceptions) also may cause a process to be interrupted.
Job queue
Ready queue
CPU
Interrupt I/O request events IPC request
processes are new
I/O request
The process requests I/O. The process is shifted to a device queue. After the I/O device has ready, the process is put into the ready queue to continue.
Device queue
processes are waiting
IPC request
The process wants to communicate with another process through some blocking IPC feature. Like I/O, but here the I/O-device is another process.
A note on the terminology: Strictly spoken, a process (in the sense of an active entity) only exists when it is allocated the CPU. In all other cases it is a dead body.
IPC queue
Queueing diagram of process scheduling

Scheduling
The OS selects processes from queues and puts them into other queues. This selection task is done by schedulers.
Processes
Scheduling
Processes
Long-term scheduler
Short-term scheduler
Long-term Scheduler
Originates from batch systems. Selects jobs (programs) from the pool and loads them into memory. Invoked rather infrequently (seconds ... minutes). Can be slow. Has influence on the degree of multiprogramming (number of processes in memory). Some modern OS do not have a long-term scheduler any more.
Job queue
Ready queue
CPU
Short-term Scheduler
Selects one process from among the processes that are ready to execute, and allocates the CPU to it. Initiates the context switches. Invoked very frequently (in the range of milliseconds). Must be fast, that is, must not consume much CPU time compared to the processes.
Schedulers and their queues

Scheduling
Sometimes it may be advantageous to remove processes temporarily from memory in order to reduce the degree of multiprogramming. At some later time the process is reintroduced into memory and can be continued. This scheme is called swapping, performed by a medium-term scheduler.
Job queue
Processes
Process Concept
Program in execution
Several processes may be carried out in parallel. Processes
Resource grouping
Each process is related to a certain task and groups together the required resources (Address space, PCB).
Ready queue
swap out
CPU
Traditional multi-processing systems:
Each process is executed sequentially

No parallelism inside a process.
Medium-term scheduler
Blocked operations Blocked process

Any blocking operation (e.g. I/O, IPC) blocks the process. The process must wait until the operation finishes.
swap in
swap queue
In traditional systems each process has a single thread of control.

Slide 175
WS 06/07
Process Management
Processes (153) Threads (178) Interprocess Communication (IPC) (195) Scheduling (247) Real-Time Scheduling (278) Deadlocks (318)
Threads
A thread is a piece of yarn, a screw spire, a line of thoughts. Here: a sequence of instructions
that may execute in parallel with others A thread is a line of execution within the scope of a process. A single threaded process has a single line of execution (sequential execution of program code), the process and the thread are the same. In particular, a thread is
a basic unit of CPU utilization.

Threads
Threads
As an example, consider of a word processing application.
- Reading from keyboard - Formatting and displaying pages - Periodically saving to disk - ... and lots of other tasks
A single threaded process would quite quickly result in an unhappy user since (s)he always has to wait until the current operation is finished.
Multiple processes?
Three single threaded processes in parallel A process with three parallel threads.
Figure from [Ta01 p.82] Slide 179 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 180
Each process would have its own isolated address space.
Multiple threads!
The threads operate in the same address space and thus have access to the data.
Threads
Three-threaded word processing application
formatting and displaying
Threads
Multiple executions in same environment
All threads have exactly the same address space (the process address space).
Each thread has own registers, stack and state
Reading keyboard
Saving to disk
Figure from [Ta01 p.86] Slide 181 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 182 Computer Architecture WS 06/07
Figure from [Sil00 p.116] Dr.-Ing. Stefan Freinatis
Threads
User Level Threads

Threads
Take place in user space

The operating system does not know about the applications internal multi-threading.
Can be used on OS not supporting threads

It only needs some thread library (like pthreads) linked to the application.
Each process has its own thread table

The table is maintained by the routines of the thread library.
Customized thread scheduling

Items shared by all threads in a process Items private to each thread
Table from [Ta01 p.83] Slide 183 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 184 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
The processes use their own thread scheduling algorithm. However, no timer controlled scheduling possible since there are no clock interrupts inside a process.
Blocking system calls do block the process

All threads are stopped because the process is temporarily removed from the CPU.
User Level Threads

Threads
Kernel Threads
Take place in kernel
The operating system manages the threads of each process Threads
Thread management is performed by the application.

Examples - POSIX Pthreads - Mach C-threads - Solaris threads
Available only on multi-threaded OSs

The operating system must support multi-threaded application programs.
No thread administration inside process

since this is done by the kernel. Thread creation and management however is generally somewhat slower than with user level threads [Sil00 p.118].
No customized scheduling
The user process cannot use its own customized scheduling algorithm.
No problem with blocking system calls

A blocking system call causes a thread to pause. The OS activates another thread, either from the same process or from another process.
Figure from [Ta01 p.91] Slide 185 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 186 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Kernel Threads
Threads
Multithreading Models
Many-to-One Model
Threads
Thread management is performed by the operating system.

Examples - Windows 95/98/NT/2000 - Solaris - Tru64 UNIX - BeOS - Linux
Many user level threads are mapped to a single kernel thread. Used on systems that do not support kernel threads.
Figure from [Ta01 p.91] Figure from [Sil00 p.118] Slide 188 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Slide 187
WS 06/07
One-to-One Model
Threads
Many-to-Many Model
Threads
Many user level threads are mapped to many kernel threads. Each user level thread is mapped to one kernel thread.
Figure from [Sil00 p.119] Slide 189 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 190 Computer Architecture WS 06/07 Figure from [Sil00 p.119] Dr.-Ing. Stefan Freinatis
Multithreading
Solaris 2 multi-threading example
Threads
Threads
Windows 2000: Implements one-to-one mapping
Each thread contains - a thread id - register set - separate user and kernel stacks - private data storage area
Linux:
One-to-one model (pthreads), many-to-many (NGPT) Thread creation is done through clone() system call.
clone() allows a child to share the address space of the parent. This system call is unique to Linux, source code not portable to other UNIX systems.
Figure from [Sil00 p.121] Slide 191 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 192 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Threads
Java: Provides support at language level.
Thread scheduling in JVM
class Worker extends Thread { public void run() { System.out.println("I am a worker thread"); } }
Example: Creation of a thread by inheriting from Thread class
Process Management
public class MainThread { public static void main(String args[]) { Worker worker1 = new Worker(); worker1.start(); System.out.println("I am the main thread"); } thread creation and automatic call of run() method }
WS 06/07
IPC
Purpose of Inter Process Communication
Race Conditions
Print spooling example
IPC
Managing critical activities

Making sure that two (or more) processes do not get into each others' way when engaging critical activities.
Sequencing
Making sure that proper sequencing is assured in case of dependencies among processes.
Process Synchronization
Thread Synchronization
shared variables
Passing information
Processes are independent of each other and have private address spaces. How can a process pass information (or data) to another process?
next empty slot
Data exchange
Less important for threads since they operate in the same environment
Situations, where two or more processes access some shared resource, and the final result depends on who runs precisely when, are called race conditions.
Race Conditions
Processes A and B want to print a file Both have to enter the file name into a spooler directory out points to the next file to be printed. This variable is accessed only by the printer daemon. The daemon currently is busy with slot 4. in points to the next empty slot. Each process entering a file name in the empty slot must increment in.
Now consider this situation:
Race Conditions
IPC
Another example at machine instruction level Shared variable x (initially 0)
IPC
Process 1
R1 x R1 = R1+1 R1 x
Process 2
x=0 x=1
Process 1
R1 x
Process 2
x=0
Process A reads in (value = 7) into some local variable. Before it can continue, the CPU is switched over to B. Process B reads in (value = 7) and stores its value locally. Then the file name is entered into slot 7 and the local variable is incremented by 1. Finally the local variable is copied to in (value = 8). Process A is running again. According to the local variable, the file name is entered into slot 7 erasing the file name put by B. Finally in is incremented. User B is waiting in the printer room for years ...
R3 x R3 = R3+1 R1 = R1+1 R1 x R3 x
R3 x R3 = R3+1 R3 x
x=2
x=1
Scenario 1
Scenario 2
Slide 198
WS 06/07
Critical Regions
How to avoid race conditions? Find some way to prohibit more than one process from manipulating the shared data at the same time.
IPC
Critical Regions
IPC
Four conditions to provide correctly working mutual exclusion:
1. No two processes simultaneously in critical region

which would otherwise controvert the concept of mutuality.
Mutual exclusion
Part of the time a process is doing some internal computations and other things that do not lead to race conditions. Sometimes a process however needs to access shared resources or does other critical things that may lead to race conditions. These parts of a program are called critical regions (or critical sections).
critical region Process A
2. No assumptions about process speeds

No predictions on process timings or priorities. Must work with all processes.
3. No process outside its critical regions must block other processes, simply because there is no reason to hinder
others entering their critical region.
4. No process must wait forever to enter a critical region.

For reasons of fairness and to avoid deadlocks.
t
Slide 199
Critical Regions
Mutual exclusion using critical regions
IPC
Mutual Exclusion
Proposals for achieving mutual exclusion
IPC
Disabling interrupts
The process disables all interrupts and thus cannot be taken away from the CPU.
Not appropriate. Unwise to give user process full control over computer.
Lock variables
A process reads a shared lock variable. If the lock it is not set, the process sets the variable (locking) and uses the resource.
In the period between evaluating and setting the variable the process may be interrupted. Same problem as with printer spooling example.
Mutual Exclusion
Proposals for achieving mutual exclusion (continued)
IPC
Mutual Exclusion
IPC
Strict Alternation
The shared variable turn keeps track of whose turn it is. Both processes alternate in accessing their critical regions.
Strict Alternation (continued)

Busy waiting wastes CPU time. No good idea when one process is much slower than the other. Violation of condition 3.
busy waiting for turn = 0 Process 0 Process 1
while (1) { while (turn != 0); critical_region(); turn = 1; noncritical_region(); }

Process 0
while (1) { while (turn != 1); critical_region(); turn = 0; noncritical_region(); }

Process 1
t
turn = 0 turn = 1 turn = 0
Slide 203
WS 06/07
Slide 204
WS 06/07
Mutual Exclusion
IPC
Mutual Exclusion
Peterson Algorithm (continued)
IPC
Peterson Algorithm
int turn; bool interested[2];
shared variables Two processes, number is either 0 or 1
Assume process 0 and 1 both simultaneously entering critical_region() other = 1 interested[0] = true turn = 0 other = 0 interested[1] = true turn = 1
Process 0 Process 1
void enter_region(int process) { int other = 1 process; interested[process] = TRUE; turn = process; while (turn == process && interested[other] == TRUE); } void leave_region(int process) { interested[process] = FALSE; }
Both are manipulating turn at the same time. Whichever store is last is the one that counts. Assume process 1 was slightly later, thus turn = 1. while (turn == 0 && interested[1] == TRUE); while (turn == 1 && interested[0] == TRUE); Process 0 passes its while statement, whereas process 1 keeps busy waiting therein. Later, when process 0 calls leave_region(), process 1 is released from the loop.
Good working algorithm, but uses busy waiting

Slide 206
Mutual Exclusion
IPC
Mutual Exclusion
Intermediate Summary
IPC Not recommended for multi-user systems. Problem remains the same. Violation of condition 3. Busy waiting. Busy waiting. Solves the problem through atomic operation. Should be used without busy waiting.
Test and Set Lock (TSL)

Atomic operation at machine level. Cannot be interrupted. TSL reads the content of the memory word lock into register R and then stores a nonzero value at the memory address lock. The memory bus is locked, no other process(or) can access lock.
Disabling Interrupts Lock Variables Strict Alternation Peterson Algorithm TSL instruction
enter_region:
TSL R, lock CMP R, #0 JNZ enter_region RET MOV lock, #0 RET
indivisible operation
CPU must support TSL Busy waiting
leave_region:
In essence, what the last three solutions do is this: A process checks whether the entry to its critical region is allowed. If it is not, the process just sits in a tight loop waiting until it is. Unexpected side effects, such as priority inversion problem.
Pseudo assembler listing providing the functions enter_region() and leave_region().

Priority Inversion Problem

IPC
Sleep and wake up

IPC
Consider a computer with two processes Process H with high priority Process L with low priority The scheduling rules are such that H is run whenever it is in ready state. At a certain moment, with L in its critical region, H becomes ready and is scheduled. H now begins busy waiting, but since L is never scheduled while H is running, L never has the chance to leave its critical region. H loops forever. This is sometimes referred to as the priority inversion problem. Solution: blocking a process instead of wasting CPU time.
sleep()
A system call that causes the caller to block, that is, the process voluntarily goes from the running state into the waiting state. The scheduler switches over to another process.
wakeup(process)
A system call that causes the process process to awake from its
sleep() and to continue execution. If the process process is not asleep at that moment, the wakeup signal is lost.
Note: these two calls are fictitious representatives of real system calls whose names and parameters depend on the particular operating system.
Producer Consumer Problem

IPC
const int N = 100; int count = 0;
Producer Consumer Implementation Example

This implementation suffers from race conditions
Shared buffer with limited size

The buffer allows for a maximum of N entries (it is bounded). The problem is also known as bounded buffer problem.
Producer puts information into buffer

When the buffer is full, the producer must wait until at least one item has been consumed.
Consumer removes information from buffer

When the buffer is empty the consumer must wait until at least one new item has been entered.
void producer() { constantly producing while (TRUE) { int item = produce_item(); produce item if (count == N) sleep(); sleep when buffer is full insert_item(item); enter item to buffer count++; adjust item counter if (count == 1) wakeup(consumer); } when the buffer was empty beforehand } (and thus now has 1 item), wakeup any A
consumer(s) that may be waiting
shared buffer Producer Consumer
Slide 211
WS 06/07
void consumer() { constantly consuming while(TRUE) { if (count == 0) sleep(); sleep when buffer is empty item = remove_item(); remove one item count--; adjust item counter if (count == N-1) wakeup(producer); consume_item(item); when the buffer was full beforehand (and thus now has N-1 items), wakeup } producer(s) that may be waiting. }

A race condition may occur in this case:

Mutual Exclusion
Mutual Exclusion
The buffer is empty and the consumer has just read count to see if it is 0. At that instant (see A in listing) the scheduler decides to switch over to the producer. The producer inserts an item in the buffer, increments count and notices that count is now 1. Reasoning that count was just 0 and thus the consumer must be sleeping, the producer calls wakeup() to wake the consumer up. However, the consumer was not yet asleep, it was just taken away the CPU shortly before it could enter sleep(). The wakeup signal is lost. When the consumer is rescheduled and resumes at A , it will go to sleep. Sooner or later the producer has filled up the buffer and goes asleep as well. Both processes will sleep forever.
Reasons for race condition

The variable count is unconstrained
Any process has access any time.
Evaluating count and going asleep is a non-atomic operation

The prerequisite(s) that lead to sleep() may have changed when sleep() is reached.
Workaround:
Add a wakeup waiting bit
When the bit is set, sleep() will reset that bit and the process stays awake.
Each process must have a wakeup bit assigned

Although this is possible, the principal problem is not solved.
What is needed is something that does testing a variable and going to sleep dependent on that variable in a single non-interruptible manner.
Semaphores
Mutual Exclusion
Semaphores
Up and down are system calls
Semaphores should be lock-protected
This is recommended at least in multi-processor systems to prevent another CPU from simultaneously accessing a semaphore. TSL instruction helps out. Mutual Exclusion in order to make sure that the operating system briefly disables all interrupts while carrying out the few machine instructions implementing up and down.
Introduced by Dijkstra (1965) Counting the number of wakeups

An integer variable counts the number of wakeups for future use.
Two operations: down and up

down is a generalization of sleep. up is a generalization of wakeup. Both
operations are carried out in a single, indivisible operation (usually in kernel). Once a semaphore operation is started, no other process can access the semaphore.
Producer Consumer problem using semaphores (next page)

Definition of variables:
down(int* sem) { if (*sem < 1) sleep(); *sem--; }

principle of down-operation
Slide 215
up(int* sem) { *sem++; if (*sem == 1) wakeup a process }

principle of up-operation
a semaphore is an integer counting empty slots counting full slots mutual exclusion on buffer access
Slide 216
const int N = 10; typedef int semaphore; semaphore empty = N; semaphore full = 0; semaphore mutex = 1;
Producer Consumer Implementation Example void producer() { This implementation does not suffer from race conditions while (TRUE) { int item = produce_item(); possibly sleep, decrement empty counter down(&empty); down(&mutex); possibly sleep, claim mutex (set it to 0) thereafter insert_item(item); up(&mutex); release mutex, wake up other process up(&full); increment full counter, possibly wake up other ... } }
void consumer() { while(TRUE) { down(&full); down(&mutex); item = remove_item(); up(&mutex); up(&empty); consume_item(item); } }
Semaphores
Assume N = 5. Initial condition: empty = 5, full = 0.
producer producer producer producer
Mutual Exclusion
producer pro...
empty full
sleep
5
4 1
3 2
2 3
1 4
0 5
Scenario: producer is working, no consumer present

Initial condition: empty = 0, full = 5.
consumer consumer consumer consumer consumer con...
possibly sleep, decrement full counter possibly sleep, claim mutex (set it to 0) thereafter release mutex, wake up other process increment empty counter, possibly wake up other ...
full empty
sleep
5
4 1
3 2
2 3
1 4
0 5
Scenario: consumer is working, no producer present

Semaphores
producer pro... producer
Semaphores
Mutual Exclusion Assume N = 5. Initial condition: empty = 4, full = 1.
producer producer
Mutual Exclusion
producer
empty full
sleep
1
0 5
0 4
1 0 5 waking up producer
consumer consumer consumer
empty
t
full
5 0
4 1 0
4 1
3 2
waking up consumer
consumer con... consumer
full empty
5 0
4 1 0
4 1
3 2
full
sleep
1
t
empty
0 5
0 4
1 0 5
Scenario: Consumer waking up producer
Scenario: Producer waking up consumer
Slide 219
WS 06/07
Slide 220
WS 06/07
Semaphores
down
producer
Mutex
Mutual Exclusion
Simplified semaphore
when counting is not needed.
Mutual Exclusion
empty full
23 1 21 up down
consumer
Two states
Locked or unlocked. Used for managing mutual exclusion (hence the name).
mutex_lock:
consumer
down 21 23 up
full empty
t
4 up
ok:
TSL R, mutex CMP R, #0 JZ ok CALL thread_yield JMP mutex_lock RET
get and set mutex was it unlocked? if yes: jump to ok if no: sleep try again acquiring mutex
If processes overlap, then temporary it may be that empty + full N

Note that consumer and producer may almost concurrently change the same semaphore legally.
mutex_unlock: MOV mutex, #0 RET

Pseudo assembler listing implementing mutex_lock() and mutex_unlock().
unlock mutex
Slide 221
WS 06/07
Slide 222
WS 06/07
Monitors
Mutual Exclusion
Monitors
Mutual Exclusion
High level synchronization primitive

at programming language level. Direct support by some programming languages.
monitor example; integer i; condition c; procedure producer() ... ... end; procedure consumer() ... ... end; end monitor;
Variables not accessible from outside the monitors own methods (capsulation).
A collection of procedures, variables and data structures grouped together in a module

A monitor has multiple entry points Only one process can be in the monitor at a time Enforces mutual exclusion less chances for programming errors
Functions (methods) publicly accessible to all processes, however only one process at a time may call a monitor function.
Monitor implementation
Compiler handles implementation Library functions using semaphores
If the buffer is full, the producer must wait. If the buffer is empty the consumer must wait.
A monitor in Pidgin Pascal, from [Ta01 p.115]

Slide 224
Monitors
Mutual Exclusion
Monitors
Mutual Exclusion
How can a process wait inside a monitor?

Cannot put to sleep because no other process can enter the monitor meanwhile.
Use a condition variable!

A condition variable supports two operations.
wait(): suspend this process until it is signaled. The suspended process is not considered inside the monitor any more. Another process is allowed to enter the monitor. signal(): wake up one process waiting on the condition variable. No effect if nobody is waiting. The signaling process automatically leaves the monitor (Hoare monitor). Condition variables usable only inside a monitor.
Producer-Consumer problem with monitors, from [Ta01 p.117]
Barriers
Group synchronization
Intended for groups of processes rather than for two processes. IPC
Barriers
Application example Process 1 working on these elements IPC Process 2 working on these elements Process 3 working on these elements
Processes wait at a barrier for the others

according to the all-or-none principle
After all have arrived, all can proceed

Process 0
... and so on for the remaining elements
Processes approaching barrier

Slide 227
Waiting for C to arrive

All processes continuing

Figure from [Ta01 p.124] Dr.-Ing. Stefan Freinatis
An array (e.g. an image) is updated frequently by some process 0 (producer). Many processes are working in parallel on certain array elements (consumers). All consumers must wait until the array has been updated and can then start working again on the updated input.
IPC
Intermediate Summary (II)
Messages
IPC
Semaphores
Counting variable, used in non-interruptible manner. Down may put the caller to sleep, up may wake up another process.
Kernel supported mechanism for data exchange

Eliminates the need for self-made (user-programmed) communication via shared resources such as shared files or shared memory.
Mutexes
Simplified semaphore with two states. Used for mutual exclusion.
Two basic operations: send(): send data

provided by the kernel (system calls)
Some data (a message)
receive(): receive data

System buffers
Monitors
High level construct for achieving mutual exclusion at programming language level.
Process 1
send
OS (kernel space)
Process 2
receive
Barriers
Used for synchronizing a group of processes. These mechanisms all serve for process synchronization. For data exchange among processes something else is needed: Messages.
Copy from user space to kernel space
Copy from kernel space to user space

Direct Communication
Messages
Indirect Communication
Messages
Both processes must exist

As the name direct implies, you cannot send a message to a future process.
Messages are send / received from mailboxes

The mailbox must exist, not necessarily the receiving process yet.
Processes must name each other explicitly

- send(P, message): send data to process P - receive(Q, message): receive data from process Q
Symmetry in addressing. Both processes need to know each other by some identifier. This is no problem if both were fork()ed off the same parent beforehand, but is a problem when they are strangers to each other.
- Each mailbox has a unique identifier - Processes communicate when they access the same mailbox
Primitives
- send(A, message): send message to mailbox A - receive(A, message): receive message from mailbox A
Communication link properties

- One process pair has exactly one link - The link may be unidirectional or bidirectional
Communication link properties

- Link is established when processes share a mailbox - A link may be associated with many processes (broadcast) - Unidirectional or bidirectional communication
Synchronous Communication
Messages
Asynchronous Communication
Messages
Also called blocking send / receive Sender waits for receiver to receive the data
The send() system call blocks until receiver has received the message. Process 1
send
Also called non-blocking send / receive Sender drops message and passes on
The send() system call returns to the caller when the kernel has the message. Process 1
send
OS (kernel space)
Process 2
receive
OS (kernel space)
Process 2
receive
Acknowledgement from receiver
A single buffer (for the pair) is sufficient
Multiple buffers (for each pair) needed
Receiver waits for sender to send data

The receive() system call blocks until a message is arriving.
Receiver peeks for messages

The receive() system does not block, but rather returns some error code telling whether there is a message or not. Receiver must do polling to check for messages.
Messages
IPC
UNIX IPC Mechanisms

IPC
Send by copy
The message is copied to kernel buffer at send time. At receive time the message is copied to the receiver. Copying takes time.
Pipes
Simple(st) communication link between two processes. Applies first-in first-out principle. Works like an invisible file, but is no file. Operations: read(), write().
Send by reference
A reference (a memory address or a handle) is copied to the receiver which uses the reference to access the data. The data usually resides in a kernel buffer (is copied there beforehand). Fast read access.
FIFOs
Also called named pipe. Works like a file. May exist in modern Unices just in the kernel (and not in the file system). There can be more than one writer or reader on a FIFO. Operations: open(), close(), read(), write().
Fixed sized messages

The kernel buffers are of fixed size as are the messages. Straightforward system level implementation. Big messages must be constructed from many small messages which makes user level programming somewhat more difficult.
Messages
Allow for message transfer. Messages can have types. A process may read all messages or only those of a particular type. Message communication works according to the first-in first-out principle. Operations: msgget(), msgsnd(), msgrcv(), msgctl().
Variable sized messages

Sender and receiver must communicate about the message size. Best use of kernel buffer space, however, buffers must not grow indefinitely.
Slide 236
WS 06/07
UNIX IPC Mechanisms

Shared memory
IPC A selectable part of the address space of process P1 is mapped into the address space of another process P2 (or others). The processes have simultaneous access. Operations: shmget(), shmat(), shmdt(), smhctl().
Simple pipe example. Parent is writing, child is reading. const int FIXSIZE=80 void main() { int fd[2]; pipe(fd); int result = fork(); if (result == 0) { close(fd[1]); char buf[256]; read(fd[0], buf, FIXSIZE) exit(0); } close(fd[0]); // file descriptors for pipe // create pipe // duplicate process // start childs code // we do not need writing // a buffer // wait for message from parent // good bye // end child, start parent // we do not need reading
Semaphores
Creation and manipulation of sets of semaphores. Operations: semget(), semop(), semctl().
printf(This is the child, my pid is: %d\n", getpid());
printf(Child: received message was: %s\n", buf); For an introduction into the UNIX IPC mechanisms (with examples) see Stefan Freinatis: Interprozekommunikation unter Unix - eine Einfhrung, Technischer Bericht, Fachgebiet Datenverarbeitung, Universitt Duisburg, 1994. http://www.fb9dv.uni-duisburg.de/vs/members/fr/ipc.pdf }
printf(This is the parent, my pid is: %d\n", getpid()); write(fd[1], "Hallo!", FIXSIZE); // write message to child
Classical IPC Problems

The dining philosophers
An artificial synchronization problem posed and solved by Edsger Dijkstra 1965. IPC
Dining philosophers
Classical IPC problems
The life of these philosophers consists of alternate periods of eating and thinking. When a philosopher becomes hungry, she tries to acquire her left and right fork, one at a time, in either order. If successful in acquiring two forks, she eats for a while, then puts down the forks and continues to think.
Text from [Ta01 p.125]
Five philosophers sitting at a table

The problem can be generalized to more than five philosophers, of course.
Each either eats or thinks Five forks available Eating needs 2 forks
Slippery spaghetti, one needs two forks!
Can you write a program that makes the philosophers eating and thinking (thus creation of 5 threads or processes, one for each philosopher), allows maximum utilization (parallelism), that is, two philosophers may eat at a time (no simple solution with just one philosopher eating at a time), is not centrally controlled by somebody instructing the philosophers, and that never gets stuck?
Figure from [Ta01 p.125] WS 06/07 Dr.-Ing. Stefan Freinatis Slide 240 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Pick one fork at a time

Either first the right fork and then the left one, or vice versa.
Dining philosophers
Classical IPC problems const int N=5; void philosopher(int i) { while(TRUE){ think(); take_fork(i); take_fork((i+1)%N); eat(); put_fork(i); put_fork((i+1)%N); } } // put left fork // put right fork // take left fork // take right fork // N philosophers in parallel // for the whole life

The Readers and Writers Problem
An artificial shared database access problem by Courtois et. al, 1971 IPC
Database system
such as an airline reservation system.
Many competing processes wish to read and write

Many reading processes is not the problem, but if one process wants to write, no other process may have access not even readers.
How to program the readers and writers?
A nonsolution to the dining philosophers problem
Writer waits until all readers are gone

Not good. Usually there are always readers present. Indefinite wait.
If all philosophers take their left fork simultaneously, none will be able to take the right fork. All philosophers get stuck. Deadlock situation.
Writer blocks new readers

A solution. Writer waits until old readers are gone and meanwhile blocks new readers
Slide 242
WS 06/07

The sleeping barber problem
An artificial queuing situation problem customer chairs IPC
Sleeping Barber
IPC
The barber shop has one barber, one barber chair, and n chairs for customers, if any, to sit on. If there are no customers present, the barber sits down in the barber chair and falls asleep. When a customer arrives, he has to wake up the sleeping barber. If additional customers arrive while the
barber sleeps when no customers are present
barber is cutting a customers hair, they either sit down (if there are empty chairs) or leave the shop (if all chairs are full).
Text from [Ta01 p.129]
How to program the barber and the customers without getting into race conditions?
const int CHAIRS=5; typdef int semaphore; semaphore customers = 0; semaphore barbers = 0; semaphore mutex = 1; int waiting = 0;
Slide 244
// number of chairs // number of customers waiting // number of barbers waiting // for mutual exclusion
WS 06/07
void barber() { while(TRUE){ down(&customers); down(&mutex); waiting--; up(&barbers); up(&mutex); cut_hair(); } } A solution to the sleeping void customer() { down(&mutex); if (waiting < CHAIRS){ waiting++; up(&customers); up(&mutex); down(&barbers); get_haircut(); } else up(&mutex); }
Slide 245
// // // //
barber process for the whole life sleep if no customers acquire access to waiting
Scheduling
IPC
// one barber ready to cut // release waiting // cut hair (non critical)
barber problem [Ta01 p.131]

// // // // // // // // // customer process enter critical region when seats available one more waiting tell barber if first customer release waiting sleep if no barber available get serviced shop is full, leave
WS 06/07
Slide 246
WS 06/07
Scheduling
Better CPU utilization through multiprogramming Scheduling: switching CPU among processes Productivity depends on CPU bursts
Short-Term Scheduler
Also called CPU scheduler. Selects one process from among the ready processes in memory and dispatches it. The dispatcher is a module that finally gives CPU control to the selected process (switching context, switching from kernel mode to user mode, loading the PC).
Short-term scheduler
Scheduling
Job queue
Ready queue
CPU
Scheduling decisions
CPU scheduling decisions may take place when a process
Scheduling
Preemptive(ness)
Preemptiveness determines the way of multitasking.
Scheduling
1. 2. 3. 4.
switches from running to waiting, switches from running to ready, switches from waiting to ready, or terminates.
2. 4.
With non-preemptive scheduling (cooperative scheduling), a running process is taken away the CPU because the process became blocked, it completed, or it voluntarily gave up the CPU. With preemptive scheduling the operating system can additionally force a context switch at any time to satisfy the priority policies. This allows the system to more reliably guarantee each process a
3.
1.
regular "slice" of operating time.

Slide 249
WS 06/07
Preemptive(ness)
Preemptive scheduling:
Scheduling
Scheduling Criteria
The scheduling policy depends on what criteria are emphasized [Sil00 p.140] Scheduling
Scheduler can interrupt Special timer hardware required

for the timer-controlled interrupts of the scheduler.
CPU Utilization
Keeping the CPU as busy as possible. The utilization usually ranges from 40% (light loaded system) to 90% (heavy loaded).
Synchronization of shared resources

An interrupted process may leave shared data inconsistent.
Throughput
The number of processes that are completed per time unit. For long processes the throughput rate may be one process per hour, for short ones it may be 10 per second.
Cooperative (non-preemptive) scheduling:
CPU occupation depends on process

in particular on the CPU burst distribution.
Turnaround time
The interval from the time of submission to the time of completion of a process. Includes the time to get into memory, times spent in the ready queue, execution time on CPU and I/O time.
[ With real-time scheduling this time-period is called reaction time ]
Applicable on any hardware platform Lesser problems with shared resources

at least the elementary parts of shared data structures are not inconsistent
Scheduling Criteria
Waiting time
Scheduling
Scheduling
The scheduling algorithm does not affect the time a process executes or spends doing I/O. It only affects the amount of time a process spends waiting in the ready queue. The waiting time is the sum of time spent waiting in the ready queue.
Response time
Irrespective of the turnaround time, some processes produce an output fairly early and continue computing new results while previous results are output to the user. The response time is the time from the submission of a request until the first response is produced.
[ Remark: In the exercises the response time is defined as the time from submission until the process starts (that is, until the first machine instruction is executing). ]
Different systems (batch systems, interactive computers, control systems) may put focus on different scheduling criteria. See next slide.
Criteria importance by system [Ta01 p.137]

Optimization
Common criteria:
Scheduling
Static / Dynamic Scheduling

Scheduling
With static scheduling all decisions are made before the system starts running. This only works when there is perfect information available in advance about the work needed to be done and the deadlines that have to be met. Static scheduling - if applied - is used in real-time systems that operate in a deterministic environment. With dynamic scheduling all decisions are made at run time. Little needs to be known in advance. Dynamic scheduling is required when the number and type of requests is not known beforehand (non deterministic environment). Interactive computer systems like personal computers use dynamic scheduling. The scheduling algorithm is carried out as a
(hopefully short) system process in-between the other processes.
Maximize(average(CPU utilization)) Maximize(average(throughput)) Minimize(average(turnaround time)) Minimize(average(waiting time)) Minimize(average(response time))

Sometimes it is desirable to optimize the minimum or maximum values rather than the average. For example, to guarantee that all users receive a good service in terms of responsiveness, we may want to minimize the maximum response time. [Note: we do not delve into optimization any further].
Slide 256
WS 06/07
Scheduling Algorithms
Scheduling
First Come - First Served

The process that entered the ready queue first will be the first one scheduled. The ready queue is a FIFO queue. Cooperative scheduling (no preemption).
Process P1 P2 P3 Burst time 24 ms 3 ms 3 ms 0
Scheduling
First Come First Served Shortest Job First Priority Scheduling Round Robin Multilevel Queueing
These algorithms typically are dynamic scheduling algorithms.
Let the processes arrive in the order P1, P2, P3. The Gantt chart for the schedule is:
P1
24
P2
27
P3
ms
30
Waiting time for P1 = 0 ms, for P2 = 24 ms, for P3 = 27 ms. Average waiting time: (0 ms + 24 ms + 27 ms) / 3 = 17 ms.
Slide 257
WS 06/07
First Come - First Served

Let the processes now arrive in the order P2, P3, P1. The Gantt chart for the schedule is:
Scheduling
Shortest Job First (SJF)

Associate with each process the length of its next CPU burst. Use these lengths to schedule the process with the shortest time. Two schemes:
Scheduling
P2
0 3
P3
6
P1
30
t [ms]
Non-preemptive SJF
Once the CPU is given to the process, it cannot be preempted until the CPU burst is completed.
Waiting time for P1 = 6 ms, for P2 = 0 ms, for P3 = 3 ms. Average waiting time: (6 ms + 0 ms + 3 ms) / 3 = 3 ms.
Preemptive SJF
When a new process arrives with a CPU burst length less than the remaining burst time of the current process, the CPU is given to the new process. This scheme is known as the Shortest Remaining Time First (SRTF)
Much better average waiting time than previous case. With FCFS the waiting time generally is not minimal. No preemption.
With respect to the waiting time, SJF is provably optimal. It gives the minimum average waiting time for a given set of processes. Processes with long bursts may suffer from starvation.
Shortest Job First

Process P1 P2 P3 P4 Arrival time 0 ms 2 ms 4 ms 5 ms Burst time 7 ms 4 ms 1 ms 4 ms
Scheduling
Shortest Job First

Process P1 P2 P3 P4 Arrival time 0 ms 2 ms 4 ms 5 ms Burst time 7 ms 4 ms 1 ms 4 ms
Scheduling
For non-preemptive scheduling the Gantt chart is:
For preemptive scheduling (SRTF) the Gantt chart is:

P4 P1 t [ms]
P1
P3
P2
P4 t [ms] 0
P1
P2
P3
P2
12
16
11
16
Waiting time for P1 = 0 ms, for P2 = 6 ms, for P3 = 3 ms, for P4 = 7 ms. Average waiting time: (0 ms + 6 ms + 3 ms + 7 ms) / 4 = 4 ms.
Waiting time for P1 = 9 ms, for P2 = 1 ms, for P3 = 0 ms, for P4 = 2 ms. Average waiting time: (9 ms + 1 ms + 0 ms + 2 ms) / 4 = 3 ms.
Shortest Job First

Predicting the CPU burst time
Scheduling
Shortest Job First

Exponential average for = and 0 = 10
Scheduling
The next CPU burst is predicted as the exponential average of the measured lengths of previous bursts:
n + 1 = t n + (1 ) n
n + 1 = predicted length of next burst
tn = actual length of nth burst : 0 1
controls the relative contributions of the recent and the past history
i
0 1 2 3 4 5 6 7 8
Shortest Job First

Exponential average for = and 0 = 10
Scheduling
Priority Scheduling
Each process is assigned a priority. The process with highest priority is allocated the CPU. Two schemes:
Scheduling
1 = 6 + 10 = 8 2 = 4 + 8 = 6 3 = 6 + 6 = 6 4 = 4 + 6 = 5 5 = 13 + 5 = 9 6 = 13 + 9 = 11 7 = 13 + 11 = 12
Non-preemptive
n + 1 = t n + (1 ) n
Preemptive
When a new process arrives with a priority higher than a running process, the CPU is given to the new process.
SJF scheduling is a special case of priority scheduling in which the priority is the inverse of the CPU burst length. Solution to starvation problem: The priority of a process increases as the waiting time increases (aging technique).
Priority Scheduling
Assume low numbers representing high priorities
Process P1 P2 P3 P4 P5 Burst time 10 ms 1 ms 2 ms 1 ms 5 ms Priority 3 1 4 5 2
Scheduling
Priority Scheduling
Process Burst time Arrival time Priority P1 P2 P3 P4 10 ms 1 ms 2 ms 1 ms 5 ms 0 ms 2 ms 2 ms 6 ms 12 ms 3 1 4 5 2
Scheduling
All processes arrive at time 0. For non-preemptive scheduling the Gantt chart is:
P2 P5 P1
P5
Here: preemptive scheduling.
Timing diagram
Processes sorted by priority = running = ready
P2
0 1
P5
6
P1
16
P3
18
P4
19
t [ms]
P3 P4
0 5 10 15
20
WS 06/07
t [ms]
Slide 267
Slide 268
Round Robin
Each process gets a small unit of CPU time (time quantum), usually 10-100 milliseconds. After the quantum has elapsed, the process is preempted and added to the end of the ready queue.
Scheduling
Round Robin
Process P1 P2 P3 P4 Burst time 53 ms 17 ms 68 ms 24 ms
Scheduling
Burst quantum
When the current CPU burst is smaller than the time quantum, the process itself will release the CPU (changing state into waiting).
Suppose a time quantum of 20 ms. The Gantt chart for the schedule is:
Burst > quantum

The process is interrupted and another process is dispatched.
P1 0 20
P2 37
P3 57
P4 77
P1
P3 97 117
P4
P1
P3
P3
121 134 154 162
t [ms]
If the time quantum is very large compared to the processes burst times, the scheduling policy is the same as FCFS. If the time quantum is very small, the round robin policy turns into processor sharing (seems as if each process has its own processor).
Waiting time for P1 = 0 + 57 + 24 = 81 ms, for P2 = 20 ms, for P3 = 37 + 40 + 17 = 94 ms, for P4 = 57 + 40 = 97 ms. Average waiting time: (81 + 20 + 94 + 97) / 4 = 73 ms.
Round Robin
Round Robin typically has higher average turnarounds than SJF, but has better response.
Scheduling
Round Robin
Turnaround time depends on time quantum
Burst time Scheduling
Context switch and performance

The smaller the time quanta, the more the context switches do affect performance. Following is shown a process with a 10 ms burst, and time quanta of 12, 6 and 1 ms.
All processes arrive at same time. Ready queue order: P1, P2, P3, P4
Context switches cause overhead
Turnaround time as function of time quantum

Figure from [Sil00 p.148] Slide 271 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Slide 272
WS 06/07
Round Robin
Average turnaround time for time quantum = 1ms
P1 P2 P3 P4
0 5 10 15 20
Round Robin
Scheduling
Average turnaround time for time quantum = 2 ms

P1 P2 P3 P4
0 5 10 15 20
Scheduling
t [ms]
t [ms]
Turnaround (P1) = 15 ms Turnaround (P2) = 9 ms Turnaround (P3) = 3 ms Turnaround (P4) = 17 ms

Slide 273
Average turnaround: (15 + 9 + 3 + 17) ms = 11 ms 4

Slide 274
Average turnaround: (14 + 10 + 5 + 17) ms = 11.5 ms 4
WS 06/07
WS 06/07
Round Robin
Average turnaround time for time quantum = 6 ms
P1 P2 P3 P4
0 5 10 15 20
Multilevel Queue
Scheduling
The ready queue is partitioned into separate queues. Each queue has its own CPU scheduling algorithm. There is also scheduling between the queues (inter queue).
Scheduling
Side note: policy now is like FCFS
Interqueue scheduling Fixed priority Time slicing
t [ms]

Slide 275
Average turnaround: (6 + 9 + 10 + 17) ms 4 = 10.5 ms

Figure from [Sil00 p.150] Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 276 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Process Management
Real-Time Scheduling
Tdist Technical process r waiting (ready) context switch RT System execution
inclusive output
Scheduling
TRmax
t
d Tw Tcs
t
e s TR c
Realtime condition: TR TRmax

otherwise realtime-violation
WS 06/07
Tdist
Scheduling
The reaction time (also called response time) TR is the time interval between the request (the interrupt) and the end of the process: TR = Tw + TCS + e. This is the time interval the technical system has to wait until response is received. Starting from the request, the maximum response time TRmax defines the deadline d (a point in time) at which the real-time system must have responded. A hard real-time system must not violate the real-time conditions. Note: For all following considerations, the context switch time TCS is neglected, that is, we assume TCS = 0 s.
In accordance with D. Zbel, W. Albrecht: Echtzeitsysteme, page 24, ISBN 3-8266-0150-5
Scheduling
A technical process generates events (periodically or not). A real-time computing system is requested to respond to the events. The response must be delivered within the period TRmax. The technical system requests computation by raising an interrupt at time r at the real-time system. The time from the occurrence of the request (interrupt) until the context switch of the corresponding computer process is the waiting time Tw . Switching the context takes the time TCS . The point in time at which execution starts is the start time s. The execution time e is the netto CPU time needed for execution (even if the process is interrupted). The process finishes at completion time c.
Slide 280
WS 06/07
Real-Time Violation
Example RT.1 RT-Scheduling
Real-Time Violation
Case 1:
P1 low priority P2 high priority
Two technical processes TP1 and TP2 on some machine require response from a real-time system. The corresponding computer processes are P1 and P2. The technical processes generate events as follows:
a 0 a 0 Response must be given latest just before the next event (thus within Tdist)
Machine TP1 TP2

b 5
TRmax Process
4 ms 6 ms P1 P2
c
response time TR 1 ms 4 ms
d 10
Priority LOW HIGH
TRmax1 TRmax2
TP1 TP2
a 0 a 0
t [ms] t [ms]
TP1 TP2
b 5 b 5
c 10
t [ms] t [ms]
b 5 10
c 10
Real-time violation, response to TP1 is too late!
The execution time of P1 is 1ms, the execution time of P2 is 4 ms, and the scheduling algorithm is preemptive priority scheduling. The context switch time is considered negligible (0 s).
P1 P2
0
Slide 282
a a 5
b b 10
c c
t [ms]
WS 06/07
Real-Time Violation
Case 2:
P1 high priority P2 low priority
response time TR 1 ms 4 ms
d 10
Machine TP1 TP2

b 5
TRmax Process
4 ms 6 ms P1 P2
c
Priority HIGH LOW
Theorem
Scheduling
For a system with n processors (n 2) there is no optimal scheduling algorithm for a set of processes P1 ... Pm unless
all starting times s1, ... sm, all execution times e1, ... em, all completion times c1, ... cm
TP1 TP2
a 0 a 0
t [ms] t [ms]
b 5 10
No real-time violation. Fine!
are known (deterministic systems).

d
P1 P2
0
Slide 283
a a
b a 5 b
c b 10
Often, technical processes (or natural processes) are non-deterministic, at least to a part.
c
t [ms]
An algorithm is optimal when it finds an effective solution if such exists.

Branch-and-Bound Scheduling
Find a schedule by searching all combinations of processes.
RT-Scheduling
Search tree for the example
RT-Scheduling
Of each process (non-preemptive!) must be known in advance:

the request time (interrupt arrival time)
known in case of periodical technical processes
r
P1 P1, P2 P1, P3 P2, P1 P2 P2, P3 P3, P1 P3 P3, P2
the response time the deadline
TR
known from analysis or worst-case measurements
d
request time ri 0 ms 0 ms 0 ms
given by the technical system
Example:
Process P1 P2 P3
execution time e 20 ms 50 ms 30 ms
WS 06/07
deadline di 30 ms 90 ms 100 ms
P1, P2 , P3
P1, P3 , P2
P2, P1 , P3
P2, P3 , P1
P3, P1 , P2
P3, P2 , P1
For n processes: tree depth (number of levels) = n, number of combinations = n!

Slide 285
RT-Scheduling
RT-Scheduling
Sequence P1, P2 , P3 P3 P2 P1
0 10 20 30 40 50 60 70 80 90 100 110
0 10 20 30 40 50 60 70 80 90 100 110
t [ms] d1
Sequence P2, P3 , P1
t [ms] d1
Real-time violation
d2
d3
P3 P2 P1
0 10 20
d2
dd3
P3 P2 P1
0 10 20 30 40 50 60 70 80 90 100 110
t [ms] d1
Real-time violation
t [ms]
30 40 50 60 70 80 90 100 110
d2
d3
Slide 288
d1
Real-time violation
d2
d3
Slide 287
RT-Scheduling
Search tree for the example
RT-Scheduling
0 10 20 30 40 50 60 70 80 90 100 110
Real-time violation
t [ms] d1
Real-time violation
P1 P1, P2 P1, P3 P2, P1
P2 P2, P3 P3, P1
P3 P3, P2
d2
d3
P3 P2 P1
0 10 20 30 40 50 60 70 80 90 100 110
P1, P2 , P3
P1, P3 , P2
P2, P1 , P3
P2, P3 , P1
P3, P1 , P2
P3, P2 , P1
t [ms] d1
Real-time violation
d2
d3
Slide 290
The only solution: P1 must be first, P2 must be second.

Slide 289
For small n one may directly investigate the n! combinations at the leafs. For bigger n it is recommended to start from the root and investigate all nodes (level by level). When a node violates the real-time condition the corresponding sub tree can be disregarded.
RT-Scheduling
Deadline Scheduling
RT-Scheduling
Priority Scheduling. The process with the closest deadline has highest priority. When processes have the same deadline, selection is done arbitrarily or according to FCFS.
Non-preemptive
The algorithm is carried out after a running process finishes. Intermediate requests are saved (interrupt flip-flops) meanwhile.
P1 P1, P2 P1, P3 P2, P1
P2 P2, P3 P3, P1
P3 P3, P2
Preemptive
The algorithm is carried out when a request arrives (interrupt routine) or after a process finishes.
P1, P2 , P3
Slide 291
P1, P3 , P2
P2, P3 , P1
P3, P2 , P1
The deadline scheduling algorithm is also known as earliest deadline first (EDF). The algorithm is optimal for the one-processor case.
If there is a solution, it is found. If none is found then there is no solution.
Deadline Scheduling
Example RT.2: Non-preemptive scheduling Process P1 P2 P3 P4 P4 P3 P2 P1
0 5 10 15 20
Deadline Scheduling
RT-Scheduling
Example RT.3: Preemptive scheduling Process P1 P2 P3 P4 P4 request time ri 0 ms 3 ms 6 ms 5 ms execution time e 2 ms 3 ms 3 ms 4 ms deadline di 4 ms 14 ms 12 ms 10 ms
RT-Scheduling
Remember, context switch time is neglected.
request time ri 0 ms 0 ms 0 ms 0 ms
execution time e 4 ms 1 ms 2 ms 5 ms
deadline di 5 ms 7 ms 7 ms 13 ms
Deadline is the same, choice is arbitrary. Could be sequence P3, P2 as well.
P3 P2 P1
0 5 10 15 20
t [ms]
t [ms]
d1 d2, d3
Slide 293
d4
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 294
d1
d4
d3
d2
Deadline Scheduling
Continuation of example RT.3 t = 0 ms: t = 2 ms: t = 3 ms: t = 5 ms: t = 6 ms: t = 9 ms: t = 12 ms: t = 13 ms:
RT-Scheduling Request for P1 arrives. Since there is no other process, P1 is scheduled. P1 finishes. Since there are no requests, the scheduler has nothing to do. Request for P2 arrives. Since there is no other process, P2 is scheduled. Request for P4 arrives. The deadline d4 is closer than the deadline of the running process P2. P4 has higher priority and is scheduled. Request for P3 arrives. Deadline d3 is more distant than any other, so nothing changes. P4 continues. P4 finishes. The closest deadline now is d3, so P3 is scheduled. P3 finishes. The closest deadline now is d2, so P2 is scheduled again. P2 finishes. There are no processes ready. Nothing to schedule.
Deadline Scheduling
RT-Scheduling
For multi-processor systems, the algorithm is not optimal.

Example RT.4: Three processes and two processors. Non-preemptive scheduling. Process P1 P2 P3
Processor 1 Processor 2
0 P3
P2
Real-time violation P1
P1
t [ms]
5
10
Slide 295
WS 06/07
Slide 296
Scheduling
Example RT.5 : Process P1 P2
t
When there are n processes that are

periodic, independent of each other, preemptable, and the response is to be delivered latest
Tdist TRmax
Scheduling
deadline di k 30 ms k 70 ms k 200 ms
P3
at the end of each period (that is TRmax = Tdist) then the processes can be scheduled on a single processor without real-time violation, if n
T
i =1
ei
dist i
15 25 15 + + = 0.5 + 0.36 + 0.075 = 0.935 1 30 70 200
The processes can be scheduled. Deadline scheduling would yield: P3 P2 P1

5 ms 15 ms 15 ms 0
ei 1 i =1 Tdist i
10 ms
10 ms
15 ms
Schedulability Test
50
100
150
WS 06/07
200
t [ms]
Continuation of example RT.5 t = 0 ms: t = 15 ms: t = 30 ms: t = 45 ms: t = 55 ms: t = 60 ms: t = 70 ms:
Scheduling Requests for P1, P2, P3 arrive. P1 has closest deadline and is scheduled. P1 finishes. The deadline of P2 is closer than the deadline of P3. P2 is scheduled. Request for P1 arrives. Reevaluation of the deadlines yields that P1 has highest priority. P1 is scheduled. P1 finishes. The deadline of P2 still is closer than the deadline of P3. P2 is scheduled. P2 finishes. The only waiting process is P3. P3 thus is scheduled. Request for P1 arrives. Reevaluation of the deadlines yields that P1 has highest priority. P1 is scheduled. Request for P2 arrives. Deadline of P1 is closest, P1 continues.
Example RT.6: Process P1 P2 P3 execution time e 2 ms 3 ms 5 ms deadlines di k 4 ms k 14 ms k 12 ms
Scheduling
T
i =1
ei
dist i
2 3 5 + + = 0.5 + 0.215 + 0.42 = 1.135 4 14 12
...
This means an overutilization of the microprocessor. The processor would have to execute more than one process at a time (which is impossible). Therefore there is no schedule that would not violate the real-time condition sooner or later (on a single-processor system). The schedulability test failed.
Slide 299
WS 06/07
Laxity Scheduling
Priority Scheduling. The process with the least laxity has highest priority. For equal laxities the selection policy is arbitrary or FCFS.
The laxity is the period of time left in which a process can be started without violating its deadline. Latest when the laxity is 0 the process must be started, otherwise it will not finish in time. The execution time e of the process must be known, of course Laxity: lax = (d - now) e now is the point at time at which the laxity is
lax now e d
RT-Scheduling
Laxity Scheduling
RT-Scheduling
Deadline scheduling focuses on the deadline, but does not take into account the execution time e of a process. Laxity scheduling does, it sometimes finds a solution that deadline scheduling does not find.
Example RT.7: Three processes and two processors. Non-preemptive scheduling. Same as in example RT.4. Process P1
t
P2 P3
(re)calculated. Usually this is the point in time at which a new request arrives (preemptive scheduling) or at which a process finishes.
Processes now undergoing laxity scheduling (see next slide)
WS 06/07
Laxity Scheduling
Continuation of example RT.7 t = 0 ms:
RT-Scheduling Requests for P1, P2, P3 arrive. The laxities are: lax1 = 2 ms, lax2 = 4 ms,
Laxity Scheduling
Laxity scheduling, like deadline scheduling, is generally not optimal for multi-processors.
That is, it does not always find a solution. RT-Scheduling
lax3 = 5 ms. Least laxity is lax1, so P1 is scheduled on processor 1.

Processor 2 is not yet assigned, so P2 is chosen (lax2 < lax3).
t = 5 ms: t = 8 ms:
Example RT.8: Four processes and two processors. Non-preemptive scheduling. Process P1 P2 P3 request time ri 0 ms 0 ms 0 ms 0 ms execution time e 1 ms 5 ms 3 ms 5 ms deadline di 1 ms 6 ms 5 ms 8 ms
P2 finishes. The only process waiting is P3, so it is scheduled. P1 finishes. No new processes to schedule. Processor 1 Processor 2
0 P2 P1
P3
t [ms]
5 10
P4
No real-time violation as opposed to the deadline scheduling example RT.4

Continuation on next slide

Laxity Scheduling
Continuation of example RT.8 t = 0 ms:
RT-Scheduling Requests for P1, P2, P3, P4 arrive. The laxities are: lax1 = 0 ms, so P1 is scheduled on processor 1. Second least laxity is lax2, so P2 is chosen for processor 2.
Laxity Scheduling
Continuation of example RT.8 However, there exists a schedule that works well:
Processor 1 Processor 2
0 P1 P2
RT-Scheduling
lax2 = 1 ms, lax3 = 2 ms, lax4 = 3 ms. Least laxity is lax1,
Non-violating schedule
found through deadline scheduling P4
t = 1 ms: t = 4 ms:
P1 finishes. Least laxity is lax3 (now 1ms), so P3 is scheduled on processor 1. P3 finishes. Least laxity is lax4 (now -1 ms), so P4 is scheduled on processor 1 ... but it is already too late (negative laxity). Processor 1 Processor 2
0 P1 P3 P4
P3
t [ms]
5
d4
10
P2
Real-time violation P4 t [ms]

5
Scheduling non-preemptive processes in a multi-processor system is a complex problem.

This is even the case in a two-processor system when all request times ri are the same and all deadlines di are the same.
d4
10
Slide 305
WS 06/07
Rate Monotonic Scheduling

Priority scheduling for periodical preemptive processes where the deadlines are equal to the periods. The process with highest frequency (repetition rate) has highest priority. Static scheduling.
Technical process 1 Tdist
t

A more thorough explanation from [Ta01 p.472]
The classic static real-time scheduling algorithm for preemptable, periodic processes is RMS (Rate Monotonic Scheduling). It can be used for processes that meet the following conditions:
Each periodic process must complete within its period.
Tdist
t
No process is dependent on any other process. Each process needs the same amount of CPU time on each burst. Any non periodic processes have no deadlines. Process preemption occurs instantaneously and with no overhead.
Technical process 2
RMS works by assigning each process a fixed priority equal to the frequency
Computer process P2 has higher priority than process P1 since its rate is higher.
of occurrence of its triggering event. For example, a process that must run every 30ms (= 33Hz) receives priority 33, a process that must run every 40ms (= 25 Hz) receives priority 25. The priorities are linear with the rate, this is why it is called rate monotonic.
Although the algorithm is not optimal, it is often used in real-time applications because it is fast and simple (at run time!). Note, static scheduling!

Example RT.9 Process A B C request time ri k 30 ms k 40 ms k 50 ms execution time e 10 ms 15 ms 5 ms deadline di (k+1) 30 ms (k+1) 40 ms (k+1) 50 ms

Continuation Example RT.9 The processes A, B, C scheduled with Rate Monotonic Scheduling (RMS), Deadline scheduling (EDF).
Three periodic processes [Ta01 p.471]
Figure from [Ta01 p.471] Slide 309 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 310 Computer Architecture WS 06/07
Figure from [Ta01 p.473] Dr.-Ing. Stefan Freinatis

Continuation Example RT.9

Example RT.10: Like RT.9 but process A now has 15ms execution time Process A B C request time ri k 30 ms k 40 ms k 50 ms execution time e 15 ms 15 ms 5 ms deadline di (k+1) 30 ms (k+1) 40 ms (k+1) 50 ms
Up to t = 90 the choices of EDF and RMS are the same. At t = 90 process A is requested again. The RMS scheduler votes for A (process A4 in the figure) since its priority is higher than the priority of B, thus B is interrupted. The deadline scheduler in contrast has a choice because the deadline of A is the same as the deadline of B (dA = dB = 120). In practice, preempting B has some nonzero cost associated, therefore it is better to let B continue.
The schedulability test yields that the processes are schedulable.
T
i =1
ei
dist i
15 15 5 + + = 0.5 + 0.375 + 0.1 = 0.975 30 40 50
See next example (Example RT.10) to dispel the idea that RMS and EDF would always give same results.
Nevertheless, RMS fails in this example while EDF does not.
WS 06/07

Continuation Example RT.10

Why did RMS fail? Using static priorities only works if the CPU utilization is not too high. It was proved* that RMS is guaranteed to work for any system of periodic processes if 1
T
i =1
ei
n (2 n 1).
dist i
For n = 2 processes, RMS will work for sure if the CPU utilization is below 0.828. For n = 3 processes, RMS will work for sure if the CPU utilization is below 0.780. For n processes, RMS will ... if the CPU utilization is below ln 2 (0.694). RMS leads to a real-time violation. Process C is missing its deadline dC = 50.
* C.L. Liu, James Layland: Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment, Journal of the ACM, 1973, http://citeseer.ist.psu.edu/liu73scheduling.html

In example RT.9 the utilization was 0.808 (thus higher than 0.780), why did it work? We were just lucky. With different periods and execution times, a utilization of 0.808 might fail. In example RT.10 the utilization was so high that there was little hope RMS could work. In contrast to RMS, deadline scheduling always works for any schedulable set of processes (single-processor system). Deadline scheduling can achieve 100% CPU utilization. The price paid is a more complex algorithm [Ta01 p.475]. Because RMS is static all priorities are known at run time. Selecting the next process is a matter of just a few machine instructions.
Branch and Bound Description
Try all permutations of processes.
Deadline (EDF)
Earliest deadline has highest priority. Execution time is not taken into account.
Laxity
Least laxity has highest priority. Execution time is taken into account.
RMS
Highest repetition rate (frequency) has highest priority. Execution time is not taken into account.
Preferably static used in scheduling German Planen durch Name

Suchen
dynamic scheduling Planen nach Fristen
dynamic scheduling Planen nach Spielrumen
static scheduling Planen nach monotonen Raten
Overview real-time scheduling algorithms
Slide 316
WS 06/07
Deadlocks
Consider two processes requiring exclusive access to some shared resources (e.g. file, tape-drive, printer, CD-Writer). { request(resource1); request(resource2); ... release(resource1); release(resource2); }
Process 1
{ request(resource2); request(resource1); ... release(resource2); release(resource1); }

Process 2
Fictitious system call for requesting exclusive access to a resource. When access cannot be granted, the call blocks until the resource is available.
Deadlocks
{ request(resource1); request(resource2); ... release(resource1); release(resource2);
time
Deadlocks
{ request(resource1); request(resource2); ... release(resource1); release(resource2);
time
{ request(resource2);
blocked
} {
Process 1
request(resource2); request(resource1); ... release(resource2); release(resource1); }

Slide 319
When the two processes are executed sequentially (one after the other), no problem arises.
Process 1
request(resource1); ... release(resource2); release(resource1); }

Process 2
When process 1 has acquired the resources before process 2 starts trying the same, no problem arises. Process 2 just has to wait.
Process 2
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 320 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Deadlocks
{ request(resource1); request(resource2);
blocked
Deadlocks
{ request(resource2); request(resource1);
blocked
A set of processes is deadlocked when each process in the set is waiting for an event that only another process in the set can cause. Waiting for an event:
Waiting for the availability of a resource Waiting for some input Waiting for a message (IPC) or a signal or any other type of event that a process is waiting for in order to continue
time
Process 1
Process 2
Occasionally, when both processes are carried out in parallel as depicted above, both their attempts to acquire the missing resource will cause the processes to block. Since each process holds a resource that the other one needs, and since each process cannot release its resource, both processes do wait forever (deadlock).
Slide 321
WS 06/07
Slide 322
WS 06/07
Deadlocks
Classical deadlock problem from the non-computer world
Resources
Anything a process / thread needs to continue Exclusive access
Only one process at a time can use the resource (e.g. printer or writing to a shared file). Deadlocks Examples: I/O-devices like printer, tape, CD-ROM, files, but also internal resources such as process table, thread table, file allocation table or semaphores / mutexes.
Yields to car at right
Non-exclusive access
More than one process can use the resource at the same time (e.g. reading from a shared file)
Every car is ought to give way to the car on the right. None will proceed.
Preemptable resources
The resource can (with some non-zero cost) be temporarily taken away from a process and given to another process (e.g. memory swapping).
Non-preemptable resources
The resource cannot be temporarily assigned to another process (e.g. printer, CD-Writer) without leading to garbage.
Figure from lecture slides Computer Architecture WS 05/06 (Basermann / Jungmaier) Slide 323 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Deadlocks
The following four conditions must be present for a deadlock to occur.
Deadlock Modeling
Resource allocation graphs
Process Resource Deadlocks
Mutual Exclusion
Each resource is either currently assigned to exactly one process or is available.
Hold and Wait

A process currently holding a resource can request new resources.
Non-preemptable resources
Resources previously granted cannot be forcibly taken away from a process.
Circular Wait
There must be a circular chain of processes, each of which is waiting for a resource held by another process in the chain.
a) Holding a resource (Process A holds resource R) b) Requesting a resource (Process B requests resource S) c) Deadlock situation: Process D requests U which is held by process C. Process C requests T which is held by D. Figure from [Ta01 p.165]
If one of these conditions is absent, no deadlock is possible

Deadlock Modeling
A B C
Deadlock Modeling
Deadlocks Deadlocks
Example of resource allocation not resulting in a deadlock
time
time
Resource allocation order leading to a deadlock

(o)
(p)
(q)
WS 06/07
Deadlocks
Strategies for dealing with deadlocks:
Deadlocks
Strategy 1 (Ignoring the problem)
Most operating systems, including UNIX and Windows, just ignore the problem on the assumption that most users would prefer an occasional deadlock to a rule restricting all users to one process, one open file, and one of everything. If deadlocks could be eliminated for free, there would not be much discussion. But the price is high. If deadlocks occur on the average once a year, but system crashes owing to hardware failures and software errors occur once a week, nobody would be willing to pay a large penalty in performance or convenience to eliminate deadlocks (After Ta01 p.167 ). For that, the deadlock problem often is disregarded.
1. Ignore the problem

Sounds silly, but in fact many operating systems do exactly this assuming that deadlocks occur rarely.
2. Detection & Recovery

The OS tries to detect deadlocks and then takes some recovery action.
3. Avoidance
Resources are granted in such a way that deadlocks cannot occur.
4. Prevention
Trying to break at least one of the four conditions such that no deadlock can happen.
Slide 329
WS 06/07
Slide 330
WS 06/07
Deadlocks
Strategy 2 (Detection & Recovery)
The operating system tries to detect deadlocks and to recover.
Example DL.1 : Consider the following system state: Process A holds R and wants S Process B holds nothing and wants T Process C holds nothing and wants S Process D holds U and wants S and T Process E holds T and wants V Process F holds W and wants S Process G holds V and wants U. Is the system deadlocked, and if so, which processes are involved?
Strategy 2
Continuation of example DL.1 (deadlock detection) Constructing the resource allocation graph (a):
Deadlocks
deadlock
The extracted cycle (b) shows the processes and resources involved in a deadlock.
Strategy 2
Deadlock detection with multiple instances of a resource type We have (respectively we define):
Deadlocks
Strategy 2
Deadlock detection with multiple instances of a resource type Definition of current allocation matrix and request matrix:
Deadlocks
n processes: P1 ... Pn m resource classes

Ei = the number of existing resource instances of resource class i, 1 i m. E is the existing resource vector, E = (E1 ... Em). A is the available resource vector. Each Ai in A gives the number of currently available resource instances. A = (A1 ... Am). Relation X Y is defined to be true if each Xi Yi.
Slide 333 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 334 Computer Architecture WS 06/07 Figure from [Ta01 p.171] Dr.-Ing. Stefan Freinatis
P1 P2
Strategy 2
Deadlock detection with multiple instances of a resource type Deadlock detection algorithm: 1. All processes are initially unmarked 2. Look for an unmarked process Pi for which row Ri A
Here the algorithm is looking for a process that can be run to completion (the resource demands of the process can be satisfied immediately). Deadlocks
Strategy 2
Example DL.2 (deadlock detection algorithm): Consider the following system state:
Deadlocks
3. If such a Pi is found, add row Ci to A and mark Pi. Go to step 2.

After Pi is (or would have) finished, its resources are given back to the pool. The process is marked (in the sense of successful completion).
4. If no such process exists, terminate. All unmarked processes, if any, are deadlocked!
Is there (or will there be) a deadlock in the system?

Strategy 2
Continuation of example DL.2 (deadlock detection algorithm)
Deadlocks Checking P1: R1 is not A (CD-ROM is missing). P1 cannot run and is not marked. Checking P2: R2 is not A (Scanner is missing). P2 cannot run and is not marked. Checking P3: R3 is A, thus P3 can run and is marked. The resources are given back to the pool. A = (2 2 2 0). Checking P1: R1 still is A (CD-Rom still not available). Checking P2: R2 now is A, thus P2 can run and is marked. The resources are given back to the pool. A = (4 2 2 1). Checking P1: R1 now is A. P1 can run and is marked. The resources are given back to the pool. A = (4 2 3 1) = E. No more unmarked processes: termination.
Strategy 2
Example DL.3 (deadlock detection algorithm):
Deadlocks
Same as DL.2 but now C2 = (2 1 0 1) and thus A = (2 0 0 0).
Checking P1: R1 is not A (CD-ROM is missing). P1 cannot run and is not marked. Checking P2: R2 is not A (Scanner is missing). P2 cannot run and is not marked. Checking P3: R3 is not A (Plotter is missing). P3 cannot run and is not marked. All processes checked. Nothing will change: termination.
The entire system is deadlocked!
No deadlocks.
Strategy 2
Detection & Recovery
Deadlocks
Deadlocks
Strategy 3 (Avoidance)
Do not allow system states that may result in a deadlock.
Resource Preemption
Forcibly taking away a resource from a process. May have ill side effects. Difficult or even impossible in many cases.
Process Rollback
A process periodically writes its complete state to file (checkpointing). In case of a deadlock, the process is rolled back to an earlier state in which it occupied lesser resources. Program(ming) overhead!
A state is said to be safe when it is not deadlocked and there exists some scheduling order in which every process can run to completion even if all of them request their maximum number of resources. An unsafe state may result in a deadlock, but does not have to. maximum number of resource instances needed (requests) number of resource instances currently held (allocation)
Killing Processes
Crudest but simplest method. One or more processes from the chain are terminated and must be started all over again at some later point in time. May also cause ill effects consider a process updating a data base twice instead of once.
Assume there is a total number of 10 instances available. Then the state is a safe state since there is a way to run all processes.
Slide 339
WS 06/07
Strategy 3
Deadlocks
Strategy 3
Deadlocks
(a)
(b)
(c)
(d)
(e)
(a)
(b)
(c)
(d)
a) starting situation as before (this is a safe state) a) starting situation (question: is this a safe state?). There are 3 resources left in the pool. b) B is granted 2 additional resources. c) B has finished. Now 5 resources are free. d) C is granted another 5 resources. e) C has finished. Now 7 resources are free. Process A can be run without problems. Thus (a) is a safe state.
b) A is granted one additional resource. c) B is granted the remaining 2 resources. d) B has finished. A and C cannot run because each of them needs 5 resources to complete. Deadlock. Any other sequence starting from (b) also ends up in a deadlock. Therefore state (b) is an unsafe state. The move from (a) to (b) was bringing the system from a safe state to an unsafe state.
Strategy 3
Bankers Algorithm (Dijkstra 1965)
Deadlocks
Strategy 3
Continuation Bankers Algorithm
Deadlocks
Think of a small-town banker who deals with a group of customers to whom he has granted lines of credit. If granting a request leads to an unsafe state, the request is denied. If a request leads to a safe state, the request is granted. Knowing that not all customers need their credit line immediately, the banker has reserved 10 money units instead of 22 to service them. Initial state There are four customers (processes) demanding for a total of 22 money units (resources). The banker (operating system) has provided 10 money units in total.
The bankers algorithm considers each request as it occurs. A request is granted when the state remains safe, otherwise the request is postponed until later.
(a)
(b)
(c)
a) Initial state (safe) b) Safe state: Cs maximum request can be satisfied. When C has paid back the 4 money units, Bs request (or Ds) can be satisfied. ... c) Unsafe state: If any of the customers requests the maximum, the banker would be stuck (deadlock). Figure from [Ta01 p.178]
Slide 343
WS 06/07
Strategy 3
Bankers Algorithm for multiple resource instances
Deadlocks
Strategy 3
Deadlocks
1. Look for a row Ri whose unmet requirements are smaller than (or equal) to A. If no such row exists, the system will deadlock
Existing
sooner or later since no process can run to completion. 2. Assume the process of the row chosen requests its maximum resources (which is guaranteed to be possible) and finishes. Mark the process as terminated and add its resources to the pool A. 3. Repeat steps 1 and 2 until either all processes are marked (in which case the initial state was safe), or until a deadlock occurs (in which case the initial state was unsafe).
Available Possessed (allocated)
Current allocation matrix C
Request matrix R
Slide 345
WS 06/07
Slide 346
WS 06/07
Strategy 3
The pool is A = (1 0 2 0). Process D can be scheduled next because (0 0 1 0) < (1 0 2 0). When finished, the pool is A = (1 0 1 0) + (1 1 1 1) = (2 1 2 1) . Process A can be scheduled because (1 1 0 0) < ( 2 1 2 1). When finished, the pool is A = (1 0 2 1) + (4 1 1 1) = (5 1 3 2). Process B can be scheduled because (0 1 1 2) < (5 1 3 2). When finished, the pool is A = (5 0 2 0) + (0 2 1 2) = (5 2 3 2). Process C can be scheduled because (3 1 0 0) < (5 2 3 2). When finished, the pool is A = (2 1 3 2) + (4 2 1 0) = (6 3 4 2). Process E can be scheduled because (2 1 1 0) < (6 3 4 2). When finished, the pool is A = (4 2 3 2) + (2 1 1 0) = (6 3 4 2).
Strategy 3
Deadlocks

No more processes. All processes have successfully completed.
Deadlocks
The state shown is a safe state since we have found at least one way to complete all processes. Other sequences are possible.
In practice the bankers algorithm is of minor use, because processes rarely know in advance the maximum number of resources needed, the number of processes is not constant over time as users log in and out (or other events require computational attention).
Deadlocks
Strategy 4 (Deadlock Prevention)
Break (at least) one of the four conditions for a deadlock.
Strategy 4
Deadlocks
Attacking the no preemption condition

Forcibly removing a resource from a process is barely possible.
Avoiding mutual exclusion

Sometimes possible. Instead of using a printer exclusively, the processes write into a print spooler directory. This way several processes can use the printer at the same time. However, an internal system table (e.g. process table) cannot be spooled. Similar applies to a CD-Writer.
Breaking circular wait

Provide a global numbering of all resources (ranking). Resource requests must be made in ascending order. This way a resource allocation graph can have no cycles. In the figure, B cannot request the scanner even if it would be available.
Breaking the hold and wait

Processes request all their resources at once (either all or none). However, not all processes know their demand from the beginning. Moreover, the resources are not optimally used then (degradation in multi-programming). Variation: each time an additional resource is needed, the process releases all its resources first and then tries to acquire all of them at once. This way a process does not occupy resources while waiting for a new one.
1. 2. 3. 4. 5.
Imagesetter Scanner Plotter Tape drive CD-Rom drive
Scanner
Plotter
However, not all resources allow for a reasonable order. How to order table slots, disk spooling space, locked database records?
Slide 350
WS 06/07
Memory Management
Memory (353)
Memory Management
Paging (381) Segmentation (400) Paged Segments (412) Virtual Memory (419) Caches (471)
Slide 351
WS 06/07
Slide 352
WS 06/07
Memory
Core Memory
Period: 1950 ... 1975 Non-volatile Matrix of magnetic cores Storing a bit by changing the magnetic polarity of a core Access time 3s ... 300ns Destructive read
After reading a core, the content is lost. A read cycle must be followed by a write cycle i.o. to restore.
Image source: http://www.psych.usyd.edu.au/pdp-11/core.html Slide 353 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Memory
Semiconductor Memory (1970 ...)
Dynamic memory (DRAM)
Storing a bit by charging a capacitor
(sometimes just the self-capacitance of a transistor) Memory Management
One transistor per bit

High density / capacity per area unit
Volatile Destructive read Self-discharging

Periodic refresh needed
Image source: http://www.research.ibm.com/journal/rd/391/adler.html Slide 354 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Memory
Semiconductor Memory (1970 ...)
Static memory (SRAM)
Storing a bit in a flip-flop
Setting / Resetting the flip-flop Memory Management
Memory Hierarchy
Memory Management
Program(mer)s want unlimited amounts of fast memory. Economical solution: Memory hierarchy.
6 transistors per bit

More chip area than with DRAM
Volatile Non-destructive read No self-discharge Fast!

Image source: Wikipedia on SRAM (English) Slide 355 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Memory hierarchy levels in typical desktop / server computers, figure from [HP06 p.288]
Main Memory
Central to computer system Large array of words / bytes Many programs at a time
for multi-programming / tasking to be effective
Address Binding
Operating System
program 1 program 2 program 3 program 4 program 5 program 6 program n Memory Management
Program = binary executable file Code/data accessible via addresses

... i = i + 1; check(i); ...
Addresses in the source code are symbolic, here: i (a variable) and check (a function). The compiler typically binds the symbolic addresses to relocatable addresses, such as i is 14 bytes from the beginning of the module. The compiler may also
be instructed to produce absolute addresses (non-relocatable code).

Working Memory Memory layout of a time sharing system
The loader finally binds the relocatable addresses to absolute addresses, such as i is at 74014 when loading the code into memory.
Address Binding Schemes

The binding of code and data to logical memory addresses can be done at three stages:
Memory Management
Logical / Physical Addresses

Memory Management
Logical Address
The address generated by the CPU, also termed virtual address. All logical addresses form the logical (virtual) address space.
Compile time (Program creation)

The resulting code is absolute code. All addresses are absolute. The program must be loaded exactly to a particular logical address in memory.
Physical Address
The address seen by the memory. All physical addresses form the physical address space. In compile-time and load-time address-binding schemes the logical and the physical addresses are the same. In execution-time address-binding the logical and physical addresses differ.
Load time
The code must be relocatable, that is, all addresses are given as an offset from some starting address (relative addresses). The loader calculates and fills in the resulting absolute addresses at load time (before execution starts).
Execution time
The relocatable code is executed. Address translation from relative to absolute addresses takes place at execution time (for every single memory access). Special hardware needed (MMU).
Memory Management Unit

Memory Management
Protection
Memory Management
Hardware device that maps logical addresses to physical addresses (MMU).
Protecting the kernel against user processes

No user process may read, modify or even destroy kernel data (or kernel code). Access to kernel data (system tables) only through system calls.
Protecting user processes from one another

No user process may read or modify other processes` data or code. Any data exchange between processes only via IPC.
MMU equipped with limit register Loaded with the highest allowed logical address
This is done by the dispatcher as part of the context switch.
Any address beyond the limit causes an error

A program (a process) deals with logical addresses, it never sees the real physical addresses.
Assumption: contiguous physical memory per process

Figure from [Sil00 p.258] Dr.-Ing. Stefan Freinatis Slide 362 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Protection
Memory Management
Memory Occupation
Obtaining better memory-space utilization
Memory Management Initially the entire program plus its data (variables) needed to be in memory
Dynamic Loading
Load what is needed when it is needed.
Overlays
Replace code by other code.
Dynamic Linking (Shared Libraries)

Use shared code rather than back-pack everything.
Limit register for protecting process spaces against each other

Swapping
Temporarily kick out a process from memory.
Dynamic Loading
Memory Occupation
Overlays
Memory Occupation
Routines are kept on disk

Main program is loaded into memory.
Existing code is replaced by new code

Similar to dynamic loading, but instead of adding new routines to the memory, existing code is replaced by the loaded code.
Routine loaded when needed

Upon each call it is checked whether the routine is in memory. If not, the routine is loaded into memory.
No special OS support required

Overlay technique implemented by the user.
Unused routines are never loaded

Although the total program size may be large, the portion that is actually executed can be much smaller.
Example: Consider a two-pass assembler Pass 1 Pass 2 Symbol table Common routines 70 kB 80 kB 20 kB 30 kB Loading everything at once would require 200 kB.
No special OS support required

Dynamic loading is implemented by the user. System libraries (and corresponding system calls) may help the programmer.
Pass 1 and pass 2 do not need to be in memory at the same time Overlay
Slide 365
WS 06/07
Overlays
Memory Occupation Pass 1, when finished, is overlayed by pass 2. An additional overlay driver is needed (10 kB), but the total memory requirement now is 140 kB instead of 200 kB.
Dynamic Linking
Different processes use same code
Memory Occupation This especially true for shared system libraries (e.g. reading from keyboard, graphical output on screen, networking, printing, disk access).
Single copy of shared code in memory

Rather than linking the libraries statically to each program (which increases the size of each binary executable), the libraries (or individual routines) are linked dynamically during execution time. Each library only resides once in physical memory.
Stub
is a piece of program code initially located at the library references in the program. When first called it loads the library (if not yet loaded) and replaces itself with the address of the library routine.
OS support required
Memory
Slide 367 Computer Architecture WS 06/07 Figure from [Sil00 p.262] Dr.-Ing. Stefan Freinatis
since a user process cannot look beyond its address space whether (and where) the library code may be located in physical memory (protection!).
Swapping
Memory Occupation
Swapping
Memory Occupation
A process can be swapped temporarily out of memory to a
backing store, and then brought back into memory for continued execution.
Backing store: fast disk large enough to accommodate copies
of all memory images for all users; must provide direct access to these memory images.
Roll out, roll in swapping variant used for priority-based
scheduling algorithms; lower-priority process is swapped out so higher-priority process can be loaded and executed.
Major part of swap time is transfer time; total transfer time is
directly proportional to the amount of memory swapped.

Figure: Process P1 is swapped out, and process P2 is swapped in.

Memory Allocation
Allocation of physical memory to a process
Memory Management
Contiguous Memory Allocation

The physical memory allocated to a process is contiguous (no holes).
Contiguous
The physical memory space is contiguous (linear) for each process.
Fixed-sized partitions
Memory is divided into fixed sized partitions. Originally used by IBM OS/360, no longer in use today.
Operating System
process 1
Fixed-sized partitions Variable sized partitions

Placement schemes: first fit, best fit, worst fit
Simple to implement Degree of multiprogramming is bound by the number of partitions Internal fragmentation
free partition
process 2 process 3
Non-Contiguous
The physical memory space per process is fragmented (has holes).
Paging Segmentation Combination of Paging and Segmentation

process 4
Slide 372
WS 06/07
Contiguous Memory Allocation

The physical memory allocated to a process is contiguous (no holes).
Compaction
Reducing external fragmentation (for variable-sized partitions)
Operating System
process 1 process 2 process 3 process 3
Variable-sized partitions
Partitions are of variable size.
Operating System
process 1 process 2 process 3
Operating System
process 1 process 2
OS must keep a free list

listing free memory (holes)
OS must provide placement scheme Degree of multiprogramming only limited by available memory No (or very little) internal fragmentation External fragmentation
The holes may be too small for a new process
process 4
Copy operation is expensive

process 4 process 4 free memory
Slide 374
WS 06/07
Placement Schemes
Satisfying a request of size n from a list of free holes.
General to the following schemes: find a large enough hole, allocate the portion needed, and return the remainder (leftover hole) to the free list.
First Fit
Example: we need this amount of memory: Search starts at the bottom.
Operating System
Operating System
process 1 process 2 process 3 The first hole encountered is large enough.
First fit
Find the first hole that is large enough. Fastest method.
Best fit
Find the smallest hole that is large enough. The entire list must be searched (unless it is sorted by hole size). This strategy produces the smallest leftover hole.
Worst fit
Find the largest hole. Search entire list (unless sorted). This strategy produces the largest left-over hole, which may be more useful than the smallest leftover hole from the best-fit approach.
Search
process 4
process 4 leftover hole
Best Fit
Operating System
Worst Fit
Operating System
leftover hole process 1 process 2 process 3 We have to search all holes. The top hole fits best. This scheme creates the smallest leftover hole among the three schemes.
Search
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 378 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Operating System
Operating System
process 1 process 2 process 3 We have to search all holes. The bottom hole is found to be the largest. This scheme creates the largest leftover hole among the three schemes. leftover hole
process 4
Search
Slide 377
process 4
process 4
process 4
Memory Allocation
Allocation of physical memory to a process
Memory Management
Memory (353) Paging (381) Segmentation (400) Paged Segments (412) Virtual Memory (419) Caches (471)
Contiguous
The physical memory space is contiguous (linear) for each process.
Fixed-sized partitions Variable sized partitions

Placement schemes: first fit, best fit, worst fit
Non-Contiguous
The physical memory space of a process is fragmented (has holes).
Paging Segmentation Combination of Paging and Segmentation

WS 06/07
Paging
Physical address space of a process can
Address Translation
Paging
be non-contiguous
Physical memory divided into fixed-sized frames
Frame size is power of 2, between 512 bytes and 8192 bytes
Address generated by CPU is divided into:

Page number p used as in index into a page table which contains the base address f of the corresponding frame in physical memory. Page offset d the offset from the frame start, physical memory address = f + d.
page number logical address p
mn
Logical memory divided into pages

Prage size is identical to frame size.
OS keeps track of all free frames (free-frame list) Running a program of size n pages requires
finding n free frames

Page table translates logical to physical addresses. Internal fragmentation, no external fragmentation.
page offset d
n
Logical address is m bits wide. Page size = frame size = 2n.

Paging
Physical address = f + d f = PageTable[p] p = m-n significant bits of logical address d = n least significant bits
low memory
Paging
high memory
Paging model: logical address space is contiguous, whereas the corresponding physical address space is not.
Figure from [Sil00 p.270] Slide 383 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 384 Computer Architecture WS 06/07
Paging
What is the physical address of k?
n = 2 (page size is 4 byte) m = 4 (logical address space is 16 byte) k is located at logical address 10D
p d
Free-Frame List
The OS must maintain a table of free frames (free-frame list)
Paging free-frame list 13 14 15 16 17 18
page 0 page 1 page 2 page 3
frame number
frame 0
frame 1 frame 2
free
free-frame list
frame 3
frame address
frame 4
14 13 18 20 15
15
13 14
page 1 page 0
frame number
15 16
10D = 1010 B
10 10
p = 2, d = 2. 0 1 2 3 20 24 4 8
frame 5 frame 6
f = PageTable[2] = 4
19 20
0 1 2 3
14 13 18 20
17 18 19 20
page 3 page 2
Physical address = f + d = 4 + 2 = 6
PageTable
WS 06/07
frame 7
page table of new process
new process
Page-Table
Where to locate the page table?
Paging
Translation Look-Aside Buffer

Paging
Dedicated registers within CPU

Only suitable for small memory. Used e.g. in PDP-11 (8 page registers, each page 8 kB, 64 kB main memory total). Fast access (high speed registers).
A translation look-aside buffer (TLB) is a small fast-lookup associative memory.

key value
5 0 1 4 2 6 9 3
12 14 13 4 18 15 17 20
frame address or frame number
Table in main memory

A dedicated CPU register, the page-table base register (PTBR), points to the table in memory (the table currently in use). With each context switch the PTBR is reloaded (then pointing to another page table in memory). The actual size of the page table is given by a second register, the page table length register (PTLR).
page number
18
With the latter scheme we need two memory accesses, one for the page table, and one for accessing the memory location itself. Slowdown! Solution: Special hardware cache: translation look-aside buffer (TLB)
The associative registers contain page frame entries (key | value). When a page number is presented to the TLB, all keys are checked simultaneously. If the desired page number is not in the TLB, it must be fetched from memory.
Slide 387
WS 06/07
Translation Look-Aside Buffer

Paging
Memory Access Time

Paging
Assume: Memory access time = 100 ns. TLB access time = 20 ns When page number is in TLB (hit): total access time = 20 ns + 100 ns = 120 ns When page number is not in TLB (miss): total access time = 20 ns + 100 ns + 100 ns = 220 ns With 80% hit ratio: average access time = 0.8 120 ns + 0.2 220 ns = 140 ns With 98% hit ratio:
Paging hardware with TLB. Figure from [Sil00 p.276]
average access time = 0.98 120 ns + 0.02 220 ns = 122 ns

Protection
With paging the processes memory spaces are automatically protected against each other since each process is assigned its own set of frames. If a page is tried to be accessed that is not in the page table (or is marked invalid -- see next slide), the process is trapped by the OS. 0 1 2 3
Frame Attributes
Each frame may be characterized by additional bits in the page table.
Paging
Paging
frame 0
frame 1 frame 2
Valid / invalid
Whether the frame is currently allocated to the process
frame 3
Read-Only
Frame is read-only
frame address
frame 4
Execute-Only
Frame contains code
Valid physical addresses:

20 ... 23 24 ... 27 04 ... 07 08 ... 11
20 24 4 8
frame 5
Shared
frame 6
Frame is accessible to other processes as well.

page table
WS 06/07
frame 7
Slide 392
WS 06/07
Shared Pages
Implementation of shared memory through paging is rather easy.
Paging
Shared Pages
0 1
0 1 2 3
A shared page is a page whose frame is allocated to other processes as well. Many processes share a page in that each of the shared pages is mapped to the same frame in physical memory. Shared code must be non-self modifying code (reentrant code).
Figure on the next slide: Three processes are using an editor. The editor needs 3 pages for its code. Rather than loading the code three times into memory, the code is shared. It is loaded only once into memory, but is visible to each process as if it is their private code. The data (the text edited), of course, is private to each process. Each process thus has its own data frame.
2 3
Note: Free memory is shown in gray, occupied memory is in white.
Pages 0,1,2 of each process are mapped to physical frames 3,4,6.
0 1 2 3
0 1 2 3
0 1 2 3
0 1 2 3 Figure from [Sil00 p.283]
WS 06/07
Paging
Logical address space of modern CPUs: 232 ... 264 Assume: 32-bit CPU, frame size = 4K 232 / 212 = 220 page table entries (per process) Each entry size = 20 bit + 20 bit = 5 byte
20 bit for page number. 20 bit for frame number (less than requiring 32 bit for the frame address).
Two-Level Paging
Often, a process will not use all of its logical address space. Rather than allocating the page table contiguously in main memory (for the worst case), the page table is divided into small pieces and is paged itself.
Paging
outer page table
page table entry
page number frame number

20 20
inner page table
220 x
5 byte = 5 MB per page table!
output points to a frame containing page table entries (inner page table entries)
output points to final destination frame

Slide 395
WS 06/07
Slide 396
Two-Level Paging
Paging
Multi-Level Paging
Paging
page number logical address p1

10
page offset p2
10
Tree-Structure principle
Each outer page entry defines a root node of a tree.
d
12
Two / three / four level paging

SPARC (32 bit): three-level paging. Motorola 68030 (32 bit): four-level paging.
Numbers are for the 32-bit, 4 kB frame, example
max 210 entries each page of inner table has 210 entries final destination frame in memory
Better memory utilization

than using a contiguous (and possibly maximum-sized) page table.
Increase in access time

since we hop several times until final memory location is reached. Caching (TLB) however helps out a lot. Four-level paging with 98% hit rate: Effective access time = 0.98 120 ns + 0.02 520 ns = 128 ns
Slide 397
WS 06/07
Slide 398
WS 06/07
Segmentation
User views of logical memory: Linear array of bytes
Reflected by the Paging memory scheme
A collection of variable-sized entities

User thinks in terms of subroutines, stack, symbol table, main program which are somehow located somewhere in memory.
Segmentation supports this user view. The logical address space is a collection of segments.
WS 06/07 Dr.-Ing. Stefan Freinatis Slide 400 Computer Architecture WS 06/07
Slide 399
Segmentation
1 1 2 3 2 4 3
User space Physical memory
Segmentation
Physical address space of a process can
4
be non-contiguous as with paging

Logical address consists of a tuple
<segment number, offset>
Segment table maps logical address onto physical
address
base: physical address of segment limit: length of segment
Segment table can hold additional segment attributes

Like with frame attributes (see paging).
Segmentation model: The user space (logical address space) consists of a collection of segments which are mapped through the segmentation architecture onto the physical memory.
Shared Segments
Shared segments are mapped to the same segment in physical memory.
Slide 402
WS 06/07
Segmentation
s selects the entry from the table. Offset d is checked against the maximum size of the segment (limit). Final physical address = base + d.
Segmentation
Segments are variable-sized
Dynamic memory allocation required (first fit, best fit, worst fit).
External fragmentation
In the worst case the largest hole may not be large enough to fit in a new segment. Note that paging has no external fragmentation problem.
Each process has its own segment table

like with paging where each process has its own page table. The size of the segment table is determined by the number of segments, whereas the size of the page table depends on the total amount of memory occupied.
Segment table located in main memory

as is the page table with paging
Segment table base register (STBR)

points to current segment table in memory
Segment table length register (STLR)

indicates number of segments

Segmentation
Example:
Segmentation
Example (continued):
A program is being assembled. The compiler determines the sizes of the individual components (segments) as follows:
Segment main program symbol table function sqrt() subroutine stack
Slide 405
Size 400 byte 1000 byte 400 byte 1000 byte 1100 byte
The process is assigned 5 segments in memory as well as a segment table.

Shared Segments
Segmentation
Paging versus Segmentation

With paging physical memory is divided into fixed-size frames. When
Process P1 and P2 share the editor code. Segment 0 of each process is mapped onto the same physical segment at address 43062.
memory space is needed, as many free frames are occupied as necessary. These frames can be located anywhere in memory, the user process always sees a logical contiguous address space.
With segmentation the memory is not systematically divided. When a
The data segments are private to each process, so segment 1 of each process is mapped to its own segment in physical memory.
program needs k segments (usually these have different sizes), the OS tries to place these segments in the available memory holes. The segments can be scattered around memory. The user process does not see a contiguous address space, but sees a collection of segments (of course each individual segment is contiguous as is each page or frame).
Slide 407
WS 06/07
Slide 408
WS 06/07

13 14 15 16 17 18 19 20
seg3

Each process is assigned its page table.
Paging Segmentation
Slide 410 Dr.-Ing. Stefan Freinatis
Page table size proportional to allocated memory Often large page tables and/or multi-level paging Internal fragmentation Free memory is quickly allocated to a process
unused memory
internal fragmentation
seg1
free memory
can be allocated
seg4
Motorola 68000 line is based on a flat address space
seg2
Each process is assigned a segment table Segment table size proportional to number of segments Usually small segment tables External fragmentation. Lengthy search times when allocating memory to a process. Intel 80X86 family is based on segmentation
Paging is based on fixed-size units of memory (frames)

Slide 409
Segmentation is based on variable-size units of memory (segments)

WS 06/07
Memory Management
Paged Segments
Combining segmentation with paging yields paged segments
13
seg1
14 15
With segmentation, each segment is a contiguous space in physical memory.
seg4
16 17 18 19
seg2
With paged segments, each segment is sliced into pages. The pages can be scattered in memory.
seg3
20
segmentation
paged segments
Paged Segments
Each segment has its own page table
frame numbers
Paged Segments
13 14 15 16 17 18 19 20
15
seg1
The MULTICS (predecessor of UNIX) operating system solved the problems of external fragmentation and lengthy search times by paging the segments. This solution differs from pure segmentation in that each segment table entry does not contain the base address of the segment, but rather contains the base address of a page table for this segment.
unused memory
internal fragmentation
16 17
page table
seg4
14
page table
seg2
13
page table
In contrast to pure paging where each process is assigned a page table, here each segment is assigned a page table. The processes still see just segments not knowing that the segments themselves are paged. With paged segments there is no more time spent on optimal segment placing, however, there is introduced some internal fragmentation.
seg3
18 20
page table
logical process space

physical memory
Paged Segments
Explanation of next slide (principle of paged segments)
Paged Segments
The logical address is a tuple <segment number s, offset d>. The segment number is added to the STBR (segment table base register) and by this points to a segment table entry. The segment table is located in main memory. From the entry the page table base is derived which points to the beginning of the corresponding page table in memory. The first part p of the offset d determines the entry in the page table. The output of the page table is the frame address f (or alternatively a frame number). Finally f + d is the physical memory address. Steps in resolving the final physical address: PageTable = SegmentTable[s].base; f = PageTable[p]; final address = f + d
Slide 415
d
logical address
Principle of paged segments
WS 06/07
Slide 416
WS 06/07
Paged Segments
Combination of segmentation and paging
User view is segmentation, memory allocation scheme is paging
Used by modern processors / architectures
Example: Intel 80386

CPU has 6 segment registers
which act as a quick 6-entry segment table
Up to 16384 segments per process possible

in which case the segment table resides in main memory.
Maximum segment size is 4 GB

Within each segment we have a flat address scheme of 232 byte addresses
Page size is 4 kB
A two-level paging scheme is used
Virtual Memory
What if the physical memory is smaller than required by a process?
Virtual Memory
Based on locality assumption
No process can access all its code and data at the same time, therefore the entire process space does not need to be in memory at all time instants. Require special precautions and extra work by the programmer.
Dynamic Loading Overlays
Only parts of the process space are in memory

The remaining ones are on disk and are loaded when demanded
It would be much easier if we would not have to worry about the memory size and could leave the problem of fitting a larger program into smaller memory to the operating system.
Logical address space can be much larger than physical address space
A program larger than physical memory can be executed More programs can (partially) reside in memory which increases the degree of multiprogramming!
Virtual Memory
Memory is abstracted into an extremely large uniform array of storage, apart from the amount of physical memory available.
Slide 420
WS 06/07
Virtual Memory
Virtual memory concept (one program)
Virtual Memory
Virtual memory concept (three programs)
OS
size
OS backing store
virtual memory concept
OS
(usually a disk)
size
B
free memory
C physical memory
program
Slide 421
physical memory
program A
Slide 422
program B
program C
physical memory
backing store
Virtual Memory
Virtual memory can be implemented by means of
Virtual Memory
Demand Segmentation
Used in early Burroughs computer systems and in IBM OS/2. Complex segment-replacement algorithms.
Demand Paging
Commonly used today. Physical memory is divided into frames (paging principle). Demand paging applies to both paging systems and paged segment systems.
Figure next slide: Virtual memory usually is much larger than physical memory (e.g. modern 64-bit processors). The pages currently needed by a process are in memory, the other pages reside on disk. From the page table is known whether a page is in memory or on disk.
page table disk

Virtual memory consists of more pages than there are frames in physical memory
Demand Paging
Less I/O
than loading the entire program (at least for the moment)
Virtual Memory
Demand Paging
Virtual Memory
A page is brought from disk into memory when it is needed (when it is demanded by the process)
Q: How does the OS know that a page is demanded by a process? A: When the process tries to access a page that is not in memory!
A process does not know whether or not a page is in memory, only the OS knows.
Less memory needed

since a (hopefully) great part of the program remains on disk
Each page table entry has a validity bit (v)

If v = 1 page is in memory If v = 0 page is in on disk validity bit is also termed valid-invalid bit
During address translation, when the validity bit is found 0, the hardware causes a page fault trap to the operating system.
Faster response
The process can start earlier since loading is quicker
More processes in memory

The memory saved can be given to other processes
Loading a page on demand is done by the pager (a part of the operating system usually a daemon process).
Slide 426
WS 06/07
Page Fault
A page fault is the fact that a non-memory-resident page was tried to be accessed by some process. Steps in demand paging: 1. A reference to some page is made
Virtual Memory
Page Fault
into the free frame. 5. When disk read is complete, the internal tables are updated to reflect that the page now is in memory.
Virtual Memory
4. A disk operation is scheduled to read in the desired page
2. The page is not listed in the table (or is marked invalid) which causes a page fault trap (a hardware interrupt) to the operating system. 3. An internal table is checked (usually kept with the process control block) to determine whether the reference was a valid or an invalid memory access. If the reference was valid, a free frame is to be found.
6. The process is restarted at the instruction that caused the page fault trap. The process can now access the page.
Slide 428
These steps are symbolized in the next figure
WS 06/07
Virtual Memory
Page Fault
Virtual Memory
Page table indicating that pages 0, 2 and 5 are currently in memory, while pages 1, 3, 4, 6, 7 are not.
Steps in handling a page fault

Performance of Demand Paging

Page fault rate 0 p 1
Average probability that a memory reference will cause a page fault Virtual Memory

Page fault time
Trap to the OS Context switch Check validity Find a free frame Schedule disk read
Virtual Memory The time from the failed memory reference until the machine instruction continues
if p = 0 no page faults at all if p = 1 every reference causes a page fault
Context switch to another process (optional) Place page in frame Adjust tables Context switch and restart process
Memory access time tma

Time to access physical memory (usually in the range of 10 ns ...150 ns)
Assuming a disk system with an average latency of 8 ms, average seek time of 15 ms and a transfer time of 1 ms (and neglecting that the disk queue may hold other processes waiting for disk I/O), and assuming the execution time of the page fault handling instructions to be 1 ms, the page fault time is 25 ms.
Effective access time teff

Average effective memory access time. This time finally counts for system performance
teff = (1 p) tma + p page fault time


Effective access time
tma
Virtual Memory

Some possibilities for lowering the page fault rate
Virtual Memory
teff = (1 p) 100 ns + p 25 ms = 100 ns + p 249999 ns 25 ms

When each memory reference causes a page fault (p = 1), the system is slowed down by a factor of 250000. When one out of 1000 references causes a page fault (p = 0.001), the system is slowed down by a factor of 250. For less than a 10 % degradation, the page fault rate p must be less than 0.000004 (1 page fault in 2.5 million references).
Increase page size

With larger pages the likelihood of crossing page boundaries is lesser.
Use good page replacement scheme

Preferably one that minimizes page faults.
Assign sufficient frames

The system constantly monitors memory accesses, creates page-usage statistics and on-the-fly adjusts the number of allocated frames. Costly, but used in some systems (so-called working set model).
Enforce program locality

Programs can contribute to locality by minimizing cross-page accesses. This applies to the implemented algorithms as well as to the addressing modes of the individual machine instructions.
Slide 434
WS 06/07
Page Size
What should be the page (= frame) size?
Virtual Memory
Page Attributes
Large Pages
internal fragmentation smaller page tables faster disk I/O less page faults Next to the validity bit v, each page may in addition be equipped with the following attribute bits in the page table entry:
Virtual Memory
Small Pages
little internal fragmentation large page tables slower disk I/O more page faults
Reference bit r
Upon any reference to the page (read / write) the bit is set. Once the bit is set it remains set until cleared by the OS.
Modify bit m
Each time the page is modified (write access), the bit is set. The bit remains set until cleared by the OS. A page that is modified is also called dirty. The modify bit is also termed dirty bit. When the page is not modified it is clean.
Trend goes toward larger pages. Page faults are more costly today because the gap between CPU-speed and disk speed increased.
Intel 80386: 4 kB Intel Pentium II: 4 kB or 4 MB Sun UltraSparc: 8 kB, 64 kB, 512 kB, 4MB
Slide 436
WS 06/07
Finding Free Frames

Terminate another process
Virtual Memory
Page Replacement
Page replacement scheme:
If there is a free frame use it, otherwise use a page-replacement algorithm to select a victim frame. Save the victim page to disk and adjust the tables of the owner process. Read in the desired page and adjust the tables. Improvement Preferably use a victim page that is clean (not modified, m = 0). Clean pages do not need to be saved to disk.
What options does the OS have when needing free frames?
Virtual Memory
Not acceptable. The process may already have done some work (e.g. changed a data base) which may mistakenly be repeated when the process is started again.
Swap out a process

An option only in case of rare circumstances (e.g. thrashing).
Hold some frames in spare

Sooner or later the spare frames are used up. Memory utilization is lower since the spare memory is not used productively.
Two page transfers
Borrow frames
Yes! Take an allocated frame, use it, and give it (or another one) back to the owner later. Page Replacement
Page Replacement
Virtual Memory
0 1 2 3
Page Replacement
Virtual Memory
0 1 2 3
Figure from [Sil00 p.309] Slide 439
Need for page-replacement User process 1 wants to access module M (page 3). All memory however is occupied. Now a victim frame needs to be determined.
Page-replacement The victim is saved to disk (1) and the page table is adjusted (2). The desired page is read in (3) and the table is adjusted again. In this figure the victim used to be a page from the same process (or same segment in case of paged segments).
Page Replacement
Global Page Replacement
process can take a frame from another. Processes can affect each others page fault rate, though.
Virtual Memory
Page Replacement
Page replacement algorithms
Virtual Memory
The victim frame can be from the set of all frames, that is, one
First-in first-out (FIFO)

and its variations second-chance and clock.
Optimal page replacement (OPT) Least Recently Used (LRU) LRU Approximations
Desired: Lowest page-fault rate! Evaluation of the algorithms through applying them onto memory reference strings.
Local Page Replacement

The victim frame may only be from the own set of frames, that is, the number of allocated frames per process does not change. No impact onto other processes.
The figure on the previous slide shows a local page replacement strategy.
Slide 441
WS 06/07
Memory Reference Strings

Assume the following address sequence:
(e.g. recorded by tracing the memory accesses of a process) Virtual Memory
Memory Reference Strings

In general, the more frames available the lesser is the expected number of page faults.
0100, 0432, 0101, 0612, 0102, 0103, 0104, 0101, 0611, 0102, 0103 0104, 0101, 0610, 0102, 0103, 0104, 0101, 0609, 0102, 0105 Assuming a page size of 100 bytes, the sequence can be reduced to
Page faults versus number of frames
1, 4, 1, 6, 1, 6, 1, 6, 1, 6, 1
This memory reference string lists the pages accessed over time (at the time steps at which page access changes).
If there is only 1 frame available, the sequence would cause 11 page faults. If there are 3 frames available, the sequence would cause 3 page faults.
FIFO Page Replacement

Principle: Replace the oldest page (old = swap-in time).
Virtual Memory

Example VM.2
Memory reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5 Number of frames: 3 1
1
Virtual Memory
Example VM.1
Memory reference string: 7, 0, 1, 2, 0, 3, 0, 4, 2, 3, 0, 3, 2, 1, 2, 0, 1, 7, 0, 1 Number of frames: 3
2
1 2
3
1 2 3
4
4 2 3
1
4 1 3
2
4 1 2
5
5 1 2
1
5 1 2
2
5 1 2
3
5 3 2
4
5 3 4
5
5 3 4
0 1 2
frame contents over time
Total: 15 page faults.

9 page faults
WS 06/07

Example VM.3
Number of frames: 4 1
1
Virtual Memory

Virtual Memory
From the examples VM.2, VM.3 it can be noticed that the number of page faults for 4 frames is greater than for 3 frames. This unexpected result is known as Beladys Anomaly1: 1
1 2 3 4 1 2 3 4
Memory reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5 (as in VM.2)
2
1 2
3
1 2 3
2
1 2 3 4
5
5 2 3 4
1
5 1 3 4
2
5 1 2 4
3
5 1 2 3
4
4 1 2 3
5
1 5 2 3
1
For some page-replacement algorithms the page-fault rate may increase as the number of allocated frames increases.
10 page faults
Although we have more frames available than previously, the page fault rate did not decrease!
Lazlo Belady, R. Nelson, G. Shedler: An anomaly in space-time characteristics of certain programs running in a paging machine, Communications of the ACM, Volume 12, Issue 6, June 1969, Pages: 349 - 353, ISSN:0001-0782, also available online as pdf from the ACM.
Slide 448
WS 06/07
Beladys Anomaly
Virtual Memory
Second-Chance Algorithm
This algorithm is a derivative of the FIFO algorithm. Start with the oldest page Inspect the page If r = 0: replace the page. Done. If r = 1: give the page a second chance by clearing r and moving the page to the top of the FIFO Proceed to next oldest page
Virtual Memory
Page faults versus number of frames for the string 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5.
When a page is used often enough to keep the r bit set, it will never be replaced. Avoids the problem of throwing out a heavily used page (as may happen with strict FIFO). If all pages have r =1, the algorithm however is FIFO.
Second-Chance Algorithm
Virtual Memory
Clock Algorithm
When the FIFO is arranged as a circular list the overhead is less.
Virtual Memory
Second chance constantly moves pages within the FIFO (overhead)!
Example: page A is the oldest in the FIFO (see a). With pure FIFO it would have been replaced. However, as r = 1 it is given a second chance and is moved to the top of the FIFO (see b). The algorithm continues with page B. FIFO
Initially the hand (a pointer) points to the oldest page.

r=1
The algorithm then applied is second chance.
Slide 451
WS 06/07
Slide 452
WS 06/07
Optimal Page Replacement

Principle: Replace the page that will not be used for the longest time.
LRU Page Replacement

Example VM.5
Virtual Memory
Principle: Replace the page that has not been used for the longest time.
Example VM.4

Total: 9 page faults.


Possible LRU implementations:
Virtual Memory

Example for the stack implementation principle
Virtual Memory
Counter Implementation
Every page table entry has a counter field. The system hardware must have a logical counter. With each page access the counter value is copied to the entry.
Update on each page access required Searching the table for finding the LRU page Account for clock overflow
Stack Implementation
Keep a stack containing all page numbers. Each time a page is referenced, its number is searched and moved to the top. The top holds the MRU pages, the bottom holds the LRU pages. bottom of stack
Update on page access required Searching the stack for the current page number
Figure from [Sil00 p.317] WS 06/07 Dr.-Ing. Stefan Freinatis
LRU Approximation
Not many systems provide sufficient hardware support for true LRU page replacement. Approximate LRU!
Virtual Memory
LRU Approximation
History field examples
Virtual Memory
00000000 11111111 01001000

history field
= Not used for the last 8 time periods = Used in each of the past 8 periods = Used in last period and in the fifth last period
Use reference bit

When looking for LRU page, take a page with r = 0 No ordering among the pages (only used and unused)
History Field
Each page table entry has a history field h (e.g. a byte) When page is accessed, set most significant bit (e.g. bit 7) Periodically (e.g. every 100 ms) shift right the bits When looking for LRU page, take page with smallest unsigned int(h) Better ordering among the pages (256 history values)
value (unsigned int)
0101 0111 0110 1011

Slide 458
5 7 6 11
This page will be chosen as victim
WS 06/07
Page Replacement
Exemplary page fault rates
Page Faults per 1000 References
40 35 30 25 20 15 10 5 0 6 8 10 Number of Frames Allocated 12 14 FIFO Clock LRU Opt
Page Replacement
Virtual Memory
Algorithms Summary
Virtual Memory
First-in first-out (FIFO)

Simplest algorithm, easy to implement, but has worst performance. The clock version is somewhat better as it does not replace busy pages.
Optimal page replacement (OPT)

Not of practical use as one must know future! Used for comparisons only. Lowest page fault rate of all algorithms.
Least Recently Used (LRU)

The best algorithm usable, but requires much execution time or highly sophisticated hardware.
Figure from lecture slides WS 05/06
LRU Approximations
Slightly worse than LRU, but faster. Applicable in practice.
Differences noticeable only for smaller number of frames

Slide 460
WS 06/07
Thrashing
Virtual Memory
Thrashing
Countermeasures
Virtual Memory
When the number of allocated frames falls below a certain number of pages actively used by a process, the process will cause page fault after page fault. This high paging activity is called thrashing.
Switching to local page replacement

A thrashing process cannot steal frames from others. The page device queue (used by all) however is still full of requests lowering overall system performance.
Swap out
The thrashing process or some other process can be swapped out for a while. Choice depends on process priorities.
A too high degree of multiprogramming results in thrashing because each process does not have enough frames.
Assign sufficient frames

How many frames are sufficient?
Working-set model: All page references are monitored (online memory reference string creation). The pages recently accessed form the working-set. Its size is used as the number of sufficient frames.
Slide 461
WS 06/07
Slide 462
WS 06/07
Working-Set
Virtual Memory
Program Locality
Virtual Memory
Demand paging is transparent to the user program. A program however can enforce locality (at least for data).
Assume a page size of 128 words and consider the following program which clears the elements of a 128 x 128 matrix.
row column
The working-set model uses a parameter to define the working-set window. The set of pages in defines the working-set WS. The OS allocates to the process enough frames to maintain the size of the working-set. Keeping track of the working set requires the observation of memory accesses (constantly or in time intervals).
int A[][] = new int[128][128]; for (int j = 0; j < 128; j++) for (int i = 0; i < 128; i++) A[i][j] = 0;
Program A Program clearing the matrix elements column-wise.
Slide 463
WS 06/07
Slide 464
WS 06/07
Program Locality
The array is stored in memory row major.
Virtual Memory
Program Locality
Thus, each row of the 128 x 128 matrix occupies one page. If the operating system allocates only one frame (for the data) to process A, the process will cause 128 x 128 = 16384 page faults!
i+3
Virtual Memory
In row major storage, a multidimensional array in linear memory is accessed such that rows are stored one after the other. It is the approach used by C, Java, and many other languages, with the notable exception of Fortran. For example, the matrix in C as
123 456
6
row 2
high
is defined
5 4 3
This is because the process clears one word in each page (word j), then the next word, ..., thus jumping from page to page in the inner loop. for (int j = 0; j < 128; j++) for (int i = 0; i < 128; i++)
i+2
i+1
row 1
int A[2][3]= { {1,2,3}, {4,5,6} };
2 1
word memory
low
and is stored in memory row-wise.

A[i][j] = 0;
Program Locality
By changing the loop order, the process first finishes one page before going to the next. int A[][] = new int[128][128]; for (int i = 0; i < 128; i++) for (int j = 0; j < 128; j++) A[i][j] = 0;
Program B
i+3
Virtual Memory
Program Locality
Locality is also influenced by the addressing modes at machine instruction level.
Virtual Memory
Consider a three-address instruction, such as ADD A,B,C which performs C:=A+B. In the worst case the operands A, B, C are located in 3 different pages.
i+2
Another example is the PDP-11 instruction MOV @(R1)+,@(R2)+

j
i+1
Now, if the operating system allocates only one frame (for the data) to process B, the process will cause only 128 page faults!
which in the worst case straddles across 6 pages.
R1
PDP 11 addressing mode 3 for the source operand
Slide 468
WS 06/07
Virtual Memory
Separation logical physical memory
The user / programmer can think of an extremely large virtual address space.
Pure Paging / Paged Segments

Virtual memory can be implemented upon both memory allocation schemes.
Execution of large programs

which do not fit into physical memory in their entirety.
Better multiprogramming
as there can be more programs in memory.
Not suitable for hard real-time systems!

Virtual memory is the antithesis of hard real-time computing. This is because the response times cannot be guaranteed owing to the fact that processes may influence each other (page device queue, thrashing, ...).
Slide 469
WS 06/07
Slide 470
WS 06/07
Memory Hierarchy
The farther away from CPU, the larger and slower the memory. The hierarchy is the consequence of locality.
Caches
Locality Principle
Caches
Programs tend to reuse data and instructions. Rule of thumb:
[HP06 p.38]
A program spends 90% of its execution time in only 10% of the code.
Temporal locality: recently accessed items are likely to be accessed in near future. Spatial locality: items whose addresses are near one another tend to be referenced close together in time.
Memory hierarchy levels in typical desktop / server computers, figure from [HP06 p.288]
Locality Principle
Caches
Cache: a safe place for hiding or storing things
Websters Dictionary [HP06 p. C-1]
Example of a memory-access trace of a process
Here: Fast memory that stores copies of data from the most frequently used main memory locations. Used by the CPU to reduce the average time to access memory locations. Effect: instructions (in execution) can proceed quicker.
Instruction fetch is quicker Memory operands are accessed quicker
from the CPUs point of view
Result: faster program execution improved system performance

Cached Memory Access

Steps in accessing memory (here: reading from memory), simplified.
Caches
Caches
To take advantage of spatial locality a cache contains blocks of data rather than individual bytes. A block is a contiguous line of processor words. It is also called a cache line. Common block sizes: 8 ... 128 bytes block transfer
CPU requests content from a memory location Cache is checked for this datum When present, deliver datum from cache When not, transfer datum from main memory to cache Then deliver from cache to CPU
Cache components
Data Area Tag Area word transfer
Attribute Area
Data Area
Caches
Tag Area
Caches
All blocks in the cache make up the data area.

Block
The block addresses of the cached blocks make up the 1 tags of the cache lines. All tags form the tag area.
Block
N bytes per block
0 1 2 3 4
N byte per block
0 1 2 3 4
...
...
...
Data area
...
...
Tag area
...
Data area
...
B1
Cache capacity = B N bytes
1 The statement is slightly simplified. In real caches, often just a
fraction of the block address is used as tag.

Attribute Area
Caches
Caches
Each cache line plus its tag plus its attributes forms a slot.
Block / Slot Block
The attribute area contains attribute bits for each cache line.
V D
Validity bit V
V = 1 data is valid V = 0 data is invalid
indicates whether the cache line holds valid data 0 N bytes per block
1 2 3
V D
N bytes per block
...
Attributes
4 Dirty bit D the cache line data is modified ...indicates whether...
Cache slot
...
Attributes
...
Tag area
...
Data area
...
with respect to main memory

Tag area
D = 1 data is modified Data area D = 0 data is not modified
B1
Slide 479
WS 06/07
Slide 480
WS 06/07
...
B1
...
B1
0 1 2 3 4
Caches
How to find a certain byte in the cache? Caches
Block Address
Memory can be considered as an array of blocks.
block address 0 1 2 3 0 4 8 12 16 20 24 28 32 36 4 bytes per block
Caches
The address generated by the CPU is divided into two fields. High order bits make up the block address Low order bits determine the offset within that block
m block address m-n offset n
Memory address (binary) 000000 000100 001000 001100 010000 010100 011000 011100 100000 100100
4 5 6 7 8 9
The block address should not be confused with the memory address at which the block starts. The block address is a block number.
block address = memory address DIV block size
Block address is compared against all tags simultaneously In case of a match (cache hit), the offset selects the byte
Remark: CPU address space = 2m, Cache line size (block size) = 2n
Memory
memory address (decimal)
Slide 482
block address offset
WS 06/07
Caches
V D Tags Data
Hit Rate
Caches
Cache capacity is smaller than the capacity of main memory. Cache mechanism Consequently, not all memory locations can be mirrored in the cache. When a required datum is found in the cache, we have a cache hit, otherwise a cache miss.
...
...
...
The hit rate is the fraction of cache accesses that result in a hit.
Comparator
hit / miss
Data out
Hit rate =
number of hits number of memory accesses
block address
offset
CPU memory address

The miss rate is the fraction of cache accesses that result in a miss.
Amdahls Law
The law is a general law, not restricted to caches or computers.
Caches
Amdahls Law
Example: 30% of the computations can be made twice as fast. P = 0.3, S = 2. Improvement I =
Caches
Used to find the maximum expected improvement to an overall system when a part of the system is improved.
1 0.3 (1 0.3) + 2
I=
1 (1 P) + P S
I=
1 = 1.177 0.7 + 0.15
Amdahls Law in the special case of parallelization
I: maximum expected improvement, I > 0 (usually I > 1) P: proportion of the system improved, 0 P 1 S: speedup of that proportion, S > 0, usually S > 1
1 (1 F ) F+ N
See lecture Advanced Computer Architecture
F: proportion of sequential calculations (no speedup possible), 0 F 1 N: grade of parallelism (e.g. N processors), N > 0
Caches
CPU (registers) Memory space Access time 500 Byte 250 ps Cache (SRAM) 64 kB 1 ns Main memory (DRAM) 1 GB 100 ns I/O Devices (disks) 1 TB 10 ms
Read Access
Reading from memory (improvement)
Caches
CPU requests datum Search cache while fetching block from memory Cache hit: deliver datum, discard fetched block Cache miss: put block in cache and deliver datum
In case of a hit, the datum is available quickly. In case of a miss there is no benefit from the cache, but also no harm. Things are not that easy when writing into memory. Lets look at the cases of a write hit and a write miss.
Example: Assume: Cache = 1 ns, main memory = 100 ns, 90% hit rate. What is the overall improvement?
P = 0.9, S =
I= 1
100 ns = 100 1 ns
0.9 100 = 9.175
(1 0.9) +
Slide 487
Memory accesses (as seen by the CPU) now are more than 9 times as fast than without a cache.
WS 06/07
Write Hit Policy

Write through
The datum is written to both the block in the cache and the block in memory.
CPU Cache
Caches
Write Miss Policy

Assume a write miss. What to do? Caches
Assume a write hit. How to keep cache and main memory consistent on write-accesses?
Memory
Write allocate
The block containing the referenced datum is transferred from main memory to the cache. Then one of the write hit policies is applied. Normally used with write back caches.
Write Buffer
Write back
Cache always clean (no dirty bit required) CPU write stall (problem reduced through write buffer) Main memory always has the most current copy (cache coherency in multi-processor systems)
No-write allocate
Write misses do not affect the cache. Instead the datum is modified only in main memory. Write hits however do affect the cache. Normally used with write through caches.
The datum is only written to the cache (dirty bit is set). The modified block is written to main memory once it is evicted from cache.
Write speed = cache speed Multiple writes to the same block still result in only one write to memory Less memory bandwidth needed
Slide 490
WS 06/07
Write Miss Policy

Assume an empty cache and the following sequence of memory operations.
WriteMem[100] WriteMem[100] ReadMem[200] WriteMem[200] WriteMem[100]
Caches
Caches
Cache
What are the number of hits and misses when using no-write allocate versus write allocate?
No-write allocate WriteMem[100] WriteMem[100] ReadMem[200] WriteMem[200] WriteMem[100]
Slide 491
Write allocate miss hit miss hit hit
Where exactly are the blocks placed in the cache? Cache Organization What if the cache if full? Replacement Strategies
...
Slide 492
miss miss miss hit miss

WS 06/07
Memory
Cache Organization
Where can a block be placed in the cache? Caches
Direct Mapped
Each memory block is mapped to exactly one slot in the cache (many-to-one mapping).
Memory
0 4 8 12 16 20 24 28
Caches
Direct Mapped
With this mapping scheme a memory block can be placed in only one particular slot. The slot number is calculated from
((memory address) DIV (blocksize)) MOD (slots in cache).
Cache
Slot
0 1 2 3
Fully Associative
The block can be placed in any slot.
Set Associative
The block can be placed in a restricted set of slots. A set is a group of slots. The block is first mapped onto the set and can then be placed anywhere within the set. The set number is calculated from
(memory address) DIV (block size) MOD (number of sets in cache).
...
...
Block size = 4 byte Cache capacity = 4 x 4 = 16 byte

If slot occupied (V = 1) evict cache line

Direct Mapped
Slot = ((memory address) DIV (blocksize)) MOD (slots in cache).
offset within slot = (memory address) MOD (blocksize).
Caches
Direct Mapped
Extracting slot number and offset directly from memory address
block address
Caches
Examples
In which slot goes the block located at address 12D? 12 DIV 4 = 3 3 MOD 4 = 3 (slot 3) In which slot goes the block located at address 20D? 20 DIV 4 = 5 5 MOD 4 = 1 (slot 1) Where goes the byte located at address 23D? 23 DIV 4 = 5 5 MOD 4 = 1 23 MOD 4 = 3
Slide 495
m tag bits slot offset n
MOD 4
m-n
The lower bits of the block address select the slot. The size of the slot field depends on the number of slots (size = ld(number of slots)).
2
Example
ld = logarithmus dualis (base 2)
Where goes the byte located at address 23D?
23D = 1 0 1 1 1B
The byte goes in cache line (slot) 1 at offset 3
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 496
slot offset
Slot 1, offset 3
Direct Mapped
31 30 29 28
Direct Mapped
Caches
7 6 5 4 3 2 1 0
Word Byte offset offset
Address (showing bit positions) . . . . . . . . 19 18 17 16 15 14 13 12 . . . .
Explanations for previous slide
Caches
Slot
Hit 16 12
Logical address space of CPU: 232 byte

Data
16 bits Valid Tag Data
128 bits
Number of cache slots: 64kB / 16 Byte = 4K = 4096 slots. Bit 0,1 determine the position of the selected byte in a word. However, as the CPU uses 4-byte words as smallest entity, the byte offset is not used. Bit 2,3 determine the position of the word within a cache line.
16k 4k entries
lines
16
32
32
32
32
Bits 4 to 15 (12 bits) determine the slot. 212 = 4K = number of slots.

64 kByte cache using four-word (16 Byte) blocks
MUX 32
Figure from lecture CA WS05/06
Bits 16 to 31 are compared against the tags to see whether or not the block is in the cache.
Slide 497
WS 06/07
Fully Associative
A memory block can go in any cache slot (many-to-many).
Caches
0 4 8 12 16 20 24 28 32 36 4 choices 0 1 2 3
Set Associative
Slot Set
Caches
A memory block goes into a set, and can be placed anywhere within the set (many-to some)
0 4 8 12 16 20 24 28 32 36 0 1 0 1 2-way set associative cache
0 1
Slot selection check all tags (preferably simultaneously) take a slot with V = 0 (a free slot) otherwise select a slot according to some replacement strategy
Slot selection Determine set from block address In this set, take a free slot ... ... or evict a slot according to some replacement strategy

Slide 500
Slide 499
Set Associative
Set = ((memory address) DIV (blocksize)) MOD (sets in cache). Example
In which set goes the block located at address 12D? 12 DIV 4 = 3 (block address) 3 MOD 2 = 1 (set 1)
Caches
Set Associative
N-way set associative cache
N = number of slots per set, not the number of sets N is a power of 2, common values are 2, 4, 8. Extremes N=1 There is only one slot per set, that is, each slot is a set. The set number (thus the slot) is drawn from the block address.
Caches
In which slot the block finally goes depends on occupation and replacement strategy
Similar to direct mapping, the low order bits of the block address determine the destination set.
block address
Direct Mapped
N=B There is only one set containing all slots (B = number of blocks in cache = number of slots).
m tag bits m-n

set
offset n
Dr.-Ing. Stefan Freinatis Slide 502
Fully Associative
Set Associative
Caches
Set Associative
Opteron cache
Cache capacity: 64 kB in 64-byte blocks (1024 slots) Cache is two-way set associative: 512 sets 2 cache lines Hardware: Two arrays with each 512 caches lines, that is, each set has one cache line in array1 and one in array2.
Caches
AMD Opteron Cache

Two-way set associative
Physical address is 40 bits. Address is divided into 34 bit block address (subdivided into 25 tag bits and 9 index bits) and 6 bits byte offset ().
Figure: The index selects the set (29 = 512), see . The two tags of the set are compared against the tag bits. The valid bit must be set for a hit (). On a hit, the corresponding data is delivered using the winning input from a 2:1 multiplexer (). The data goes to Data in of the CPU. The victim buffer is needed when a cache line has to be written back to main memory (replacement).
Figure from [HP06 p. C-13] Slide 504 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Cache Organization
Where can block 12 go?
Cache
Caches
Cache Organization
Caches
For the previous figure, assume block 12 and block 20 being used very often. What is the problem?
Fully associative: No special problem. Both blocks can be stored in the cache at the same time. Direct mapped: Problem! Only one of them can be stored at the same time since both map to the same slot. 12 mod 8 = 20 mod 8 = 4 No special problem. Both blocks can be stored in the same set at the same time.
Block address
Memory
Set associative:
Figure from [HP06 p.C-7] Slide 505 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 506 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Cache Organization
Direct Mapped

Caches
Replacement Strategies
Strategies for selecting a slot to evict (when necessary)
Caches
Hard-Allocation (no choice) Simple & Inexpensive No replacement strategy required If a process uses 2 blocks mapping to the same slot, cache misses are high.
Random
Victim cache lines are selected randomly. Hardware pseudo-random number generator generates slot-numbers.
Fully Associative
Full choice Expensive searching (hardware) for free slot Replacement strategy required
Least-Recently Used (LRU)

Relies on the temporal locality principle. The least recently used block is hoped to have smallest likelihood of (re)usage. Expensive hardware.
Set Associative
Compromise between direct mapped and fully associative Some choice Replacement strategy required
First-in, First out (FIFO)

Approximation of LRU by selecting the oldest block (oldest = load time).
Slide 508
WS 06/07
Replacement Strategies
Caches Two-way Capacity 16 kB 64 kB 256 kB LRU 114.1 103.4 92.2 Rand 117.3 104.3 92.1 FIFO 115.5 103.9 92.5 LRU 111.7 102.4 92.1 Four-way Rand 115.1 102.3 92.1 FIFO 113.3 103.1 92.5 LRU 109.0 99.7 92.1 Eight-way Rand 111.8 100.5 92.1 FIFO 110.4 100.3 92.5
Miss Categories
Caches
Compulsory Misses
The very first access to a block cannot be in the cache, so the block must be loaded. Also called cold-start misses.
Capacity Misses
Owing to the limited capacity of the cache, capacity misses will occur in addition to compulsory misses.
Table: Data cache misses per 1000 instructions

Data collected for Alpha architecture, block size = 64 byte.
Data from [HP06 p.C-10]
Conflict Misses
In set associative or direct mapped caches too many blocks may map to the same set (or slot). Also called collision misses.
LRU is best for small caches little difference between all strategies for large caches
Coherency Misses
are owing to cache flushes to keep multiple caches coherent in a multiprocessor. Not considered in this lecture (see lecture Advanced Computer Architecture).
Cache Optimization
reducing miss rate
Caches
Block Size
Caches
Average memory access time = hit time + miss rate miss penalty
The data area gets larger cache lines (but less lines), the overall cache capacity remains the same.
reduced miss rate, taking advantage of spatial locality

more accesses will likely go to the same block
Larger block size larger cache capacity higher associativity

reducing miss penalty
increased miss penalty

More bytes have to be fetched from main memory
Multilevel caches read over write
increased conflict misses

cache has less slots (per set)
reducing hit time
increased capacity misses

only for small caches. In case of high locality (e.g. repeatedly access to only one byte in a block) the remaining bytes are unused and waste up cache capacity.
avoiding address translation
Common block sizes are 32 ... 128 bytes

Block Size
Caches
Block Size
of overhead and then delivers 16 bytes every 2 clock cycles. Assume the hit time to be 1 clock cycle independent of block size. Which block size has the smallest average memory access time? Average memory access time = hit time + miss rate miss penalty 4K cache, 16 byte block:
Caches
For the previous figure, assume the memory system takes 80 clock cycles
Average memory access time = 1 + (8.57 % 82) = 8.027 clock cycles 4K cache, 32 byte block: Average memory access time = 1 + (7.24 % 84) = 7.082 clock cycles Miss rate versus block size [from HP06 p.C-26]
Cache capacity
... and so on for all cache sizes and block sizes.

Block Size
cache capacity
Block size Miss penalty
Caches
Cache Capacity
Caches
The cache is enlarged by adding more cache slots.
4K 8.027 7.082 7.160 8.469 11.651
16K 4.231 3.411 3.323 3.659 4.685
64K 2.673 2.134 1.933 1.979 2.288
256K 1.894 1.588 1.449 1.470 1.549
reduced miss rate

owing to less capacity misses
16 32 64 128 256
82 84 88 96 112
potentially increased hit time

owing to increased complexity
increased hardware & power consumption

Miss rates for block size 64 bytes
4K 7.00 %
16K 2.64 %
38 %
64K 1.06 %
40 %
256K 0.51 %
48 %
WS 06/07
Cache capacity Miss rate
Average memory access time (in clock cycles) versus block size for 4 different cache capacities green values = best (smallest access time) per column (thus per cache)
Associativity
The higher the associativity the more slots per set.
Caches
Associativity
4 K cache
Degree Miss rate [%]
Caches
8 K cache
16 K cache
reduced miss rate

primarily owing to less conflict misses
1-way 2-way 4-way 8-way
9.8 7.6 7.1 7.1
6.8 4.9 4.4 4.4
4.9 4.1 4.1 4.1
increased hit time

time needed for finding a free slot in the set
Rules of Thumb
Eight-way set associative is almost as effective as fully associative. A direct mapped cache with capacity N has about the same miss rate as a two-way set associative cache of capacity N/2.
Common associativities are 1 (direct mapped), 2, 4, 8
64 K cache
128 K cache
512 K cache
3.7 3.1 3.0 2.9
2.1 1.9 1.9 1.9

WS 06/07
0.8 0.7 0.6 0.6
Data Slidefrom 518 [HP06 p.C-23]
Associativity
Caches
Multi-Level Caches
Building a cache hierarchy.
Caches
First-Level Cache (L1)

small high speed cache usually located in the CPU
Second level cache (L2)

(hit time)
fast and bigger cache located close to CPU (chip set)
Third level cache (L3), optional

Separate memory chip between L2 and main memory
CPU
L1
L2
L3
Main Memory
Slide 519
WS 06/07
Slide 520
WS 06/07
Multi-Level Caches
Multi-level caches reduce the average miss penalty because on a miss the block can be fetched from the next higher level instead from main memory.
Distinction between local and global cache considerations: number of cache misses number of cache accesses
Caches
Multi-Level Caches
Example CH.1: Suppose that in 1000 memory references there are 40 misses in L1 and 20 misses in L2. What are the various miss rates? Local miss rateL1 = global miss rate L1 = Local miss rateL2 = Global miss rateL2 = 20 40 20 1000 = 50 % =2%
These 2 % go from L2 to main memory Caches
40 =4% 1000
These 4% go from L1 to L2
Local miss rate =
Local to a cache (e.g. L1, L2, ...)
Global miss rate =
number of cache misses number of memory references by CPU
CPU
L1
L2
Main Memory
Local miss rate L2 is large because L1 skims the cream of memory accesses.
local misses versus global references

Multi-Level Caches
Caches
Multi-Level Caches
Caches
Average memory access time = hit timeL1 + miss rateL1 miss penaltyL1
Using the miss rates from example CH.1, and the following data hit time L1 = 1 clock cycle, hit timeL2 = 10 clock cycles, miss penaltyL2 = 200 clock cycles, the average memory access time is = 1 + 0.04 (10 + 0.05 200) = 5.5 clock cycles
CPU
L1
L2
Main Memory
miss penaltyL1 = hit timeL2 + local miss rateL2 miss penaltyL2 hit timeL1 + miss rateL1 (hit timeL2 + local miss rateL2 miss penaltyL2) = hit timeL1 + miss rateL1 (hit timeL2 + local miss rateL2 miss penaltyL2)
Slide 523
WS 06/07
Slide 524
WS 06/07
Read over Write

Caches
Read over Write

Caches
Assume a direct mapped write-through cache with 512 slots, and a four-word write buffer that is not checked on a read miss.
store word
Solutions to previous problem
Read misses wait until write buffer is empty,

and thereafter the required memory block is fetched into cache.

SW R3, 512(R0) LW R1, 1024(R0) LW R2, 512(R0)

load word
; mem[512]:= R3 ; R1:= mem[1024] ; R2:= mem[512]
(cache slot 0) (cache slot 0) (cache slot 0)
Check contents of write buffer,

if referenced data not in buffer let the read-access continue fetching the block into the cache. Write buffer is flushed later when memory system is available.
Read-after-write hazard: The data in R3 is placed in write buffer. causes a read miss. Cache line is discarded. again causes a read miss. If the write buffer has not completed writing R3 into memory, will read an incorrect value from mem[512]. Cache
Giving reads priority over writes
Also applicable to write-back caches. The dirty block is put into a write
CPU Memory
buffer that allows inspection in case of a read miss. Read misses check the buffer before directly going to memory.
Write Buffer Slide 525 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 526 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Address Translation
What addresses are cached, virtual or physical addresses?
Caches
Address Translation
Caches
Fully virtual cache: No address translation time on a hit Cache must have copies of protection information
Protection info must be fetched from page/segment tables.
A fully virtual cache uses logical addresses only, a fully physical cache uses physical addresses only.
virtual
CPU
address
virtual cache
virtual address
physical
Cache flush on processor switch

memory
Individual virtual addresses usually refer to different physical addresses.
translation
address
Shared Memory
Different virt. addresses refer to same phys. address. Copies of same data in cache.
segment tables / page tables / TLB
Fully physical cache:

physical address
virtual
physical
CPU
address
translation
address
physical cache
Very well on shared memory accesses

memory
Always address translation (time)

Hits are of no advantage regarding address translation
Slide 527
WS 06/07
Address Translation
Solution: get the best from both virtual and physical caches Two issues in accessing a cache:
Caches
Address Translation
Virtually indexed, physically tagged cache
virtual address
Caches
CPU
Indexing the cache

that is, calculating the target set (or slot with direct mapping)
page number
page offset
word offset tags data data
Comparing tags
comparing the tag field with (parts of) the block address
Translation
(TLB, page table)
...
...
... Cache
The page offset (the part that is identical in both virtual and physical address space) is used to index the cache. In parallel, the virtual part of the address is translated into the physical address and used for tag comparison. Improved hit time. virtually indexed, physically tagged cache
physical address
frame address
offset
next memory level
next memory level

Cache Optimization
Technique Larger block size Larger cache capacity Higher associativity Multi-level cache Read over write Address translation + + +
Hit time Miss penalty Miss rate Complexity
Caches
Exam Computer Architecture

March 9th, 2007
Comment
09.03.2007
Date
ST 025/118
Location
8:30 hrs
Time
+ + +
0 1 1 2 1 1
Trivial Widely used for L2 Widely used

Costly hardware; harder if L1 block size L2 block size
Duisburg - Ruhrort !
Widely used Widely used

Data from [HP06 p.C-39]
+ = improves a factor, = hurts a factor, blank = no impact
Summary of basic cache optimizations

CA
Slide 533
WS 06/07

LectureCA All Slides

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

LectureCA All Slides

Загружено:

Авторское право:

Доступные форматы

This is the collection of lecture slides* of the lecture Computer Architecture tought in Wintersemester 06/07 at University DuisburgEssen.

Dr.-Ing. Stefan Freinatis

Dr.-Ing. Stefan Freinatis

Times & Dates

Exercises Dipl.-Math. Kerstin Luck

8. 9. 10. 11. 12. 13.

1. Operating Systems (slide 34)

2. File Systems (slide 65)

3. Process Management (slide 151)

4. Memory Management (slide 351)

Direct link to homepage of lecture Computer Architecture

Dr.-Ing. Stefan Freinatis

Dr.-Ing. Stefan Freinatis

Dr.-Ing. Stefan Freinatis

Computer Application Areas

General Purpose desktops

Scientific desktops and servers

Book from 1958

Roman numeral system addition system, no zero

Finger technique (from Japanese book 1954)

1000 500 100 50 10 5 1

Value 19: XVIIII or XIX

See also: http://www.ee.ryerson.ca/~elf/abacus/leeabacus/

Hindu-Arabic Numeral System, place value system, introduction of 0

Forms the basis point for the development of calculation on machines.

Characteristics of the first 5 operative digital computers

Atanasoff - Berry Computer Colossus Harvard Mark I

Information source: Wikipedia on Z3 or on ENIAC, English

Input Unit Output Unit Memory

ALU (Arithmetic Logic Unit)

The von Neumann model of a universal computer (stored program computer)

Architecture is independent of problem to be processed

Random accessible memory locations

Keyboard Monitor ...

Both program and data reside in memory

Computer is centrally controlled

CPU has the master role.

Memory accesses in executing C = A + B

Dr.-Ing. Stefan Freinatis

A, B, C: data in memory address bus Slide data 26 bus

Performance, the work done in a certain amount of time.

Popular performance measures Clock rate [Hz]

very expressive ... as they do not

Gordon Moore empirically observed in 1965 that

In 1975 he revised his prediction to

See also: www.thocp.net/biographies/papers/moores_law.htm

Image source: Wikipedia on Moores Law, English

Dr.-Ing. Stefan Freinatis

to use the hardware in an efficient manner

Computer system layers

Hardware provides basic computing resources.

Computer system layers

Early Computer Systems

Vacuum tubes A single group of people did all the work

process management, memory management, file system management, device management.

Programming in machine language

Early Computer Systems

Early Computer Systems

IBM 407 Accounting Machine

Wiring panel (plugboard) IBM 402 plugboard

Early Computer Systems

Vacuum tubes A single group of people did all the work

Transistors, Mainframe computers First high level programming languages

Programming in machine language