01 - Introduction PDF

Introduction /
Performance / Machine
Basics
CS270 Unit 1
Max Luttrell, Spring 2017
"Welcome to the machine"
- Pink Floyd, 1975
levels of
computer usage
example user typical task
higher level "can you use your phone

grandma
to check traffic for me?"
"use a smartphone to
dad
check traffic"
"write a smartphone app in
application programmer java which provides traffic
predictions"
"write MIPS assembly
lower level cs270 student language code to sort an
array of integers"
levels of
program code
• Application software
• written in a high level language (e.g.
C/C++, java, swift, etc.)
• Systems software
• compiler - translate C++ code into
target processor's machine code
• operating system
• handle input/output (I/O)
• manage memory and storage
• schedule tasks
• share resources
• Hardware
• processor, memory, I/O controllers
compiler: translates high-level
language into assembly language
assembler: translates assembly

language into binary machine language
why use a high level
language?
• closer to natural language / algebra
• can be tailored to its typical usage, e.g. fortran for scientific

computation, cobol for business applications
• improves programmer productivity
• fewer lines = fewer opportunities for error, easier to debug
• closer to natural language than assembly, easier to understand/

debug
• platform independent
• the same C program can be compiled for different target

architectures, e.g. MIPS, ARM, Intel x86
performance
Which airplane has the best performance?
Boeing 777 Boeing 777
BAC/Sud BAC/Sud
Concorde Concorde
Douglas Douglas DC-
DC-8-50 8-50
0 100 200 300 400 500 0 2000 4000 6000 8000 10000
Passenger Capacity Cruising Range (miles)
BAC/Sud BAC/Sud
Concorde Concorde
Douglas Douglas DC-
DC-8-50 8-50
0 500 1000 1500 0 100000 200000 300000 400000
Cruising Speed (mph) Passengers x mph

what is high performance?
which is best
scenario
performance
run same program on two whichever one gets the

different desktop program done first (best
computers execution time)
datacenter with multiple whichever one completes

computers running jobs the most jobs in a given
submitted by many users day (best throughput)
relative performance
• define performance = 1 / execution time
• X is n times faster than Y
performancex / performancey =
execution timey / execution timex = n
• Example: execution time to run a program
• 10s on computer A, 15s on computer B
• execution timeB / execution timeA = 15s / 10s = 1.5
• Thus, A is 1.5 times faster than B

measuring execution time
• total response time
• more than just computation, also includes I/O,

OS time, idle time
• CPU time for a given job
• CPU time plus OS time
• discounts I/O time and idle time

CPU clocks
• operation of digital hardware is governed by a constant rate-clock
• Clock period = duration of one clock cycle
A. e.g. 0.1 s
-12
B. e.g. 250ps = 0.25ns = 250*10 s
• Clock frequency = clock cycles per second = 1 / clock period
A. e.g. 10 Hz
9
B. e.g. 4.0GHz = 4000MHz = 4.0*10 Hz
CPU time
• CPU Time = CPU Clock Cycles * Clock Cycle Period =
CPU Clock Cycles / Clock Rate
• Performance improved by:
• reduce number of clock cycles needed
• increase clock rate
• HW designer must often trade off clock rate against

cycle count
CPU time example
• Computer A: 2GHz clock, 10s CPU time for our program
• Designing computer B
• Aiming for 6s CPU time for our program
• We can do a faster clock for computer B, but this will cause

us to need 1.2x the clock cycles we needed on computer A
• How fast must computer B's clock be?

Clock CyclesB 1.2 × Clock CyclesA
Clock RateB = =
CPU TimeB 6s
Clock CyclesA = CPU Time A × Clock Rate A
= 10s × 2GHz = 20 × 10 9
1.2 × 20 × 10 9 24 × 10 9
Clock RateB = = = 4GHz
6s 6s
instruction count / CPI
Clock Cycles = Instruction Count × Cycles per Instruction
CPU Time = Instruction Count × CPI × Clock Cycle Time
Instruction Count × CPI
=
Clock Rate
• instruction count
• determined by program, Instruction Set Architecture (ISA), and compiler
• average cycles per instruction (CPI)
• determined by CPU hardware
• different instructions can have different CPI
• average CPI affected by instruction mix

CPI example
• Computer A: Cycle Time = 250ps, CPI = 2.0
• Computer B: Cycle Time = 500ps, CPI = 1.2
• Both computers have the same Instruction Set Architecture (ISA)
• For a given program (with a fixed instruction count I), which

computer is faster?
CPU Time = Instruction Count × CPI × Cycle Time
A A A
= I × 2.0 × 250ps = I × 500ps
CPU Time = Instruction Count × CPI × Cycle Time
B B B
= I × 1.2 × 500ps = I × 600ps
CPU Time
B = I × 600ps = 1.2
CPU Time I × 500ps
A
another CPI example
• we have two compiled code sequences using
instructions in classes A, B, C, each with different CPI's
class A B C
CPI for class 1 2 3
class A B C
IC in code sequence 1 2 1 2
IC in code sequence 2 4 1 1
Sequence 1: IC = 5 Sequence 2: IC = 6
Clock Cycles  Clock Cycles 
= 2×1 + 1×2 + 2×3  = 4×1 + 1×2 + 1×3 
= 10 =9
Avg. CPI = 10/5 = 2.0 Avg. CPI = 9/6 = 1.5
performance optimization
• common case
• optimize what's used most commonly during

program execution
• typically, this is a small fraction of overall code

Amdahl's law
Taffected
Timproved = + Tunaffected
improvemen t factor
• Given: a program runs in 100 seconds on our computer. 80
seconds of this time is spent doing multiplication. 20 seconds is
spent doing other things.
• Assume we can get our hardware designer to speed up

multiplication on our computer.
• Question: what improvement factor for multiplication do we need if

we want our program to run:
A. two times faster?
B. five times faster?

performance optimization
optimization type example
running two errands: I run

parallel one errand, you run the
other errand
laundry: move first load to
pipeline dryer, immediately start
second load in washer
travel: research trip
speculative execution itinerary before you're sure
you will go
cache: cache memory

use several memory levels
faster than main memory
number representation
• decimal - base 10 - humans
• each digit represents a power of 10
• e.g. 1979decimal = 1*1000 + 9*100 + 7*10 + 1*9
• binary - base 2 - computers
• each digit represents a power of 2
• e.g. 1011binary = 1*8 + 0*4 + 1*2 + 1*1 = 11decimal

number representation
position values examples
base
(b) in in
exponential decimal value base b decimal position values decimal
10 103 102 10 1 1000 100 10 1 1234 1*1000+2*100+3*10+4*1 1234
2 23 22 2 1 8 4 2 1 1011 1*8 + 0*4 + 1*2 + 1*1 11
4 43 42 4 1 64 16 4 1 3203 3*64 + 2*16 + 0*4 + 3*1 227
8 83 82 8 1 512 64 8 1 2715 2*512+ 7*64 + 1*8 + 5*1 1485
16 163 162 16 1 4096 256 16 1 2AC8 2*4096+10*256+12*16+8*1 10952
base 16: hexadecimal (aka hex) uses digits 0-9, then A-F
base conversion
• e.g. convert octal 437 to hexadecimal
• idea: convert octal to decimal, then decimal to

hex
• 437octal = 4*64 + 3*8 + 7 = 287decimal
• 287decimal contains 256 (16*16). 287-256 = 31
• 31 contains 16. 31-16 = 15
• answer: 11Fhex, commonly written 0x11F

base conversion
• an easier way to convert base for powers of 2 is
to express in binary first. for our example:
octal 4 3 7
binary 1 0 0 0 1 1 1 1 1
hex 1 1 F
binary 1 0 0 0 1 1 1 1 1
signed integers
• signed integer can be negative
• unsigned integer can not be negative

• sign/magnitude: use the first bit to indicate sign. 0-
pos; 1-neg. then use the rest of the bits for magnitude.
• example: we have 4 bit numbers. 6 is 0110. -6 is 1110.

two's complement
• instead of sign-magnitude, modern computers use two's
complement
• idea:
• first bit still indicates sign, 0-pos and 1-neg
• to change sign, flip every bit and then add 1 (two’s

complement operation)
• example: convert 8 bit number for positive 23 to negative -23
00010111 11101000 11101001

23 flip +1 -23
interpreting binary
• example: the following binary contains an
unsigned 4-bit int, then a signed 8-bit int, then a
signed, 2’s complement 16-bit int. Convert each
to decimal
• 1001001010011111111111101001
value two's complement decimal total
1001 n/a 8+1 9
00101001 n/a 32+8+1 41
1111111111101001 0000000000010111 16+4+2+1 -23
sign extension
• converting shorter ints to longer ints
• for unsigned, just pad with zeros
• ex: 4-bit 5 is 0101. 8-bit 5 is 00000101.
• for signed two's complement, just pad with

most-significant-bit (MSB)
• ex: 4 bit -7 is 1001. 8-bit -7 is 11111001

adding/subtracting two's
complement
• addition: do bitwise addition
• e.g. 4 bit values
• 3decimal = 0011. 2decimal = 0010. 3+2 = 5 = 0101
• subtraction: take two's complement of number

being subtracted, and then do addition
• 3decimal = 0011. 2decimal = 0010. 3 - 2 = 3+(-2) =

0011 + 1110 = 0001 = 1decimal
overflow
• overflow: not enough bits to store the result
• e.g. if we try signed two’s complement 4-bit 7+3

= 0111+0011 = 1010 (note sign is wrong!)
• can detect overflow if two operands are like-

signed, and the sign of result is different
interpreting data
• what's this?
• 0001001010011111
• could be anything!
• 16 bit unsigned int
• 16 bit signed int
• float
• an instruction for a computer to execute

where is data stored?
memory type
slower,
cheaper, backup
(e.g. cloud, tape backup)
larger
external storage
(e.g. hard drive, flash drive)
random access memory

(RAM)
faster, more
expensive, registers
smaller
typically, data must be in a register to run an instruction on it
running a program
• a program's instructions are typically stored
consecutively in RAM
• typical machine uses two special registers to

execute them
Instruction Register
holds current instruction
(IR)
Program Counter holds address of next

(PC) instruction to execute
fetch / decode / execute
cycle
• Prior to running program, load PC
with address of its first instruction.
IR PC
Then repeatedly:
• Fetch
• load IR with next instruction
• increment PC
• Decode
• determine what instruction
does and where operands are
• Execute
• perform the instruction
types of instructions -
arithmetic
• arithmetic instruction includes:
• opcode operation to perform, e.g. add,

multiply
• operands - what data to perform operation on
• where to place the result

branch
• branch instruction includes:
• under what conditions to branch
• address of next instruction to execute if we

are to branch
halt
• halt: stop the machine
• halt instruction: normal termination
• or, error condition encountered

Instruction Set Architecture
(ISA)
• ISA: set of instructions and machine capabilities
• interface between hardware and lowest-level software

type description example
postfix calculator,
stack-based uses stack for data e.g. 3 4 + to
calculate 3+4
implicitly uses one
accumulator-
accumulator (or a select Simple Machine
based
few) for data
many general purpose
general- registers; info about all MIPS, ARM, Intel
register operands contained in x86
each instruction
general-register
architecture types
• CISC vs. RISC
• Reduced Instruction Set Computer -

instructions are simple. Most modern
architectures are RISC
• Complex Instruction Set Computer - more

complex instructions available. Intel x86 is
CISC but with some RISC added
addressing modes
• where are the operands?
mode description example
instruction requires both add two registers,

register source operands and store result in
destination in registers another register
one operand is in a add a register to a

mixed / register, the other operand constant, store
constant is a constant contained in result in register
instruction
MIPS mult
instruction places
one or more operands is
implicit results of multiply
implicit in the instruction
in lo and hi
registers
memory interface
• we need to be able to read from
memory and write to memory
MBR MAR
• memory interface: array of bytes
with two special registers
• Memory Address Register
(MAR) - address of next
memory reference
• Memory Buffer Register
RAM
(MBR) - storage location for
data being read from or
written to memory
memory cycle
• how to write to memory:
MBR MAR • access the MAR
• place contents of MBR at
memory[MAR]
• how to read from memory:

RAM • access the MAR
• retrieve contents of
memory[MAR] and put in
MBR
register transfer language/
notation (RTL/RTN)
• all internal operations in a CPU can be described as
sequence of operations on, and transfers between registers,
and memory cycles.
• the RTL (or RTN) contains:
• operations
• operands
• ability to specify specific bits in registers
• programming constructs for CPU's logic capabilities

•
Exercise 1A
Make sure you can connect to our course Canvas page and view homework
#1.
• http://ccsf.instructure.com
• Make sure you can connect between our hills linux server and the Batmale
413 linux workstations, using the syllabus as a guide and for password /
login information.
• If you're on a linux workstation, demonstrate an ssh connection to hills.
Replace “uname” with your username.
ssh uname@hills.ccsf.edu
• If you're on your own laptop, demonstrate a terminal window logged into
hills, and from that terminal window, ssh to a linux workstation. The IP
range you can use is anything between 147.144.23.31-147.144.23.58.
ssh uname@147.144.23.37
• For Mac laptops, ssh is builtin to the terminal window. For Windows
laptops, you will need to download a terminal client like Putty or
SSHSecureShellClient.exe. See the computer systems guide on my web
page, http://fog.ccsf.edu/~mluttrel
Exercise 1B
• Linux basics and simple machine intro
http://fog.ccsf.edu/~mluttrel/cs270/exercises/ex1b.html

01 - Introduction PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

01 - Introduction PDF

Загружено:

Авторское право:

Доступные форматы

Introduction /

higher level "can you use your phone

assembler: translates assembly

• can be tailored to its typical usage, e.g. fortran for scientific

• improves programmer productivity

• fewer lines = fewer opportunities for error, easier to debug

• closer to natural language than assembly, easier to understand/

• the same C program can be compiled for different target

Boeing 747 Boeing 747

Passenger Capacity Cruising Range (miles)

Boeing 777 Boeing 777

Boeing 747 Boeing 747

0 500 1000 1500 0 100000 200000 300000 400000

Cruising Speed (mph) Passengers x mph

run same program on two whichever one gets the

datacenter with multiple whichever one completes

• X is n times faster than Y

execution timey / execution timex = n

• Example: execution time to run a program

• 10s on computer A, 15s on computer B

• execution timeB / execution timeA = 15s / 10s = 1.5

• Thus, A is 1.5 times faster than B

• more than just computation, also includes I/O,

• CPU time for a given job

• CPU time plus OS time

• discounts I/O time and idle time

• Clock period = duration of one clock cycle

• Clock frequency = clock cycles per second = 1 / clock period

CPU Clock Cycles / Clock Rate

• Performance improved by:

• reduce number of clock cycles needed

• increase clock rate

• HW designer must often trade off clock rate against

• Aiming for 6s CPU time for our program

• We can do a faster clock for computer B, but this will cause

• How fast must computer B's clock be?

• determined by program, Instruction Set Architecture (ISA), and compiler

• average cycles per instruction (CPI)

• determined by CPU hardware

• different instructions can have different CPI

• average CPI affected by instruction mix

• Computer B: Cycle Time = 500ps, CPI = 1.2

• Both computers have the same Instruction Set Architecture (ISA)

• For a given program (with a fixed instruction count I), which

• optimize what's used most commonly during

• typically, this is a small fraction of overall code

• Assume we can get our hardware designer to speed up

• Question: what improvement factor for multiplication do we need if

A. two times faster?

B. five times faster?

running two errands: I run

cache: cache memory

• each digit represents a power of 10

• e.g. 1979decimal = 1*1000 + 9*100 + 7*10 + 1*9

• binary - base 2 - computers

• each digit represents a power of 2

• e.g. 1011binary = 1*8 + 0*4 + 1*2 + 1*1 = 11decimal

10 103 102 10 1 1000 100 10 1 1234 1*1000+2*100+3*10+4*1 1234

2 23 22 2 1 8 4 2 1 1011 1*8 + 0*4 + 1*2 + 1*1 11

4 43 42 4 1 64 16 4 1 3203 3*64 + 2*16 + 0*4 + 3*1 227

8 83 82 8 1 512 64 8 1 2715 2*512+ 7*64 + 1*8 + 5*1 1485

16 163 162 16 1 4096 256 16 1 2AC8 2*4096+10*256+12*16+8*1 10952

• idea: convert octal to decimal, then decimal to

• e.g. 1979decimal = 11000 + 9100 + 710 + 19

• e.g. 1011binary = 18 + 04 + 12 + 11 = 11decimal

10 103 102 10 1 1000 100 10 1 1234 11000+2100+310+41 1234

2 23 22 2 1 8 4 2 1 1011 18 + 04 + 12 + 11 11

4 43 42 4 1 64 16 4 1 3203 364 + 216 + 04 + 31 227

8 83 82 8 1 512 64 8 1 2715 2512+ 764 + 18 + 51 1485

16 163 162 16 1 4096 256 16 1 2AC8 24096+10256+1216+81 10952