Вы находитесь на странице: 1из 44

CT

320: Network and System Administra8on


Fall 2014*
Dr. Indrajit Ray
Email: indrajit@cs.colostate.edu

Department of Computer Science
Colorado State University
Fort Collins, CO 80528, USA

* Thanks to Dr. James Walden, NKU and Russ Wakeeld, CSU for contents of these slides

Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

Disks

Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

Topics
1.
2.
3.
4.
5.
6.
7.
8.
9.

Disk components
Disk interfaces
Lifecycle of a disk
Performance
Reliability
RAID
Adding a disk
Logical volumes
Filesystems

Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

Hard Drive Components

Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

Physical Disk Geometry


One head for each
surface
All tracks at r = dn
form a cylinder
Each sector has 512+
bytes of informa8on
One surface
dedicated for
posi8oning and
synchroniza8on
Not all por8ons of
the disk are
addressable by the
OS
Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

Hard Drive Components


Actuator
Moves arm across disk to read/write data.
Arm has mul8ple read/write heads (oben 2/placer.)

Placers
Rigid substrate material.
Thin coa8ng of magne8c material stores data.
Coa8ng type determines areal density: Gbits/in2

Spindle Motor
Spins placers from 3600-15,000 rpm.
Speed determines disk latency.

Cache
2-16MB of cache memory oben more
Reliability: write-back vs. write-through
Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

Disk Informa;on: hdparm


# hdparm -i /dev/hde

/dev/hde:

Model=WDC WD1200JB-00CRA1, FwRev=17.07W17, SerialNo=WD-WMA8C4533667
Cong={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
RawCHS=16383/16/63, TrkSize=57600, SectSize=600, ECCbytes=40
BuType=DualPortCache, BuSize=8192kB, MaxMultSect=16, MultSect=o
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=234441648
IORDY=on/o, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5
AdvancedPM=no WriteCache=enabled
Drive conforms to: device does not report version:

* signies the current ac/ve mode

Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

Disk Performance
Seek Time
Time to move head to desired track (3-8 ms)

Rota8onal Delay
Time un8l head over desired block (8ms for 7200)

Latency
Seek Time + Rota8onal Delay

Throughput
Data transfer rate (20-80 MB/s)

Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

Latency vs. Throughput


Which is more important?
Depends on the type of load.

Sequen8al access Throughput


Mul8media on a single user PC

Random access Latency


Most servers

How to improve performance


Faster disks
Caching
More spindles (disks).
More disk controllers.
Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

Disk Performance: hdparm


# hdparm -tT /dev/hde
/dev/hde:
Timing cached reads:
876 MB in 2.00 seconds
= 437.41 MB/sec
Timing buffered disk reads:
88 MB in 3.08 seconds = 28.60 MB/sec

Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

Reliability
MTBF
Average 8me between failures (>1,000,000 hours).

Real failure curves


Early phase: high failure rate from defects.
Constant failure rate phase: MTBF valid.
Wearout phase: high failure rate from wear.

Failures more likely on trauma8c events.


Power on/o.

Systems oben wear out before MTBF.


Average life span of a disk is about 5 years

Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

Solid State Drives


Flash memory based solid state drives
No moving parts
Much higher I/O performance than hard disks
Random reads also result in very high performance.
Less prone to failure (more reliable)

Higher costs
Uses NAND memory

Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

NAND Flash Constraints (1)


Flash module divided in Blocks Pages
Sectors
E.g., 1GB 8K Blocks of 64 pages of 4 sectors of 512
bytes

Read/Write at page granularity (as disks)


Writes more 8me and energy consuming than reads
(factor of 3 to 10)
Pages must be wricen sequen8ally within a block

Erase at block granularity


Erase-before-rewrite constraint
10 8mes more costly than a page write

A block wears out aber 106 write/erase cycles


Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

NAND Flash Constraints (2)


Hardware constraints usually lead to make
updates out-of place
Flash Transla8on Layer (FTL) is required for
Address transla8on
Wear leveling
Garbage collec8on

FTL is a main source of unpredictability


Very badly adapted to random writes
Provides no guarantee against read/write failures

Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

Disk Interfaces
SCSI
Standard interface for servers.

IDE
Standard interface for PCs.

Fibre Channel
High bandwidth
Can run SCSI or IP

USB
Fast enough for slow devices on PCs.

Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

SCSI
Small Computer Systems Interface
Fast, reliable, expensive.

A bus, not a simple PC to device interface.


Each device has a target # ranging 0-7 or 0-15.
Devices can communicate directly w/o CPU.

Many versions
Original: SCSI-1 (1979) 5MB/s
Current: SCSI-3 (2001) 320MB/s

Serial Acached SCSI (SAS)


Up to 128 devices
Up to 2 GB/s full duplex.
Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

IDE
Integrated Drive Electronics / AT acachment
Slower, less reliable, cheap.
Only allows 2 devices per interface.
ATAPI standard added removable devices.

Many versions
Original: IDE / ATA (1984)
Current: Ultra-ATA/133 133MB/s

Serial ATA
Up to 128 devices.
1.5 GB/s
New standard up to 6 GB/s
Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

IDE vs. SCSI


SCSI oers becer performance/scale
Faster bus
Faster hard drives (up to 15,000rpm).
Lower CPU usage
Becer handling of mul8ple requests.

Cheaper IDE oben best for worksta8ons.


Convergence
SATA2 and SAS converging on a single standard.

Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

Other Host Interfaces


PCI Express
Speeds up to 2.0 GB/s

Fibre Channel
Very high speed achievable
Can support variety of network communica8on
protocols such as SCSI / IP
Almost exclusively used for servers

USB, Firewire
Generally much slower and hence not used for
internal disks
USB 3.0 promises speeds > 3.0 GB/s
Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

RAID
Redundant Array of Independent Disks
Can be implemented in hardware or sobware.
Hardware RAID controllers:
Caching
Automate rebuilding of arrays

Advantages
Capacity
Reliability
Fault-tolerance
Throughput

Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

RAID Levels
RAID 0: Striped
evenly for
performance.
MTBF = (avg MTBF)/#
disks

Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

RAID Levels (contd)


RAID 1: Mirrored for
reliability
Every write goes to
each disk of set.
Seek 8me halved as
reads split between
disks.

RAID 0 + 1: Striped +
mirrored

Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

RAID Levels

RAID 5: Striped with distributed parity.


Block striping, not disk striping.
Can lose one disk of set without losing data.
Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

RAID Levels
JBOD: Concatenated for capacity.
Only data on bad disk is lost, no performance
penalty

RAID 3, 4 exist but not popular.


RAID 3 uses byte level striping with dedicated parity
disk
RAID 4 uses block level striping with dedicated
parity disk

RAID 6 extends RAID 5 by using two parity


blocks

Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

Lifecycle of a HDD
Blank media
Low level format
Performed at the factory

Par88on
High level format
Opera8ng system install
Systems opera8on

Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

Blank Magne;c Media

Beginning

End

For simplicity we will use a


linear model of the magne8c
media
Unless we are performing
electron microscopy the exact
media geometry is not
signicant
The blank media has only
geometric structure and raw
magne8c storage

Beginning

Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

End

Read / Write Process (simplied)


Read / Write Head

Beginning

Write process

End

Digital signals are encoded (for 8ming recovery) and


transformed into analog signals that drive the magne8c
eld on the write head

Read process

Analog magne8c eld is sensed, 8ming is recovered and


sampled signal is converted into digital data

Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

Low Level Format

Sectors (512 bytes plus overhead)

Individual sector

512
bytes"

Redundant
Sectors
(Only visible to the HDD controller)

Sector overhead

Low level formavng adds indivisible units of storage called sectors


Most modern HDDs use 512+ bytes sectors
The + accounts for sector overhead bytes (dier by manufacturer)

Overhead bytes provide error correc8on and 8ming recovery


func8ons
Bad sectors are automa8cally remapped to redundant sectors by
the HDD controller
Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

Par;;oning
Master Boot Record
(MBR)

Inter-par88on gap
Par88on #1

MBC MPT VBC DPB

Volume Boot Record


(VBR)

Unused sectors

VBC DPB
Par88on #2

The Master Boot Record is created and includes the Master Boot
Code (MBC) and the Master Par88on Table (MPT) always at
sector 1 on any bootable media
The MBC is executed at boot if the HDD is designated as the boot
device
The MPT contains informa8on about logical volumes including the
ac8ve par88on, the par88on whose Volume Boot Code (VBC) will
be executed
Each par88on has Disk Parameter Block (DPB) that stores
informa8on about par88ons, le system type, date and 8me last
mounted etc.
Inter-par88on gaps are a collec8on of unused sector
Some sectors are unused due to addressing issues
Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

High Level Format (File System)


Master Boot Record File System
(MBR)
Structures

Free Space

MBC MPT

Cluster
Blocks

MPT now contains le system type and cluster size


Cluster sizes are in increments of 512 bytes (one sector)
This becomes the indivisible le size for the opera8ng system

A le system structure is created


FAT creates a le alloca8on table (simple table)
NTFS creates a master le table (database)
Linux EXT2/EXT3/EXT4 creates a virtual le system
Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

Opera;ng System Install


Master Boot Record File System
(MBR)
Structures

Free Space

Swap Space

MBC MPT

Opera8ng System
Code / Data

Opera8ng system code, applica8on code,


congura8on data and applica8on data are
installed
A swap le is created for NTFS and UNIX variants
(Linux, Unix, FreeBSD etc)
Boot code is wricen to the MBC (or VBC if a boot
loader is used)
Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

Adding a Disk
Install new hardware
Verify disk recognized by BIOS.

Boot
Verify device exists in /dev

Par88on
fdisk /dev/sdb

Create lesystem
mkfs v t ext3 /dev/sdb1

Add entry to /etc/fstab


/dev/sdb1 /proj ext3 defaults 0 2

mount -a
Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

When dont you need a lesystem?


Swap space
mkswap v /dev/sdb1

Server applica8ons
Oracle
VMWare Server

Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

Logical Volumes
What are logical volumes?
Appear to user as a physical volume.
But can span mul8ple par88ons and/or disks.

Why logical volumes?


Aggregate disks for performance/reliability.
Grow and shrink logical volumes on the y.
Move logical volumes btw physical devices.
Replace volumes w/o interrup8ng service.

Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

LVM

Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

LVM Components
Logical Volume Group (LVG)
Set of physical volumes (par88ons or disks.)
May be divided into logical volumes (LVs.)

LVs made up of xed sized logical extents


Each LE is 4MB.
Physical extents are the same size.

Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

Mapping Modes
Linear Mapping
LVs assigned to con8nguous areas of PV space.

Striped Mapping
LEs interleaved across PVs to improve
performance.

Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

SeVng up a LVG and LV


1. Ini8alize physical volumes
pvcreate /dev/hda1
pvcreate /dev/hdb1

2. Ini8alize a volume group


vgcreate nku_proj /dev/hda1 /dev/hdb1
Use vgextend to add more PVs later.

3. Create logical volumes


lvcreate -n nku1 --size 100G
nku_proj1

4. Create lesystem
mkfs v t ext3 /dev/nku_proj/nku1
Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

Extending a LV
Set absolute size
lvextend L120G /dev/nku_proj/nku1

Or set rela8ve size


lvextend L+20G /dev/nku_proj/nku1

Expand the lesystem without unmoun8ng


ext2online v /dev/nku_proj/nku1

Check size
df k

Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

Swap
Can use swaple instead of swap par88on
dd if=/dev/zero of=/swapfile bs=1024k
count=512
mkswap /swapfile

Enable swap

swapon /swapfile
swapon /dev/sda2

Disable swap

swapoff /swapfile
swapoff /dev/sda2

Check swap resource usage


cat /proc/swaps

Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

Filesystems
ext4
Gaining more popularity
Can support volumes with sizes up to 1 exbibyte
(260 bytes) and les up to 16 tebibytes (240 bytes)

ext3
Current most common Linux lesystem.
Journaling eliminates need for fsck.

ext2
Old Linux non-fragmen8ng fast lesystem.
Can be converted to ext3 by adding journal:
tune2fs j /dev/sda1
Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

Moun;ng
To use a lesystem
mount /dev/sda1 /mnt
df /mnt

Automa8c moun8ng
Add an entry in /etc/fstab

Unmount
umount /dev/sda1
Cannot unmount a volume in use.

Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

fstab
# /etc/fstab: static file system information.
#
# <file system> <mount point>
<type> <options>
<dump> <pass>
proc
/proc
proc
defaults 0
/dev/hdc1
/
ext3
defaults 0
/dev/hdc5
/win
vfat
user,rw 0
/dev/hdc7
none
swap
sw
0
/dev/hdc8
/var
ext3
defaults 0
/dev/hdc9
/home
ext3
defaults 0
/dev/hda
/media/cdrom0
iso9660 ro,user 0
/dev/fd0
/media/floppy0 auto
rw,user 0

Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

0
1
0
0
2
2
0
0

fsck: check + repair fs


Filesystem corrup8on sources
Power failure
System crash

Types of corrup8on
Unreferenced inodes.
Bad superblocks.
Unused data blocks not recorded in block maps.
Data blocks listed as free that are used in les.

fsck can x these and more


Asks user to make more complex decisions.
Stores unxable les in lost+found.
Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

Вам также может понравиться