Вы находитесь на странице: 1из 27

XtreemFS A Cloud File System

Michael Berlin Zuse Institute Berlin


Contrail Summer School, Almere, 24.07.2012

Funded under: FP7 (Seventh Framework Programme) Area: Internet of Services, Software & virtualization (ICT-2009.1.2) Project reference: 257438

Motivation Cloud Storage / Cloud File System Cloud Storage Requirements


highly available scalable elastic: add and remove capacity suitable for wide area networks

Support for legacy applications


POSIX-compatible file system required

Google for cloud file system: www.XtreemFS.org

Outline XtreemFS Architecture

Replication in XtreemFS
Read-Only File Replication Read/Write File Replication Custom Replica Placement and Selection Metadata Replication

XtreemFS Use Cases


XtreemFS and OpenNebula

XtreemFS - A Cloud File System History


2006 initial development in XtreemOS project 2010 further development in Contrail project 2012 August: Release 1.3.2

Features
Distributed File System

POSIX compatible
Replication X.509 Certificates and SSL Support

Software
Open source: www.xtreemfs.org
Client software (C++) runs on Linux & OS X (Fuse), Windows (Dokan) Server software (Java)
4

XtreemFS Architecture Separation of Metadata and File Content:


Metadata and Replica Catalog (MRC):

stores metadata per volume

Object Storage Devices (OSDs):


directly accessed by clients file content split into objects

object-based file system

Scalability Storage Capacity


addition and removal of OSDs possible OSDs may be used by multiple volumes

File I/O Throughput


scales with number of OSDs

Metadata Throughput
limited by MRC hardware

use many volumes spread over multiple MRCs

Read-Only Replication (1) Only for write-once files


File must be marked as read-only done automatically after close() Use Case: CDN

Replica Types:
1.
2.

Full replicas
complete copy, fills itself as fast as possible Initially empty on-demand fetching of missing objects Partial replicas

P2P-like efficient transfer between all replicas

Read-Only Replication (2)

Read/Write Replication (1) Primary/backup scheme


POSIX requires total order of update operations primary/backup Primary fail-over?

Leases
grants access to a resource (here: primary role) for a predefined period of time Failover after timeout possible Assumption: loosely synchronized clocks max drift
9

Read/Write Replication (2) Replicated write():

10

Read/Write Replication (3) Replicated write(): 1. Lease Acquisition

11

Read/Write Replication (4) Replicated write(): 1. Lease Acquisition

2. Data Dissemination

12

Read/Write Replication (5) Replicated read(): 1. Lease Acquisition

1b. Replica Reset update primarys replica


2. Respond to read() using local replica

13

Read/Write Replication: Distributed Lease Acquisition with Flease


Central Lock Service Flease

Flease
Failure tolerant: majority-based Scalable: lease per file

Experiment:
Zookeeper: 3 servers

Flease: 3 nodes (2 randomly selected)

14

Read/Write Replication: Data dissemination Ensuring Consistency with Quorum Protocol


R+W>N R - # replicas have to be read from W - # replicas have to be updated Quorum intersection property
Intersection never empty

Write All, Read 1 (W = N, R = 1)


No availability Reads from Backup Replicas allowed

Example with 3 (=N) Replicas (W = 2, R = 2) a) Write

b) Read

Write Quorum, Read Quorum


Available if majority reachable

Quorum Read covered by Replica Reset phase


15

Read/Write Replication: Summary High up-front costs (for first access to inactive file)
3+ round-trips 2 for Flease (lease acquisition) 1 for Replica Reset + further when fetching missing objects

Minimal cost for subsequent operations


Read: identical to non-replicated case Write: latency increases by time to update majority of backups

Works at file-level: scales with # OSDs and # files Flease: no I/O to stable storage for crash-recovery needed
16

Custom Replica Placement and Selection Policies


filter and sort available OSDs/replicas evaluates client information (IP address/hostname, estimated latency) create file on OSD close to me access closest replica

Available default policies:


Server ID DNS Datacenter Map Vivaldi

Own policies possible (Java)

17

Replica Placement/Selection: Vivaldi Visualization

18

Metadata Replication Replication at database level


same approach as file R/W replication

Loosen consistency
allow stale reads

All services replicated


No single point of failure

19

XtreemFS Use Cases Storage of VM images for IaaS solutions (OpenNebula, ...)

Storage-as-a-Service: Volumes per User


XtreemFS as HDFS replacement in Hadoop XtreemFS in ConPaaS: storage on demand for other services

20

XtreemFS and OpenNebula (1) Use Case: VM images in OpenNebula cluster

no distributed file system: scp VM images to hosts


distributed file system: shared storage, available on all nodes
Support for live migration
Fault-tolerant storage of VM images Resume VM on another node after crash Use XtreemFS Read/Write file replication

21

XtreemFS and OpenNebula (2) VM deployment


Create copy (clone) of original VM image Run cloned VM image at scheduled host (Discard cloned image after VM shutdown)

Problems
1. cloning time-consuming
2. waste of space 3. increasing total boot time when starting multiple VMs e.g., ConPaaS image

22

XtreemFS and OpenNebula: qcow2 + Replication qcow2 VM image format


allows snapshots 1. immutable backing file 2. mutable, initially empty snapshot file instead of cloning, snapshot original VM image (< 1 second) Use Read/Write replication for snapshot file

Problem left: run multiple VMs simultaneously


snapshot file: R/W replication scales with # OSDs and # files backing file: bottle neck

use Read-Only Replication

23

XtreemFS and OpenNebula: Benchmark (1) OpenNebula Test Cluster


Frontend + 30 Worker nodes Gigabit Ethernet (100 MB/s) SATA disk (70 MB/s)

Setup
Frontend

MRC
OSD (has the ConPaaS VM image) Each worker node OSD XtreemFS Fuse client OpenNebula node Replica Placement + Replica Selection: prefer local OSD/replica
24

XtreemFS and OpenNebula: Benchmark (2)


Setup copy (1.6 GB image file) qcow2, 1 VM qcow2, 30 VMs qcow2, 30 VMs, 30 partial replicas - second run Total Boot Time 82 seconds (69 seconds for copy) 13.6 seconds 20.8 seconds 142.8 seconds 20.1 seconds

- after second run


+ Read/Write Replication on snapshot file few read()s on image, no bottleneck yet

17.5 seconds
19.5 seconds

Replication: object granularity vs. small reads/writes

25

Future Research & Work Deduplication Improved Elasticity

Fault Tolerance
Optimize Storage Cost Erasure Codes

Self-* Client Cache less POSIX: replace MRC with a scalable service

26

Funded under: FP7 (Seventh Framework Programme) Area: Internet of Services, Software & virtualization (ICT-2009.1.2) Project reference: 257438 Total cost: 11,29 million euro EU contribution: 8,3 million euro Execution: From 2010-10-01 till 2013-09-30 Duration: 36 months Contract type: Collaborative project (generic)

27

Вам также может понравиться