XtreemFS-A Cloud File System

XtreemFS A Cloud File System
Michael Berlin Zuse Institute Berlin

Contrail Summer School, Almere, 24.07.2012
Funded under: FP7 (Seventh Framework Programme) Area: Internet of Services, Software & virtualization (ICT-2009.1.2) Project reference: 257438
Motivation Cloud Storage / Cloud File System Cloud Storage Requirements

highly available scalable elastic: add and remove capacity suitable for wide area networks
Support for legacy applications

POSIX-compatible file system required
Google for cloud file system: www.XtreemFS.org
Outline XtreemFS Architecture
Replication in XtreemFS
Read-Only File Replication Read/Write File Replication Custom Replica Placement and Selection Metadata Replication
XtreemFS Use Cases

XtreemFS and OpenNebula
XtreemFS - A Cloud File System History

2006 initial development in XtreemOS project 2010 further development in Contrail project 2012 August: Release 1.3.2
Features
Distributed File System
POSIX compatible
Replication X.509 Certificates and SSL Support
Software
Open source: www.xtreemfs.org
Client software (C++) runs on Linux & OS X (Fuse), Windows (Dokan) Server software (Java)
4
XtreemFS Architecture Separation of Metadata and File Content:

Metadata and Replica Catalog (MRC):
stores metadata per volume
Object Storage Devices (OSDs):

directly accessed by clients file content split into objects
object-based file system
Scalability Storage Capacity

addition and removal of OSDs possible OSDs may be used by multiple volumes
File I/O Throughput

scales with number of OSDs
Metadata Throughput
limited by MRC hardware
use many volumes spread over multiple MRCs
Read-Only Replication (1) Only for write-once files

File must be marked as read-only done automatically after close() Use Case: CDN
Replica Types:
1.
2.
Full replicas
complete copy, fills itself as fast as possible Initially empty on-demand fetching of missing objects Partial replicas
P2P-like efficient transfer between all replicas
Read-Only Replication (2)
Read/Write Replication (1) Primary/backup scheme

POSIX requires total order of update operations primary/backup Primary fail-over?
Leases
grants access to a resource (here: primary role) for a predefined period of time Failover after timeout possible Assumption: loosely synchronized clocks max drift
9
Read/Write Replication (2) Replicated write():
10
Read/Write Replication (3) Replicated write(): 1. Lease Acquisition
11
Read/Write Replication (4) Replicated write(): 1. Lease Acquisition
2. Data Dissemination
12
Read/Write Replication (5) Replicated read(): 1. Lease Acquisition
1b. Replica Reset update primarys replica

2. Respond to read() using local replica
13
Read/Write Replication: Distributed Lease Acquisition with Flease

Central Lock Service Flease
Flease
Failure tolerant: majority-based Scalable: lease per file
Experiment:
Zookeeper: 3 servers
Flease: 3 nodes (2 randomly selected)
14
Read/Write Replication: Data dissemination Ensuring Consistency with Quorum Protocol

R+W>N R - # replicas have to be read from W - # replicas have to be updated Quorum intersection property
Intersection never empty
Write All, Read 1 (W = N, R = 1)

No availability Reads from Backup Replicas allowed
Example with 3 (=N) Replicas (W = 2, R = 2) a) Write
b) Read
Write Quorum, Read Quorum

Available if majority reachable
Quorum Read covered by Replica Reset phase

15
Read/Write Replication: Summary High up-front costs (for first access to inactive file)
3+ round-trips 2 for Flease (lease acquisition) 1 for Replica Reset + further when fetching missing objects
Minimal cost for subsequent operations

Read: identical to non-replicated case Write: latency increases by time to update majority of backups
Works at file-level: scales with # OSDs and # files Flease: no I/O to stable storage for crash-recovery needed
16
Custom Replica Placement and Selection Policies

filter and sort available OSDs/replicas evaluates client information (IP address/hostname, estimated latency) create file on OSD close to me access closest replica
Available default policies:

Server ID DNS Datacenter Map Vivaldi
Own policies possible (Java)
17
Replica Placement/Selection: Vivaldi Visualization
18
Metadata Replication Replication at database level

same approach as file R/W replication
Loosen consistency
allow stale reads
All services replicated

No single point of failure
19
XtreemFS Use Cases Storage of VM images for IaaS solutions (OpenNebula, ...)
Storage-as-a-Service: Volumes per User

XtreemFS as HDFS replacement in Hadoop XtreemFS in ConPaaS: storage on demand for other services
20
XtreemFS and OpenNebula (1) Use Case: VM images in OpenNebula cluster
no distributed file system: scp VM images to hosts

distributed file system: shared storage, available on all nodes
Support for live migration
Fault-tolerant storage of VM images Resume VM on another node after crash Use XtreemFS Read/Write file replication
21
XtreemFS and OpenNebula (2) VM deployment

Create copy (clone) of original VM image Run cloned VM image at scheduled host (Discard cloned image after VM shutdown)
Problems
1. cloning time-consuming
2. waste of space 3. increasing total boot time when starting multiple VMs e.g., ConPaaS image
22
XtreemFS and OpenNebula: qcow2 + Replication qcow2 VM image format

allows snapshots 1. immutable backing file 2. mutable, initially empty snapshot file instead of cloning, snapshot original VM image (< 1 second) Use Read/Write replication for snapshot file
Problem left: run multiple VMs simultaneously

snapshot file: R/W replication scales with # OSDs and # files backing file: bottle neck
use Read-Only Replication
23
XtreemFS and OpenNebula: Benchmark (1) OpenNebula Test Cluster

Frontend + 30 Worker nodes Gigabit Ethernet (100 MB/s) SATA disk (70 MB/s)
Setup
Frontend
MRC
OSD (has the ConPaaS VM image) Each worker node OSD XtreemFS Fuse client OpenNebula node Replica Placement + Replica Selection: prefer local OSD/replica
24
XtreemFS and OpenNebula: Benchmark (2)

Setup copy (1.6 GB image file) qcow2, 1 VM qcow2, 30 VMs qcow2, 30 VMs, 30 partial replicas - second run Total Boot Time 82 seconds (69 seconds for copy) 13.6 seconds 20.8 seconds 142.8 seconds 20.1 seconds
- after second run

+ Read/Write Replication on snapshot file few read()s on image, no bottleneck yet
17.5 seconds
19.5 seconds
Replication: object granularity vs. small reads/writes
25
Future Research & Work Deduplication Improved Elasticity
Fault Tolerance
Optimize Storage Cost Erasure Codes
Self-* Client Cache less POSIX: replace MRC with a scalable service
26
Funded under: FP7 (Seventh Framework Programme) Area: Internet of Services, Software & virtualization (ICT-2009.1.2) Project reference: 257438 Total cost: 11,29 million euro EU contribution: 8,3 million euro Execution: From 2010-10-01 till 2013-09-30 Duration: 36 months Contract type: Collaborative project (generic)
27

XtreemFS-A Cloud File System

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

XtreemFS-A Cloud File System

Загружено:

Авторское право:

Доступные форматы

XtreemFS A Cloud File System

Michael Berlin Zuse Institute Berlin

Motivation Cloud Storage / Cloud File System Cloud Storage Requirements

Support for legacy applications

Google for cloud file system: www.XtreemFS.org

Outline XtreemFS Architecture

XtreemFS Use Cases

XtreemFS - A Cloud File System History

XtreemFS Architecture Separation of Metadata and File Content:

stores metadata per volume

Object Storage Devices (OSDs):

directly accessed by clients file content split into objects

object-based file system

Scalability Storage Capacity

File I/O Throughput

use many volumes spread over multiple MRCs

Read-Only Replication (1) Only for write-once files

P2P-like efficient transfer between all replicas

Read-Only Replication (2)

Read/Write Replication (1) Primary/backup scheme

Read/Write Replication (2) Replicated write():

Read/Write Replication (3) Replicated write(): 1. Lease Acquisition

Read/Write Replication (4) Replicated write(): 1. Lease Acquisition

Read/Write Replication (5) Replicated read(): 1. Lease Acquisition

1b. Replica Reset update primarys replica

Read/Write Replication: Distributed Lease Acquisition with Flease

Flease: 3 nodes (2 randomly selected)

Read/Write Replication: Data dissemination Ensuring Consistency with Quorum Protocol

Write All, Read 1 (W = N, R = 1)

Example with 3 (=N) Replicas (W = 2, R = 2) a) Write

Write Quorum, Read Quorum

Quorum Read covered by Replica Reset phase

Minimal cost for subsequent operations

Custom Replica Placement and Selection Policies

Available default policies:

Own policies possible (Java)

Replica Placement/Selection: Vivaldi Visualization

Metadata Replication Replication at database level

All services replicated

Storage-as-a-Service: Volumes per User

XtreemFS and OpenNebula (1) Use Case: VM images in OpenNebula cluster

no distributed file system: scp VM images to hosts

XtreemFS and OpenNebula (2) VM deployment

XtreemFS and OpenNebula: qcow2 + Replication qcow2 VM image format

Problem left: run multiple VMs simultaneously

use Read-Only Replication

XtreemFS and OpenNebula: Benchmark (1) OpenNebula Test Cluster

XtreemFS and OpenNebula: Benchmark (2)

- after second run

Replication: object granularity vs. small reads/writes

Future Research & Work Deduplication Improved Elasticity

Вам также может понравиться