Ceph/Guide

Ceph Distributed Filesystem
Ceph is a distributed object store and filesystem designed to provide excellent performance, reliability and scalability. According to the Ceph wikipedia entry, the first stable release (Argonaut) was in 2012. It arose from a doctoral dissertation by Sage Weil at the University of California, Santa Cruz. Signficant funding came from the US DOE as the software has found early adoption in clusters in use at Lawrence Livermore, Los Alamos and Sandia National Labs. The main commercial backing for Ceph comes from a company founded by Wells (Inktank) which was acquired by RedHat in April 2014.

Overview
Ceph consists of three major components:
 * Distributed Object Store
 * RADOS Block Device (RBD)
 * POSIX compliant Filesystem

Distributed Object Store
This component is always implemented in a Ceph rollout. It sits on top of an existing filesystem such as ext4 , xfs, zfs or btrfs and is created and maintained by an Object Store Daemon (OSD). It is up to the underlying filesystem and volume management scheme to provide the redundancy and reliability for the object storage and also the recovery that may be necessary when a drive fails and gets replaced.

An OSD will take advantage of advanced features of the underlying filesystem such as Extents, Copy On Write (COW), and snapshotting. It makes extended use of the xattr feature to store metadata about an object and will often exceed the 4kb limitation of ext4 filesystems such that a workaround will be necessary. The ceph.com site documentation recommends either ext4 or xfs in production for OSDs, but it is obvious that zfs or btrfs would be better because of their ability to self-repair, snapshot and handle COW. Ultimately btrfs will probably become the preferred underlying filesystem for a Linux based OSD when the majority is satisfied that it is stable enough.

The task of the OSD is to handle the distribution of objects by CEPH across the cluster. The user can specify the number of copies of an object to be created and distributed amongst the OSDs. The default is 2 copies with a minimum of 1, but those values can be increased up to the number of OSDs that are implemented. Since this redundancy is on top of whatever may be provided the underlying RAID arrays, the cluster enjoys an added layer of protection that guards against catastrophic failure of a disk array. When a drive array fails, only the OSD or OSDs that make use of it are brought down.