Ceph/Guide

Ceph Distributed Filesystem
Ceph is a distributed object store and filesystem designed to provide excellent performance, reliability and scalability. According to the Ceph wikipedia entry, the first stable release (Argonaut) was in 2012. It arose from a doctoral dissertation by Sage Weil at the University of California, Santa Cruz. Signficant funding came from the US DOE as the software has found early adoption in clusters in use at Lawrence Livermore, Los Alamos and Sandia National Labs. The main commercial backing for Ceph comes from a company founded by Weil (Inktank) which was acquired by RedHat in April 2014.

The Floss Weekly podcast interviewd Sage Weil in 2013 for their 250th show. The interview was done around the time that the "Dumpling" release was created. One of the points of discussion was the need for datacenters to handle disaster recovery, and Sage pointed out that starting with Dumpling, Ceph would provide for replication between datacenters.

Overview
Ceph consists of three major components:
 * Distributed Object Store
 * RADOS Block Device (RBD)
 * POSIX compliant Filesystem

Distributed Object Store
This component is always implemented in a Ceph rollout. It sits on top of an existing filesystem such as ext4 , xfs, zfs or btrfs and is created and maintained by an Object Store Device Daemon (OSD). It is up to the underlying filesystem and volume management scheme to provide the redundancy and reliability for the object storage and also the recovery that may be necessary when a drive fails and gets replaced.

An OSD will take advantage of advanced features of the underlying filesystem such as Extents, Copy On Write (COW), and snapshotting. It makes extended use of the xattr feature to store metadata about an object and will often exceed the 4kb limitation of ext4 filesystems such that a workaround will be necessary. The ceph.com site documentation recommends either ext4 or xfs in production for OSDs, but it is obvious that zfs or btrfs would be better because of their ability to self-repair, snapshot and handle COW. Ultimately btrfs will probably become the preferred underlying filesystem for a Linux based OSD when the majority is satisfied that it is stable enough.

The task of the OSD is to handle the distribution of objects by Ceph across the cluster. The user can specify the number of copies of an object to be created and distributed amongst the OSDs. The default is 2 copies with a minimum of 1, but those values can be increased up to the number of OSDs that are implemented. Since this redundancy is on top of whatever may be provided the underlying RAID arrays, the cluster enjoys an added layer of protection that guards against catastrophic failure of a disk array. When a drive array fails, only the OSD or OSDs that make use of it are brought down.

OSDs are watched over by Monitor Servers (MONs) which act as the coordinators for object traffic. The initial Ceph Cluster would consist of a MON and two OSD servers, and this is the example used in their documentation for a quick install. They also talk about an admin server, but this is only a system which is able to painlessly remote into the cluster members using ssh authorized_keys. The admin server would be the system that the user has set up to run Chef, Puppet or other control systems that oversee the operation of the cluster.

A single MON would be a single point of failure for Ceph, so it is recommended that the Ceph Cluster be run with an odd number of MONs with a minimum number of 3 running to establish a quorum. For performance reasons, MONs should be put on a separate filesystem from OSDs because they tend to do a lot of fsyncs. Although they are typically shown as running on dedicated hosts, they can share a host with an OSD. MONs don't need a lot of storage space, so it is probably perfectly fine to have them run on the system drive, while the OSD takes over whatever large disk or array is in the server. A home user who isn't big on performance but nervous about an ssd based system disk and large numbers of fsyncs will probably just throw both the MON and the OSD into separate subvolumes of a btrfs based array.

MONs coordinate object traffic by implementing the Controlled Replication Under Scalable Hashing (CRUSH) map. This is an algorithm that computes the locations for storing objects in the OSD pools. MONS also keep track of the map of daemons running the various flavors of Ceph server in the cluster. An "Initial Members" setting allows the user the specify the minimum number of MON servers that must be running in order to form a quorum. There doesn't appear to be any caveats about running an excessive number of MONs, so it is probably okay to stick a MON wherever an OSD gets implemented.

RADOS Block Device
Ceph provides support in the Linux kernel for the RADOS Block Device (RBD). This is essentially a virtual disk device that distributes its "blocks" across the OSDs in the Ceph cluster. An RBD provides the following capabilities:


 * thin provisioning
 * i/o striping and redundancy across the Cluster
 * resizeable
 * snapshot with revert capability
 * directly useable as a KVM guest's disk device

A major selling point for the RBD is the fact that it can be used as a virtual machine's drive store in KVM. Because it spans the OSD server pool, the guest can be hot migrated between cluster CPUs with little or no down time. Libvirt and Virt-Manager have provided this support for some time now, and it is probably one of the main reasons why RedHat (a major sponsor of QEMU/KVM, Libvirt and Virt-Manager) has acquired Inktank.

The RBD and the RADOS Gateway provide the same sort of functionality for Cloud Services as Amazon S3 and OpenStack Swift.

POSIX Filesystem
Ceph provides a MetaData Server (MDS) which provides a more traditional style of filesystem based on POSIX standards that translates into objects stored in the OSD pool. This would then be shared via CIFS and NFS to non-Ceph and non-Linux based systems including Windows. This is also the way to use Ceph as a drop-in replacement for HADOOP.

Installation
As of this writing the stable version of Ceph in portage is  which corresponds to a midway rev of the second major release of Ceph code named "Bobtail". In gentoo unstable are versions of the follow-on major Ceph updates up to the current major version "Firefly":


 * "Bobtail"
 * "Cuttlefish"
 * "Dumpling"
 * "Firefly"

After noting that the ceph site online archive only shows release downloads back to Cuttlefish, we decided to unmask unstable  in our   file along with its dependencies and ended up building with Firefly before doing our installation.

If you want to use the RADOS block device, you will need to put that into your kernel .config as either a module or baked in. Ceph itself will want to have FUSE support enabled if you want to work with the POSIX filesystem component and you will also want to include the driver for that in Network File Systems. For your backend object stores, you will want to have xfs support because of the xattr limitations in Ext4 and btrfs because it really is becoming stable now.