Ceph/Guide

Ceph Distributed Filesystem
Ceph is a distributed object store and filesystem designed to provide excellent performance, reliability and scalability. According to the Ceph wikipedia entry, the first stable release (Argonaut) was in 2012. It arose from a doctoral dissertation by Sage Weil at the University of California, Santa Cruz. Signficant funding came from the US DOE as the software has found early adoption in clusters in use at Lawrence Livermore, Los Alamos and Sandia National Labs. The main commercial backing for Ceph comes from a company founded by Weil (Inktank) which was acquired by RedHat in April 2014.

The Floss Weekly podcast interviewd Sage Weil in 2013 for their 250th show. The interview was done around the time that the "Cuttlefish" release was created. One of the points of discussion was the need for datacenters to handle disaster recovery, and Sage pointed out that starting with Dumpling, Ceph would provide for replication between datacenters. Another bit of trivia came out in the podcast: Sage Weil was one of the inventors of the WebRing concept in the early days of the World Wide Web.

Overview
Ceph consists of three major components:
 * Distributed Object Store
 * RADOS Block Device (RBD) and RADOS Gateway
 * POSIX compliant Filesystem

Distributed Object Store
This component is always implemented in a Ceph rollout. It sits on top of an existing filesystem such as ext4 , xfs, zfs or btrfs and is created and maintained by an Object Store Device Daemon (OSD). It is up to the underlying filesystem and volume management scheme to provide the redundancy and reliability for the object storage and also the recovery that may be necessary when a drive fails and gets replaced.

An OSD will take advantage of advanced features of the underlying filesystem such as Extents, Copy On Write (COW), and snapshotting. It makes extended use of the xattr feature to store metadata about an object and will often exceed the 4kb limitation of ext4 filesystems such that a workaround will be necessary. The ceph.com site documentation recommends either ext4 or xfs in production for OSDs, but it is obvious that zfs or btrfs would be better because of their ability to self-repair, snapshot and handle COW. Ultimately btrfs will probably become the preferred underlying filesystem for a Linux based OSD when the majority is satisfied that it is stable enough.

The task of the OSD is to handle the distribution of objects by Ceph across the cluster. The user can specify the number of copies of an object to be created and distributed amongst the OSDs. The default is 2 copies with a minimum of 1, but those values can be increased up to the number of OSDs that are implemented. Since this redundancy is on top of whatever may be provided the underlying RAID arrays, the cluster enjoys an added layer of protection that guards against catastrophic failure of a disk array. When a drive array fails, only the OSD or OSDs that make use of it are brought down.

OSDs are watched over by Monitor Servers (MONs) which act as the coordinators for object traffic. The initial Ceph Cluster would consist of a MON and two OSD servers, and this is the example used in their documentation for a quick install. They also talk about an admin server, but this is only a system which is able to painlessly remote into the cluster members using ssh authorized_keys. The admin server would be the system that the user has set up to run Chef, Puppet or other control systems that oversee the operation of the cluster.

A single MON would be a single point of failure for Ceph, so it is recommended that the Ceph Cluster be run with an odd number of MONs with a minimum number of 3 running to establish a quorum. For performance reasons, MONs should be put on a separate filesystem from OSDs because they tend to do a lot of fsyncs. Although they are typically shown as running on dedicated hosts, they can share a host with an OSD. MONs don't need a lot of storage space, so it is probably perfectly fine to have them run on the system drive, while the OSD takes over whatever large disk or array is in the server. A home user who isn't big on performance but nervous about an ssd based system disk and large numbers of fsyncs will probably just throw both the MON and the OSD into separate subvolumes of a btrfs based array.

MONs coordinate object traffic by implementing the Controlled Replication Under Scalable Hashing (CRUSH) map. This is an algorithm that computes the locations for storing objects in the OSD pools. MONS also keep track of the map of daemons running the various flavors of Ceph server in the cluster. An "Initial Members" setting allows the user the specify the minimum number of MON servers that must be running in order to form a quorum. When there are not enough MONs to form a quorum, the Ceph cluster will stop processing until a quorum is re-established in order to avoid a "split-brain" situation.

The CRUSH map defaults to an algorithm that automatically computes which OSDs an object should be placed, but it can be influenced by additional human specified policies. This way, a site administrator can sway CRUSH when making choices such as:


 * use the sites faster OSDs by default
 * divide OSDs into "hot" (SSD based), "normal" and "archival" (slow or tape backed) storage
 * localize replication to OSDs sitting on the same switch or subnet
 * prevent replication to OSDs on the same rack to avoid downtime when an entire RACK has a power failure

RADOS Block Device and RADOS Gateway
Ceph provides support in the Linux kernel for the RADOS Block Device (RBD). This is essentially a virtual disk device that distributes its "blocks" across the OSDs in the Ceph cluster. An RBD provides the following capabilities:


 * thin provisioning
 * i/o striping and redundancy across the Cluster
 * resizeable
 * snapshot with revert capability
 * directly useable as a KVM guest's disk device
 * a variant of COW where a VM starts with a "golden image" which the VM diverges from as it operates
 * Data Replication between datacenters starting with the Dumpling Release

A major selling point for the RBD is the fact that it can be used as a virtual machine's drive store in KVM. Because it spans the OSD server pool, the guest can be hot migrated between cluster CPUs with little or no down time. Libvirt and Virt-Manager have provided this support for some time now, and it is probably one of the main reasons why RedHat (a major sponsor of QEMU/KVM, Libvirt and Virt-Manager) has acquired Inktank.

The RBD and the RADOS Gateway provide the same sort of functionality for Cloud Services as Amazon S3 and OpenStack Swift. The early adopters of Ceph were interested primarily in Cloud Service object stores. Cloud Services also drove the intial work on replication between datacenters.

POSIX Filesystem
Ceph provides a MetaData Server (MDS) which provides a more traditional style of filesystem based on POSIX standards that translates into objects stored in the OSD pool. This is typically where a non-Linux platform can implment client support for Ceph. This can also be shared via CIFS and NFS to non-Ceph and non-Linux based systems including Windows. This is also the way to use Ceph as a drop-in replacement for HADOOP. The filesystem component started to mature around the Dumpling release.

Ceph requires all of its servers to be able to see each other directly in the cluster. So this filesystem would also be the point where external systems would be able to see the content without having direct access to the Ceph Cluster. For performance reasons, the user may have all of the Ceph cluster participants using a dedicated network on faster hardware such as Infiniband with isolated switches. The MDS server would then have multiple NICs to straddle the Ceph network and the outside world.

Installation
As of this writing the stable version of Ceph in portage is  which corresponds to a midway rev of the second major release of Ceph code named "Bobtail". In gentoo unstable are versions of the follow-on major Ceph updates up to the current major version "Firefly":


 * "Bobtail"
 * "Cuttlefish"
 * "Dumpling"
 * "Firefly"

After noting that the ceph site online archive only shows release downloads back to Cuttlefish, we decided to unmask unstable  in our   file along with its dependencies and ended up building with Firefly before doing our installation.

Kernel Config
If you want to use the RADOS block device, you will need to put that into your kernel .config as either a module or baked in. Ceph itself will want to have FUSE support enabled if you want to work with the POSIX filesystem component and you will also want to include the driver for that in Network File Systems. For your backend object stores, you will want to have xfs support because of the xattr limitations in Ext4 and btrfs because it really is becoming stable now.

Network Config
Ceph is sensitive to IP address changes, so you should make sure that all of your Ceph servers are assigned static IP addresses. You also may want to proactively treat the Ceph cluster members as an independent subnet from your existing network by multi-homing your existing network adapters as necessary. That way if an ISP change or other topology changes are needed, you can keep your cluster setup intact. It also gives you the luxury of migrating the ceph subnet later on to dedicated nics, switches and faster hardware such as 10Gbit ethernet or Infiniband. If the cluster subnet is small enough, consider keeping the hostnames in your /etc/hosts files, at least until things grow to the point where a pair of DNS servers among the cluster members becomes a compelling solution.

We will be using four hosts in our Example implementation. All four will be MON servers with an initial quorum of 3 so that we can safely avoid a "split-brain" situation and still be able to run the cluster when a single server is rebooted.

Our Example Ceph Cluster
We have chosen to roll out Ceph on a portion of our home network. The four kroll hosts are as follows:


 * kroll1 (aka Thufir) - An AMD FX9590 8 core CPU with 32GB of memory, 256GB SSD root drive and a 4x4TB SATA array formatted as a RAID5 btrfs with the default volume mounted on .  kroll1 will act as our admin server since the ssh keys for its root user have been pushed out to the other nodes in their   files.  Kroll1 will act as a MON and OSD server since only slightly less than half of the btrfs array has been used.
 * kroll2 (aka Figo) - An AMD FX8350 8 core CPU with 16GB of memory, 256GB SSD root drive and a 4x3TB SATA array formatted as btrfs RAID1. Kroll2 will act as a MON and MDS server.  We will not do an OSD server here since the array is already over 90% capacity.  Also the Ceph developers have suggested that it is not a wise idea to run an MDS and an OSD on the same node.
 * kroll3 (aka Mater) - An AMD FX8350 8 core CPU with 16GB of memory, 256GB SSD and 4x4TB SATA array formatted as a RAID5 btrfs and default volume mounted on .  /materraid was being kept as a mirror of /thufirraid using   on a periodic basis.  kroll3 will become a MON and an OSD server.
 * kroll4 (aka Tube) - An AMD A10-7850K APU with 16GB of memory, 256GB SSD and a 2x4TB SATA array formatted as a btrfs RAID1 mirror set with its default volume mounted on .  As its real name suggests, kroll4 was originally set up a MythTV box, but its filesystem is only averaging about 10% of its capacity.  We will thus use kroll4 as a MON and OSD server.

Thufir, Mater and Tube run a gentoo stable desktop profile and are currently on kernel. Thufir and Mater are btrfs installs with new drives using the latest btrfs code. Tube has been up and running a bit longer and had its btrfs array built under kernel. Figo runs a gentoo unstable desktop profile and is currenly on kernel. It is also being used as an rsync mirror for thufir, but its array has been running for about 2 years. The reason for  running at 90% capacity is both due to older smaller drives (3TB versus 4TB) and also because the version of btrfs available when it was built did not yet include RAID5 support.

Editing the ceph config file
We will be following the manual guide for ceph installation on their site. There is also a python based script call ceph-deploy which is packaged for a number of distros but not directly available for gentoo. If you can manage to get it working, it would automate a good bit of the process of rolling out a server from your admin node.

We use  to generate a new random uuid for the entire cluster. We will rename the cluster name from the default  to   to match our host naming scheme. We specify the 192.168.2 network to be the "public" network for the cluster. Other default settings come from the manual install url mentioned earlier, including a default to replicate two copies of each object with a minimum of 1 copy allowed when the cluster is in "degraded" state.

We set a journal size in the  global section but leave the filestore stanza commented out since we will be using btrfs for object stores instead of ext4. We also added a little extra language to this to clarify exactly what it means.

We add a  global section where we specify the list of hostnames that will act as mons along with their corresponding ip addresses. The port number  is the IANA registered well known port assigned to Ceph. The initial members stanza specifies the three hosts which will be necessary to form an initial quorum. The list of numbers corresponds to the list of mon sections that will follow for,   etc. These will also be used by the /etc/init.d/ceph startup script when figuring out which services are to be started for a host.

The rest of the  file consists of subsections for the various mon, osd and mds servers that we will be implementing.

After editing the file, we copy it around to the other cluster members from our admin node kroll1 using

/etc/conf.d/ceph file
There is also a conf file for the ceph service, but as of the Firefly release, there is only the location of the conf file to specify. This is because in previous releases, there was only an /etc/init.d/ceph script that needed to be worked with. The single script would start up or shut down all of the services enabled for the site at once. The ebuild maintainer changed this for Firefly to use renamed softlinks of  in order to specify the running of individual Ceph services. It is somewhat similar to what gentoo does with  when enabling network devices.