Ceph/Guide

Ceph Distributed Filesystem
Ceph is a distributed object store and filesystem designed to provide excellent performance, reliability and scalability. According to the Ceph wikipedia entry, the first stable release (Argonaut) was in 2012. It arose from a doctoral dissertation by Sage Weil at the University of California, Santa Cruz. Signficant funding came from the US DOE as the software has found early adoption in clusters in use at Lawrence Livermore, Los Alamos and Sandia National Labs. The main commercial backing for Ceph comes from a company founded by Weil (Inktank) which was acquired by RedHat in April 2014.

The Floss Weekly podcast interviewd Sage Weil in 2013 for their 250th show. The interview was done around the time that the "Cuttlefish" release was created. One of the points of discussion was the need for datacenters to handle disaster recovery, and Sage pointed out that starting with Dumpling, Ceph would provide for replication between datacenters. Another bit of trivia came out in the podcast: Sage Weil was one of the inventors of the WebRing concept in the early days of the World Wide Web.

Overview
Ceph consists of four major components:
 * Object Store Device
 * Monitor Server
 * RADOS Block Device (RBD) and RADOS Gateway
 * Metadata Server providing a POSIX compliant Filesystem

Object Store Device
Two object stores mark the beginning of a Ceph cluster and they may be joined by potentially thousands more. They sit on top of an existing filesystem such as ext4 , xfs, zfs or btrfs and are created and maintained by an Object Store Device Daemon (OSD). While the underlying filesystem may provide for redundancy, error detection and repair on its own, Ceph implements its own layer of error detection, recovery and n-way replication. There is a tradeoff between using a RAID1,5,6 or 10 scheme with the underlying filesystem and then having a single OSD server versus having individual drives and multiple OSD servers. The former provides a defense in depth strategy against data loss, but the latter has less of an impact on the cluster when a drive fails and requires replacement. The latter also potentially provides better performance than a software RAID or a filesystem built on top of a number of JBOD devices.

An OSD will take advantage of advanced features of the underlying filesystem such as Extents, Copy On Write (COW), and snapshotting. It can make extended use of the xattr feature to store metadata about an object, but this will often exceed the 4kb limitation of ext4 filesystems such that an alternative metadata store will be necessary. The ceph.com site documentation recommends either ext4 or xfs in production for OSDs, but it is obvious that zfs or btrfs would be better because of their ability to self-repair, snapshot and handle COW. Ultimately btrfs will become the preferred underlying filesystem for a Linux based OSD when the majority is satisfied that it is stable enough. If you are still unsure about btrfs, look at the performance tuning research that has been done already such as that done by an Inktank employee on Bobtail at this page.

The task of the OSD is to handle the distribution of objects by Ceph across the cluster. The user can specify the number of copies of an object to be created and distributed amongst the OSDs. The default is 2 copies with a minimum of 1, but those values can be increased up to the number of OSDs that are implemented. Since this redundancy is on top of whatever may be provided the underlying RAID arrays, the cluster enjoys an added layer of protection that guards against catastrophic failure of a disk array. When a drive array fails, only the OSD or OSDs that make use of it are brought down.

Objects are broken down into extants, or shards, when distributed instead of having them treated as a single entity. In a 2-way replication scheme where there are more than 2 OSD servers, an object's shards will actually end up distributed across potentially all of the OSD servers.

Monitor Server
OSDs are watched over by Monitor Servers (MONs) which act as the coordinators for object traffic. The initial Ceph Cluster would consist of a MON and two OSD servers, and this is the example used in their documentation for a quick install. They also talk about an admin server, but this is only a system which is able to painlessly remote into the cluster members using ssh authorized_keys. The admin server would be the system that the user has set up to run Chef, Puppet or other control systems that oversee the operation of the cluster.

A single MON would be a single point of failure for Ceph, so it is recommended that the Ceph Cluster be run with an odd number of MONs with a minimum number of 3 running to establish a quorum. For performance reasons, MONs should be put on a separate filesystem or device from OSDs because they tend to do a lot of fsyncs. Although they are typically shown as running on dedicated hosts, they can share a host with an OSD and often do in order to have enough MON servers for a decent quorum. MONs don't need a lot of storage space, so it is perfectly fine to have them run on the system drive, while the OSD takes over whatever large disk or array is in the server. If you dedicate an SSD to handle OSD journals, the MON storage will only require another 2gb or so.

MONs coordinate object traffic by implementing the Controlled Replication Under Scalable Hashing (CRUSH) map. This is an algorithm that computes the locations for storing objects in the OSD pools. MONS also keep track of the map of daemons running the various flavors of Ceph server in the cluster. An "Initial Members" setting allows the user the specify the minimum number of MON servers that must be running in order to form a quorum. When there are not enough MONs to form a quorum, the Ceph cluster will stop processing until a quorum is re-established in order to avoid a "split-brain" situation.

The CRUSH map defaults to an algorithm that automatically computes where in the OSDs an object's shards should be placed, but it can be influenced by additional human specified policies. This way, a site administrator can sway CRUSH when making choices such as:


 * use the sites faster OSDs by default
 * divide OSDs into "hot" (SSD based), "normal" and "archival" (slow or tape backed) storage
 * localize replication to OSDs sitting on the same switch or subnet
 * prevent replication to OSDs on the same rack to avoid downtime when an entire RACK has a power failure

It is this spreading out of the load with the CRUSH map that allows Ceph to scale up to thousands of OSDs so easily while increasing performance as new stores are added. Because of the spreading, the bottleneck transfers from raw disk performance (about 100mb/sec for a SATA drive for example) to the bandwidth capacity of your network and switches.

There are a number of ways to work with the MON pool and Praxis database to monitor and administrate the cluster, but the most common is the  command. This is a Python script that uses a number of Ceph supplied Python modules that use json to communicate with the MON pool.

RADOS Block Device and RADOS Gateway
Ceph provides a kernel module for the RADOS Block Device (RBD) and a librados library which libvirt and KVM can be linked against. This is essentially a virtual disk device that distributes its "blocks" across the OSDs in the Ceph cluster. An RBD provides the following capabilities:


 * thin provisioning
 * i/o striping and redundancy across the Cluster
 * resizeable
 * snapshot with revert capability
 * directly useable as a KVM guest's disk device
 * a variant of COW where a VM starts with a "golden image" which the VM diverges from as it operates
 * Data Replication between datacenters starting with the Dumpling Release

A major selling point for the RBD is the fact that it can be used as a virtual machine's drive store in KVM. Because it spans the OSD server pool, the guest can be hot migrated between cluster CPUs by literally shutting the guest down on one CPU and booting it on another. Libvirt and Virt-Manager have provided this support for some time now, and it is probably one of the main reasons why RedHat (a major sponsor of QEMU/KVM, Libvirt and Virt-Manager) has acquired Inktank.

The RBD and the RADOS Gateway provide the same sort of functionality for Cloud Services as Amazon S3 and OpenStack Swift. The early adopters of Ceph were interested primarily in Cloud Service object stores. Cloud Services also drove the intial work on replication between datacenters.

Metadata Server
Ceph provides a MetaData Server (MDS) which provides a more traditional style of filesystem based on POSIX standards that translates into objects stored in the OSD pool. This is typically where a non-Linux platform can implement client support for Ceph. This can be shared via CIFS and NFS to non-Ceph and non-Linux based systems including Windows. This is also the way to use Ceph as a drop-in replacement for HADOOP. The filesystem component started to mature around the Dumpling release.

Ceph requires all of its servers to be able to see each other directly in the cluster. So this filesystem would also be the point where external systems would be able to see the content without having direct access to the Ceph Cluster. For performance reasons, the user may have all of the Ceph cluster participants using a dedicated network on faster hardware with isolated switches. The MDS server would then have multiple NICs to straddle the Ceph network and the outside world.

As of the Firefly release, there is only one active MDS server at a time. Other MDS servers run in a standby mode to quickly perform a failover when the active server goes down. The cluster will take about 30 seconds to determine whether the active MDS server has failed. This may appear to be a bottleneck for the cluster, but the MDS only does the mapping of POSIX file names to object ids. With an object id, a client then directly contacts the OSD servers to perform the necessary i/o of extents/shards.

Eventually Ceph will allow multiple active MDS servers, dividing the POSIX filesystem namespace with a mapping scheme that distributes the load.

Installation
As of this writing the stable version of Ceph in portage is  which corresponds to a midway rev of the second major release of Ceph code named "Bobtail". In gentoo unstable are versions of the follow-on major Ceph updates up to the current major version "Firefly":


 * "Bobtail"
 * "Cuttlefish"
 * "Dumpling"
 * "Firefly"
 * "Giant"

The ceph site online archive only shows release downloads back to Cuttlefish. Also the MDS server doesn't begin to stabilize until around Dumpling. We decided to unmask unstable  in our   file along with its dependencies and ended up building with Firefly before doing our installation.

Outside of Portage, there is a development only release on the ceph site (0.84.x) and an upcoming release "Giant" due to drop, probably around Q4 2014. In the weeks since the creation of this wiki entry and the current edit, there have already been a number of ebuild updates to Firefly which we have installed. The example cluster was initially installed with 0.80.1 and is now running 0.80.5.

Kernel Config
If you want to use the RADOS block device, you will need to put that into your kernel .config as either a module or baked in. Ceph itself will want to have FUSE support enabled if you want to work with the POSIX filesystem component and you will also want to include the driver for that in Network File Systems. For your backend object stores, you will want to have xfs support because of the xattr limitations in Ext4 and btrfs because it really is becoming stable now.

Network Config
Ceph is sensitive to IP address changes, so you should make sure that all of your Ceph servers are assigned static IP addresses. You also may want to proactively treat the Ceph cluster members as an independent subnet from your existing network by multi-homing your existing network adapters as necessary. That way if an ISP change or other topology changes are needed, you can keep your cluster setup intact. It also gives you the luxury of migrating the ceph subnet later on to dedicated nics, switches and faster hardware such as 10Gbit ethernet or Infiniband. If the cluster subnet is small enough, consider keeping the hostnames in your /etc/hosts files, at least until things grow to the point where a pair of DNS servers among the cluster members becomes a compelling solution.

We will be using four hosts in our Example implementation. All four will be MON servers with an initial quorum of 3 so that we can safely avoid a "split-brain" situation and still be able to run the cluster when a single server is rebooted.

Our Example Ceph Cluster
We have chosen to roll out Ceph on a portion of our home network. The four kroll hosts are as follows:


 * kroll1 (aka Thufir) - An AMD FX9590 8 core CPU with 32GB of memory, 256GB SSD root drive and a 4x4TB SATA array formatted as a RAID5 btrfs with the default volume mounted on .  kroll1 will act as our admin server since the ssh keys for its root user have been pushed out to the other nodes in their   files.  Kroll1 will act as a MON and OSD server since only slightly less than half of the btrfs array has been used.
 * kroll2 (aka Figo) - An AMD FX8350 8 core CPU with 16GB of memory, 256GB SSD root drive and a 4x3TB SATA array formatted as btrfs RAID1. Kroll2 will act as a MON and MDS server.  We will not do an OSD server here since the array is already over 90% capacity.  Also the Ceph developers have suggested that it is not a wise idea to run an MDS and an OSD on the same node.
 * kroll3 (aka Mater) - An AMD FX8350 8 core CPU with 16GB of memory, 256GB SSD and 4x4TB SATA array formatted as a RAID5 btrfs and default volume mounted on .  /materraid was being kept as a mirror of /thufirraid using   on a periodic basis.  kroll3 will become a MON and an OSD server.
 * kroll4 (aka Tube) - An AMD A10-7850K APU with 16GB of memory, 256GB SSD and a 2x4TB SATA array formatted as a btrfs RAID1 mirror set with its default volume mounted on .  As its real name suggests, kroll4 was originally set up a MythTV box, but its filesystem is only averaging about 10% of its capacity.  We will thus use kroll4 as a MON and OSD server.

Thufir, Mater and Tube run a gentoo stable desktop profile and are currently on kernel. Thufir and Mater are btrfs installs with new drives using the latest btrfs code. Tube has been up and running a bit longer and had its btrfs array built under kernel. Figo runs a gentoo unstable desktop profile and is currenly on kernel. It is also being used as an rsync mirror for thufir, but its array has been running for about 2 years. The reason for  running at 90% capacity is both due to older smaller drives (3TB versus 4TB) and also because the version of btrfs available when it was built did not yet include RAID5 support.

Editing the ceph config file
We will be following the manual guide for ceph installation on their site. There is also a python based script call ceph-deploy which is packaged for a number of distros but not directly available for gentoo. If you can manage to get it working, it would automate a good bit of the process of rolling out a server from your admin node.

We use  to generate a new random uuid for the entire cluster. We will rename the cluster name from the default  to   to match our host naming scheme. We specify the 192.168.2 network to be the "public" network for the cluster. Other default settings come from the manual install url mentioned earlier, including a default to replicate two copies of each object with a minimum of 1 copy allowed when the cluster is in "degraded" state.

We set a journal size in the  global section but leave the filestore stanza commented out since we will be using btrfs for object stores instead of ext4. We also added a little extra language to this to clarify exactly what it means.

We add a  global section where we specify the list of hostnames that will act as mons along with their corresponding ip addresses. The port number  is the IANA registered well known port assigned to Ceph. The initial members stanza specifies the three hosts which will be necessary to form an initial quorum. The list of numbers corresponds to the list of mon sections that will follow for,   etc. These will also be used by the /etc/init.d/ceph startup script when figuring out which services are to be started for a host.

The rest of the  file consists of subsections for the various mon, osd and mds servers that we will be implementing.

After editing the file, we copy it around to the other cluster members from our admin node kroll1 using

/etc/conf.d/ceph file
There is also a conf file for the ceph service, but as of the Firefly release, there is only the location of the conf file to specify. This is because in previous releases, there was only an /etc/init.d/ceph script that needed to be worked with. The single script would start up or shut down all of the services enabled for the site at once. The ebuild maintainer changed this for Firefly to use renamed softlinks of  in order to specify the running of individual Ceph services. It is somewhat similar to what gentoo does with  when enabling network devices.

/etc/init.d/ceph script
As noted earlier when discussing MON servers, Ceph is dependent on clocks that are synchronized across the cluster. Until the ebuild maintainers update the script, consider editing your  script's depend function to include whatever service you may have running that has synched your clock. In this example edit to the firefly version of the script, we have added "after ntp-client" since we use the standard ntp ebuild and have ntp-client and ntp services in our default runlevel.

Creating Keyrings For MON rollout
Ceph uses its own shared secret concept when handling communications among cluster members. We must generate keyring files that will then be distributed out to the servers that will be set up among the cluster members. The keyrings are generated by the  command. The first keyring is for the mon servers. The manual install url has it going to a file on, but we are more inclined to keep it around by parking it in

The result is a readable text file:

Next we create an admin keyring file which goes into.

The resulting text file may actually be shorter than the complicated command line used to create it. The redacted key here is the same as the one that appears in our mon keyring, so it must be based on the uuid we parked in the  config file.

This next command is just as annoying, because it wasn't until after running it that we discovered that the auth tool basically just appended the client admin keyring file contents to the mon keyring file.

We push the mon and client keyrings out to  on the other kroll hosts.

Creating the initial monmap file
The OSD and MDS servers use the  for discovering MON servers, but the MON servers themselves have a much stricter consistency scheme in order to form and maintain their quorum. When up and running the quorum uses majority rule voting system called Praxis, but MONs do work with an initial binary file called a monmap when you first set up the Ceph cluster.

The manual deployment page covers the example where only a single MON is used to form the quorum. We referred to the Monitor Config reference page and the Monitor Bootstrap page it refers to when creating our scheme with an initial quorum of three MONs.

The  command is used to create the initial monmap binary file. We essentially give it the addresses corrsponding to our    and the cluster fsid from   file. We will park this file in  and then pass it around to the right place when we configure our MON hosts.

We push the initial monmap file out to the other  directories on kroll2, 3 and 4.

Creating kroll1 server mon.0
Ceph servers look for their file trees in. Mon servers look for their server id number subtree under  where N is the id # that we designated for the server in. Kroll1 will host mon.0, so we create  for it. This implies that we will be using our 256GB SSD root system device for mon.0's i/o. Later on when we create the OSD for kroll1, we will be creating and mounting a btrfs subvolume for it to use. Otherwise the object store would default to eating us out of house and home on our system drive!

Before continuing on, you may want to look at  to clear out anything that may be in there. The next command will create an empty  file if it doesn't already exist.

The  command will populate the ceph.0 directory with a copy of our ceph.mon.keyring file renamed to   and a   directory tree which is a Praxis database reflecting the contents of the initial monmap file.

We set up the mon.0 server startup in /etc/init.d by softlinking. Examination of the ceph script reveals that the gentoo ebuild developer was merely picking the server type and id # from string positions in the script file name. Thus it isn't crucial to use the "." and the "-" as we did here, but it does make things readable:

We repeated the same process to create  and   on the other kroll member hosts.

Starting the Mon Servers
With all four kroll hosts configured with mons, we now go back and start the services beginning with  on kroll1.

Notice the ".fault" entries at the bottom of the log. We will continue to see them until the quorum is established. We will also see them as outputwhen attempting to do monitor style commands such as  until there is a quorum.

Starting mon.1 on kroll2
We start up the mon.1 server on kroll2. The mon log for this server will show it discovering the kroll1  peer server, and the 2 start using praxis to hold a quorum election. Until  gets spun up, we still won't have a quorum.

Starting mon.2 on kroll3
There is now also a  for the cluster with a briefer summary. There is a warning that the system drive is over 70% full on kroll1 since we have also been using it for a /home directory there. We may migrate that to the btrfs array later if it becomes necessary.

The commands in the manual install page to check cluster sanity:  and   will now work, but of course we don't have any OSDs spun up yet.

Starting mon.3 on kroll4
The appearance of  causes a new monitor election as noted in the ceph.log on kroll4

And now we are four. The health will stay at HEALTH_ERR or degraded until we get at least two OSDs spun up. We need two since that's the default replication count for objects as set in the.

Creating osd.0 on kroll1
We use  to create a unique id which will be used in our first osd. With the mon servers up and running maintaining the server map in praxis, we will need to follow the id numbers returned to us from  and then retrofit   if necessary if we lose and osd later or do things out of order.

Since this will be the first osd for the mon servers, we get assigned id 0 to create  on kroll1. As we have noted earlier, kroll1 (Thufir) already has a btrfs raid5 array up and running with the default volume mounted on. The "normal" content is in the subvolume  which is mounted on. We will add a new subvolume called  which will be mounted to   for use by the new osd server.

We add the new subvolume to our  and take the liberty of turning on automatic defragmentation and on-the-fly lzo compression for the subvolume. The osd will actually manage some subvolumes of its own underneath this mountpoint ( and   rolling snapshots).

Now we let ceph-osd have its way with our new btrfs subvolume.

We now have a  which shows us all of the activity and the btrfs features that the osd decided to take advantage of:

The resulting filesystem looks like this:

The  and   directories are btrfs snapshots. The keyring file is a new unique key that was generated for.

We use  to transfer that into the praxis database in the mon servers.

The  actually contains the fsid for the cluster from   and not the one we passed when creating the osd itself:

We save the uuid we used for osd.0 in a text file in  just in case. It would probably be a bad idea to try to recycle an osd id number if we ever have a filesystem go bad, but you never know...

We set up kroll1 as a host in the CRUSH map and make it part of the  osd tree.

We can now see kroll1 as a host in the default tree using. also appears but hasn't been assigned to a host.

We now put  in the crush map under kroll1 with a default weighting value.

All that is left is to enable and start the osd.0 service.

With one osd spun up, our cluster is now operating in a degraded state.

osd.1 on kroll3
will use our other big btrfs array on host kroll2 (mater).

Since we are visiting hosts in the right order, we get to set up osd.1 on kroll3 just as we had anticipated in our  file. Mater's btrfs array has its default volume mounted on

With two osds running now, the cluster moves from degraded to "warn" state. This may be leftover from our warning about the root filesystem on kroll1.

osd.2 on kroll4
After shoving a couple of data archives from /home to the /raid filesystem on thufir, we got the root drive down to 51% of capacity. The cluster now shows HEALTH_OK.

We will now add our final (for the moment) object store as osd.2 on kroll4. Its btrfs mirror set has its default volume mounted on. It also has a different subvolume setup from thufir and mater. Instead of mounting and exporting a single  subvolume, it has subvolumes for   and a dedicated volume for virtual machines as. Another wrinkle is that this system has been set up to use btrfs on the SSD for its /boot and / drives. In fact, the root drive is mounted as a subvolume of

It suddenly strikes us that we are the almost assuredly the first kid in our neighborhood to set up a cluster with almost 40tb of distributed object store in the comfort of our own home. In fact, we are probably outclassing all but one or two NIH funded research labs that are nearby. The cost of the hardware was probably in the range of $5-7K since it was built from scratch, and it may even been cheaper than a single desktop video editing system sold by a certain fruity computer company.

Setting up mds.0 on kroll2
The Ceph site wants us to use ceph-deploy to set up the mds servers, but that script has yet to be ported and packaged for Gentoo. We found enough information from a bloggers site to do a manual install for Gentoo. The  needs an edit to introduce a global   section for the default location of the mds server and the keyring that it needs.

The rest of the file stays untouched including the [mds.0] section that we had put in much earlier. We pass around the updated conf file to the other hosts.

On kroll2 we create the mds-0 directory and then use  to create its keyring. The result is only a single file with a simple key value stanza.

We then create a softlink of the ceph openrc script to enable and start mds-0.

Looking at the ceph-mds-0.log and the  command shows us that everything is fine and we now have an mds server running. We should now be able to use mount.ceph and export out the ceph namespace over nfs and cifs.

Creating and exporting the Posix Filesystem
Once again, the Ceph web site offered scant details about how to go about mounting the object store as a Posix Filesystem with cephx authentication. After a lot of googling around after mount errors 5, etc, we hit upon the magic sauce that is necessary. If you have compiled the ceph network filesystem and ceph lib as modules, you do not need to worry about a manual modprobe to have them loaded. The  command will take care of that for you. You can confirm that the following two modules are loaded after trying your first mount:

If you remember, the ceph.client.admin.keyring you created back in the beginning of the install included allow for mds operations. However the format of that keyring file will not work with mount.ceph. We need to copy only the key value itself as the contents of an  file.

Because we are using nfs4, we will create an  and then use mount.ceph to mount to that. The user specified is just "admin", not "client.admin". Since we have four mons up, we will go ahead and specify all of them to mount the root ceph object namespace.

And now we have a gi-normous empty filesystem hanging off of /export/kroll. The 46% capacity filled there are from the non-ceph btrfs subvolumes that are currently in the three arrays that are providing our object stores. We will modify our /etc/exports and then update the sharing to put out /export/kroll.