Ceph/Guide

Ceph is a distributed object store and filesystem designed to provide excellent performance, reliability, and scalability. According to the Ceph wikipedia entry, the first stable release (Argonaut) was in 2012. It arose from a doctoral dissertation by Sage Weil at the University of California, Santa Cruz. Significant funding came from the US DOE as the software has found early adoption in clusters in use at Lawrence Livermore, Los Alamos, and Sandia National Labs. The main commercial backing for Ceph comes from a company founded by Weil (Inktank) which was acquired by RedHat in April 2014.

The Floss Weekly podcast interviewed Sage Weil in 2013 for their 250th show. The interview was done around the time that the "Cuttlefish" release was created. One of the points of discussion was the need for data centers to handle disaster recovery, and Sage pointed out that starting with Dumpling, Ceph would provide for replication between data centers. Another bit of trivia came out in the podcast: Sage Weil was one of the inventors of the WebRing concept in the early days of the World Wide Web.

Ceph's largest customer was (and probably still is) CERN which uses the object store for researcher virtual machines. Its size is on the order of Petabytes. This howto will show that it installs and runs well on cheap consumer hardware using as few as 3 machines and only hundreds of gigabytes or some number of Terabytes of disk capacity. An ex-military colleague of the author described how he used to string together a number of their standard issue Panasonic Toughbooks running a variant of BSD or Linux to run impromptu clusters out in the field. Ceph running on top of Gentoo would make an excellent reliable file store in just such a situation.

A standard SATA "spinning rust" hard drive will max out performance at about 100mb/sec under optimal conditions for writing. Ceph spreads out the writing to however many drives and hosts you give it to work with for storage. Even though standard settings have it create three different replicas of the data as it writes, the use of multiple drives and hosts will easily allow Ceph to blow past this speed limit.

Overview
Ceph consists of five major components:


 * Object Store Device Server
 * Monitor Server
 * RADOS API using librados and support for a number of languages including Python and systems like libvirt
 * Metadata Server providing a POSIX compliant filesystem that can be shared out to non-Linux platforms with NFS and/or Samba
 * Kernel support for the RADOS block device and cephfs filesystem

Object store device
Two object stores mark the beginning of a Ceph cluster and they may be joined by potentially thousands more. In earlier releases of Ceph, they sit on top of an existing filesystem such as ext4 , xfs, zfs or btrfs and are created and maintained by an Object Store Device Daemon (OSD). While the underlying filesystem may provide for redundancy, error detection and repair on its own, Ceph implements its own layer of error detection, recovery and n-way replication. There is a trade off between using a RAID1, 5, 6, or 10 scheme with the underlying filesystem and then having a single OSD server versus having individual drives and multiple OSD servers. The former provides a defense in depth strategy against data loss, but the latter has less of an impact on the cluster when a drive fails and requires replacement. The latter also potentially provides better performance than a software RAID or a filesystem built on top of a number of JBOD devices.

Inktank/Redhat used lessons learned to develop a new underlying filestore called Bluestore or BlueFS. This is starting to replace the other filesystems as the default for new cluster installations. The current version of this howto is being written as the author replaces his original ceph cluster based on btrfs with a completely new install based on BlueFS.

An OSD will take advantage of advanced features of the underlying filesystem such as Extents, Copy On Write (COW), and snapshotting. It can make extended use of the xattr feature to store metadata about an object, but this will often exceed the 4kb limitation of ext4 filesystems such that an alternative metadata store will be necessary. Up until the Luminous release, the ceph.com site documentation recommended either ext4 or xfs in production for OSDs, but it was obvious that zfs or btrfs would be better because of their ability to self-repair, snapshot and handle COW. BlueFS is a response to findings that zfs and btrfs did more than ceph needed and that something a bit more stripped down would buy extra performance. It is the default store as of version Luminous.

The task of the OSD is to handle the distribution of objects by Ceph across the cluster. The user can specify the number of copies of an object to be created and distributed amongst the OSDs. The default is 2 copies with a minimum of 1, but those values can be increased up to the number of OSDs that are implemented. Since this redundancy is on top of whatever may be provided the underlying RAID arrays, the cluster enjoys an added layer of protection that guards against catastrophic failure of a disk array. When a drive array fails, only the OSD or OSDs that make use of it are brought down.

Objects are broken down into extents, or shards, when distributed instead of having them treated as a single entity. In a 2-way replication scheme where there are more than 2 OSD servers, an object's shards will actually end up distributed across potentially all of the OSD servers. Each shard replica ends up in a Placement Group (PG) in an OSD pool somewhere out in the cluster. Scrubber processes running in the background will periodically check the shards in each PG for any errors that may crop up due to bad block development on the hard drives. In general, every PG in the cluster will be verified at least once every two weeks by the background scrubbing, and errors will be automatically corrected if they can be.

Monitor server
Monitor Servers (MONs) which act as the coordinators for object and other traffic. The initial Ceph Cluster would consist of a MON and two OSD servers, and this is the example used in their documentation for a quick install. They also talk about an admin server, but this is only a system which is able to painlessly remote into the cluster members using ssh authorized_keys. The admin server would be the system that the user has set up to run Chef, Puppet or other control systems that oversee the operation of the cluster.

A single MON would be a single point of failure for Ceph, so it is recommended that the Ceph Cluster be run with an odd number of MONs with a minimum number of 3 running to establish a quorum and avoid single host failures and MON errors. For performance reasons, MONs should be put on a separate filesystem or device from OSDs because they tend to do a lot of fsyncs. Although they are typically shown as running on dedicated hosts, they can share a host with an OSD and often do in order to have enough MON servers for a decent quorum. MONs don't need a lot of storage space, so it is perfectly fine to have them run on the system drive, while the OSD servers take over whatever other disks are in the server. If you dedicate an SSD to handle OSD journals for non-BlueFS based OSD servers, the MON storage will only require another 2gb or so.

MONs coordinate shard replication and distribution by implementing the Controlled Replication Under Scalable Hashing (CRUSH) map. This is an algorithm that computes the locations for storing shards in the OSD pools. MONS also keep track of the map of daemons running the various flavors of Ceph server in the cluster. An "Initial Members" setting allows the user the specify the minimum number of MON servers that must be running in order to form a quorum. When there are not enough MONs to form a quorum, the Ceph cluster will stop processing until a quorum is re-established in order to avoid a "split-brain" situation.

The CRUSH map defaults to an algorithm that computes a deterministic uniform random distribution of where in the OSDs an object's shards should be placed, but it can be influenced by additional human specified policies. This way, a site administrator can sway CRUSH when making choices such as:


 * Use the sites faster OSDs by default
 * Divide OSDs into "hot" (SSD based), "normal" and "archival" (slow or tape backed) storage
 * Localize replication to OSDs sitting on the same switch or subnet
 * Prevent replication to OSDs on the same rack to avoid downtime when an entire RACK has a power failure
 * Take underlying drive size in consideration so that for example an osd based on a 6tb drive gets 50% more shards than a 4tb based one.

It is this spreading out of the load with the CRUSH map that allows Ceph to scale up to thousands of OSDs so easily while increasing performance as new stores are added. Because of the spreading, the bottleneck transfers from raw disk performance (about 100mb/sec for a SATA drive for example) to the bandwidth capacity of your network and switches.

There are a number of ways to work with the MON pool and Praxis database to monitor and administrate the cluster, but the most common is the  command. This is a Python script that uses a number of Ceph supplied Python modules that use json to communicate with the MON pool.

RADOS block device and RADOS gateway
Ceph provides a kernel module for the RADOS Block Device (RBD) and a librados library which libvirt and KVM can be linked against. This is essentially a virtual disk device that distributes its "blocks" across the OSDs in the Ceph cluster. An RBD provides the following capabilities:


 * Thin provisioning
 * I/O striping and redundancy across the Cluster
 * Resizeable
 * Snapshot with revert capability
 * Directly useable as a KVM guest's disk device
 * A variant of COW where a VM starts with a "golden image" which the VM diverges from as it operates
 * Data replication between datacenters

A major selling point for the RBD is the fact that it can be used as a virtual machine's drive store in KVM. Because it spans the OSD server pool, the guest can be hot migrated between cluster CPUs by literally shutting the guest down on one CPU and booting it on another. Libvirt and Virt-Manager have provided this support for some time now, and it is probably one of the main reasons why RedHat (a major sponsor of QEMU/KVM, Libvirt, and Virt-Manager) has acquired Inktank.

The RBD and the RADOS Gateway provide the same sort of functionality for Cloud Services as Amazon S3 and OpenStack Swift. The early adopters of Ceph were interested primarily in Cloud Service object stores. Cloud Services also drove the intial work on replication between datacenters.

Metadata server
Ceph provides a Metadata Server (MDS) which provides a more traditional style of filesystem based on POSIX standards that translates into objects stored in the OSD pool. This is typically where a non-Linux platform can implement client support for Ceph. This can be shared via CIFS and NFS to non-Ceph and non-Linux based systems including Windows. This is also the way to use Ceph as a drop-in replacement for HADOOP. The filesystem component started to mature around the Dumpling release.

Ceph requires all of its servers to be able to see each other directly in the cluster. So this filesystem would also be the point where external systems would be able to see the content without having direct access to the Ceph Cluster. For performance reasons, the user may have all of the Ceph cluster participants using a dedicated network on faster hardware with isolated switches. The MDS server would then have multiple NICs to straddle the Ceph network and the outside world.

When the author first rolled out ceph using the Firefly release, there was only one active MDS server at a time. Other MDS servers run in a standby mode to quickly perform a failover when the active server goes down. The cluster will take about 30 seconds to determine whether the active MDS server has failed. This may appear to be a bottleneck for the cluster, but the MDS only does the mapping of POSIX file names to object ids. With an object id, a client then directly contacts the OSD servers to perform the necessary i/o of extents/shards. Non-cephfs based traffic such as a VM running in an RBD device would continue without noticing any interruptions.

Multiple active MDS server support appeared in Jewel and became stable in Kraken. This allows the request load to be shared between more than one MDS server by divying up the namespace.

Storage pools
You can and will have more than one pool for storing objects. Each can use either the default CRUSH map or have an alternative in effect for its object placement. There is a default pool which is used for the generic ceph object store which your application can create and manipulate objects using the librados API. Your RBD devices go into another pool by default. The MDS server will also use its own pool for storage so if you intend to use it alongside your own RADOS aware application, get the MDS set up and running first. There is a well known layout scheme for the MDS pool that doesn't seem to be prone to change and that your RADOS aware app can take advantage of.

Installation
As of this writing the stable version of Ceph in portage is  which corresponds to "Luminous". In Gentoo unstable are versions  aka "Mimic". However the author has yet to get a version of Mimic to emerge on a gentoo 17.0 desktop stable profile:


 * "Bobtail"
 * "Cuttlefish"
 * "Dumpling"
 * "Firefly" - The initial release that the author used to roll out ceph, "experimental" MDS support
 * "Giant" - Redhat buys up Inktank around now
 * "Hammer - The MDS server code wasn't considered stable until either Giant or Hammer... the author forgets
 * "Infernalis" - Redhat marketing has obviously taken over. Last release packaged for RHEL/Centos 6.x servers
 * "Jewel" - systemd aware. Unstable support for more than one "active" MDS server but there were "issues"
 * "Kraken" - Initial BlueFS support for OSD storage. Multiple active MDS support marked stable
 * "Luminous" - current gentoo stable version, BlueFS marked stable and becomes default store for new OSD servers
 * "Mimic" - author hasn't read up on it yet because the silly thing doesn't build for him yet. Object de-dupe?

In general, Inktank/Redhat has kept to dropping new major releases of ceph about every six months or so with a major milestone release that stabilizes major functionality about once a year.

Kernel configuration
If you want to use the RADOS block device, you will need to put that into your kernel .config as either a module or baked in. Ceph itself will want to have FUSE support enabled if you want to work with the POSIX filesystem component and you will also want to include the driver for that in Network File Systems. For your backend object stores, you will want to have xfs support because of the xattr limitations in Ext4 and btrfs because it really is becoming stable now.

Network configuration
Ceph is sensitive to IP address changes, so you should make sure that all of your Ceph servers are assigned static IP addresses. You also may want to proactively treat the Ceph cluster members as an independent subnet from your existing network by multi-homing your existing network adapters as necessary. That way if an ISP change or other topology changes are needed, you can keep your cluster setup intact. It also gives you the luxury of migrating the ceph subnet later on to dedicated nics, switches and faster hardware such as 10Gbit ethernet or Infiniband. If the cluster subnet is small enough, consider keeping the hostnames in your /etc/hosts files, at least until things grow to the point where a pair of DNS servers among the cluster members becomes a compelling solution.

The author's initial rollout of Ceph back in the Firefly days was on four nodes, but that grew to 7 hosts. This updated guide reflects a real world use implementation of ceph Luminous 12.2.11 which should be considered a new "from scratch" install. The old btrfs based OSD store was backed off to a btrfs filesystem mirror on the mater host so that the old install could be burned down.

The DNS servers did not get set up to define an inside domain and zones for the ceph subnet. Instead the author used /etc/hosts on each machine.

An example Ceph cluster
This is a ceph cluster based on a collection of "frankenstein" AMD based machines in a home network. The author also had a small 3 node "personal" setup at their desk at a previous job at a major defense contractor that was based on HP and Supermicro based systems. Back in the Firefly days, 1tb drives were the norm so the "weighting" factor units for crush maps corresponded to sizes in Terabytes. The kroll home network hosts are as follows:


 * kroll1 (aka Thufir) - An AMD FX9590 8 core CPU with 32GB of memory, 256GB SSD root drive and a 4x4TB SATA array formatted as a RAID5 btrfs with the default volume mounted on .  Thufir had been our admin server since the ssh keys for its root user have been pushed out to the other nodes in their   files.  Over time, the disks in the 4x1tb array were replaced with 4tb drives with one going to a dedicated home mount.  The other three were set up as raid1 for a new osd daemon.  For the new rollout, these three will become individual 4tb osd servers.
 * kroll2 (aka Figo) - An AMD FX8350 8 core CPU with 16GB of memory, 256GB SSD root drive and a 4x3TB SATA array formatted as btrfs RAID1. Kroll2 acted as a MON and a OSD server in the initial install for Firefly. The MON was eventually deleted and the array has been replaced by 4x4tb drives with one going to a dedicated home mount.  The motherboard has been swapped out for an AMD Rzyen 7 2700x 8 core CPU installation with 32 gb of memory.  The system drive is now a 512mb SSD replacing the old 256mb OCZ Vertex.  Figo will get used as a host for three 4tb OSD servers.
 * kroll3 (aka Mater) - An AMD FX8350 8 core CPU with 16GB of memory, 256GB SSD and 4x1TB SATA array formatted as a RAID5 btrfs. The old mater was originally both the fourth MON and an OSD server.  The MON was eventually deleted when the author was researching performance as a function of the number of MON servers.  Mater got hardware refreshed to an AMD Ryzen 7 1700x motherboard with 32gb of memory and a 4x4tb disk array.  The existing Samsung 256gb SSD system drive was kept.  Since Mater is hooked to a nice 4k display panel, this will become the new admin server for the cluster.  It will just be the single MDS server in the new cluster for the moment since the old cluster contents are living on its SATA array formatted as a btrfs RAID10 mirror.
 * kroll4 (aka Tube) - An AMD A10-7850K APU with 16GB of memory, 256GB SSD and a 2x2TB SATA array. Tube was originally set up as a MON and OSD server but the MON was deleted over time.  The 2tb drives were swapped out for 4tb drives.  In the new deployment, it will run a single OSD server with one of the drives.
 * kroll5 (aka Refurb) - An AMD A10-7870K APU with 16gb of memory and a 1tb ssd with 2x4tb raid array. It wasn't part of a the old Firefly install initially, but it later got set up as a MON and as an MDS server since it was on the same KVM switch as thufir and topshelf.  In the new deployment, it will be one of the three MONs (thufir, refurb, and topshelf).
 * kroll6 (aka Topshelf) - An AMD FX8350 8 core CPU with 16gb of memory and a 256gb ssd drive. It wasn't part of the original Firefly deployment, but it later got set up as a MON and as the other MDS server in the active/hot backup MDS scheme that was used.  The hardware was refreshed to an AMD Ryzen 7 2700x with 32gb of memory and a 1tb SSD drive.  It originally had a 4x3tb array in it, but they were members of a problematic generation of Seagate drives that only has one survivor still spinning.  That may eventually be refreshed, but topshelf will only be used as a MON in the new deployment for now.
 * kroll7 (aka Mike) - An AMD Ryzen 7 1700x 8 core processor with 32gb of memory, 1tb SSD drive and 2x4tb raid drive. It will be used to deploy a pair of 4tb osd servers in the new cluster.

All 7 systems are running Gentoo stable profiles, but the Ryzen 7 processors are running unstable kernels in place of the stable series in order to have better AMD Zen support. The two Ryzen 1700x based hosts suffer from the dreaded idling problems of the early fab versions of Zen, but firmware tweaks on the motherboards and other voodoo rituals have kept them at bay (mostly).

Editing the ceph config file
We will be following the manual guide for ceph installation on their site. There is also a Python based script call ceph-deploy which is packaged for a number of distros but not directly available for Gentoo. If you can manage to get it working, it would automate a good bit of the process of rolling out a server from your admin node.

We used  to generate a new random uuid for the entire cluster. We will rename the cluster name from the default  to   to match our host naming scheme. We specify the 192.168.2 network to be the "public" network for the cluster. Other default settings come from the manual install url mentioned earlier, including a default to replicate two copies of each object with a minimum of 1 copy allowed when the cluster is in "degraded" state.

The example conf file has only a single MON but we use a quorum of three using kroll1, kroll5 and kroll6.

We override the OSD journal size as noted, but the entire thing is moot since we will be using BlueFS.

We use a 3 replica setup which matches the ceph example but read our comments above. Their example glosses over pg sizing and will not work if you have less than 3 hosts running osd servers.

After editing the file, we copy it around to the other cluster members from our admin node kroll1 using

/etc/conf.d/ceph file
There is a conf.d file for the ceph service but it is pretty barebones and doesn't need changing unless services are being juggled for more than one cluster with different conf files. Since we changed the cluster name from ceph to kroll but still use the default ceph.conf name for the file, we change it to uncomment out the setting

Creating Keyrings For MON rollout
Ceph uses its own shared secret concept when handling communications among cluster members. We must generate keyring files that will then be distributed out to the servers that will be set up among the cluster members. The keyrings are generated by the  command. The first keyring is for the mon servers. The manual install url has it going to a file on, but we are more inclined to keep it around by parking it in

The result is a readable text file:

Next we create an admin keyring file which goes into the file.

The resulting text file may actually be shorter than the complicated command line used to create it. The redacted key here is the same as the one that appears in our mon keyring, so it must be based on the UUID parked in the config file.

Creating /var/lib/ceph
ceph uses for various server settings and storage. Since the author had a legacy install of ceph to start with, there was already a tree with ownership set to ceph:ceph. Daemons ran as the root user in ceph release up until around Giant and then changed to run as the user ceph in later releases so this ownership needed a reset at some point. Depending on the class of ceph servers running on the host there would then be msd, mon and osd subdirectories under this tree with the appropriate files. There is also likely to be a tmp subdir there that gets created at some point due to commands. YMMV for a fresh install so you may need to create a tree like this. The author had to create a new /var/lib/ceph/bootstrap-osd subdir for himself for the next keyring:

Merging the three keyrings together into the mon.keyring file
The ceph manual guide then uses the authtool with  options to merge the three keys together into the mon.keyring file. You can save a bit of typing just by using good 'ole cat to slap everything together.

Creating the initial monmap file
The OSD and MDS servers use the for discovering MON servers, but the MON servers themselves have a much stricter consistency scheme in order to form and maintain their quorum. When up and running the quorum uses amajority rule voting system called Praxis, but MONs do work with an initial binary file called a monmap when you first set up the Ceph cluster.

The manual deployment page covers the example where only a single MON is used to form the quorum. It's simply a matter of using more  stanzas to define our initial 3 member monitor map.

The command is used to create the initial monmap binary file. We essentially give it the addresses corresponding to our    and the cluster fsid from   file. We will park this file in and then pass it around to the right place when we configure our MON hosts.

Once the three monitors are up and running and have established a quorum, they will begin to automatically revise this initial monitor map. Each revision is called an epoch, and the epoch number will get bumped whenever it happens. It will change when the OSDs get added and as the initial CRUSH map and PG pools get set up. It also changes as events happen such as the scrubbing processes scrub a PG and move on to the next. So this initial map will no longer be needed after the quorum is established. In fact, when a new monitor is added to the cluster following add-or-rm-mons, there's a point where you retrieve the current monitor map from the quorum to a file and then use that to create the new monitor's filesystem. Part of the process of joining the new monitor to the quorum involves it figuring out what needs to be changed to go from the old epoch number to the current one that the quorum is working with.

We push the initial monmap file over to directories on kroll1, kroll5 and kroll6.

Creating kroll1 server mon.a
Ceph servers look for their file trees in. Mon servers look for their server id number subtree under where N is the id number that has been designated for the server in. Kroll1 will host mon.0, so we create for it. This implies that we will be using our 256GB SSD root system device for mon.0's i/o. Later on when we create the OSD for kroll1, we will be creating and mounting a btrfs subvolume for it to use. Otherwise the object store would default to eating us out of house and home on our system drive!

Before continuing on, you may want to look at to clear out anything that may be in there. The next command will create an empty file if it doesn't already exist.

The command will populate the  directory with a copy of the  file renamed to  and a  directory tree which is a Praxis database reflecting the contents of the initial monmap file.

We set up the mon.0 server startup in by softlinking. Examination of the ceph script reveals that the Gentoo ebuild developer was merely picking the server type and id number from string positions in the script file name. Thus it isn't crucial to use the "." and the "-" as we did here, but it does make things readable:

We repeated the same process to create, , and on the other kroll member hosts.

Starting the Mon servers
With all four kroll hosts configured with mons, we now go back and start the services beginning with on kroll1.

Notice the ".fault" entries at the bottom of the log. We will continue to see them until the quorum is established. We will also see them as outputwhen attempting to do monitor style commands such as  until there is a quorum.

Starting mon.1 on kroll2
We start up the mon.1 server on kroll2. The mon log for this server will show it discovering the kroll1 peer server, and the 2 start using praxis to hold a quorum election. Until gets spun up, we still won't have a quorum.

Starting mon.2 on kroll3
There is now also a file for the cluster with a briefer summary. There is a warning that the system drive is over 70% full on kroll1 since we have also been using it for a directory there. We may migrate that to the btrfs array later if it becomes necessary.

The commands in the manual install page to check cluster sanity: and  will now work, but of course we don't have any OSDs spun up yet.

Starting mon.3 on kroll4
The appearance of  causes a new monitor election as noted in the ceph.log on kroll4

And now we are four. The health will stay at HEALTH_ERR or degraded until we get at least two OSDs spun up. We need two since that's the default replication count for objects as set in the file.

Creating osd.0 on kroll1
We use command to create a unique id which will be used in our first osd. With the mon servers up and running maintaining the server map in praxis, we will need to follow the id numbers returned to us from and then retrofit  if necessary if we lose and osd later or do things out of order.

Since this will be the first osd for the mon servers, we get assigned id 0 to create on kroll1. As we have noted earlier, kroll1 (Thufir) already has a btrfs raid5 array up and running with the default volume mounted on. The "normal" content is in the subvolume  which is mounted at. We will add a new subvolume called  which will be mounted to  for use by the new osd server.

We add the new subvolume to our and take the liberty of turning on automatic defragmentation and on-the-fly lzo compression for the subvolume. The osd will actually manage some subvolumes of its own underneath this mountpoint ( and rolling snapshots).

Now we let ceph-osd have its way with our new btrfs subvolume.

We now have a which shows us all of the activity and the btrfs features that the osd decided to take advantage of:

The resulting filesystem looks like this:

The, and  directories are btrfs snapshots. The keyring file is a new unique key that was generated for.

Now use to transfer that into the praxis database in the mon servers.

The  actually contains the fsid for the cluster from  and not the one we passed when creating the osd itself:

We save the uuid we used for osd.0 in a text file in just in case. It would probably be a bad idea to try to recycle an osd id number if we ever have a filesystem go bad, but you never know...

We set up kroll1 as a host in the CRUSH map and make it part of the  osd tree.

We can now see kroll1 as a host in the default tree using. also appears but hasn't been assigned to a host.

We now put in the crush map under kroll1 with a default weighting value.

All that is left is to enable and start the osd.0 service.

With one osd spun up, our cluster is now operating in a degraded state.

osd.1 on kroll3
will use our other big btrfs array on host kroll2 (mater).

Since we are visiting hosts in the right order, we get to set up osd.1 on kroll3 just as we had anticipated in our  file. Mater's btrfs array has its default volume mounted at

With two osds running now, the cluster moves from degraded to "warn" state. This may be leftover from our warning about the root filesystem on kroll1.

osd.2 on kroll4
After shoving a couple of data archives from to the  filesystem on thufir, we got the root drive down to 51% of capacity. The cluster now shows HEALTH_OK.

We will now add our final (for the moment) object store as osd.2 on kroll4. Its btrfs mirror set has its default volume mounted on. It also has a different subvolume setup from thufir and mater. Instead of mounting and exporting a single  subvolume, it has subvolumes for   and a dedicated volume for virtual machines as. Another wrinkle is that this system has been set up to use btrfs on the SSD for its /boot and / drives. In fact, the root drive is mounted as a subvolume of

It suddenly strikes us that we are the almost assuredly the first kid in our neighborhood to set up a cluster with almost 40tb of distributed object store in the comfort of our own home. In fact, we are probably outclassing all but one or two NIH funded research labs that are nearby. The cost of the hardware was probably in the range of $5-7K since it was built from scratch, and it may even been cheaper than a single desktop video editing system sold by a certain fruity computer company.

Setting up mds.0 on kroll2
The Ceph site wants us to use ceph-deploy to set up the mds servers, but that script has yet to be ported and packaged for Gentoo. We found enough information from a bloggers site to do a manual install for Gentoo. The  needs an edit to introduce a global   section for the default location of the mds server and the keyring that it needs.

The rest of the file stays untouched including the [mds.0] section that we had put in much earlier. We pass around the updated conf file to the other hosts.

On kroll2 we create the mds-0 directory and then use  to create its keyring. The result is only a single file with a simple key value stanza.

Next create a softlink of the Ceph OpenRC script to enable and start mds-0.

Looking at the ceph-mds-0.log and the  command shows us that everything is fine and we now have an mds server running. We should now be able to use mount.ceph and export out the ceph namespace over nfs and cifs.

Creating and exporting the Posix filesystem
Once again, the Ceph web site offered scant details about how to go about mounting the object store as a Posix filesystem with cephx authentication. After a lot of googling around after mount errors 5, etc, we hit upon the magic sauce that is necessary. If you have compiled the ceph network filesystem and ceph lib as modules, you do not need to worry about a manual modprobe to have them loaded. The  command will take care of that for you. You can confirm that the following two modules are loaded after trying your first mount:

If you remember, the ceph.client.admin.keyring you created back in the beginning of the install included allow for mds operations. However the format of that keyring file will not work with mount.ceph. We need to copy only the key value itself as the contents of an  file.

Because we are using nfs4, we will create an directory and then use mount.ceph to mount to that. The user specified is just "admin", not "client.admin". Since we have four mons up, we will go ahead and specify all of them to mount the root ceph object namespace.

And now we have a enormous empty filesystem hanging off of. The 46% capacity filled there are from the non-ceph btrfs subvolumes that are currently in the three arrays that are providing our object stores. We will modify our /etc/exports and then update the sharing to put out /export/kroll.