ceph for small cluster?

All of lore.kernel.org
 help / color / mirror / Atom feed

* ceph for small cluster?
@ 2012-12-30 21:38 Miles Fidelman
  2012-12-31  9:10 ` Wido den Hollander
       [not found] ` <CADXA5U1ATB+1baCfBSHmzozRe3m-HhLxHQr3be1-dASgABPQYw@mail.gmail.com>
  0 siblings, 2 replies; 6+ messages in thread
From: Miles Fidelman @ 2012-12-30 21:38 UTC (permalink / raw)
  To: ceph-devel

Hi Folks,

I'm wondering how ceph would work in a small cluster that supports a mix 
of engineering and modest production (email, lists, web server for 
several small communities).

Specifically, we have a rack with 4 medium-horsepower servers, each with 
4 disk drives, running Xen (debian dom0 and domUs) - all linked together 
w/ 4 gigE ethernets.

Currently, 2 of the servers are running a high-availability 
configuration, using DRBD to mirror specific volumes, and pacemaker for 
failover.

For a while, I've been looking for a way to replace DRBD with something 
that would mirror across more than 2 servers - so that we could migrate 
VMs arbitrarily - and that will work without splitting up compute vs. 
storage nodes (for the short term, at least, we're stuck with rack space 
and server limitations).

The thing that looks closest to filling the bill is Sheepdog (at least 
architecturally) - but it only provides a KVM interface. GlusterFS, 
xTreemFS, and Ceph keep coming up as possibles - with ceph's rbd 
interface looking like the easiest to integrate.

Which leads me to two questions:

- On a theoretical level, does using ceph as a storage pool for this 
kind of small cluster make any sense (notably, I'd see running an OSD, a 
MDS, a MON, and client DomUs on each of the 4 nodes, using LVM to pool 
all the storage and it seems like folks recommend XFS as a production 
filesystem)

- On a practical level, has anybody tried building this kind of small 
cluster, and if so, what kind of results have you had?

Comments and suggestions please!

Thank you very much,

Miles Fidelman

-- 
In theory, there is no difference between theory and practice.
In practice, there is.   .... Yogi Berra

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ceph for small cluster?
  2012-12-30 21:38 ceph for small cluster? Miles Fidelman
@ 2012-12-31  9:10 ` Wido den Hollander
       [not found] ` <CADXA5U1ATB+1baCfBSHmzozRe3m-HhLxHQr3be1-dASgABPQYw@mail.gmail.com>
  1 sibling, 0 replies; 6+ messages in thread
From: Wido den Hollander @ 2012-12-31  9:10 UTC (permalink / raw)
  To: Miles Fidelman; +Cc: ceph-devel

Hi,

On 12/30/2012 10:38 PM, Miles Fidelman wrote:
> Hi Folks,
>
> I'm wondering how ceph would work in a small cluster that supports a mix
> of engineering and modest production (email, lists, web server for
> several small communities).
>
> Specifically, we have a rack with 4 medium-horsepower servers, each with
> 4 disk drives, running Xen (debian dom0 and domUs) - all linked together
> w/ 4 gigE ethernets.
>
> Currently, 2 of the servers are running a high-availability
> configuration, using DRBD to mirror specific volumes, and pacemaker for
> failover.
>
> For a while, I've been looking for a way to replace DRBD with something
> that would mirror across more than 2 servers - so that we could migrate
> VMs arbitrarily - and that will work without splitting up compute vs.
> storage nodes (for the short term, at least, we're stuck with rack space
> and server limitations).
>
> The thing that looks closest to filling the bill is Sheepdog (at least
> architecturally) - but it only provides a KVM interface. GlusterFS,
> xTreemFS, and Ceph keep coming up as possibles - with ceph's rbd
> interface looking like the easiest to integrate.
>
> Which leads me to two questions:
>
> - On a theoretical level, does using ceph as a storage pool for this
> kind of small cluster make any sense (notably, I'd see running an OSD, a
> MDS, a MON, and client DomUs on each of the 4 nodes, using LVM to pool
> all the storage and it seems like folks recommend XFS as a production
> filesystem)
>

Yes, that could work. But you have to keep in mind that OSDs can spike 
in both CPU and memory when they have to do recovery work for a failed 
node/OSD.

Also, with RBD you don't need an MDS. As a last note, you should always 
have an odd number of monitors. So run a monitor on 3 of the 4 machines.

The monitors work by a voting principle where they need a majority. An 
odd number is best in that situation.

> - On a practical level, has anybody tried building this kind of small
> cluster, and if so, what kind of results have you had?
>

I build some small Ceph cluster with sometimes just 3 nodes. It works, 
but you have to keep in mind that when one node in a 4 node cluster 
fails you will loose 25% of the capacity.

This will lead to a heavy recovery within the Ceph cluster which will 
but a lot of pressure on that Gbit links and the CPUs and memory of the 
nodes.

With RBD you might want to consider adding an SSD for the journaling of 
the OSDs, that will give you a pretty nice performance boost.

Wido

> Comments and suggestions please!
>
> Thank you very much,
>
> Miles Fidelman
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ceph for small cluster?
       [not found] ` <CADXA5U1ATB+1baCfBSHmzozRe3m-HhLxHQr3be1-dASgABPQYw@mail.gmail.com>
@ 2012-12-31 14:14   ` Miles Fidelman
  2013-01-01  0:13     ` Matthew Roy
  0 siblings, 1 reply; 6+ messages in thread
From: Miles Fidelman @ 2012-12-31 14:14 UTC (permalink / raw)
  Cc: ceph-devel

Matt, Thanks for the comments.  A follow-up if I might (inline):

Matthew Roy wrote:
> What I'm not doing that you'd need to test is running VMs on the same 
> servers as storage. I'd be careful about mounting RBD volumes on the 
> OSDs, you can run into kernel deadlock trying to write out things to 
> physical disk when trying to write to the mounted volume. Mounts 
> inside VMs should be okay. 

I was thinking of running pinning one CPU to DomO and running the OSD 
from there, and mounting RBD volumes only in DomUs.  And leaving a bit 
of disk space outside the OSD for booting and Dom0.

Which raises another question: how are you combining drives within each 
OSD (raid, lvm, ?).

Thanks again,

Miles

-- 
In theory, there is no difference between theory and practice.
In practice, there is.   .... Yogi Berra

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ceph for small cluster?
       [not found] <50E19E34.7080005@meetinghouse.net>
@ 2012-12-31 14:21 ` Miles Fidelman
  0 siblings, 0 replies; 6+ messages in thread
From: Miles Fidelman @ 2012-12-31 14:21 UTC (permalink / raw)
  To: ceph-devel


Wido, Thanks for the comment, a follow-up if I might (below)?

Wido den Hollander wrote:
>
> I build some small Ceph cluster with sometimes just 3 nodes. It works,
> but you have to keep in mind that when one node in a 4 node cluster
> fails you will loose 25% of the capacity.
>
> This will lead to a heavy recovery within the Ceph cluster which will
> but a lot of pressure on that Gbit links and the CPUs and memory of
> the nodes.
>
> With RBD you might want to consider adding an SSD for the journaling
> of the OSDs, that will give you a pretty nice performance boost.
>
Would not journalling alone, say on a separate hard disk volume, help
with recovery?

Thanks,

Miles

-- 
In theory, there is no difference between theory and practice.
In practice, there is.   .... Yogi Berra


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ceph for small cluster?
  2012-12-31 14:14   ` Miles Fidelman
@ 2013-01-01  0:13     ` Matthew Roy
  2013-01-02 12:17       ` Wido den Hollander
  0 siblings, 1 reply; 6+ messages in thread
From: Matthew Roy @ 2013-01-01  0:13 UTC (permalink / raw)
  To: Miles Fidelman; +Cc: ceph-devel

On Mon, Dec 31, 2012 at 9:14 AM, Miles Fidelman
<mfidelman@meetinghouse.net> wrote:
>
>
> Which raises another question: how are you combining drives within each OSD (raid, lvm, ?).
>

I'm not combining them, just running an OSD per data disk. On this
cluster it's 2 disks for each of the 3 nodes.  I ended up that way
only because I added the second disk to each node after getting
started. There was an inktank blog post not too long ago about the
performance of RAID'ed disks on OSDs that might provide quantitative
justification for which route to go.

Like Wido suggests, I also use a shared SSD for journal on each node.
The journal's not really about speeding recovery from failed
OSDs/disks, it's about being able to ACK writes faster and still
retain integrity when Bad Things happen. If you're RAIDing with a
battery-backed cache I think you can run without a journal, but I
don't know the details on that.

Matthew

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ceph for small cluster?
  2013-01-01  0:13     ` Matthew Roy
@ 2013-01-02 12:17       ` Wido den Hollander
  0 siblings, 0 replies; 6+ messages in thread
From: Wido den Hollander @ 2013-01-02 12:17 UTC (permalink / raw)
  To: Matthew Roy; +Cc: Miles Fidelman, ceph-devel

On 01/01/2013 01:13 AM, Matthew Roy wrote:
> On Mon, Dec 31, 2012 at 9:14 AM, Miles Fidelman
> <mfidelman@meetinghouse.net> wrote:
>>
>>
>> Which raises another question: how are you combining drives within each OSD (raid, lvm, ?).
>>
>
> I'm not combining them, just running an OSD per data disk. On this
> cluster it's 2 disks for each of the 3 nodes.  I ended up that way
> only because I added the second disk to each node after getting
> started. There was an inktank blog post not too long ago about the
> performance of RAID'ed disks on OSDs that might provide quantitative
> justification for which route to go.
>
> Like Wido suggests, I also use a shared SSD for journal on each node.
> The journal's not really about speeding recovery from failed
> OSDs/disks, it's about being able to ACK writes faster and still
> retain integrity when Bad Things happen. If you're RAIDing with a
> battery-backed cache I think you can run without a journal, but I
> don't know the details on that.
>

That depends on the RAID controller. Some (like Areca I think) cache 
O_DIRECT writes in their write cache, but other still flush them 
directly to the disks although they have a cache with BBU.

I'd try to prevent using any RAID system with Ceph. Let the replication 
handle everything. The less hardware and complexity you add, the less 
can fail.

Wido

> Matthew
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-01-02 12:17 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-30 21:38 ceph for small cluster? Miles Fidelman
2012-12-31  9:10 ` Wido den Hollander
     [not found] ` <CADXA5U1ATB+1baCfBSHmzozRe3m-HhLxHQr3be1-dASgABPQYw@mail.gmail.com>
2012-12-31 14:14   ` Miles Fidelman
2013-01-01  0:13     ` Matthew Roy
2013-01-02 12:17       ` Wido den Hollander
     [not found] <50E19E34.7080005@meetinghouse.net>
2012-12-31 14:21 ` Miles Fidelman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.