* ceph for small cluster?
@ 2012-12-30 21:38 Miles Fidelman
2012-12-31 9:10 ` Wido den Hollander
[not found] ` <CADXA5U1ATB+1baCfBSHmzozRe3m-HhLxHQr3be1-dASgABPQYw@mail.gmail.com>
0 siblings, 2 replies; 6+ messages in thread
From: Miles Fidelman @ 2012-12-30 21:38 UTC (permalink / raw)
To: ceph-devel
Hi Folks,
I'm wondering how ceph would work in a small cluster that supports a mix
of engineering and modest production (email, lists, web server for
several small communities).
Specifically, we have a rack with 4 medium-horsepower servers, each with
4 disk drives, running Xen (debian dom0 and domUs) - all linked together
w/ 4 gigE ethernets.
Currently, 2 of the servers are running a high-availability
configuration, using DRBD to mirror specific volumes, and pacemaker for
failover.
For a while, I've been looking for a way to replace DRBD with something
that would mirror across more than 2 servers - so that we could migrate
VMs arbitrarily - and that will work without splitting up compute vs.
storage nodes (for the short term, at least, we're stuck with rack space
and server limitations).
The thing that looks closest to filling the bill is Sheepdog (at least
architecturally) - but it only provides a KVM interface. GlusterFS,
xTreemFS, and Ceph keep coming up as possibles - with ceph's rbd
interface looking like the easiest to integrate.
Which leads me to two questions:
- On a theoretical level, does using ceph as a storage pool for this
kind of small cluster make any sense (notably, I'd see running an OSD, a
MDS, a MON, and client DomUs on each of the 4 nodes, using LVM to pool
all the storage and it seems like folks recommend XFS as a production
filesystem)
- On a practical level, has anybody tried building this kind of small
cluster, and if so, what kind of results have you had?
Comments and suggestions please!
Thank you very much,
Miles Fidelman
--
In theory, there is no difference between theory and practice.
In practice, there is. .... Yogi Berra
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: ceph for small cluster?
2012-12-30 21:38 ceph for small cluster? Miles Fidelman
@ 2012-12-31 9:10 ` Wido den Hollander
[not found] ` <CADXA5U1ATB+1baCfBSHmzozRe3m-HhLxHQr3be1-dASgABPQYw@mail.gmail.com>
1 sibling, 0 replies; 6+ messages in thread
From: Wido den Hollander @ 2012-12-31 9:10 UTC (permalink / raw)
To: Miles Fidelman; +Cc: ceph-devel
Hi,
On 12/30/2012 10:38 PM, Miles Fidelman wrote:
> Hi Folks,
>
> I'm wondering how ceph would work in a small cluster that supports a mix
> of engineering and modest production (email, lists, web server for
> several small communities).
>
> Specifically, we have a rack with 4 medium-horsepower servers, each with
> 4 disk drives, running Xen (debian dom0 and domUs) - all linked together
> w/ 4 gigE ethernets.
>
> Currently, 2 of the servers are running a high-availability
> configuration, using DRBD to mirror specific volumes, and pacemaker for
> failover.
>
> For a while, I've been looking for a way to replace DRBD with something
> that would mirror across more than 2 servers - so that we could migrate
> VMs arbitrarily - and that will work without splitting up compute vs.
> storage nodes (for the short term, at least, we're stuck with rack space
> and server limitations).
>
> The thing that looks closest to filling the bill is Sheepdog (at least
> architecturally) - but it only provides a KVM interface. GlusterFS,
> xTreemFS, and Ceph keep coming up as possibles - with ceph's rbd
> interface looking like the easiest to integrate.
>
> Which leads me to two questions:
>
> - On a theoretical level, does using ceph as a storage pool for this
> kind of small cluster make any sense (notably, I'd see running an OSD, a
> MDS, a MON, and client DomUs on each of the 4 nodes, using LVM to pool
> all the storage and it seems like folks recommend XFS as a production
> filesystem)
>
Yes, that could work. But you have to keep in mind that OSDs can spike
in both CPU and memory when they have to do recovery work for a failed
node/OSD.
Also, with RBD you don't need an MDS. As a last note, you should always
have an odd number of monitors. So run a monitor on 3 of the 4 machines.
The monitors work by a voting principle where they need a majority. An
odd number is best in that situation.
> - On a practical level, has anybody tried building this kind of small
> cluster, and if so, what kind of results have you had?
>
I build some small Ceph cluster with sometimes just 3 nodes. It works,
but you have to keep in mind that when one node in a 4 node cluster
fails you will loose 25% of the capacity.
This will lead to a heavy recovery within the Ceph cluster which will
but a lot of pressure on that Gbit links and the CPUs and memory of the
nodes.
With RBD you might want to consider adding an SSD for the journaling of
the OSDs, that will give you a pretty nice performance boost.
Wido
> Comments and suggestions please!
>
> Thank you very much,
>
> Miles Fidelman
>
^ permalink raw reply [flat|nested] 6+ messages in thread[parent not found: <CADXA5U1ATB+1baCfBSHmzozRe3m-HhLxHQr3be1-dASgABPQYw@mail.gmail.com>]
* Re: ceph for small cluster?
[not found] ` <CADXA5U1ATB+1baCfBSHmzozRe3m-HhLxHQr3be1-dASgABPQYw@mail.gmail.com>
@ 2012-12-31 14:14 ` Miles Fidelman
2013-01-01 0:13 ` Matthew Roy
0 siblings, 1 reply; 6+ messages in thread
From: Miles Fidelman @ 2012-12-31 14:14 UTC (permalink / raw)
Cc: ceph-devel
Matt, Thanks for the comments. A follow-up if I might (inline):
Matthew Roy wrote:
> What I'm not doing that you'd need to test is running VMs on the same
> servers as storage. I'd be careful about mounting RBD volumes on the
> OSDs, you can run into kernel deadlock trying to write out things to
> physical disk when trying to write to the mounted volume. Mounts
> inside VMs should be okay.
I was thinking of running pinning one CPU to DomO and running the OSD
from there, and mounting RBD volumes only in DomUs. And leaving a bit
of disk space outside the OSD for booting and Dom0.
Which raises another question: how are you combining drives within each
OSD (raid, lvm, ?).
Thanks again,
Miles
--
In theory, there is no difference between theory and practice.
In practice, there is. .... Yogi Berra
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: ceph for small cluster?
2012-12-31 14:14 ` Miles Fidelman
@ 2013-01-01 0:13 ` Matthew Roy
2013-01-02 12:17 ` Wido den Hollander
0 siblings, 1 reply; 6+ messages in thread
From: Matthew Roy @ 2013-01-01 0:13 UTC (permalink / raw)
To: Miles Fidelman; +Cc: ceph-devel
On Mon, Dec 31, 2012 at 9:14 AM, Miles Fidelman
<mfidelman@meetinghouse.net> wrote:
>
>
> Which raises another question: how are you combining drives within each OSD (raid, lvm, ?).
>
I'm not combining them, just running an OSD per data disk. On this
cluster it's 2 disks for each of the 3 nodes. I ended up that way
only because I added the second disk to each node after getting
started. There was an inktank blog post not too long ago about the
performance of RAID'ed disks on OSDs that might provide quantitative
justification for which route to go.
Like Wido suggests, I also use a shared SSD for journal on each node.
The journal's not really about speeding recovery from failed
OSDs/disks, it's about being able to ACK writes faster and still
retain integrity when Bad Things happen. If you're RAIDing with a
battery-backed cache I think you can run without a journal, but I
don't know the details on that.
Matthew
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: ceph for small cluster?
2013-01-01 0:13 ` Matthew Roy
@ 2013-01-02 12:17 ` Wido den Hollander
0 siblings, 0 replies; 6+ messages in thread
From: Wido den Hollander @ 2013-01-02 12:17 UTC (permalink / raw)
To: Matthew Roy; +Cc: Miles Fidelman, ceph-devel
On 01/01/2013 01:13 AM, Matthew Roy wrote:
> On Mon, Dec 31, 2012 at 9:14 AM, Miles Fidelman
> <mfidelman@meetinghouse.net> wrote:
>>
>>
>> Which raises another question: how are you combining drives within each OSD (raid, lvm, ?).
>>
>
> I'm not combining them, just running an OSD per data disk. On this
> cluster it's 2 disks for each of the 3 nodes. I ended up that way
> only because I added the second disk to each node after getting
> started. There was an inktank blog post not too long ago about the
> performance of RAID'ed disks on OSDs that might provide quantitative
> justification for which route to go.
>
> Like Wido suggests, I also use a shared SSD for journal on each node.
> The journal's not really about speeding recovery from failed
> OSDs/disks, it's about being able to ACK writes faster and still
> retain integrity when Bad Things happen. If you're RAIDing with a
> battery-backed cache I think you can run without a journal, but I
> don't know the details on that.
>
That depends on the RAID controller. Some (like Areca I think) cache
O_DIRECT writes in their write cache, but other still flush them
directly to the disks although they have a cache with BBU.
I'd try to prevent using any RAID system with Ceph. Let the replication
handle everything. The less hardware and complexity you add, the less
can fail.
Wido
> Matthew
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 6+ messages in thread
[parent not found: <50E19E34.7080005@meetinghouse.net>]
* Re: ceph for small cluster?
[not found] <50E19E34.7080005@meetinghouse.net>
@ 2012-12-31 14:21 ` Miles Fidelman
0 siblings, 0 replies; 6+ messages in thread
From: Miles Fidelman @ 2012-12-31 14:21 UTC (permalink / raw)
To: ceph-devel
Wido, Thanks for the comment, a follow-up if I might (below)?
Wido den Hollander wrote:
>
> I build some small Ceph cluster with sometimes just 3 nodes. It works,
> but you have to keep in mind that when one node in a 4 node cluster
> fails you will loose 25% of the capacity.
>
> This will lead to a heavy recovery within the Ceph cluster which will
> but a lot of pressure on that Gbit links and the CPUs and memory of
> the nodes.
>
> With RBD you might want to consider adding an SSD for the journaling
> of the OSDs, that will give you a pretty nice performance boost.
>
Would not journalling alone, say on a separate hard disk volume, help
with recovery?
Thanks,
Miles
--
In theory, there is no difference between theory and practice.
In practice, there is. .... Yogi Berra
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2013-01-02 12:17 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-30 21:38 ceph for small cluster? Miles Fidelman
2012-12-31 9:10 ` Wido den Hollander
[not found] ` <CADXA5U1ATB+1baCfBSHmzozRe3m-HhLxHQr3be1-dASgABPQYw@mail.gmail.com>
2012-12-31 14:14 ` Miles Fidelman
2013-01-01 0:13 ` Matthew Roy
2013-01-02 12:17 ` Wido den Hollander
[not found] <50E19E34.7080005@meetinghouse.net>
2012-12-31 14:21 ` Miles Fidelman
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.