* Using other filesystems than btrfs with Ceph
@ 2010-06-11 14:23 Peter Niemayer
2010-06-11 16:26 ` Gregory Farnum
0 siblings, 1 reply; 6+ messages in thread
From: Peter Niemayer @ 2010-06-11 14:23 UTC (permalink / raw)
To: ceph-devel
Hi,
the release notes of 0.20 state "btrfs no longer strictly required".
Is there some documentation/discussion on the pro's and con's of using
other filesystems with Ceph?
Also, the documentation in the Wiki does not mention what would need
to be configured differently if another filesystem was to be used -
what would I have to use instead of "btrfs devs = /dev/sdy"?
The reason why I ask is that the application I would like to test-run
on a minimal Ceph-cluster runs much faster when using XFS than btrfs.
Also, XFS is not quite as young&experimental as btrfs, so if there is
no specific benefit from using btrfs, it would be a reasonable choice
to use the much longer matured XFS for now.
Any thoughts on this?
Regards,
Peter Niemayer
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Using other filesystems than btrfs with Ceph
2010-06-11 14:23 Using other filesystems than btrfs with Ceph Peter Niemayer
@ 2010-06-11 16:26 ` Gregory Farnum
2010-06-11 16:40 ` Sage Weil
2010-06-11 16:42 ` Peter Niemayer
0 siblings, 2 replies; 6+ messages in thread
From: Gregory Farnum @ 2010-06-11 16:26 UTC (permalink / raw)
To: Peter Niemayer; +Cc: ceph-devel
On Fri, Jun 11, 2010 at 7:23 AM, Peter Niemayer <niemayer@isg.de> wrote:
> Is there some documentation/discussion on the pro's and con's of using
> other filesystems with Ceph?
Unfortunately, I can't seem to find anything.
> Also, the documentation in the Wiki does not mention what would need
> to be configured differently if another filesystem was to be used -
> what would I have to use instead of "btrfs devs = /dev/sdy"?
You can look at the ceph.conf file produced by the vstart script for a
non-btrfs example -- just use
osd data = path
osd journal = path2
osd journal size = [# in MB]
> The reason why I ask is that the application I would like to test-run
> on a minimal Ceph-cluster runs much faster when using XFS than btrfs.
> Also, XFS is not quite as young&experimental as btrfs, so if there is
> no specific benefit from using btrfs, it would be a reasonable choice
> to use the much longer matured XFS for now.
Well, it's possible that you could improve Ceph's performance in
certain workloads by using different underlying filesystems, but in
general Ceph's interfaces and protocols are going to matter a lot
more, and btrfs works very well with it. The fact that XFS is faster
than btrfs in your workload doesn't necessarily mean that Ceph on XFS
will be faster than Ceph on btrfs for your job.
That said, btrfs is most beneficial when you engage in snapshotting,
or have to handle recoveries. Ceph makes use of a number of ioctls to
ensure consistency and provide speed when running on btrfs; on other
filesystems snapshots will be much slower, and OSDs will be more
likely to lose data in the case of a power failure or similar problem.
If these aren't big problems for you, you can run Ceph on whatever fs
you like.
-Greg
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Using other filesystems than btrfs with Ceph
2010-06-11 16:26 ` Gregory Farnum
@ 2010-06-11 16:40 ` Sage Weil
2010-06-11 16:47 ` Peter Niemayer
2010-06-11 16:42 ` Peter Niemayer
1 sibling, 1 reply; 6+ messages in thread
From: Sage Weil @ 2010-06-11 16:40 UTC (permalink / raw)
To: Gregory Farnum; +Cc: Peter Niemayer, ceph-devel
On Fri, 11 Jun 2010, Gregory Farnum wrote:
> That said, btrfs is most beneficial when you engage in snapshotting,
> or have to handle recoveries. Ceph makes use of a number of ioctls to
> ensure consistency and provide speed when running on btrfs; on other
> filesystems snapshots will be much slower, and OSDs will be more
> likely to lose data in the case of a power failure or similar problem.
> If these aren't big problems for you, you can run Ceph on whatever fs
> you like.
Right.
The btrfs isn't required for consistency if the writeahead journal is
enabled (which it is by default). However, at the moment the code that
controls trimming the journal assumes ext3 data=ordered fsync semantics
(fsync flushes the entire journal and all prior writes). This needs a
little bit of work to do the right thing with ext4 and xfs.
So: I would stick with btrfs or ext3 for now if you want recovery to work
reliably!
sage
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Using other filesystems than btrfs with Ceph
2010-06-11 16:26 ` Gregory Farnum
2010-06-11 16:40 ` Sage Weil
@ 2010-06-11 16:42 ` Peter Niemayer
1 sibling, 0 replies; 6+ messages in thread
From: Peter Niemayer @ 2010-06-11 16:42 UTC (permalink / raw)
To: ceph-devel
On 06/11/2010 06:26 PM, Gregory Farnum wrote:
>> Also, the documentation in the Wiki does not mention what would need
>> to be configured differently if another filesystem was to be used -
>> what would I have to use instead of "btrfs devs = /dev/sdy"?
> You can look at the ceph.conf file produced by the vstart script for a
> non-btrfs example -- just use
> osd data = path
> osd journal = path2
> osd journal size = [# in MB]
Thanks for this info!
> Well, it's possible that you could improve Ceph's performance in
> certain workloads by using different underlying filesystems, but in
> general Ceph's interfaces and protocols are going to matter a lot
> more, and btrfs works very well with it.
It would be interesting to me what would be a factor of slow-down
to be expected when setting up a minimalistic one-node-Ceph service,
where all daemons run on localhost, and just one local filesystem
is created for the OSD, in comparison to using btrfs directly on that
local filesystem?
Would you expect a significant decrease in sequential read/write throughput?
And what about latencies for small reads/writes?
I wonder what would be an acceptable design-goal when ruling
out physical network equipment or cabling for the moment...
Regards,
Peter Niemayer
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Using other filesystems than btrfs with Ceph
2010-06-11 16:40 ` Sage Weil
@ 2010-06-11 16:47 ` Peter Niemayer
2010-06-11 16:54 ` Sage Weil
0 siblings, 1 reply; 6+ messages in thread
From: Peter Niemayer @ 2010-06-11 16:47 UTC (permalink / raw)
To: ceph-devel
On 06/11/2010 06:40 PM, Sage Weil wrote:
> The btrfs isn't required for consistency if the writeahead journal is
> enabled (which it is by default). However, at the moment the code that
> controls trimming the journal assumes ext3 data=ordered fsync semantics
> (fsync flushes the entire journal and all prior writes). This needs a
> little bit of work to do the right thing with ext4 and xfs.
>
> So: I would stick with btrfs or ext3 for now if you want recovery to work
> reliably!
The recovery you are referring to, here, is that an operation required...
a) after an outage that involved many/all redundant OSDs
b) after a physical failure of one underlying storage device
c) after every disconnect/reconnect of Ceph nodes
?
Regards,
Peter Niemayer
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Using other filesystems than btrfs with Ceph
2010-06-11 16:47 ` Peter Niemayer
@ 2010-06-11 16:54 ` Sage Weil
0 siblings, 0 replies; 6+ messages in thread
From: Sage Weil @ 2010-06-11 16:54 UTC (permalink / raw)
To: Peter Niemayer; +Cc: ceph-devel
On Fri, 11 Jun 2010, Peter Niemayer wrote:
> On 06/11/2010 06:40 PM, Sage Weil wrote:
> > The btrfs isn't required for consistency if the writeahead journal is
> > enabled (which it is by default). However, at the moment the code that
> > controls trimming the journal assumes ext3 data=ordered fsync semantics
> > (fsync flushes the entire journal and all prior writes). This needs a
> > little bit of work to do the right thing with ext4 and xfs.
> >
> > So: I would stick with btrfs or ext3 for now if you want recovery to work
> > reliably!
>
> The recovery you are referring to, here, is that an operation required...
>
> a) after an outage that involved many/all redundant OSDs
> b) after a physical failure of one underlying storage device
> c) after every disconnect/reconnect of Ceph nodes
After an OSD node crash.
The challenge is keeping the contents of the osd data dir in a fully
consistent state. The writeahead journal lets us do that, but it needs to
know when previous operations have fully committed to disk so it can trim.
Currently there's a simple fsync() in there to do that, but something
trickier is required for ext4 and xfs. A per-mount sync(2) type operation
would be ideal.
sage
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2010-06-11 16:54 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-06-11 14:23 Using other filesystems than btrfs with Ceph Peter Niemayer
2010-06-11 16:26 ` Gregory Farnum
2010-06-11 16:40 ` Sage Weil
2010-06-11 16:47 ` Peter Niemayer
2010-06-11 16:54 ` Sage Weil
2010-06-11 16:42 ` Peter Niemayer
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.