FileStore should not use syncfs(2)

All of lore.kernel.org
 help / color / mirror / Atom feed

* FileStore should not use syncfs(2)
@ 2015-08-05 21:26 Sage Weil
  2015-08-05 21:38 ` Somnath Roy
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Sage Weil @ 2015-08-05 21:26 UTC (permalink / raw)
  To: Somnath.Roy; +Cc: ceph-devel, sjust

Today I learned that syncfs(2) does an O(n) search of the superblock's 
inode list searching for dirty items.  I've always assumed that it was 
only traversing dirty inodes (e.g., a list of dirty inodes), but that 
appears not to be the case, even on the latest kernels.

That means that the more RAM in the box, the larger (generally) the inode 
cache, the longer syncfs(2) will take, and the more CPU you'll waste doing 
it.  The box I was looking at had 256GB of RAM, 36 OSDs, and a load of ~40 
servicing a very light workload, and each syncfs(2) call was taking ~7 
seconds (usually to write out a single inode).

A possible workaround for such boxes is to turn 
/proc/sys/vm/vfs_cache_pressure way up (so that the kernel favors caching 
pages instead of inodes/dentries)...

I think the take-away though is that we do need to bite the bullet and 
make FileStore f[data]sync all the right things so that the syncfs call 
can be avoided.  This is the path you were originally headed down, 
Somnath, and I think it's the right one.

The main thing to watch out for is that according to POSIX you really need 
to fsync directories.  With XFS that isn't the case since all metadata 
operations are going into the journal and that's fully ordered, but we 
don't want to allow data loss on e.g. ext4 (we need to check what the 
metadata ordering behavior is there) or other file systems.

:(

sage

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: FileStore should not use syncfs(2)
  2015-08-05 21:26 FileStore should not use syncfs(2) Sage Weil
@ 2015-08-05 21:38 ` Somnath Roy
  2015-08-06  2:17   ` Haomai Wang
  2015-08-05 21:55 ` Mark Nelson
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Somnath Roy @ 2015-08-05 21:38 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel@vger.kernel.org, sjust@redhat.com

Thanks Sage for digging down..I was suspecting something similar.. As I mentioned in today's call, in idle time also syncfs is taking ~60ms. I have 64 GB of RAM in the system.
The workaround I was talking about today  is working pretty good so far. In this implementation, I am not giving much work to syncfs as each worker thread is writing with o_dsync mode. I am issuing syncfs before trimming the journal and most of the time I saw it is taking < 100 ms.
I have to wake up the sync_thread now after each worker thread finished writing. I will benchmark both the approaches. As we discussed earlier, in case of only fsync approach, we still need to do a db sync to make sure the leveldb stuff persisted, right ?

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com]
Sent: Wednesday, August 05, 2015 2:27 PM
To: Somnath Roy
Cc: ceph-devel@vger.kernel.org; sjust@redhat.com
Subject: FileStore should not use syncfs(2)

Today I learned that syncfs(2) does an O(n) search of the superblock's inode list searching for dirty items.  I've always assumed that it was only traversing dirty inodes (e.g., a list of dirty inodes), but that appears not to be the case, even on the latest kernels.

That means that the more RAM in the box, the larger (generally) the inode cache, the longer syncfs(2) will take, and the more CPU you'll waste doing it.  The box I was looking at had 256GB of RAM, 36 OSDs, and a load of ~40 servicing a very light workload, and each syncfs(2) call was taking ~7 seconds (usually to write out a single inode).

A possible workaround for such boxes is to turn /proc/sys/vm/vfs_cache_pressure way up (so that the kernel favors caching pages instead of inodes/dentries)...

I think the take-away though is that we do need to bite the bullet and make FileStore f[data]sync all the right things so that the syncfs call can be avoided.  This is the path you were originally headed down, Somnath, and I think it's the right one.

The main thing to watch out for is that according to POSIX you really need to fsync directories.  With XFS that isn't the case since all metadata operations are going into the journal and that's fully ordered, but we don't want to allow data loss on e.g. ext4 (we need to check what the metadata ordering behavior is there) or other file systems.

:(

sage

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: FileStore should not use syncfs(2)
  2015-08-05 21:26 FileStore should not use syncfs(2) Sage Weil
  2015-08-05 21:38 ` Somnath Roy
@ 2015-08-05 21:55 ` Mark Nelson
  2015-08-07  6:50   ` Chen, Xiaoxi
  2015-08-06  9:44 ` Yan, Zheng
  2015-08-06 11:27 ` Christoph Hellwig
  3 siblings, 1 reply; 11+ messages in thread
From: Mark Nelson @ 2015-08-05 21:55 UTC (permalink / raw)
  To: Sage Weil, Somnath.Roy; +Cc: ceph-devel, sjust



On 08/05/2015 04:26 PM, Sage Weil wrote:
> Today I learned that syncfs(2) does an O(n) search of the superblock's
> inode list searching for dirty items.  I've always assumed that it was
> only traversing dirty inodes (e.g., a list of dirty inodes), but that
> appears not to be the case, even on the latest kernels.
>
> That means that the more RAM in the box, the larger (generally) the inode
> cache, the longer syncfs(2) will take, and the more CPU you'll waste doing
> it.  The box I was looking at had 256GB of RAM, 36 OSDs, and a load of ~40
> servicing a very light workload, and each syncfs(2) call was taking ~7
> seconds (usually to write out a single inode).
>
> A possible workaround for such boxes is to turn
> /proc/sys/vm/vfs_cache_pressure way up (so that the kernel favors caching
> pages instead of inodes/dentries)...

FWIW, I often see performance increase when favoring inode/dentry cache, 
but probably with far fewer inodes that the setup you just saw.  It 
sounds like there needs to be some maximum limit on the inode/dentry 
cache to prevent this kind of behavior but still favor it up until that 
point.  Having said that, maybe avoiding syncfs is best as you say below.

>
> I think the take-away though is that we do need to bite the bullet and
> make FileStore f[data]sync all the right things so that the syncfs call
> can be avoided.  This is the path you were originally headed down,
> Somnath, and I think it's the right one.
>
> The main thing to watch out for is that according to POSIX you really need
> to fsync directories.  With XFS that isn't the case since all metadata
> operations are going into the journal and that's fully ordered, but we
> don't want to allow data loss on e.g. ext4 (we need to check what the
> metadata ordering behavior is there) or other file systems.
>
> :(
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: FileStore should not use syncfs(2)
  2015-08-05 21:38 ` Somnath Roy
@ 2015-08-06  2:17   ` Haomai Wang
  2015-08-06 12:47     ` Sage Weil
  0 siblings, 1 reply; 11+ messages in thread
From: Haomai Wang @ 2015-08-06  2:17 UTC (permalink / raw)
  To: Somnath Roy; +Cc: Sage Weil, ceph-devel@vger.kernel.org, sjust@redhat.com

Agree

On Thu, Aug 6, 2015 at 5:38 AM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> Thanks Sage for digging down..I was suspecting something similar.. As I mentioned in today's call, in idle time also syncfs is taking ~60ms. I have 64 GB of RAM in the system.
> The workaround I was talking about today  is working pretty good so far. In this implementation, I am not giving much work to syncfs as each worker thread is writing with o_dsync mode. I am issuing syncfs before trimming the journal and most of the time I saw it is taking < 100 ms.

Actually I prefer we don't use syncfs anymore. I more like to use
"aio+dio+Filestore custom cache" to deal with all "syncfs+pagecache"
things. So we even can make cache more smart to aware of upper levels
instead of fadvise* calls. Second we can use "checkpoint" method like
mysql innodb, we can know the bw of frontend(filejournal) and decide
how much and how often we want to flush(using aio+dio).

Anyway, because it's a big project, we may prefer to work at newstore
instead of filestore.

> I have to wake up the sync_thread now after each worker thread finished writing. I will benchmark both the approaches. As we discussed earlier, in case of only fsync approach, we still need to do a db sync to make sure the leveldb stuff persisted, right ?
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Wednesday, August 05, 2015 2:27 PM
> To: Somnath Roy
> Cc: ceph-devel@vger.kernel.org; sjust@redhat.com
> Subject: FileStore should not use syncfs(2)
>
> Today I learned that syncfs(2) does an O(n) search of the superblock's inode list searching for dirty items.  I've always assumed that it was only traversing dirty inodes (e.g., a list of dirty inodes), but that appears not to be the case, even on the latest kernels.
>
> That means that the more RAM in the box, the larger (generally) the inode cache, the longer syncfs(2) will take, and the more CPU you'll waste doing it.  The box I was looking at had 256GB of RAM, 36 OSDs, and a load of ~40 servicing a very light workload, and each syncfs(2) call was taking ~7 seconds (usually to write out a single inode).
>
> A possible workaround for such boxes is to turn /proc/sys/vm/vfs_cache_pressure way up (so that the kernel favors caching pages instead of inodes/dentries)...
>
> I think the take-away though is that we do need to bite the bullet and make FileStore f[data]sync all the right things so that the syncfs call can be avoided.  This is the path you were originally headed down, Somnath, and I think it's the right one.
>
> The main thing to watch out for is that according to POSIX you really need to fsync directories.  With XFS that isn't the case since all metadata operations are going into the journal and that's fully ordered, but we don't want to allow data loss on e.g. ext4 (we need to check what the metadata ordering behavior is there) or other file systems.

I guess there only a little directory modify operations, is it true?
Maybe we only need to do syncfs when modifying directories?

>
> :(
>
> sage
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: FileStore should not use syncfs(2)
  2015-08-05 21:26 FileStore should not use syncfs(2) Sage Weil
  2015-08-05 21:38 ` Somnath Roy
  2015-08-05 21:55 ` Mark Nelson
@ 2015-08-06  9:44 ` Yan, Zheng
  2015-08-06 12:57   ` Sage Weil
  2015-08-06 11:27 ` Christoph Hellwig
  3 siblings, 1 reply; 11+ messages in thread
From: Yan, Zheng @ 2015-08-06  9:44 UTC (permalink / raw)
  To: Sage Weil; +Cc: Somnath.Roy, ceph-devel, sjust

On Thu, Aug 6, 2015 at 5:26 AM, Sage Weil <sweil@redhat.com> wrote:
> Today I learned that syncfs(2) does an O(n) search of the superblock's
> inode list searching for dirty items.  I've always assumed that it was
> only traversing dirty inodes (e.g., a list of dirty inodes), but that
> appears not to be the case, even on the latest kernels.
>

I checked syncfs code in 3.10/4.1 kernel. I think both kernels only
traverse dirty inodes (inodes in
bdi_writeback::{b_dirty,b_io,b_more_io} lists). what am I missing?


> That means that the more RAM in the box, the larger (generally) the inode
> cache, the longer syncfs(2) will take, and the more CPU you'll waste doing
> it.  The box I was looking at had 256GB of RAM, 36 OSDs, and a load of ~40
> servicing a very light workload, and each syncfs(2) call was taking ~7
> seconds (usually to write out a single inode).
>
> A possible workaround for such boxes is to turn
> /proc/sys/vm/vfs_cache_pressure way up (so that the kernel favors caching
> pages instead of inodes/dentries)...
>
> I think the take-away though is that we do need to bite the bullet and
> make FileStore f[data]sync all the right things so that the syncfs call
> can be avoided.  This is the path you were originally headed down,
> Somnath, and I think it's the right one.
>
> The main thing to watch out for is that according to POSIX you really need
> to fsync directories.  With XFS that isn't the case since all metadata
> operations are going into the journal and that's fully ordered, but we
> don't want to allow data loss on e.g. ext4 (we need to check what the
> metadata ordering behavior is there) or other file systems.
>
> :(
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: FileStore should not use syncfs(2)
  2015-08-05 21:26 FileStore should not use syncfs(2) Sage Weil
                   ` (2 preceding siblings ...)
  2015-08-06  9:44 ` Yan, Zheng
@ 2015-08-06 11:27 ` Christoph Hellwig
  2015-08-06 13:00   ` Sage Weil
  3 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2015-08-06 11:27 UTC (permalink / raw)
  To: Sage Weil; +Cc: Somnath.Roy, ceph-devel, sjust

On Wed, Aug 05, 2015 at 02:26:30PM -0700, Sage Weil wrote:
> Today I learned that syncfs(2) does an O(n) search of the superblock's 
> inode list searching for dirty items.  I've always assumed that it was 
> only traversing dirty inodes (e.g., a list of dirty inodes), but that 
> appears not to be the case, even on the latest kernels.

I'm pretty sure Dave had some patches for that,  Even if they aren't
included it's not an unsolved problem.

> The main thing to watch out for is that according to POSIX you really need 
> to fsync directories.  With XFS that isn't the case since all metadata 
> operations are going into the journal and that's fully ordered, but we 
> don't want to allow data loss on e.g. ext4 (we need to check what the 
> metadata ordering behavior is there) or other file systems.

That additional fsync in XFS is basically free, so better get it right
and let the file system micro optimize for you.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: FileStore should not use syncfs(2)
  2015-08-06  2:17   ` Haomai Wang
@ 2015-08-06 12:47     ` Sage Weil
  0 siblings, 0 replies; 11+ messages in thread
From: Sage Weil @ 2015-08-06 12:47 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Somnath Roy, ceph-devel@vger.kernel.org, sjust@redhat.com

On Thu, 6 Aug 2015, Haomai Wang wrote:
> Agree
> 
> On Thu, Aug 6, 2015 at 5:38 AM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> > Thanks Sage for digging down..I was suspecting something similar.. As I mentioned in today's call, in idle time also syncfs is taking ~60ms. I have 64 GB of RAM in the system.
> > The workaround I was talking about today  is working pretty good so far. In this implementation, I am not giving much work to syncfs as each worker thread is writing with o_dsync mode. I am issuing syncfs before trimming the journal and most of the time I saw it is taking < 100 ms.
> 
> Actually I prefer we don't use syncfs anymore. I more like to use
> "aio+dio+Filestore custom cache" to deal with all "syncfs+pagecache"
> things. So we even can make cache more smart to aware of upper levels
> instead of fadvise* calls. Second we can use "checkpoint" method like
> mysql innodb, we can know the bw of frontend(filejournal) and decide
> how much and how often we want to flush(using aio+dio).
> 
> Anyway, because it's a big project, we may prefer to work at newstore
> instead of filestore.
> 
> > I have to wake up the sync_thread now after each worker thread finished writing. I will benchmark both the approaches. As we discussed earlier, in case of only fsync approach, we still need to do a db sync to make sure the leveldb stuff persisted, right ?
> >
> > Thanks & Regards
> > Somnath
> >
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Wednesday, August 05, 2015 2:27 PM
> > To: Somnath Roy
> > Cc: ceph-devel@vger.kernel.org; sjust@redhat.com
> > Subject: FileStore should not use syncfs(2)
> >
> > Today I learned that syncfs(2) does an O(n) search of the superblock's inode list searching for dirty items.  I've always assumed that it was only traversing dirty inodes (e.g., a list of dirty inodes), but that appears not to be the case, even on the latest kernels.
> >
> > That means that the more RAM in the box, the larger (generally) the inode cache, the longer syncfs(2) will take, and the more CPU you'll waste doing it.  The box I was looking at had 256GB of RAM, 36 OSDs, and a load of ~40 servicing a very light workload, and each syncfs(2) call was taking ~7 seconds (usually to write out a single inode).
> >
> > A possible workaround for such boxes is to turn /proc/sys/vm/vfs_cache_pressure way up (so that the kernel favors caching pages instead of inodes/dentries)...
> >
> > I think the take-away though is that we do need to bite the bullet and make FileStore f[data]sync all the right things so that the syncfs call can be avoided.  This is the path you were originally headed down, Somnath, and I think it's the right one.
> >
> > The main thing to watch out for is that according to POSIX you really need to fsync directories.  With XFS that isn't the case since all metadata operations are going into the journal and that's fully ordered, but we don't want to allow data loss on e.g. ext4 (we need to check what the metadata ordering behavior is there) or other file systems.
> 
> I guess there only a little directory modify operations, is it true?
> Maybe we only need to do syncfs when modifying directories?

I'd say there are a few broad cases:

 - creating or deleting objects.  simply fsyncing the file is 
sufficient on XFS; we should confirm what the behavior is on other 
distros.  But even if we d the fsync on the dir this is simple to 
implement.

 - renaming objects (collection_move_rename).  Easy to add an fsync here.

 - HashIndex rehashing.  This is where I get nervous... and setting some 
flag that triggers a full syncfs might be an interim solution since it's a 
pretty rare event.  OTOH, adding the fsync calls in the HashIndex code 
probably isn't so bad to audit and get right either...

sage

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: FileStore should not use syncfs(2)
  2015-08-06  9:44 ` Yan, Zheng
@ 2015-08-06 12:57   ` Sage Weil
  0 siblings, 0 replies; 11+ messages in thread
From: Sage Weil @ 2015-08-06 12:57 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: Somnath.Roy, ceph-devel, sjust

On Thu, 6 Aug 2015, Yan, Zheng wrote:
> On Thu, Aug 6, 2015 at 5:26 AM, Sage Weil <sweil@redhat.com> wrote:
> > Today I learned that syncfs(2) does an O(n) search of the superblock's
> > inode list searching for dirty items.  I've always assumed that it was
> > only traversing dirty inodes (e.g., a list of dirty inodes), but that
> > appears not to be the case, even on the latest kernels.
> >
> 
> I checked syncfs code in 3.10/4.1 kernel. I think both kernels only
> traverse dirty inodes (inodes in
> bdi_writeback::{b_dirty,b_io,b_more_io} lists). what am I missing?

See wait_sb_inodes in fs/fs-writeback.c, called by sync_inodes_sb.

sage

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: FileStore should not use syncfs(2)
  2015-08-06 11:27 ` Christoph Hellwig
@ 2015-08-06 13:00   ` Sage Weil
  2015-08-06 13:06     ` Christoph Hellwig
  0 siblings, 1 reply; 11+ messages in thread
From: Sage Weil @ 2015-08-06 13:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Somnath.Roy, ceph-devel, sjust

On Thu, 6 Aug 2015, Christoph Hellwig wrote:
> On Wed, Aug 05, 2015 at 02:26:30PM -0700, Sage Weil wrote:
> > Today I learned that syncfs(2) does an O(n) search of the superblock's 
> > inode list searching for dirty items.  I've always assumed that it was 
> > only traversing dirty inodes (e.g., a list of dirty inodes), but that 
> > appears not to be the case, even on the latest kernels.
> 
> I'm pretty sure Dave had some patches for that,  Even if they aren't
> included it's not an unsolved problem.
> 
> > The main thing to watch out for is that according to POSIX you really need 
> > to fsync directories.  With XFS that isn't the case since all metadata 
> > operations are going into the journal and that's fully ordered, but we 
> > don't want to allow data loss on e.g. ext4 (we need to check what the 
> > metadata ordering behavior is there) or other file systems.
> 
> That additional fsync in XFS is basically free, so better get it right
> and let the file system micro optimize for you.

I'm guessing the strategy here should be to fsync the file (leaf) and then 
any affected ancestors, such that the directory fsyncs are effectively 
no-ops?  Or does it matter?

Thanks!
sage


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: FileStore should not use syncfs(2)
  2015-08-06 13:00   ` Sage Weil
@ 2015-08-06 13:06     ` Christoph Hellwig
  0 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2015-08-06 13:06 UTC (permalink / raw)
  To: Sage Weil; +Cc: Somnath.Roy, ceph-devel, sjust

On Thu, Aug 06, 2015 at 06:00:42AM -0700, Sage Weil wrote:
> I'm guessing the strategy here should be to fsync the file (leaf) and then 
> any affected ancestors, such that the directory fsyncs are effectively 
> no-ops?  Or does it matter?

All metadata transactions log the involve parties (parent and child
inode(s) mostly) in the same transaction.  So flushing one of them out
is enough.  But file data I/O might dirty the inode before flushing them
out, so to not need to write out the inode log item twice you first want
to fsync any file that had data I/O followed by directories or special
files that only had metadata modified.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: FileStore should not use syncfs(2)
  2015-08-05 21:55 ` Mark Nelson
@ 2015-08-07  6:50   ` Chen, Xiaoxi
  0 siblings, 0 replies; 11+ messages in thread
From: Chen, Xiaoxi @ 2015-08-07  6:50 UTC (permalink / raw)
  To: Mark Nelson, Sage Weil, Somnath.Roy@sandisk.com
  Cc: ceph-devel@vger.kernel.org, sjust@redhat.com

> FWIW, I often see performance increase when favoring inode/dentry cache, but
> probably with far fewer inodes that the setup you just saw.  It sounds like there
> needs to be some maximum limit on the inode/dentry cache to prevent this
> kind of behavior but still favor it up until that point.  Having said that, maybe
> avoiding syncfs is best as you say below.

We also see that in most of the case. Usually we set this to 1 (prefer inode) as a tuning BKM for small file storage.

Can we walk around it by enlarge the size of FDCache and tune /proc/sys/vm/vfs_cache_pressure to 100(prefer data)? That means we try to use FDCache to replace inode/dentry cache as possible.



> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Thursday, August 6, 2015 5:56 AM
> To: Sage Weil; Somnath.Roy@sandisk.com
> Cc: ceph-devel@vger.kernel.org; sjust@redhat.com
> Subject: Re: FileStore should not use syncfs(2)
> 
> 
> 
> On 08/05/2015 04:26 PM, Sage Weil wrote:
> > Today I learned that syncfs(2) does an O(n) search of the superblock's
> > inode list searching for dirty items.  I've always assumed that it was
> > only traversing dirty inodes (e.g., a list of dirty inodes), but that
> > appears not to be the case, even on the latest kernels.
> >
> > That means that the more RAM in the box, the larger (generally) the
> > inode cache, the longer syncfs(2) will take, and the more CPU you'll
> > waste doing it.  The box I was looking at had 256GB of RAM, 36 OSDs,
> > and a load of ~40 servicing a very light workload, and each syncfs(2)
> > call was taking ~7 seconds (usually to write out a single inode).
> >
> > A possible workaround for such boxes is to turn
> > /proc/sys/vm/vfs_cache_pressure way up (so that the kernel favors
> > caching pages instead of inodes/dentries)...
> 
> FWIW, I often see performance increase when favoring inode/dentry cache, but
> probably with far fewer inodes that the setup you just saw.  It sounds like there
> needs to be some maximum limit on the inode/dentry cache to prevent this
> kind of behavior but still favor it up until that point.  Having said that, maybe
> avoiding syncfs is best as you say below.
> 
> >
> > I think the take-away though is that we do need to bite the bullet and
> > make FileStore f[data]sync all the right things so that the syncfs
> > call can be avoided.  This is the path you were originally headed
> > down, Somnath, and I think it's the right one.
> >
> > The main thing to watch out for is that according to POSIX you really
> > need to fsync directories.  With XFS that isn't the case since all
> > metadata operations are going into the journal and that's fully
> > ordered, but we don't want to allow data loss on e.g. ext4 (we need to
> > check what the metadata ordering behavior is there) or other file systems.
> >
> > :(
> >
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body
> of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-08-07  6:50 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-05 21:26 FileStore should not use syncfs(2) Sage Weil
2015-08-05 21:38 ` Somnath Roy
2015-08-06  2:17   ` Haomai Wang
2015-08-06 12:47     ` Sage Weil
2015-08-05 21:55 ` Mark Nelson
2015-08-07  6:50   ` Chen, Xiaoxi
2015-08-06  9:44 ` Yan, Zheng
2015-08-06 12:57   ` Sage Weil
2015-08-06 11:27 ` Christoph Hellwig
2015-08-06 13:00   ` Sage Weil
2015-08-06 13:06     ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.