question about xfs_fsync on linux

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* question about xfs_fsync on linux
@ 2008-07-14 22:13 Chris Torek
  2008-07-14 23:03 ` Dave Chinner
  0 siblings, 1 reply; 6+ messages in thread
From: Chris Torek @ 2008-07-14 22:13 UTC (permalink / raw)
  To: xfs

The implementation of xfs_fsync() in 2.6.2x, for reasonably late x,
reads as follows in xfs_vnodeops.c:

	error = filemap_fdatawait(vn_to_inode(XFS_ITOV(ip))->i_mapping);

We have a customer who is seeing data not "make it" to disk on a
stress test that involves doing an fsync() or fdatasync() and then
deliberately rebooting the machine (to simulate a failure; note
that the underlying RAID has its own battery backup and this is
just one of many different parts of the stress-test).

Looking into this, I am now wondering if this call should read:

	error = filemap_write_and_wait(vn_to_inode(XFS_ITOV(ip))->i_mapping);

instead.  From a quick skim, it seems as though fdatawait does not
start dirty page pushes, but rather only wait for any that are
currently in progress.  The write-and-wait call starts them first,
which seems more appropriate for an fsync.  I must admit to being
relatively unfamiliar with all the innards of the Linux filemap
code though.

Chris

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: question about xfs_fsync on linux
  2008-07-14 22:13 question about xfs_fsync on linux Chris Torek
@ 2008-07-14 23:03 ` Dave Chinner
  2008-07-15  1:29   ` Chris Torek
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2008-07-14 23:03 UTC (permalink / raw)
  To: Chris Torek; +Cc: xfs

On Mon, Jul 14, 2008 at 04:13:40PM -0600, Chris Torek wrote:
> The implementation of xfs_fsync() in 2.6.2x, for reasonably late x,
> reads as follows in xfs_vnodeops.c:

What kernel(s), exactly, is/are showing this problem?

> 	error = filemap_fdatawait(vn_to_inode(XFS_ITOV(ip))->i_mapping);
> 
> We have a customer who is seeing data not "make it" to disk on a
> stress test that involves doing an fsync() or fdatasync() and then
> deliberately rebooting the machine (to simulate a failure; note
> that the underlying RAID has its own battery backup and this is
> just one of many different parts of the stress-test).

What is the symptom? The file size does not change? The file the
right size but has no data in it?

> Looking into this, I am now wondering if this call should read:
> 
> 	error = filemap_write_and_wait(vn_to_inode(XFS_ITOV(ip))->i_mapping);
> 
> instead.

No, the filemap_fdatawrite() has already been executed by this
point. We only need to wait for I/O completion here. See do_fsync():

 78 long do_fsync(struct file *file, int datasync)
 79 {
.....
 90         ret = filemap_fdatawrite(mapping);
.....
 97         err = file->f_op->fsync(file, file->f_path.dentry, datasync);
.....

The ->fsync() method in XFS is xfs_fsync()....

However, I do ask exactly what kernel version you are running,
because this mod that has gone into 2.6.26:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=978b7237123d007b9fa983af6e0e2fa8f97f9934

might be the fix you need for .24 or .25 kernels, (not sure about .22 or .23
where all this changed, nor .21 or earlier which I don't think even had the
wait...)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: question about xfs_fsync on linux
  2008-07-14 23:03 ` Dave Chinner
@ 2008-07-15  1:29   ` Chris Torek
  2008-07-15  2:48     ` Dave Chinner
  0 siblings, 1 reply; 6+ messages in thread
From: Chris Torek @ 2008-07-15  1:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

>What kernel(s), exactly, is/are showing this problem?

Well, that part is a bit tricky.  The base kernel is 2.6.21
but it has a lot of patches, including the one you mentioned.
(The customer is double checking to make sure they actually have
that patch in.)

>> We have a customer who is seeing data not "make it" to disk on a
>> stress test that involves doing an fsync() or fdatasync() and then
>> deliberately rebooting the machine (to simulate a failure; note
>> that the underlying RAID has its own battery backup and this is
>> just one of many different parts of the stress-test).
>
>What is the symptom? The file size does not change? The file the
>right size but has no data in it?

Their system has a large number of databases (on the order of 50)
all open simultaneously, and is using directIO (with a call to
fdatasync()) to make entries in many of them, and apparently *some*
of them get corrupted.  Exactly how, I do not know: naturally, we
cannot reproduce this with our own system, and when they tried a
simplified system with just one database the problem went away on
their end too.  (Agh.)

>No, the filemap_fdatawrite() has already been executed by this
>point [by do_fsync()].

D'oh!  I somehow missed this in eyeballing the code paths.

>However, I do ask exactly what kernel version you are running ...

It is mostly 2.6.21.  We brought in a large number of miscellaneous
XFS fixes, not including the ones that remove the "behavior" layer
stuff, but definitely including this one:

>http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;
>h=978b7237123d007b9fa983af6e0e2fa8f97f9934

(which of course necessitated a bit of hacking on the patches to
fit, as a lot of the later ones assume the bhv* layer has been
removed).

Chris

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: question about xfs_fsync on linux
  2008-07-15  1:29   ` Chris Torek
@ 2008-07-15  2:48     ` Dave Chinner
  2008-07-16 21:58       ` Chris Torek
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2008-07-15  2:48 UTC (permalink / raw)
  To: Chris Torek; +Cc: xfs

On Mon, Jul 14, 2008 at 07:29:16PM -0600, Chris Torek wrote:
> >What kernel(s), exactly, is/are showing this problem?
> 
> Well, that part is a bit tricky.  The base kernel is 2.6.21
> but it has a lot of patches, including the one you mentioned.
> (The customer is double checking to make sure they actually have
> that patch in.)

Well, you are pretty much on your own, then. Really, we cannot help
diagnose problems on old kernels with a random set of backported
patches in them.  If you can provide a reproducable test case that
triggers the problem on a recent, pristine tree then we can find the
problem and fix it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: question about xfs_fsync on linux
  2008-07-15  2:48     ` Dave Chinner
@ 2008-07-16 21:58       ` Chris Torek
  2008-07-17  0:22         ` Dave Chinner
  0 siblings, 1 reply; 6+ messages in thread
From: Chris Torek @ 2008-07-16 21:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

>Well, you are pretty much on your own, then. Really, we cannot help
>diagnose problems on old kernels with a random set of backported
>patches in them.

Definitely understood.  I just wanted to ask that original
question, really.  I had assumed that the file system itself
had to start any dirty-page writes, having missed the top level
filemap_fdatawrite() call.

We finally got a test case and did a bunch of analysis, and it
turns out that the DB software is missing an fsync() call.  Of
course XFS won't fsync the file if you don't *ask* it to!

As long as I am sending mail, there is something else I am curious
about though.  While this is not XFS specific, I wonder if there
is any desire to have different background write frequencies on
different file systems.  By default, mm/page-writeback.c will start
writebacks after a 30-second delay.  One can tune this to any other
number (via /proc/sys/vm/dirty_{expire,writeback}_centisecs), but
this affects the entire system.  It might be useful to be able
to tune this per-FS instead.

(On the other hand, perhaps if one really wants one's data journaled,
one should just use a data-journaling file system....)

Chris

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: question about xfs_fsync on linux
  2008-07-16 21:58       ` Chris Torek
@ 2008-07-17  0:22         ` Dave Chinner
  0 siblings, 0 replies; 6+ messages in thread
From: Dave Chinner @ 2008-07-17  0:22 UTC (permalink / raw)
  To: Chris Torek; +Cc: xfs

On Wed, Jul 16, 2008 at 03:58:55PM -0600, Chris Torek wrote:
> >Well, you are pretty much on your own, then. Really, we cannot help
> >diagnose problems on old kernels with a random set of backported
> >patches in them.
> 
> Definitely understood.  I just wanted to ask that original
> question, really.  I had assumed that the file system itself
> had to start any dirty-page writes, having missed the top level
> filemap_fdatawrite() call.
> 
> We finally got a test case and did a bunch of analysis, and it
> turns out that the DB software is missing an fsync() call.  Of
> course XFS won't fsync the file if you don't *ask* it to!

Yes, that would help ;)

Thanks for following up and letting us know you found the
problem.

> As long as I am sending mail, there is something else I am curious
> about though.  While this is not XFS specific, I wonder if there
> is any desire to have different background write frequencies on
> different file systems.  By default, mm/page-writeback.c will start
> writebacks after a 30-second delay.  One can tune this to any other
> number (via /proc/sys/vm/dirty_{expire,writeback}_centisecs), but
> this affects the entire system.  It might be useful to be able
> to tune this per-FS instead.

Wouldn't be too difficult, but really if you have  data that needs
to go to disk quickly from a given application then the application
should be triggering the flush. e.g. іssuing
posix_fadvise(fd, ...., POSIX_FADV_DONTNEED) will trigger an
immediate async flush of the file....

> (On the other hand, perhaps if one really wants one's data journaled,
> one should just use a data-journaling file system....)

Or use the sync mount option.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2008-07-17  0:21 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-14 22:13 question about xfs_fsync on linux Chris Torek
2008-07-14 23:03 ` Dave Chinner
2008-07-15  1:29   ` Chris Torek
2008-07-15  2:48     ` Dave Chinner
2008-07-16 21:58       ` Chris Torek
2008-07-17  0:22         ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox