* question about xfs_fsync on linux
@ 2008-07-14 22:13 Chris Torek
2008-07-14 23:03 ` Dave Chinner
0 siblings, 1 reply; 6+ messages in thread
From: Chris Torek @ 2008-07-14 22:13 UTC (permalink / raw)
To: xfs
The implementation of xfs_fsync() in 2.6.2x, for reasonably late x,
reads as follows in xfs_vnodeops.c:
error = filemap_fdatawait(vn_to_inode(XFS_ITOV(ip))->i_mapping);
We have a customer who is seeing data not "make it" to disk on a
stress test that involves doing an fsync() or fdatasync() and then
deliberately rebooting the machine (to simulate a failure; note
that the underlying RAID has its own battery backup and this is
just one of many different parts of the stress-test).
Looking into this, I am now wondering if this call should read:
error = filemap_write_and_wait(vn_to_inode(XFS_ITOV(ip))->i_mapping);
instead. From a quick skim, it seems as though fdatawait does not
start dirty page pushes, but rather only wait for any that are
currently in progress. The write-and-wait call starts them first,
which seems more appropriate for an fsync. I must admit to being
relatively unfamiliar with all the innards of the Linux filemap
code though.
Chris
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: question about xfs_fsync on linux
2008-07-14 22:13 question about xfs_fsync on linux Chris Torek
@ 2008-07-14 23:03 ` Dave Chinner
2008-07-15 1:29 ` Chris Torek
0 siblings, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2008-07-14 23:03 UTC (permalink / raw)
To: Chris Torek; +Cc: xfs
On Mon, Jul 14, 2008 at 04:13:40PM -0600, Chris Torek wrote:
> The implementation of xfs_fsync() in 2.6.2x, for reasonably late x,
> reads as follows in xfs_vnodeops.c:
What kernel(s), exactly, is/are showing this problem?
> error = filemap_fdatawait(vn_to_inode(XFS_ITOV(ip))->i_mapping);
>
> We have a customer who is seeing data not "make it" to disk on a
> stress test that involves doing an fsync() or fdatasync() and then
> deliberately rebooting the machine (to simulate a failure; note
> that the underlying RAID has its own battery backup and this is
> just one of many different parts of the stress-test).
What is the symptom? The file size does not change? The file the
right size but has no data in it?
> Looking into this, I am now wondering if this call should read:
>
> error = filemap_write_and_wait(vn_to_inode(XFS_ITOV(ip))->i_mapping);
>
> instead.
No, the filemap_fdatawrite() has already been executed by this
point. We only need to wait for I/O completion here. See do_fsync():
78 long do_fsync(struct file *file, int datasync)
79 {
.....
90 ret = filemap_fdatawrite(mapping);
.....
97 err = file->f_op->fsync(file, file->f_path.dentry, datasync);
.....
The ->fsync() method in XFS is xfs_fsync()....
However, I do ask exactly what kernel version you are running,
because this mod that has gone into 2.6.26:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=978b7237123d007b9fa983af6e0e2fa8f97f9934
might be the fix you need for .24 or .25 kernels, (not sure about .22 or .23
where all this changed, nor .21 or earlier which I don't think even had the
wait...)
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: question about xfs_fsync on linux
2008-07-14 23:03 ` Dave Chinner
@ 2008-07-15 1:29 ` Chris Torek
2008-07-15 2:48 ` Dave Chinner
0 siblings, 1 reply; 6+ messages in thread
From: Chris Torek @ 2008-07-15 1:29 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
>What kernel(s), exactly, is/are showing this problem?
Well, that part is a bit tricky. The base kernel is 2.6.21
but it has a lot of patches, including the one you mentioned.
(The customer is double checking to make sure they actually have
that patch in.)
>> We have a customer who is seeing data not "make it" to disk on a
>> stress test that involves doing an fsync() or fdatasync() and then
>> deliberately rebooting the machine (to simulate a failure; note
>> that the underlying RAID has its own battery backup and this is
>> just one of many different parts of the stress-test).
>
>What is the symptom? The file size does not change? The file the
>right size but has no data in it?
Their system has a large number of databases (on the order of 50)
all open simultaneously, and is using directIO (with a call to
fdatasync()) to make entries in many of them, and apparently *some*
of them get corrupted. Exactly how, I do not know: naturally, we
cannot reproduce this with our own system, and when they tried a
simplified system with just one database the problem went away on
their end too. (Agh.)
>No, the filemap_fdatawrite() has already been executed by this
>point [by do_fsync()].
D'oh! I somehow missed this in eyeballing the code paths.
>However, I do ask exactly what kernel version you are running ...
It is mostly 2.6.21. We brought in a large number of miscellaneous
XFS fixes, not including the ones that remove the "behavior" layer
stuff, but definitely including this one:
>http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;
>h=978b7237123d007b9fa983af6e0e2fa8f97f9934
(which of course necessitated a bit of hacking on the patches to
fit, as a lot of the later ones assume the bhv* layer has been
removed).
Chris
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: question about xfs_fsync on linux
2008-07-15 1:29 ` Chris Torek
@ 2008-07-15 2:48 ` Dave Chinner
2008-07-16 21:58 ` Chris Torek
0 siblings, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2008-07-15 2:48 UTC (permalink / raw)
To: Chris Torek; +Cc: xfs
On Mon, Jul 14, 2008 at 07:29:16PM -0600, Chris Torek wrote:
> >What kernel(s), exactly, is/are showing this problem?
>
> Well, that part is a bit tricky. The base kernel is 2.6.21
> but it has a lot of patches, including the one you mentioned.
> (The customer is double checking to make sure they actually have
> that patch in.)
Well, you are pretty much on your own, then. Really, we cannot help
diagnose problems on old kernels with a random set of backported
patches in them. If you can provide a reproducable test case that
triggers the problem on a recent, pristine tree then we can find the
problem and fix it.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: question about xfs_fsync on linux
2008-07-15 2:48 ` Dave Chinner
@ 2008-07-16 21:58 ` Chris Torek
2008-07-17 0:22 ` Dave Chinner
0 siblings, 1 reply; 6+ messages in thread
From: Chris Torek @ 2008-07-16 21:58 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
>Well, you are pretty much on your own, then. Really, we cannot help
>diagnose problems on old kernels with a random set of backported
>patches in them.
Definitely understood. I just wanted to ask that original
question, really. I had assumed that the file system itself
had to start any dirty-page writes, having missed the top level
filemap_fdatawrite() call.
We finally got a test case and did a bunch of analysis, and it
turns out that the DB software is missing an fsync() call. Of
course XFS won't fsync the file if you don't *ask* it to!
As long as I am sending mail, there is something else I am curious
about though. While this is not XFS specific, I wonder if there
is any desire to have different background write frequencies on
different file systems. By default, mm/page-writeback.c will start
writebacks after a 30-second delay. One can tune this to any other
number (via /proc/sys/vm/dirty_{expire,writeback}_centisecs), but
this affects the entire system. It might be useful to be able
to tune this per-FS instead.
(On the other hand, perhaps if one really wants one's data journaled,
one should just use a data-journaling file system....)
Chris
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: question about xfs_fsync on linux
2008-07-16 21:58 ` Chris Torek
@ 2008-07-17 0:22 ` Dave Chinner
0 siblings, 0 replies; 6+ messages in thread
From: Dave Chinner @ 2008-07-17 0:22 UTC (permalink / raw)
To: Chris Torek; +Cc: xfs
On Wed, Jul 16, 2008 at 03:58:55PM -0600, Chris Torek wrote:
> >Well, you are pretty much on your own, then. Really, we cannot help
> >diagnose problems on old kernels with a random set of backported
> >patches in them.
>
> Definitely understood. I just wanted to ask that original
> question, really. I had assumed that the file system itself
> had to start any dirty-page writes, having missed the top level
> filemap_fdatawrite() call.
>
> We finally got a test case and did a bunch of analysis, and it
> turns out that the DB software is missing an fsync() call. Of
> course XFS won't fsync the file if you don't *ask* it to!
Yes, that would help ;)
Thanks for following up and letting us know you found the
problem.
> As long as I am sending mail, there is something else I am curious
> about though. While this is not XFS specific, I wonder if there
> is any desire to have different background write frequencies on
> different file systems. By default, mm/page-writeback.c will start
> writebacks after a 30-second delay. One can tune this to any other
> number (via /proc/sys/vm/dirty_{expire,writeback}_centisecs), but
> this affects the entire system. It might be useful to be able
> to tune this per-FS instead.
Wouldn't be too difficult, but really if you have data that needs
to go to disk quickly from a given application then the application
should be triggering the flush. e.g. іssuing
posix_fadvise(fd, ...., POSIX_FADV_DONTNEED) will trigger an
immediate async flush of the file....
> (On the other hand, perhaps if one really wants one's data journaled,
> one should just use a data-journaling file system....)
Or use the sync mount option.....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2008-07-17 0:21 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-14 22:13 question about xfs_fsync on linux Chris Torek
2008-07-14 23:03 ` Dave Chinner
2008-07-15 1:29 ` Chris Torek
2008-07-15 2:48 ` Dave Chinner
2008-07-16 21:58 ` Chris Torek
2008-07-17 0:22 ` Dave Chinner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox