* question about xfs_fsync on linux @ 2008-07-14 22:13 Chris Torek 2008-07-14 23:03 ` Dave Chinner 0 siblings, 1 reply; 6+ messages in thread From: Chris Torek @ 2008-07-14 22:13 UTC (permalink / raw) To: xfs The implementation of xfs_fsync() in 2.6.2x, for reasonably late x, reads as follows in xfs_vnodeops.c: error = filemap_fdatawait(vn_to_inode(XFS_ITOV(ip))->i_mapping); We have a customer who is seeing data not "make it" to disk on a stress test that involves doing an fsync() or fdatasync() and then deliberately rebooting the machine (to simulate a failure; note that the underlying RAID has its own battery backup and this is just one of many different parts of the stress-test). Looking into this, I am now wondering if this call should read: error = filemap_write_and_wait(vn_to_inode(XFS_ITOV(ip))->i_mapping); instead. From a quick skim, it seems as though fdatawait does not start dirty page pushes, but rather only wait for any that are currently in progress. The write-and-wait call starts them first, which seems more appropriate for an fsync. I must admit to being relatively unfamiliar with all the innards of the Linux filemap code though. Chris ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: question about xfs_fsync on linux 2008-07-14 22:13 question about xfs_fsync on linux Chris Torek @ 2008-07-14 23:03 ` Dave Chinner 2008-07-15 1:29 ` Chris Torek 0 siblings, 1 reply; 6+ messages in thread From: Dave Chinner @ 2008-07-14 23:03 UTC (permalink / raw) To: Chris Torek; +Cc: xfs On Mon, Jul 14, 2008 at 04:13:40PM -0600, Chris Torek wrote: > The implementation of xfs_fsync() in 2.6.2x, for reasonably late x, > reads as follows in xfs_vnodeops.c: What kernel(s), exactly, is/are showing this problem? > error = filemap_fdatawait(vn_to_inode(XFS_ITOV(ip))->i_mapping); > > We have a customer who is seeing data not "make it" to disk on a > stress test that involves doing an fsync() or fdatasync() and then > deliberately rebooting the machine (to simulate a failure; note > that the underlying RAID has its own battery backup and this is > just one of many different parts of the stress-test). What is the symptom? The file size does not change? The file the right size but has no data in it? > Looking into this, I am now wondering if this call should read: > > error = filemap_write_and_wait(vn_to_inode(XFS_ITOV(ip))->i_mapping); > > instead. No, the filemap_fdatawrite() has already been executed by this point. We only need to wait for I/O completion here. See do_fsync(): 78 long do_fsync(struct file *file, int datasync) 79 { ..... 90 ret = filemap_fdatawrite(mapping); ..... 97 err = file->f_op->fsync(file, file->f_path.dentry, datasync); ..... The ->fsync() method in XFS is xfs_fsync().... However, I do ask exactly what kernel version you are running, because this mod that has gone into 2.6.26: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=978b7237123d007b9fa983af6e0e2fa8f97f9934 might be the fix you need for .24 or .25 kernels, (not sure about .22 or .23 where all this changed, nor .21 or earlier which I don't think even had the wait...) Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: question about xfs_fsync on linux 2008-07-14 23:03 ` Dave Chinner @ 2008-07-15 1:29 ` Chris Torek 2008-07-15 2:48 ` Dave Chinner 0 siblings, 1 reply; 6+ messages in thread From: Chris Torek @ 2008-07-15 1:29 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs >What kernel(s), exactly, is/are showing this problem? Well, that part is a bit tricky. The base kernel is 2.6.21 but it has a lot of patches, including the one you mentioned. (The customer is double checking to make sure they actually have that patch in.) >> We have a customer who is seeing data not "make it" to disk on a >> stress test that involves doing an fsync() or fdatasync() and then >> deliberately rebooting the machine (to simulate a failure; note >> that the underlying RAID has its own battery backup and this is >> just one of many different parts of the stress-test). > >What is the symptom? The file size does not change? The file the >right size but has no data in it? Their system has a large number of databases (on the order of 50) all open simultaneously, and is using directIO (with a call to fdatasync()) to make entries in many of them, and apparently *some* of them get corrupted. Exactly how, I do not know: naturally, we cannot reproduce this with our own system, and when they tried a simplified system with just one database the problem went away on their end too. (Agh.) >No, the filemap_fdatawrite() has already been executed by this >point [by do_fsync()]. D'oh! I somehow missed this in eyeballing the code paths. >However, I do ask exactly what kernel version you are running ... It is mostly 2.6.21. We brought in a large number of miscellaneous XFS fixes, not including the ones that remove the "behavior" layer stuff, but definitely including this one: >http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit; >h=978b7237123d007b9fa983af6e0e2fa8f97f9934 (which of course necessitated a bit of hacking on the patches to fit, as a lot of the later ones assume the bhv* layer has been removed). Chris ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: question about xfs_fsync on linux 2008-07-15 1:29 ` Chris Torek @ 2008-07-15 2:48 ` Dave Chinner 2008-07-16 21:58 ` Chris Torek 0 siblings, 1 reply; 6+ messages in thread From: Dave Chinner @ 2008-07-15 2:48 UTC (permalink / raw) To: Chris Torek; +Cc: xfs On Mon, Jul 14, 2008 at 07:29:16PM -0600, Chris Torek wrote: > >What kernel(s), exactly, is/are showing this problem? > > Well, that part is a bit tricky. The base kernel is 2.6.21 > but it has a lot of patches, including the one you mentioned. > (The customer is double checking to make sure they actually have > that patch in.) Well, you are pretty much on your own, then. Really, we cannot help diagnose problems on old kernels with a random set of backported patches in them. If you can provide a reproducable test case that triggers the problem on a recent, pristine tree then we can find the problem and fix it. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: question about xfs_fsync on linux 2008-07-15 2:48 ` Dave Chinner @ 2008-07-16 21:58 ` Chris Torek 2008-07-17 0:22 ` Dave Chinner 0 siblings, 1 reply; 6+ messages in thread From: Chris Torek @ 2008-07-16 21:58 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs >Well, you are pretty much on your own, then. Really, we cannot help >diagnose problems on old kernels with a random set of backported >patches in them. Definitely understood. I just wanted to ask that original question, really. I had assumed that the file system itself had to start any dirty-page writes, having missed the top level filemap_fdatawrite() call. We finally got a test case and did a bunch of analysis, and it turns out that the DB software is missing an fsync() call. Of course XFS won't fsync the file if you don't *ask* it to! As long as I am sending mail, there is something else I am curious about though. While this is not XFS specific, I wonder if there is any desire to have different background write frequencies on different file systems. By default, mm/page-writeback.c will start writebacks after a 30-second delay. One can tune this to any other number (via /proc/sys/vm/dirty_{expire,writeback}_centisecs), but this affects the entire system. It might be useful to be able to tune this per-FS instead. (On the other hand, perhaps if one really wants one's data journaled, one should just use a data-journaling file system....) Chris ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: question about xfs_fsync on linux 2008-07-16 21:58 ` Chris Torek @ 2008-07-17 0:22 ` Dave Chinner 0 siblings, 0 replies; 6+ messages in thread From: Dave Chinner @ 2008-07-17 0:22 UTC (permalink / raw) To: Chris Torek; +Cc: xfs On Wed, Jul 16, 2008 at 03:58:55PM -0600, Chris Torek wrote: > >Well, you are pretty much on your own, then. Really, we cannot help > >diagnose problems on old kernels with a random set of backported > >patches in them. > > Definitely understood. I just wanted to ask that original > question, really. I had assumed that the file system itself > had to start any dirty-page writes, having missed the top level > filemap_fdatawrite() call. > > We finally got a test case and did a bunch of analysis, and it > turns out that the DB software is missing an fsync() call. Of > course XFS won't fsync the file if you don't *ask* it to! Yes, that would help ;) Thanks for following up and letting us know you found the problem. > As long as I am sending mail, there is something else I am curious > about though. While this is not XFS specific, I wonder if there > is any desire to have different background write frequencies on > different file systems. By default, mm/page-writeback.c will start > writebacks after a 30-second delay. One can tune this to any other > number (via /proc/sys/vm/dirty_{expire,writeback}_centisecs), but > this affects the entire system. It might be useful to be able > to tune this per-FS instead. Wouldn't be too difficult, but really if you have data that needs to go to disk quickly from a given application then the application should be triggering the flush. e.g. іssuing posix_fadvise(fd, ...., POSIX_FADV_DONTNEED) will trigger an immediate async flush of the file.... > (On the other hand, perhaps if one really wants one's data journaled, > one should just use a data-journaling file system....) Or use the sync mount option..... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2008-07-17 0:21 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-07-14 22:13 question about xfs_fsync on linux Chris Torek 2008-07-14 23:03 ` Dave Chinner 2008-07-15 1:29 ` Chris Torek 2008-07-15 2:48 ` Dave Chinner 2008-07-16 21:58 ` Chris Torek 2008-07-17 0:22 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox