Re: NFS file size anomaly?

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
To: Trond Myklebust
	<trond.myklebust-7I+n7zu2hftEKMMhf/gKZA@public.gmane.org>
Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>,
	Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>,
	Dan Duval <dan.duval-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>,
	Linux NFS Mailing List
	<linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	linux-fsdevel
	<linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Subject: Re: NFS file size anomaly?
Date: Wed, 11 Dec 2013 15:12:40 +0100	[thread overview]
Message-ID: <20131211141240.GA1163@quack.suse.cz> (raw)
In-Reply-To: <1386694881.2879.16.camel-5lNtUQgoD8Pfa3cDbr2K10B+6BGkLq7r@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 7411 bytes --]

On Tue 10-12-13 19:01:21, Trond Myklebust wrote:
> On Tue, 2013-12-10 at 15:36 +0100, Jan Kara wrote:
> > On Mon 09-12-13 19:56:16, Chuck Lever wrote:
> > > On Dec 9, 2013, at 4:05 PM, Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> > > > Copying Trond's new work address.
> > > > 
> > > > Begin forwarded message:
> > > > 
> > > >> From: Dan Duval <dan.duval-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> > > >> Subject: NFS file size anomaly?
> > > >> Date: December 9, 2013 3:04:27 PM EST
> > > >> To: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
> > > >> Cc: <linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, <linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
> > > >> 
> > > >> [NOTE: cross-posted to linux-nfs and linux-fsdevel]
> > > >> 
> > > >> I'm seeing some unexpected behavior with NFS and file sizes.
> > > >> 
> > > >> The test cases are from the LTP (Linux Test Project), tests
> > > >> ftest01, ftest05, and ftest07.  I'll concentrate on ftest01
> > > >> to explain what I've found.
> > > >> 
> > > >> ftest01 fires off 5 subprocesses, each of which opens an empty
> > > >> file and does the following, repeatedly:
> > > >> 
> > > >>       . lseek to some point in the file
> > > >>       . read 2048 bytes
> > > >>       . lseek back to the same point
> > > >>       . write 2048 bytes
> > > >> 
> > > >> The "point in the file" is determined by a pseudo-random
> > > >> sequence.  All such points are on 2048-byte boundaries.
> > > >> 
> > > >> Occasionally, also driven pseudo-randomly, ftest01 will throw
> > > >> in a call to ftruncate(), truncate(), sync(), or fstat().
> > > >> 
> > > >> With the fstat() calls, the returned .st_size is compared
> > > >> with the test's expected size for the file, and an error is
> > > >> declared if they don't match.
> > > >> 
> > > >> What's happening is that, some way into the test, this fstat()
> > > >> check is failing.  Specifically, the .st_size reported by
> > > >> fstat() is greater than the computed size.
> > > >> 
> > > >> The sequence of operations leading up to this is:
> > > >> 
> > > >>       lseek 1034240 0
> > > >>       read 2048
> > > >>       lseek 0 1
> > > >>       write 2048
> > > >> 
> > > >>       lseek 638976 0
> > > >>       (read, lseek, write)
> > > >> 
> > > >>       lseek 708608 0
> > > >>       (read, lseek, write)
> > > >> 
> > > >>       lseek 708608 0
> > > >>       (read, lseek, write)
> > > >> 
> > > >>       lseek 679584 0
> > > >>       (read, lseek, write)
> > > >> 
> > > >>       truncate 266240
> > > >> 
> > > >>       lseek 960512 0
> > > >>       (read, lseek, write)
> > > >> 
> > > >>       (a bunch of lseek/read/lseek/write ops that do not
> > > >>        extend the file)
> > > >> 
> > > >>       fstat
> > > >> 
> > > >> So the expected size of the file is 960512 + 2048 == 960560.
> > > >> But the fstat reports a size of 1036288.
> > > >> 
> > > >> A look at what's happening on the wire, distilled from the
> > > >> output of tethereal, is instructive.
> > > >> 
> > > >>       READ Call 638976 4096 (byte offset and size to read)
> > > >>       READ Reply 4096 995382 (bytes read and current file size)
> > > >> 
> > > >>       SETATTR Call 266240 (this corresponds to the truncate() call)
> > > >> 
> > > >>       WRITE Call 638976 4096 (byte offset and size to write)
> > > >>       WRITE Call 708608 4096
> > > >>       WRITE Call 1032192 4096
> > > >> 
> > > >>       SETATTR Reply 266240 (current size of file)
> > > >> 
> > > >>       WRITE Reply 643072 (current size of file after write)
> > > >>       WRITE Reply 1036288
> > > >>       WRITE Reply 1036288
> > > >> 
> > > >>       GETATTR (initiated internally by NFS code?)
> > > >> 
> > > >>       READ Call 958464 4096 READ Reply 4096 1036288
> > > >> 
> > > >>       ... (a bunch of READ and WRITE ops that do not extend the file)
> > > >> 
> > > >>       GETATTR Call (this corresponds to the fstat() call)
> > > >>       GETATTR Reply 1036288
> > > >> 
> > > >> So what appears to have happened here is that three of the
> > > >> WRITE operations that the program issued before the truncate()
> > > >> call have "bled past" the SETATTR, extending the file further
> > > >> than the SETATTR did.  Since none of the operations issued
> > > >> after SETATTR extends the file further, by the time we get to
> > > >> the GETATTR, the file is larger than the test expects.
> > > >> 
> > > >> There are two strange things going on here.  The first,
> > > >> identified above, is that write()s that were initiated before
> > > >> the truncate() call are being processed after the resulting
> > > >> SETATTR call.  The second is that WRITE operations are being
> > > >> initiated while the SETATTR is outstanding.
> > > >> 
> > > >> It seems to me that a size-changing SETATTR operation should
> > > >> act essentially as an I/O barrier. It should wait for all outstanding
> > > >> read/write requests to complete, then issue the SETATTR,
> > > >> wait for the reply, and only then re-enable read/write requests.
> > > >> 
> > > >> In other words, SETATTR should be atomic with respect to other
> > > >> I/O operations.
> > > 
> > > nfs_setattr() relies on writeback_single_inode() [ via nfs_setattr() ->
> > > nfs_wb_all() -> sync_inode() -> writeback_single_inode() ] to serialize
> > > itself with other pending I/O operations.  The NFS client already treats
> > > SETATTR as an I/O barrier. 
> > > 
> > > >> A git bisect indicates that this problem first appeared (or
> > > >> was first uncovered) with this commit:
> > > >> 
> > > >>   4f8ad65 writeback: Refactor writeback_single_inode()
> > > >> 
> > > >> It continues to the most recent mainline kernels.
> > > >> 
> > > >> NFS v3 vs. v4 doesn't seem to matter.
> > > >> 
> > > >> Has anyone else seen this?  Any pointers you can provide?
> > > 
> > > After commit 4f8ad65, writeback_single_inode() occasionally returns
> > > immediately instead of waiting for writeback to finish.  The behavior
> > > would be difficult or impossible to spot with a local filesystem, but
> > > with NFS, a race is clearly exposed.
> >   That looks like a bug. But the only thing that's changed by 4f8ad65 is
> > the additional test:
> > 	if (!(inode->i_state & I_DIRTY))
> > 		goto out;
> > So it seems NFS has inodes that have dirty data but don't have (any of) inode
> > dirty bits set? But I've been staring at the code and don't see any
> > problem. Any idea Chuck? Otherwise I guess I'll have to reproduce this and
> > debug what's going on.
> 
> Hi Jan,
> 
> I'm guessing that the problem is the fact that you can clear the
> I_DIRTY_PAGES flag inside __writeback_single_inode() on a non-waiting
> call (i.e. one which does not have WB_SYNC_ALL set).
> When that happens, the NFS client ends up with inode->i_state == 0 until
> one of the WRITEs returns with the "unstable" flag set (which triggers a
> call to __mark_inode_dirty(inode, I_DIRTY_DATASYNC) so that we call
> write_inode()).
> 
> Basically, if you have WB_SYNC_ALL set, I think that you cannot apply
> the test for (inode->i_state & I_DIRTY) until you have called 
> filemap_fdatawait().
  Ah, thanks for explanation. I should have spotted this myself. So
something like the attached patch should fix the issue. Dan, can you try
whether the attached patch fixes your problem?

								Honza
-- 
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR

[-- Attachment #2: 0001-writeback-Fix-data-corruption-on-NFS.patch --]
[-- Type: text/x-patch, Size: 2461 bytes --]

>From 18eb76f3c2c7e8f3f57380846677097fe886232e Mon Sep 17 00:00:00 2001
From: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
Date: Wed, 11 Dec 2013 12:15:57 +0100
Subject: [PATCH] writeback: Fix data corruption on NFS

Commit 4f8ad655dbc8 "writeback: Refactor writeback_single_inode()" added
a condition to skip clean inode. However this is wrong in WB_SYNC_ALL
mode because there we also want to wait for outstanding writeback on
possibly clean inode. This was causing occasional data corruption issues
on NFS because it uses sync_inode() to make sure all outstanding writes
are flushed to the server before truncating the inode and with
sync_inode() returning prematurely file was sometimes extended back
by an outstanding write after it was truncated.

So modify the test to also check for pages under writeback in
WB_SYNC_ALL mode.

CC: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org # >= 3.5
Fixes: 4f8ad655dbc82cf05d2edc11e66b78a42d38bf93
Reported-by: Dan Duval <dan.duval-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
---
 fs/fs-writeback.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1f4a10ece2f1..e0259a163f98 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -516,13 +516,16 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	}
 	WARN_ON(inode->i_state & I_SYNC);
 	/*
-	 * Skip inode if it is clean. We don't want to mess with writeback
-	 * lists in this function since flusher thread may be doing for example
-	 * sync in parallel and if we move the inode, it could get skipped. So
-	 * here we make sure inode is on some writeback list and leave it there
-	 * unless we have completely cleaned the inode.
+	 * Skip inode if it is clean and we have no outstanding writeback in
+	 * WB_SYNC_ALL mode. We don't want to mess with writeback lists in this
+	 * function since flusher thread may be doing for example sync in
+	 * parallel and if we move the inode, it could get skipped. So here we
+	 * make sure inode is on some writeback list and leave it there unless
+	 * we have completely cleaned the inode.
 	 */
-	if (!(inode->i_state & I_DIRTY))
+	if (!(inode->i_state & I_DIRTY) &&
+	    (wbc->sync_mode != WB_SYNC_ALL ||
+	     !mapping_tagged(inode->i_mapping, PAGECACHE_TAG_WRITEBACK)))
 		goto out;
 	inode->i_state |= I_SYNC;
 	spin_unlock(&inode->i_lock);
-- 
1.8.1.4

next prev parent reply	other threads:[~2013-12-11 14:12 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-12-09 20:04 NFS file size anomaly? Dan Duval
2013-12-09 21:05 ` Fwd: " Chuck Lever
     [not found]   ` <9FBE3B61-C265-460A-A92E-A4DFE8D27114-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2013-12-10  0:56     ` Chuck Lever
2013-12-10 14:36       ` Jan Kara
     [not found]         ` <20131210143643.GD1543-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
2013-12-10 17:01           ` Trond Myklebust
     [not found]             ` <1386694881.2879.16.camel-5lNtUQgoD8Pfa3cDbr2K10B+6BGkLq7r@public.gmane.org>
2013-12-11 14:12               ` Jan Kara [this message]
  -- strict thread matches above, loose matches on Subject: below --
2013-12-11 18:01 Dan Duval
2013-12-11 20:11 ` Jan Kara

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:1f4a10ece2f dfblob:e0259a163f9 )
 OR (
bs:"writeback: Fix data corruption on NFS" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131211141240.GA1163@quack.suse.cz \
    --to=jack-alswssmvlrq@public.gmane.org \
    --cc=chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org \
    --cc=dan.duval-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org \
    --cc=linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=trond.myklebust-7I+n7zu2hftEKMMhf/gKZA@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).