public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Brian Foster <bfoster@redhat.com>
Cc: xfs@oss.sgi.com, Viro@disappointment.disaster,
	viro@ZenIV.linux.org.uk, Al@disappointment.disaster
Subject: Re: [PATCH 2/6] xfs: write failure beyond EOF truncates too much data
Date: Sat, 5 Apr 2014 08:26:40 +1100	[thread overview]
Message-ID: <20140404212640.GZ17603@dastard> (raw)
In-Reply-To: <20140404152636.GA40629@bfoster.bfoster>

On Fri, Apr 04, 2014 at 11:26:36AM -0400, Brian Foster wrote:
> On Sat, Mar 29, 2014 at 11:14:19AM -0400, Brian Foster wrote:
> > On Fri, Mar 21, 2014 at 09:11:46PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > If we fail a write beyond EOF and have to handle it in
> > > xfs_vm_write_begin(), we truncate the inode back to the current inode
> > > size. This doesn't take into account the fact that we may have
> > > already made successful writes to the same page (in the case of block
> > > size < page size) and hence we can truncate the page cache away from
> > > blocks with valid data in them. If these blocks are delayed
> > > allocation blocks, we now have a mismatch between the page cache and
> > > the extent tree, and this will trigger - at minimum - a delayed
> > > block count mismatch assert when the inode is evicted from the cache.
> > > We can also trip over it when block mapping for direct IO - this is
> > > the most common symptom seen from fsx and fsstress when run from
> > > xfstests.
> > > 
> > > Fix it by only truncating away the exact range we are updating state
> > > for in this write_begin call.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_aops.c | 10 ++++++++--
> > >  1 file changed, 8 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > index e810243..6b4ecc8 100644
> > > --- a/fs/xfs/xfs_aops.c
> > > +++ b/fs/xfs/xfs_aops.c
> > > @@ -1609,12 +1609,18 @@ xfs_vm_write_begin(
> > >  	status = __block_write_begin(page, pos, len, xfs_get_blocks);
> > >  	if (unlikely(status)) {
> > >  		struct inode	*inode = mapping->host;
> > > +		size_t		isize = i_size_read(inode);
> > >  
> > >  		xfs_vm_write_failed(inode, page, pos, len);
> > >  		unlock_page(page);
> > >  
> > > -		if (pos + len > i_size_read(inode))
> > > -			truncate_pagecache(inode, i_size_read(inode));
> > 
> > From what I can see, we have a write_begin/copy/write_end sequence per
> > page and the inode size is updated down in the write_end path. If the
> > write fails at write_begin time, then we haven't copied any data and the
> > inode size hasn't changed.
> > 
> > The intent of the original code looks like it wants to break off any
> > blocks that might have been set up beyond EOF before the write happened
> > to fail. Could you elaborate on how this can kill blocks that might have
> > been written successfully? Doesn't this only affect post-EOF metadata?
> > 
> > > +		/*
> > > +		 * If the write is beyond EOF, we only want to kill blocks
> > > +		 * allocated in this write, not blocks that were previously
> > > +		 * written successfully.
> > > +		 */
> > > +		if (pos + len > isize)
> > > +			truncate_pagecache_range(inode, pos, pos + len);
> > 
> > So the cache truncate now covers the range of the write. What happens if
> > the part of the write is an overwrite? This seems similar to the problem
> > you've described above... should the truncate start at isize instead?
> > 
> 
> I ran an experiment on this to confirm my suspicion here. I added a
> quick little hack to fail any write_begin (set status=-1) for which pos
> != 0. With that in place:
> 
> xfs_io -fc "pwrite -b 2k 0 2k" /mnt/file
> 
> # hexdump /mnt/file 
> 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
> *
> 0000800
> 
> xfs_io -c "pwrite -b 2k 1k 2k" /mnt/file
> (fails)
> 
> # hexdump /mnt/file 
> 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
> *
> 0000400 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 0000800
> 
> The failed extending write trashes the data over the previously valid
> region.

Good catch, Brian! That is indeed a problem with the fix I made. Not
intentional, as the comment probably lead you to beleive. I did say
I wasn't entirely sure all the fixes were correct, and I greatly
appreciate your diligence here.

The fix should probably be:

		/*
		 * If the write is beyond EOF, we only want to kill blocks
		 * allocated in this write, not blocks that were previously
		 * written successfully.
		 */
		if (pos + len > isize) {
			ssize_t		start = max_t(ssize_t, pos, isize);

			truncate_pagecache_range(inode, start, pos + len);
		}

So if the write overlaps EOF we don't truncate away data from before
EOF. It also has the effect of not truncating data in previous
writes between isize and pos, which was the bug I was trying to
fix...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2014-04-04 21:26 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-21 10:11 [RFC, PATCH 0/6] xfs: delalloc, DIO and corruption Dave Chinner
2014-03-21 10:11 ` [PATCH 1/6] xfs: kill buffers over failed write ranges properly Dave Chinner
2014-03-21 10:11 ` [PATCH 2/6] xfs: write failure beyond EOF truncates too much data Dave Chinner
2014-03-29 15:14   ` Brian Foster
2014-04-04 15:26     ` Brian Foster
2014-04-04 21:26       ` Dave Chinner [this message]
2014-03-21 10:11 ` [PATCH 3/6] xfs: xfs_vm_write_end truncates too much on failure Dave Chinner
2014-03-21 10:11 ` [PATCH 4/6] xfs: zeroing space needs to punch delalloc blocks Dave Chinner
2014-03-21 10:11 ` [PATCH 5/6] xfs: splitting delalloc extents can run out of reservation Dave Chinner
2014-04-04 13:37   ` Brian Foster
2014-04-04 21:31     ` Dave Chinner
2014-03-21 10:11 ` [PATCH 6/6] xfs: don't map ranges that span EOF for direct IO Dave Chinner
2014-03-31 17:22 ` [RFC, PATCH 0/6] xfs: delalloc, DIO and corruption Brian Foster
2014-03-31 20:17   ` Dave Chinner
2014-04-01 11:54     ` Brian Foster
  -- strict thread matches above, loose matches on Subject: below --
2014-04-10  5:00 [PATCH 0/6 v2] xfs: delalloc, dio " Dave Chinner
2014-04-10  5:00 ` [PATCH 2/6] xfs: write failure beyond EOF truncates too much data Dave Chinner
2014-04-10 10:35   ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140404212640.GZ17603@dastard \
    --to=david@fromorbit.com \
    --cc=Al@disappointment.disaster \
    --cc=Viro@disappointment.disaster \
    --cc=bfoster@redhat.com \
    --cc=viro@ZenIV.linux.org.uk \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox