linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: xfs@oss.sgi.com, Viro@disappointment.disaster,
	viro@ZenIV.linux.org.uk, Al@disappointment.disaster
Subject: Re: [PATCH 2/6] xfs: write failure beyond EOF truncates too much data
Date: Fri, 4 Apr 2014 11:26:36 -0400	[thread overview]
Message-ID: <20140404152636.GA40629@bfoster.bfoster> (raw)
In-Reply-To: <20140329151419.GA33827@bfoster.bfoster>

On Sat, Mar 29, 2014 at 11:14:19AM -0400, Brian Foster wrote:
> On Fri, Mar 21, 2014 at 09:11:46PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > If we fail a write beyond EOF and have to handle it in
> > xfs_vm_write_begin(), we truncate the inode back to the current inode
> > size. This doesn't take into account the fact that we may have
> > already made successful writes to the same page (in the case of block
> > size < page size) and hence we can truncate the page cache away from
> > blocks with valid data in them. If these blocks are delayed
> > allocation blocks, we now have a mismatch between the page cache and
> > the extent tree, and this will trigger - at minimum - a delayed
> > block count mismatch assert when the inode is evicted from the cache.
> > We can also trip over it when block mapping for direct IO - this is
> > the most common symptom seen from fsx and fsstress when run from
> > xfstests.
> > 
> > Fix it by only truncating away the exact range we are updating state
> > for in this write_begin call.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_aops.c | 10 ++++++++--
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index e810243..6b4ecc8 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -1609,12 +1609,18 @@ xfs_vm_write_begin(
> >  	status = __block_write_begin(page, pos, len, xfs_get_blocks);
> >  	if (unlikely(status)) {
> >  		struct inode	*inode = mapping->host;
> > +		size_t		isize = i_size_read(inode);
> >  
> >  		xfs_vm_write_failed(inode, page, pos, len);
> >  		unlock_page(page);
> >  
> > -		if (pos + len > i_size_read(inode))
> > -			truncate_pagecache(inode, i_size_read(inode));
> 
> From what I can see, we have a write_begin/copy/write_end sequence per
> page and the inode size is updated down in the write_end path. If the
> write fails at write_begin time, then we haven't copied any data and the
> inode size hasn't changed.
> 
> The intent of the original code looks like it wants to break off any
> blocks that might have been set up beyond EOF before the write happened
> to fail. Could you elaborate on how this can kill blocks that might have
> been written successfully? Doesn't this only affect post-EOF metadata?
> 
> > +		/*
> > +		 * If the write is beyond EOF, we only want to kill blocks
> > +		 * allocated in this write, not blocks that were previously
> > +		 * written successfully.
> > +		 */
> > +		if (pos + len > isize)
> > +			truncate_pagecache_range(inode, pos, pos + len);
> 
> So the cache truncate now covers the range of the write. What happens if
> the part of the write is an overwrite? This seems similar to the problem
> you've described above... should the truncate start at isize instead?
> 

I ran an experiment on this to confirm my suspicion here. I added a
quick little hack to fail any write_begin (set status=-1) for which pos
!= 0. With that in place:

xfs_io -fc "pwrite -b 2k 0 2k" /mnt/file

# hexdump /mnt/file 
0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
*
0000800

xfs_io -c "pwrite -b 2k 1k 2k" /mnt/file
(fails)

# hexdump /mnt/file 
0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
*
0000400 0000 0000 0000 0000 0000 0000 0000 0000
*
0000800

The failed extending write trashes the data over the previously valid
region.

Brian

> Brian
> 
> >  
> >  		page_cache_release(page);
> >  		page = NULL;
> > -- 
> > 1.9.0
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2014-04-04 15:26 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-21 10:11 [RFC, PATCH 0/6] xfs: delalloc, DIO and corruption Dave Chinner
2014-03-21 10:11 ` [PATCH 1/6] xfs: kill buffers over failed write ranges properly Dave Chinner
2014-03-21 10:11 ` [PATCH 2/6] xfs: write failure beyond EOF truncates too much data Dave Chinner
2014-03-29 15:14   ` Brian Foster
2014-04-04 15:26     ` Brian Foster [this message]
2014-04-04 21:26       ` Dave Chinner
2014-03-21 10:11 ` [PATCH 3/6] xfs: xfs_vm_write_end truncates too much on failure Dave Chinner
2014-03-21 10:11 ` [PATCH 4/6] xfs: zeroing space needs to punch delalloc blocks Dave Chinner
2014-03-21 10:11 ` [PATCH 5/6] xfs: splitting delalloc extents can run out of reservation Dave Chinner
2014-04-04 13:37   ` Brian Foster
2014-04-04 21:31     ` Dave Chinner
2014-03-21 10:11 ` [PATCH 6/6] xfs: don't map ranges that span EOF for direct IO Dave Chinner
2014-03-31 17:22 ` [RFC, PATCH 0/6] xfs: delalloc, DIO and corruption Brian Foster
2014-03-31 20:17   ` Dave Chinner
2014-04-01 11:54     ` Brian Foster
  -- strict thread matches above, loose matches on Subject: below --
2014-04-10  5:00 [PATCH 0/6 v2] xfs: delalloc, dio " Dave Chinner
2014-04-10  5:00 ` [PATCH 2/6] xfs: write failure beyond EOF truncates too much data Dave Chinner
2014-04-10 10:35   ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140404152636.GA40629@bfoster.bfoster \
    --to=bfoster@redhat.com \
    --cc=Al@disappointment.disaster \
    --cc=Viro@disappointment.disaster \
    --cc=david@fromorbit.com \
    --cc=viro@ZenIV.linux.org.uk \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).