Re: [PATCH v2] xfs: don't zero partial page cache pages during O_DIRECT

From: Dave Chinner <david@fromorbit.com>
To: Chris Mason <clm@fb.com>
Cc: Eric Sandeen <sandeen@redhat.com>, xfs@oss.sgi.com
Subject: Re: [PATCH v2] xfs: don't zero partial page cache pages during O_DIRECT
Date: Wed, 20 Aug 2014 12:36:49 +1000	[thread overview]
Message-ID: <20140820023649.GV20518@dastard> (raw)
In-Reply-To: <20140820021950.GU20518@dastard>

On Wed, Aug 20, 2014 at 12:19:50PM +1000, Dave Chinner wrote:
> On Tue, Aug 19, 2014 at 09:54:14PM -0400, Chris Mason wrote:
> > On 08/19/2014 06:35 PM, Dave Chinner wrote:
> > > On Tue, Aug 19, 2014 at 03:24:48PM -0400, Chris Mason wrote:
> > >> On 08/11/2014 09:17 PM, Dave Chinner wrote:
> > >>> On Sat, Aug 09, 2014 at 08:57:00AM -0400, Chris Mason wrote:
> > >>>>
> > >>>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > >>>> index 1f66779..023d575 100644
> > >>>> --- a/fs/xfs/xfs_file.c
> > >>>> +++ b/fs/xfs/xfs_file.c
> > >>>> @@ -295,7 +295,8 @@ xfs_file_read_iter(
> > >>>>  				xfs_rw_iunlock(ip, XFS_IOLOCK_EXCL);
> > >>>>  				return ret;
> > >>>>  			}
> > >>>> -			truncate_pagecache_range(VFS_I(ip), pos, -1);
> > >>>> +			invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
> > >>>> +					      pos >> PAGE_CACHE_SHIFT, -1);
> > >>>>  		}
> > >>>>  		xfs_rw_ilock_demote(ip, XFS_IOLOCK_EXCL);
> > >>>>  	}
> > >>>
> > >>> I added the WARN_ON_ONCE(ret) check to this and I am seeing it fire
> > >>> occasionally. It always fires immediately before some other ASSERT()
> > >>> they fires with a block map/page cache inconsistency. It usually
> > >>> fires in a test that runs fsx or fsstress. The fsx failures are new
> > >>> regressions caused by this patch. e.g. generic/263 hasn't failed for
> > >>> months on any of my systems and this patch causes it to fail
> > >>> reliably on my 1k block size test config.
> > >>>
> > >>> I'm going to assume at this point that this is uncovering some other
> > >>> existing bug, but it means I'm not going to push this fix until I
> > >>> understand what is actually happening here. It is possible that what
> > >>> I'm seeing is related to Brian's collapse range bug fixes, but until
> > >>> I applied this direct IO patch I'd never seen fsx throw ASSERTs in
> > >>> xfs_bmap_shift_extents()....
> > >>>
> > >>> Either way, more testing and understanding is needed.
> > >>
> > >> Do you have the output from xfs and the command line args it used?  For
> > >> my device, it picks:
> > >>
> > >> -r 4096 -t 512 -w 512 -Z
> > >>
> > >> And for a blocksize 1024 test I did mkfs.xfs -b size=1024
> > > 
> > > I'm running:
> > > 
> > > $ mkfs.xfs -f -m crc=1,finobt=1 -b size=1k /dev/vda
> > > $ mount /dev/vda /mnt/test
> > > $ ltp/fsx -o 128000   -l 500000 -r 4096 -t 512 -w 512 -Z -d /mnt/test/foo
> > > 
> > >> But I can't trigger failures with or without the invalidate_inode_pages2
> > >> change.  I was hoping to trigger on 3.16, and then jump back to 3.10 +
> > >> my patch to see if the patch alone was at fault.
> > > 
> > > I am seeing failures at operation 1192.
> > > 
> > > Yesterday, I found a new class of bufferhead state coherency issues
> > > to do with EOF handling that are causing the problems. Basically,
> > > when the page cache marks a page dirty, the generic code marks all
> > > the buffers on the page dirty, even when they are beyond EOF.
> > > 
> > > As a result, when we go to invalidate the page that spans EOF, it
> > > cannot be invalidated because there are dirty buffers on the page.
> > > Those buffers persist in that state because they are beyond EOF,
> > > have no blocks allocated to them, and cannot be written. And so when
> > > we do a direct IO that spans the current EOF, it now fails to
> > > invalidate that page and so triggers the warning.
> > > 
> > > Worse is that it appears that these bufferheads can leak into the
> > > internal blocks into the file when the file is extended, leading to
> > > all sorts of other ASSERT failures (which I've been seeing for a
> > > while now).
> > > 
> > > I've got the above fsx command to run for somewhere between 100k and
> > > 110k operations with the fixes I currently have, but I haven't found
> > > the cause of the dirty buffer beyond EOF state leaking into the
> > > interior of the file from extend operations yet.
> > > 
> > > Once I have something that passes a few million fsx ops....
> > 
> > I have to admit, I'm not sure where this leaves us in terms of safely
> > applying my patch to our 3.10 or mainline kernel... Failing to
> > invalidate the page and zeroing the page are really both wrong.
> 
> Well, yes they are, and that's one of the reasons why I wanted to
> ensure we caught failures. As such, I think this invalidation bug
> goes back a *long time* because we've never, ever checked for
> invalidation failure before this patch.
> 
> However, I think I've got a self-contained fix that can be
> backported for the invalidation problem now - it's well past 10
> million fsx ops at this point with both read and write DIO
> invalidation fixes included. I'll post my series when I've done
> some more robust testing on it....

There's more issues. generic/247 on a 4k block size filesystem just
triggered an invalidation failure on a direct IO write. That test is
specifically mixing DIO overwrite with mmap overwrite on the same
file, so there's clearly more issues in play here than I first
thought.

So, more debugging needed to determine if this is a result of the
usual, currently unsolvable page fault vs DIO races...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs