From: Chris Mason <clm@fb.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Eric Sandeen <sandeen@redhat.com>, xfs@oss.sgi.com
Subject: Re: [PATCH] xfs: don't zero partial page cache pages during O_DIRECT
Date: Fri, 8 Aug 2014 22:32:58 -0400 [thread overview]
Message-ID: <53E5885A.9090302@fb.com> (raw)
In-Reply-To: <20140809003628.GE26465@dastard>
On 08/08/2014 08:36 PM, Dave Chinner wrote:
> On Fri, Aug 08, 2014 at 10:35:38AM -0400, Chris Mason wrote:
>>
>> xfs is using truncate_pagecache_range to invalidate the page cache
>> during DIO reads. This is different from the other filesystems who only
>> invalidate pages during DIO writes.
>
> Historical oddity thanks to wrapper functions that were kept way
> longer than they should have been.
>
>> truncate_pagecache_range is meant to be used when we are freeing the
>> underlying data structs from disk, so it will zero any partial ranges
>> in the page. This means a DIO read can zero out part of the page cache
>> page, and it is possible the page will stay in cache.
>
> commit fb59581 ("xfs: remove xfs_flushinval_pages"). also removed
> the offset masks that seem to be the issue here. Classic case of a
> regression caused by removing 10+ year old code that was not clearly
> documented and didn't appear important.
>
> The real question is why isn't fsx and other corner case data
> integrity tools tripping over this?
>
My question too. Maybe not mixing buffered/direct for partial pages?
Does fsx only do 4K O_DIRECT?
>> buffered reads will find an up to date page with zeros instead of the
>> data actually on disk.
>>
>> This patch fixes things by leaving the page cache alone during DIO
>> reads.
>>
>> We discovered this when our buffered IO program for distributing
>> database indexes was finding zero filled blocks. I think writes
>> are broken too, but I'll leave that for a separate patch because I don't
>> fully understand what XFS needs to happen during a DIO write.
>>
>> Test program:
>
> Encapsulate it in a generic xfstest, please, and send it to
> fstests@vger.kernel.org.
This test prog was looking for races, which we really don't have. It
can be much shorter to just look for the improper zeroing from a single
thread. I can send it either way.
[ ... ]
> I guarantee you that there are applications out there that rely on
> the implicit invalidation behaviour for performance. There are also
> applications out that rely on it for correctness, too, because the
> OS is not the only source of data in the filesystem the OS has
> mounted.
>
> Besides, XFS's direct IO semantics are far saner, more predictable
> and hence are more widely useful than the generic code. As such,
> we're not going to regress semantics that have been unchanged
> over 20 years just to match whatever insanity the generic Linux code
> does right now.
>
> Go on, call me a deranged monkey on some serious mind-controlling
> substances. I don't care. :)
The deranged part is invalidating pos -> -1 on a huge file because of a
single 512b direct read. But, if you mix O_DIRECT and buffered you get
what the monkeys give you and like it.
>
> I think the fix should probably just be:
>
> - truncate_pagecache_range(VFS_I(ip), pos, -1);
> + invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
> + pos >> PAGE_CACHE_SHIFT, -1);
>
I'll retest with this in the morning. The invalidate is basically what
we had before with the masking & PAGE_CACHE_SHIFT.
-chris
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2014-08-09 2:33 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-08-08 14:35 [PATCH] xfs: don't zero partial page cache pages during O_DIRECT Chris Mason
2014-08-08 15:17 ` Chris Mason
2014-08-08 16:04 ` [PATCH RFC] xfs: use invalidate_inode_pages2_range for DIO writes Chris Mason
2014-08-09 0:48 ` Dave Chinner
2014-08-09 2:42 ` Chris Mason
2014-08-08 20:39 ` [PATCH] xfs: don't zero partial page cache pages during O_DIRECT Brian Foster
2014-08-09 0:36 ` Dave Chinner
2014-08-09 2:32 ` Chris Mason [this message]
2014-08-09 3:19 ` Eric Sandeen
2014-08-09 4:17 ` Dave Chinner
2014-08-09 12:57 ` [PATCH v2] " Chris Mason
2014-08-11 13:29 ` Brian Foster
2014-08-12 1:17 ` Dave Chinner
2014-08-19 19:24 ` Chris Mason
2014-08-19 22:35 ` Dave Chinner
2014-08-20 1:54 ` Chris Mason
2014-08-20 2:19 ` Dave Chinner
2014-08-20 2:36 ` Dave Chinner
2014-08-20 4:41 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=53E5885A.9090302@fb.com \
--to=clm@fb.com \
--cc=david@fromorbit.com \
--cc=sandeen@redhat.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.