From: Dave Chinner <david@fromorbit.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
Oleg Nesterov <oleg@redhat.com>,
Linux Kernel <linux-kernel@vger.kernel.org>,
Ben Myers <bpm@sgi.com>, Alexander Viro <viro@zeniv.linux.org.uk>,
Dave Jones <davej@redhat.com>,
xfs@oss.sgi.com
Subject: Re: splice vs execve lockdep trace.
Date: Thu, 18 Jul 2013 13:17:59 +1000 [thread overview]
Message-ID: <20130718031759.GN11674@dastard> (raw)
In-Reply-To: <CA+55aFxdqzMY5VJoYaLmL=+=f2s1cbHHV-TjC3=taXpF-xov1w@mail.gmail.com>
On Wed, Jul 17, 2013 at 09:03:11AM -0700, Linus Torvalds wrote:
> On Tue, Jul 16, 2013 at 10:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > But When i say "stale data" I mean that the data being returned
> > might not have originally belonged to the underlying file you are
> > reading.
>
> We're still talking at cross purposes then.
>
> How the hell do you handle mmap() and page faulting?
We cross our fingers and hope. Always have. Races are rare as
historically there have been only a handful of applications that do
the necessary operations to trigger them. However, with holepunch
now a generic fallocate() operation....
> Because if you return *that* kind of stale data, than you're horribly
> horribly buggy. And you cannot *possibly* blame
> generic_file_splice_read() on that.
Right, it's horribly buggy and I'm not blaming
generic_file_splice_read().
I'm saying that the page cache architecture does not providing
mechanisms to avoid the problem. i.e. that we can't synchronise
multi-page operations against a single page operation that only uses
the page lock for serialisation without some form of filesystem
specific locking. And that the i_mutex/i_iolock/mmap_sem inversion
problems essentially prevent us from beign able to fix it in a
filesystem specific manner.
We've hacked around this read vs invalidation race condition for
truncate() by putting ordered operations in place to avoid
refaulting after invalidation by read operations. i.e. truncate was
optimised to avoid extra locking, but now the realisation is that
truncate is just a degenerate case of hole punching and that hole
punching cannot make use of the same "beyond EOF" optimisations to
avoid race conditions with other IO.
We (XFS developers) have known about this for years, but we've
always been told when it's been raised that it's "just a wacky XFS
problem". Now that other filesystems are actually implementing the
same functionality that XFS has had since day zero, they are also
seeing the same architectural deficiencies in the generic code. i.e.
they are not actually "whacky XFS problems". That's why we were
talking about a range-locking solution to this problem at LSF/MM
this year - to find a generic solution to the issue...
FWIW, this problem is not just associated with splice reads - it's a
problem for the direct IO code, too. The direct IO layer has lots of
hacky invalidation code that tries to work around the fact that
mmap() page faults cannot be synchronised against direct IO in
progress. Hence it invalidates caches before and after direct IO is
done in the hope that we don't have a page fault that races and
leaves us with out-of-date data being exposed to userspace via mmap.
Indeed, we have a regression test that demonstrates how this often
fails - xfstests:generic/263 uses fsx with direct IO and mmap on the
same file and will fail with data corruption on XFS.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2013-07-18 3:18 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20130716015305.GB30569@redhat.com>
[not found] ` <CA+55aFyLbqJp0-=7=HOF9sKGOHwsa7A7-V76b8tbsnra8Z2=-w@mail.gmail.com>
[not found] ` <20130716023847.GA31481@redhat.com>
2013-07-16 3:25 ` splice vs execve lockdep trace Linus Torvalds
2013-07-16 3:28 ` Dave Jones
2013-07-16 5:31 ` Al Viro
2013-07-16 6:03 ` Dave Chinner
2013-07-16 6:16 ` Al Viro
2013-07-16 6:41 ` Dave Chinner
2013-07-16 6:50 ` Dave Chinner
2013-07-16 19:33 ` Ben Myers
2013-07-16 20:18 ` Linus Torvalds
2013-07-16 20:43 ` Dave Chinner
2013-07-16 21:02 ` Linus Torvalds
2013-07-17 4:06 ` Dave Chinner
2013-07-17 4:54 ` Linus Torvalds
2013-07-17 5:51 ` Dave Chinner
2013-07-17 16:03 ` Linus Torvalds
2013-07-17 23:40 ` Ben Myers
2013-07-18 0:17 ` Linus Torvalds
2013-07-18 3:42 ` Dave Chinner
2013-07-18 21:16 ` Ben Myers
2013-07-18 22:21 ` Ben Myers
2013-07-18 22:49 ` Dave Chinner
2013-07-18 3:17 ` Dave Chinner [this message]
2013-07-16 13:59 ` Vince Weaver
2013-07-16 15:02 ` Dave Jones
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130718031759.GN11674@dastard \
--to=david@fromorbit.com \
--cc=bpm@sgi.com \
--cc=davej@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=oleg@redhat.com \
--cc=peterz@infradead.org \
--cc=torvalds@linux-foundation.org \
--cc=viro@zeniv.linux.org.uk \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox