Re: splice vs execve lockdep trace.

From: Dave Chinner <david@fromorbit.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Oleg Nesterov <oleg@redhat.com>,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	Ben Myers <bpm@sgi.com>, Alexander Viro <viro@zeniv.linux.org.uk>,
	Dave Jones <davej@redhat.com>,
	xfs@oss.sgi.com
Subject: Re: splice vs execve lockdep trace.
Date: Thu, 18 Jul 2013 13:17:59 +1000	[thread overview]
Message-ID: <20130718031759.GN11674@dastard> (raw)
In-Reply-To: <CA+55aFxdqzMY5VJoYaLmL=+=f2s1cbHHV-TjC3=taXpF-xov1w@mail.gmail.com>

On Wed, Jul 17, 2013 at 09:03:11AM -0700, Linus Torvalds wrote:
> On Tue, Jul 16, 2013 at 10:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > But When i say "stale data" I mean that the data being returned
> > might not have originally belonged to the underlying file you are
> > reading.
> 
> We're still talking at cross purposes then.
> 
> How the hell do you handle mmap() and page faulting?

We cross our fingers and hope. Always have. Races are rare as
historically there have been only a handful of applications that do
the necessary operations to trigger them. However, with holepunch
now a generic fallocate() operation....

> Because if you return *that* kind of stale data, than you're horribly
> horribly buggy. And you cannot *possibly* blame
> generic_file_splice_read() on that.

Right, it's horribly buggy and I'm not blaming
generic_file_splice_read().

I'm saying that the page cache architecture does not providing
mechanisms to avoid the problem. i.e. that we can't synchronise
multi-page operations against a single page operation that only uses
the page lock for serialisation without some form of filesystem
specific locking. And that the i_mutex/i_iolock/mmap_sem inversion
problems essentially prevent us from beign able to fix it in a
filesystem specific manner.

We've hacked around this read vs invalidation race condition for
truncate() by putting ordered operations in place to avoid
refaulting after invalidation by read operations. i.e. truncate was
optimised to avoid extra locking, but now the realisation is that
truncate is just a degenerate case of hole punching and that hole
punching cannot make use of the same "beyond EOF" optimisations to
avoid race conditions with other IO.

We (XFS developers) have known about this for years, but we've
always been told when it's been raised that it's "just a wacky XFS
problem".  Now that other filesystems are actually implementing the
same functionality that XFS has had since day zero, they are also
seeing the same architectural deficiencies in the generic code. i.e.
they are not actually "whacky XFS problems".  That's why we were
talking about a range-locking solution to this problem at LSF/MM
this year - to find a generic solution to the issue...

FWIW, this problem is not just associated with splice reads - it's a
problem for the direct IO code, too. The direct IO layer has lots of
hacky invalidation code that tries to work around the fact that
mmap() page faults cannot be synchronised against direct IO in
progress. Hence it invalidates caches before and after direct IO is
done in the hope that we don't have a page fault that races and
leaves us with out-of-date data being exposed to userspace via mmap.
Indeed, we have a regression test that demonstrates how this often
fails - xfstests:generic/263 uses fsx with direct IO and mmap on the
same file and will fail with data corruption on XFS.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs