linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>,
	linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com,
	linux-ext4@vger.kernel.org, Hugh Dickins <hughd@google.com>,
	linux-mm@kvack.org
Subject: Re: Hole punching and mmap races
Date: Wed, 6 Jun 2012 01:15:30 +0200	[thread overview]
Message-ID: <20120605231530.GB4402@quack.suse.cz> (raw)
In-Reply-To: <20120605055150.GF4347@dastard>

On Tue 05-06-12 15:51:50, Dave Chinner wrote:
> On Thu, May 24, 2012 at 02:35:38PM +0200, Jan Kara wrote:
> > > To me the issue at hand is that we have no method of serialising
> > > multi-page operations on the mapping tree between the filesystem and
> > > the VM, and that seems to be the fundamental problem we face in this
> > > whole area of mmap/buffered/direct IO/truncate/holepunch coherency.
> > > Hence it might be better to try to work out how to fix this entire
> > > class of problems rather than just adding a complex kuldge that just
> > > papers over the current "hot" symptom....
> >   Yes, looking at the above table, the amount of different synchronization
> > mechanisms is really striking. So probably we should look at some
> > possibility of unifying at least some cases.
> 
> It seems to me that we need some thing in between the fine grained
> page lock and the entire-file IO exclusion lock. We need to maintain
> fine grained locking for mmap scalability, but we also need to be
> able to atomically lock ranges of pages.
  Yes, we also need to keep things fine grained to keep scalability of
direct IO and buffered reads...

> I guess if we were to nest a fine grained multi-state lock
> inside both the IO exclusion lock and the mmap_sem, we might be able
> to kill all problems in one go.
> 
> Exclusive access on a range needs to be granted to:
> 
> 	- direct IO
> 	- truncate
> 	- hole punch
> 
> so they can be serialised against mmap based page faults, writeback
> and concurrent buffered IO. Serialisation against themselves is an
> IO/fs exclusion problem.
> 
> Shared access for traversal or modification needs to be granted to:
> 
> 	- buffered IO
> 	- mmap page faults
> 	- writeback
> 
> Each of these cases can rely on the existing page locks or IO
> exclusion locks to provide safety for concurrent access to the same
> ranges. This means that once we have access granted to a range we
> can check truncate races once and ignore the problem until we drop
> the access.  And the case of taking a page fault within a buffered
> IO won't deadlock because both take a shared lock....
  You cannot just use a lock (not even a shared one) both above and under
mmap_sem. That is deadlockable in presence of other requests for exclusive
locking... Luckily, with buffered writes the situation isn't that bad. You
need mmap_sem only before each page is processed (in
iov_iter_fault_in_readable()). Later on in the loop we use
iov_iter_copy_from_user_atomic() which doesn't need mmap_sem. So we can
just get our shared lock after iov_iter_fault_in_readable() (or simply
leave it for ->write_begin() if we want to give control over the locking to
filesystems).
 
> We'd need some kind of efficient shared/exclusive range lock for
> this sort of exclusion, and it's entirely possible that it would
> have too much overhead to be acceptible in the page fault path. It's
> the best I can think of right now.....
>
> As it is, a range lock of this kind would be very handy for other
> things, too (like the IO exclusion locks so we can do concurrent
> buffered writes in XFS ;).
  Yes, that's what I thought as well. In particular it should be pretty
efficient in locking a single page range because that's going to be
majority of calls. I'll try to write something and see how fast it can
be...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

  parent reply	other threads:[~2012-06-05 23:15 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-05-15 22:48 Hole punching and mmap races Jan Kara
2012-05-16  2:14 ` Dave Chinner
2012-05-16 13:04   ` Jan Kara
2012-05-17  7:43     ` Dave Chinner
2012-05-17 23:28       ` Jan Kara
2012-05-18 10:12         ` Dave Chinner
2012-05-18 13:32           ` Jan Kara
2012-05-19  1:40             ` Dave Chinner
2012-05-24 12:35               ` Jan Kara
2012-06-05  5:51                 ` Dave Chinner
2012-06-05  6:22                   ` Marco Stornelli
2012-06-05 23:15                   ` Jan Kara [this message]
2012-06-06  0:06                     ` Dave Chinner
2012-06-06  9:58                       ` Jan Kara
2012-06-06 13:36                         ` Dave Chinner
2012-06-07 21:58                           ` Jan Kara
2012-06-08  0:57                             ` Dave Chinner
2012-06-08 21:36                               ` Jan Kara
2012-06-08 23:06                                 ` Dave Chinner
2012-06-12  8:56                                   ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120605231530.GB4402@quack.suse.cz \
    --to=jack@suse.cz \
    --cc=david@fromorbit.com \
    --cc=hughd@google.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).