From: Dave Chinner <david@fromorbit.com>
To: Ben Myers <bpm@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
Linux Kernel <linux-kernel@vger.kernel.org>,
Oleg Nesterov <oleg@redhat.com>,
xfs@oss.sgi.com, Alexander Viro <viro@zeniv.linux.org.uk>,
Dave Jones <davej@redhat.com>,
Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: splice vs execve lockdep trace.
Date: Fri, 19 Jul 2013 08:49:07 +1000 [thread overview]
Message-ID: <20130718224907.GD13468@dastard> (raw)
In-Reply-To: <20130718222117.GE3572@sgi.com>
On Thu, Jul 18, 2013 at 05:21:17PM -0500, Ben Myers wrote:
> Dave,
>
> On Thu, Jul 18, 2013 at 04:16:32PM -0500, Ben Myers wrote:
> > On Thu, Jul 18, 2013 at 01:42:03PM +1000, Dave Chinner wrote:
> > > On Wed, Jul 17, 2013 at 05:17:36PM -0700, Linus Torvalds wrote:
> > > > On Wed, Jul 17, 2013 at 4:40 PM, Ben Myers <bpm@sgi.com> wrote:
> > > > >>
> > > > >> We're still talking at cross purposes then.
> > > > >>
> > > > >> How the hell do you handle mmap() and page faulting?
> > > > >
> > > > > __xfs_get_blocks serializes access to the block map with the i_lock on the
> > > > > xfs_inode. This appears to be racy with respect to hole punching.
> > > >
> > > > Would it be possible to just make __xfs_get_blocks get the i_iolock
> > > > (non-exclusively)?
> > >
> > > No. __xfs_get_blocks() operates on metadata (e.g. extent lists), and
> > > as such is protected by the i_ilock (note: not the i_iolock). i.e.
> > > XFS has a multi-level locking strategy:
> > >
> > > i_iolock is provided for *data IO serialisation*,
> > > i_ilock is for *inode metadata serialisation*.
> >
> > I think if __xfs_get_blocks has some way of knowing it is the mmap/page fault
> > path, taking the iolock shared in addition to the ilock (in just that case)
> > would prevent the mmap from being able to read stale data from disk. You would
> > see either the data before the punch or you would see the hole.
> >
> > Actually... I think that is wrong: You'd have to take the iolock across the
> > read itself (not just the access to the block map) for it to have the desired
> > effect:
> >
> > 1608 int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> > ...
> > 1704 page_not_uptodate:
> > 1705 /*
> > 1706 * Umm, take care of errors if the page isn't up-to-date.
> > 1707 * Try to re-read it _once_. We do this synchronously,
> > 1708 * because there really aren't any performance issues here
> > 1709 * and we need to check for errors.
> > 1710 */
> > 1711 ClearPageError(page);
> > 1712 error = mapping->a_ops->readpage(file, page);
> > 1713 if (!error) {
> > 1714 wait_on_page_locked(page);
> > 1715 if (!PageUptodate(page))
> > 1716 error = -EIO;
> > 1717 }
> > 1718 page_cache_release(page);
> >
> > Wouldn't you have to hold the iolock until after wait_on_page_locked returns?
>
> Maybe like so (crappy/untested/probably wrong/fodder for ridicule/etc):
Try running it with lockdep. You'll see pretty quickly why you can't
take the i_iolock or i_mutex in the ->fault path - it is called
with the mmap_sem held.
The lock inversion that can deadlock is that the page fault might be
occurring in the read path that is already holding the
i_mutex/i_iolock....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
WARNING: multiple messages have this Message-ID (diff)
From: Dave Chinner <david@fromorbit.com>
To: Ben Myers <bpm@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
Oleg Nesterov <oleg@redhat.com>,
Linux Kernel <linux-kernel@vger.kernel.org>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Dave Jones <davej@redhat.com>,
xfs@oss.sgi.com, Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: splice vs execve lockdep trace.
Date: Fri, 19 Jul 2013 08:49:07 +1000 [thread overview]
Message-ID: <20130718224907.GD13468@dastard> (raw)
In-Reply-To: <20130718222117.GE3572@sgi.com>
On Thu, Jul 18, 2013 at 05:21:17PM -0500, Ben Myers wrote:
> Dave,
>
> On Thu, Jul 18, 2013 at 04:16:32PM -0500, Ben Myers wrote:
> > On Thu, Jul 18, 2013 at 01:42:03PM +1000, Dave Chinner wrote:
> > > On Wed, Jul 17, 2013 at 05:17:36PM -0700, Linus Torvalds wrote:
> > > > On Wed, Jul 17, 2013 at 4:40 PM, Ben Myers <bpm@sgi.com> wrote:
> > > > >>
> > > > >> We're still talking at cross purposes then.
> > > > >>
> > > > >> How the hell do you handle mmap() and page faulting?
> > > > >
> > > > > __xfs_get_blocks serializes access to the block map with the i_lock on the
> > > > > xfs_inode. This appears to be racy with respect to hole punching.
> > > >
> > > > Would it be possible to just make __xfs_get_blocks get the i_iolock
> > > > (non-exclusively)?
> > >
> > > No. __xfs_get_blocks() operates on metadata (e.g. extent lists), and
> > > as such is protected by the i_ilock (note: not the i_iolock). i.e.
> > > XFS has a multi-level locking strategy:
> > >
> > > i_iolock is provided for *data IO serialisation*,
> > > i_ilock is for *inode metadata serialisation*.
> >
> > I think if __xfs_get_blocks has some way of knowing it is the mmap/page fault
> > path, taking the iolock shared in addition to the ilock (in just that case)
> > would prevent the mmap from being able to read stale data from disk. You would
> > see either the data before the punch or you would see the hole.
> >
> > Actually... I think that is wrong: You'd have to take the iolock across the
> > read itself (not just the access to the block map) for it to have the desired
> > effect:
> >
> > 1608 int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> > ...
> > 1704 page_not_uptodate:
> > 1705 /*
> > 1706 * Umm, take care of errors if the page isn't up-to-date.
> > 1707 * Try to re-read it _once_. We do this synchronously,
> > 1708 * because there really aren't any performance issues here
> > 1709 * and we need to check for errors.
> > 1710 */
> > 1711 ClearPageError(page);
> > 1712 error = mapping->a_ops->readpage(file, page);
> > 1713 if (!error) {
> > 1714 wait_on_page_locked(page);
> > 1715 if (!PageUptodate(page))
> > 1716 error = -EIO;
> > 1717 }
> > 1718 page_cache_release(page);
> >
> > Wouldn't you have to hold the iolock until after wait_on_page_locked returns?
>
> Maybe like so (crappy/untested/probably wrong/fodder for ridicule/etc):
Try running it with lockdep. You'll see pretty quickly why you can't
take the i_iolock or i_mutex in the ->fault path - it is called
with the mmap_sem held.
The lock inversion that can deadlock is that the page fault might be
occurring in the read path that is already holding the
i_mutex/i_iolock....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2013-07-18 22:49 UTC|newest]
Thread overview: 51+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-07-16 1:53 splice vs execve lockdep trace Dave Jones
2013-07-16 2:32 ` Linus Torvalds
2013-07-16 2:38 ` Dave Jones
2013-07-16 3:25 ` Linus Torvalds
2013-07-16 3:25 ` Linus Torvalds
2013-07-16 3:28 ` Dave Jones
2013-07-16 3:28 ` Dave Jones
2013-07-16 5:31 ` Al Viro
2013-07-16 5:31 ` Al Viro
2013-07-16 6:03 ` Dave Chinner
2013-07-16 6:03 ` Dave Chinner
2013-07-16 6:16 ` Al Viro
2013-07-16 6:16 ` Al Viro
2013-07-16 6:41 ` Dave Chinner
2013-07-16 6:41 ` Dave Chinner
2013-07-16 6:50 ` Dave Chinner
2013-07-16 6:50 ` Dave Chinner
2013-07-16 19:33 ` Ben Myers
2013-07-16 19:33 ` Ben Myers
2013-07-16 20:18 ` Linus Torvalds
2013-07-16 20:18 ` Linus Torvalds
2013-07-16 20:43 ` Dave Chinner
2013-07-16 20:43 ` Dave Chinner
2013-07-16 21:02 ` Linus Torvalds
2013-07-16 21:02 ` Linus Torvalds
2013-07-17 4:06 ` Dave Chinner
2013-07-17 4:06 ` Dave Chinner
2013-07-17 4:54 ` Linus Torvalds
2013-07-17 4:54 ` Linus Torvalds
2013-07-17 5:51 ` Dave Chinner
2013-07-17 5:51 ` Dave Chinner
2013-07-17 16:03 ` Linus Torvalds
2013-07-17 16:03 ` Linus Torvalds
2013-07-17 23:40 ` Ben Myers
2013-07-17 23:40 ` Ben Myers
2013-07-18 0:17 ` Linus Torvalds
2013-07-18 0:17 ` Linus Torvalds
2013-07-18 3:42 ` Dave Chinner
2013-07-18 3:42 ` Dave Chinner
2013-07-18 21:16 ` Ben Myers
2013-07-18 21:16 ` Ben Myers
2013-07-18 22:21 ` Ben Myers
2013-07-18 22:21 ` Ben Myers
2013-07-18 22:49 ` Dave Chinner [this message]
2013-07-18 22:49 ` Dave Chinner
2013-07-18 3:17 ` Dave Chinner
2013-07-18 3:17 ` Dave Chinner
2013-07-16 13:59 ` Vince Weaver
2013-07-16 13:59 ` Vince Weaver
2013-07-16 15:02 ` Dave Jones
2013-07-16 15:02 ` Dave Jones
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130718224907.GD13468@dastard \
--to=david@fromorbit.com \
--cc=bpm@sgi.com \
--cc=davej@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=oleg@redhat.com \
--cc=peterz@infradead.org \
--cc=torvalds@linux-foundation.org \
--cc=viro@zeniv.linux.org.uk \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.