Re: [RFC] new ->perform_write fop

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

From: Nick Piggin <npiggin@suse.de>
To: Josef Bacik <josef@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>, Jan Kara <jack@suse.cz>,
	linux-fsdevel@vger.kernel.org, chris.mason@oracle.com,
	hch@infradead.org, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC] new ->perform_write fop
Date: Mon, 24 May 2010 13:29:42 +1000	[thread overview]
Message-ID: <20100524032942.GR2516@laptop> (raw)
In-Reply-To: <20100521151921.GO32248@dhcp231-156.rdu.redhat.com>

On Fri, May 21, 2010 at 11:19:22AM -0400, Josef Bacik wrote:
> On Sat, May 22, 2010 at 12:23:54AM +1000, Nick Piggin wrote:
> > On Fri, May 21, 2010 at 09:50:54AM -0400, Josef Bacik wrote:
> > > On Fri, May 21, 2010 at 09:05:24AM +1000, Dave Chinner wrote:
> > > > Allocating multipage writes as unwritten extents turns off delayed
> > > > allocation and hence we'd lose all the benefits that this gives...
> > > >
> > > 
> > > I just realized we have another problem, the mmap_sem/page_lock deadlock.
> > > Currently BTRFS is susceptible to this, since we don't prefault any of the pages
> > > in yet.  If we're going to do multi-page writes we're going to need to have a
> > > way to fault in all of the iovec's at once, so when we do the
> > > pagefault_disable() copy pagefault_enable() we don't just end up copying the
> > > first iovec.  Nick have you done something like this already?  If not I assume
> > > I can just loop through all the iovec's and call fault_in_pages_readable on all
> > > of them and be good to go right?  Thanks,
> > 
> > Yes, well it's a different issue. With multi-page writes, even a single
> > iovec may not be faulted in properly. And with multiple iovecs, we are
> > already suboptimal with faulting.
> > 
> > faulting in multiple iovecs may already be a good idea. I didn't add
> > that code, I had hoped for a test case first, but perhaps we can just
> > go and add it.
> > 
> > With multipage writes, we would want to fault in multiple source pages
> > at once if the design was to lock multiple pages at once and do the
> > copy. I still think we might be able to just lock and copy one page at
> > a time, but I could be wrong.
> > 
> 
> I was thinking about this possibility, it seems this is what FUSE does already.
> It would probably be easier to deal with this fault problem, just do all the fs
> stuff to prepare for the total amount, do the copy one page at a time, and then
> if something goes wrong we can cleanup properly.

Yep. The fewer pages we have to copy to/from at once, the more robust
it should be.

 
> > Oh wow, btrfs is deadlocky there. Firstly, fault_in_pages_readable does
> > not guarantee success (race can always unmap the page in the meantime).
> > Secondly, calling it inside the page lock section just means it will
> > cause the deadlock rather than the copy_from_user.
> > 
> > Quick workaround to reduce probability is to do fault_in_pages_readable
> > calls before locking the pages.
> > 
> > But you really need to handle the short-copy case. From the error
> > handling there, it seems like you can just free_reserved_data_space and
> > retry the copy?
> > 
> 
> Well no, if theres a short copy we just exit.  If we do the
> fault_in_pages_readable before we do the prepare_pages we could deal with a
> short-copy then.  Thanks,

Take a look at the code in generic buffered write. It's not hard, but it
is easy to get the error case wrong.

You need to fault in the source pages before holding page lock. If this
fails, then you may exit. After locking the pages, you need to do a
pagefault_disable() and atomic kmaps/usercopies. If this copy comes up
short, you need to unlock and retry faulting in the source pages.

The reason for this is that the source page may be unmapped but there is
still a valid memory mapping at the point of the usercopy. So you must
not fail that.

And, if some bytes have been copied into pagecache, and if that pagecache
page is marked as uptodate, then you must treat it as a partial write.

next prev parent reply	other threads:[~2010-05-24  3:29 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-05-12 21:24 [RFC] new ->perform_write fop Josef Bacik
2010-05-13  1:39 ` Josef Bacik
2010-05-13 15:36   ` Christoph Hellwig
2010-05-14  1:00   ` Dave Chinner
2010-05-14  3:30     ` Josef Bacik
2010-05-14  5:50       ` Nick Piggin
2010-05-14  7:20         ` Dave Chinner
2010-05-14  7:33           ` Nick Piggin
2010-05-14  6:41       ` Dave Chinner
2010-05-14  7:22         ` Nick Piggin
2010-05-14  8:38           ` Dave Chinner
2010-05-14 13:33             ` Chris Mason
2010-05-18  6:36             ` Nick Piggin
2010-05-18  8:05               ` Dave Chinner
2010-05-18 10:43                 ` Nick Piggin
2010-05-18 12:27                   ` Dave Chinner
2010-05-18 15:09                     ` Nick Piggin
2010-05-19 23:50                       ` Dave Chinner
2010-05-20  6:48                         ` Nick Piggin
2010-05-20 20:12                         ` Jan Kara
2010-05-20 23:05                           ` Dave Chinner
2010-05-21  9:05                             ` Steven Whitehouse
2010-05-21 13:50                             ` Josef Bacik
2010-05-21 14:23                               ` Nick Piggin
2010-05-21 15:19                                 ` Josef Bacik
2010-05-24  3:29                                   ` Nick Piggin [this message]
2010-05-22  0:31                               ` Dave Chinner
2010-05-21 18:58                             ` Jan Kara
2010-05-22  0:27                               ` Dave Chinner
2010-05-24  9:20                                 ` Jan Kara
2010-05-24  9:33                                   ` Nick Piggin
2010-06-05 15:05                                   ` tytso
2010-06-06  7:59                                     ` Nick Piggin
2010-05-21 15:15           ` Christoph Hellwig
2010-05-22  2:31             ` Nick Piggin
2010-05-22  8:37               ` Dave Chinner
2010-05-24  3:09                 ` Nick Piggin
2010-05-24  5:53                   ` Dave Chinner
2010-05-24  6:55                     ` Nick Piggin
2010-05-24 10:21                       ` Dave Chinner
2010-06-01  6:27                         ` Nick Piggin
2010-05-24 18:40                       ` Joel Becker
2010-05-17 23:35         ` Jan Kara
2010-05-18  1:21           ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100524032942.GR2516@laptop \
    --to=npiggin@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=chris.mason@oracle.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=josef@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox