From: Nick Piggin <npiggin@suse.de>
To: Dave Chinner <david@fromorbit.com>
Cc: Josef Bacik <josef@redhat.com>,
linux-fsdevel@vger.kernel.org, chris.mason@oracle.com,
hch@infradead.org, akpm@linux-foundation.org,
linux-kernel@vger.kernel.org
Subject: Re: [RFC] new ->perform_write fop
Date: Fri, 14 May 2010 17:22:19 +1000 [thread overview]
Message-ID: <20100514072219.GC4706@laptop> (raw)
In-Reply-To: <20100514064145.GJ13617@dastard>
On Fri, May 14, 2010 at 04:41:45PM +1000, Dave Chinner wrote:
> On Thu, May 13, 2010 at 11:30:57PM -0400, Josef Bacik wrote:
> > So this is what I had envisioned, we make write_begin take a nr_pages pointer
> > and tell it how much data we have to write, then in the filesystem we allocate
> > as many pages as we feel like, idealy something like
> >
> > min(number of pages we need for the write, some arbitrary limit for security)
>
> Actually, i was thinking that the RESERVE call determines the size
> of the chunk (in the order of 1-4MB maximum). IOWs, we pass in the
> start offset of the write, the entire length remaining, and the
> RESERVE call determines how much it will allow in one loop.
>
> written = 0;
> while (bytes_remaining > 0) {
> chunklen = ->allocate(off, bytes_remaining, RESERVE);
> write_begin(&pages, off, chunklen);
> copied = copy_pages(&pages, iov_iter, chunklen);
> .....
> bytes_remaining -= copied;
> off += copied;
> written += copied;
> }
How much benefit are you expecting to get? I have no problems with
non-pagecache block allocation/freeing APIs to get the most out of
amortising expensive locking or tree manipulations, but I don't
know if it is a good idea to pin a big chunk of pagecache like that.
I'd say just do the per-page manipulations like we currently do.
> > and then we allocate all those pages. Then you can pass them into
> > block_write_begin, which will walk the pages, allocating buffer heads and
> > allocating the space as needed.
>
> Close - except that I'd like like to move the block allocation out of
> ->write_begin until after the copy into the pages has completed.
> Hence write_begin only deals with populating the page cache and
> anything filesystem specific like locking or allocating/populating
> per-page structures like bufferheads. i.e. ->write_begin does not
> do anything that cannot be undone easily if the copy fails.
Hmm, not so sure. You run into other problems.
Write to a hole? Then you have to clear page uptodate and unmap
any potential mmaps on it before the copy to pagecache, in case
the allocation fails (otherwise you have exposed inconsistent
state to userspace).
Write to a hole, then to a non-hole? Then your write to the non
hole pagecache cannot be recovered in case the block allocation
for the hole fails and we need to return a short write.
> > Now since we're coming into write_begin with "we want to write X bytes" we can
> > go ahead and do the enospc checks for X bytes, and then if we are good to go,
> > chances are we won't fail.
>
> Well, the "RESERVE" callout I proposed should mean that the
> allocation that follows the copy should never fail with ENOSPC. i.e.
> if there isn't space, the RESERVE call will fail with ENOSPC, not
> the actual ALLOCATE call that occurs after the data is copied.
>
> Basically I was thinking that in the XFS case, RESERVE would scan the
> range for holes, and reserve all blocks needed to fill all the holes
> in the range. Then the ALLOCATE call would mark them all delalloc.
> The space reservation and delalloc is done in a single call right
> now, but splitting them is not hard...
>
> > Except if we're overwriting a holey section of the file, we're going to be
> > screwed in both your way and my way. My way probably would be the most likely
> > to fail, since we could fail to do the copy_from_user, but hopefully the segment
> > checks and doing the fault_in_readable before all of this would keep those
> > problems to a minimum.
>
> fault_in_readable() only faults the first page of the first iov
> passed in. It can still fail on any other page the iov points to.
> And even then, fault_in_readable() does not pin the page it faulted
> in memory, so even that can get recycled before we start the copy
> and we'd still get EFAULT.
>
> > In your case the only failure point is in the allocate step. If we fail on down
> > the line after we've done some hole filling, we'll be hard pressed to go back
> > and free up those blocks. Is that what you are talking about, having the
> > allocate(UNRESERVE) thing being able to go back and figure out what should have
> > been holes needs to be holes again?
>
> Yeah, that's kind of what I was thinking of. After a bit more
> thought, if the allocation only covers part of the range (for
> whatever reason), we can still allow that allocated part of the
> write to complete succesfully, and the UNRESERVE call simply returns
> the unused part of the space reservation. This would be a partial
> write and be much simpler to deal with than punching out the
> allocations that succeeded.
Do we really need to restore holes inside i_size in error cases?
Current code does not. It just leaves the blocks allocated and
zeroes them.
next prev parent reply other threads:[~2010-05-14 7:22 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-05-12 21:24 [RFC] new ->perform_write fop Josef Bacik
2010-05-13 1:39 ` Josef Bacik
2010-05-13 15:36 ` Christoph Hellwig
2010-05-14 1:00 ` Dave Chinner
2010-05-14 3:30 ` Josef Bacik
2010-05-14 5:50 ` Nick Piggin
2010-05-14 7:20 ` Dave Chinner
2010-05-14 7:33 ` Nick Piggin
2010-05-14 6:41 ` Dave Chinner
2010-05-14 7:22 ` Nick Piggin [this message]
2010-05-14 8:38 ` Dave Chinner
2010-05-14 13:33 ` Chris Mason
2010-05-18 6:36 ` Nick Piggin
2010-05-18 8:05 ` Dave Chinner
2010-05-18 10:43 ` Nick Piggin
2010-05-18 12:27 ` Dave Chinner
2010-05-18 15:09 ` Nick Piggin
2010-05-19 23:50 ` Dave Chinner
2010-05-20 6:48 ` Nick Piggin
2010-05-20 20:12 ` Jan Kara
2010-05-20 23:05 ` Dave Chinner
2010-05-21 9:05 ` Steven Whitehouse
2010-05-21 13:50 ` Josef Bacik
2010-05-21 14:23 ` Nick Piggin
2010-05-21 15:19 ` Josef Bacik
2010-05-24 3:29 ` Nick Piggin
2010-05-22 0:31 ` Dave Chinner
2010-05-21 18:58 ` Jan Kara
2010-05-22 0:27 ` Dave Chinner
2010-05-24 9:20 ` Jan Kara
2010-05-24 9:33 ` Nick Piggin
2010-06-05 15:05 ` tytso
2010-06-06 7:59 ` Nick Piggin
2010-05-21 15:15 ` Christoph Hellwig
2010-05-22 2:31 ` Nick Piggin
2010-05-22 8:37 ` Dave Chinner
2010-05-24 3:09 ` Nick Piggin
2010-05-24 5:53 ` Dave Chinner
2010-05-24 6:55 ` Nick Piggin
2010-05-24 10:21 ` Dave Chinner
2010-06-01 6:27 ` Nick Piggin
2010-05-24 18:40 ` Joel Becker
2010-05-17 23:35 ` Jan Kara
2010-05-18 1:21 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100514072219.GC4706@laptop \
--to=npiggin@suse.de \
--cc=akpm@linux-foundation.org \
--cc=chris.mason@oracle.com \
--cc=david@fromorbit.com \
--cc=hch@infradead.org \
--cc=josef@redhat.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).