[RFC] new ->perform

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] new ->perform_write fop
@ 2010-05-12 21:24 Josef Bacik
  2010-05-13  1:39 ` Josef Bacik
  0 siblings, 1 reply; 44+ messages in thread
From: Josef Bacik @ 2010-05-12 21:24 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: chris.mason, hch, akpm, npiggin, linux-kernel

Hello,

I just started adding aio_write to Btrfs and I noticed we're duplicating _alot_
of the generic stuff in mm/filemap.c, even though the only thing thats really
unique is the fact that we copy userspace pages in chunks rather than one page a
t a time.  What would be best is instead of doing write_begin/write_end with
Btrfs, it would be nice if we could just do our own perform_write instead of
generic_perform_write.  This way we can drop all of these generic checks we have
that we copied from filemap.c and just got to the business of actually writing
the data.  I hate to add another file operation, but it would _greatly_ reduce
the amount of duplicate code we have.  If there is no violent objection to this
I can put something together quickly for review.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-12 21:24 [RFC] new ->perform_write fop Josef Bacik
@ 2010-05-13  1:39 ` Josef Bacik
  2010-05-13 15:36   ` Christoph Hellwig
  2010-05-14  1:00   ` Dave Chinner
  0 siblings, 2 replies; 44+ messages in thread
From: Josef Bacik @ 2010-05-13  1:39 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-fsdevel, chris.mason, hch, akpm, npiggin, linux-kernel

On Wed, May 12, 2010 at 05:24:04PM -0400, Josef Bacik wrote:
> Hello,
> 
> I just started adding aio_write to Btrfs and I noticed we're duplicating _alot_
> of the generic stuff in mm/filemap.c, even though the only thing thats really
> unique is the fact that we copy userspace pages in chunks rather than one page a
> t a time.  What would be best is instead of doing write_begin/write_end with
> Btrfs, it would be nice if we could just do our own perform_write instead of
> generic_perform_write.  This way we can drop all of these generic checks we have
> that we copied from filemap.c and just got to the business of actually writing
> the data.  I hate to add another file operation, but it would _greatly_ reduce
> the amount of duplicate code we have.  If there is no violent objection to this
> I can put something together quickly for review.  Thanks,
>

I just got a suggestion from hpa about instead just moving what BTRFS does into
the generic_perform_write.  What btrfs does is allocates a chunk of pages to
cover the entirety of the write, sets everything up, does the copy from user
into the pages, and tears everything down, so essentially what
generic_perform_write does, just with more pages.  I could modify
generic_perform_write and the write_begin/write_end aops to do this, where
write_begin will return how many pages it allocated, copy in all of the
userpages into the pages we allocated at once, and then call write_end with the
pages we allocated in write begin.  Then I could just make btrfs do
write_being/write_end.  So which option seems more palatable?  Thanks,

Josef 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-13  1:39 ` Josef Bacik
@ 2010-05-13 15:36   ` Christoph Hellwig
  2010-05-14  1:00   ` Dave Chinner
  1 sibling, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2010-05-13 15:36 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-fsdevel, chris.mason, hch, akpm, npiggin, linux-kernel,
	david

On Wed, May 12, 2010 at 09:39:27PM -0400, Josef Bacik wrote:
> I just got a suggestion from hpa about instead just moving what BTRFS does into
> the generic_perform_write.  What btrfs does is allocates a chunk of pages to
> cover the entirety of the write, sets everything up, does the copy from user
> into the pages, and tears everything down, so essentially what
> generic_perform_write does, just with more pages.  I could modify
> generic_perform_write and the write_begin/write_end aops to do this, where
> write_begin will return how many pages it allocated, copy in all of the
> userpages into the pages we allocated at once, and then call write_end with the
> pages we allocated in write begin.  Then I could just make btrfs do
> write_being/write_end.  So which option seems more palatable?  Thanks,

Having a more efficient generic write path would be great.  I'm not
quite sure btrfs gets everything that is needed right already, but
getting started on this would be great.

And to get back to the original question: adding a ->perform_write
is a really dumb idea.  It doesn't fit into the structure of the real
AOPs at all as it required context dependent locking and so and so
on.  It would just be a nasty hack around the fact that we still leave
too much work to the filesystem in the write path, and for the bad
prototype of the ->aio_read/->aio_write methods.

For a start generic_segment_checks should move out of the methods
and into fs/read_write.c before calling into the methods.  To do
this fully we'll need to pass down a count of total bytes into
->aio_read and ->aio_write, though.

Adding a new generic_write_prep helper for all the boilderplate
code in ->aio_write also seems fine to me.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-13  1:39 ` Josef Bacik
  2010-05-13 15:36   ` Christoph Hellwig
@ 2010-05-14  1:00   ` Dave Chinner
  2010-05-14  3:30     ` Josef Bacik
  1 sibling, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2010-05-14  1:00 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-fsdevel, chris.mason, hch, akpm, npiggin, linux-kernel

On Wed, May 12, 2010 at 09:39:27PM -0400, Josef Bacik wrote:
> On Wed, May 12, 2010 at 05:24:04PM -0400, Josef Bacik wrote:
> > Hello,
> > 
> > I just started adding aio_write to Btrfs and I noticed we're duplicating _alot_
> > of the generic stuff in mm/filemap.c, even though the only thing thats really
> > unique is the fact that we copy userspace pages in chunks rather than one page a
> > t a time.  What would be best is instead of doing write_begin/write_end with
> > Btrfs, it would be nice if we could just do our own perform_write instead of
> > generic_perform_write.  This way we can drop all of these generic checks we have
> > that we copied from filemap.c and just got to the business of actually writing
> > the data.  I hate to add another file operation, but it would _greatly_ reduce
> > the amount of duplicate code we have.  If there is no violent objection to this
> > I can put something together quickly for review.  Thanks,
> >
> 
> I just got a suggestion from hpa about instead just moving what BTRFS does into
> the generic_perform_write.  What btrfs does is allocates a chunk of pages to
> cover the entirety of the write, sets everything up, does the copy from user
> into the pages, and tears everything down, so essentially what
> generic_perform_write does, just with more pages.

Except that btrfs does things in a very different manner to most
other filesystems ;)

> I could modify
> generic_perform_write and the write_begin/write_end aops to do this, where
> write_begin will return how many pages it allocated, copy in all of the
> userpages into the pages we allocated at once, and then call write_end with the
> pages we allocated in write begin.  Then I could just make btrfs do
> write_being/write_end.  So which option seems more palatable?  Thanks,

I can see how this would work for btrfs, but the issue is how any
other filesystem would handle it.

I've been trying to get my head around how any filesystem using
bufferheads and generic code can do multipage writes using
write_begin/write_end without modifying the interface, and I just
threw away my second attempt because the error handling just
couldn't be handled cleanly without duplicating the entire
block_write_begin path in each filesystem that wanted to do
multipage writes.

The biggest problem is that block allocation is intermingled with
allocating and attaching bufferheads to pages, hence error handling
can get really nasty and is split across a call chain 3 or 4
functions deep.  The error handling is where I'm finding all the
dangerous and hard-to-kill demons lurking in dark corners. I suspect
there's a grue in there somewhere, too. ;)

Separating the page+bufferhead allocation and block allocation would
make this much simpler but I can't fit that easily into the existing
interfaces.

Hence I think that write_begin/copy pages/write_end is not really
suited to multipage writes when allocation is done in write_begin
and the write can then fail in a later stage without a simple method
of undoing the allocation. We don't have any hole punch interfaces
to the filesystems (and I think only XFS supports that functionality
right now), so handling errors after allocation becomes rather
complex, especially when you have multiple blocks per page. 

Hence I've independently come to the conclusion that delaying the
allocation until *after* the copy as btrfs does is probably the best
approach to take here.  This largely avoids the error handling
complexity because the write operation is an all-or-nothing
operation. btrfs has separate hooks for space reservation and
releasing the reservation and doesn't commit to actually allocating
anything until the copying is complete. Hence cleanup is simple no
matter where a failure occurs.

Personally, I'm tending towards killing the get_blocks() callback as
the first step in this process - turn it into a real inode/address
space operation (say ->allocate) so we can untangle the write path
somewhat (lots of filesystem just provide operations as wrappers to
provide a fs-specific get_blocks callback to generic code.  If the
"create" flag is then converted to a "command" field, the interface
can pass "RESERVE", "ALLOC", "CREATE", etc to allow different
operations to be clearly handled.

e.g.:

	->allocate(mapping, NULL, off, len, RESERVE)
		reserves necessary space for write
	->write_begin
		grab pages into page cache
		attach bufferheads (if required)
	fail -> goto drop pages
	copy data into pages
	fail -> goto drop pages
	->allocate(mapping, pages, off, len, ALLOC)
		allocates reserved space (if required)
		sets up/maps/updates bufferheads/extents
	fail -> goto drop pages
	->write_end
		set pages dirty + uptodate
	done

drop_pages:
	->allocate(mapping, NULL, off, len, UNRESERVE)
	if needed, zero partial pages
	release pages, clears uptodate.

Basically this allows the copying of data before any allocation is
actually done, but also allows ENOSPC to be detected before starting
the copy. The filesystem can call whatver helpers it needs inside
->get_blocks(ALLOC) to set up bufferhead/extent state to match
what has been reserved/allocated/mapped in the RESERVE call.

This will work for btrfs, and it will work for XFS and I think it
will work for other filesystems that are using bufferheads as well.
For those filesystems that will only support a page at a time, then
they can continue to use the current code, but should be able to be
converted to the multipage code by making RESERVE and UNRESERVE
no-ops, and ALLOC does what write_begin+get_blocks currently does to
set up block mappings.

Thoughts?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-14  1:00   ` Dave Chinner
@ 2010-05-14  3:30     ` Josef Bacik
  2010-05-14  5:50       ` Nick Piggin
  2010-05-14  6:41       ` Dave Chinner
  0 siblings, 2 replies; 44+ messages in thread
From: Josef Bacik @ 2010-05-14  3:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, npiggin,
	linux-kernel

On Fri, May 14, 2010 at 11:00:42AM +1000, Dave Chinner wrote:
> On Wed, May 12, 2010 at 09:39:27PM -0400, Josef Bacik wrote:
> > On Wed, May 12, 2010 at 05:24:04PM -0400, Josef Bacik wrote:
> > > Hello,
> > > 
> > > I just started adding aio_write to Btrfs and I noticed we're duplicating _alot_
> > > of the generic stuff in mm/filemap.c, even though the only thing thats really
> > > unique is the fact that we copy userspace pages in chunks rather than one page a
> > > t a time.  What would be best is instead of doing write_begin/write_end with
> > > Btrfs, it would be nice if we could just do our own perform_write instead of
> > > generic_perform_write.  This way we can drop all of these generic checks we have
> > > that we copied from filemap.c and just got to the business of actually writing
> > > the data.  I hate to add another file operation, but it would _greatly_ reduce
> > > the amount of duplicate code we have.  If there is no violent objection to this
> > > I can put something together quickly for review.  Thanks,
> > >
> > 
> > I just got a suggestion from hpa about instead just moving what BTRFS does into
> > the generic_perform_write.  What btrfs does is allocates a chunk of pages to
> > cover the entirety of the write, sets everything up, does the copy from user
> > into the pages, and tears everything down, so essentially what
> > generic_perform_write does, just with more pages.
> 
> Except that btrfs does things in a very different manner to most
> other filesystems ;)
> 
> > I could modify
> > generic_perform_write and the write_begin/write_end aops to do this, where
> > write_begin will return how many pages it allocated, copy in all of the
> > userpages into the pages we allocated at once, and then call write_end with the
> > pages we allocated in write begin.  Then I could just make btrfs do
> > write_being/write_end.  So which option seems more palatable?  Thanks,
> 
> I can see how this would work for btrfs, but the issue is how any
> other filesystem would handle it.
> 
> I've been trying to get my head around how any filesystem using
> bufferheads and generic code can do multipage writes using
> write_begin/write_end without modifying the interface, and I just
> threw away my second attempt because the error handling just
> couldn't be handled cleanly without duplicating the entire
> block_write_begin path in each filesystem that wanted to do
> multipage writes.
> 
> The biggest problem is that block allocation is intermingled with
> allocating and attaching bufferheads to pages, hence error handling
> can get really nasty and is split across a call chain 3 or 4
> functions deep.  The error handling is where I'm finding all the
> dangerous and hard-to-kill demons lurking in dark corners. I suspect
> there's a grue in there somewhere, too. ;)
> 
> Separating the page+bufferhead allocation and block allocation would
> make this much simpler but I can't fit that easily into the existing
> interfaces.
>
> Hence I think that write_begin/copy pages/write_end is not really
> suited to multipage writes when allocation is done in write_begin
> and the write can then fail in a later stage without a simple method
> of undoing the allocation. We don't have any hole punch interfaces
> to the filesystems (and I think only XFS supports that functionality
> right now), so handling errors after allocation becomes rather
> complex, especially when you have multiple blocks per page. 
> 
> Hence I've independently come to the conclusion that delaying the
> allocation until *after* the copy as btrfs does is probably the best
> approach to take here.  This largely avoids the error handling
> complexity because the write operation is an all-or-nothing
> operation. btrfs has separate hooks for space reservation and
> releasing the reservation and doesn't commit to actually allocating
> anything until the copying is complete. Hence cleanup is simple no
> matter where a failure occurs.
> 
> Personally, I'm tending towards killing the get_blocks() callback as
> the first step in this process - turn it into a real inode/address
> space operation (say ->allocate) so we can untangle the write path
> somewhat (lots of filesystem just provide operations as wrappers to
> provide a fs-specific get_blocks callback to generic code.  If the
> "create" flag is then converted to a "command" field, the interface
> can pass "RESERVE", "ALLOC", "CREATE", etc to allow different
> operations to be clearly handled.
> 
> e.g.:
> 
> 	->allocate(mapping, NULL, off, len, RESERVE)
> 		reserves necessary space for write
> 	->write_begin
> 		grab pages into page cache
> 		attach bufferheads (if required)
> 	fail -> goto drop pages
> 	copy data into pages
> 	fail -> goto drop pages
> 	->allocate(mapping, pages, off, len, ALLOC)
> 		allocates reserved space (if required)
> 		sets up/maps/updates bufferheads/extents
> 	fail -> goto drop pages
> 	->write_end
> 		set pages dirty + uptodate
> 	done
> 
> drop_pages:
> 	->allocate(mapping, NULL, off, len, UNRESERVE)
> 	if needed, zero partial pages
> 	release pages, clears uptodate.
> 
> Basically this allows the copying of data before any allocation is
> actually done, but also allows ENOSPC to be detected before starting
> the copy. The filesystem can call whatver helpers it needs inside
> ->get_blocks(ALLOC) to set up bufferhead/extent state to match
> what has been reserved/allocated/mapped in the RESERVE call.
> 
> This will work for btrfs, and it will work for XFS and I think it
> will work for other filesystems that are using bufferheads as well.
> For those filesystems that will only support a page at a time, then
> they can continue to use the current code, but should be able to be
> converted to the multipage code by making RESERVE and UNRESERVE
> no-ops, and ALLOC does what write_begin+get_blocks currently does to
> set up block mappings.
> 
> Thoughts?
>
So this is what I had envisioned, we make write_begin take a nr_pages pointer
and tell it how much data we have to write, then in the filesystem we allocate
as many pages as we feel like, idealy something like

min(number of pages we need for the write, some arbitrary limit for security)

and then we allocate all those pages.  Then you can pass them into
block_write_begin, which will walk the pages, allocating buffer heads and
allocating the space as needed.

Now since we're coming into write_begin with "we want to write X bytes" we can
go ahead and do the enospc checks for X bytes, and then if we are good to go,
chances are we won't fail.

Except if we're overwriting a holey section of the file, we're going to be
screwed in both your way and my way.  My way probably would be the most likely
to fail, since we could fail to do the copy_from_user, but hopefully the segment
checks and doing the fault_in_readable before all of this would keep those
problems to a minimum.

In your case the only failure point is in the allocate step.  If we fail on down
the line after we've done some hole filling, we'll be hard pressed to go back
and free up those blocks.  Is that what you are talking about, having the
allocate(UNRESERVE) thing being able to go back and figure out what should have
been holes needs to be holes again?  If thats the case then I think your idea
will work and is probably the best way to move forward.  But from what I can
remember about how all this works with buffer heads, there's not a way to say
"we _just_ allocated this recently".  Thanks,

Josef

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-14  3:30     ` Josef Bacik
@ 2010-05-14  5:50       ` Nick Piggin
  2010-05-14  7:20         ` Dave Chinner
  2010-05-14  6:41       ` Dave Chinner
  1 sibling, 1 reply; 44+ messages in thread
From: Nick Piggin @ 2010-05-14  5:50 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Dave Chinner, linux-fsdevel, chris.mason, hch, akpm, linux-kernel

On Thu, May 13, 2010 at 11:30:57PM -0400, Josef Bacik wrote:
> On Fri, May 14, 2010 at 11:00:42AM +1000, Dave Chinner wrote:
> > On Wed, May 12, 2010 at 09:39:27PM -0400, Josef Bacik wrote:
> > > On Wed, May 12, 2010 at 05:24:04PM -0400, Josef Bacik wrote:
> > > > Hello,
> > > > 
> > > > I just started adding aio_write to Btrfs and I noticed we're duplicating _alot_
> > > > of the generic stuff in mm/filemap.c, even though the only thing thats really
> > > > unique is the fact that we copy userspace pages in chunks rather than one page a
> > > > t a time.  What would be best is instead of doing write_begin/write_end with
> > > > Btrfs, it would be nice if we could just do our own perform_write instead of
> > > > generic_perform_write.  This way we can drop all of these generic checks we have
> > > > that we copied from filemap.c and just got to the business of actually writing
> > > > the data.  I hate to add another file operation, but it would _greatly_ reduce
> > > > the amount of duplicate code we have.  If there is no violent objection to this
> > > > I can put something together quickly for review.  Thanks,
> > > >
> > > 
> > > I just got a suggestion from hpa about instead just moving what BTRFS does into
> > > the generic_perform_write.  What btrfs does is allocates a chunk of pages to
> > > cover the entirety of the write, sets everything up, does the copy from user
> > > into the pages, and tears everything down, so essentially what
> > > generic_perform_write does, just with more pages.

Yeah we basically decided that perform_write is not a good entry point.
BTW. can you wrap up this generic code into libfs and so you don't have
to duplicate so much of it?


> > 
> > Except that btrfs does things in a very different manner to most
> > other filesystems ;)
> > 
> > > I could modify
> > > generic_perform_write and the write_begin/write_end aops to do this, where
> > > write_begin will return how many pages it allocated, copy in all of the
> > > userpages into the pages we allocated at once, and then call write_end with the
> > > pages we allocated in write begin.  Then I could just make btrfs do
> > > write_being/write_end.  So which option seems more palatable?  Thanks,
> > 
> > I can see how this would work for btrfs, but the issue is how any
> > other filesystem would handle it.
> > 
> > I've been trying to get my head around how any filesystem using
> > bufferheads and generic code can do multipage writes using
> > write_begin/write_end without modifying the interface, and I just
> > threw away my second attempt because the error handling just
> > couldn't be handled cleanly without duplicating the entire
> > block_write_begin path in each filesystem that wanted to do
> > multipage writes.
> > 
> > The biggest problem is that block allocation is intermingled with
> > allocating and attaching bufferheads to pages, hence error handling
> > can get really nasty and is split across a call chain 3 or 4
> > functions deep.  The error handling is where I'm finding all the
> > dangerous and hard-to-kill demons lurking in dark corners. I suspect
> > there's a grue in there somewhere, too. ;)
> > 
> > Separating the page+bufferhead allocation and block allocation would
> > make this much simpler but I can't fit that easily into the existing
> > interfaces.
> >
> > Hence I think that write_begin/copy pages/write_end is not really
> > suited to multipage writes when allocation is done in write_begin
> > and the write can then fail in a later stage without a simple method
> > of undoing the allocation. We don't have any hole punch interfaces
> > to the filesystems (and I think only XFS supports that functionality
> > right now), so handling errors after allocation becomes rather
> > complex, especially when you have multiple blocks per page. 
> > 
> > Hence I've independently come to the conclusion that delaying the
> > allocation until *after* the copy as btrfs does is probably the best
> > approach to take here.  This largely avoids the error handling
> > complexity because the write operation is an all-or-nothing
> > operation. btrfs has separate hooks for space reservation and
> > releasing the reservation and doesn't commit to actually allocating
> > anything until the copying is complete. Hence cleanup is simple no
> > matter where a failure occurs.
> > 
> > Personally, I'm tending towards killing the get_blocks() callback as
> > the first step in this process - turn it into a real inode/address
> > space operation (say ->allocate) so we can untangle the write path
> > somewhat (lots of filesystem just provide operations as wrappers to
> > provide a fs-specific get_blocks callback to generic code.  If the
> > "create" flag is then converted to a "command" field, the interface
> > can pass "RESERVE", "ALLOC", "CREATE", etc to allow different
> > operations to be clearly handled.
> > 
> > e.g.:
> > 
> > 	->allocate(mapping, NULL, off, len, RESERVE)
> > 		reserves necessary space for write
> > 	->write_begin
> > 		grab pages into page cache
> > 		attach bufferheads (if required)
> > 	fail -> goto drop pages
> > 	copy data into pages
> > 	fail -> goto drop pages
> > 	->allocate(mapping, pages, off, len, ALLOC)
> > 		allocates reserved space (if required)
> > 		sets up/maps/updates bufferheads/extents
> > 	fail -> goto drop pages
> > 	->write_end
> > 		set pages dirty + uptodate
> > 	done
> > 
> > drop_pages:
> > 	->allocate(mapping, NULL, off, len, UNRESERVE)
> > 	if needed, zero partial pages
> > 	release pages, clears uptodate.
> > 
> > Basically this allows the copying of data before any allocation is
> > actually done, but also allows ENOSPC to be detected before starting
> > the copy. The filesystem can call whatver helpers it needs inside
> > ->get_blocks(ALLOC) to set up bufferhead/extent state to match
> > what has been reserved/allocated/mapped in the RESERVE call.
> > 
> > This will work for btrfs, and it will work for XFS and I think it
> > will work for other filesystems that are using bufferheads as well.
> > For those filesystems that will only support a page at a time, then
> > they can continue to use the current code, but should be able to be
> > converted to the multipage code by making RESERVE and UNRESERVE
> > no-ops, and ALLOC does what write_begin+get_blocks currently does to
> > set up block mappings.
> > 
> > Thoughts?
> >
> So this is what I had envisioned, we make write_begin take a nr_pages pointer
> and tell it how much data we have to write, then in the filesystem we allocate
> as many pages as we feel like, idealy something like
> 
> min(number of pages we need for the write, some arbitrary limit for security)
> 
> and then we allocate all those pages.  Then you can pass them into
> block_write_begin, which will walk the pages, allocating buffer heads and
> allocating the space as needed.
> 
> Now since we're coming into write_begin with "we want to write X bytes" we can
> go ahead and do the enospc checks for X bytes, and then if we are good to go,
> chances are we won't fail.
> 
> Except if we're overwriting a holey section of the file, we're going to be
> screwed in both your way and my way.  My way probably would be the most likely
> to fail, since we could fail to do the copy_from_user, but hopefully the segment
> checks and doing the fault_in_readable before all of this would keep those
> problems to a minimum.
> 
> In your case the only failure point is in the allocate step.  If we fail on down
> the line after we've done some hole filling, we'll be hard pressed to go back
> and free up those blocks.  Is that what you are talking about, having the
> allocate(UNRESERVE) thing being able to go back and figure out what should have
> been holes needs to be holes again?  If thats the case then I think your idea
> will work and is probably the best way to move forward.  But from what I can
> remember about how all this works with buffer heads, there's not a way to say
> "we _just_ allocated this recently".  Thanks,

Now is there really a good reason to go this way and add more to the
write_begin/write_end paths? Rather than having filesystems just
implement their own write file_operations in order to do multi-block
operations?

>From what I can see, the generic code is not going to be able to be
much help with error handling etc. so I would prefer to keep it as
simple as possible. I think it is still adequate for most cases.

Take a look at how fuse does multi-page write operations. It is about
the simplest case you can get, but it still requires all the generic
checks etc. and it is quite neat -- I don't see a big issue with
duplicating generic code?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-14  5:50       ` Nick Piggin
@ 2010-05-14  7:20         ` Dave Chinner
  2010-05-14  7:33           ` Nick Piggin
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2010-05-14  7:20 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel

On Fri, May 14, 2010 at 03:50:54PM +1000, Nick Piggin wrote:
> On Thu, May 13, 2010 at 11:30:57PM -0400, Josef Bacik wrote:
> > On Fri, May 14, 2010 at 11:00:42AM +1000, Dave Chinner wrote:
> > > On Wed, May 12, 2010 at 09:39:27PM -0400, Josef Bacik wrote:
> > > > On Wed, May 12, 2010 at 05:24:04PM -0400, Josef Bacik wrote:
> > > > I could modify
> > > > generic_perform_write and the write_begin/write_end aops to do this, where
> > > > write_begin will return how many pages it allocated, copy in all of the
> > > > userpages into the pages we allocated at once, and then call write_end with the
> > > > pages we allocated in write begin.  Then I could just make btrfs do
> > > > write_being/write_end.  So which option seems more palatable?  Thanks,
> > > 
> > > I can see how this would work for btrfs, but the issue is how any
> > > other filesystem would handle it.
> > > 
> > > I've been trying to get my head around how any filesystem using
> > > bufferheads and generic code can do multipage writes using
> > > write_begin/write_end without modifying the interface, and I just
> > > threw away my second attempt because the error handling just
> > > couldn't be handled cleanly without duplicating the entire
> > > block_write_begin path in each filesystem that wanted to do
> > > multipage writes.
> > > 
> > > The biggest problem is that block allocation is intermingled with
> > > allocating and attaching bufferheads to pages, hence error handling
> > > can get really nasty and is split across a call chain 3 or 4
> > > functions deep.  The error handling is where I'm finding all the
> > > dangerous and hard-to-kill demons lurking in dark corners. I suspect
> > > there's a grue in there somewhere, too. ;)
> > > 
> > > Separating the page+bufferhead allocation and block allocation would
> > > make this much simpler but I can't fit that easily into the existing
> > > interfaces.
> > >
> > > Hence I think that write_begin/copy pages/write_end is not really
> > > suited to multipage writes when allocation is done in write_begin
> > > and the write can then fail in a later stage without a simple method
> > > of undoing the allocation. We don't have any hole punch interfaces
> > > to the filesystems (and I think only XFS supports that functionality
> > > right now), so handling errors after allocation becomes rather
> > > complex, especially when you have multiple blocks per page. 
> > > 
> > > Hence I've independently come to the conclusion that delaying the
> > > allocation until *after* the copy as btrfs does is probably the best
> > > approach to take here.  This largely avoids the error handling
> > > complexity because the write operation is an all-or-nothing
> > > operation. btrfs has separate hooks for space reservation and
> > > releasing the reservation and doesn't commit to actually allocating
> > > anything until the copying is complete. Hence cleanup is simple no
> > > matter where a failure occurs.
> > > 
> > > Personally, I'm tending towards killing the get_blocks() callback as
> > > the first step in this process - turn it into a real inode/address
> > > space operation (say ->allocate) so we can untangle the write path
> > > somewhat (lots of filesystem just provide operations as wrappers to
> > > provide a fs-specific get_blocks callback to generic code.  If the
> > > "create" flag is then converted to a "command" field, the interface
> > > can pass "RESERVE", "ALLOC", "CREATE", etc to allow different
> > > operations to be clearly handled.
> > > 
> > > e.g.:
> > > 
> > > 	->allocate(mapping, NULL, off, len, RESERVE)
> > > 		reserves necessary space for write
> > > 	->write_begin
> > > 		grab pages into page cache
> > > 		attach bufferheads (if required)
> > > 	fail -> goto drop pages
> > > 	copy data into pages
> > > 	fail -> goto drop pages
> > > 	->allocate(mapping, pages, off, len, ALLOC)
> > > 		allocates reserved space (if required)
> > > 		sets up/maps/updates bufferheads/extents
> > > 	fail -> goto drop pages
> > > 	->write_end
> > > 		set pages dirty + uptodate
> > > 	done
> > > 
> > > drop_pages:
> > > 	->allocate(mapping, NULL, off, len, UNRESERVE)
> > > 	if needed, zero partial pages
> > > 	release pages, clears uptodate.
> > > 
> > > Basically this allows the copying of data before any allocation is
> > > actually done, but also allows ENOSPC to be detected before starting
> > > the copy. The filesystem can call whatver helpers it needs inside
> > > ->get_blocks(ALLOC) to set up bufferhead/extent state to match
> > > what has been reserved/allocated/mapped in the RESERVE call.
> > > 
> > > This will work for btrfs, and it will work for XFS and I think it
> > > will work for other filesystems that are using bufferheads as well.
> > > For those filesystems that will only support a page at a time, then
> > > they can continue to use the current code, but should be able to be
> > > converted to the multipage code by making RESERVE and UNRESERVE
> > > no-ops, and ALLOC does what write_begin+get_blocks currently does to
> > > set up block mappings.
> > > 
> > > Thoughts?
> > >
> > So this is what I had envisioned, we make write_begin take a nr_pages pointer
> > and tell it how much data we have to write, then in the filesystem we allocate
> > as many pages as we feel like, idealy something like
> > 
> > min(number of pages we need for the write, some arbitrary limit for security)
> > 
> > and then we allocate all those pages.  Then you can pass them into
> > block_write_begin, which will walk the pages, allocating buffer heads and
> > allocating the space as needed.
> > 
> > Now since we're coming into write_begin with "we want to write X bytes" we can
> > go ahead and do the enospc checks for X bytes, and then if we are good to go,
> > chances are we won't fail.
> > 
> > Except if we're overwriting a holey section of the file, we're going to be
> > screwed in both your way and my way.  My way probably would be the most likely
> > to fail, since we could fail to do the copy_from_user, but hopefully the segment
> > checks and doing the fault_in_readable before all of this would keep those
> > problems to a minimum.
> > 
> > In your case the only failure point is in the allocate step.  If we fail on down
> > the line after we've done some hole filling, we'll be hard pressed to go back
> > and free up those blocks.  Is that what you are talking about, having the
> > allocate(UNRESERVE) thing being able to go back and figure out what should have
> > been holes needs to be holes again?  If thats the case then I think your idea
> > will work and is probably the best way to move forward.  But from what I can
> > remember about how all this works with buffer heads, there's not a way to say
> > "we _just_ allocated this recently".  Thanks,
> 
> Now is there really a good reason to go this way and add more to the
> write_begin/write_end paths?  Rather than having filesystems just
> implement their own write file_operations in order to do multi-block
> operations?

Well, if we've got xfs, btrfs, gfs2, ext4, and others all wanting to
do multipage writes, shouldn't we be trying doing in a generic way?


Fuse doesn't have to deal with allocation of blocks in
fuse_perform_write()

> From what I can see, the generic code is not going to be able to be
> much help with error handling etc. so I would prefer to keep it as
> simple as possible. I think it is still adequate for most cases.
> 
> Take a look at how fuse does multi-page write operations. It is about
> the simplest case you can get, but it still requires all the generic
> checks etc.

fuse_perform_write() doesn't do allocation, and so can easily abort
at the first error and just complete the writes that did succeed.
Hence it don't think it's a model that a filesystem that has to
handle space allocation can use.

> and it is quite neat -- I don't see a big issue with
> duplicating generic code?

When a large number of filesystems end up duplicating the same code,
then we should be looking at how to implement that functionality
generically, right?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-14  7:20         ` Dave Chinner
@ 2010-05-14  7:33           ` Nick Piggin
  0 siblings, 0 replies; 44+ messages in thread
From: Nick Piggin @ 2010-05-14  7:33 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel

On Fri, May 14, 2010 at 05:20:55PM +1000, Dave Chinner wrote:
> On Fri, May 14, 2010 at 03:50:54PM +1000, Nick Piggin wrote:
> > Now is there really a good reason to go this way and add more to the
> > write_begin/write_end paths?  Rather than having filesystems just
> > implement their own write file_operations in order to do multi-block
> > operations?
> 
> Well, if we've got xfs, btrfs, gfs2, ext4, and others all wanting to
> do multipage writes, shouldn't we be trying doing in a generic way?

If it makes sense, definitely.


> Fuse doesn't have to deal with allocation of blocks in
> fuse_perform_write()

I just can't see how the generic code can really help out with that
problem of error handling in various parts of the operation allocation.


> > From what I can see, the generic code is not going to be able to be
> > much help with error handling etc. so I would prefer to keep it as
> > simple as possible. I think it is still adequate for most cases.
> > 
> > Take a look at how fuse does multi-page write operations. It is about
> > the simplest case you can get, but it still requires all the generic
> > checks etc.
> 
> fuse_perform_write() doesn't do allocation, and so can easily abort
> at the first error and just complete the writes that did succeed.
> Hence it don't think it's a model that a filesystem that has to
> handle space allocation can use.

No but it does all the _generic_ vfs checks required, which sounded
like what the btrfs folk were concerned about duplicating. My point
was just that there isn't very much duplication really.


> > and it is quite neat -- I don't see a big issue with
> > duplicating generic code?
> 
> When a large number of filesystems end up duplicating the same code,
> then we should be looking at how to implement that functionality
> generically, right?

Yes if it captures a good chunk of common code without unduely
complicating things. I'll be interested to see if that can be
made the case.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-14  3:30     ` Josef Bacik
  2010-05-14  5:50       ` Nick Piggin
@ 2010-05-14  6:41       ` Dave Chinner
  2010-05-14  7:22         ` Nick Piggin
  2010-05-17 23:35         ` Jan Kara
  1 sibling, 2 replies; 44+ messages in thread
From: Dave Chinner @ 2010-05-14  6:41 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-fsdevel, chris.mason, hch, akpm, npiggin, linux-kernel

On Thu, May 13, 2010 at 11:30:57PM -0400, Josef Bacik wrote:
> On Fri, May 14, 2010 at 11:00:42AM +1000, Dave Chinner wrote:
> > On Wed, May 12, 2010 at 09:39:27PM -0400, Josef Bacik wrote:
> > > On Wed, May 12, 2010 at 05:24:04PM -0400, Josef Bacik wrote:
> > > > Hello,
> > > > 
> > > > I just started adding aio_write to Btrfs and I noticed we're duplicating _alot_
> > > > of the generic stuff in mm/filemap.c, even though the only thing thats really
> > > > unique is the fact that we copy userspace pages in chunks rather than one page a
> > > > t a time.  What would be best is instead of doing write_begin/write_end with
> > > > Btrfs, it would be nice if we could just do our own perform_write instead of
> > > > generic_perform_write.  This way we can drop all of these generic checks we have
> > > > that we copied from filemap.c and just got to the business of actually writing
> > > > the data.  I hate to add another file operation, but it would _greatly_ reduce
> > > > the amount of duplicate code we have.  If there is no violent objection to this
> > > > I can put something together quickly for review.  Thanks,
> > > >
> > > 
> > > I just got a suggestion from hpa about instead just moving what BTRFS does into
> > > the generic_perform_write.  What btrfs does is allocates a chunk of pages to
> > > cover the entirety of the write, sets everything up, does the copy from user
> > > into the pages, and tears everything down, so essentially what
> > > generic_perform_write does, just with more pages.
> > 
> > Except that btrfs does things in a very different manner to most
> > other filesystems ;)
> > 
> > > I could modify
> > > generic_perform_write and the write_begin/write_end aops to do this, where
> > > write_begin will return how many pages it allocated, copy in all of the
> > > userpages into the pages we allocated at once, and then call write_end with the
> > > pages we allocated in write begin.  Then I could just make btrfs do
> > > write_being/write_end.  So which option seems more palatable?  Thanks,
> > 
> > I can see how this would work for btrfs, but the issue is how any
> > other filesystem would handle it.
> > 
> > I've been trying to get my head around how any filesystem using
> > bufferheads and generic code can do multipage writes using
> > write_begin/write_end without modifying the interface, and I just
> > threw away my second attempt because the error handling just
> > couldn't be handled cleanly without duplicating the entire
> > block_write_begin path in each filesystem that wanted to do
> > multipage writes.
> > 
> > The biggest problem is that block allocation is intermingled with
> > allocating and attaching bufferheads to pages, hence error handling
> > can get really nasty and is split across a call chain 3 or 4
> > functions deep.  The error handling is where I'm finding all the
> > dangerous and hard-to-kill demons lurking in dark corners. I suspect
> > there's a grue in there somewhere, too. ;)
> > 
> > Separating the page+bufferhead allocation and block allocation would
> > make this much simpler but I can't fit that easily into the existing
> > interfaces.
> >
> > Hence I think that write_begin/copy pages/write_end is not really
> > suited to multipage writes when allocation is done in write_begin
> > and the write can then fail in a later stage without a simple method
> > of undoing the allocation. We don't have any hole punch interfaces
> > to the filesystems (and I think only XFS supports that functionality
> > right now), so handling errors after allocation becomes rather
> > complex, especially when you have multiple blocks per page. 
> > 
> > Hence I've independently come to the conclusion that delaying the
> > allocation until *after* the copy as btrfs does is probably the best
> > approach to take here.  This largely avoids the error handling
> > complexity because the write operation is an all-or-nothing
> > operation. btrfs has separate hooks for space reservation and
> > releasing the reservation and doesn't commit to actually allocating
> > anything until the copying is complete. Hence cleanup is simple no
> > matter where a failure occurs.
> > 
> > Personally, I'm tending towards killing the get_blocks() callback as
> > the first step in this process - turn it into a real inode/address
> > space operation (say ->allocate) so we can untangle the write path
> > somewhat (lots of filesystem just provide operations as wrappers to
> > provide a fs-specific get_blocks callback to generic code.  If the
> > "create" flag is then converted to a "command" field, the interface
> > can pass "RESERVE", "ALLOC", "CREATE", etc to allow different
> > operations to be clearly handled.
> > 
> > e.g.:
> > 
> > 	->allocate(mapping, NULL, off, len, RESERVE)
> > 		reserves necessary space for write
> > 	->write_begin
> > 		grab pages into page cache
> > 		attach bufferheads (if required)
> > 	fail -> goto drop pages
> > 	copy data into pages
> > 	fail -> goto drop pages
> > 	->allocate(mapping, pages, off, len, ALLOC)
> > 		allocates reserved space (if required)
> > 		sets up/maps/updates bufferheads/extents
> > 	fail -> goto drop pages
> > 	->write_end
> > 		set pages dirty + uptodate
> > 	done
> > 
> > drop_pages:
> > 	->allocate(mapping, NULL, off, len, UNRESERVE)
> > 	if needed, zero partial pages
> > 	release pages, clears uptodate.
> > 
> > Basically this allows the copying of data before any allocation is
> > actually done, but also allows ENOSPC to be detected before starting
> > the copy. The filesystem can call whatver helpers it needs inside
> > ->get_blocks(ALLOC) to set up bufferhead/extent state to match
> > what has been reserved/allocated/mapped in the RESERVE call.
> > 
> > This will work for btrfs, and it will work for XFS and I think it
> > will work for other filesystems that are using bufferheads as well.
> > For those filesystems that will only support a page at a time, then
> > they can continue to use the current code, but should be able to be
> > converted to the multipage code by making RESERVE and UNRESERVE
> > no-ops, and ALLOC does what write_begin+get_blocks currently does to
> > set up block mappings.
> > 
> > Thoughts?
> >
> So this is what I had envisioned, we make write_begin take a nr_pages pointer
> and tell it how much data we have to write, then in the filesystem we allocate
> as many pages as we feel like, idealy something like
> 
> min(number of pages we need for the write, some arbitrary limit for security)

Actually, i was thinking that the RESERVE call determines the size
of the chunk (in the order of 1-4MB maximum). IOWs, we pass in the
start offset of the write, the entire length remaining, and the
RESERVE call determines how much it will allow in one loop.

	written = 0;
	while (bytes_remaining > 0) {
		chunklen = ->allocate(off, bytes_remaining, RESERVE);
		write_begin(&pages, off, chunklen);
		copied = copy_pages(&pages, iov_iter, chunklen);
		.....
		bytes_remaining -= copied;
		off += copied;
		written += copied;
	}


> and then we allocate all those pages.  Then you can pass them into
> block_write_begin, which will walk the pages, allocating buffer heads and
> allocating the space as needed.

Close - except that I'd like like to move the block allocation out of
->write_begin until after the copy into the pages has completed.
Hence write_begin only deals with populating the page cache and
anything filesystem specific like locking or allocating/populating
per-page structures like bufferheads.  i.e. ->write_begin does not
do anything that cannot be undone easily if the copy fails.

> Now since we're coming into write_begin with "we want to write X bytes" we can
> go ahead and do the enospc checks for X bytes, and then if we are good to go,
> chances are we won't fail.

Well, the "RESERVE" callout I proposed should mean that the
allocation that follows the copy should never fail with ENOSPC. i.e.
if there isn't space, the RESERVE call will fail with ENOSPC, not
the actual ALLOCATE call that occurs after the data is copied.

Basically I was thinking that in the XFS case, RESERVE would scan the
range for holes, and reserve all blocks needed to fill all the holes
in the range. Then the ALLOCATE call would mark them all delalloc.
The space reservation and delalloc is done in a single call right
now, but splitting them is not hard...

> Except if we're overwriting a holey section of the file, we're going to be
> screwed in both your way and my way.  My way probably would be the most likely
> to fail, since we could fail to do the copy_from_user, but hopefully the segment
> checks and doing the fault_in_readable before all of this would keep those
> problems to a minimum.

fault_in_readable() only faults the first page of the first iov
passed in. It can still fail on any other page the iov points to.
And even then, fault_in_readable() does not pin the page it faulted
in memory, so even that can get recycled before we start the copy
and we'd still get EFAULT.

> In your case the only failure point is in the allocate step.  If we fail on down
> the line after we've done some hole filling, we'll be hard pressed to go back
> and free up those blocks.  Is that what you are talking about, having the
> allocate(UNRESERVE) thing being able to go back and figure out what should have
> been holes needs to be holes again?

Yeah, that's kind of what I was thinking of. After a bit more
thought, if the allocation only covers part of the range (for
whatever reason), we can still allow that allocated part of the
write to complete succesfully, and the UNRESERVE call simply returns
the unused part of the space reservation. This would be a partial
write and be much simpler to deal with than punching out the
allocations that succeeded.

> If thats the case then I think your idea
> will work and is probably the best way to move forward.  But from what I can
> remember about how all this works with buffer heads, there's not a way to say
> "we _just_ allocated this recently".  Thanks,

buffer_new() would suffice, I think.

In __block_prepare_write(), when a new buffer is mapped that flag is
cleared. If the buffer is not mapped, we call get_blocks, and any
buffer it allocates will be marked buffer_new() again. This is used
to trigger special things in __block_prepare_write() and the flag is
not cleared until write_end (in __block_commit_write). Hence a
buffer with the new flag set after allocation is a buffer that had
space allocated...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-14  6:41       ` Dave Chinner
@ 2010-05-14  7:22         ` Nick Piggin
  2010-05-14  8:38           ` Dave Chinner
  2010-05-21 15:15           ` Christoph Hellwig
  2010-05-17 23:35         ` Jan Kara
  1 sibling, 2 replies; 44+ messages in thread
From: Nick Piggin @ 2010-05-14  7:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel

On Fri, May 14, 2010 at 04:41:45PM +1000, Dave Chinner wrote:
> On Thu, May 13, 2010 at 11:30:57PM -0400, Josef Bacik wrote:
> > So this is what I had envisioned, we make write_begin take a nr_pages pointer
> > and tell it how much data we have to write, then in the filesystem we allocate
> > as many pages as we feel like, idealy something like
> > 
> > min(number of pages we need for the write, some arbitrary limit for security)
> 
> Actually, i was thinking that the RESERVE call determines the size
> of the chunk (in the order of 1-4MB maximum). IOWs, we pass in the
> start offset of the write, the entire length remaining, and the
> RESERVE call determines how much it will allow in one loop.
> 
> 	written = 0;
> 	while (bytes_remaining > 0) {
> 		chunklen = ->allocate(off, bytes_remaining, RESERVE);
> 		write_begin(&pages, off, chunklen);
> 		copied = copy_pages(&pages, iov_iter, chunklen);
> 		.....
> 		bytes_remaining -= copied;
> 		off += copied;
> 		written += copied;
> 	}

How much benefit are you expecting to get? I have no problems with
non-pagecache block allocation/freeing APIs to get the most out of
amortising expensive locking or tree manipulations, but I don't
know if it is a good idea to pin a big chunk of pagecache like that.

I'd say just do the per-page manipulations like we currently do.


> > and then we allocate all those pages.  Then you can pass them into
> > block_write_begin, which will walk the pages, allocating buffer heads and
> > allocating the space as needed.
> 
> Close - except that I'd like like to move the block allocation out of
> ->write_begin until after the copy into the pages has completed.
> Hence write_begin only deals with populating the page cache and
> anything filesystem specific like locking or allocating/populating
> per-page structures like bufferheads.  i.e. ->write_begin does not
> do anything that cannot be undone easily if the copy fails.

Hmm, not so sure. You run into other problems.

Write to a hole? Then you have to clear page uptodate and unmap
any potential mmaps on it before the copy to pagecache, in case
the allocation fails (otherwise you have exposed inconsistent
state to userspace).

Write to a hole, then to a non-hole? Then your write to the non
hole pagecache cannot be recovered in case the block allocation
for the hole fails and we need to return a short write.


> > Now since we're coming into write_begin with "we want to write X bytes" we can
> > go ahead and do the enospc checks for X bytes, and then if we are good to go,
> > chances are we won't fail.
> 
> Well, the "RESERVE" callout I proposed should mean that the
> allocation that follows the copy should never fail with ENOSPC. i.e.
> if there isn't space, the RESERVE call will fail with ENOSPC, not
> the actual ALLOCATE call that occurs after the data is copied.
> 
> Basically I was thinking that in the XFS case, RESERVE would scan the
> range for holes, and reserve all blocks needed to fill all the holes
> in the range. Then the ALLOCATE call would mark them all delalloc.
> The space reservation and delalloc is done in a single call right
> now, but splitting them is not hard...
> 
> > Except if we're overwriting a holey section of the file, we're going to be
> > screwed in both your way and my way.  My way probably would be the most likely
> > to fail, since we could fail to do the copy_from_user, but hopefully the segment
> > checks and doing the fault_in_readable before all of this would keep those
> > problems to a minimum.
> 
> fault_in_readable() only faults the first page of the first iov
> passed in. It can still fail on any other page the iov points to.
> And even then, fault_in_readable() does not pin the page it faulted
> in memory, so even that can get recycled before we start the copy
> and we'd still get EFAULT.
> 
> > In your case the only failure point is in the allocate step.  If we fail on down
> > the line after we've done some hole filling, we'll be hard pressed to go back
> > and free up those blocks.  Is that what you are talking about, having the
> > allocate(UNRESERVE) thing being able to go back and figure out what should have
> > been holes needs to be holes again?
> 
> Yeah, that's kind of what I was thinking of. After a bit more
> thought, if the allocation only covers part of the range (for
> whatever reason), we can still allow that allocated part of the
> write to complete succesfully, and the UNRESERVE call simply returns
> the unused part of the space reservation. This would be a partial
> write and be much simpler to deal with than punching out the
> allocations that succeeded.

Do we really need to restore holes inside i_size in error cases?
Current code does not. It just leaves the blocks allocated and
zeroes them.

 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-14  7:22         ` Nick Piggin
@ 2010-05-14  8:38           ` Dave Chinner
  2010-05-14 13:33             ` Chris Mason
  2010-05-18  6:36             ` Nick Piggin
  2010-05-21 15:15           ` Christoph Hellwig
  1 sibling, 2 replies; 44+ messages in thread
From: Dave Chinner @ 2010-05-14  8:38 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel

On Fri, May 14, 2010 at 05:22:19PM +1000, Nick Piggin wrote:
> On Fri, May 14, 2010 at 04:41:45PM +1000, Dave Chinner wrote:
> > On Thu, May 13, 2010 at 11:30:57PM -0400, Josef Bacik wrote:
> > > So this is what I had envisioned, we make write_begin take a nr_pages pointer
> > > and tell it how much data we have to write, then in the filesystem we allocate
> > > as many pages as we feel like, idealy something like
> > > 
> > > min(number of pages we need for the write, some arbitrary limit for security)
> > 
> > Actually, i was thinking that the RESERVE call determines the size
> > of the chunk (in the order of 1-4MB maximum). IOWs, we pass in the
> > start offset of the write, the entire length remaining, and the
> > RESERVE call determines how much it will allow in one loop.
> > 
> > 	written = 0;
> > 	while (bytes_remaining > 0) {
> > 		chunklen = ->allocate(off, bytes_remaining, RESERVE);
> > 		write_begin(&pages, off, chunklen);
> > 		copied = copy_pages(&pages, iov_iter, chunklen);
> > 		.....
> > 		bytes_remaining -= copied;
> > 		off += copied;
> > 		written += copied;
> > 	}
> 
> How much benefit are you expecting to get?

If the max chunk size is 4MB, then three orders of magnitudes fewer
allocation calls for x86_64 (i.e. one instead of 1024).  For
filesystems with significant allocation overhead (like gaining
cluster locks in gfs2), this will be a *massive* win.

> I have no problems with
> non-pagecache block allocation/freeing APIs to get the most out of
> amortising expensive locking or tree manipulations, but I don't
> know if it is a good idea to pin a big chunk of pagecache like that.
> I'd say just do the per-page manipulations like we currently do.

Can you expand on this?

> > > and then we allocate all those pages.  Then you can pass them into
> > > block_write_begin, which will walk the pages, allocating buffer heads and
> > > allocating the space as needed.
> > 
> > Close - except that I'd like like to move the block allocation out of
> > ->write_begin until after the copy into the pages has completed.
> > Hence write_begin only deals with populating the page cache and
> > anything filesystem specific like locking or allocating/populating
> > per-page structures like bufferheads.  i.e. ->write_begin does not
> > do anything that cannot be undone easily if the copy fails.
> 
> Hmm, not so sure. You run into other problems.
> 
> Write to a hole? Then you have to clear page uptodate and unmap
> any potential mmaps on it before the copy to pagecache, in case
> the allocation fails (otherwise you have exposed inconsistent
> state to userspace).

Good call. I think that checking page_mapped() on each page as they
are grabbed and locked, and if any are mapped call
unmap_mapping_range() on them will work. New faults will serialise
on the page locks and because we hold the page lock across the
entire operation, I don't think we need to clear the uptodate flag
unless we fail the write.

> Write to a hole, then to a non-hole? Then your write to the non
> hole pagecache cannot be recovered in case the block allocation
> for the hole fails and we need to return a short write.

The pages covering non-holes are locked, so nobody else can access
them. We could simply invalidate them so the next access would force
a re-read from disk.  The problem I see with this, though, is
multiple partial overwrites of the allocated block - we could lose
changes invalidating it.

I think the simplest way to deal with it is to ensure that
->allocate(RESERVE) is not allowed to return ranges that span
allocated and unallocated regions. i.e. it can return an allocated
region or a hole, but not a region that combines both. The worst
case is that we fall back to current behaviour (an allocation call
per block), but for the case we're trying to optimise (large writes
into large holes) we'd still see the benefit of fewer allocation
calls.  FWIW, I think this simplifies error handling as well (see
below).

> > > Now since we're coming into write_begin with "we want to write X bytes" we can
> > > go ahead and do the enospc checks for X bytes, and then if we are good to go,
> > > chances are we won't fail.
> > 
> > Well, the "RESERVE" callout I proposed should mean that the
> > allocation that follows the copy should never fail with ENOSPC. i.e.
> > if there isn't space, the RESERVE call will fail with ENOSPC, not
> > the actual ALLOCATE call that occurs after the data is copied.
> > 
> > Basically I was thinking that in the XFS case, RESERVE would scan the
> > range for holes, and reserve all blocks needed to fill all the holes
> > in the range. Then the ALLOCATE call would mark them all delalloc.
> > The space reservation and delalloc is done in a single call right
> > now, but splitting them is not hard...
> > 
> > > Except if we're overwriting a holey section of the file, we're going to be
> > > screwed in both your way and my way.  My way probably would be the most likely
> > > to fail, since we could fail to do the copy_from_user, but hopefully the segment
> > > checks and doing the fault_in_readable before all of this would keep those
> > > problems to a minimum.
> > 
> > fault_in_readable() only faults the first page of the first iov
> > passed in. It can still fail on any other page the iov points to.
> > And even then, fault_in_readable() does not pin the page it faulted
> > in memory, so even that can get recycled before we start the copy
> > and we'd still get EFAULT.
> > 
> > > In your case the only failure point is in the allocate step.  If we fail on down
> > > the line after we've done some hole filling, we'll be hard pressed to go back
> > > and free up those blocks.  Is that what you are talking about, having the
> > > allocate(UNRESERVE) thing being able to go back and figure out what should have
> > > been holes needs to be holes again?
> > 
> > Yeah, that's kind of what I was thinking of. After a bit more
> > thought, if the allocation only covers part of the range (for
> > whatever reason), we can still allow that allocated part of the
> > write to complete succesfully, and the UNRESERVE call simply returns
> > the unused part of the space reservation. This would be a partial
> > write and be much simpler to deal with than punching out the
> > allocations that succeeded.
> 
> Do we really need to restore holes inside i_size in error cases?

No, we don't. That's what I was trying to say - holes that were
allocated are full of good data, so we only need to deal with the
holes that didn't get allocated. The range of the write that failed
is only stale in memory now, so we can just invalidate pages in the
range that is considered failed.

If we limit RESERVE to return ranges of a single type of block, then
this whole cleanup problem will also go away - it's back to a simple
succeed or fail-and-rollback state per execution of the loop, and
partial writes are those that have made at least one successful pass
of the loop.

> Current code does not. It just leaves the blocks allocated and
> zeroes them.

Yup, error handling is easy when you're only dealing with one page
at a time. ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-14  8:38           ` Dave Chinner
@ 2010-05-14 13:33             ` Chris Mason
  2010-05-18  6:36             ` Nick Piggin
  1 sibling, 0 replies; 44+ messages in thread
From: Chris Mason @ 2010-05-14 13:33 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Nick Piggin, Josef Bacik, linux-fsdevel, hch, akpm, linux-kernel

On Fri, May 14, 2010 at 06:38:21PM +1000, Dave Chinner wrote:
> On Fri, May 14, 2010 at 05:22:19PM +1000, Nick Piggin wrote:
> > On Fri, May 14, 2010 at 04:41:45PM +1000, Dave Chinner wrote:
> > > On Thu, May 13, 2010 at 11:30:57PM -0400, Josef Bacik wrote:
> > > > So this is what I had envisioned, we make write_begin take a nr_pages pointer
> > > > and tell it how much data we have to write, then in the filesystem we allocate
> > > > as many pages as we feel like, idealy something like
> > > > 
> > > > min(number of pages we need for the write, some arbitrary limit for security)
> > > 
> > > Actually, i was thinking that the RESERVE call determines the size
> > > of the chunk (in the order of 1-4MB maximum). IOWs, we pass in the
> > > start offset of the write, the entire length remaining, and the
> > > RESERVE call determines how much it will allow in one loop.
> > > 
> > > 	written = 0;
> > > 	while (bytes_remaining > 0) {
> > > 		chunklen = ->allocate(off, bytes_remaining, RESERVE);
> > > 		write_begin(&pages, off, chunklen);
> > > 		copied = copy_pages(&pages, iov_iter, chunklen);
> > > 		.....
> > > 		bytes_remaining -= copied;
> > > 		off += copied;
> > > 		written += copied;
> > > 	}
> > 
> > How much benefit are you expecting to get?
> 
> If the max chunk size is 4MB, then three orders of magnitudes fewer
> allocation calls for x86_64 (i.e. one instead of 1024).  For
> filesystems with significant allocation overhead (like gaining
> cluster locks in gfs2), this will be a *massive* win.

It's a pretty big deal in btrfs too.  A 4K write write is much less
expensive than it used to be, but the part where we mark a range of
bytes as delayed allocation goes faster if that range is bigger.

-chris

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-14  8:38           ` Dave Chinner
  2010-05-14 13:33             ` Chris Mason
@ 2010-05-18  6:36             ` Nick Piggin
  2010-05-18  8:05               ` Dave Chinner
  1 sibling, 1 reply; 44+ messages in thread
From: Nick Piggin @ 2010-05-18  6:36 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel

On Fri, May 14, 2010 at 06:38:21PM +1000, Dave Chinner wrote:
> On Fri, May 14, 2010 at 05:22:19PM +1000, Nick Piggin wrote:
> > On Fri, May 14, 2010 at 04:41:45PM +1000, Dave Chinner wrote:
> > > On Thu, May 13, 2010 at 11:30:57PM -0400, Josef Bacik wrote:
> > > > So this is what I had envisioned, we make write_begin take a nr_pages pointer
> > > > and tell it how much data we have to write, then in the filesystem we allocate
> > > > as many pages as we feel like, idealy something like
> > > > 
> > > > min(number of pages we need for the write, some arbitrary limit for security)
> > > 
> > > Actually, i was thinking that the RESERVE call determines the size
> > > of the chunk (in the order of 1-4MB maximum). IOWs, we pass in the
> > > start offset of the write, the entire length remaining, and the
> > > RESERVE call determines how much it will allow in one loop.
> > > 
> > > 	written = 0;
> > > 	while (bytes_remaining > 0) {
> > > 		chunklen = ->allocate(off, bytes_remaining, RESERVE);
> > > 		write_begin(&pages, off, chunklen);
> > > 		copied = copy_pages(&pages, iov_iter, chunklen);
> > > 		.....
> > > 		bytes_remaining -= copied;
> > > 		off += copied;
> > > 		written += copied;
> > > 	}
> > 
> > How much benefit are you expecting to get?
> 
> If the max chunk size is 4MB, then three orders of magnitudes fewer
> allocation calls for x86_64 (i.e. one instead of 1024).  For
> filesystems with significant allocation overhead (like gaining
> cluster locks in gfs2), this will be a *massive* win.
> 
> > I have no problems with
> > non-pagecache block allocation/freeing APIs to get the most out of
> > amortising expensive locking or tree manipulations, but I don't
> > know if it is a good idea to pin a big chunk of pagecache like that.
> > I'd say just do the per-page manipulations like we currently do.
> 
> Can you expand on this?

Well you could do a large span block allocation at the beginning,
and then dirty the pagecache one by one like we do right now.

The only reason to do operations on multiple pages at once is if
we need to lock them all. Now the fs might well have that requirement
(if it is not using i_mutex for block (de)allocation serialisation),
but I don't think generic code needs to be doing that.


> > > > and then we allocate all those pages.  Then you can pass them into
> > > > block_write_begin, which will walk the pages, allocating buffer heads and
> > > > allocating the space as needed.
> > > 
> > > Close - except that I'd like like to move the block allocation out of
> > > ->write_begin until after the copy into the pages has completed.
> > > Hence write_begin only deals with populating the page cache and
> > > anything filesystem specific like locking or allocating/populating
> > > per-page structures like bufferheads.  i.e. ->write_begin does not
> > > do anything that cannot be undone easily if the copy fails.
> > 
> > Hmm, not so sure. You run into other problems.
> > 
> > Write to a hole? Then you have to clear page uptodate and unmap
> > any potential mmaps on it before the copy to pagecache, in case
> > the allocation fails (otherwise you have exposed inconsistent
> > state to userspace).
> 
> Good call. I think that checking page_mapped() on each page as they
> are grabbed and locked, and if any are mapped call
> unmap_mapping_range() on them will work. New faults will serialise
> on the page locks and because we hold the page lock across the
> entire operation,

That is supposed to be enough to prevent concurrent faults setting
up the mapping again (which is how the fault vs truncate race was
fixed). So yes that would be fine (so long as the fs DTRT in its
fault API of course).


> I don't think we need to clear the uptodate flag
> unless we fail the write.

read(2) is performed without page lock though, so it could equally
expose transient data that way.

 
> > Write to a hole, then to a non-hole? Then your write to the non
> > hole pagecache cannot be recovered in case the block allocation
> > for the hole fails and we need to return a short write.
> 
> The pages covering non-holes are locked, so nobody else can access
> them. We could simply invalidate them so the next access would force
> a re-read from disk.  The problem I see with this, though, is
> multiple partial overwrites of the allocated block - we could lose
> changes invalidating it.
> 
> I think the simplest way to deal with it is to ensure that
> ->allocate(RESERVE) is not allowed to return ranges that span
> allocated and unallocated regions. i.e. it can return an allocated
> region or a hole, but not a region that combines both. The worst
> case is that we fall back to current behaviour (an allocation call
> per block), but for the case we're trying to optimise (large writes
> into large holes) we'd still see the benefit of fewer allocation
> calls.  FWIW, I think this simplifies error handling as well (see
> below).

Possibly is the right way to go. But why is error handling of block
allocation the hard part? Probably I look at it from the other side
of the equation, but publishing modifications to the pagecache seems
much harder to recover from than block allocations.

I don't see what changes when you go to multi-page allocation spans.
Can you explain?


> > > > Now since we're coming into write_begin with "we want to write X bytes" we can
> > > > go ahead and do the enospc checks for X bytes, and then if we are good to go,
> > > > chances are we won't fail.
> > > 
> > > Well, the "RESERVE" callout I proposed should mean that the
> > > allocation that follows the copy should never fail with ENOSPC. i.e.
> > > if there isn't space, the RESERVE call will fail with ENOSPC, not
> > > the actual ALLOCATE call that occurs after the data is copied.
> > > 
> > > Basically I was thinking that in the XFS case, RESERVE would scan the
> > > range for holes, and reserve all blocks needed to fill all the holes
> > > in the range. Then the ALLOCATE call would mark them all delalloc.
> > > The space reservation and delalloc is done in a single call right
> > > now, but splitting them is not hard...
> > > 
> > > > Except if we're overwriting a holey section of the file, we're going to be
> > > > screwed in both your way and my way.  My way probably would be the most likely
> > > > to fail, since we could fail to do the copy_from_user, but hopefully the segment
> > > > checks and doing the fault_in_readable before all of this would keep those
> > > > problems to a minimum.
> > > 
> > > fault_in_readable() only faults the first page of the first iov
> > > passed in. It can still fail on any other page the iov points to.
> > > And even then, fault_in_readable() does not pin the page it faulted
> > > in memory, so even that can get recycled before we start the copy
> > > and we'd still get EFAULT.
> > > 
> > > > In your case the only failure point is in the allocate step.  If we fail on down
> > > > the line after we've done some hole filling, we'll be hard pressed to go back
> > > > and free up those blocks.  Is that what you are talking about, having the
> > > > allocate(UNRESERVE) thing being able to go back and figure out what should have
> > > > been holes needs to be holes again?
> > > 
> > > Yeah, that's kind of what I was thinking of. After a bit more
> > > thought, if the allocation only covers part of the range (for
> > > whatever reason), we can still allow that allocated part of the
> > > write to complete succesfully, and the UNRESERVE call simply returns
> > > the unused part of the space reservation. This would be a partial
> > > write and be much simpler to deal with than punching out the
> > > allocations that succeeded.
> > 
> > Do we really need to restore holes inside i_size in error cases?
> 
> No, we don't. That's what I was trying to say - holes that were

OK good.

> allocated are full of good data, so we only need to deal with the
> holes that didn't get allocated. The range of the write that failed
> is only stale in memory now, so we can just invalidate pages in the
> range that is considered failed.

Well invalidate isn't a good way to do this IMO. It is very hairy to
put data tentatively into PageUptodate pagecache and then try to recover
afterwards.

Basically, once pagecache is marked uptodate, I don't think we should
ever put maybe-invalid data into it -- the way to do it is to invalidate
that page and put a *new* page in there.

Why? Because user mappings are just one problem, but once you had a
user mapping, you can have been subject to get_user_pages, so it could
be in the middle of a DMA operation or something.

 
> If we limit RESERVE to return ranges of a single type of block, then
> this whole cleanup problem will also go away - it's back to a simple
> succeed or fail-and-rollback state per execution of the loop, and
> partial writes are those that have made at least one successful pass
> of the loop.
> 
> > Current code does not. It just leaves the blocks allocated and
> > zeroes them.
> 
> Yup, error handling is easy when you're only dealing with one page
> at a time. ;)

Humour me; why? :)


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-18  6:36             ` Nick Piggin
@ 2010-05-18  8:05               ` Dave Chinner
  2010-05-18 10:43                 ` Nick Piggin
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2010-05-18  8:05 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel

On Tue, May 18, 2010 at 04:36:47PM +1000, Nick Piggin wrote:
> On Fri, May 14, 2010 at 06:38:21PM +1000, Dave Chinner wrote:
> > On Fri, May 14, 2010 at 05:22:19PM +1000, Nick Piggin wrote:
> > > On Fri, May 14, 2010 at 04:41:45PM +1000, Dave Chinner wrote:
> > > > On Thu, May 13, 2010 at 11:30:57PM -0400, Josef Bacik wrote:
> > > > > So this is what I had envisioned, we make write_begin take a nr_pages pointer
> > > > > and tell it how much data we have to write, then in the filesystem we allocate
> > > > > as many pages as we feel like, idealy something like
> > > > > 
> > > > > min(number of pages we need for the write, some arbitrary limit for security)
> > > > 
> > > > Actually, i was thinking that the RESERVE call determines the size
> > > > of the chunk (in the order of 1-4MB maximum). IOWs, we pass in the
> > > > start offset of the write, the entire length remaining, and the
> > > > RESERVE call determines how much it will allow in one loop.
> > > > 
> > > > 	written = 0;
> > > > 	while (bytes_remaining > 0) {
> > > > 		chunklen = ->allocate(off, bytes_remaining, RESERVE);
> > > > 		write_begin(&pages, off, chunklen);
> > > > 		copied = copy_pages(&pages, iov_iter, chunklen);
> > > > 		.....
> > > > 		bytes_remaining -= copied;
> > > > 		off += copied;
> > > > 		written += copied;
> > > > 	}
> > > 
> > > How much benefit are you expecting to get?
> > 
> > If the max chunk size is 4MB, then three orders of magnitudes fewer
> > allocation calls for x86_64 (i.e. one instead of 1024).  For
> > filesystems with significant allocation overhead (like gaining
> > cluster locks in gfs2), this will be a *massive* win.
> > 
> > > I have no problems with
> > > non-pagecache block allocation/freeing APIs to get the most out of
> > > amortising expensive locking or tree manipulations, but I don't
> > > know if it is a good idea to pin a big chunk of pagecache like that.
> > > I'd say just do the per-page manipulations like we currently do.
> > 
> > Can you expand on this?
> 
> Well you could do a large span block allocation at the beginning,
> and then dirty the pagecache one by one like we do right now.

The problem is that if we fail to allocate a page (e.g.  ENOMEM) or
fail the copy (EFAULT) after the block allocation, we have to undo
the allocation we have already completed. If we don't, we leave
uninitialisaed allocations on disk that will expose stale data.

In the second case (EFAULT) we might be able to zero the pages to
avoid punching out blocks, but the first case where pages can't be
allocated to cover the block allocated range makes it very
difficult without being able to punch holes in allocated block
ranges.

AFAIK, only XFS and OCFS2 currently support punching out arbitrary
ranges of allocated blocks from an inode - there is not VFS method
for it, just an ioctl (XFS_IOC_UNRESVSP).

Hence the way to avoid needing hole punching is to allocate and lock
down all the pages into the page cache fіrst, then do the copy so
they fail before the allocation is done if they are going to fail.
That makes it much, much easier to handle failures....

Remember, we still don't do the "truncate blocks past eof on failure"
correctly in block_write_begin (still calls vmtruncate, not
->setattr), so we can't claim that the current way of doing things
is a model of perfection that we ca't improve on....

> The only reason to do operations on multiple pages at once is if
> we need to lock them all. 

Well, to avoid the refaulting of pages we just unmapped we'd need to
do that...

> Now the fs might well have that requirement
> (if it is not using i_mutex for block (de)allocation
> serialisation), but I don't think generic code needs to be doing
> that.

XFS already falls into the category of a filesystem using the
generic code that does not use i_mutex for allocation serialisation.
I'm sure it isn't the only filesystem that this is true for, so it
seems sensible for the generic code to handle this case.

> > I don't think we need to clear the uptodate flag
> > unless we fail the write.
> 
> read(2) is performed without page lock though, so it could equally
> expose transient data that way.

fair point.

> > > Write to a hole, then to a non-hole? Then your write to the non
> > > hole pagecache cannot be recovered in case the block allocation
> > > for the hole fails and we need to return a short write.
> > 
> > The pages covering non-holes are locked, so nobody else can access
> > them. We could simply invalidate them so the next access would force
> > a re-read from disk.  The problem I see with this, though, is
> > multiple partial overwrites of the allocated block - we could lose
> > changes invalidating it.
> > 
> > I think the simplest way to deal with it is to ensure that
> > ->allocate(RESERVE) is not allowed to return ranges that span
> > allocated and unallocated regions. i.e. it can return an allocated
> > region or a hole, but not a region that combines both. The worst
> > case is that we fall back to current behaviour (an allocation call
> > per block), but for the case we're trying to optimise (large writes
> > into large holes) we'd still see the benefit of fewer allocation
> > calls.  FWIW, I think this simplifies error handling as well (see
> > below).
> 
> Possibly is the right way to go. But why is error handling of block
> allocation the hard part? Probably I look at it from the other side
> of the equation, but publishing modifications to the pagecache seems
> much harder to recover from than block allocations.
> I don't see what changes when you go to multi-page allocation spans.
> Can you explain?

The lack of an interface to support removing abitrary ranges of
allocated blocks. See above.

> > > > Yeah, that's kind of what I was thinking of. After a bit more
> > > > thought, if the allocation only covers part of the range (for
> > > > whatever reason), we can still allow that allocated part of the
> > > > write to complete succesfully, and the UNRESERVE call simply returns
> > > > the unused part of the space reservation. This would be a partial
> > > > write and be much simpler to deal with than punching out the
> > > > allocations that succeeded.
> > > 
> > > Do we really need to restore holes inside i_size in error cases?
> > 
> > No, we don't. That's what I was trying to say - holes that were
> 
> OK good.
> 
> > allocated are full of good data, so we only need to deal with the
> > holes that didn't get allocated. The range of the write that failed
> > is only stale in memory now, so we can just invalidate pages in the
> > range that is considered failed.
> 
> Well invalidate isn't a good way to do this IMO. It is very hairy to
> put data tentatively into PageUptodate pagecache and then try to recover
> afterwards.
> 
> Basically, once pagecache is marked uptodate, I don't think we should
> ever put maybe-invalid data into it -- the way to do it is to invalidate
> that page and put a *new* page in there.

Ok, so lets do that...

> Why? Because user mappings are just one problem, but once you had a
> user mapping, you can have been subject to get_user_pages, so it could
> be in the middle of a DMA operation or something.

... because we already know this behaviour causes problems for
high end enterprise level features like hardware checksumming IO
paths.

Hence it seems that a multipage write needs to:

	1. allocate new pages
	2. attach bufferheads/mapping structures to pages (if required)
	3. copy data into pages
	4. allocate space
	5. for each old page in the range:
		lock page
		invalidate mappings
		clear page uptodate flag
		remove page from page cache
	6. for each new page:
		map new page to allocated space
		lock new page
		insert new page into pagecache
		update new page state (write_end equivalent)
		unlock new page
	7. free old pages

Steps 1-4 can all fail, and can all be backed out from without
changing the current state. Steps 5-7 can't fail AFAICT, so we
should be able to run this safely after the allocation without
needing significant error unwinding...

Thoughts?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-18  8:05               ` Dave Chinner
@ 2010-05-18 10:43                 ` Nick Piggin
  2010-05-18 12:27                   ` Dave Chinner
  0 siblings, 1 reply; 44+ messages in thread
From: Nick Piggin @ 2010-05-18 10:43 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel

On Tue, May 18, 2010 at 06:05:03PM +1000, Dave Chinner wrote:
> On Tue, May 18, 2010 at 04:36:47PM +1000, Nick Piggin wrote:
> > Well you could do a large span block allocation at the beginning,
> > and then dirty the pagecache one by one like we do right now.
> 
> The problem is that if we fail to allocate a page (e.g.  ENOMEM) or
> fail the copy (EFAULT) after the block allocation, we have to undo
> the allocation we have already completed. If we don't, we leave
> uninitialisaed allocations on disk that will expose stale data.
> 
> In the second case (EFAULT) we might be able to zero the pages to
> avoid punching out blocks, but the first case where pages can't be
> allocated to cover the block allocated range makes it very
> difficult without being able to punch holes in allocated block
> ranges.
> 
> AFAIK, only XFS and OCFS2 currently support punching out arbitrary
> ranges of allocated blocks from an inode - there is not VFS method
> for it, just an ioctl (XFS_IOC_UNRESVSP).
> 
> Hence the way to avoid needing hole punching is to allocate and lock
> down all the pages into the page cache fіrst, then do the copy so
> they fail before the allocation is done if they are going to fail.
> That makes it much, much easier to handle failures....

So it is just a matter of what is exposed as a vfs interface?
I would much prefer to make it a requirement to support hole
punching in the block allocator if the filesystem wishes to do
such writes.

If you have an ->allocate(inode, RESERVE/ALLOCATE, interval) API
then it makes sense to have a DEALLOCATE operation there too.

That seems like it should be cleaner than to work around it if
we're talking about adding new APIs anyway.


> Remember, we still don't do the "truncate blocks past eof on failure"
> correctly in block_write_begin (still calls vmtruncate, not
> ->setattr), so we can't claim that the current way of doing things
> is a model of perfection that we ca't improve on....

Right, but the API (write_begin/write_end) is sufficient now to
make it work correctly. What should really happen is the
filesystem detects and handles these error cases.

In the truncate patchset (that deprecates ->truncate and vmtruncate
entirely), that is exactly what happens.

> 
> > The only reason to do operations on multiple pages at once is if
> > we need to lock them all. 
> 
> Well, to avoid the refaulting of pages we just unmapped we'd need to
> do that...

Well, the lock/unmap/copy/unlock could be done on a per-page
basis.

 
> > Now the fs might well have that requirement
> > (if it is not using i_mutex for block (de)allocation
> > serialisation), but I don't think generic code needs to be doing
> > that.
> 
> XFS already falls into the category of a filesystem using the
> generic code that does not use i_mutex for allocation serialisation.
> I'm sure it isn't the only filesystem that this is true for, so it
> seems sensible for the generic code to handle this case.

Well, does it need page lock? All pages locked concurrently in
a range under which block allocation is happening? I would much
prefer an allocation API that supports allocation/freeing
without requiring any pagecache at all.

 
> > Basically, once pagecache is marked uptodate, I don't think we should
> > ever put maybe-invalid data into it -- the way to do it is to invalidate
> > that page and put a *new* page in there.
> 
> Ok, so lets do that...
> 
> > Why? Because user mappings are just one problem, but once you had a
> > user mapping, you can have been subject to get_user_pages, so it could
> > be in the middle of a DMA operation or something.
> 
> ... because we already know this behaviour causes problems for
> high end enterprise level features like hardware checksumming IO
> paths.
> 
> Hence it seems that a multipage write needs to:
> 
> 	1. allocate new pages
> 	2. attach bufferheads/mapping structures to pages (if required)
> 	3. copy data into pages
> 	4. allocate space
> 	5. for each old page in the range:
> 		lock page
> 		invalidate mappings
> 		clear page uptodate flag
> 		remove page from page cache
> 	6. for each new page:
> 		map new page to allocated space
> 		lock new page
> 		insert new page into pagecache
> 		update new page state (write_end equivalent)
> 		unlock new page
> 	7. free old pages
> 
> Steps 1-4 can all fail, and can all be backed out from without
> changing the current state. Steps 5-7 can't fail AFAICT, so we
> should be able to run this safely after the allocation without
> needing significant error unwinding...
> 
> Thoughts?

Possibly. The importance of hot cache is reduced, because we are
doing full-page copies, and bulk copies, by definition. But it
could still be an issue. The allocations and deallocations could
cost a little as well.

Compared to having a nice API to just do bulk allocate/free block
operations and then just doing simple per-page copies, I think it
doesn't look so nice.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-18 10:43                 ` Nick Piggin
@ 2010-05-18 12:27                   ` Dave Chinner
  2010-05-18 15:09                     ` Nick Piggin
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2010-05-18 12:27 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel

On Tue, May 18, 2010 at 08:43:51PM +1000, Nick Piggin wrote:
> On Tue, May 18, 2010 at 06:05:03PM +1000, Dave Chinner wrote:
> > On Tue, May 18, 2010 at 04:36:47PM +1000, Nick Piggin wrote:
> > > Well you could do a large span block allocation at the beginning,
> > > and then dirty the pagecache one by one like we do right now.
> > 
> > The problem is that if we fail to allocate a page (e.g.  ENOMEM) or
> > fail the copy (EFAULT) after the block allocation, we have to undo
> > the allocation we have already completed. If we don't, we leave
> > uninitialisaed allocations on disk that will expose stale data.
> > 
> > In the second case (EFAULT) we might be able to zero the pages to
> > avoid punching out blocks, but the first case where pages can't be
> > allocated to cover the block allocated range makes it very
> > difficult without being able to punch holes in allocated block
> > ranges.
> > 
> > AFAIK, only XFS and OCFS2 currently support punching out arbitrary
> > ranges of allocated blocks from an inode - there is not VFS method
> > for it, just an ioctl (XFS_IOC_UNRESVSP).
> > 
> > Hence the way to avoid needing hole punching is to allocate and lock
> > down all the pages into the page cache fіrst, then do the copy so
> > they fail before the allocation is done if they are going to fail.
> > That makes it much, much easier to handle failures....
> 
> So it is just a matter of what is exposed as a vfs interface?

More a matter of utilising the functionality most filesystems
already have and minimising the amount of churn in critical areas of
filesytsem code. Hole punching is not simple, anѕ bugs will likely
result in a corrupted filesystem.  And the hole punching will only
occur in a hard to trigger corner case, so it's likely that bugs
will go undetected and filesystems will suffer from random,
impossible to track down corruptions as a result.

In comparison, adding reserve/unreserve functionality might cause
block accounting issues if there is a bug, but it won't cause
on-disk corruption that results in data loss.  Hole punching is not
simple or easy - it's a damn complex way to handle errors and if
that's all it's required for then we've failed already.

> I would much prefer to make it a requirement to support hole
> punching in the block allocator if the filesystem wishes to do
> such writes.

I think that's an unrealistic requirement simply because it can be
avoided.  With a reserve/alloc/unreserve interface, btrfs will work
almost unmodified, XFS will require some new wrappers and bufferhead
mapping code, and ext4 and gfs2 look to be pretty much in the same
boat. All relatively simple on the filesystem side.

If we have to add hole punching, XFS will require an extra wrapper
but btrfs, gfs2 and ext4 will have to implement hole punching from
the ground up.  Personally, I think that requiring hole punching is
asking far too much for multipage writes, esp. given that btrfs
already implements them without needing such functionality...

> > > The only reason to do operations on multiple pages at once is if
> > > we need to lock them all. 
> > 
> > Well, to avoid the refaulting of pages we just unmapped we'd need to
> > do that...
> 
> Well, the lock/unmap/copy/unlock could be done on a per-page
> basis.

The moment we unmap the old page we cannot unlock it until the new
page is in the page cache. If we do unlock it, we risk having it
faulted again before we insert the new copy. Yes, that can be done
page by page, but shoul donly be done after all the pages are
allocated and copied into.

FWIW, I don't think we can unmap the old page until after the entire
copy is done because the old page(s) might be where we are copying
from....

> > > Now the fs might well have that requirement
> > > (if it is not using i_mutex for block (de)allocation
> > > serialisation), but I don't think generic code needs to be doing
> > > that.
> > 
> > XFS already falls into the category of a filesystem using the
> > generic code that does not use i_mutex for allocation serialisation.
> > I'm sure it isn't the only filesystem that this is true for, so it
> > seems sensible for the generic code to handle this case.
> 
> Well, does it need page lock? All pages locked concurrently in
> a range under which block allocation is happening?

No, allocation doesn't require page locks either - XFS has it's own
inode locks for serialisation of allocation, truncation and hole
punching.

> I would much
> prefer an allocation API that supports allocation/freeing
> without requiring any pagecache at all.

Allocation doesn't require any pagecache at all. It's the fact that
the allocation needs to be sycnhronised with the page cache state
change that requires page locks to be taken as part of the write
process.

> > > Basically, once pagecache is marked uptodate, I don't think we should
> > > ever put maybe-invalid data into it -- the way to do it is to invalidate
> > > that page and put a *new* page in there.
> > 
> > Ok, so lets do that...
> > 
> > > Why? Because user mappings are just one problem, but once you had a
> > > user mapping, you can have been subject to get_user_pages, so it could
> > > be in the middle of a DMA operation or something.
> > 
> > ... because we already know this behaviour causes problems for
> > high end enterprise level features like hardware checksumming IO
> > paths.
> > 
> > Hence it seems that a multipage write needs to:
> > 
> > 	1. allocate new pages
> > 	2. attach bufferheads/mapping structures to pages (if required)
> > 	3. copy data into pages
> > 	4. allocate space
> > 	5. for each old page in the range:
> > 		lock page
> > 		invalidate mappings
> > 		clear page uptodate flag
> > 		remove page from page cache
> > 	6. for each new page:
> > 		map new page to allocated space
> > 		lock new page
> > 		insert new page into pagecache
> > 		update new page state (write_end equivalent)
> > 		unlock new page
> > 	7. free old pages
> > 
> > Steps 1-4 can all fail, and can all be backed out from without
> > changing the current state. Steps 5-7 can't fail AFAICT, so we
> > should be able to run this safely after the allocation without
> > needing significant error unwinding...
> > 
> > Thoughts?
> 
> Possibly. The importance of hot cache is reduced, because we are
> doing full-page copies, and bulk copies, by definition. But it
> could still be an issue. The allocations and deallocations could
> cost a little as well.

They will cost far less than the reduction in allocation overhead
saves us, and there are potential optimisations there for reuse of
old pages....

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-18 12:27                   ` Dave Chinner
@ 2010-05-18 15:09                     ` Nick Piggin
  2010-05-19 23:50                       ` Dave Chinner
  0 siblings, 1 reply; 44+ messages in thread
From: Nick Piggin @ 2010-05-18 15:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel

On Tue, May 18, 2010 at 10:27:14PM +1000, Dave Chinner wrote:
> On Tue, May 18, 2010 at 08:43:51PM +1000, Nick Piggin wrote:
> > On Tue, May 18, 2010 at 06:05:03PM +1000, Dave Chinner wrote:
> > > On Tue, May 18, 2010 at 04:36:47PM +1000, Nick Piggin wrote:
> > > > Well you could do a large span block allocation at the beginning,
> > > > and then dirty the pagecache one by one like we do right now.
> > > 
> > > The problem is that if we fail to allocate a page (e.g.  ENOMEM) or
> > > fail the copy (EFAULT) after the block allocation, we have to undo
> > > the allocation we have already completed. If we don't, we leave
> > > uninitialisaed allocations on disk that will expose stale data.
> > > 
> > > In the second case (EFAULT) we might be able to zero the pages to
> > > avoid punching out blocks, but the first case where pages can't be
> > > allocated to cover the block allocated range makes it very
> > > difficult without being able to punch holes in allocated block
> > > ranges.
> > > 
> > > AFAIK, only XFS and OCFS2 currently support punching out arbitrary
> > > ranges of allocated blocks from an inode - there is not VFS method
> > > for it, just an ioctl (XFS_IOC_UNRESVSP).
> > > 
> > > Hence the way to avoid needing hole punching is to allocate and lock
> > > down all the pages into the page cache fіrst, then do the copy so
> > > they fail before the allocation is done if they are going to fail.
> > > That makes it much, much easier to handle failures....
> > 
> > So it is just a matter of what is exposed as a vfs interface?
> 
> More a matter of utilising the functionality most filesystems
> already have and minimising the amount of churn in critical areas of
> filesytsem code. Hole punching is not simple, anѕ bugs will likely
> result in a corrupted filesystem.  And the hole punching will only
> occur in a hard to trigger corner case, so it's likely that bugs
> will go undetected and filesystems will suffer from random,
> impossible to track down corruptions as a result.
> 
> In comparison, adding reserve/unreserve functionality might cause
> block accounting issues if there is a bug, but it won't cause
> on-disk corruption that results in data loss.  Hole punching is not
> simple or easy - it's a damn complex way to handle errors and if
> that's all it's required for then we've failed already.

As I said, we can have a dumb fallback path for filesystems that
don't implement hole punching. Clear the blocks past i size, and
zero out the allocated but not initialized blocks.

There does not have to be pagecache allocated in order to do this,
you could do direct IO from the zero page in order to do it.

Hole punching is not only useful there, it is already exposed to
userspace via MADV_REMOVE.


> > I would much prefer to make it a requirement to support hole
> > punching in the block allocator if the filesystem wishes to do
> > such writes.
> 
> I think that's an unrealistic requirement simply because it can be
> avoided.  With a reserve/alloc/unreserve interface, btrfs will work
> almost unmodified, XFS will require some new wrappers and bufferhead
> mapping code, and ext4 and gfs2 look to be pretty much in the same
> boat. All relatively simple on the filesystem side.
> 
> If we have to add hole punching, XFS will require an extra wrapper
> but btrfs, gfs2 and ext4 will have to implement hole punching from
> the ground up.  Personally, I think that requiring hole punching is
> asking far too much for multipage writes, esp. given that btrfs
> already implements them without needing such functionality...

I think that is the better long term choice rather than having
inferior APIs then being stuck with them for years.

 
> > > > The only reason to do operations on multiple pages at once is if
> > > > we need to lock them all. 
> > > 
> > > Well, to avoid the refaulting of pages we just unmapped we'd need to
> > > do that...
> > 
> > Well, the lock/unmap/copy/unlock could be done on a per-page
> > basis.
> 
> The moment we unmap the old page we cannot unlock it until the new
> page is in the page cache. If we do unlock it, we risk having it
> faulted again before we insert the new copy. Yes, that can be done
> page by page, but shoul donly be done after all the pages are
> allocated and copied into.
> 
> FWIW, I don't think we can unmap the old page until after the entire
> copy is done because the old page(s) might be where we are copying
> from....

Yeah true.

 
> > > > Now the fs might well have that requirement
> > > > (if it is not using i_mutex for block (de)allocation
> > > > serialisation), but I don't think generic code needs to be doing
> > > > that.
> > > 
> > > XFS already falls into the category of a filesystem using the
> > > generic code that does not use i_mutex for allocation serialisation.
> > > I'm sure it isn't the only filesystem that this is true for, so it
> > > seems sensible for the generic code to handle this case.
> > 
> > Well, does it need page lock? All pages locked concurrently in
> > a range under which block allocation is happening?
> 
> No, allocation doesn't require page locks either - XFS has it's own
> inode locks for serialisation of allocation, truncation and hole
> punching.

Right, I was just mentioning multiple page locks could just be useful
for that. I was not advocating that we only support i_mutex filesystems
from the generic code.

 
> > I would much
> > prefer an allocation API that supports allocation/freeing
> > without requiring any pagecache at all.
> 
> Allocation doesn't require any pagecache at all. It's the fact that
> the allocation needs to be sycnhronised with the page cache state
> change that requires page locks to be taken as part of the write
> process.

When setting up the buffer state with the filesystem state. Sure.

 
> > > > Basically, once pagecache is marked uptodate, I don't think we should
> > > > ever put maybe-invalid data into it -- the way to do it is to invalidate
> > > > that page and put a *new* page in there.
> > > 
> > > Ok, so lets do that...
> > > 
> > > > Why? Because user mappings are just one problem, but once you had a
> > > > user mapping, you can have been subject to get_user_pages, so it could
> > > > be in the middle of a DMA operation or something.
> > > 
> > > ... because we already know this behaviour causes problems for
> > > high end enterprise level features like hardware checksumming IO
> > > paths.
> > > 
> > > Hence it seems that a multipage write needs to:
> > > 
> > > 	1. allocate new pages
> > > 	2. attach bufferheads/mapping structures to pages (if required)
> > > 	3. copy data into pages
> > > 	4. allocate space
> > > 	5. for each old page in the range:
> > > 		lock page
> > > 		invalidate mappings
> > > 		clear page uptodate flag
> > > 		remove page from page cache
> > > 	6. for each new page:
> > > 		map new page to allocated space
> > > 		lock new page
> > > 		insert new page into pagecache
> > > 		update new page state (write_end equivalent)
> > > 		unlock new page
> > > 	7. free old pages
> > > 
> > > Steps 1-4 can all fail, and can all be backed out from without
> > > changing the current state. Steps 5-7 can't fail AFAICT, so we
> > > should be able to run this safely after the allocation without
> > > needing significant error unwinding...
> > > 
> > > Thoughts?
> > 
> > Possibly. The importance of hot cache is reduced, because we are
> > doing full-page copies, and bulk copies, by definition. But it
> > could still be an issue. The allocations and deallocations could
> > cost a little as well.
> 
> They will cost far less than the reduction in allocation overhead
> saves us, and there are potential optimisations there for reuse of
> old pages....

An API that doesn't require that, though, should be less overhead
and simpler.

Is it really going to be a problem to implement block hole punching
in ext4 and gfs2?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-18 15:09                     ` Nick Piggin
@ 2010-05-19 23:50                       ` Dave Chinner
  2010-05-20  6:48                         ` Nick Piggin
  2010-05-20 20:12                         ` Jan Kara
  0 siblings, 2 replies; 44+ messages in thread
From: Dave Chinner @ 2010-05-19 23:50 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel

On Wed, May 19, 2010 at 01:09:12AM +1000, Nick Piggin wrote:
> On Tue, May 18, 2010 at 10:27:14PM +1000, Dave Chinner wrote:
> > On Tue, May 18, 2010 at 08:43:51PM +1000, Nick Piggin wrote:
> > > On Tue, May 18, 2010 at 06:05:03PM +1000, Dave Chinner wrote:
> > > > On Tue, May 18, 2010 at 04:36:47PM +1000, Nick Piggin wrote:
> > > > > Well you could do a large span block allocation at the beginning,
> > > > > and then dirty the pagecache one by one like we do right now.
> > > > 
> > > > The problem is that if we fail to allocate a page (e.g.  ENOMEM) or
> > > > fail the copy (EFAULT) after the block allocation, we have to undo
> > > > the allocation we have already completed. If we don't, we leave
> > > > uninitialisaed allocations on disk that will expose stale data.
> > > > 
> > > > In the second case (EFAULT) we might be able to zero the pages to
> > > > avoid punching out blocks, but the first case where pages can't be
> > > > allocated to cover the block allocated range makes it very
> > > > difficult without being able to punch holes in allocated block
> > > > ranges.
> > > > 
> > > > AFAIK, only XFS and OCFS2 currently support punching out arbitrary
> > > > ranges of allocated blocks from an inode - there is not VFS method
> > > > for it, just an ioctl (XFS_IOC_UNRESVSP).
> > > > 
> > > > Hence the way to avoid needing hole punching is to allocate and lock
> > > > down all the pages into the page cache fіrst, then do the copy so
> > > > they fail before the allocation is done if they are going to fail.
> > > > That makes it much, much easier to handle failures....
> > > 
> > > So it is just a matter of what is exposed as a vfs interface?
> > 
> > More a matter of utilising the functionality most filesystems
> > already have and minimising the amount of churn in critical areas of
> > filesytsem code. Hole punching is not simple, anѕ bugs will likely
> > result in a corrupted filesystem.  And the hole punching will only
> > occur in a hard to trigger corner case, so it's likely that bugs
> > will go undetected and filesystems will suffer from random,
> > impossible to track down corruptions as a result.
> > 
> > In comparison, adding reserve/unreserve functionality might cause
> > block accounting issues if there is a bug, but it won't cause
> > on-disk corruption that results in data loss.  Hole punching is not
> > simple or easy - it's a damn complex way to handle errors and if
> > that's all it's required for then we've failed already.
> 
> As I said, we can have a dumb fallback path for filesystems that
> don't implement hole punching. Clear the blocks past i size, and
> zero out the allocated but not initialized blocks.
> 
> There does not have to be pagecache allocated in order to do this,
> you could do direct IO from the zero page in order to do it.

I don't see that as a good solution - it's once again a fairly
complex way of dealing with the problem, especially as it now means
that direct io would fall back to buffered which would fall back to
direct IO....

> Hole punching is not only useful there, it is already exposed to
> userspace via MADV_REMOVE.

That interface is *totally broken*. It has all the same problems as
vmtruncate() for removing file blocks (because it uses vmtruncate).
It also has the fundamental problem of being called un the mmap_sem,
which means that inode locks and therefore de-allocation cannot be
executed without the possibility of deadlocks. Fundamentally, hole
punching is an inode operation, not a VM operation....


> > > > > Basically, once pagecache is marked uptodate, I don't think we should
> > > > > ever put maybe-invalid data into it -- the way to do it is to invalidate
> > > > > that page and put a *new* page in there.
> > > > 
> > > > Ok, so lets do that...
> > > > 
> > > > > Why? Because user mappings are just one problem, but once you had a
> > > > > user mapping, you can have been subject to get_user_pages, so it could
> > > > > be in the middle of a DMA operation or something.
> > > > 
> > > > ... because we already know this behaviour causes problems for
> > > > high end enterprise level features like hardware checksumming IO
> > > > paths.
> > > > 
> > > > Hence it seems that a multipage write needs to:
> > > > 
> > > > 	1. allocate new pages
> > > > 	2. attach bufferheads/mapping structures to pages (if required)
> > > > 	3. copy data into pages
> > > > 	4. allocate space
> > > > 	5. for each old page in the range:
> > > > 		lock page
> > > > 		invalidate mappings
> > > > 		clear page uptodate flag
> > > > 		remove page from page cache
> > > > 	6. for each new page:
> > > > 		map new page to allocated space
> > > > 		lock new page
> > > > 		insert new page into pagecache
> > > > 		update new page state (write_end equivalent)
> > > > 		unlock new page
> > > > 	7. free old pages
> > > > 
> > > > Steps 1-4 can all fail, and can all be backed out from without
> > > > changing the current state. Steps 5-7 can't fail AFAICT, so we
> > > > should be able to run this safely after the allocation without
> > > > needing significant error unwinding...
> > > > 
> > > > Thoughts?
> > > 
> > > Possibly. The importance of hot cache is reduced, because we are
> > > doing full-page copies, and bulk copies, by definition. But it
> > > could still be an issue. The allocations and deallocations could
> > > cost a little as well.
> > 
> > They will cost far less than the reduction in allocation overhead
> > saves us, and there are potential optimisations there 
> 
> An API that doesn't require that, though, should be less overhead
> and simpler.
> 
> Is it really going to be a problem to implement block hole punching
> in ext4 and gfs2?

I can't follow the ext4 code - it's an intricate maze of weird entry
and exit points, so I'm not even going to attempt to comment on it.

The gfs2 code is easier to follow and it looks like it would require
a redesign and rewrite of the block truncation implementation as it
appears to assume that blocks are only ever removed from the end of
the file - I don't think the recursive algorithms for trimming the
indirect block trees can be easily modified for punching out
arbitrary ranges of blocks easily. I could be wrong, though, as I'm
not a gfs2 expert....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-19 23:50                       ` Dave Chinner
@ 2010-05-20  6:48                         ` Nick Piggin
  2010-05-20 20:12                         ` Jan Kara
  1 sibling, 0 replies; 44+ messages in thread
From: Nick Piggin @ 2010-05-20  6:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel

On Thu, May 20, 2010 at 09:50:54AM +1000, Dave Chinner wrote:
> > As I said, we can have a dumb fallback path for filesystems that
> > don't implement hole punching. Clear the blocks past i size, and
> > zero out the allocated but not initialized blocks.
> > 
> > There does not have to be pagecache allocated in order to do this,
> > you could do direct IO from the zero page in order to do it.
> 
> I don't see that as a good solution - it's once again a fairly
> complex way of dealing with the problem, especially as it now means
> that direct io would fall back to buffered which would fall back to
> direct IO....

Well it wouldn't use the full direct IO path. It has the block, just
build a bio with the source zero page and write it out. If the fs
requires anything more fancy than that, tough, it should just
implement hole punching.

 
> > Hole punching is not only useful there, it is already exposed to
> > userspace via MADV_REMOVE.
> 
> That interface is *totally broken*.

Why?

> It has all the same problems as
> vmtruncate() for removing file blocks (because it uses vmtruncate).
> It also has the fundamental problem of being called un the mmap_sem,
> which means that inode locks and therefore de-allocation cannot be
> executed without the possibility of deadlocks.

None of that is an API problem, it's all implementation. Yes fadivse
would be a much better API, but the madvise API is still there.
Implementation wise: it does not use vmtruncate; it has no mmap_sem
problem.


> Fundamentally, hole
> punching is an inode operation, not a VM operation....

VM acts as a handle to inode operations. It's no big deal.


> > An API that doesn't require that, though, should be less overhead
> > and simpler.
> > 
> > Is it really going to be a problem to implement block hole punching
> > in ext4 and gfs2?
> 
> I can't follow the ext4 code - it's an intricate maze of weird entry
> and exit points, so I'm not even going to attempt to comment on it.
> 
> The gfs2 code is easier to follow and it looks like it would require
> a redesign and rewrite of the block truncation implementation as it
> appears to assume that blocks are only ever removed from the end of
> the file - I don't think the recursive algorithms for trimming the
> indirect block trees can be easily modified for punching out
> arbitrary ranges of blocks easily. I could be wrong, though, as I'm
> not a gfs2 expert....

I'm far more in favour of doing the interfaces right, and making
the filesystems fix themselves to use it.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-19 23:50                       ` Dave Chinner
  2010-05-20  6:48                         ` Nick Piggin
@ 2010-05-20 20:12                         ` Jan Kara
  2010-05-20 23:05                           ` Dave Chinner
  1 sibling, 1 reply; 44+ messages in thread
From: Jan Kara @ 2010-05-20 20:12 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Nick Piggin, Josef Bacik, linux-fsdevel, chris.mason, hch, akpm,
	linux-kernel

On Thu 20-05-10 09:50:54, Dave Chinner wrote:
> On Wed, May 19, 2010 at 01:09:12AM +1000, Nick Piggin wrote:
> > On Tue, May 18, 2010 at 10:27:14PM +1000, Dave Chinner wrote:
> > > On Tue, May 18, 2010 at 08:43:51PM +1000, Nick Piggin wrote:
> > > > On Tue, May 18, 2010 at 06:05:03PM +1000, Dave Chinner wrote:
> > > > > On Tue, May 18, 2010 at 04:36:47PM +1000, Nick Piggin wrote:
> > > > > > Well you could do a large span block allocation at the beginning,
> > > > > > and then dirty the pagecache one by one like we do right now.
> > > > > 
> > > > > The problem is that if we fail to allocate a page (e.g.  ENOMEM) or
> > > > > fail the copy (EFAULT) after the block allocation, we have to undo
> > > > > the allocation we have already completed. If we don't, we leave
> > > > > uninitialisaed allocations on disk that will expose stale data.
> > > > > 
> > > > > In the second case (EFAULT) we might be able to zero the pages to
> > > > > avoid punching out blocks, but the first case where pages can't be
> > > > > allocated to cover the block allocated range makes it very
> > > > > difficult without being able to punch holes in allocated block
> > > > > ranges.
> > > > > 
> > > > > AFAIK, only XFS and OCFS2 currently support punching out arbitrary
> > > > > ranges of allocated blocks from an inode - there is not VFS method
> > > > > for it, just an ioctl (XFS_IOC_UNRESVSP).
> > > > > 
> > > > > Hence the way to avoid needing hole punching is to allocate and lock
> > > > > down all the pages into the page cache fіrst, then do the copy so
> > > > > they fail before the allocation is done if they are going to fail.
> > > > > That makes it much, much easier to handle failures....
> > > > 
> > > > So it is just a matter of what is exposed as a vfs interface?
> > > 
> > > More a matter of utilising the functionality most filesystems
> > > already have and minimising the amount of churn in critical areas of
> > > filesytsem code. Hole punching is not simple, anѕ bugs will likely
> > > result in a corrupted filesystem.  And the hole punching will only
> > > occur in a hard to trigger corner case, so it's likely that bugs
> > > will go undetected and filesystems will suffer from random,
> > > impossible to track down corruptions as a result.
> > > 
> > > In comparison, adding reserve/unreserve functionality might cause
> > > block accounting issues if there is a bug, but it won't cause
> > > on-disk corruption that results in data loss.  Hole punching is not
> > > simple or easy - it's a damn complex way to handle errors and if
> > > that's all it's required for then we've failed already.
> > 
> > As I said, we can have a dumb fallback path for filesystems that
> > don't implement hole punching. Clear the blocks past i size, and
> > zero out the allocated but not initialized blocks.
> > 
> > There does not have to be pagecache allocated in order to do this,
> > you could do direct IO from the zero page in order to do it.
> 
> I don't see that as a good solution - it's once again a fairly
> complex way of dealing with the problem, especially as it now means
> that direct io would fall back to buffered which would fall back to
> direct IO....
> 
> > Hole punching is not only useful there, it is already exposed to
> > userspace via MADV_REMOVE.
> 
> That interface is *totally broken*. It has all the same problems as
> vmtruncate() for removing file blocks (because it uses vmtruncate).
> It also has the fundamental problem of being called un the mmap_sem,
> which means that inode locks and therefore de-allocation cannot be
> executed without the possibility of deadlocks. Fundamentally, hole
> punching is an inode operation, not a VM operation....
> 
> 
> > > > > > Basically, once pagecache is marked uptodate, I don't think we should
> > > > > > ever put maybe-invalid data into it -- the way to do it is to invalidate
> > > > > > that page and put a *new* page in there.
> > > > > 
> > > > > Ok, so lets do that...
> > > > > 
> > > > > > Why? Because user mappings are just one problem, but once you had a
> > > > > > user mapping, you can have been subject to get_user_pages, so it could
> > > > > > be in the middle of a DMA operation or something.
> > > > > 
> > > > > ... because we already know this behaviour causes problems for
> > > > > high end enterprise level features like hardware checksumming IO
> > > > > paths.
> > > > > 
> > > > > Hence it seems that a multipage write needs to:
> > > > > 
> > > > > 	1. allocate new pages
> > > > > 	2. attach bufferheads/mapping structures to pages (if required)
> > > > > 	3. copy data into pages
> > > > > 	4. allocate space
> > > > > 	5. for each old page in the range:
> > > > > 		lock page
> > > > > 		invalidate mappings
> > > > > 		clear page uptodate flag
> > > > > 		remove page from page cache
> > > > > 	6. for each new page:
> > > > > 		map new page to allocated space
> > > > > 		lock new page
> > > > > 		insert new page into pagecache
> > > > > 		update new page state (write_end equivalent)
> > > > > 		unlock new page
> > > > > 	7. free old pages
> > > > > 
> > > > > Steps 1-4 can all fail, and can all be backed out from without
> > > > > changing the current state. Steps 5-7 can't fail AFAICT, so we
> > > > > should be able to run this safely after the allocation without
> > > > > needing significant error unwinding...
> > > > > 
> > > > > Thoughts?
> > > > 
> > > > Possibly. The importance of hot cache is reduced, because we are
> > > > doing full-page copies, and bulk copies, by definition. But it
> > > > could still be an issue. The allocations and deallocations could
> > > > cost a little as well.
> > > 
> > > They will cost far less than the reduction in allocation overhead
> > > saves us, and there are potential optimisations there 
> > 
> > An API that doesn't require that, though, should be less overhead
> > and simpler.
> > 
> > Is it really going to be a problem to implement block hole punching
> > in ext4 and gfs2?
> 
> I can't follow the ext4 code - it's an intricate maze of weird entry
> and exit points, so I'm not even going to attempt to comment on it.
  Hmm, I was thinking about it and I see two options how to get out
of problems:
  a) Filesystems which are not able to handle hole punching will allow
     multipage writes only after EOF (which can be easily undone by
     truncate in case of failure). That should actually cover lots of
     cases we are interested in (I don't expect multipage writes to holes
     to be a common case).
  b) E.g. ext4 can do even without hole punching. It can allocate extent
     as 'unwritten' and when something during the write fails, it just
     leaves the extent allocated and the 'unwritten' flag makes sure that
     any read will see zeros. I suppose that other filesystems that care
     about multipage writes are able to do similar things (e.g. btrfs can
     do the same as far as I remember, I'm not sure about gfs2).


									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-20 20:12                         ` Jan Kara
@ 2010-05-20 23:05                           ` Dave Chinner
  2010-05-21  9:05                             ` Steven Whitehouse
                                               ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Dave Chinner @ 2010-05-20 23:05 UTC (permalink / raw)
  To: Jan Kara
  Cc: Nick Piggin, Josef Bacik, linux-fsdevel, chris.mason, hch, akpm,
	linux-kernel

On Thu, May 20, 2010 at 10:12:32PM +0200, Jan Kara wrote:
> On Thu 20-05-10 09:50:54, Dave Chinner wrote:
> > On Wed, May 19, 2010 at 01:09:12AM +1000, Nick Piggin wrote:
> > > On Tue, May 18, 2010 at 10:27:14PM +1000, Dave Chinner wrote:
> > > > On Tue, May 18, 2010 at 08:43:51PM +1000, Nick Piggin wrote:
> > > > > On Tue, May 18, 2010 at 06:05:03PM +1000, Dave Chinner wrote:
> > > > > > On Tue, May 18, 2010 at 04:36:47PM +1000, Nick Piggin wrote:
> > > > > > > Well you could do a large span block allocation at the beginning,
> > > > > > > and then dirty the pagecache one by one like we do right now.
> > > > > > 
> > > > > > The problem is that if we fail to allocate a page (e.g.  ENOMEM) or
> > > > > > fail the copy (EFAULT) after the block allocation, we have to undo
> > > > > > the allocation we have already completed. If we don't, we leave
> > > > > > uninitialisaed allocations on disk that will expose stale data.
> > > > > > 
> > > > > > In the second case (EFAULT) we might be able to zero the pages to
> > > > > > avoid punching out blocks, but the first case where pages can't be
> > > > > > allocated to cover the block allocated range makes it very
> > > > > > difficult without being able to punch holes in allocated block
> > > > > > ranges.
> > > > > > 
> > > > > > AFAIK, only XFS and OCFS2 currently support punching out arbitrary
> > > > > > ranges of allocated blocks from an inode - there is not VFS method
> > > > > > for it, just an ioctl (XFS_IOC_UNRESVSP).
> > > > > > 
> > > > > > Hence the way to avoid needing hole punching is to allocate and lock
> > > > > > down all the pages into the page cache fіrst, then do the copy so
> > > > > > they fail before the allocation is done if they are going to fail.
> > > > > > That makes it much, much easier to handle failures....
> > > > > 
> > > > > So it is just a matter of what is exposed as a vfs interface?
> > > > 
> > > > More a matter of utilising the functionality most filesystems
> > > > already have and minimising the amount of churn in critical areas of
> > > > filesytsem code. Hole punching is not simple, anѕ bugs will likely
> > > > result in a corrupted filesystem.  And the hole punching will only
> > > > occur in a hard to trigger corner case, so it's likely that bugs
> > > > will go undetected and filesystems will suffer from random,
> > > > impossible to track down corruptions as a result.
> > > > 
> > > > In comparison, adding reserve/unreserve functionality might cause
> > > > block accounting issues if there is a bug, but it won't cause
> > > > on-disk corruption that results in data loss.  Hole punching is not
> > > > simple or easy - it's a damn complex way to handle errors and if
> > > > that's all it's required for then we've failed already.
> > > 
> > > As I said, we can have a dumb fallback path for filesystems that
> > > don't implement hole punching. Clear the blocks past i size, and
> > > zero out the allocated but not initialized blocks.
> > > 
> > > There does not have to be pagecache allocated in order to do this,
> > > you could do direct IO from the zero page in order to do it.
> > 
> > I don't see that as a good solution - it's once again a fairly
> > complex way of dealing with the problem, especially as it now means
> > that direct io would fall back to buffered which would fall back to
> > direct IO....
> > 
> > > Hole punching is not only useful there, it is already exposed to
> > > userspace via MADV_REMOVE.
> > 
> > That interface is *totally broken*. It has all the same problems as
> > vmtruncate() for removing file blocks (because it uses vmtruncate).
> > It also has the fundamental problem of being called un the mmap_sem,
> > which means that inode locks and therefore de-allocation cannot be
> > executed without the possibility of deadlocks. Fundamentally, hole
> > punching is an inode operation, not a VM operation....
> > 
> > 
> > > > > > > Basically, once pagecache is marked uptodate, I don't think we should
> > > > > > > ever put maybe-invalid data into it -- the way to do it is to invalidate
> > > > > > > that page and put a *new* page in there.
> > > > > > 
> > > > > > Ok, so lets do that...
> > > > > > 
> > > > > > > Why? Because user mappings are just one problem, but once you had a
> > > > > > > user mapping, you can have been subject to get_user_pages, so it could
> > > > > > > be in the middle of a DMA operation or something.
> > > > > > 
> > > > > > ... because we already know this behaviour causes problems for
> > > > > > high end enterprise level features like hardware checksumming IO
> > > > > > paths.
> > > > > > 
> > > > > > Hence it seems that a multipage write needs to:
> > > > > > 
> > > > > > 	1. allocate new pages
> > > > > > 	2. attach bufferheads/mapping structures to pages (if required)
> > > > > > 	3. copy data into pages
> > > > > > 	4. allocate space
> > > > > > 	5. for each old page in the range:
> > > > > > 		lock page
> > > > > > 		invalidate mappings
> > > > > > 		clear page uptodate flag
> > > > > > 		remove page from page cache
> > > > > > 	6. for each new page:
> > > > > > 		map new page to allocated space
> > > > > > 		lock new page
> > > > > > 		insert new page into pagecache
> > > > > > 		update new page state (write_end equivalent)
> > > > > > 		unlock new page
> > > > > > 	7. free old pages
> > > > > > 
> > > > > > Steps 1-4 can all fail, and can all be backed out from without
> > > > > > changing the current state. Steps 5-7 can't fail AFAICT, so we
> > > > > > should be able to run this safely after the allocation without
> > > > > > needing significant error unwinding...
> > > > > > 
> > > > > > Thoughts?
> > > > > 
> > > > > Possibly. The importance of hot cache is reduced, because we are
> > > > > doing full-page copies, and bulk copies, by definition. But it
> > > > > could still be an issue. The allocations and deallocations could
> > > > > cost a little as well.
> > > > 
> > > > They will cost far less than the reduction in allocation overhead
> > > > saves us, and there are potential optimisations there 
> > > 
> > > An API that doesn't require that, though, should be less overhead
> > > and simpler.
> > > 
> > > Is it really going to be a problem to implement block hole punching
> > > in ext4 and gfs2?
> > 
> > I can't follow the ext4 code - it's an intricate maze of weird entry
> > and exit points, so I'm not even going to attempt to comment on it.
>   Hmm, I was thinking about it and I see two options how to get out
> of problems:
>   a) Filesystems which are not able to handle hole punching will allow
>      multipage writes only after EOF (which can be easily undone by
>      truncate in case of failure). That should actually cover lots of
>      cases we are interested in (I don't expect multipage writes to holes
>      to be a common case).

multipage writes to holes is a relatively common operation in the
HPC space that XFS is designed for (e.g. calculations on huge sparse
matrices), so I'm not really fond of this idea....

>   b) E.g. ext4 can do even without hole punching. It can allocate extent
>      as 'unwritten' and when something during the write fails, it just
>      leaves the extent allocated and the 'unwritten' flag makes sure that
>      any read will see zeros. I suppose that other filesystems that care
>      about multipage writes are able to do similar things (e.g. btrfs can
>      do the same as far as I remember, I'm not sure about gfs2).

Allocating multipage writes as unwritten extents turns off delayed
allocation and hence we'd lose all the benefits that this gives...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-20 23:05                           ` Dave Chinner
@ 2010-05-21  9:05                             ` Steven Whitehouse
  2010-05-21 13:50                             ` Josef Bacik
  2010-05-21 18:58                             ` Jan Kara
  2 siblings, 0 replies; 44+ messages in thread
From: Steven Whitehouse @ 2010-05-21  9:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Nick Piggin, Josef Bacik, linux-fsdevel, chris.mason,
	hch, akpm, linux-kernel

Hi,

On Fri, 2010-05-21 at 09:05 +1000, Dave Chinner wrote:
> On Thu, May 20, 2010 at 10:12:32PM +0200, Jan Kara wrote:
> > On Thu 20-05-10 09:50:54, Dave Chinner wrote:
> > > On Wed, May 19, 2010 at 01:09:12AM +1000, Nick Piggin wrote:
> > > > On Tue, May 18, 2010 at 10:27:14PM +1000, Dave Chinner wrote:
> > > > 
> > > > Is it really going to be a problem to implement block hole punching
> > > > in ext4 and gfs2?
> > > 
[snip]
> >   b) E.g. ext4 can do even without hole punching. It can allocate extent
> >      as 'unwritten' and when something during the write fails, it just
> >      leaves the extent allocated and the 'unwritten' flag makes sure that
> >      any read will see zeros. I suppose that other filesystems that care
> >      about multipage writes are able to do similar things (e.g. btrfs can
> >      do the same as far as I remember, I'm not sure about gfs2).
> 
> Allocating multipage writes as unwritten extents turns off delayed
> allocation and hence we'd lose all the benefits that this gives...

It should be possible to implement hole punching in GFS2 I think. The
main issue is locking order of resource groups. We have on our todo list
a rewrite of the truncate/delete code which is currently used to
deallocate data blocks and metadata tree blocks. The current algorithm
is a rather inefficient recursive scanning of the tree which is done
multiple times depending on the tree height.

Adapting that to punch holes should be possible without too much effort
if we need to do that. We do need to allow for the possibility that such
a deallocation might have to be split into multiple transactions
depending on the amount of metadata involved (for large files, this
could be larger than the size of the log for example). Currently the
code will split up truncates into multiple transactions which allows the
deallocation to be restartable from any transaction boundary.

GFS2 does not have any way to mark unwritten extents, so we cannot do
delayed allocation or implement an efficient fallocate. We can do better
performance-wise than just dd'ing zeros to a file for fallocate, but
we'll never be able to match a fs that can mark extents unwritten in
performance terms,

Steve.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-20 23:05                           ` Dave Chinner
  2010-05-21  9:05                             ` Steven Whitehouse
@ 2010-05-21 13:50                             ` Josef Bacik
  2010-05-21 14:23                               ` Nick Piggin
  2010-05-22  0:31                               ` Dave Chinner
  2010-05-21 18:58                             ` Jan Kara
  2 siblings, 2 replies; 44+ messages in thread
From: Josef Bacik @ 2010-05-21 13:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Nick Piggin, Josef Bacik, linux-fsdevel, chris.mason,
	hch, akpm, linux-kernel

On Fri, May 21, 2010 at 09:05:24AM +1000, Dave Chinner wrote:
> On Thu, May 20, 2010 at 10:12:32PM +0200, Jan Kara wrote:
> > On Thu 20-05-10 09:50:54, Dave Chinner wrote:
> > > On Wed, May 19, 2010 at 01:09:12AM +1000, Nick Piggin wrote:
> > > > On Tue, May 18, 2010 at 10:27:14PM +1000, Dave Chinner wrote:
> > > > > On Tue, May 18, 2010 at 08:43:51PM +1000, Nick Piggin wrote:
> > > > > > On Tue, May 18, 2010 at 06:05:03PM +1000, Dave Chinner wrote:
> > > > > > > On Tue, May 18, 2010 at 04:36:47PM +1000, Nick Piggin wrote:
> > > > > > > > Well you could do a large span block allocation at the beginning,
> > > > > > > > and then dirty the pagecache one by one like we do right now.
> > > > > > > 
> > > > > > > The problem is that if we fail to allocate a page (e.g.  ENOMEM) or
> > > > > > > fail the copy (EFAULT) after the block allocation, we have to undo
> > > > > > > the allocation we have already completed. If we don't, we leave
> > > > > > > uninitialisaed allocations on disk that will expose stale data.
> > > > > > > 
> > > > > > > In the second case (EFAULT) we might be able to zero the pages to
> > > > > > > avoid punching out blocks, but the first case where pages can't be
> > > > > > > allocated to cover the block allocated range makes it very
> > > > > > > difficult without being able to punch holes in allocated block
> > > > > > > ranges.
> > > > > > > 
> > > > > > > AFAIK, only XFS and OCFS2 currently support punching out arbitrary
> > > > > > > ranges of allocated blocks from an inode - there is not VFS method
> > > > > > > for it, just an ioctl (XFS_IOC_UNRESVSP).
> > > > > > > 
> > > > > > > Hence the way to avoid needing hole punching is to allocate and lock
> > > > > > > down all the pages into the page cache fіrst, then do the copy so
> > > > > > > they fail before the allocation is done if they are going to fail.
> > > > > > > That makes it much, much easier to handle failures....
> > > > > > 
> > > > > > So it is just a matter of what is exposed as a vfs interface?
> > > > > 
> > > > > More a matter of utilising the functionality most filesystems
> > > > > already have and minimising the amount of churn in critical areas of
> > > > > filesytsem code. Hole punching is not simple, anѕ bugs will likely
> > > > > result in a corrupted filesystem.  And the hole punching will only
> > > > > occur in a hard to trigger corner case, so it's likely that bugs
> > > > > will go undetected and filesystems will suffer from random,
> > > > > impossible to track down corruptions as a result.
> > > > > 
> > > > > In comparison, adding reserve/unreserve functionality might cause
> > > > > block accounting issues if there is a bug, but it won't cause
> > > > > on-disk corruption that results in data loss.  Hole punching is not
> > > > > simple or easy - it's a damn complex way to handle errors and if
> > > > > that's all it's required for then we've failed already.
> > > > 
> > > > As I said, we can have a dumb fallback path for filesystems that
> > > > don't implement hole punching. Clear the blocks past i size, and
> > > > zero out the allocated but not initialized blocks.
> > > > 
> > > > There does not have to be pagecache allocated in order to do this,
> > > > you could do direct IO from the zero page in order to do it.
> > > 
> > > I don't see that as a good solution - it's once again a fairly
> > > complex way of dealing with the problem, especially as it now means
> > > that direct io would fall back to buffered which would fall back to
> > > direct IO....
> > > 
> > > > Hole punching is not only useful there, it is already exposed to
> > > > userspace via MADV_REMOVE.
> > > 
> > > That interface is *totally broken*. It has all the same problems as
> > > vmtruncate() for removing file blocks (because it uses vmtruncate).
> > > It also has the fundamental problem of being called un the mmap_sem,
> > > which means that inode locks and therefore de-allocation cannot be
> > > executed without the possibility of deadlocks. Fundamentally, hole
> > > punching is an inode operation, not a VM operation....
> > > 
> > > 
> > > > > > > > Basically, once pagecache is marked uptodate, I don't think we should
> > > > > > > > ever put maybe-invalid data into it -- the way to do it is to invalidate
> > > > > > > > that page and put a *new* page in there.
> > > > > > > 
> > > > > > > Ok, so lets do that...
> > > > > > > 
> > > > > > > > Why? Because user mappings are just one problem, but once you had a
> > > > > > > > user mapping, you can have been subject to get_user_pages, so it could
> > > > > > > > be in the middle of a DMA operation or something.
> > > > > > > 
> > > > > > > ... because we already know this behaviour causes problems for
> > > > > > > high end enterprise level features like hardware checksumming IO
> > > > > > > paths.
> > > > > > > 
> > > > > > > Hence it seems that a multipage write needs to:
> > > > > > > 
> > > > > > > 	1. allocate new pages
> > > > > > > 	2. attach bufferheads/mapping structures to pages (if required)
> > > > > > > 	3. copy data into pages
> > > > > > > 	4. allocate space
> > > > > > > 	5. for each old page in the range:
> > > > > > > 		lock page
> > > > > > > 		invalidate mappings
> > > > > > > 		clear page uptodate flag
> > > > > > > 		remove page from page cache
> > > > > > > 	6. for each new page:
> > > > > > > 		map new page to allocated space
> > > > > > > 		lock new page
> > > > > > > 		insert new page into pagecache
> > > > > > > 		update new page state (write_end equivalent)
> > > > > > > 		unlock new page
> > > > > > > 	7. free old pages
> > > > > > > 
> > > > > > > Steps 1-4 can all fail, and can all be backed out from without
> > > > > > > changing the current state. Steps 5-7 can't fail AFAICT, so we
> > > > > > > should be able to run this safely after the allocation without
> > > > > > > needing significant error unwinding...
> > > > > > > 
> > > > > > > Thoughts?
> > > > > > 
> > > > > > Possibly. The importance of hot cache is reduced, because we are
> > > > > > doing full-page copies, and bulk copies, by definition. But it
> > > > > > could still be an issue. The allocations and deallocations could
> > > > > > cost a little as well.
> > > > > 
> > > > > They will cost far less than the reduction in allocation overhead
> > > > > saves us, and there are potential optimisations there 
> > > > 
> > > > An API that doesn't require that, though, should be less overhead
> > > > and simpler.
> > > > 
> > > > Is it really going to be a problem to implement block hole punching
> > > > in ext4 and gfs2?
> > > 
> > > I can't follow the ext4 code - it's an intricate maze of weird entry
> > > and exit points, so I'm not even going to attempt to comment on it.
> >   Hmm, I was thinking about it and I see two options how to get out
> > of problems:
> >   a) Filesystems which are not able to handle hole punching will allow
> >      multipage writes only after EOF (which can be easily undone by
> >      truncate in case of failure). That should actually cover lots of
> >      cases we are interested in (I don't expect multipage writes to holes
> >      to be a common case).
> 
> multipage writes to holes is a relatively common operation in the
> HPC space that XFS is designed for (e.g. calculations on huge sparse
> matrices), so I'm not really fond of this idea....
> 
> >   b) E.g. ext4 can do even without hole punching. It can allocate extent
> >      as 'unwritten' and when something during the write fails, it just
> >      leaves the extent allocated and the 'unwritten' flag makes sure that
> >      any read will see zeros. I suppose that other filesystems that care
> >      about multipage writes are able to do similar things (e.g. btrfs can
> >      do the same as far as I remember, I'm not sure about gfs2).
> 
> Allocating multipage writes as unwritten extents turns off delayed
> allocation and hence we'd lose all the benefits that this gives...
>

I just realized we have another problem, the mmap_sem/page_lock deadlock.
Currently BTRFS is susceptible to this, since we don't prefault any of the pages
in yet.  If we're going to do multi-page writes we're going to need to have a
way to fault in all of the iovec's at once, so when we do the
pagefault_disable() copy pagefault_enable() we don't just end up copying the
first iovec.  Nick have you done something like this already?  If not I assume
I can just loop through all the iovec's and call fault_in_pages_readable on all
of them and be good to go right?  Thanks,

Josef 
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-21 13:50                             ` Josef Bacik
@ 2010-05-21 14:23                               ` Nick Piggin
  2010-05-21 15:19                                 ` Josef Bacik
  2010-05-22  0:31                               ` Dave Chinner
  1 sibling, 1 reply; 44+ messages in thread
From: Nick Piggin @ 2010-05-21 14:23 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Dave Chinner, Jan Kara, linux-fsdevel, chris.mason, hch, akpm,
	linux-kernel

On Fri, May 21, 2010 at 09:50:54AM -0400, Josef Bacik wrote:
> On Fri, May 21, 2010 at 09:05:24AM +1000, Dave Chinner wrote:
> > Allocating multipage writes as unwritten extents turns off delayed
> > allocation and hence we'd lose all the benefits that this gives...
> >
> 
> I just realized we have another problem, the mmap_sem/page_lock deadlock.
> Currently BTRFS is susceptible to this, since we don't prefault any of the pages
> in yet.  If we're going to do multi-page writes we're going to need to have a
> way to fault in all of the iovec's at once, so when we do the
> pagefault_disable() copy pagefault_enable() we don't just end up copying the
> first iovec.  Nick have you done something like this already?  If not I assume
> I can just loop through all the iovec's and call fault_in_pages_readable on all
> of them and be good to go right?  Thanks,

Yes, well it's a different issue. With multi-page writes, even a single
iovec may not be faulted in properly. And with multiple iovecs, we are
already suboptimal with faulting.

faulting in multiple iovecs may already be a good idea. I didn't add
that code, I had hoped for a test case first, but perhaps we can just
go and add it.

With multipage writes, we would want to fault in multiple source pages
at once if the design was to lock multiple pages at once and do the
copy. I still think we might be able to just lock and copy one page at
a time, but I could be wrong.

Oh wow, btrfs is deadlocky there. Firstly, fault_in_pages_readable does
not guarantee success (race can always unmap the page in the meantime).
Secondly, calling it inside the page lock section just means it will
cause the deadlock rather than the copy_from_user.

Quick workaround to reduce probability is to do fault_in_pages_readable
calls before locking the pages.

But you really need to handle the short-copy case. From the error
handling there, it seems like you can just free_reserved_data_space and
retry the copy?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-21 14:23                               ` Nick Piggin
@ 2010-05-21 15:19                                 ` Josef Bacik
  2010-05-24  3:29                                   ` Nick Piggin
  0 siblings, 1 reply; 44+ messages in thread
From: Josef Bacik @ 2010-05-21 15:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Josef Bacik, Dave Chinner, Jan Kara, linux-fsdevel, chris.mason,
	hch, akpm, linux-kernel

On Sat, May 22, 2010 at 12:23:54AM +1000, Nick Piggin wrote:
> On Fri, May 21, 2010 at 09:50:54AM -0400, Josef Bacik wrote:
> > On Fri, May 21, 2010 at 09:05:24AM +1000, Dave Chinner wrote:
> > > Allocating multipage writes as unwritten extents turns off delayed
> > > allocation and hence we'd lose all the benefits that this gives...
> > >
> > 
> > I just realized we have another problem, the mmap_sem/page_lock deadlock.
> > Currently BTRFS is susceptible to this, since we don't prefault any of the pages
> > in yet.  If we're going to do multi-page writes we're going to need to have a
> > way to fault in all of the iovec's at once, so when we do the
> > pagefault_disable() copy pagefault_enable() we don't just end up copying the
> > first iovec.  Nick have you done something like this already?  If not I assume
> > I can just loop through all the iovec's and call fault_in_pages_readable on all
> > of them and be good to go right?  Thanks,
> 
> Yes, well it's a different issue. With multi-page writes, even a single
> iovec may not be faulted in properly. And with multiple iovecs, we are
> already suboptimal with faulting.
> 
> faulting in multiple iovecs may already be a good idea. I didn't add
> that code, I had hoped for a test case first, but perhaps we can just
> go and add it.
> 
> With multipage writes, we would want to fault in multiple source pages
> at once if the design was to lock multiple pages at once and do the
> copy. I still think we might be able to just lock and copy one page at
> a time, but I could be wrong.
> 

I was thinking about this possibility, it seems this is what FUSE does already.
It would probably be easier to deal with this fault problem, just do all the fs
stuff to prepare for the total amount, do the copy one page at a time, and then
if something goes wrong we can cleanup properly.

> Oh wow, btrfs is deadlocky there. Firstly, fault_in_pages_readable does
> not guarantee success (race can always unmap the page in the meantime).
> Secondly, calling it inside the page lock section just means it will
> cause the deadlock rather than the copy_from_user.
> 
> Quick workaround to reduce probability is to do fault_in_pages_readable
> calls before locking the pages.
> 
> But you really need to handle the short-copy case. From the error
> handling there, it seems like you can just free_reserved_data_space and
> retry the copy?
> 

Well no, if theres a short copy we just exit.  If we do the
fault_in_pages_readable before we do the prepare_pages we could deal with a
short-copy then.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-21 15:19                                 ` Josef Bacik
@ 2010-05-24  3:29                                   ` Nick Piggin
  0 siblings, 0 replies; 44+ messages in thread
From: Nick Piggin @ 2010-05-24  3:29 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Dave Chinner, Jan Kara, linux-fsdevel, chris.mason, hch, akpm,
	linux-kernel

On Fri, May 21, 2010 at 11:19:22AM -0400, Josef Bacik wrote:
> On Sat, May 22, 2010 at 12:23:54AM +1000, Nick Piggin wrote:
> > On Fri, May 21, 2010 at 09:50:54AM -0400, Josef Bacik wrote:
> > > On Fri, May 21, 2010 at 09:05:24AM +1000, Dave Chinner wrote:
> > > > Allocating multipage writes as unwritten extents turns off delayed
> > > > allocation and hence we'd lose all the benefits that this gives...
> > > >
> > > 
> > > I just realized we have another problem, the mmap_sem/page_lock deadlock.
> > > Currently BTRFS is susceptible to this, since we don't prefault any of the pages
> > > in yet.  If we're going to do multi-page writes we're going to need to have a
> > > way to fault in all of the iovec's at once, so when we do the
> > > pagefault_disable() copy pagefault_enable() we don't just end up copying the
> > > first iovec.  Nick have you done something like this already?  If not I assume
> > > I can just loop through all the iovec's and call fault_in_pages_readable on all
> > > of them and be good to go right?  Thanks,
> > 
> > Yes, well it's a different issue. With multi-page writes, even a single
> > iovec may not be faulted in properly. And with multiple iovecs, we are
> > already suboptimal with faulting.
> > 
> > faulting in multiple iovecs may already be a good idea. I didn't add
> > that code, I had hoped for a test case first, but perhaps we can just
> > go and add it.
> > 
> > With multipage writes, we would want to fault in multiple source pages
> > at once if the design was to lock multiple pages at once and do the
> > copy. I still think we might be able to just lock and copy one page at
> > a time, but I could be wrong.
> > 
> 
> I was thinking about this possibility, it seems this is what FUSE does already.
> It would probably be easier to deal with this fault problem, just do all the fs
> stuff to prepare for the total amount, do the copy one page at a time, and then
> if something goes wrong we can cleanup properly.

Yep. The fewer pages we have to copy to/from at once, the more robust
it should be.

 
> > Oh wow, btrfs is deadlocky there. Firstly, fault_in_pages_readable does
> > not guarantee success (race can always unmap the page in the meantime).
> > Secondly, calling it inside the page lock section just means it will
> > cause the deadlock rather than the copy_from_user.
> > 
> > Quick workaround to reduce probability is to do fault_in_pages_readable
> > calls before locking the pages.
> > 
> > But you really need to handle the short-copy case. From the error
> > handling there, it seems like you can just free_reserved_data_space and
> > retry the copy?
> > 
> 
> Well no, if theres a short copy we just exit.  If we do the
> fault_in_pages_readable before we do the prepare_pages we could deal with a
> short-copy then.  Thanks,

Take a look at the code in generic buffered write. It's not hard, but it
is easy to get the error case wrong.

You need to fault in the source pages before holding page lock. If this
fails, then you may exit. After locking the pages, you need to do a
pagefault_disable() and atomic kmaps/usercopies. If this copy comes up
short, you need to unlock and retry faulting in the source pages.

The reason for this is that the source page may be unmapped but there is
still a valid memory mapping at the point of the usercopy. So you must
not fail that.

And, if some bytes have been copied into pagecache, and if that pagecache
page is marked as uptodate, then you must treat it as a partial write.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-21 13:50                             ` Josef Bacik
  2010-05-21 14:23                               ` Nick Piggin
@ 2010-05-22  0:31                               ` Dave Chinner
  1 sibling, 0 replies; 44+ messages in thread
From: Dave Chinner @ 2010-05-22  0:31 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Jan Kara, Nick Piggin, linux-fsdevel, chris.mason, hch, akpm,
	linux-kernel

On Fri, May 21, 2010 at 09:50:54AM -0400, Josef Bacik wrote:
> On Fri, May 21, 2010 at 09:05:24AM +1000, Dave Chinner wrote:
> > On Thu, May 20, 2010 at 10:12:32PM +0200, Jan Kara wrote:
> > > On Thu 20-05-10 09:50:54, Dave Chinner wrote:
> > >   b) E.g. ext4 can do even without hole punching. It can allocate extent
> > >      as 'unwritten' and when something during the write fails, it just
> > >      leaves the extent allocated and the 'unwritten' flag makes sure that
> > >      any read will see zeros. I suppose that other filesystems that care
> > >      about multipage writes are able to do similar things (e.g. btrfs can
> > >      do the same as far as I remember, I'm not sure about gfs2).
> > 
> > Allocating multipage writes as unwritten extents turns off delayed
> > allocation and hence we'd lose all the benefits that this gives...
> >
> 
> I just realized we have another problem, the mmap_sem/page_lock deadlock.
> Currently BTRFS is susceptible to this, since we don't prefault any of the pages
> in yet.  If we're going to do multi-page writes we're going to need to have a
> way to fault in all of the iovec's at once, so when we do the
> pagefault_disable() copy pagefault_enable() we don't just end up copying the
> first iovec.  Nick have you done something like this already?

I have patches that already do this, but the big issue is that it is
inhernetly racy. The prefaulting does not guarantee that by the time
we disable page faults that the prefaulted page has not already been
reclaimed. Basically we have to design for EFAULT occurring because
even pre-faultig does not prevent it from occurring.

> If not I assume
> I can just loop through all the iovec's and call fault_in_pages_readable on all
> of them and be good to go right?  Thanks,

That's effectively what I've done, but it's still no guarantee.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-20 23:05                           ` Dave Chinner
  2010-05-21  9:05                             ` Steven Whitehouse
  2010-05-21 13:50                             ` Josef Bacik
@ 2010-05-21 18:58                             ` Jan Kara
  2010-05-22  0:27                               ` Dave Chinner
  2 siblings, 1 reply; 44+ messages in thread
From: Jan Kara @ 2010-05-21 18:58 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Nick Piggin, Josef Bacik, linux-fsdevel, chris.mason,
	hch, akpm, linux-kernel

On Fri 21-05-10 09:05:24, Dave Chinner wrote:
> On Thu, May 20, 2010 at 10:12:32PM +0200, Jan Kara wrote:
> >   Hmm, I was thinking about it and I see two options how to get out
> > of problems:
> >   a) Filesystems which are not able to handle hole punching will allow
> >      multipage writes only after EOF (which can be easily undone by
> >      truncate in case of failure). That should actually cover lots of
> >      cases we are interested in (I don't expect multipage writes to holes
> >      to be a common case).
> 
> multipage writes to holes is a relatively common operation in the
> HPC space that XFS is designed for (e.g. calculations on huge sparse
> matrices), so I'm not really fond of this idea....
  Well, XFS could still handle them because it is able to do hole punching
but I get your point.

> >   b) E.g. ext4 can do even without hole punching. It can allocate extent
> >      as 'unwritten' and when something during the write fails, it just
> >      leaves the extent allocated and the 'unwritten' flag makes sure that
> >      any read will see zeros. I suppose that other filesystems that care
> >      about multipage writes are able to do similar things (e.g. btrfs can
> >      do the same as far as I remember, I'm not sure about gfs2).
> 
> Allocating multipage writes as unwritten extents turns off delayed
> allocation and hence we'd lose all the benefits that this gives...
  Ah, sorry. That was a short-circuit in my brain. But when we do delayed
I don't see why we should actually do any hole punching... The write needs
to:
  a) reserve enough blocks for the write - I don't know about other
filesystems but for ext4 this means just incrementing a counter.
  b) copy data page by page
  c) release part of reservation (i.e. decrement counter) if we actually
copied less than we originally thought.

  Am I missing something?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-21 18:58                             ` Jan Kara
@ 2010-05-22  0:27                               ` Dave Chinner
  2010-05-24  9:20                                 ` Jan Kara
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2010-05-22  0:27 UTC (permalink / raw)
  To: Jan Kara
  Cc: Nick Piggin, Josef Bacik, linux-fsdevel, chris.mason, hch, akpm,
	linux-kernel

On Fri, May 21, 2010 at 08:58:46PM +0200, Jan Kara wrote:
> On Fri 21-05-10 09:05:24, Dave Chinner wrote:
> > On Thu, May 20, 2010 at 10:12:32PM +0200, Jan Kara wrote:
> > >   b) E.g. ext4 can do even without hole punching. It can allocate extent
> > >      as 'unwritten' and when something during the write fails, it just
> > >      leaves the extent allocated and the 'unwritten' flag makes sure that
> > >      any read will see zeros. I suppose that other filesystems that care
> > >      about multipage writes are able to do similar things (e.g. btrfs can
> > >      do the same as far as I remember, I'm not sure about gfs2).
> > 
> > Allocating multipage writes as unwritten extents turns off delayed
> > allocation and hence we'd lose all the benefits that this gives...
>   Ah, sorry. That was a short-circuit in my brain. But when we do delayed
> I don't see why we should actually do any hole punching... The write needs
> to:
>   a) reserve enough blocks for the write - I don't know about other
> filesystems but for ext4 this means just incrementing a counter.
>   b) copy data page by page
>   c) release part of reservation (i.e. decrement counter) if we actually
> copied less than we originally thought.
> 
>   Am I missing something?

Possibly. Delayed allocation is made up of two parts - space
reservation and recording the regions of delayed allocation in an
extent tree, page/bufferhead state or both.

In XFS, these two steps happen in the same get_blocks call, but the
result of that is we have to truncate/punch delayed allocate extents
out just like normal extents if we are not going to use them. Hence
a reserve/allocate interface allows us to split the operation -
reserve ensures we have space for the delayed allocation, allocate
inserts the delayed extents into the inode extent tree for later
real allocation during writeback. Hence the unreserve call can
simply be accounting - it has no requirement to punch out delayed
extents that may have already been allocated, just do work on
counters.

btrfs already has this split design - it reserves space, does the
copy, then marks the extent ranges as delalloc once the copy has
succeeded, otherwise it simply unreserves the unused space.

Once again, I don't know if ext4 does this internal delayed
allocation extent tracking or whether it just uses page state to
track those extents, but it would probably still have to use the
allocate call to mark all the pages/bufferheads as delalloc so
that uneserve didn't have to do any extra work.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-22  0:27                               ` Dave Chinner
@ 2010-05-24  9:20                                 ` Jan Kara
  2010-05-24  9:33                                   ` Nick Piggin
  2010-06-05 15:05                                   ` tytso
  0 siblings, 2 replies; 44+ messages in thread
From: Jan Kara @ 2010-05-24  9:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Nick Piggin, Josef Bacik, linux-fsdevel, chris.mason,
	hch, akpm, linux-kernel

On Sat 22-05-10 10:27:59, Dave Chinner wrote:
> On Fri, May 21, 2010 at 08:58:46PM +0200, Jan Kara wrote:
> > On Fri 21-05-10 09:05:24, Dave Chinner wrote:
> > > On Thu, May 20, 2010 at 10:12:32PM +0200, Jan Kara wrote:
> > > >   b) E.g. ext4 can do even without hole punching. It can allocate extent
> > > >      as 'unwritten' and when something during the write fails, it just
> > > >      leaves the extent allocated and the 'unwritten' flag makes sure that
> > > >      any read will see zeros. I suppose that other filesystems that care
> > > >      about multipage writes are able to do similar things (e.g. btrfs can
> > > >      do the same as far as I remember, I'm not sure about gfs2).
> > > 
> > > Allocating multipage writes as unwritten extents turns off delayed
> > > allocation and hence we'd lose all the benefits that this gives...
> >   Ah, sorry. That was a short-circuit in my brain. But when we do delayed
> > I don't see why we should actually do any hole punching... The write needs
> > to:
> >   a) reserve enough blocks for the write - I don't know about other
> > filesystems but for ext4 this means just incrementing a counter.
> >   b) copy data page by page
> >   c) release part of reservation (i.e. decrement counter) if we actually
> > copied less than we originally thought.
> > 
> >   Am I missing something?
> 
> Possibly. Delayed allocation is made up of two parts - space
> reservation and recording the regions of delayed allocation in an
> extent tree, page/bufferhead state or both.
  Yes. Ext4 records the info about delayed allocation only in buffer
heads.

> In XFS, these two steps happen in the same get_blocks call, but the
> result of that is we have to truncate/punch delayed allocate extents
> out just like normal extents if we are not going to use them. Hence
> a reserve/allocate interface allows us to split the operation -
> reserve ensures we have space for the delayed allocation, allocate
> inserts the delayed extents into the inode extent tree for later
> real allocation during writeback. Hence the unreserve call can
> simply be accounting - it has no requirement to punch out delayed
> extents that may have already been allocated, just do work on
> counters.
> 
> btrfs already has this split design - it reserves space, does the
> copy, then marks the extent ranges as delalloc once the copy has
> succeeded, otherwise it simply unreserves the unused space.
> 
> Once again, I don't know if ext4 does this internal delayed
> allocation extent tracking or whether it just uses page state to
> track those extents, but it would probably still have to use the
> allocate call to mark all the pages/bufferheads as delalloc so
> that uneserve didn't have to do any extra work.
  Yes, exactly. I just wanted to point out that AFAICS ext4 can implement
proper error recovery without a need for 'punch' operation. So after all
Nick's copy page-by-page should be plausible at least for ext4.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-24  9:20                                 ` Jan Kara
@ 2010-05-24  9:33                                   ` Nick Piggin
  2010-06-05 15:05                                   ` tytso
  1 sibling, 0 replies; 44+ messages in thread
From: Nick Piggin @ 2010-05-24  9:33 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Josef Bacik, linux-fsdevel, chris.mason, hch, akpm,
	linux-kernel

On Mon, May 24, 2010 at 11:20:34AM +0200, Jan Kara wrote:
> On Sat 22-05-10 10:27:59, Dave Chinner wrote:
> > On Fri, May 21, 2010 at 08:58:46PM +0200, Jan Kara wrote:
> > > On Fri 21-05-10 09:05:24, Dave Chinner wrote:
> > > > On Thu, May 20, 2010 at 10:12:32PM +0200, Jan Kara wrote:
> > > > >   b) E.g. ext4 can do even without hole punching. It can allocate extent
> > > > >      as 'unwritten' and when something during the write fails, it just
> > > > >      leaves the extent allocated and the 'unwritten' flag makes sure that
> > > > >      any read will see zeros. I suppose that other filesystems that care
> > > > >      about multipage writes are able to do similar things (e.g. btrfs can
> > > > >      do the same as far as I remember, I'm not sure about gfs2).
> > > > 
> > > > Allocating multipage writes as unwritten extents turns off delayed
> > > > allocation and hence we'd lose all the benefits that this gives...
> > >   Ah, sorry. That was a short-circuit in my brain. But when we do delayed
> > > I don't see why we should actually do any hole punching... The write needs
> > > to:
> > >   a) reserve enough blocks for the write - I don't know about other
> > > filesystems but for ext4 this means just incrementing a counter.
> > >   b) copy data page by page
> > >   c) release part of reservation (i.e. decrement counter) if we actually
> > > copied less than we originally thought.
> > > 
> > >   Am I missing something?
> > 
> > Possibly. Delayed allocation is made up of two parts - space
> > reservation and recording the regions of delayed allocation in an
> > extent tree, page/bufferhead state or both.
>   Yes. Ext4 records the info about delayed allocation only in buffer
> heads.
> 
> > In XFS, these two steps happen in the same get_blocks call, but the
> > result of that is we have to truncate/punch delayed allocate extents
> > out just like normal extents if we are not going to use them. Hence
> > a reserve/allocate interface allows us to split the operation -
> > reserve ensures we have space for the delayed allocation, allocate
> > inserts the delayed extents into the inode extent tree for later
> > real allocation during writeback. Hence the unreserve call can
> > simply be accounting - it has no requirement to punch out delayed
> > extents that may have already been allocated, just do work on
> > counters.
> > 
> > btrfs already has this split design - it reserves space, does the
> > copy, then marks the extent ranges as delalloc once the copy has
> > succeeded, otherwise it simply unreserves the unused space.
> > 
> > Once again, I don't know if ext4 does this internal delayed
> > allocation extent tracking or whether it just uses page state to
> > track those extents, but it would probably still have to use the
> > allocate call to mark all the pages/bufferheads as delalloc so
> > that uneserve didn't have to do any extra work.
>   Yes, exactly. I just wanted to point out that AFAICS ext4 can implement
> proper error recovery without a need for 'punch' operation. So after all
> Nick's copy page-by-page should be plausible at least for ext4.

Great. AFAIKS, any filesystem that does not leak uninitialized data
on IO error or crash when allocating writeback cache over holes
should already have enough information to recover properly from
short-copy type of error today.

Otherwise, an IO error or crash seems like quite a similar problem
from the point of view of the filesystem. Now perhaps it can be
recovered only in a fsck type operation which is far too expensive
to do in a normal error path, which sounds like XFS.

So possibly we could have 2 APIs, one for filesystems like XFS, but
I don't think we should penalise ones like ext4 which can handle
this situation.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-24  9:20                                 ` Jan Kara
  2010-05-24  9:33                                   ` Nick Piggin
@ 2010-06-05 15:05                                   ` tytso
  2010-06-06  7:59                                     ` Nick Piggin
  1 sibling, 1 reply; 44+ messages in thread
From: tytso @ 2010-06-05 15:05 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Nick Piggin, Josef Bacik, linux-fsdevel,
	chris.mason, hch, akpm, linux-kernel

On Mon, May 24, 2010 at 11:20:34AM +0200, Jan Kara wrote:
>   Yes, exactly. I just wanted to point out that AFAICS ext4 can implement
> proper error recovery without a need for 'punch' operation. So after all
> Nick's copy page-by-page should be plausible at least for ext4.

Sorry for my late response to this thread; I've been busy catching up
on another of other fronts, so I didn't have a chance to go through
this thread until now.

First of all, I'm not against implementing a 'punch' operation for
ext4; I've actually toyed with this idea before.

Secondly, I'm not sure it's really necessary; we already have a code
path (which I was planning on making be the default when I have a
chance to rewrite ext4_writepages) where the blocks are initially
allocated with the 'uninitialized' flag in the extent tree; this is
the same flag used for fallocate(2) support when we allocate blocks
without filling in the data blocks.  Then, when the block I/O
completes, we use the block I/O callback to clear the uninit flag in
the extent tree.  This is currently used to avoid safely avoid locking
in the read path, which is needed to speed up access for extremely
fast (think Fusion I/O-like) flash devices.

I was already thinking about using this trick in my planned
ext4_writepages() rewrite, and if it turns out we have common code
that also assumes that file systems can do the equivalent fallocate(2)
and can clear the uninitialized bit on a callback, I think that makes
ext4 fairly similar to what XFS does, at least at the high level,
doesn't it?

Note that strictly speaking this isn't a 'punch' operation in this
case; it's rather an fallocate(2) and don't convert the extent to mark
the data blocks as valid on error, which is not quite the same as a
'punch' operation.

Am I missing something?

					- Ted

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-06-05 15:05                                   ` tytso
@ 2010-06-06  7:59                                     ` Nick Piggin
  0 siblings, 0 replies; 44+ messages in thread
From: Nick Piggin @ 2010-06-06  7:59 UTC (permalink / raw)
  To: tytso, Jan Kara, Dave Chinner, Josef Bacik, linux-fsdevel,
	chris.mason, hch

On Sat, Jun 05, 2010 at 11:05:23AM -0400, tytso@mit.edu wrote:
> On Mon, May 24, 2010 at 11:20:34AM +0200, Jan Kara wrote:
> >   Yes, exactly. I just wanted to point out that AFAICS ext4 can implement
> > proper error recovery without a need for 'punch' operation. So after all
> > Nick's copy page-by-page should be plausible at least for ext4.
> 
> Sorry for my late response to this thread; I've been busy catching up
> on another of other fronts, so I didn't have a chance to go through
> this thread until now.
> 
> First of all, I'm not against implementing a 'punch' operation for
> ext4; I've actually toyed with this idea before.
> 
> Secondly, I'm not sure it's really necessary; we already have a code
> path (which I was planning on making be the default when I have a
> chance to rewrite ext4_writepages) where the blocks are initially
> allocated with the 'uninitialized' flag in the extent tree; this is
> the same flag used for fallocate(2) support when we allocate blocks
> without filling in the data blocks.  Then, when the block I/O
> completes, we use the block I/O callback to clear the uninit flag in
> the extent tree.  This is currently used to avoid safely avoid locking
> in the read path, which is needed to speed up access for extremely
> fast (think Fusion I/O-like) flash devices.
> 
> I was already thinking about using this trick in my planned
> ext4_writepages() rewrite, and if it turns out we have common code
> that also assumes that file systems can do the equivalent fallocate(2)
> and can clear the uninitialized bit on a callback, I think that makes
> ext4 fairly similar to what XFS does, at least at the high level,
> doesn't it?
> 
> Note that strictly speaking this isn't a 'punch' operation in this
> case; it's rather an fallocate(2) and don't convert the extent to mark
> the data blocks as valid on error, which is not quite the same as a
> 'punch' operation.
> 
> Am I missing something?

No this is fine, it's actually better than a punch operation from
error recovery point of view because it wouldn't require further
modifications to to filesystem in the error case.

AFAIKS this 'uninitialised blocks' approach seems to be the most
optimal way to do block allocations that are not tightly coupled
with the pagecache.

Do you mean the ext4's file_write path?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-14  7:22         ` Nick Piggin
  2010-05-14  8:38           ` Dave Chinner
@ 2010-05-21 15:15           ` Christoph Hellwig
  2010-05-22  2:31             ` Nick Piggin
  1 sibling, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2010-05-21 15:15 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Dave Chinner, Josef Bacik, linux-fsdevel, chris.mason, hch, akpm,
	linux-kernel

Nick, what exactly is the problem with the reserve + allocate design?

In a delalloc filesystem (which is all those that will care about high
performance large writes) the write path fundamentally consists of those
two operations.  Getting rid of the get_blocks mess and replacing it
with a dedicated operations vector will simplify things a lot.

Punching holes is a rather problematic operation, and as mentioned not
actually implemented for most filesystems - just decrementing counters
on errors increases the chances that our error handling will actually
work massively.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-21 15:15           ` Christoph Hellwig
@ 2010-05-22  2:31             ` Nick Piggin
  2010-05-22  8:37               ` Dave Chinner
  0 siblings, 1 reply; 44+ messages in thread
From: Nick Piggin @ 2010-05-22  2:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Josef Bacik, linux-fsdevel, chris.mason, akpm,
	linux-kernel

On Fri, May 21, 2010 at 11:15:18AM -0400, Christoph Hellwig wrote:
> Nick, what exactly is the problem with the reserve + allocate design?
> 
> In a delalloc filesystem (which is all those that will care about high
> performance large writes) the write path fundamentally consists of those
> two operations.  Getting rid of the get_blocks mess and replacing it
> with a dedicated operations vector will simplify things a lot.

Nothing wrong with it, I think it's a fine idea (although you may still
need a per-bh call to connect the fs metadata to each page).

I just much prefer to have operations after the copy not able to fail,
otherwise you get into all those pagecache corner cases.

BTW. when you say reserve + allocate, this is in the page-dirty path,
right? I thought delalloc filesystems tend to do the actual allocation
in the page-cleaning path? Or am I confused?

> Punching holes is a rather problematic operation, and as mentioned not
> actually implemented for most filesystems - just decrementing counters
> on errors increases the chances that our error handling will actually
> work massively.

It's just harder for the pagecache. Invalidating and throwing out old
pagecache and splicing in new pages seems a bit of a hack.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-22  2:31             ` Nick Piggin
@ 2010-05-22  8:37               ` Dave Chinner
  2010-05-24  3:09                 ` Nick Piggin
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2010-05-22  8:37 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Hellwig, Josef Bacik, linux-fsdevel, chris.mason, akpm,
	linux-kernel

On Sat, May 22, 2010 at 12:31:02PM +1000, Nick Piggin wrote:
> On Fri, May 21, 2010 at 11:15:18AM -0400, Christoph Hellwig wrote:
> > Nick, what exactly is the problem with the reserve + allocate design?
> > 
> > In a delalloc filesystem (which is all those that will care about high
> > performance large writes) the write path fundamentally consists of those
> > two operations.  Getting rid of the get_blocks mess and replacing it
> > with a dedicated operations vector will simplify things a lot.
> 
> Nothing wrong with it, I think it's a fine idea (although you may still
> need a per-bh call to connect the fs metadata to each page).
> 
> I just much prefer to have operations after the copy not able to fail,
> otherwise you get into all those pagecache corner cases.
> 
> BTW. when you say reserve + allocate, this is in the page-dirty path,
> right? I thought delalloc filesystems tend to do the actual allocation
> in the page-cleaning path? Or am I confused?

See my reply to Jan - delayed allocate has two parts to it - space
reservation (accounting for ENOSPC) and recording of the delalloc extents
(allocate). This is separate to the writeback path where we convert
delalloc extents to real extents....

> > Punching holes is a rather problematic operation, and as mentioned not
> > actually implemented for most filesystems - just decrementing counters
> > on errors increases the chances that our error handling will actually
> > work massively.
> 
> It's just harder for the pagecache. Invalidating and throwing out old
> pagecache and splicing in new pages seems a bit of a hack.

Hardly a hack - it turns a buffered write into an operation that
does not expose transient page state and hence prevents torn writes.
That will allow us to use DIF enabled storage paths for buffered
filesystem IO(*), perhaps even allow us to generate checksums during
copy-in to do end-to-end checksum protection of data....

Cheers,

Dave.

(*) Yes, I know that mmap writes will still break DIF even if we do
writes this way.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-22  8:37               ` Dave Chinner
@ 2010-05-24  3:09                 ` Nick Piggin
  2010-05-24  5:53                   ` Dave Chinner
  0 siblings, 1 reply; 44+ messages in thread
From: Nick Piggin @ 2010-05-24  3:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Josef Bacik, linux-fsdevel, chris.mason, akpm,
	linux-kernel

On Sat, May 22, 2010 at 06:37:03PM +1000, Dave Chinner wrote:
> On Sat, May 22, 2010 at 12:31:02PM +1000, Nick Piggin wrote:
> > On Fri, May 21, 2010 at 11:15:18AM -0400, Christoph Hellwig wrote:
> > > Nick, what exactly is the problem with the reserve + allocate design?
> > > 
> > > In a delalloc filesystem (which is all those that will care about high
> > > performance large writes) the write path fundamentally consists of those
> > > two operations.  Getting rid of the get_blocks mess and replacing it
> > > with a dedicated operations vector will simplify things a lot.
> > 
> > Nothing wrong with it, I think it's a fine idea (although you may still
> > need a per-bh call to connect the fs metadata to each page).
> > 
> > I just much prefer to have operations after the copy not able to fail,
> > otherwise you get into all those pagecache corner cases.
> > 
> > BTW. when you say reserve + allocate, this is in the page-dirty path,
> > right? I thought delalloc filesystems tend to do the actual allocation
> > in the page-cleaning path? Or am I confused?
> 
> See my reply to Jan - delayed allocate has two parts to it - space
> reservation (accounting for ENOSPC) and recording of the delalloc extents
> (allocate). This is separate to the writeback path where we convert
> delalloc extents to real extents....

Yes I saw that. I'm sure we'll want clearer terminology in the core
code. But I don't quite know why you need to do it in 2 parts
(reserve, then "allocate"). Surely even reservation failures are
very rare, and obviously the error handling is required, so why not
just do a single call?

 
> > > Punching holes is a rather problematic operation, and as mentioned not
> > > actually implemented for most filesystems - just decrementing counters
> > > on errors increases the chances that our error handling will actually
> > > work massively.
> > 
> > It's just harder for the pagecache. Invalidating and throwing out old
> > pagecache and splicing in new pages seems a bit of a hack.
> 
> Hardly a hack - it turns a buffered write into an operation that
> does not expose transient page state and hence prevents torn writes.
> That will allow us to use DIF enabled storage paths for buffered
> filesystem IO(*), perhaps even allow us to generate checksums during
> copy-in to do end-to-end checksum protection of data....

It is a hack. Invalidating is inherently racy and isn't guaranteed
to succeed.

You do not need to invalidate the pagecache to do this (which as I said
is racy). You need to lock the page to prevent writes, and then unmap
user mappings. You'd also need to have done some magic so writable mmaps
don't leak into get_user_pages.

But this should be a different discussion anyway. Don't forget, your
approach is forced into the invalidation requirement because of
downsides in its error handling sequence. That cannot be construed as
positive, because you are forced into it, wheras other approaches
*could* use it, but do not have to.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-24  3:09                 ` Nick Piggin
@ 2010-05-24  5:53                   ` Dave Chinner
  2010-05-24  6:55                     ` Nick Piggin
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2010-05-24  5:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Hellwig, Josef Bacik, linux-fsdevel, chris.mason, akpm,
	linux-kernel

On Mon, May 24, 2010 at 01:09:43PM +1000, Nick Piggin wrote:
> On Sat, May 22, 2010 at 06:37:03PM +1000, Dave Chinner wrote:
> > On Sat, May 22, 2010 at 12:31:02PM +1000, Nick Piggin wrote:
> > > On Fri, May 21, 2010 at 11:15:18AM -0400, Christoph Hellwig wrote:
> > > > Nick, what exactly is the problem with the reserve + allocate design?
> > > > 
> > > > In a delalloc filesystem (which is all those that will care about high
> > > > performance large writes) the write path fundamentally consists of those
> > > > two operations.  Getting rid of the get_blocks mess and replacing it
> > > > with a dedicated operations vector will simplify things a lot.
> > > 
> > > Nothing wrong with it, I think it's a fine idea (although you may still
> > > need a per-bh call to connect the fs metadata to each page).
> > > 
> > > I just much prefer to have operations after the copy not able to fail,
> > > otherwise you get into all those pagecache corner cases.
> > > 
> > > BTW. when you say reserve + allocate, this is in the page-dirty path,
> > > right? I thought delalloc filesystems tend to do the actual allocation
> > > in the page-cleaning path? Or am I confused?
> > 
> > See my reply to Jan - delayed allocate has two parts to it - space
> > reservation (accounting for ENOSPC) and recording of the delalloc extents
> > (allocate). This is separate to the writeback path where we convert
> > delalloc extents to real extents....
> 
> Yes I saw that. I'm sure we'll want clearer terminology in the core
> code. But I don't quite know why you need to do it in 2 parts
> (reserve, then "allocate").

Because reserve/allocate are the two steps that allocation is
generally broken down into, even in filesystems that don't do
delayed allocation. That's because....

> Surely even reservation failures are
> very rare

... ENOSPC and EDQUOT are not at all rare, and they are generated
during the reservation stage. i.e. before any real allocation or
state changes are made. Just about every filesystem does this
because failing half way through an allocation not being able to
allocate a block due to ENOSPC or EDQUOT is pretty much impossible
to undo reliably in most filesystems.

> , and obviously the error handling is required, so why not
> just do a single call?

Because if we fail after the allocation then ensuring we handle the
error *correctly* and *without further failures* is *fucking hard*.

IMO, the fundamental issue with using hole punching or direct IO
from the zero page to handle errors is that they are complex enough
that there is *no guarantee that they will succeed*. e.g. Both can
get ENOSPC/EDQUOT because they may end up with metadata allocation
requirements above and beyond what was originally reserved. If the
error handling fails to handle the error, then where do we go from
there?

In comparison, undoing a reservation is simple - maybe incrementing
a couple of counters - and is effectively guaranteed never to fail.
This is a good characteristic to have in an error handling
function...

> > > > Punching holes is a rather problematic operation, and as mentioned not
> > > > actually implemented for most filesystems - just decrementing counters
> > > > on errors increases the chances that our error handling will actually
> > > > work massively.
> > > 
> > > It's just harder for the pagecache. Invalidating and throwing out old
> > > pagecache and splicing in new pages seems a bit of a hack.
> > 
> > Hardly a hack - it turns a buffered write into an operation that
> > does not expose transient page state and hence prevents torn writes.
> > That will allow us to use DIF enabled storage paths for buffered
> > filesystem IO(*), perhaps even allow us to generate checksums during
> > copy-in to do end-to-end checksum protection of data....
> 
> It is a hack. Invalidating is inherently racy and isn't guaranteed
> to succeed.
> 
> You do not need to invalidate the pagecache to do this (which as I said
> is racy). You need to lock the page to prevent writes, and then unmap
> user mappings.

Which is the major part of invalidating a page. The other part of
invalidation is removing the page from the page cache, so if
invalidation is inherently too racy to use safely here, then I fail
to see why the above isn't also too racy to use safely....

> You'd also need to have done some magic so writable mmaps
> don't leak into get_user_pages.

Which means what, and why don't we have to do any special magic now
to prevent it?

> But this should be a different discussion anyway. Don't forget, your
> approach is forced into the invalidation requirement because of
> downsides in its error handling sequence.

I wouldn't say forced into it, Nick - it's a deliberate design
choice to make the overall stack simpler and more likely to function
correctly.

Besides, all it takes to avoid the requirement of invalidation is to
provide the guarantee that the allocation after reservation will
either succeed or the filesystem shuts down in a corrupted state.
If we provide that guarantee then the fact that transient page cache
data might appear on allocation error is irrelevant, because it
will never get written to disk and applications will error out
pretty quickly.

I'm quite happy with that requirement, because of two things.
Firstly, after the reservation nothing but a corruption or IO error
should prevent the allocation from succeeding. In that case, the
filesystem should already be in a shutdown state by the time the
failed allocation returns.  Secondly, filesystems using delayed
allocation are already making this promise successfully from
get_blocks to ->writepage, hence I don't see any issues with
encoding it into an allocation interface....

> That cannot be construed as
> positive, because you are forced into it, wheras other approaches
> *could* use it, but do not have to.

Except for the fact the other alternatives have much, much worse
downsides. Yes, they could also use such a write path, but that
doesn't reduce the complexity of those solutions or prevent any of
the problems they have.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-24  5:53                   ` Dave Chinner
@ 2010-05-24  6:55                     ` Nick Piggin
  2010-05-24 10:21                       ` Dave Chinner
  2010-05-24 18:40                       ` Joel Becker
  0 siblings, 2 replies; 44+ messages in thread
From: Nick Piggin @ 2010-05-24  6:55 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Josef Bacik, linux-fsdevel, chris.mason, akpm,
	linux-kernel

On Mon, May 24, 2010 at 03:53:29PM +1000, Dave Chinner wrote:
> On Mon, May 24, 2010 at 01:09:43PM +1000, Nick Piggin wrote:
> > On Sat, May 22, 2010 at 06:37:03PM +1000, Dave Chinner wrote:
> > > On Sat, May 22, 2010 at 12:31:02PM +1000, Nick Piggin wrote:
> > > > On Fri, May 21, 2010 at 11:15:18AM -0400, Christoph Hellwig wrote:
> > > > > Nick, what exactly is the problem with the reserve + allocate design?
> > > > > 
> > > > > In a delalloc filesystem (which is all those that will care about high
> > > > > performance large writes) the write path fundamentally consists of those
> > > > > two operations.  Getting rid of the get_blocks mess and replacing it
> > > > > with a dedicated operations vector will simplify things a lot.
> > > > 
> > > > Nothing wrong with it, I think it's a fine idea (although you may still
> > > > need a per-bh call to connect the fs metadata to each page).
> > > > 
> > > > I just much prefer to have operations after the copy not able to fail,
> > > > otherwise you get into all those pagecache corner cases.
> > > > 
> > > > BTW. when you say reserve + allocate, this is in the page-dirty path,
> > > > right? I thought delalloc filesystems tend to do the actual allocation
> > > > in the page-cleaning path? Or am I confused?
> > > 
> > > See my reply to Jan - delayed allocate has two parts to it - space
> > > reservation (accounting for ENOSPC) and recording of the delalloc extents
> > > (allocate). This is separate to the writeback path where we convert
> > > delalloc extents to real extents....
> > 
> > Yes I saw that. I'm sure we'll want clearer terminology in the core
> > code. But I don't quite know why you need to do it in 2 parts
> > (reserve, then "allocate").
> 
> Because reserve/allocate are the two steps that allocation is
> generally broken down into, even in filesystems that don't do
> delayed allocation. That's because....
> 
> > Surely even reservation failures are
> > very rare
> 
> ... ENOSPC and EDQUOT are not at all rare, and they are generated
> during the reservation stage. i.e. before any real allocation or

I meant "rare" as-in not critical for performance. Not that they don't
have to be handled properly.


> state changes are made. Just about every filesystem does this
> because failing half way through an allocation not being able to
> allocate a block due to ENOSPC or EDQUOT is pretty much impossible
> to undo reliably in most filesystems.
> 
> > , and obviously the error handling is required, so why not
> > just do a single call?
> 
> Because if we fail after the allocation then ensuring we handle the
> error *correctly* and *without further failures* is *fucking hard*.

I don't think you really answered my question. Let me put it in concrete
terms. In your proposal, why not just do the reserve+allocate *after*
the pagecache copy? What does the "reserve" part add?

> 
> IMO, the fundamental issue with using hole punching or direct IO
> from the zero page to handle errors is that they are complex enough
> that there is *no guarantee that they will succeed*. e.g. Both can
> get ENOSPC/EDQUOT because they may end up with metadata allocation
> requirements above and beyond what was originally reserved. If the
> error handling fails to handle the error, then where do we go from
> there?

There are already fundamental issues that seems like they are not
handled properly if your filesystem may allocate uninitialized blocks
over holes for writeback cache without somehow marking them as
uninitialized.

If you get a power failure or IO error before the pagecache can be
written out, you're left with uninitialized data there, aren't you?
Simple buffer head based filesystems are already subject to this.


> In comparison, undoing a reservation is simple - maybe incrementing
> a couple of counters - and is effectively guaranteed never to fail.
> This is a good characteristic to have in an error handling
> function...

Yes of course.

 
> > > > > Punching holes is a rather problematic operation, and as mentioned not
> > > > > actually implemented for most filesystems - just decrementing counters
> > > > > on errors increases the chances that our error handling will actually
> > > > > work massively.
> > > > 
> > > > It's just harder for the pagecache. Invalidating and throwing out old
> > > > pagecache and splicing in new pages seems a bit of a hack.
> > > 
> > > Hardly a hack - it turns a buffered write into an operation that
> > > does not expose transient page state and hence prevents torn writes.
> > > That will allow us to use DIF enabled storage paths for buffered
> > > filesystem IO(*), perhaps even allow us to generate checksums during
> > > copy-in to do end-to-end checksum protection of data....
> > 
> > It is a hack. Invalidating is inherently racy and isn't guaranteed
> > to succeed.
> > 
> > You do not need to invalidate the pagecache to do this (which as I said
> > is racy). You need to lock the page to prevent writes, and then unmap
> > user mappings.
> 
> Which is the major part of invalidating a page. The other part of
> invalidation is removing the page from the page cache, so if
> invalidation is inherently too racy to use safely here, then I fail
> to see why the above isn't also too racy to use safely....

I don't know what you mean by the major part of invalidating the page.
Taking the page out of the pagecache is indeed the fallable part of
the operation.

 
> > You'd also need to have done some magic so writable mmaps
> > don't leak into get_user_pages.
> 
> Which means what, and why don't we have to do any special magic now
> to prevent it?

We do, but filesystems don't tend to use it.

 
> > But this should be a different discussion anyway. Don't forget, your
> > approach is forced into the invalidation requirement because of
> > downsides in its error handling sequence.
> 
> I wouldn't say forced into it, Nick - it's a deliberate design
> choice to make the overall stack simpler and more likely to function
> correctly.

Your design is forced to do it when I pointed out that writes into the
pagecache should not be made visiable if the process can subsequently
fail. copy-last is not subject to this.

So you can't say invalidation is an advantage of copy-first, because
if it is advantageous in other areas, copy-last can implement it too.

 
> Besides, all it takes to avoid the requirement of invalidation is to
> provide the guarantee that the allocation after reservation will
> either succeed or the filesystem shuts down in a corrupted state.
> If we provide that guarantee then the fact that transient page cache
> data might appear on allocation error is irrelevant, because it
> will never get written to disk and applications will error out
> pretty quickly.

Sure. IO errors and writeback cache means that we don't guarantee
without fsync that the data will come back after a crash.

This could be an answer to my above question (what is the 2-call
sequence for?)

All that is required upon write(2) completion (or partial completion)
is that the data can actually be found and written back at a later
date.

 
> I'm quite happy with that requirement, because of two things.
> Firstly, after the reservation nothing but a corruption or IO error
> should prevent the allocation from succeeding. In that case, the
> filesystem should already be in a shutdown state by the time the
> failed allocation returns.  Secondly, filesystems using delayed
> allocation are already making this promise successfully from
> get_blocks to ->writepage, hence I don't see any issues with
> encoding it into an allocation interface....

Well it's already there, and not just for delalloc filesystems,
because a lot of filesystems do writeback on their metadata too,
so it's all subject to IO errors.

 
> > That cannot be construed as
> > positive, because you are forced into it, wheras other approaches
> > *could* use it, but do not have to.
> 
> Except for the fact the other alternatives have much, much worse
> downsides. Yes, they could also use such a write path, but that
> doesn't reduce the complexity of those solutions or prevent any of
> the problems they have.

Let's just carefully look at the alternatives. This alternative of
zeroing out uninitialized blocks (via pagecache) is what we have
today.

What we should do is carefully consider exactly what error semantics
and guarantees we want, and then implement the best API taking those
into account.

If we are happy with the existing state of error handling, copy-first is
clearly better because the fast path is faster.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-24  6:55                     ` Nick Piggin
@ 2010-05-24 10:21                       ` Dave Chinner
  2010-06-01  6:27                         ` Nick Piggin
  2010-05-24 18:40                       ` Joel Becker
  1 sibling, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2010-05-24 10:21 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Hellwig, Josef Bacik, linux-fsdevel, chris.mason, akpm,
	linux-kernel

On Mon, May 24, 2010 at 04:55:19PM +1000, Nick Piggin wrote:
> On Mon, May 24, 2010 at 03:53:29PM +1000, Dave Chinner wrote:
> > On Mon, May 24, 2010 at 01:09:43PM +1000, Nick Piggin wrote:
> > > On Sat, May 22, 2010 at 06:37:03PM +1000, Dave Chinner wrote:
> > > > On Sat, May 22, 2010 at 12:31:02PM +1000, Nick Piggin wrote:
> > > > > On Fri, May 21, 2010 at 11:15:18AM -0400, Christoph Hellwig wrote:
> > > > > > Nick, what exactly is the problem with the reserve + allocate design?
> > > > > > 
> > > > > > In a delalloc filesystem (which is all those that will care about high
> > > > > > performance large writes) the write path fundamentally consists of those
> > > > > > two operations.  Getting rid of the get_blocks mess and replacing it
> > > > > > with a dedicated operations vector will simplify things a lot.
> > > > > 
> > > > > Nothing wrong with it, I think it's a fine idea (although you may still
> > > > > need a per-bh call to connect the fs metadata to each page).
> > > > > 
> > > > > I just much prefer to have operations after the copy not able to fail,
> > > > > otherwise you get into all those pagecache corner cases.
> > > > > 
> > > > > BTW. when you say reserve + allocate, this is in the page-dirty path,
> > > > > right? I thought delalloc filesystems tend to do the actual allocation
> > > > > in the page-cleaning path? Or am I confused?
> > > > 
> > > > See my reply to Jan - delayed allocate has two parts to it - space
> > > > reservation (accounting for ENOSPC) and recording of the delalloc extents
> > > > (allocate). This is separate to the writeback path where we convert
> > > > delalloc extents to real extents....
> > > 
> > > Yes I saw that. I'm sure we'll want clearer terminology in the core
> > > code. But I don't quite know why you need to do it in 2 parts
> > > (reserve, then "allocate").
> > 
> > Because reserve/allocate are the two steps that allocation is
> > generally broken down into, even in filesystems that don't do
> > delayed allocation. That's because....
> > 
> > > Surely even reservation failures are
> > > very rare
> > 
> > ... ENOSPC and EDQUOT are not at all rare, and they are generated
> > during the reservation stage. i.e. before any real allocation or
> 
> I meant "rare" as-in not critical for performance. Not that they don't
> have to be handled properly.

I am trying to handle these cases properly for the fast path, not
turn it into a maze of twisty, dark, buggy passages.

> > state changes are made. Just about every filesystem does this
> > because failing half way through an allocation not being able to
> > allocate a block due to ENOSPC or EDQUOT is pretty much impossible
> > to undo reliably in most filesystems.
> > 
> > > , and obviously the error handling is required, so why not
> > > just do a single call?
> > 
> > Because if we fail after the allocation then ensuring we handle the
> > error *correctly* and *without further failures* is *fucking hard*.
> 
> I don't think you really answered my question. Let me put it in concrete
> terms. In your proposal, why not just do the reserve+allocate *after*
> the pagecache copy? What does the "reserve" part add?

....

> Your design is forced to do it when I pointed out that writes into the
> pagecache should not be made visiable if the process can subsequently
> fail. copy-last is not subject to this.

Let's summarise what we've got so far so that it becomes obvious why
we need a two step process.

1. Copy First Problem: Copying in data before the allocate results
in transient data exposure if the allocate fails or if we overwrite
pages in the same loop iteration as the failed allocation. To deal
with this, we've talked page replacement mechanisms for the page
cache and all the complex issues associated with that.

2. Copy Last Problem: If we allocate up front, then we have to
handle errors allocating pages into the page cache or copying the
data. To work around copying failures, we've talked about punching
out the allocation range that wasn't copied into. This gets complex
with mixed overwrite/allocate ranges. To deal with page cache
allocation failures, we talked about using direct IO w/ the zero
page to zero the remaining range, but this gets complex with
delalloc.

Situation: both copy-first and copy-last have serious issues.

3. Generic Simplification: The first thing we need to do in both
cases is avoid the mixed overwrite/allocate copy case, as that
greatly increases the complexity of error handling in both cases.
To do this we need to split the mutlipage write up into ranges that
are of the same kind (only allocate, or only overwrite). This needs
to be done before we copy the data and hence we need to be able to
map the range we are allowed to copy before we do the copy. A
filesystem callout that is required to do this.

4. Avoiding Copy Last Problems: To avoid page allocation failures
requiring hole punching or DIO writes, we need to ensure we don't do
allocation until after the copy has succeeded (i.e. take that bit of
Copy First). However, to avoid data exposure problems, we need to
ensure this allocation will succeed in all normal cases.  ENOSPC and
EDQUOT are normal cases, EIO or filesystem corruption are not normal
cases.

5. To avoid the catch-22 situation, we can avoid "normal" errors in
the allocation case if we reserve the space required for the
upcoming allocation before we copy any data.  Effectively we can
guarantee the allocation will succeed.

6. If the allocation does fail, then the filesystem is considered
corrupted and needs to enter a shutdown/failed state.  This ensures
allocation failures do not expose transient data in any meaningful
way because the cached data on an errored filesystem can not be
guaranteed to be valid. This is a hard stop, but should never occur
unless there is some other hard-stop error condition in the
filesystem.

7. Hence by reserving space and defining how the filesystem needs to
handle allocation failures after a successful reserve, we remove all
the problematic error handling in the Copy Last case. This requires
a filesystem callout.

8. Avoiding Copy First Problems: By ensuring that allocation will
never fail except in a hard-stop situation, the allocation can be
safely done after the copy has been made. Effectively the steps to
remove the problematic error handling from the Copy Last case also
remove the problematic transient data issues from the Copy First
case.

9. Copying Data: Given the guarantee that allocation will succeed,
we don't need to worry about exposing transіent data during the data
copy. The copy can fail at any time (either page allocation failure
or EFAULT during the copy) and we can easily deal with it. Hence we
don't need to lock more than one page at once, and the copy can
proceed a page at a time. Failure during a copy will require partial
pages to be treated like a partial write.

10. Handling Errors: If we fail the reserve/map stage, then there is
nothing to undo and we can simply return at that point.

11. If we fail part way through the copy (i.e.  in the middle), the
error handling is simple:
	a) allocate the range that did succeed (if allocate)
	b) zero the portion of the page that failed (if allocate)
	c) update the page states appropriately (write_end)
	d) release the unused reserved space (if allocate)

13. If we fail the allocation:
	a) the filesystem is shut down
	b) the range that failed is tossed out of the page cache.

To me this seems pretty sane - "copy middle" avoids all the pitfalls
and traps we've been talking about.  No transient data exposure
problems, no need to lock multiple pages at a time, no complex error
handling required, no major new filesystem functionality, etc.

> > IMO, the fundamental issue with using hole punching or direct IO
> > from the zero page to handle errors is that they are complex enough
> > that there is *no guarantee that they will succeed*. e.g. Both can
> > get ENOSPC/EDQUOT because they may end up with metadata allocation
> > requirements above and beyond what was originally reserved. If the
> > error handling fails to handle the error, then where do we go from
> > there?
> 
> There are already fundamental issues that seems like they are not
> handled properly if your filesystem may allocate uninitialized blocks
> over holes for writeback cache without somehow marking them as
> uninitialized.

You mean like the problems ext3's data=writeback mode has? ;)

> If you get a power failure or IO error before the pagecache can be
> written out, you're left with uninitialized data there, aren't you?
> Simple buffer head based filesystems are already subject to this.

Most filesystems avoid this one way or another. XFS, btrfs, ext3/4
(in data=ordered/journal modes), gfs2, etc all use different
methods, but the end result is the same - they don't expose stale
blocks.

But that's not really the issue I was alluding to. What I was
thinking of was a whole lot more complex than that, and might give
you insight into why the error handling you were proposing is ..
meeting resistance.

Take a delayed allocation in XFS - it XFS reserves an amount (worst
case) of extra blocks for metadata allocation during the eventual
real extent allocation. Every delalloc reserves this metadata space
but it is not specific to the inode, so pretty much any outstanding
delalloc conversion has a big pool of reserved metadata space that
it can allocate metadata blocks from. When the allocation occurs
during writeback this unused metadata space is taken back out of the
pool.

If we then change this to use DIO, the first thing XFS has to do is
punch out the delalloc extent spanning the page because we cannot do
direct IO into delalloc extents (a fundamental design rule).  That
punch would require it's own metadata reservation on top of the
current delalloc reservation which is not released until after the
reservation for th ehole punch is made. Hence it can ENOSPC/EDQUOT.

Then we have to allocate during DIO, which also requires it's own
metadata reservation plus the blocks for the page being written.
That can also ENOSPC/EDQUOT because it can't make use of the large
delalloc metadata pool that would have allowed it to succeed.

But because this is an error handling path, it must be made to
succeed. If it fails, we've failed to handle the error correctly and
somethign bad *will* happen as a result. As you can see switching to
DIO from the zero page in a copy-last error handler would be quite
hard to implement in XFS - it would require a special DIO path
because we'd need to know that it was in a do-not-fail case and that
we'd have to hole punch before allocation rather than asserting that
a design rule had been violated, and then make sure that neither
punching or the later allocation failed, which we can't actually
do....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-24 10:21                       ` Dave Chinner
@ 2010-06-01  6:27                         ` Nick Piggin
  0 siblings, 0 replies; 44+ messages in thread
From: Nick Piggin @ 2010-06-01  6:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Josef Bacik, linux-fsdevel, chris.mason, akpm,
	linux-kernel

On Mon, May 24, 2010 at 08:21:25PM +1000, Dave Chinner wrote:
> On Mon, May 24, 2010 at 04:55:19PM +1000, Nick Piggin wrote:
> > > > Surely even reservation failures are
> > > > very rare
> > > 
> > > ... ENOSPC and EDQUOT are not at all rare, and they are generated
> > > during the reservation stage. i.e. before any real allocation or
> > 
> > I meant "rare" as-in not critical for performance. Not that they don't
> > have to be handled properly.
> 
> I am trying to handle these cases properly for the fast path, not
> turn it into a maze of twisty, dark, buggy passages.

What I mean is that when the common-code has such a failure, the action
to correct it is not performance critical, ie. the fast-path is the
success case. So a cleaner or more performant fast-path is preferable
even if it results in slow-path error case being slower or more complex
(within reasonable limits).

 
> > > state changes are made. Just about every filesystem does this
> > > because failing half way through an allocation not being able to
> > > allocate a block due to ENOSPC or EDQUOT is pretty much impossible
> > > to undo reliably in most filesystems.
> > > 
> > > > , and obviously the error handling is required, so why not
> > > > just do a single call?
> > > 
> > > Because if we fail after the allocation then ensuring we handle the
> > > error *correctly* and *without further failures* is *fucking hard*.
> > 
> > I don't think you really answered my question. Let me put it in concrete
> > terms. In your proposal, why not just do the reserve+allocate *after*
> > the pagecache copy? What does the "reserve" part add?
> 
> ....
> 
> > Your design is forced to do it when I pointed out that writes into the
> > pagecache should not be made visiable if the process can subsequently
> > fail. copy-last is not subject to this.
> 
> Let's summarise what we've got so far so that it becomes obvious why
> we need a two step process.
> 
> 1. Copy First Problem: Copying in data before the allocate results
> in transient data exposure if the allocate fails or if we overwrite
> pages in the same loop iteration as the failed allocation. To deal
> with this, we've talked page replacement mechanisms for the page
> cache and all the complex issues associated with that.
> 
> 2. Copy Last Problem: If we allocate up front, then we have to
> handle errors allocating pages into the page cache or copying the
> data. To work around copying failures, we've talked about punching
> out the allocation range that wasn't copied into. This gets complex
> with mixed overwrite/allocate ranges. To deal with page cache
> allocation failures, we talked about using direct IO w/ the zero
> page to zero the remaining range, but this gets complex with
> delalloc.
> 
> Situation: both copy-first and copy-last have serious issues.
> 
> 3. Generic Simplification: The first thing we need to do in both
> cases is avoid the mixed overwrite/allocate copy case, as that
> greatly increases the complexity of error handling in both cases.
> To do this we need to split the mutlipage write up into ranges that
> are of the same kind (only allocate, or only overwrite). This needs
> to be done before we copy the data and hence we need to be able to
> map the range we are allowed to copy before we do the copy. A
> filesystem callout that is required to do this.

OK, this allows filesystems that can handle mixed to do so.

 
> 4. Avoiding Copy Last Problems: To avoid page allocation failures
> requiring hole punching or DIO writes, we need to ensure we don't do
> allocation until after the copy has succeeded (i.e. take that bit of
> Copy First). However, to avoid data exposure problems, we need to
> ensure this allocation will succeed in all normal cases.  ENOSPC and
> EDQUOT are normal cases, EIO or filesystem corruption are not normal
> cases.

In generic code, I don't know whether this should be a requirement.
If ext4 marks allocated blocks as uninitialized, it seems to be
able to handle this pretty gracefully. Not all filesystems necessarily
can reserve blocks before the copy, you see.

 
> 5. To avoid the catch-22 situation, we can avoid "normal" errors in
> the allocation case if we reserve the space required for the
> upcoming allocation before we copy any data.  Effectively we can
> guarantee the allocation will succeed.
> 
> 6. If the allocation does fail, then the filesystem is considered
> corrupted and needs to enter a shutdown/failed state.  This ensures
> allocation failures do not expose transient data in any meaningful
> way because the cached data on an errored filesystem can not be
> guaranteed to be valid. This is a hard stop, but should never occur
> unless there is some other hard-stop error condition in the
> filesystem.
> 
> 7. Hence by reserving space and defining how the filesystem needs to
> handle allocation failures after a successful reserve, we remove all
> the problematic error handling in the Copy Last case. This requires
> a filesystem callout.
> 
> 8. Avoiding Copy First Problems: By ensuring that allocation will
> never fail except in a hard-stop situation, the allocation can be
> safely done after the copy has been made. Effectively the steps to
> remove the problematic error handling from the Copy Last case also
> remove the problematic transient data issues from the Copy First
> case.

This is a reasonable observation and seems to solve your copy-first
problem. However this maybe depends a little bit on the error
semantics of the fs, and a lot on the filesystem itself. Is it
reasonable to encode this in the generic code?

I think I will still like to see a little more implementations before
it is turned into generic code. I would like to see actually how much
code is saved versus implementing it in the fs (perhaps with some
helpers in the vfs to avoid generic checks).

We have relatively simple iov accessors/usercopy etc. so the pagecache
part of it really doesn't need to be in generic code.


> 9. Copying Data: Given the guarantee that allocation will succeed,
> we don't need to worry about exposing transіent data during the data
> copy. The copy can fail at any time (either page allocation failure
> or EFAULT during the copy) and we can easily deal with it. Hence we
> don't need to lock more than one page at once, and the copy can
> proceed a page at a time. Failure during a copy will require partial
> pages to be treated like a partial write.
>
> 10. Handling Errors: If we fail the reserve/map stage, then there is
> nothing to undo and we can simply return at that point.
> 
> 11. If we fail part way through the copy (i.e.  in the middle), the
> error handling is simple:
> 	a) allocate the range that did succeed (if allocate)
> 	b) zero the portion of the page that failed (if allocate)
> 	c) update the page states appropriately (write_end)
> 	d) release the unused reserved space (if allocate)
> 
> 13. If we fail the allocation:
> 	a) the filesystem is shut down
> 	b) the range that failed is tossed out of the page cache.
> 
> To me this seems pretty sane - "copy middle" avoids all the pitfalls
> and traps we've been talking about.  No transient data exposure
> problems, no need to lock multiple pages at a time, no complex error
> handling required, no major new filesystem functionality, etc.

So long as you avoid the transient pagecache data without requiring
pagecache invalidation, I don't have a problem with it.


> > > IMO, the fundamental issue with using hole punching or direct IO
> > > from the zero page to handle errors is that they are complex enough
> > > that there is *no guarantee that they will succeed*. e.g. Both can
> > > get ENOSPC/EDQUOT because they may end up with metadata allocation
> > > requirements above and beyond what was originally reserved. If the
> > > error handling fails to handle the error, then where do we go from
> > > there?
> > 
> > There are already fundamental issues that seems like they are not
> > handled properly if your filesystem may allocate uninitialized blocks
> > over holes for writeback cache without somehow marking them as
> > uninitialized.
> 
> You mean like the problems ext3's data=writeback mode has? ;)

And probably most non-journalling filesystems.


> > If you get a power failure or IO error before the pagecache can be
> > written out, you're left with uninitialized data there, aren't you?
> > Simple buffer head based filesystems are already subject to this.
> 
> Most filesystems avoid this one way or another. XFS, btrfs, ext3/4
> (in data=ordered/journal modes), gfs2, etc all use different
> methods, but the end result is the same - they don't expose stale
> blocks.
> 
> But that's not really the issue I was alluding to. What I was
> thinking of was a whole lot more complex than that, and might give
> you insight into why the error handling you were proposing is ..
> meeting resistance.
> 
> Take a delayed allocation in XFS - it XFS reserves an amount (worst
> case) of extra blocks for metadata allocation during the eventual
> real extent allocation. Every delalloc reserves this metadata space
> but it is not specific to the inode, so pretty much any outstanding
> delalloc conversion has a big pool of reserved metadata space that
> it can allocate metadata blocks from. When the allocation occurs
> during writeback this unused metadata space is taken back out of the
> pool.
> 
> If we then change this to use DIO, the first thing XFS has to do is
> punch out the delalloc extent spanning the page because we cannot do
> direct IO into delalloc extents (a fundamental design rule).  That
> punch would require it's own metadata reservation on top of the
> current delalloc reservation which is not released until after the
> reservation for th ehole punch is made. Hence it can ENOSPC/EDQUOT.
> 
> Then we have to allocate during DIO, which also requires it's own
> metadata reservation plus the blocks for the page being written.
> That can also ENOSPC/EDQUOT because it can't make use of the large
> delalloc metadata pool that would have allowed it to succeed.
> 
> But because this is an error handling path, it must be made to
> succeed. If it fails, we've failed to handle the error correctly and
> somethign bad *will* happen as a result. As you can see switching to
> DIO from the zero page in a copy-last error handler would be quite
> hard to implement in XFS - it would require a special DIO path
> because we'd need to know that it was in a do-not-fail case and that
> we'd have to hole punch before allocation rather than asserting that
> a design rule had been violated, and then make sure that neither
> punching or the later allocation failed, which we can't actually
> do....

Fair point. Thanks for the explanation.	I could say why pagecache
invalidation is very hard to do as well, but luckily you've found
a solution that avoids all the problems.

Is it easy to explain how XFS avoids the uninitialized block problem,
btw?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-24  6:55                     ` Nick Piggin
  2010-05-24 10:21                       ` Dave Chinner
@ 2010-05-24 18:40                       ` Joel Becker
  1 sibling, 0 replies; 44+ messages in thread
From: Joel Becker @ 2010-05-24 18:40 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Dave Chinner, Christoph Hellwig, Josef Bacik, linux-fsdevel,
	chris.mason, akpm, linux-kernel

On Mon, May 24, 2010 at 04:55:19PM +1000, Nick Piggin wrote:
> On Mon, May 24, 2010 at 03:53:29PM +1000, Dave Chinner wrote:
> > Because if we fail after the allocation then ensuring we handle the
> > error *correctly* and *without further failures* is *fucking hard*.
> 
> I don't think you really answered my question. Let me put it in concrete
> terms. In your proposal, why not just do the reserve+allocate *after*
> the pagecache copy? What does the "reserve" part add?

	In ocfs2, we can't just crash our filesystem.  We have to be
safe not just with respect to the local machine, we have to leave the
filesystem in a consistent state - structure *and* data - for the other
nodes.
	The ordering and locking of allocation in get_block(s)() is so
bad that we just Don't Do It.  By the time get_block(s)() is called, we
require our filesystem to have the allocation done.  We do our
allocation in write_begin().   By the time we get to the page copy, we
can't ENOSPC or EDQUOT.  O_DIRECT I/O falls back to sync buffered I/O if
it must allocate, pushing us through write_begin() forcing other nodes
to honor what we've done.
	This is easily extended to the reserve multipage operation.
It's not delalloc, because we actually allocate in the reserve
operation.  We handle it just like a large case of the single page
operation.  Someday we hope to add delalloc, and it would actually do
better here.
	I guess you could call this "copy middle" like Dave describes in
his followup to your mail.  Copy Middle also has the property that it
can handle short writes without any error handling.  Copy First has to
discover it can only get half the allocation and drop the latter half of
the pagecache.  Copy Last has to discover it can only do half the page
copy and drop the latter half of the allocation.

> > IMO, the fundamental issue with using hole punching or direct IO
> > from the zero page to handle errors is that they are complex enough
> > that there is *no guarantee that they will succeed*. e.g. Both can
> > get ENOSPC/EDQUOT because they may end up with metadata allocation
> > requirements above and beyond what was originally reserved. If the
> > error handling fails to handle the error, then where do we go from
> > there?
> 
> There are already fundamental issues that seems like they are not
> handled properly if your filesystem may allocate uninitialized blocks
> over holes for writeback cache without somehow marking them as
> uninitialized.
> 
> If you get a power failure or IO error before the pagecache can be
> written out, you're left with uninitialized data there, aren't you?
> Simple buffer head based filesystems are already subject to this.

	Sure, ext2 does this.  But don't most filesystems guaranteeing
state actually make sure to order such I/Os?  If you run ext3 in
data=writeback, you get what you pay for.  This sounds like a red
herring.
	Dave's original point stands.  ocfs2 supports unwritten extents
and punching holes.  In fact, we directly copied the XFS ioctl(2)s.  But
when we do punch holes, we have to adjust our tree.  That may require
additional metadata, and *that* can fail with ENOSPC or EDQUOT.

Joel

-- 

"I always thought the hardest questions were those I could not answer.
 Now I know they are the ones I can never ask."
			- Charlie Watkins

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-14  6:41       ` Dave Chinner
  2010-05-14  7:22         ` Nick Piggin
@ 2010-05-17 23:35         ` Jan Kara
  2010-05-18  1:21           ` Dave Chinner
  1 sibling, 1 reply; 44+ messages in thread
From: Jan Kara @ 2010-05-17 23:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, npiggin,
	linux-kernel

On Fri 14-05-10 16:41:45, Dave Chinner wrote:
...
> Well, the "RESERVE" callout I proposed should mean that the
> allocation that follows the copy should never fail with ENOSPC. i.e.
> if there isn't space, the RESERVE call will fail with ENOSPC, not
> the actual ALLOCATE call that occurs after the data is copied.
> 
> Basically I was thinking that in the XFS case, RESERVE would scan the
> range for holes, and reserve all blocks needed to fill all the holes
> in the range. Then the ALLOCATE call would mark them all delalloc.
> The space reservation and delalloc is done in a single call right
> now, but splitting them is not hard...
  But this would mean that you would have to implement at least a basic
delayed allocation support for filesystems like ext2, ext3, udf, vfat, etc.
They currently do not have a way to provide a functionality you require
from RESERVE call...
  Not that it would be so hard to do - I actually have some patches for
ext2/3 doing that because of problems with ENOSPC during mmaped writes but
it's not completely trivial either.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] new ->perform_write fop
  2010-05-17 23:35         ` Jan Kara
@ 2010-05-18  1:21           ` Dave Chinner
  0 siblings, 0 replies; 44+ messages in thread
From: Dave Chinner @ 2010-05-18  1:21 UTC (permalink / raw)
  To: Jan Kara
  Cc: Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, npiggin,
	linux-kernel

On Tue, May 18, 2010 at 01:35:41AM +0200, Jan Kara wrote:
> On Fri 14-05-10 16:41:45, Dave Chinner wrote:
> ...
> > Well, the "RESERVE" callout I proposed should mean that the
> > allocation that follows the copy should never fail with ENOSPC. i.e.
> > if there isn't space, the RESERVE call will fail with ENOSPC, not
> > the actual ALLOCATE call that occurs after the data is copied.
> > 
> > Basically I was thinking that in the XFS case, RESERVE would scan the
> > range for holes, and reserve all blocks needed to fill all the holes
> > in the range. Then the ALLOCATE call would mark them all delalloc.
> > The space reservation and delalloc is done in a single call right
> > now, but splitting them is not hard...
>   But this would mean that you would have to implement at least a basic
> delayed allocation support for filesystems like ext2, ext3, udf, vfat, etc.
> They currently do not have a way to provide a functionality you require
> from RESERVE call...

I don't think that's necessary. For filesytems that don't do
multipage writes (i.e. continue to operate as they do now) then
RSERVE/UNRESERVE can effectively be a no-op as there is nothing
extra to reserve or undo if the single page allocation fails.  Hence
filesystems that can't easily be converted or we don't want to put
the effort into converting to multi-page writes do not need any new
functionality....

>   Not that it would be so hard to do - I actually have some patches for
> ext2/3 doing that because of problems with ENOSPC during mmaped writes but
> it's not completely trivial either.

Yup, and a good reason for not trying to retrofit delayed allocation to
all those old filesystems.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2010-06-06  7:59 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-12 21:24 [RFC] new ->perform_write fop Josef Bacik
2010-05-13  1:39 ` Josef Bacik
2010-05-13 15:36   ` Christoph Hellwig
2010-05-14  1:00   ` Dave Chinner
2010-05-14  3:30     ` Josef Bacik
2010-05-14  5:50       ` Nick Piggin
2010-05-14  7:20         ` Dave Chinner
2010-05-14  7:33           ` Nick Piggin
2010-05-14  6:41       ` Dave Chinner
2010-05-14  7:22         ` Nick Piggin
2010-05-14  8:38           ` Dave Chinner
2010-05-14 13:33             ` Chris Mason
2010-05-18  6:36             ` Nick Piggin
2010-05-18  8:05               ` Dave Chinner
2010-05-18 10:43                 ` Nick Piggin
2010-05-18 12:27                   ` Dave Chinner
2010-05-18 15:09                     ` Nick Piggin
2010-05-19 23:50                       ` Dave Chinner
2010-05-20  6:48                         ` Nick Piggin
2010-05-20 20:12                         ` Jan Kara
2010-05-20 23:05                           ` Dave Chinner
2010-05-21  9:05                             ` Steven Whitehouse
2010-05-21 13:50                             ` Josef Bacik
2010-05-21 14:23                               ` Nick Piggin
2010-05-21 15:19                                 ` Josef Bacik
2010-05-24  3:29                                   ` Nick Piggin
2010-05-22  0:31                               ` Dave Chinner
2010-05-21 18:58                             ` Jan Kara
2010-05-22  0:27                               ` Dave Chinner
2010-05-24  9:20                                 ` Jan Kara
2010-05-24  9:33                                   ` Nick Piggin
2010-06-05 15:05                                   ` tytso
2010-06-06  7:59                                     ` Nick Piggin
2010-05-21 15:15           ` Christoph Hellwig
2010-05-22  2:31             ` Nick Piggin
2010-05-22  8:37               ` Dave Chinner
2010-05-24  3:09                 ` Nick Piggin
2010-05-24  5:53                   ` Dave Chinner
2010-05-24  6:55                     ` Nick Piggin
2010-05-24 10:21                       ` Dave Chinner
2010-06-01  6:27                         ` Nick Piggin
2010-05-24 18:40                       ` Joel Becker
2010-05-17 23:35         ` Jan Kara
2010-05-18  1:21           ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).