* [RFC] new ->perform_write fop @ 2010-05-12 21:24 Josef Bacik 2010-05-13 1:39 ` Josef Bacik 0 siblings, 1 reply; 44+ messages in thread From: Josef Bacik @ 2010-05-12 21:24 UTC (permalink / raw) To: linux-fsdevel; +Cc: chris.mason, hch, akpm, npiggin, linux-kernel Hello, I just started adding aio_write to Btrfs and I noticed we're duplicating _alot_ of the generic stuff in mm/filemap.c, even though the only thing thats really unique is the fact that we copy userspace pages in chunks rather than one page a t a time. What would be best is instead of doing write_begin/write_end with Btrfs, it would be nice if we could just do our own perform_write instead of generic_perform_write. This way we can drop all of these generic checks we have that we copied from filemap.c and just got to the business of actually writing the data. I hate to add another file operation, but it would _greatly_ reduce the amount of duplicate code we have. If there is no violent objection to this I can put something together quickly for review. Thanks, Josef ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-12 21:24 [RFC] new ->perform_write fop Josef Bacik @ 2010-05-13 1:39 ` Josef Bacik 2010-05-13 15:36 ` Christoph Hellwig 2010-05-14 1:00 ` Dave Chinner 0 siblings, 2 replies; 44+ messages in thread From: Josef Bacik @ 2010-05-13 1:39 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-fsdevel, chris.mason, hch, akpm, npiggin, linux-kernel On Wed, May 12, 2010 at 05:24:04PM -0400, Josef Bacik wrote: > Hello, > > I just started adding aio_write to Btrfs and I noticed we're duplicating _alot_ > of the generic stuff in mm/filemap.c, even though the only thing thats really > unique is the fact that we copy userspace pages in chunks rather than one page a > t a time. What would be best is instead of doing write_begin/write_end with > Btrfs, it would be nice if we could just do our own perform_write instead of > generic_perform_write. This way we can drop all of these generic checks we have > that we copied from filemap.c and just got to the business of actually writing > the data. I hate to add another file operation, but it would _greatly_ reduce > the amount of duplicate code we have. If there is no violent objection to this > I can put something together quickly for review. Thanks, > I just got a suggestion from hpa about instead just moving what BTRFS does into the generic_perform_write. What btrfs does is allocates a chunk of pages to cover the entirety of the write, sets everything up, does the copy from user into the pages, and tears everything down, so essentially what generic_perform_write does, just with more pages. I could modify generic_perform_write and the write_begin/write_end aops to do this, where write_begin will return how many pages it allocated, copy in all of the userpages into the pages we allocated at once, and then call write_end with the pages we allocated in write begin. Then I could just make btrfs do write_being/write_end. So which option seems more palatable? Thanks, Josef ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-13 1:39 ` Josef Bacik @ 2010-05-13 15:36 ` Christoph Hellwig 2010-05-14 1:00 ` Dave Chinner 1 sibling, 0 replies; 44+ messages in thread From: Christoph Hellwig @ 2010-05-13 15:36 UTC (permalink / raw) To: Josef Bacik Cc: linux-fsdevel, chris.mason, hch, akpm, npiggin, linux-kernel, david On Wed, May 12, 2010 at 09:39:27PM -0400, Josef Bacik wrote: > I just got a suggestion from hpa about instead just moving what BTRFS does into > the generic_perform_write. What btrfs does is allocates a chunk of pages to > cover the entirety of the write, sets everything up, does the copy from user > into the pages, and tears everything down, so essentially what > generic_perform_write does, just with more pages. I could modify > generic_perform_write and the write_begin/write_end aops to do this, where > write_begin will return how many pages it allocated, copy in all of the > userpages into the pages we allocated at once, and then call write_end with the > pages we allocated in write begin. Then I could just make btrfs do > write_being/write_end. So which option seems more palatable? Thanks, Having a more efficient generic write path would be great. I'm not quite sure btrfs gets everything that is needed right already, but getting started on this would be great. And to get back to the original question: adding a ->perform_write is a really dumb idea. It doesn't fit into the structure of the real AOPs at all as it required context dependent locking and so and so on. It would just be a nasty hack around the fact that we still leave too much work to the filesystem in the write path, and for the bad prototype of the ->aio_read/->aio_write methods. For a start generic_segment_checks should move out of the methods and into fs/read_write.c before calling into the methods. To do this fully we'll need to pass down a count of total bytes into ->aio_read and ->aio_write, though. Adding a new generic_write_prep helper for all the boilderplate code in ->aio_write also seems fine to me. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-13 1:39 ` Josef Bacik 2010-05-13 15:36 ` Christoph Hellwig @ 2010-05-14 1:00 ` Dave Chinner 2010-05-14 3:30 ` Josef Bacik 1 sibling, 1 reply; 44+ messages in thread From: Dave Chinner @ 2010-05-14 1:00 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-fsdevel, chris.mason, hch, akpm, npiggin, linux-kernel On Wed, May 12, 2010 at 09:39:27PM -0400, Josef Bacik wrote: > On Wed, May 12, 2010 at 05:24:04PM -0400, Josef Bacik wrote: > > Hello, > > > > I just started adding aio_write to Btrfs and I noticed we're duplicating _alot_ > > of the generic stuff in mm/filemap.c, even though the only thing thats really > > unique is the fact that we copy userspace pages in chunks rather than one page a > > t a time. What would be best is instead of doing write_begin/write_end with > > Btrfs, it would be nice if we could just do our own perform_write instead of > > generic_perform_write. This way we can drop all of these generic checks we have > > that we copied from filemap.c and just got to the business of actually writing > > the data. I hate to add another file operation, but it would _greatly_ reduce > > the amount of duplicate code we have. If there is no violent objection to this > > I can put something together quickly for review. Thanks, > > > > I just got a suggestion from hpa about instead just moving what BTRFS does into > the generic_perform_write. What btrfs does is allocates a chunk of pages to > cover the entirety of the write, sets everything up, does the copy from user > into the pages, and tears everything down, so essentially what > generic_perform_write does, just with more pages. Except that btrfs does things in a very different manner to most other filesystems ;) > I could modify > generic_perform_write and the write_begin/write_end aops to do this, where > write_begin will return how many pages it allocated, copy in all of the > userpages into the pages we allocated at once, and then call write_end with the > pages we allocated in write begin. Then I could just make btrfs do > write_being/write_end. So which option seems more palatable? Thanks, I can see how this would work for btrfs, but the issue is how any other filesystem would handle it. I've been trying to get my head around how any filesystem using bufferheads and generic code can do multipage writes using write_begin/write_end without modifying the interface, and I just threw away my second attempt because the error handling just couldn't be handled cleanly without duplicating the entire block_write_begin path in each filesystem that wanted to do multipage writes. The biggest problem is that block allocation is intermingled with allocating and attaching bufferheads to pages, hence error handling can get really nasty and is split across a call chain 3 or 4 functions deep. The error handling is where I'm finding all the dangerous and hard-to-kill demons lurking in dark corners. I suspect there's a grue in there somewhere, too. ;) Separating the page+bufferhead allocation and block allocation would make this much simpler but I can't fit that easily into the existing interfaces. Hence I think that write_begin/copy pages/write_end is not really suited to multipage writes when allocation is done in write_begin and the write can then fail in a later stage without a simple method of undoing the allocation. We don't have any hole punch interfaces to the filesystems (and I think only XFS supports that functionality right now), so handling errors after allocation becomes rather complex, especially when you have multiple blocks per page. Hence I've independently come to the conclusion that delaying the allocation until *after* the copy as btrfs does is probably the best approach to take here. This largely avoids the error handling complexity because the write operation is an all-or-nothing operation. btrfs has separate hooks for space reservation and releasing the reservation and doesn't commit to actually allocating anything until the copying is complete. Hence cleanup is simple no matter where a failure occurs. Personally, I'm tending towards killing the get_blocks() callback as the first step in this process - turn it into a real inode/address space operation (say ->allocate) so we can untangle the write path somewhat (lots of filesystem just provide operations as wrappers to provide a fs-specific get_blocks callback to generic code. If the "create" flag is then converted to a "command" field, the interface can pass "RESERVE", "ALLOC", "CREATE", etc to allow different operations to be clearly handled. e.g.: ->allocate(mapping, NULL, off, len, RESERVE) reserves necessary space for write ->write_begin grab pages into page cache attach bufferheads (if required) fail -> goto drop pages copy data into pages fail -> goto drop pages ->allocate(mapping, pages, off, len, ALLOC) allocates reserved space (if required) sets up/maps/updates bufferheads/extents fail -> goto drop pages ->write_end set pages dirty + uptodate done drop_pages: ->allocate(mapping, NULL, off, len, UNRESERVE) if needed, zero partial pages release pages, clears uptodate. Basically this allows the copying of data before any allocation is actually done, but also allows ENOSPC to be detected before starting the copy. The filesystem can call whatver helpers it needs inside ->get_blocks(ALLOC) to set up bufferhead/extent state to match what has been reserved/allocated/mapped in the RESERVE call. This will work for btrfs, and it will work for XFS and I think it will work for other filesystems that are using bufferheads as well. For those filesystems that will only support a page at a time, then they can continue to use the current code, but should be able to be converted to the multipage code by making RESERVE and UNRESERVE no-ops, and ALLOC does what write_begin+get_blocks currently does to set up block mappings. Thoughts? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-14 1:00 ` Dave Chinner @ 2010-05-14 3:30 ` Josef Bacik 2010-05-14 5:50 ` Nick Piggin 2010-05-14 6:41 ` Dave Chinner 0 siblings, 2 replies; 44+ messages in thread From: Josef Bacik @ 2010-05-14 3:30 UTC (permalink / raw) To: Dave Chinner Cc: Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, npiggin, linux-kernel On Fri, May 14, 2010 at 11:00:42AM +1000, Dave Chinner wrote: > On Wed, May 12, 2010 at 09:39:27PM -0400, Josef Bacik wrote: > > On Wed, May 12, 2010 at 05:24:04PM -0400, Josef Bacik wrote: > > > Hello, > > > > > > I just started adding aio_write to Btrfs and I noticed we're duplicating _alot_ > > > of the generic stuff in mm/filemap.c, even though the only thing thats really > > > unique is the fact that we copy userspace pages in chunks rather than one page a > > > t a time. What would be best is instead of doing write_begin/write_end with > > > Btrfs, it would be nice if we could just do our own perform_write instead of > > > generic_perform_write. This way we can drop all of these generic checks we have > > > that we copied from filemap.c and just got to the business of actually writing > > > the data. I hate to add another file operation, but it would _greatly_ reduce > > > the amount of duplicate code we have. If there is no violent objection to this > > > I can put something together quickly for review. Thanks, > > > > > > > I just got a suggestion from hpa about instead just moving what BTRFS does into > > the generic_perform_write. What btrfs does is allocates a chunk of pages to > > cover the entirety of the write, sets everything up, does the copy from user > > into the pages, and tears everything down, so essentially what > > generic_perform_write does, just with more pages. > > Except that btrfs does things in a very different manner to most > other filesystems ;) > > > I could modify > > generic_perform_write and the write_begin/write_end aops to do this, where > > write_begin will return how many pages it allocated, copy in all of the > > userpages into the pages we allocated at once, and then call write_end with the > > pages we allocated in write begin. Then I could just make btrfs do > > write_being/write_end. So which option seems more palatable? Thanks, > > I can see how this would work for btrfs, but the issue is how any > other filesystem would handle it. > > I've been trying to get my head around how any filesystem using > bufferheads and generic code can do multipage writes using > write_begin/write_end without modifying the interface, and I just > threw away my second attempt because the error handling just > couldn't be handled cleanly without duplicating the entire > block_write_begin path in each filesystem that wanted to do > multipage writes. > > The biggest problem is that block allocation is intermingled with > allocating and attaching bufferheads to pages, hence error handling > can get really nasty and is split across a call chain 3 or 4 > functions deep. The error handling is where I'm finding all the > dangerous and hard-to-kill demons lurking in dark corners. I suspect > there's a grue in there somewhere, too. ;) > > Separating the page+bufferhead allocation and block allocation would > make this much simpler but I can't fit that easily into the existing > interfaces. > > Hence I think that write_begin/copy pages/write_end is not really > suited to multipage writes when allocation is done in write_begin > and the write can then fail in a later stage without a simple method > of undoing the allocation. We don't have any hole punch interfaces > to the filesystems (and I think only XFS supports that functionality > right now), so handling errors after allocation becomes rather > complex, especially when you have multiple blocks per page. > > Hence I've independently come to the conclusion that delaying the > allocation until *after* the copy as btrfs does is probably the best > approach to take here. This largely avoids the error handling > complexity because the write operation is an all-or-nothing > operation. btrfs has separate hooks for space reservation and > releasing the reservation and doesn't commit to actually allocating > anything until the copying is complete. Hence cleanup is simple no > matter where a failure occurs. > > Personally, I'm tending towards killing the get_blocks() callback as > the first step in this process - turn it into a real inode/address > space operation (say ->allocate) so we can untangle the write path > somewhat (lots of filesystem just provide operations as wrappers to > provide a fs-specific get_blocks callback to generic code. If the > "create" flag is then converted to a "command" field, the interface > can pass "RESERVE", "ALLOC", "CREATE", etc to allow different > operations to be clearly handled. > > e.g.: > > ->allocate(mapping, NULL, off, len, RESERVE) > reserves necessary space for write > ->write_begin > grab pages into page cache > attach bufferheads (if required) > fail -> goto drop pages > copy data into pages > fail -> goto drop pages > ->allocate(mapping, pages, off, len, ALLOC) > allocates reserved space (if required) > sets up/maps/updates bufferheads/extents > fail -> goto drop pages > ->write_end > set pages dirty + uptodate > done > > drop_pages: > ->allocate(mapping, NULL, off, len, UNRESERVE) > if needed, zero partial pages > release pages, clears uptodate. > > Basically this allows the copying of data before any allocation is > actually done, but also allows ENOSPC to be detected before starting > the copy. The filesystem can call whatver helpers it needs inside > ->get_blocks(ALLOC) to set up bufferhead/extent state to match > what has been reserved/allocated/mapped in the RESERVE call. > > This will work for btrfs, and it will work for XFS and I think it > will work for other filesystems that are using bufferheads as well. > For those filesystems that will only support a page at a time, then > they can continue to use the current code, but should be able to be > converted to the multipage code by making RESERVE and UNRESERVE > no-ops, and ALLOC does what write_begin+get_blocks currently does to > set up block mappings. > > Thoughts? > So this is what I had envisioned, we make write_begin take a nr_pages pointer and tell it how much data we have to write, then in the filesystem we allocate as many pages as we feel like, idealy something like min(number of pages we need for the write, some arbitrary limit for security) and then we allocate all those pages. Then you can pass them into block_write_begin, which will walk the pages, allocating buffer heads and allocating the space as needed. Now since we're coming into write_begin with "we want to write X bytes" we can go ahead and do the enospc checks for X bytes, and then if we are good to go, chances are we won't fail. Except if we're overwriting a holey section of the file, we're going to be screwed in both your way and my way. My way probably would be the most likely to fail, since we could fail to do the copy_from_user, but hopefully the segment checks and doing the fault_in_readable before all of this would keep those problems to a minimum. In your case the only failure point is in the allocate step. If we fail on down the line after we've done some hole filling, we'll be hard pressed to go back and free up those blocks. Is that what you are talking about, having the allocate(UNRESERVE) thing being able to go back and figure out what should have been holes needs to be holes again? If thats the case then I think your idea will work and is probably the best way to move forward. But from what I can remember about how all this works with buffer heads, there's not a way to say "we _just_ allocated this recently". Thanks, Josef ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-14 3:30 ` Josef Bacik @ 2010-05-14 5:50 ` Nick Piggin 2010-05-14 7:20 ` Dave Chinner 2010-05-14 6:41 ` Dave Chinner 1 sibling, 1 reply; 44+ messages in thread From: Nick Piggin @ 2010-05-14 5:50 UTC (permalink / raw) To: Josef Bacik Cc: Dave Chinner, linux-fsdevel, chris.mason, hch, akpm, linux-kernel On Thu, May 13, 2010 at 11:30:57PM -0400, Josef Bacik wrote: > On Fri, May 14, 2010 at 11:00:42AM +1000, Dave Chinner wrote: > > On Wed, May 12, 2010 at 09:39:27PM -0400, Josef Bacik wrote: > > > On Wed, May 12, 2010 at 05:24:04PM -0400, Josef Bacik wrote: > > > > Hello, > > > > > > > > I just started adding aio_write to Btrfs and I noticed we're duplicating _alot_ > > > > of the generic stuff in mm/filemap.c, even though the only thing thats really > > > > unique is the fact that we copy userspace pages in chunks rather than one page a > > > > t a time. What would be best is instead of doing write_begin/write_end with > > > > Btrfs, it would be nice if we could just do our own perform_write instead of > > > > generic_perform_write. This way we can drop all of these generic checks we have > > > > that we copied from filemap.c and just got to the business of actually writing > > > > the data. I hate to add another file operation, but it would _greatly_ reduce > > > > the amount of duplicate code we have. If there is no violent objection to this > > > > I can put something together quickly for review. Thanks, > > > > > > > > > > I just got a suggestion from hpa about instead just moving what BTRFS does into > > > the generic_perform_write. What btrfs does is allocates a chunk of pages to > > > cover the entirety of the write, sets everything up, does the copy from user > > > into the pages, and tears everything down, so essentially what > > > generic_perform_write does, just with more pages. Yeah we basically decided that perform_write is not a good entry point. BTW. can you wrap up this generic code into libfs and so you don't have to duplicate so much of it? > > > > Except that btrfs does things in a very different manner to most > > other filesystems ;) > > > > > I could modify > > > generic_perform_write and the write_begin/write_end aops to do this, where > > > write_begin will return how many pages it allocated, copy in all of the > > > userpages into the pages we allocated at once, and then call write_end with the > > > pages we allocated in write begin. Then I could just make btrfs do > > > write_being/write_end. So which option seems more palatable? Thanks, > > > > I can see how this would work for btrfs, but the issue is how any > > other filesystem would handle it. > > > > I've been trying to get my head around how any filesystem using > > bufferheads and generic code can do multipage writes using > > write_begin/write_end without modifying the interface, and I just > > threw away my second attempt because the error handling just > > couldn't be handled cleanly without duplicating the entire > > block_write_begin path in each filesystem that wanted to do > > multipage writes. > > > > The biggest problem is that block allocation is intermingled with > > allocating and attaching bufferheads to pages, hence error handling > > can get really nasty and is split across a call chain 3 or 4 > > functions deep. The error handling is where I'm finding all the > > dangerous and hard-to-kill demons lurking in dark corners. I suspect > > there's a grue in there somewhere, too. ;) > > > > Separating the page+bufferhead allocation and block allocation would > > make this much simpler but I can't fit that easily into the existing > > interfaces. > > > > Hence I think that write_begin/copy pages/write_end is not really > > suited to multipage writes when allocation is done in write_begin > > and the write can then fail in a later stage without a simple method > > of undoing the allocation. We don't have any hole punch interfaces > > to the filesystems (and I think only XFS supports that functionality > > right now), so handling errors after allocation becomes rather > > complex, especially when you have multiple blocks per page. > > > > Hence I've independently come to the conclusion that delaying the > > allocation until *after* the copy as btrfs does is probably the best > > approach to take here. This largely avoids the error handling > > complexity because the write operation is an all-or-nothing > > operation. btrfs has separate hooks for space reservation and > > releasing the reservation and doesn't commit to actually allocating > > anything until the copying is complete. Hence cleanup is simple no > > matter where a failure occurs. > > > > Personally, I'm tending towards killing the get_blocks() callback as > > the first step in this process - turn it into a real inode/address > > space operation (say ->allocate) so we can untangle the write path > > somewhat (lots of filesystem just provide operations as wrappers to > > provide a fs-specific get_blocks callback to generic code. If the > > "create" flag is then converted to a "command" field, the interface > > can pass "RESERVE", "ALLOC", "CREATE", etc to allow different > > operations to be clearly handled. > > > > e.g.: > > > > ->allocate(mapping, NULL, off, len, RESERVE) > > reserves necessary space for write > > ->write_begin > > grab pages into page cache > > attach bufferheads (if required) > > fail -> goto drop pages > > copy data into pages > > fail -> goto drop pages > > ->allocate(mapping, pages, off, len, ALLOC) > > allocates reserved space (if required) > > sets up/maps/updates bufferheads/extents > > fail -> goto drop pages > > ->write_end > > set pages dirty + uptodate > > done > > > > drop_pages: > > ->allocate(mapping, NULL, off, len, UNRESERVE) > > if needed, zero partial pages > > release pages, clears uptodate. > > > > Basically this allows the copying of data before any allocation is > > actually done, but also allows ENOSPC to be detected before starting > > the copy. The filesystem can call whatver helpers it needs inside > > ->get_blocks(ALLOC) to set up bufferhead/extent state to match > > what has been reserved/allocated/mapped in the RESERVE call. > > > > This will work for btrfs, and it will work for XFS and I think it > > will work for other filesystems that are using bufferheads as well. > > For those filesystems that will only support a page at a time, then > > they can continue to use the current code, but should be able to be > > converted to the multipage code by making RESERVE and UNRESERVE > > no-ops, and ALLOC does what write_begin+get_blocks currently does to > > set up block mappings. > > > > Thoughts? > > > So this is what I had envisioned, we make write_begin take a nr_pages pointer > and tell it how much data we have to write, then in the filesystem we allocate > as many pages as we feel like, idealy something like > > min(number of pages we need for the write, some arbitrary limit for security) > > and then we allocate all those pages. Then you can pass them into > block_write_begin, which will walk the pages, allocating buffer heads and > allocating the space as needed. > > Now since we're coming into write_begin with "we want to write X bytes" we can > go ahead and do the enospc checks for X bytes, and then if we are good to go, > chances are we won't fail. > > Except if we're overwriting a holey section of the file, we're going to be > screwed in both your way and my way. My way probably would be the most likely > to fail, since we could fail to do the copy_from_user, but hopefully the segment > checks and doing the fault_in_readable before all of this would keep those > problems to a minimum. > > In your case the only failure point is in the allocate step. If we fail on down > the line after we've done some hole filling, we'll be hard pressed to go back > and free up those blocks. Is that what you are talking about, having the > allocate(UNRESERVE) thing being able to go back and figure out what should have > been holes needs to be holes again? If thats the case then I think your idea > will work and is probably the best way to move forward. But from what I can > remember about how all this works with buffer heads, there's not a way to say > "we _just_ allocated this recently". Thanks, Now is there really a good reason to go this way and add more to the write_begin/write_end paths? Rather than having filesystems just implement their own write file_operations in order to do multi-block operations? >From what I can see, the generic code is not going to be able to be much help with error handling etc. so I would prefer to keep it as simple as possible. I think it is still adequate for most cases. Take a look at how fuse does multi-page write operations. It is about the simplest case you can get, but it still requires all the generic checks etc. and it is quite neat -- I don't see a big issue with duplicating generic code? ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-14 5:50 ` Nick Piggin @ 2010-05-14 7:20 ` Dave Chinner 2010-05-14 7:33 ` Nick Piggin 0 siblings, 1 reply; 44+ messages in thread From: Dave Chinner @ 2010-05-14 7:20 UTC (permalink / raw) To: Nick Piggin Cc: Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel On Fri, May 14, 2010 at 03:50:54PM +1000, Nick Piggin wrote: > On Thu, May 13, 2010 at 11:30:57PM -0400, Josef Bacik wrote: > > On Fri, May 14, 2010 at 11:00:42AM +1000, Dave Chinner wrote: > > > On Wed, May 12, 2010 at 09:39:27PM -0400, Josef Bacik wrote: > > > > On Wed, May 12, 2010 at 05:24:04PM -0400, Josef Bacik wrote: > > > > I could modify > > > > generic_perform_write and the write_begin/write_end aops to do this, where > > > > write_begin will return how many pages it allocated, copy in all of the > > > > userpages into the pages we allocated at once, and then call write_end with the > > > > pages we allocated in write begin. Then I could just make btrfs do > > > > write_being/write_end. So which option seems more palatable? Thanks, > > > > > > I can see how this would work for btrfs, but the issue is how any > > > other filesystem would handle it. > > > > > > I've been trying to get my head around how any filesystem using > > > bufferheads and generic code can do multipage writes using > > > write_begin/write_end without modifying the interface, and I just > > > threw away my second attempt because the error handling just > > > couldn't be handled cleanly without duplicating the entire > > > block_write_begin path in each filesystem that wanted to do > > > multipage writes. > > > > > > The biggest problem is that block allocation is intermingled with > > > allocating and attaching bufferheads to pages, hence error handling > > > can get really nasty and is split across a call chain 3 or 4 > > > functions deep. The error handling is where I'm finding all the > > > dangerous and hard-to-kill demons lurking in dark corners. I suspect > > > there's a grue in there somewhere, too. ;) > > > > > > Separating the page+bufferhead allocation and block allocation would > > > make this much simpler but I can't fit that easily into the existing > > > interfaces. > > > > > > Hence I think that write_begin/copy pages/write_end is not really > > > suited to multipage writes when allocation is done in write_begin > > > and the write can then fail in a later stage without a simple method > > > of undoing the allocation. We don't have any hole punch interfaces > > > to the filesystems (and I think only XFS supports that functionality > > > right now), so handling errors after allocation becomes rather > > > complex, especially when you have multiple blocks per page. > > > > > > Hence I've independently come to the conclusion that delaying the > > > allocation until *after* the copy as btrfs does is probably the best > > > approach to take here. This largely avoids the error handling > > > complexity because the write operation is an all-or-nothing > > > operation. btrfs has separate hooks for space reservation and > > > releasing the reservation and doesn't commit to actually allocating > > > anything until the copying is complete. Hence cleanup is simple no > > > matter where a failure occurs. > > > > > > Personally, I'm tending towards killing the get_blocks() callback as > > > the first step in this process - turn it into a real inode/address > > > space operation (say ->allocate) so we can untangle the write path > > > somewhat (lots of filesystem just provide operations as wrappers to > > > provide a fs-specific get_blocks callback to generic code. If the > > > "create" flag is then converted to a "command" field, the interface > > > can pass "RESERVE", "ALLOC", "CREATE", etc to allow different > > > operations to be clearly handled. > > > > > > e.g.: > > > > > > ->allocate(mapping, NULL, off, len, RESERVE) > > > reserves necessary space for write > > > ->write_begin > > > grab pages into page cache > > > attach bufferheads (if required) > > > fail -> goto drop pages > > > copy data into pages > > > fail -> goto drop pages > > > ->allocate(mapping, pages, off, len, ALLOC) > > > allocates reserved space (if required) > > > sets up/maps/updates bufferheads/extents > > > fail -> goto drop pages > > > ->write_end > > > set pages dirty + uptodate > > > done > > > > > > drop_pages: > > > ->allocate(mapping, NULL, off, len, UNRESERVE) > > > if needed, zero partial pages > > > release pages, clears uptodate. > > > > > > Basically this allows the copying of data before any allocation is > > > actually done, but also allows ENOSPC to be detected before starting > > > the copy. The filesystem can call whatver helpers it needs inside > > > ->get_blocks(ALLOC) to set up bufferhead/extent state to match > > > what has been reserved/allocated/mapped in the RESERVE call. > > > > > > This will work for btrfs, and it will work for XFS and I think it > > > will work for other filesystems that are using bufferheads as well. > > > For those filesystems that will only support a page at a time, then > > > they can continue to use the current code, but should be able to be > > > converted to the multipage code by making RESERVE and UNRESERVE > > > no-ops, and ALLOC does what write_begin+get_blocks currently does to > > > set up block mappings. > > > > > > Thoughts? > > > > > So this is what I had envisioned, we make write_begin take a nr_pages pointer > > and tell it how much data we have to write, then in the filesystem we allocate > > as many pages as we feel like, idealy something like > > > > min(number of pages we need for the write, some arbitrary limit for security) > > > > and then we allocate all those pages. Then you can pass them into > > block_write_begin, which will walk the pages, allocating buffer heads and > > allocating the space as needed. > > > > Now since we're coming into write_begin with "we want to write X bytes" we can > > go ahead and do the enospc checks for X bytes, and then if we are good to go, > > chances are we won't fail. > > > > Except if we're overwriting a holey section of the file, we're going to be > > screwed in both your way and my way. My way probably would be the most likely > > to fail, since we could fail to do the copy_from_user, but hopefully the segment > > checks and doing the fault_in_readable before all of this would keep those > > problems to a minimum. > > > > In your case the only failure point is in the allocate step. If we fail on down > > the line after we've done some hole filling, we'll be hard pressed to go back > > and free up those blocks. Is that what you are talking about, having the > > allocate(UNRESERVE) thing being able to go back and figure out what should have > > been holes needs to be holes again? If thats the case then I think your idea > > will work and is probably the best way to move forward. But from what I can > > remember about how all this works with buffer heads, there's not a way to say > > "we _just_ allocated this recently". Thanks, > > Now is there really a good reason to go this way and add more to the > write_begin/write_end paths? Rather than having filesystems just > implement their own write file_operations in order to do multi-block > operations? Well, if we've got xfs, btrfs, gfs2, ext4, and others all wanting to do multipage writes, shouldn't we be trying doing in a generic way? Fuse doesn't have to deal with allocation of blocks in fuse_perform_write() > From what I can see, the generic code is not going to be able to be > much help with error handling etc. so I would prefer to keep it as > simple as possible. I think it is still adequate for most cases. > > Take a look at how fuse does multi-page write operations. It is about > the simplest case you can get, but it still requires all the generic > checks etc. fuse_perform_write() doesn't do allocation, and so can easily abort at the first error and just complete the writes that did succeed. Hence it don't think it's a model that a filesystem that has to handle space allocation can use. > and it is quite neat -- I don't see a big issue with > duplicating generic code? When a large number of filesystems end up duplicating the same code, then we should be looking at how to implement that functionality generically, right? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-14 7:20 ` Dave Chinner @ 2010-05-14 7:33 ` Nick Piggin 0 siblings, 0 replies; 44+ messages in thread From: Nick Piggin @ 2010-05-14 7:33 UTC (permalink / raw) To: Dave Chinner Cc: Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel On Fri, May 14, 2010 at 05:20:55PM +1000, Dave Chinner wrote: > On Fri, May 14, 2010 at 03:50:54PM +1000, Nick Piggin wrote: > > Now is there really a good reason to go this way and add more to the > > write_begin/write_end paths? Rather than having filesystems just > > implement their own write file_operations in order to do multi-block > > operations? > > Well, if we've got xfs, btrfs, gfs2, ext4, and others all wanting to > do multipage writes, shouldn't we be trying doing in a generic way? If it makes sense, definitely. > Fuse doesn't have to deal with allocation of blocks in > fuse_perform_write() I just can't see how the generic code can really help out with that problem of error handling in various parts of the operation allocation. > > From what I can see, the generic code is not going to be able to be > > much help with error handling etc. so I would prefer to keep it as > > simple as possible. I think it is still adequate for most cases. > > > > Take a look at how fuse does multi-page write operations. It is about > > the simplest case you can get, but it still requires all the generic > > checks etc. > > fuse_perform_write() doesn't do allocation, and so can easily abort > at the first error and just complete the writes that did succeed. > Hence it don't think it's a model that a filesystem that has to > handle space allocation can use. No but it does all the _generic_ vfs checks required, which sounded like what the btrfs folk were concerned about duplicating. My point was just that there isn't very much duplication really. > > and it is quite neat -- I don't see a big issue with > > duplicating generic code? > > When a large number of filesystems end up duplicating the same code, > then we should be looking at how to implement that functionality > generically, right? Yes if it captures a good chunk of common code without unduely complicating things. I'll be interested to see if that can be made the case. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-14 3:30 ` Josef Bacik 2010-05-14 5:50 ` Nick Piggin @ 2010-05-14 6:41 ` Dave Chinner 2010-05-14 7:22 ` Nick Piggin 2010-05-17 23:35 ` Jan Kara 1 sibling, 2 replies; 44+ messages in thread From: Dave Chinner @ 2010-05-14 6:41 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-fsdevel, chris.mason, hch, akpm, npiggin, linux-kernel On Thu, May 13, 2010 at 11:30:57PM -0400, Josef Bacik wrote: > On Fri, May 14, 2010 at 11:00:42AM +1000, Dave Chinner wrote: > > On Wed, May 12, 2010 at 09:39:27PM -0400, Josef Bacik wrote: > > > On Wed, May 12, 2010 at 05:24:04PM -0400, Josef Bacik wrote: > > > > Hello, > > > > > > > > I just started adding aio_write to Btrfs and I noticed we're duplicating _alot_ > > > > of the generic stuff in mm/filemap.c, even though the only thing thats really > > > > unique is the fact that we copy userspace pages in chunks rather than one page a > > > > t a time. What would be best is instead of doing write_begin/write_end with > > > > Btrfs, it would be nice if we could just do our own perform_write instead of > > > > generic_perform_write. This way we can drop all of these generic checks we have > > > > that we copied from filemap.c and just got to the business of actually writing > > > > the data. I hate to add another file operation, but it would _greatly_ reduce > > > > the amount of duplicate code we have. If there is no violent objection to this > > > > I can put something together quickly for review. Thanks, > > > > > > > > > > I just got a suggestion from hpa about instead just moving what BTRFS does into > > > the generic_perform_write. What btrfs does is allocates a chunk of pages to > > > cover the entirety of the write, sets everything up, does the copy from user > > > into the pages, and tears everything down, so essentially what > > > generic_perform_write does, just with more pages. > > > > Except that btrfs does things in a very different manner to most > > other filesystems ;) > > > > > I could modify > > > generic_perform_write and the write_begin/write_end aops to do this, where > > > write_begin will return how many pages it allocated, copy in all of the > > > userpages into the pages we allocated at once, and then call write_end with the > > > pages we allocated in write begin. Then I could just make btrfs do > > > write_being/write_end. So which option seems more palatable? Thanks, > > > > I can see how this would work for btrfs, but the issue is how any > > other filesystem would handle it. > > > > I've been trying to get my head around how any filesystem using > > bufferheads and generic code can do multipage writes using > > write_begin/write_end without modifying the interface, and I just > > threw away my second attempt because the error handling just > > couldn't be handled cleanly without duplicating the entire > > block_write_begin path in each filesystem that wanted to do > > multipage writes. > > > > The biggest problem is that block allocation is intermingled with > > allocating and attaching bufferheads to pages, hence error handling > > can get really nasty and is split across a call chain 3 or 4 > > functions deep. The error handling is where I'm finding all the > > dangerous and hard-to-kill demons lurking in dark corners. I suspect > > there's a grue in there somewhere, too. ;) > > > > Separating the page+bufferhead allocation and block allocation would > > make this much simpler but I can't fit that easily into the existing > > interfaces. > > > > Hence I think that write_begin/copy pages/write_end is not really > > suited to multipage writes when allocation is done in write_begin > > and the write can then fail in a later stage without a simple method > > of undoing the allocation. We don't have any hole punch interfaces > > to the filesystems (and I think only XFS supports that functionality > > right now), so handling errors after allocation becomes rather > > complex, especially when you have multiple blocks per page. > > > > Hence I've independently come to the conclusion that delaying the > > allocation until *after* the copy as btrfs does is probably the best > > approach to take here. This largely avoids the error handling > > complexity because the write operation is an all-or-nothing > > operation. btrfs has separate hooks for space reservation and > > releasing the reservation and doesn't commit to actually allocating > > anything until the copying is complete. Hence cleanup is simple no > > matter where a failure occurs. > > > > Personally, I'm tending towards killing the get_blocks() callback as > > the first step in this process - turn it into a real inode/address > > space operation (say ->allocate) so we can untangle the write path > > somewhat (lots of filesystem just provide operations as wrappers to > > provide a fs-specific get_blocks callback to generic code. If the > > "create" flag is then converted to a "command" field, the interface > > can pass "RESERVE", "ALLOC", "CREATE", etc to allow different > > operations to be clearly handled. > > > > e.g.: > > > > ->allocate(mapping, NULL, off, len, RESERVE) > > reserves necessary space for write > > ->write_begin > > grab pages into page cache > > attach bufferheads (if required) > > fail -> goto drop pages > > copy data into pages > > fail -> goto drop pages > > ->allocate(mapping, pages, off, len, ALLOC) > > allocates reserved space (if required) > > sets up/maps/updates bufferheads/extents > > fail -> goto drop pages > > ->write_end > > set pages dirty + uptodate > > done > > > > drop_pages: > > ->allocate(mapping, NULL, off, len, UNRESERVE) > > if needed, zero partial pages > > release pages, clears uptodate. > > > > Basically this allows the copying of data before any allocation is > > actually done, but also allows ENOSPC to be detected before starting > > the copy. The filesystem can call whatver helpers it needs inside > > ->get_blocks(ALLOC) to set up bufferhead/extent state to match > > what has been reserved/allocated/mapped in the RESERVE call. > > > > This will work for btrfs, and it will work for XFS and I think it > > will work for other filesystems that are using bufferheads as well. > > For those filesystems that will only support a page at a time, then > > they can continue to use the current code, but should be able to be > > converted to the multipage code by making RESERVE and UNRESERVE > > no-ops, and ALLOC does what write_begin+get_blocks currently does to > > set up block mappings. > > > > Thoughts? > > > So this is what I had envisioned, we make write_begin take a nr_pages pointer > and tell it how much data we have to write, then in the filesystem we allocate > as many pages as we feel like, idealy something like > > min(number of pages we need for the write, some arbitrary limit for security) Actually, i was thinking that the RESERVE call determines the size of the chunk (in the order of 1-4MB maximum). IOWs, we pass in the start offset of the write, the entire length remaining, and the RESERVE call determines how much it will allow in one loop. written = 0; while (bytes_remaining > 0) { chunklen = ->allocate(off, bytes_remaining, RESERVE); write_begin(&pages, off, chunklen); copied = copy_pages(&pages, iov_iter, chunklen); ..... bytes_remaining -= copied; off += copied; written += copied; } > and then we allocate all those pages. Then you can pass them into > block_write_begin, which will walk the pages, allocating buffer heads and > allocating the space as needed. Close - except that I'd like like to move the block allocation out of ->write_begin until after the copy into the pages has completed. Hence write_begin only deals with populating the page cache and anything filesystem specific like locking or allocating/populating per-page structures like bufferheads. i.e. ->write_begin does not do anything that cannot be undone easily if the copy fails. > Now since we're coming into write_begin with "we want to write X bytes" we can > go ahead and do the enospc checks for X bytes, and then if we are good to go, > chances are we won't fail. Well, the "RESERVE" callout I proposed should mean that the allocation that follows the copy should never fail with ENOSPC. i.e. if there isn't space, the RESERVE call will fail with ENOSPC, not the actual ALLOCATE call that occurs after the data is copied. Basically I was thinking that in the XFS case, RESERVE would scan the range for holes, and reserve all blocks needed to fill all the holes in the range. Then the ALLOCATE call would mark them all delalloc. The space reservation and delalloc is done in a single call right now, but splitting them is not hard... > Except if we're overwriting a holey section of the file, we're going to be > screwed in both your way and my way. My way probably would be the most likely > to fail, since we could fail to do the copy_from_user, but hopefully the segment > checks and doing the fault_in_readable before all of this would keep those > problems to a minimum. fault_in_readable() only faults the first page of the first iov passed in. It can still fail on any other page the iov points to. And even then, fault_in_readable() does not pin the page it faulted in memory, so even that can get recycled before we start the copy and we'd still get EFAULT. > In your case the only failure point is in the allocate step. If we fail on down > the line after we've done some hole filling, we'll be hard pressed to go back > and free up those blocks. Is that what you are talking about, having the > allocate(UNRESERVE) thing being able to go back and figure out what should have > been holes needs to be holes again? Yeah, that's kind of what I was thinking of. After a bit more thought, if the allocation only covers part of the range (for whatever reason), we can still allow that allocated part of the write to complete succesfully, and the UNRESERVE call simply returns the unused part of the space reservation. This would be a partial write and be much simpler to deal with than punching out the allocations that succeeded. > If thats the case then I think your idea > will work and is probably the best way to move forward. But from what I can > remember about how all this works with buffer heads, there's not a way to say > "we _just_ allocated this recently". Thanks, buffer_new() would suffice, I think. In __block_prepare_write(), when a new buffer is mapped that flag is cleared. If the buffer is not mapped, we call get_blocks, and any buffer it allocates will be marked buffer_new() again. This is used to trigger special things in __block_prepare_write() and the flag is not cleared until write_end (in __block_commit_write). Hence a buffer with the new flag set after allocation is a buffer that had space allocated... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-14 6:41 ` Dave Chinner @ 2010-05-14 7:22 ` Nick Piggin 2010-05-14 8:38 ` Dave Chinner 2010-05-21 15:15 ` Christoph Hellwig 2010-05-17 23:35 ` Jan Kara 1 sibling, 2 replies; 44+ messages in thread From: Nick Piggin @ 2010-05-14 7:22 UTC (permalink / raw) To: Dave Chinner Cc: Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel On Fri, May 14, 2010 at 04:41:45PM +1000, Dave Chinner wrote: > On Thu, May 13, 2010 at 11:30:57PM -0400, Josef Bacik wrote: > > So this is what I had envisioned, we make write_begin take a nr_pages pointer > > and tell it how much data we have to write, then in the filesystem we allocate > > as many pages as we feel like, idealy something like > > > > min(number of pages we need for the write, some arbitrary limit for security) > > Actually, i was thinking that the RESERVE call determines the size > of the chunk (in the order of 1-4MB maximum). IOWs, we pass in the > start offset of the write, the entire length remaining, and the > RESERVE call determines how much it will allow in one loop. > > written = 0; > while (bytes_remaining > 0) { > chunklen = ->allocate(off, bytes_remaining, RESERVE); > write_begin(&pages, off, chunklen); > copied = copy_pages(&pages, iov_iter, chunklen); > ..... > bytes_remaining -= copied; > off += copied; > written += copied; > } How much benefit are you expecting to get? I have no problems with non-pagecache block allocation/freeing APIs to get the most out of amortising expensive locking or tree manipulations, but I don't know if it is a good idea to pin a big chunk of pagecache like that. I'd say just do the per-page manipulations like we currently do. > > and then we allocate all those pages. Then you can pass them into > > block_write_begin, which will walk the pages, allocating buffer heads and > > allocating the space as needed. > > Close - except that I'd like like to move the block allocation out of > ->write_begin until after the copy into the pages has completed. > Hence write_begin only deals with populating the page cache and > anything filesystem specific like locking or allocating/populating > per-page structures like bufferheads. i.e. ->write_begin does not > do anything that cannot be undone easily if the copy fails. Hmm, not so sure. You run into other problems. Write to a hole? Then you have to clear page uptodate and unmap any potential mmaps on it before the copy to pagecache, in case the allocation fails (otherwise you have exposed inconsistent state to userspace). Write to a hole, then to a non-hole? Then your write to the non hole pagecache cannot be recovered in case the block allocation for the hole fails and we need to return a short write. > > Now since we're coming into write_begin with "we want to write X bytes" we can > > go ahead and do the enospc checks for X bytes, and then if we are good to go, > > chances are we won't fail. > > Well, the "RESERVE" callout I proposed should mean that the > allocation that follows the copy should never fail with ENOSPC. i.e. > if there isn't space, the RESERVE call will fail with ENOSPC, not > the actual ALLOCATE call that occurs after the data is copied. > > Basically I was thinking that in the XFS case, RESERVE would scan the > range for holes, and reserve all blocks needed to fill all the holes > in the range. Then the ALLOCATE call would mark them all delalloc. > The space reservation and delalloc is done in a single call right > now, but splitting them is not hard... > > > Except if we're overwriting a holey section of the file, we're going to be > > screwed in both your way and my way. My way probably would be the most likely > > to fail, since we could fail to do the copy_from_user, but hopefully the segment > > checks and doing the fault_in_readable before all of this would keep those > > problems to a minimum. > > fault_in_readable() only faults the first page of the first iov > passed in. It can still fail on any other page the iov points to. > And even then, fault_in_readable() does not pin the page it faulted > in memory, so even that can get recycled before we start the copy > and we'd still get EFAULT. > > > In your case the only failure point is in the allocate step. If we fail on down > > the line after we've done some hole filling, we'll be hard pressed to go back > > and free up those blocks. Is that what you are talking about, having the > > allocate(UNRESERVE) thing being able to go back and figure out what should have > > been holes needs to be holes again? > > Yeah, that's kind of what I was thinking of. After a bit more > thought, if the allocation only covers part of the range (for > whatever reason), we can still allow that allocated part of the > write to complete succesfully, and the UNRESERVE call simply returns > the unused part of the space reservation. This would be a partial > write and be much simpler to deal with than punching out the > allocations that succeeded. Do we really need to restore holes inside i_size in error cases? Current code does not. It just leaves the blocks allocated and zeroes them. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-14 7:22 ` Nick Piggin @ 2010-05-14 8:38 ` Dave Chinner 2010-05-14 13:33 ` Chris Mason 2010-05-18 6:36 ` Nick Piggin 2010-05-21 15:15 ` Christoph Hellwig 1 sibling, 2 replies; 44+ messages in thread From: Dave Chinner @ 2010-05-14 8:38 UTC (permalink / raw) To: Nick Piggin Cc: Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel On Fri, May 14, 2010 at 05:22:19PM +1000, Nick Piggin wrote: > On Fri, May 14, 2010 at 04:41:45PM +1000, Dave Chinner wrote: > > On Thu, May 13, 2010 at 11:30:57PM -0400, Josef Bacik wrote: > > > So this is what I had envisioned, we make write_begin take a nr_pages pointer > > > and tell it how much data we have to write, then in the filesystem we allocate > > > as many pages as we feel like, idealy something like > > > > > > min(number of pages we need for the write, some arbitrary limit for security) > > > > Actually, i was thinking that the RESERVE call determines the size > > of the chunk (in the order of 1-4MB maximum). IOWs, we pass in the > > start offset of the write, the entire length remaining, and the > > RESERVE call determines how much it will allow in one loop. > > > > written = 0; > > while (bytes_remaining > 0) { > > chunklen = ->allocate(off, bytes_remaining, RESERVE); > > write_begin(&pages, off, chunklen); > > copied = copy_pages(&pages, iov_iter, chunklen); > > ..... > > bytes_remaining -= copied; > > off += copied; > > written += copied; > > } > > How much benefit are you expecting to get? If the max chunk size is 4MB, then three orders of magnitudes fewer allocation calls for x86_64 (i.e. one instead of 1024). For filesystems with significant allocation overhead (like gaining cluster locks in gfs2), this will be a *massive* win. > I have no problems with > non-pagecache block allocation/freeing APIs to get the most out of > amortising expensive locking or tree manipulations, but I don't > know if it is a good idea to pin a big chunk of pagecache like that. > I'd say just do the per-page manipulations like we currently do. Can you expand on this? > > > and then we allocate all those pages. Then you can pass them into > > > block_write_begin, which will walk the pages, allocating buffer heads and > > > allocating the space as needed. > > > > Close - except that I'd like like to move the block allocation out of > > ->write_begin until after the copy into the pages has completed. > > Hence write_begin only deals with populating the page cache and > > anything filesystem specific like locking or allocating/populating > > per-page structures like bufferheads. i.e. ->write_begin does not > > do anything that cannot be undone easily if the copy fails. > > Hmm, not so sure. You run into other problems. > > Write to a hole? Then you have to clear page uptodate and unmap > any potential mmaps on it before the copy to pagecache, in case > the allocation fails (otherwise you have exposed inconsistent > state to userspace). Good call. I think that checking page_mapped() on each page as they are grabbed and locked, and if any are mapped call unmap_mapping_range() on them will work. New faults will serialise on the page locks and because we hold the page lock across the entire operation, I don't think we need to clear the uptodate flag unless we fail the write. > Write to a hole, then to a non-hole? Then your write to the non > hole pagecache cannot be recovered in case the block allocation > for the hole fails and we need to return a short write. The pages covering non-holes are locked, so nobody else can access them. We could simply invalidate them so the next access would force a re-read from disk. The problem I see with this, though, is multiple partial overwrites of the allocated block - we could lose changes invalidating it. I think the simplest way to deal with it is to ensure that ->allocate(RESERVE) is not allowed to return ranges that span allocated and unallocated regions. i.e. it can return an allocated region or a hole, but not a region that combines both. The worst case is that we fall back to current behaviour (an allocation call per block), but for the case we're trying to optimise (large writes into large holes) we'd still see the benefit of fewer allocation calls. FWIW, I think this simplifies error handling as well (see below). > > > Now since we're coming into write_begin with "we want to write X bytes" we can > > > go ahead and do the enospc checks for X bytes, and then if we are good to go, > > > chances are we won't fail. > > > > Well, the "RESERVE" callout I proposed should mean that the > > allocation that follows the copy should never fail with ENOSPC. i.e. > > if there isn't space, the RESERVE call will fail with ENOSPC, not > > the actual ALLOCATE call that occurs after the data is copied. > > > > Basically I was thinking that in the XFS case, RESERVE would scan the > > range for holes, and reserve all blocks needed to fill all the holes > > in the range. Then the ALLOCATE call would mark them all delalloc. > > The space reservation and delalloc is done in a single call right > > now, but splitting them is not hard... > > > > > Except if we're overwriting a holey section of the file, we're going to be > > > screwed in both your way and my way. My way probably would be the most likely > > > to fail, since we could fail to do the copy_from_user, but hopefully the segment > > > checks and doing the fault_in_readable before all of this would keep those > > > problems to a minimum. > > > > fault_in_readable() only faults the first page of the first iov > > passed in. It can still fail on any other page the iov points to. > > And even then, fault_in_readable() does not pin the page it faulted > > in memory, so even that can get recycled before we start the copy > > and we'd still get EFAULT. > > > > > In your case the only failure point is in the allocate step. If we fail on down > > > the line after we've done some hole filling, we'll be hard pressed to go back > > > and free up those blocks. Is that what you are talking about, having the > > > allocate(UNRESERVE) thing being able to go back and figure out what should have > > > been holes needs to be holes again? > > > > Yeah, that's kind of what I was thinking of. After a bit more > > thought, if the allocation only covers part of the range (for > > whatever reason), we can still allow that allocated part of the > > write to complete succesfully, and the UNRESERVE call simply returns > > the unused part of the space reservation. This would be a partial > > write and be much simpler to deal with than punching out the > > allocations that succeeded. > > Do we really need to restore holes inside i_size in error cases? No, we don't. That's what I was trying to say - holes that were allocated are full of good data, so we only need to deal with the holes that didn't get allocated. The range of the write that failed is only stale in memory now, so we can just invalidate pages in the range that is considered failed. If we limit RESERVE to return ranges of a single type of block, then this whole cleanup problem will also go away - it's back to a simple succeed or fail-and-rollback state per execution of the loop, and partial writes are those that have made at least one successful pass of the loop. > Current code does not. It just leaves the blocks allocated and > zeroes them. Yup, error handling is easy when you're only dealing with one page at a time. ;) Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-14 8:38 ` Dave Chinner @ 2010-05-14 13:33 ` Chris Mason 2010-05-18 6:36 ` Nick Piggin 1 sibling, 0 replies; 44+ messages in thread From: Chris Mason @ 2010-05-14 13:33 UTC (permalink / raw) To: Dave Chinner Cc: Nick Piggin, Josef Bacik, linux-fsdevel, hch, akpm, linux-kernel On Fri, May 14, 2010 at 06:38:21PM +1000, Dave Chinner wrote: > On Fri, May 14, 2010 at 05:22:19PM +1000, Nick Piggin wrote: > > On Fri, May 14, 2010 at 04:41:45PM +1000, Dave Chinner wrote: > > > On Thu, May 13, 2010 at 11:30:57PM -0400, Josef Bacik wrote: > > > > So this is what I had envisioned, we make write_begin take a nr_pages pointer > > > > and tell it how much data we have to write, then in the filesystem we allocate > > > > as many pages as we feel like, idealy something like > > > > > > > > min(number of pages we need for the write, some arbitrary limit for security) > > > > > > Actually, i was thinking that the RESERVE call determines the size > > > of the chunk (in the order of 1-4MB maximum). IOWs, we pass in the > > > start offset of the write, the entire length remaining, and the > > > RESERVE call determines how much it will allow in one loop. > > > > > > written = 0; > > > while (bytes_remaining > 0) { > > > chunklen = ->allocate(off, bytes_remaining, RESERVE); > > > write_begin(&pages, off, chunklen); > > > copied = copy_pages(&pages, iov_iter, chunklen); > > > ..... > > > bytes_remaining -= copied; > > > off += copied; > > > written += copied; > > > } > > > > How much benefit are you expecting to get? > > If the max chunk size is 4MB, then three orders of magnitudes fewer > allocation calls for x86_64 (i.e. one instead of 1024). For > filesystems with significant allocation overhead (like gaining > cluster locks in gfs2), this will be a *massive* win. It's a pretty big deal in btrfs too. A 4K write write is much less expensive than it used to be, but the part where we mark a range of bytes as delayed allocation goes faster if that range is bigger. -chris ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-14 8:38 ` Dave Chinner 2010-05-14 13:33 ` Chris Mason @ 2010-05-18 6:36 ` Nick Piggin 2010-05-18 8:05 ` Dave Chinner 1 sibling, 1 reply; 44+ messages in thread From: Nick Piggin @ 2010-05-18 6:36 UTC (permalink / raw) To: Dave Chinner Cc: Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel On Fri, May 14, 2010 at 06:38:21PM +1000, Dave Chinner wrote: > On Fri, May 14, 2010 at 05:22:19PM +1000, Nick Piggin wrote: > > On Fri, May 14, 2010 at 04:41:45PM +1000, Dave Chinner wrote: > > > On Thu, May 13, 2010 at 11:30:57PM -0400, Josef Bacik wrote: > > > > So this is what I had envisioned, we make write_begin take a nr_pages pointer > > > > and tell it how much data we have to write, then in the filesystem we allocate > > > > as many pages as we feel like, idealy something like > > > > > > > > min(number of pages we need for the write, some arbitrary limit for security) > > > > > > Actually, i was thinking that the RESERVE call determines the size > > > of the chunk (in the order of 1-4MB maximum). IOWs, we pass in the > > > start offset of the write, the entire length remaining, and the > > > RESERVE call determines how much it will allow in one loop. > > > > > > written = 0; > > > while (bytes_remaining > 0) { > > > chunklen = ->allocate(off, bytes_remaining, RESERVE); > > > write_begin(&pages, off, chunklen); > > > copied = copy_pages(&pages, iov_iter, chunklen); > > > ..... > > > bytes_remaining -= copied; > > > off += copied; > > > written += copied; > > > } > > > > How much benefit are you expecting to get? > > If the max chunk size is 4MB, then three orders of magnitudes fewer > allocation calls for x86_64 (i.e. one instead of 1024). For > filesystems with significant allocation overhead (like gaining > cluster locks in gfs2), this will be a *massive* win. > > > I have no problems with > > non-pagecache block allocation/freeing APIs to get the most out of > > amortising expensive locking or tree manipulations, but I don't > > know if it is a good idea to pin a big chunk of pagecache like that. > > I'd say just do the per-page manipulations like we currently do. > > Can you expand on this? Well you could do a large span block allocation at the beginning, and then dirty the pagecache one by one like we do right now. The only reason to do operations on multiple pages at once is if we need to lock them all. Now the fs might well have that requirement (if it is not using i_mutex for block (de)allocation serialisation), but I don't think generic code needs to be doing that. > > > > and then we allocate all those pages. Then you can pass them into > > > > block_write_begin, which will walk the pages, allocating buffer heads and > > > > allocating the space as needed. > > > > > > Close - except that I'd like like to move the block allocation out of > > > ->write_begin until after the copy into the pages has completed. > > > Hence write_begin only deals with populating the page cache and > > > anything filesystem specific like locking or allocating/populating > > > per-page structures like bufferheads. i.e. ->write_begin does not > > > do anything that cannot be undone easily if the copy fails. > > > > Hmm, not so sure. You run into other problems. > > > > Write to a hole? Then you have to clear page uptodate and unmap > > any potential mmaps on it before the copy to pagecache, in case > > the allocation fails (otherwise you have exposed inconsistent > > state to userspace). > > Good call. I think that checking page_mapped() on each page as they > are grabbed and locked, and if any are mapped call > unmap_mapping_range() on them will work. New faults will serialise > on the page locks and because we hold the page lock across the > entire operation, That is supposed to be enough to prevent concurrent faults setting up the mapping again (which is how the fault vs truncate race was fixed). So yes that would be fine (so long as the fs DTRT in its fault API of course). > I don't think we need to clear the uptodate flag > unless we fail the write. read(2) is performed without page lock though, so it could equally expose transient data that way. > > Write to a hole, then to a non-hole? Then your write to the non > > hole pagecache cannot be recovered in case the block allocation > > for the hole fails and we need to return a short write. > > The pages covering non-holes are locked, so nobody else can access > them. We could simply invalidate them so the next access would force > a re-read from disk. The problem I see with this, though, is > multiple partial overwrites of the allocated block - we could lose > changes invalidating it. > > I think the simplest way to deal with it is to ensure that > ->allocate(RESERVE) is not allowed to return ranges that span > allocated and unallocated regions. i.e. it can return an allocated > region or a hole, but not a region that combines both. The worst > case is that we fall back to current behaviour (an allocation call > per block), but for the case we're trying to optimise (large writes > into large holes) we'd still see the benefit of fewer allocation > calls. FWIW, I think this simplifies error handling as well (see > below). Possibly is the right way to go. But why is error handling of block allocation the hard part? Probably I look at it from the other side of the equation, but publishing modifications to the pagecache seems much harder to recover from than block allocations. I don't see what changes when you go to multi-page allocation spans. Can you explain? > > > > Now since we're coming into write_begin with "we want to write X bytes" we can > > > > go ahead and do the enospc checks for X bytes, and then if we are good to go, > > > > chances are we won't fail. > > > > > > Well, the "RESERVE" callout I proposed should mean that the > > > allocation that follows the copy should never fail with ENOSPC. i.e. > > > if there isn't space, the RESERVE call will fail with ENOSPC, not > > > the actual ALLOCATE call that occurs after the data is copied. > > > > > > Basically I was thinking that in the XFS case, RESERVE would scan the > > > range for holes, and reserve all blocks needed to fill all the holes > > > in the range. Then the ALLOCATE call would mark them all delalloc. > > > The space reservation and delalloc is done in a single call right > > > now, but splitting them is not hard... > > > > > > > Except if we're overwriting a holey section of the file, we're going to be > > > > screwed in both your way and my way. My way probably would be the most likely > > > > to fail, since we could fail to do the copy_from_user, but hopefully the segment > > > > checks and doing the fault_in_readable before all of this would keep those > > > > problems to a minimum. > > > > > > fault_in_readable() only faults the first page of the first iov > > > passed in. It can still fail on any other page the iov points to. > > > And even then, fault_in_readable() does not pin the page it faulted > > > in memory, so even that can get recycled before we start the copy > > > and we'd still get EFAULT. > > > > > > > In your case the only failure point is in the allocate step. If we fail on down > > > > the line after we've done some hole filling, we'll be hard pressed to go back > > > > and free up those blocks. Is that what you are talking about, having the > > > > allocate(UNRESERVE) thing being able to go back and figure out what should have > > > > been holes needs to be holes again? > > > > > > Yeah, that's kind of what I was thinking of. After a bit more > > > thought, if the allocation only covers part of the range (for > > > whatever reason), we can still allow that allocated part of the > > > write to complete succesfully, and the UNRESERVE call simply returns > > > the unused part of the space reservation. This would be a partial > > > write and be much simpler to deal with than punching out the > > > allocations that succeeded. > > > > Do we really need to restore holes inside i_size in error cases? > > No, we don't. That's what I was trying to say - holes that were OK good. > allocated are full of good data, so we only need to deal with the > holes that didn't get allocated. The range of the write that failed > is only stale in memory now, so we can just invalidate pages in the > range that is considered failed. Well invalidate isn't a good way to do this IMO. It is very hairy to put data tentatively into PageUptodate pagecache and then try to recover afterwards. Basically, once pagecache is marked uptodate, I don't think we should ever put maybe-invalid data into it -- the way to do it is to invalidate that page and put a *new* page in there. Why? Because user mappings are just one problem, but once you had a user mapping, you can have been subject to get_user_pages, so it could be in the middle of a DMA operation or something. > If we limit RESERVE to return ranges of a single type of block, then > this whole cleanup problem will also go away - it's back to a simple > succeed or fail-and-rollback state per execution of the loop, and > partial writes are those that have made at least one successful pass > of the loop. > > > Current code does not. It just leaves the blocks allocated and > > zeroes them. > > Yup, error handling is easy when you're only dealing with one page > at a time. ;) Humour me; why? :) ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-18 6:36 ` Nick Piggin @ 2010-05-18 8:05 ` Dave Chinner 2010-05-18 10:43 ` Nick Piggin 0 siblings, 1 reply; 44+ messages in thread From: Dave Chinner @ 2010-05-18 8:05 UTC (permalink / raw) To: Nick Piggin Cc: Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel On Tue, May 18, 2010 at 04:36:47PM +1000, Nick Piggin wrote: > On Fri, May 14, 2010 at 06:38:21PM +1000, Dave Chinner wrote: > > On Fri, May 14, 2010 at 05:22:19PM +1000, Nick Piggin wrote: > > > On Fri, May 14, 2010 at 04:41:45PM +1000, Dave Chinner wrote: > > > > On Thu, May 13, 2010 at 11:30:57PM -0400, Josef Bacik wrote: > > > > > So this is what I had envisioned, we make write_begin take a nr_pages pointer > > > > > and tell it how much data we have to write, then in the filesystem we allocate > > > > > as many pages as we feel like, idealy something like > > > > > > > > > > min(number of pages we need for the write, some arbitrary limit for security) > > > > > > > > Actually, i was thinking that the RESERVE call determines the size > > > > of the chunk (in the order of 1-4MB maximum). IOWs, we pass in the > > > > start offset of the write, the entire length remaining, and the > > > > RESERVE call determines how much it will allow in one loop. > > > > > > > > written = 0; > > > > while (bytes_remaining > 0) { > > > > chunklen = ->allocate(off, bytes_remaining, RESERVE); > > > > write_begin(&pages, off, chunklen); > > > > copied = copy_pages(&pages, iov_iter, chunklen); > > > > ..... > > > > bytes_remaining -= copied; > > > > off += copied; > > > > written += copied; > > > > } > > > > > > How much benefit are you expecting to get? > > > > If the max chunk size is 4MB, then three orders of magnitudes fewer > > allocation calls for x86_64 (i.e. one instead of 1024). For > > filesystems with significant allocation overhead (like gaining > > cluster locks in gfs2), this will be a *massive* win. > > > > > I have no problems with > > > non-pagecache block allocation/freeing APIs to get the most out of > > > amortising expensive locking or tree manipulations, but I don't > > > know if it is a good idea to pin a big chunk of pagecache like that. > > > I'd say just do the per-page manipulations like we currently do. > > > > Can you expand on this? > > Well you could do a large span block allocation at the beginning, > and then dirty the pagecache one by one like we do right now. The problem is that if we fail to allocate a page (e.g. ENOMEM) or fail the copy (EFAULT) after the block allocation, we have to undo the allocation we have already completed. If we don't, we leave uninitialisaed allocations on disk that will expose stale data. In the second case (EFAULT) we might be able to zero the pages to avoid punching out blocks, but the first case where pages can't be allocated to cover the block allocated range makes it very difficult without being able to punch holes in allocated block ranges. AFAIK, only XFS and OCFS2 currently support punching out arbitrary ranges of allocated blocks from an inode - there is not VFS method for it, just an ioctl (XFS_IOC_UNRESVSP). Hence the way to avoid needing hole punching is to allocate and lock down all the pages into the page cache fіrst, then do the copy so they fail before the allocation is done if they are going to fail. That makes it much, much easier to handle failures.... Remember, we still don't do the "truncate blocks past eof on failure" correctly in block_write_begin (still calls vmtruncate, not ->setattr), so we can't claim that the current way of doing things is a model of perfection that we ca't improve on.... > The only reason to do operations on multiple pages at once is if > we need to lock them all. Well, to avoid the refaulting of pages we just unmapped we'd need to do that... > Now the fs might well have that requirement > (if it is not using i_mutex for block (de)allocation > serialisation), but I don't think generic code needs to be doing > that. XFS already falls into the category of a filesystem using the generic code that does not use i_mutex for allocation serialisation. I'm sure it isn't the only filesystem that this is true for, so it seems sensible for the generic code to handle this case. > > I don't think we need to clear the uptodate flag > > unless we fail the write. > > read(2) is performed without page lock though, so it could equally > expose transient data that way. fair point. > > > Write to a hole, then to a non-hole? Then your write to the non > > > hole pagecache cannot be recovered in case the block allocation > > > for the hole fails and we need to return a short write. > > > > The pages covering non-holes are locked, so nobody else can access > > them. We could simply invalidate them so the next access would force > > a re-read from disk. The problem I see with this, though, is > > multiple partial overwrites of the allocated block - we could lose > > changes invalidating it. > > > > I think the simplest way to deal with it is to ensure that > > ->allocate(RESERVE) is not allowed to return ranges that span > > allocated and unallocated regions. i.e. it can return an allocated > > region or a hole, but not a region that combines both. The worst > > case is that we fall back to current behaviour (an allocation call > > per block), but for the case we're trying to optimise (large writes > > into large holes) we'd still see the benefit of fewer allocation > > calls. FWIW, I think this simplifies error handling as well (see > > below). > > Possibly is the right way to go. But why is error handling of block > allocation the hard part? Probably I look at it from the other side > of the equation, but publishing modifications to the pagecache seems > much harder to recover from than block allocations. > I don't see what changes when you go to multi-page allocation spans. > Can you explain? The lack of an interface to support removing abitrary ranges of allocated blocks. See above. > > > > Yeah, that's kind of what I was thinking of. After a bit more > > > > thought, if the allocation only covers part of the range (for > > > > whatever reason), we can still allow that allocated part of the > > > > write to complete succesfully, and the UNRESERVE call simply returns > > > > the unused part of the space reservation. This would be a partial > > > > write and be much simpler to deal with than punching out the > > > > allocations that succeeded. > > > > > > Do we really need to restore holes inside i_size in error cases? > > > > No, we don't. That's what I was trying to say - holes that were > > OK good. > > > allocated are full of good data, so we only need to deal with the > > holes that didn't get allocated. The range of the write that failed > > is only stale in memory now, so we can just invalidate pages in the > > range that is considered failed. > > Well invalidate isn't a good way to do this IMO. It is very hairy to > put data tentatively into PageUptodate pagecache and then try to recover > afterwards. > > Basically, once pagecache is marked uptodate, I don't think we should > ever put maybe-invalid data into it -- the way to do it is to invalidate > that page and put a *new* page in there. Ok, so lets do that... > Why? Because user mappings are just one problem, but once you had a > user mapping, you can have been subject to get_user_pages, so it could > be in the middle of a DMA operation or something. ... because we already know this behaviour causes problems for high end enterprise level features like hardware checksumming IO paths. Hence it seems that a multipage write needs to: 1. allocate new pages 2. attach bufferheads/mapping structures to pages (if required) 3. copy data into pages 4. allocate space 5. for each old page in the range: lock page invalidate mappings clear page uptodate flag remove page from page cache 6. for each new page: map new page to allocated space lock new page insert new page into pagecache update new page state (write_end equivalent) unlock new page 7. free old pages Steps 1-4 can all fail, and can all be backed out from without changing the current state. Steps 5-7 can't fail AFAICT, so we should be able to run this safely after the allocation without needing significant error unwinding... Thoughts? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-18 8:05 ` Dave Chinner @ 2010-05-18 10:43 ` Nick Piggin 2010-05-18 12:27 ` Dave Chinner 0 siblings, 1 reply; 44+ messages in thread From: Nick Piggin @ 2010-05-18 10:43 UTC (permalink / raw) To: Dave Chinner Cc: Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel On Tue, May 18, 2010 at 06:05:03PM +1000, Dave Chinner wrote: > On Tue, May 18, 2010 at 04:36:47PM +1000, Nick Piggin wrote: > > Well you could do a large span block allocation at the beginning, > > and then dirty the pagecache one by one like we do right now. > > The problem is that if we fail to allocate a page (e.g. ENOMEM) or > fail the copy (EFAULT) after the block allocation, we have to undo > the allocation we have already completed. If we don't, we leave > uninitialisaed allocations on disk that will expose stale data. > > In the second case (EFAULT) we might be able to zero the pages to > avoid punching out blocks, but the first case where pages can't be > allocated to cover the block allocated range makes it very > difficult without being able to punch holes in allocated block > ranges. > > AFAIK, only XFS and OCFS2 currently support punching out arbitrary > ranges of allocated blocks from an inode - there is not VFS method > for it, just an ioctl (XFS_IOC_UNRESVSP). > > Hence the way to avoid needing hole punching is to allocate and lock > down all the pages into the page cache fіrst, then do the copy so > they fail before the allocation is done if they are going to fail. > That makes it much, much easier to handle failures.... So it is just a matter of what is exposed as a vfs interface? I would much prefer to make it a requirement to support hole punching in the block allocator if the filesystem wishes to do such writes. If you have an ->allocate(inode, RESERVE/ALLOCATE, interval) API then it makes sense to have a DEALLOCATE operation there too. That seems like it should be cleaner than to work around it if we're talking about adding new APIs anyway. > Remember, we still don't do the "truncate blocks past eof on failure" > correctly in block_write_begin (still calls vmtruncate, not > ->setattr), so we can't claim that the current way of doing things > is a model of perfection that we ca't improve on.... Right, but the API (write_begin/write_end) is sufficient now to make it work correctly. What should really happen is the filesystem detects and handles these error cases. In the truncate patchset (that deprecates ->truncate and vmtruncate entirely), that is exactly what happens. > > > The only reason to do operations on multiple pages at once is if > > we need to lock them all. > > Well, to avoid the refaulting of pages we just unmapped we'd need to > do that... Well, the lock/unmap/copy/unlock could be done on a per-page basis. > > Now the fs might well have that requirement > > (if it is not using i_mutex for block (de)allocation > > serialisation), but I don't think generic code needs to be doing > > that. > > XFS already falls into the category of a filesystem using the > generic code that does not use i_mutex for allocation serialisation. > I'm sure it isn't the only filesystem that this is true for, so it > seems sensible for the generic code to handle this case. Well, does it need page lock? All pages locked concurrently in a range under which block allocation is happening? I would much prefer an allocation API that supports allocation/freeing without requiring any pagecache at all. > > Basically, once pagecache is marked uptodate, I don't think we should > > ever put maybe-invalid data into it -- the way to do it is to invalidate > > that page and put a *new* page in there. > > Ok, so lets do that... > > > Why? Because user mappings are just one problem, but once you had a > > user mapping, you can have been subject to get_user_pages, so it could > > be in the middle of a DMA operation or something. > > ... because we already know this behaviour causes problems for > high end enterprise level features like hardware checksumming IO > paths. > > Hence it seems that a multipage write needs to: > > 1. allocate new pages > 2. attach bufferheads/mapping structures to pages (if required) > 3. copy data into pages > 4. allocate space > 5. for each old page in the range: > lock page > invalidate mappings > clear page uptodate flag > remove page from page cache > 6. for each new page: > map new page to allocated space > lock new page > insert new page into pagecache > update new page state (write_end equivalent) > unlock new page > 7. free old pages > > Steps 1-4 can all fail, and can all be backed out from without > changing the current state. Steps 5-7 can't fail AFAICT, so we > should be able to run this safely after the allocation without > needing significant error unwinding... > > Thoughts? Possibly. The importance of hot cache is reduced, because we are doing full-page copies, and bulk copies, by definition. But it could still be an issue. The allocations and deallocations could cost a little as well. Compared to having a nice API to just do bulk allocate/free block operations and then just doing simple per-page copies, I think it doesn't look so nice. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-18 10:43 ` Nick Piggin @ 2010-05-18 12:27 ` Dave Chinner 2010-05-18 15:09 ` Nick Piggin 0 siblings, 1 reply; 44+ messages in thread From: Dave Chinner @ 2010-05-18 12:27 UTC (permalink / raw) To: Nick Piggin Cc: Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel On Tue, May 18, 2010 at 08:43:51PM +1000, Nick Piggin wrote: > On Tue, May 18, 2010 at 06:05:03PM +1000, Dave Chinner wrote: > > On Tue, May 18, 2010 at 04:36:47PM +1000, Nick Piggin wrote: > > > Well you could do a large span block allocation at the beginning, > > > and then dirty the pagecache one by one like we do right now. > > > > The problem is that if we fail to allocate a page (e.g. ENOMEM) or > > fail the copy (EFAULT) after the block allocation, we have to undo > > the allocation we have already completed. If we don't, we leave > > uninitialisaed allocations on disk that will expose stale data. > > > > In the second case (EFAULT) we might be able to zero the pages to > > avoid punching out blocks, but the first case where pages can't be > > allocated to cover the block allocated range makes it very > > difficult without being able to punch holes in allocated block > > ranges. > > > > AFAIK, only XFS and OCFS2 currently support punching out arbitrary > > ranges of allocated blocks from an inode - there is not VFS method > > for it, just an ioctl (XFS_IOC_UNRESVSP). > > > > Hence the way to avoid needing hole punching is to allocate and lock > > down all the pages into the page cache fіrst, then do the copy so > > they fail before the allocation is done if they are going to fail. > > That makes it much, much easier to handle failures.... > > So it is just a matter of what is exposed as a vfs interface? More a matter of utilising the functionality most filesystems already have and minimising the amount of churn in critical areas of filesytsem code. Hole punching is not simple, anѕ bugs will likely result in a corrupted filesystem. And the hole punching will only occur in a hard to trigger corner case, so it's likely that bugs will go undetected and filesystems will suffer from random, impossible to track down corruptions as a result. In comparison, adding reserve/unreserve functionality might cause block accounting issues if there is a bug, but it won't cause on-disk corruption that results in data loss. Hole punching is not simple or easy - it's a damn complex way to handle errors and if that's all it's required for then we've failed already. > I would much prefer to make it a requirement to support hole > punching in the block allocator if the filesystem wishes to do > such writes. I think that's an unrealistic requirement simply because it can be avoided. With a reserve/alloc/unreserve interface, btrfs will work almost unmodified, XFS will require some new wrappers and bufferhead mapping code, and ext4 and gfs2 look to be pretty much in the same boat. All relatively simple on the filesystem side. If we have to add hole punching, XFS will require an extra wrapper but btrfs, gfs2 and ext4 will have to implement hole punching from the ground up. Personally, I think that requiring hole punching is asking far too much for multipage writes, esp. given that btrfs already implements them without needing such functionality... > > > The only reason to do operations on multiple pages at once is if > > > we need to lock them all. > > > > Well, to avoid the refaulting of pages we just unmapped we'd need to > > do that... > > Well, the lock/unmap/copy/unlock could be done on a per-page > basis. The moment we unmap the old page we cannot unlock it until the new page is in the page cache. If we do unlock it, we risk having it faulted again before we insert the new copy. Yes, that can be done page by page, but shoul donly be done after all the pages are allocated and copied into. FWIW, I don't think we can unmap the old page until after the entire copy is done because the old page(s) might be where we are copying from.... > > > Now the fs might well have that requirement > > > (if it is not using i_mutex for block (de)allocation > > > serialisation), but I don't think generic code needs to be doing > > > that. > > > > XFS already falls into the category of a filesystem using the > > generic code that does not use i_mutex for allocation serialisation. > > I'm sure it isn't the only filesystem that this is true for, so it > > seems sensible for the generic code to handle this case. > > Well, does it need page lock? All pages locked concurrently in > a range under which block allocation is happening? No, allocation doesn't require page locks either - XFS has it's own inode locks for serialisation of allocation, truncation and hole punching. > I would much > prefer an allocation API that supports allocation/freeing > without requiring any pagecache at all. Allocation doesn't require any pagecache at all. It's the fact that the allocation needs to be sycnhronised with the page cache state change that requires page locks to be taken as part of the write process. > > > Basically, once pagecache is marked uptodate, I don't think we should > > > ever put maybe-invalid data into it -- the way to do it is to invalidate > > > that page and put a *new* page in there. > > > > Ok, so lets do that... > > > > > Why? Because user mappings are just one problem, but once you had a > > > user mapping, you can have been subject to get_user_pages, so it could > > > be in the middle of a DMA operation or something. > > > > ... because we already know this behaviour causes problems for > > high end enterprise level features like hardware checksumming IO > > paths. > > > > Hence it seems that a multipage write needs to: > > > > 1. allocate new pages > > 2. attach bufferheads/mapping structures to pages (if required) > > 3. copy data into pages > > 4. allocate space > > 5. for each old page in the range: > > lock page > > invalidate mappings > > clear page uptodate flag > > remove page from page cache > > 6. for each new page: > > map new page to allocated space > > lock new page > > insert new page into pagecache > > update new page state (write_end equivalent) > > unlock new page > > 7. free old pages > > > > Steps 1-4 can all fail, and can all be backed out from without > > changing the current state. Steps 5-7 can't fail AFAICT, so we > > should be able to run this safely after the allocation without > > needing significant error unwinding... > > > > Thoughts? > > Possibly. The importance of hot cache is reduced, because we are > doing full-page copies, and bulk copies, by definition. But it > could still be an issue. The allocations and deallocations could > cost a little as well. They will cost far less than the reduction in allocation overhead saves us, and there are potential optimisations there for reuse of old pages.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-18 12:27 ` Dave Chinner @ 2010-05-18 15:09 ` Nick Piggin 2010-05-19 23:50 ` Dave Chinner 0 siblings, 1 reply; 44+ messages in thread From: Nick Piggin @ 2010-05-18 15:09 UTC (permalink / raw) To: Dave Chinner Cc: Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel On Tue, May 18, 2010 at 10:27:14PM +1000, Dave Chinner wrote: > On Tue, May 18, 2010 at 08:43:51PM +1000, Nick Piggin wrote: > > On Tue, May 18, 2010 at 06:05:03PM +1000, Dave Chinner wrote: > > > On Tue, May 18, 2010 at 04:36:47PM +1000, Nick Piggin wrote: > > > > Well you could do a large span block allocation at the beginning, > > > > and then dirty the pagecache one by one like we do right now. > > > > > > The problem is that if we fail to allocate a page (e.g. ENOMEM) or > > > fail the copy (EFAULT) after the block allocation, we have to undo > > > the allocation we have already completed. If we don't, we leave > > > uninitialisaed allocations on disk that will expose stale data. > > > > > > In the second case (EFAULT) we might be able to zero the pages to > > > avoid punching out blocks, but the first case where pages can't be > > > allocated to cover the block allocated range makes it very > > > difficult without being able to punch holes in allocated block > > > ranges. > > > > > > AFAIK, only XFS and OCFS2 currently support punching out arbitrary > > > ranges of allocated blocks from an inode - there is not VFS method > > > for it, just an ioctl (XFS_IOC_UNRESVSP). > > > > > > Hence the way to avoid needing hole punching is to allocate and lock > > > down all the pages into the page cache fіrst, then do the copy so > > > they fail before the allocation is done if they are going to fail. > > > That makes it much, much easier to handle failures.... > > > > So it is just a matter of what is exposed as a vfs interface? > > More a matter of utilising the functionality most filesystems > already have and minimising the amount of churn in critical areas of > filesytsem code. Hole punching is not simple, anѕ bugs will likely > result in a corrupted filesystem. And the hole punching will only > occur in a hard to trigger corner case, so it's likely that bugs > will go undetected and filesystems will suffer from random, > impossible to track down corruptions as a result. > > In comparison, adding reserve/unreserve functionality might cause > block accounting issues if there is a bug, but it won't cause > on-disk corruption that results in data loss. Hole punching is not > simple or easy - it's a damn complex way to handle errors and if > that's all it's required for then we've failed already. As I said, we can have a dumb fallback path for filesystems that don't implement hole punching. Clear the blocks past i size, and zero out the allocated but not initialized blocks. There does not have to be pagecache allocated in order to do this, you could do direct IO from the zero page in order to do it. Hole punching is not only useful there, it is already exposed to userspace via MADV_REMOVE. > > I would much prefer to make it a requirement to support hole > > punching in the block allocator if the filesystem wishes to do > > such writes. > > I think that's an unrealistic requirement simply because it can be > avoided. With a reserve/alloc/unreserve interface, btrfs will work > almost unmodified, XFS will require some new wrappers and bufferhead > mapping code, and ext4 and gfs2 look to be pretty much in the same > boat. All relatively simple on the filesystem side. > > If we have to add hole punching, XFS will require an extra wrapper > but btrfs, gfs2 and ext4 will have to implement hole punching from > the ground up. Personally, I think that requiring hole punching is > asking far too much for multipage writes, esp. given that btrfs > already implements them without needing such functionality... I think that is the better long term choice rather than having inferior APIs then being stuck with them for years. > > > > The only reason to do operations on multiple pages at once is if > > > > we need to lock them all. > > > > > > Well, to avoid the refaulting of pages we just unmapped we'd need to > > > do that... > > > > Well, the lock/unmap/copy/unlock could be done on a per-page > > basis. > > The moment we unmap the old page we cannot unlock it until the new > page is in the page cache. If we do unlock it, we risk having it > faulted again before we insert the new copy. Yes, that can be done > page by page, but shoul donly be done after all the pages are > allocated and copied into. > > FWIW, I don't think we can unmap the old page until after the entire > copy is done because the old page(s) might be where we are copying > from.... Yeah true. > > > > Now the fs might well have that requirement > > > > (if it is not using i_mutex for block (de)allocation > > > > serialisation), but I don't think generic code needs to be doing > > > > that. > > > > > > XFS already falls into the category of a filesystem using the > > > generic code that does not use i_mutex for allocation serialisation. > > > I'm sure it isn't the only filesystem that this is true for, so it > > > seems sensible for the generic code to handle this case. > > > > Well, does it need page lock? All pages locked concurrently in > > a range under which block allocation is happening? > > No, allocation doesn't require page locks either - XFS has it's own > inode locks for serialisation of allocation, truncation and hole > punching. Right, I was just mentioning multiple page locks could just be useful for that. I was not advocating that we only support i_mutex filesystems from the generic code. > > I would much > > prefer an allocation API that supports allocation/freeing > > without requiring any pagecache at all. > > Allocation doesn't require any pagecache at all. It's the fact that > the allocation needs to be sycnhronised with the page cache state > change that requires page locks to be taken as part of the write > process. When setting up the buffer state with the filesystem state. Sure. > > > > Basically, once pagecache is marked uptodate, I don't think we should > > > > ever put maybe-invalid data into it -- the way to do it is to invalidate > > > > that page and put a *new* page in there. > > > > > > Ok, so lets do that... > > > > > > > Why? Because user mappings are just one problem, but once you had a > > > > user mapping, you can have been subject to get_user_pages, so it could > > > > be in the middle of a DMA operation or something. > > > > > > ... because we already know this behaviour causes problems for > > > high end enterprise level features like hardware checksumming IO > > > paths. > > > > > > Hence it seems that a multipage write needs to: > > > > > > 1. allocate new pages > > > 2. attach bufferheads/mapping structures to pages (if required) > > > 3. copy data into pages > > > 4. allocate space > > > 5. for each old page in the range: > > > lock page > > > invalidate mappings > > > clear page uptodate flag > > > remove page from page cache > > > 6. for each new page: > > > map new page to allocated space > > > lock new page > > > insert new page into pagecache > > > update new page state (write_end equivalent) > > > unlock new page > > > 7. free old pages > > > > > > Steps 1-4 can all fail, and can all be backed out from without > > > changing the current state. Steps 5-7 can't fail AFAICT, so we > > > should be able to run this safely after the allocation without > > > needing significant error unwinding... > > > > > > Thoughts? > > > > Possibly. The importance of hot cache is reduced, because we are > > doing full-page copies, and bulk copies, by definition. But it > > could still be an issue. The allocations and deallocations could > > cost a little as well. > > They will cost far less than the reduction in allocation overhead > saves us, and there are potential optimisations there for reuse of > old pages.... An API that doesn't require that, though, should be less overhead and simpler. Is it really going to be a problem to implement block hole punching in ext4 and gfs2? ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-18 15:09 ` Nick Piggin @ 2010-05-19 23:50 ` Dave Chinner 2010-05-20 6:48 ` Nick Piggin 2010-05-20 20:12 ` Jan Kara 0 siblings, 2 replies; 44+ messages in thread From: Dave Chinner @ 2010-05-19 23:50 UTC (permalink / raw) To: Nick Piggin Cc: Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel On Wed, May 19, 2010 at 01:09:12AM +1000, Nick Piggin wrote: > On Tue, May 18, 2010 at 10:27:14PM +1000, Dave Chinner wrote: > > On Tue, May 18, 2010 at 08:43:51PM +1000, Nick Piggin wrote: > > > On Tue, May 18, 2010 at 06:05:03PM +1000, Dave Chinner wrote: > > > > On Tue, May 18, 2010 at 04:36:47PM +1000, Nick Piggin wrote: > > > > > Well you could do a large span block allocation at the beginning, > > > > > and then dirty the pagecache one by one like we do right now. > > > > > > > > The problem is that if we fail to allocate a page (e.g. ENOMEM) or > > > > fail the copy (EFAULT) after the block allocation, we have to undo > > > > the allocation we have already completed. If we don't, we leave > > > > uninitialisaed allocations on disk that will expose stale data. > > > > > > > > In the second case (EFAULT) we might be able to zero the pages to > > > > avoid punching out blocks, but the first case where pages can't be > > > > allocated to cover the block allocated range makes it very > > > > difficult without being able to punch holes in allocated block > > > > ranges. > > > > > > > > AFAIK, only XFS and OCFS2 currently support punching out arbitrary > > > > ranges of allocated blocks from an inode - there is not VFS method > > > > for it, just an ioctl (XFS_IOC_UNRESVSP). > > > > > > > > Hence the way to avoid needing hole punching is to allocate and lock > > > > down all the pages into the page cache fіrst, then do the copy so > > > > they fail before the allocation is done if they are going to fail. > > > > That makes it much, much easier to handle failures.... > > > > > > So it is just a matter of what is exposed as a vfs interface? > > > > More a matter of utilising the functionality most filesystems > > already have and minimising the amount of churn in critical areas of > > filesytsem code. Hole punching is not simple, anѕ bugs will likely > > result in a corrupted filesystem. And the hole punching will only > > occur in a hard to trigger corner case, so it's likely that bugs > > will go undetected and filesystems will suffer from random, > > impossible to track down corruptions as a result. > > > > In comparison, adding reserve/unreserve functionality might cause > > block accounting issues if there is a bug, but it won't cause > > on-disk corruption that results in data loss. Hole punching is not > > simple or easy - it's a damn complex way to handle errors and if > > that's all it's required for then we've failed already. > > As I said, we can have a dumb fallback path for filesystems that > don't implement hole punching. Clear the blocks past i size, and > zero out the allocated but not initialized blocks. > > There does not have to be pagecache allocated in order to do this, > you could do direct IO from the zero page in order to do it. I don't see that as a good solution - it's once again a fairly complex way of dealing with the problem, especially as it now means that direct io would fall back to buffered which would fall back to direct IO.... > Hole punching is not only useful there, it is already exposed to > userspace via MADV_REMOVE. That interface is *totally broken*. It has all the same problems as vmtruncate() for removing file blocks (because it uses vmtruncate). It also has the fundamental problem of being called un the mmap_sem, which means that inode locks and therefore de-allocation cannot be executed without the possibility of deadlocks. Fundamentally, hole punching is an inode operation, not a VM operation.... > > > > > Basically, once pagecache is marked uptodate, I don't think we should > > > > > ever put maybe-invalid data into it -- the way to do it is to invalidate > > > > > that page and put a *new* page in there. > > > > > > > > Ok, so lets do that... > > > > > > > > > Why? Because user mappings are just one problem, but once you had a > > > > > user mapping, you can have been subject to get_user_pages, so it could > > > > > be in the middle of a DMA operation or something. > > > > > > > > ... because we already know this behaviour causes problems for > > > > high end enterprise level features like hardware checksumming IO > > > > paths. > > > > > > > > Hence it seems that a multipage write needs to: > > > > > > > > 1. allocate new pages > > > > 2. attach bufferheads/mapping structures to pages (if required) > > > > 3. copy data into pages > > > > 4. allocate space > > > > 5. for each old page in the range: > > > > lock page > > > > invalidate mappings > > > > clear page uptodate flag > > > > remove page from page cache > > > > 6. for each new page: > > > > map new page to allocated space > > > > lock new page > > > > insert new page into pagecache > > > > update new page state (write_end equivalent) > > > > unlock new page > > > > 7. free old pages > > > > > > > > Steps 1-4 can all fail, and can all be backed out from without > > > > changing the current state. Steps 5-7 can't fail AFAICT, so we > > > > should be able to run this safely after the allocation without > > > > needing significant error unwinding... > > > > > > > > Thoughts? > > > > > > Possibly. The importance of hot cache is reduced, because we are > > > doing full-page copies, and bulk copies, by definition. But it > > > could still be an issue. The allocations and deallocations could > > > cost a little as well. > > > > They will cost far less than the reduction in allocation overhead > > saves us, and there are potential optimisations there > > An API that doesn't require that, though, should be less overhead > and simpler. > > Is it really going to be a problem to implement block hole punching > in ext4 and gfs2? I can't follow the ext4 code - it's an intricate maze of weird entry and exit points, so I'm not even going to attempt to comment on it. The gfs2 code is easier to follow and it looks like it would require a redesign and rewrite of the block truncation implementation as it appears to assume that blocks are only ever removed from the end of the file - I don't think the recursive algorithms for trimming the indirect block trees can be easily modified for punching out arbitrary ranges of blocks easily. I could be wrong, though, as I'm not a gfs2 expert.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-19 23:50 ` Dave Chinner @ 2010-05-20 6:48 ` Nick Piggin 2010-05-20 20:12 ` Jan Kara 1 sibling, 0 replies; 44+ messages in thread From: Nick Piggin @ 2010-05-20 6:48 UTC (permalink / raw) To: Dave Chinner Cc: Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel On Thu, May 20, 2010 at 09:50:54AM +1000, Dave Chinner wrote: > > As I said, we can have a dumb fallback path for filesystems that > > don't implement hole punching. Clear the blocks past i size, and > > zero out the allocated but not initialized blocks. > > > > There does not have to be pagecache allocated in order to do this, > > you could do direct IO from the zero page in order to do it. > > I don't see that as a good solution - it's once again a fairly > complex way of dealing with the problem, especially as it now means > that direct io would fall back to buffered which would fall back to > direct IO.... Well it wouldn't use the full direct IO path. It has the block, just build a bio with the source zero page and write it out. If the fs requires anything more fancy than that, tough, it should just implement hole punching. > > Hole punching is not only useful there, it is already exposed to > > userspace via MADV_REMOVE. > > That interface is *totally broken*. Why? > It has all the same problems as > vmtruncate() for removing file blocks (because it uses vmtruncate). > It also has the fundamental problem of being called un the mmap_sem, > which means that inode locks and therefore de-allocation cannot be > executed without the possibility of deadlocks. None of that is an API problem, it's all implementation. Yes fadivse would be a much better API, but the madvise API is still there. Implementation wise: it does not use vmtruncate; it has no mmap_sem problem. > Fundamentally, hole > punching is an inode operation, not a VM operation.... VM acts as a handle to inode operations. It's no big deal. > > An API that doesn't require that, though, should be less overhead > > and simpler. > > > > Is it really going to be a problem to implement block hole punching > > in ext4 and gfs2? > > I can't follow the ext4 code - it's an intricate maze of weird entry > and exit points, so I'm not even going to attempt to comment on it. > > The gfs2 code is easier to follow and it looks like it would require > a redesign and rewrite of the block truncation implementation as it > appears to assume that blocks are only ever removed from the end of > the file - I don't think the recursive algorithms for trimming the > indirect block trees can be easily modified for punching out > arbitrary ranges of blocks easily. I could be wrong, though, as I'm > not a gfs2 expert.... I'm far more in favour of doing the interfaces right, and making the filesystems fix themselves to use it. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-19 23:50 ` Dave Chinner 2010-05-20 6:48 ` Nick Piggin @ 2010-05-20 20:12 ` Jan Kara 2010-05-20 23:05 ` Dave Chinner 1 sibling, 1 reply; 44+ messages in thread From: Jan Kara @ 2010-05-20 20:12 UTC (permalink / raw) To: Dave Chinner Cc: Nick Piggin, Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel On Thu 20-05-10 09:50:54, Dave Chinner wrote: > On Wed, May 19, 2010 at 01:09:12AM +1000, Nick Piggin wrote: > > On Tue, May 18, 2010 at 10:27:14PM +1000, Dave Chinner wrote: > > > On Tue, May 18, 2010 at 08:43:51PM +1000, Nick Piggin wrote: > > > > On Tue, May 18, 2010 at 06:05:03PM +1000, Dave Chinner wrote: > > > > > On Tue, May 18, 2010 at 04:36:47PM +1000, Nick Piggin wrote: > > > > > > Well you could do a large span block allocation at the beginning, > > > > > > and then dirty the pagecache one by one like we do right now. > > > > > > > > > > The problem is that if we fail to allocate a page (e.g. ENOMEM) or > > > > > fail the copy (EFAULT) after the block allocation, we have to undo > > > > > the allocation we have already completed. If we don't, we leave > > > > > uninitialisaed allocations on disk that will expose stale data. > > > > > > > > > > In the second case (EFAULT) we might be able to zero the pages to > > > > > avoid punching out blocks, but the first case where pages can't be > > > > > allocated to cover the block allocated range makes it very > > > > > difficult without being able to punch holes in allocated block > > > > > ranges. > > > > > > > > > > AFAIK, only XFS and OCFS2 currently support punching out arbitrary > > > > > ranges of allocated blocks from an inode - there is not VFS method > > > > > for it, just an ioctl (XFS_IOC_UNRESVSP). > > > > > > > > > > Hence the way to avoid needing hole punching is to allocate and lock > > > > > down all the pages into the page cache fіrst, then do the copy so > > > > > they fail before the allocation is done if they are going to fail. > > > > > That makes it much, much easier to handle failures.... > > > > > > > > So it is just a matter of what is exposed as a vfs interface? > > > > > > More a matter of utilising the functionality most filesystems > > > already have and minimising the amount of churn in critical areas of > > > filesytsem code. Hole punching is not simple, anѕ bugs will likely > > > result in a corrupted filesystem. And the hole punching will only > > > occur in a hard to trigger corner case, so it's likely that bugs > > > will go undetected and filesystems will suffer from random, > > > impossible to track down corruptions as a result. > > > > > > In comparison, adding reserve/unreserve functionality might cause > > > block accounting issues if there is a bug, but it won't cause > > > on-disk corruption that results in data loss. Hole punching is not > > > simple or easy - it's a damn complex way to handle errors and if > > > that's all it's required for then we've failed already. > > > > As I said, we can have a dumb fallback path for filesystems that > > don't implement hole punching. Clear the blocks past i size, and > > zero out the allocated but not initialized blocks. > > > > There does not have to be pagecache allocated in order to do this, > > you could do direct IO from the zero page in order to do it. > > I don't see that as a good solution - it's once again a fairly > complex way of dealing with the problem, especially as it now means > that direct io would fall back to buffered which would fall back to > direct IO.... > > > Hole punching is not only useful there, it is already exposed to > > userspace via MADV_REMOVE. > > That interface is *totally broken*. It has all the same problems as > vmtruncate() for removing file blocks (because it uses vmtruncate). > It also has the fundamental problem of being called un the mmap_sem, > which means that inode locks and therefore de-allocation cannot be > executed without the possibility of deadlocks. Fundamentally, hole > punching is an inode operation, not a VM operation.... > > > > > > > > Basically, once pagecache is marked uptodate, I don't think we should > > > > > > ever put maybe-invalid data into it -- the way to do it is to invalidate > > > > > > that page and put a *new* page in there. > > > > > > > > > > Ok, so lets do that... > > > > > > > > > > > Why? Because user mappings are just one problem, but once you had a > > > > > > user mapping, you can have been subject to get_user_pages, so it could > > > > > > be in the middle of a DMA operation or something. > > > > > > > > > > ... because we already know this behaviour causes problems for > > > > > high end enterprise level features like hardware checksumming IO > > > > > paths. > > > > > > > > > > Hence it seems that a multipage write needs to: > > > > > > > > > > 1. allocate new pages > > > > > 2. attach bufferheads/mapping structures to pages (if required) > > > > > 3. copy data into pages > > > > > 4. allocate space > > > > > 5. for each old page in the range: > > > > > lock page > > > > > invalidate mappings > > > > > clear page uptodate flag > > > > > remove page from page cache > > > > > 6. for each new page: > > > > > map new page to allocated space > > > > > lock new page > > > > > insert new page into pagecache > > > > > update new page state (write_end equivalent) > > > > > unlock new page > > > > > 7. free old pages > > > > > > > > > > Steps 1-4 can all fail, and can all be backed out from without > > > > > changing the current state. Steps 5-7 can't fail AFAICT, so we > > > > > should be able to run this safely after the allocation without > > > > > needing significant error unwinding... > > > > > > > > > > Thoughts? > > > > > > > > Possibly. The importance of hot cache is reduced, because we are > > > > doing full-page copies, and bulk copies, by definition. But it > > > > could still be an issue. The allocations and deallocations could > > > > cost a little as well. > > > > > > They will cost far less than the reduction in allocation overhead > > > saves us, and there are potential optimisations there > > > > An API that doesn't require that, though, should be less overhead > > and simpler. > > > > Is it really going to be a problem to implement block hole punching > > in ext4 and gfs2? > > I can't follow the ext4 code - it's an intricate maze of weird entry > and exit points, so I'm not even going to attempt to comment on it. Hmm, I was thinking about it and I see two options how to get out of problems: a) Filesystems which are not able to handle hole punching will allow multipage writes only after EOF (which can be easily undone by truncate in case of failure). That should actually cover lots of cases we are interested in (I don't expect multipage writes to holes to be a common case). b) E.g. ext4 can do even without hole punching. It can allocate extent as 'unwritten' and when something during the write fails, it just leaves the extent allocated and the 'unwritten' flag makes sure that any read will see zeros. I suppose that other filesystems that care about multipage writes are able to do similar things (e.g. btrfs can do the same as far as I remember, I'm not sure about gfs2). Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-20 20:12 ` Jan Kara @ 2010-05-20 23:05 ` Dave Chinner 2010-05-21 9:05 ` Steven Whitehouse ` (2 more replies) 0 siblings, 3 replies; 44+ messages in thread From: Dave Chinner @ 2010-05-20 23:05 UTC (permalink / raw) To: Jan Kara Cc: Nick Piggin, Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel On Thu, May 20, 2010 at 10:12:32PM +0200, Jan Kara wrote: > On Thu 20-05-10 09:50:54, Dave Chinner wrote: > > On Wed, May 19, 2010 at 01:09:12AM +1000, Nick Piggin wrote: > > > On Tue, May 18, 2010 at 10:27:14PM +1000, Dave Chinner wrote: > > > > On Tue, May 18, 2010 at 08:43:51PM +1000, Nick Piggin wrote: > > > > > On Tue, May 18, 2010 at 06:05:03PM +1000, Dave Chinner wrote: > > > > > > On Tue, May 18, 2010 at 04:36:47PM +1000, Nick Piggin wrote: > > > > > > > Well you could do a large span block allocation at the beginning, > > > > > > > and then dirty the pagecache one by one like we do right now. > > > > > > > > > > > > The problem is that if we fail to allocate a page (e.g. ENOMEM) or > > > > > > fail the copy (EFAULT) after the block allocation, we have to undo > > > > > > the allocation we have already completed. If we don't, we leave > > > > > > uninitialisaed allocations on disk that will expose stale data. > > > > > > > > > > > > In the second case (EFAULT) we might be able to zero the pages to > > > > > > avoid punching out blocks, but the first case where pages can't be > > > > > > allocated to cover the block allocated range makes it very > > > > > > difficult without being able to punch holes in allocated block > > > > > > ranges. > > > > > > > > > > > > AFAIK, only XFS and OCFS2 currently support punching out arbitrary > > > > > > ranges of allocated blocks from an inode - there is not VFS method > > > > > > for it, just an ioctl (XFS_IOC_UNRESVSP). > > > > > > > > > > > > Hence the way to avoid needing hole punching is to allocate and lock > > > > > > down all the pages into the page cache fіrst, then do the copy so > > > > > > they fail before the allocation is done if they are going to fail. > > > > > > That makes it much, much easier to handle failures.... > > > > > > > > > > So it is just a matter of what is exposed as a vfs interface? > > > > > > > > More a matter of utilising the functionality most filesystems > > > > already have and minimising the amount of churn in critical areas of > > > > filesytsem code. Hole punching is not simple, anѕ bugs will likely > > > > result in a corrupted filesystem. And the hole punching will only > > > > occur in a hard to trigger corner case, so it's likely that bugs > > > > will go undetected and filesystems will suffer from random, > > > > impossible to track down corruptions as a result. > > > > > > > > In comparison, adding reserve/unreserve functionality might cause > > > > block accounting issues if there is a bug, but it won't cause > > > > on-disk corruption that results in data loss. Hole punching is not > > > > simple or easy - it's a damn complex way to handle errors and if > > > > that's all it's required for then we've failed already. > > > > > > As I said, we can have a dumb fallback path for filesystems that > > > don't implement hole punching. Clear the blocks past i size, and > > > zero out the allocated but not initialized blocks. > > > > > > There does not have to be pagecache allocated in order to do this, > > > you could do direct IO from the zero page in order to do it. > > > > I don't see that as a good solution - it's once again a fairly > > complex way of dealing with the problem, especially as it now means > > that direct io would fall back to buffered which would fall back to > > direct IO.... > > > > > Hole punching is not only useful there, it is already exposed to > > > userspace via MADV_REMOVE. > > > > That interface is *totally broken*. It has all the same problems as > > vmtruncate() for removing file blocks (because it uses vmtruncate). > > It also has the fundamental problem of being called un the mmap_sem, > > which means that inode locks and therefore de-allocation cannot be > > executed without the possibility of deadlocks. Fundamentally, hole > > punching is an inode operation, not a VM operation.... > > > > > > > > > > > Basically, once pagecache is marked uptodate, I don't think we should > > > > > > > ever put maybe-invalid data into it -- the way to do it is to invalidate > > > > > > > that page and put a *new* page in there. > > > > > > > > > > > > Ok, so lets do that... > > > > > > > > > > > > > Why? Because user mappings are just one problem, but once you had a > > > > > > > user mapping, you can have been subject to get_user_pages, so it could > > > > > > > be in the middle of a DMA operation or something. > > > > > > > > > > > > ... because we already know this behaviour causes problems for > > > > > > high end enterprise level features like hardware checksumming IO > > > > > > paths. > > > > > > > > > > > > Hence it seems that a multipage write needs to: > > > > > > > > > > > > 1. allocate new pages > > > > > > 2. attach bufferheads/mapping structures to pages (if required) > > > > > > 3. copy data into pages > > > > > > 4. allocate space > > > > > > 5. for each old page in the range: > > > > > > lock page > > > > > > invalidate mappings > > > > > > clear page uptodate flag > > > > > > remove page from page cache > > > > > > 6. for each new page: > > > > > > map new page to allocated space > > > > > > lock new page > > > > > > insert new page into pagecache > > > > > > update new page state (write_end equivalent) > > > > > > unlock new page > > > > > > 7. free old pages > > > > > > > > > > > > Steps 1-4 can all fail, and can all be backed out from without > > > > > > changing the current state. Steps 5-7 can't fail AFAICT, so we > > > > > > should be able to run this safely after the allocation without > > > > > > needing significant error unwinding... > > > > > > > > > > > > Thoughts? > > > > > > > > > > Possibly. The importance of hot cache is reduced, because we are > > > > > doing full-page copies, and bulk copies, by definition. But it > > > > > could still be an issue. The allocations and deallocations could > > > > > cost a little as well. > > > > > > > > They will cost far less than the reduction in allocation overhead > > > > saves us, and there are potential optimisations there > > > > > > An API that doesn't require that, though, should be less overhead > > > and simpler. > > > > > > Is it really going to be a problem to implement block hole punching > > > in ext4 and gfs2? > > > > I can't follow the ext4 code - it's an intricate maze of weird entry > > and exit points, so I'm not even going to attempt to comment on it. > Hmm, I was thinking about it and I see two options how to get out > of problems: > a) Filesystems which are not able to handle hole punching will allow > multipage writes only after EOF (which can be easily undone by > truncate in case of failure). That should actually cover lots of > cases we are interested in (I don't expect multipage writes to holes > to be a common case). multipage writes to holes is a relatively common operation in the HPC space that XFS is designed for (e.g. calculations on huge sparse matrices), so I'm not really fond of this idea.... > b) E.g. ext4 can do even without hole punching. It can allocate extent > as 'unwritten' and when something during the write fails, it just > leaves the extent allocated and the 'unwritten' flag makes sure that > any read will see zeros. I suppose that other filesystems that care > about multipage writes are able to do similar things (e.g. btrfs can > do the same as far as I remember, I'm not sure about gfs2). Allocating multipage writes as unwritten extents turns off delayed allocation and hence we'd lose all the benefits that this gives... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-20 23:05 ` Dave Chinner @ 2010-05-21 9:05 ` Steven Whitehouse 2010-05-21 13:50 ` Josef Bacik 2010-05-21 18:58 ` Jan Kara 2 siblings, 0 replies; 44+ messages in thread From: Steven Whitehouse @ 2010-05-21 9:05 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, Nick Piggin, Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel Hi, On Fri, 2010-05-21 at 09:05 +1000, Dave Chinner wrote: > On Thu, May 20, 2010 at 10:12:32PM +0200, Jan Kara wrote: > > On Thu 20-05-10 09:50:54, Dave Chinner wrote: > > > On Wed, May 19, 2010 at 01:09:12AM +1000, Nick Piggin wrote: > > > > On Tue, May 18, 2010 at 10:27:14PM +1000, Dave Chinner wrote: > > > > > > > > Is it really going to be a problem to implement block hole punching > > > > in ext4 and gfs2? > > > [snip] > > b) E.g. ext4 can do even without hole punching. It can allocate extent > > as 'unwritten' and when something during the write fails, it just > > leaves the extent allocated and the 'unwritten' flag makes sure that > > any read will see zeros. I suppose that other filesystems that care > > about multipage writes are able to do similar things (e.g. btrfs can > > do the same as far as I remember, I'm not sure about gfs2). > > Allocating multipage writes as unwritten extents turns off delayed > allocation and hence we'd lose all the benefits that this gives... It should be possible to implement hole punching in GFS2 I think. The main issue is locking order of resource groups. We have on our todo list a rewrite of the truncate/delete code which is currently used to deallocate data blocks and metadata tree blocks. The current algorithm is a rather inefficient recursive scanning of the tree which is done multiple times depending on the tree height. Adapting that to punch holes should be possible without too much effort if we need to do that. We do need to allow for the possibility that such a deallocation might have to be split into multiple transactions depending on the amount of metadata involved (for large files, this could be larger than the size of the log for example). Currently the code will split up truncates into multiple transactions which allows the deallocation to be restartable from any transaction boundary. GFS2 does not have any way to mark unwritten extents, so we cannot do delayed allocation or implement an efficient fallocate. We can do better performance-wise than just dd'ing zeros to a file for fallocate, but we'll never be able to match a fs that can mark extents unwritten in performance terms, Steve. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-20 23:05 ` Dave Chinner 2010-05-21 9:05 ` Steven Whitehouse @ 2010-05-21 13:50 ` Josef Bacik 2010-05-21 14:23 ` Nick Piggin 2010-05-22 0:31 ` Dave Chinner 2010-05-21 18:58 ` Jan Kara 2 siblings, 2 replies; 44+ messages in thread From: Josef Bacik @ 2010-05-21 13:50 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, Nick Piggin, Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel On Fri, May 21, 2010 at 09:05:24AM +1000, Dave Chinner wrote: > On Thu, May 20, 2010 at 10:12:32PM +0200, Jan Kara wrote: > > On Thu 20-05-10 09:50:54, Dave Chinner wrote: > > > On Wed, May 19, 2010 at 01:09:12AM +1000, Nick Piggin wrote: > > > > On Tue, May 18, 2010 at 10:27:14PM +1000, Dave Chinner wrote: > > > > > On Tue, May 18, 2010 at 08:43:51PM +1000, Nick Piggin wrote: > > > > > > On Tue, May 18, 2010 at 06:05:03PM +1000, Dave Chinner wrote: > > > > > > > On Tue, May 18, 2010 at 04:36:47PM +1000, Nick Piggin wrote: > > > > > > > > Well you could do a large span block allocation at the beginning, > > > > > > > > and then dirty the pagecache one by one like we do right now. > > > > > > > > > > > > > > The problem is that if we fail to allocate a page (e.g. ENOMEM) or > > > > > > > fail the copy (EFAULT) after the block allocation, we have to undo > > > > > > > the allocation we have already completed. If we don't, we leave > > > > > > > uninitialisaed allocations on disk that will expose stale data. > > > > > > > > > > > > > > In the second case (EFAULT) we might be able to zero the pages to > > > > > > > avoid punching out blocks, but the first case where pages can't be > > > > > > > allocated to cover the block allocated range makes it very > > > > > > > difficult without being able to punch holes in allocated block > > > > > > > ranges. > > > > > > > > > > > > > > AFAIK, only XFS and OCFS2 currently support punching out arbitrary > > > > > > > ranges of allocated blocks from an inode - there is not VFS method > > > > > > > for it, just an ioctl (XFS_IOC_UNRESVSP). > > > > > > > > > > > > > > Hence the way to avoid needing hole punching is to allocate and lock > > > > > > > down all the pages into the page cache fіrst, then do the copy so > > > > > > > they fail before the allocation is done if they are going to fail. > > > > > > > That makes it much, much easier to handle failures.... > > > > > > > > > > > > So it is just a matter of what is exposed as a vfs interface? > > > > > > > > > > More a matter of utilising the functionality most filesystems > > > > > already have and minimising the amount of churn in critical areas of > > > > > filesytsem code. Hole punching is not simple, anѕ bugs will likely > > > > > result in a corrupted filesystem. And the hole punching will only > > > > > occur in a hard to trigger corner case, so it's likely that bugs > > > > > will go undetected and filesystems will suffer from random, > > > > > impossible to track down corruptions as a result. > > > > > > > > > > In comparison, adding reserve/unreserve functionality might cause > > > > > block accounting issues if there is a bug, but it won't cause > > > > > on-disk corruption that results in data loss. Hole punching is not > > > > > simple or easy - it's a damn complex way to handle errors and if > > > > > that's all it's required for then we've failed already. > > > > > > > > As I said, we can have a dumb fallback path for filesystems that > > > > don't implement hole punching. Clear the blocks past i size, and > > > > zero out the allocated but not initialized blocks. > > > > > > > > There does not have to be pagecache allocated in order to do this, > > > > you could do direct IO from the zero page in order to do it. > > > > > > I don't see that as a good solution - it's once again a fairly > > > complex way of dealing with the problem, especially as it now means > > > that direct io would fall back to buffered which would fall back to > > > direct IO.... > > > > > > > Hole punching is not only useful there, it is already exposed to > > > > userspace via MADV_REMOVE. > > > > > > That interface is *totally broken*. It has all the same problems as > > > vmtruncate() for removing file blocks (because it uses vmtruncate). > > > It also has the fundamental problem of being called un the mmap_sem, > > > which means that inode locks and therefore de-allocation cannot be > > > executed without the possibility of deadlocks. Fundamentally, hole > > > punching is an inode operation, not a VM operation.... > > > > > > > > > > > > > > Basically, once pagecache is marked uptodate, I don't think we should > > > > > > > > ever put maybe-invalid data into it -- the way to do it is to invalidate > > > > > > > > that page and put a *new* page in there. > > > > > > > > > > > > > > Ok, so lets do that... > > > > > > > > > > > > > > > Why? Because user mappings are just one problem, but once you had a > > > > > > > > user mapping, you can have been subject to get_user_pages, so it could > > > > > > > > be in the middle of a DMA operation or something. > > > > > > > > > > > > > > ... because we already know this behaviour causes problems for > > > > > > > high end enterprise level features like hardware checksumming IO > > > > > > > paths. > > > > > > > > > > > > > > Hence it seems that a multipage write needs to: > > > > > > > > > > > > > > 1. allocate new pages > > > > > > > 2. attach bufferheads/mapping structures to pages (if required) > > > > > > > 3. copy data into pages > > > > > > > 4. allocate space > > > > > > > 5. for each old page in the range: > > > > > > > lock page > > > > > > > invalidate mappings > > > > > > > clear page uptodate flag > > > > > > > remove page from page cache > > > > > > > 6. for each new page: > > > > > > > map new page to allocated space > > > > > > > lock new page > > > > > > > insert new page into pagecache > > > > > > > update new page state (write_end equivalent) > > > > > > > unlock new page > > > > > > > 7. free old pages > > > > > > > > > > > > > > Steps 1-4 can all fail, and can all be backed out from without > > > > > > > changing the current state. Steps 5-7 can't fail AFAICT, so we > > > > > > > should be able to run this safely after the allocation without > > > > > > > needing significant error unwinding... > > > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > Possibly. The importance of hot cache is reduced, because we are > > > > > > doing full-page copies, and bulk copies, by definition. But it > > > > > > could still be an issue. The allocations and deallocations could > > > > > > cost a little as well. > > > > > > > > > > They will cost far less than the reduction in allocation overhead > > > > > saves us, and there are potential optimisations there > > > > > > > > An API that doesn't require that, though, should be less overhead > > > > and simpler. > > > > > > > > Is it really going to be a problem to implement block hole punching > > > > in ext4 and gfs2? > > > > > > I can't follow the ext4 code - it's an intricate maze of weird entry > > > and exit points, so I'm not even going to attempt to comment on it. > > Hmm, I was thinking about it and I see two options how to get out > > of problems: > > a) Filesystems which are not able to handle hole punching will allow > > multipage writes only after EOF (which can be easily undone by > > truncate in case of failure). That should actually cover lots of > > cases we are interested in (I don't expect multipage writes to holes > > to be a common case). > > multipage writes to holes is a relatively common operation in the > HPC space that XFS is designed for (e.g. calculations on huge sparse > matrices), so I'm not really fond of this idea.... > > > b) E.g. ext4 can do even without hole punching. It can allocate extent > > as 'unwritten' and when something during the write fails, it just > > leaves the extent allocated and the 'unwritten' flag makes sure that > > any read will see zeros. I suppose that other filesystems that care > > about multipage writes are able to do similar things (e.g. btrfs can > > do the same as far as I remember, I'm not sure about gfs2). > > Allocating multipage writes as unwritten extents turns off delayed > allocation and hence we'd lose all the benefits that this gives... > I just realized we have another problem, the mmap_sem/page_lock deadlock. Currently BTRFS is susceptible to this, since we don't prefault any of the pages in yet. If we're going to do multi-page writes we're going to need to have a way to fault in all of the iovec's at once, so when we do the pagefault_disable() copy pagefault_enable() we don't just end up copying the first iovec. Nick have you done something like this already? If not I assume I can just loop through all the iovec's and call fault_in_pages_readable on all of them and be good to go right? Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-21 13:50 ` Josef Bacik @ 2010-05-21 14:23 ` Nick Piggin 2010-05-21 15:19 ` Josef Bacik 2010-05-22 0:31 ` Dave Chinner 1 sibling, 1 reply; 44+ messages in thread From: Nick Piggin @ 2010-05-21 14:23 UTC (permalink / raw) To: Josef Bacik Cc: Dave Chinner, Jan Kara, linux-fsdevel, chris.mason, hch, akpm, linux-kernel On Fri, May 21, 2010 at 09:50:54AM -0400, Josef Bacik wrote: > On Fri, May 21, 2010 at 09:05:24AM +1000, Dave Chinner wrote: > > Allocating multipage writes as unwritten extents turns off delayed > > allocation and hence we'd lose all the benefits that this gives... > > > > I just realized we have another problem, the mmap_sem/page_lock deadlock. > Currently BTRFS is susceptible to this, since we don't prefault any of the pages > in yet. If we're going to do multi-page writes we're going to need to have a > way to fault in all of the iovec's at once, so when we do the > pagefault_disable() copy pagefault_enable() we don't just end up copying the > first iovec. Nick have you done something like this already? If not I assume > I can just loop through all the iovec's and call fault_in_pages_readable on all > of them and be good to go right? Thanks, Yes, well it's a different issue. With multi-page writes, even a single iovec may not be faulted in properly. And with multiple iovecs, we are already suboptimal with faulting. faulting in multiple iovecs may already be a good idea. I didn't add that code, I had hoped for a test case first, but perhaps we can just go and add it. With multipage writes, we would want to fault in multiple source pages at once if the design was to lock multiple pages at once and do the copy. I still think we might be able to just lock and copy one page at a time, but I could be wrong. Oh wow, btrfs is deadlocky there. Firstly, fault_in_pages_readable does not guarantee success (race can always unmap the page in the meantime). Secondly, calling it inside the page lock section just means it will cause the deadlock rather than the copy_from_user. Quick workaround to reduce probability is to do fault_in_pages_readable calls before locking the pages. But you really need to handle the short-copy case. From the error handling there, it seems like you can just free_reserved_data_space and retry the copy? ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-21 14:23 ` Nick Piggin @ 2010-05-21 15:19 ` Josef Bacik 2010-05-24 3:29 ` Nick Piggin 0 siblings, 1 reply; 44+ messages in thread From: Josef Bacik @ 2010-05-21 15:19 UTC (permalink / raw) To: Nick Piggin Cc: Josef Bacik, Dave Chinner, Jan Kara, linux-fsdevel, chris.mason, hch, akpm, linux-kernel On Sat, May 22, 2010 at 12:23:54AM +1000, Nick Piggin wrote: > On Fri, May 21, 2010 at 09:50:54AM -0400, Josef Bacik wrote: > > On Fri, May 21, 2010 at 09:05:24AM +1000, Dave Chinner wrote: > > > Allocating multipage writes as unwritten extents turns off delayed > > > allocation and hence we'd lose all the benefits that this gives... > > > > > > > I just realized we have another problem, the mmap_sem/page_lock deadlock. > > Currently BTRFS is susceptible to this, since we don't prefault any of the pages > > in yet. If we're going to do multi-page writes we're going to need to have a > > way to fault in all of the iovec's at once, so when we do the > > pagefault_disable() copy pagefault_enable() we don't just end up copying the > > first iovec. Nick have you done something like this already? If not I assume > > I can just loop through all the iovec's and call fault_in_pages_readable on all > > of them and be good to go right? Thanks, > > Yes, well it's a different issue. With multi-page writes, even a single > iovec may not be faulted in properly. And with multiple iovecs, we are > already suboptimal with faulting. > > faulting in multiple iovecs may already be a good idea. I didn't add > that code, I had hoped for a test case first, but perhaps we can just > go and add it. > > With multipage writes, we would want to fault in multiple source pages > at once if the design was to lock multiple pages at once and do the > copy. I still think we might be able to just lock and copy one page at > a time, but I could be wrong. > I was thinking about this possibility, it seems this is what FUSE does already. It would probably be easier to deal with this fault problem, just do all the fs stuff to prepare for the total amount, do the copy one page at a time, and then if something goes wrong we can cleanup properly. > Oh wow, btrfs is deadlocky there. Firstly, fault_in_pages_readable does > not guarantee success (race can always unmap the page in the meantime). > Secondly, calling it inside the page lock section just means it will > cause the deadlock rather than the copy_from_user. > > Quick workaround to reduce probability is to do fault_in_pages_readable > calls before locking the pages. > > But you really need to handle the short-copy case. From the error > handling there, it seems like you can just free_reserved_data_space and > retry the copy? > Well no, if theres a short copy we just exit. If we do the fault_in_pages_readable before we do the prepare_pages we could deal with a short-copy then. Thanks, Josef ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-21 15:19 ` Josef Bacik @ 2010-05-24 3:29 ` Nick Piggin 0 siblings, 0 replies; 44+ messages in thread From: Nick Piggin @ 2010-05-24 3:29 UTC (permalink / raw) To: Josef Bacik Cc: Dave Chinner, Jan Kara, linux-fsdevel, chris.mason, hch, akpm, linux-kernel On Fri, May 21, 2010 at 11:19:22AM -0400, Josef Bacik wrote: > On Sat, May 22, 2010 at 12:23:54AM +1000, Nick Piggin wrote: > > On Fri, May 21, 2010 at 09:50:54AM -0400, Josef Bacik wrote: > > > On Fri, May 21, 2010 at 09:05:24AM +1000, Dave Chinner wrote: > > > > Allocating multipage writes as unwritten extents turns off delayed > > > > allocation and hence we'd lose all the benefits that this gives... > > > > > > > > > > I just realized we have another problem, the mmap_sem/page_lock deadlock. > > > Currently BTRFS is susceptible to this, since we don't prefault any of the pages > > > in yet. If we're going to do multi-page writes we're going to need to have a > > > way to fault in all of the iovec's at once, so when we do the > > > pagefault_disable() copy pagefault_enable() we don't just end up copying the > > > first iovec. Nick have you done something like this already? If not I assume > > > I can just loop through all the iovec's and call fault_in_pages_readable on all > > > of them and be good to go right? Thanks, > > > > Yes, well it's a different issue. With multi-page writes, even a single > > iovec may not be faulted in properly. And with multiple iovecs, we are > > already suboptimal with faulting. > > > > faulting in multiple iovecs may already be a good idea. I didn't add > > that code, I had hoped for a test case first, but perhaps we can just > > go and add it. > > > > With multipage writes, we would want to fault in multiple source pages > > at once if the design was to lock multiple pages at once and do the > > copy. I still think we might be able to just lock and copy one page at > > a time, but I could be wrong. > > > > I was thinking about this possibility, it seems this is what FUSE does already. > It would probably be easier to deal with this fault problem, just do all the fs > stuff to prepare for the total amount, do the copy one page at a time, and then > if something goes wrong we can cleanup properly. Yep. The fewer pages we have to copy to/from at once, the more robust it should be. > > Oh wow, btrfs is deadlocky there. Firstly, fault_in_pages_readable does > > not guarantee success (race can always unmap the page in the meantime). > > Secondly, calling it inside the page lock section just means it will > > cause the deadlock rather than the copy_from_user. > > > > Quick workaround to reduce probability is to do fault_in_pages_readable > > calls before locking the pages. > > > > But you really need to handle the short-copy case. From the error > > handling there, it seems like you can just free_reserved_data_space and > > retry the copy? > > > > Well no, if theres a short copy we just exit. If we do the > fault_in_pages_readable before we do the prepare_pages we could deal with a > short-copy then. Thanks, Take a look at the code in generic buffered write. It's not hard, but it is easy to get the error case wrong. You need to fault in the source pages before holding page lock. If this fails, then you may exit. After locking the pages, you need to do a pagefault_disable() and atomic kmaps/usercopies. If this copy comes up short, you need to unlock and retry faulting in the source pages. The reason for this is that the source page may be unmapped but there is still a valid memory mapping at the point of the usercopy. So you must not fail that. And, if some bytes have been copied into pagecache, and if that pagecache page is marked as uptodate, then you must treat it as a partial write. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-21 13:50 ` Josef Bacik 2010-05-21 14:23 ` Nick Piggin @ 2010-05-22 0:31 ` Dave Chinner 1 sibling, 0 replies; 44+ messages in thread From: Dave Chinner @ 2010-05-22 0:31 UTC (permalink / raw) To: Josef Bacik Cc: Jan Kara, Nick Piggin, linux-fsdevel, chris.mason, hch, akpm, linux-kernel On Fri, May 21, 2010 at 09:50:54AM -0400, Josef Bacik wrote: > On Fri, May 21, 2010 at 09:05:24AM +1000, Dave Chinner wrote: > > On Thu, May 20, 2010 at 10:12:32PM +0200, Jan Kara wrote: > > > On Thu 20-05-10 09:50:54, Dave Chinner wrote: > > > b) E.g. ext4 can do even without hole punching. It can allocate extent > > > as 'unwritten' and when something during the write fails, it just > > > leaves the extent allocated and the 'unwritten' flag makes sure that > > > any read will see zeros. I suppose that other filesystems that care > > > about multipage writes are able to do similar things (e.g. btrfs can > > > do the same as far as I remember, I'm not sure about gfs2). > > > > Allocating multipage writes as unwritten extents turns off delayed > > allocation and hence we'd lose all the benefits that this gives... > > > > I just realized we have another problem, the mmap_sem/page_lock deadlock. > Currently BTRFS is susceptible to this, since we don't prefault any of the pages > in yet. If we're going to do multi-page writes we're going to need to have a > way to fault in all of the iovec's at once, so when we do the > pagefault_disable() copy pagefault_enable() we don't just end up copying the > first iovec. Nick have you done something like this already? I have patches that already do this, but the big issue is that it is inhernetly racy. The prefaulting does not guarantee that by the time we disable page faults that the prefaulted page has not already been reclaimed. Basically we have to design for EFAULT occurring because even pre-faultig does not prevent it from occurring. > If not I assume > I can just loop through all the iovec's and call fault_in_pages_readable on all > of them and be good to go right? Thanks, That's effectively what I've done, but it's still no guarantee. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-20 23:05 ` Dave Chinner 2010-05-21 9:05 ` Steven Whitehouse 2010-05-21 13:50 ` Josef Bacik @ 2010-05-21 18:58 ` Jan Kara 2010-05-22 0:27 ` Dave Chinner 2 siblings, 1 reply; 44+ messages in thread From: Jan Kara @ 2010-05-21 18:58 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, Nick Piggin, Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel On Fri 21-05-10 09:05:24, Dave Chinner wrote: > On Thu, May 20, 2010 at 10:12:32PM +0200, Jan Kara wrote: > > Hmm, I was thinking about it and I see two options how to get out > > of problems: > > a) Filesystems which are not able to handle hole punching will allow > > multipage writes only after EOF (which can be easily undone by > > truncate in case of failure). That should actually cover lots of > > cases we are interested in (I don't expect multipage writes to holes > > to be a common case). > > multipage writes to holes is a relatively common operation in the > HPC space that XFS is designed for (e.g. calculations on huge sparse > matrices), so I'm not really fond of this idea.... Well, XFS could still handle them because it is able to do hole punching but I get your point. > > b) E.g. ext4 can do even without hole punching. It can allocate extent > > as 'unwritten' and when something during the write fails, it just > > leaves the extent allocated and the 'unwritten' flag makes sure that > > any read will see zeros. I suppose that other filesystems that care > > about multipage writes are able to do similar things (e.g. btrfs can > > do the same as far as I remember, I'm not sure about gfs2). > > Allocating multipage writes as unwritten extents turns off delayed > allocation and hence we'd lose all the benefits that this gives... Ah, sorry. That was a short-circuit in my brain. But when we do delayed I don't see why we should actually do any hole punching... The write needs to: a) reserve enough blocks for the write - I don't know about other filesystems but for ext4 this means just incrementing a counter. b) copy data page by page c) release part of reservation (i.e. decrement counter) if we actually copied less than we originally thought. Am I missing something? Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-21 18:58 ` Jan Kara @ 2010-05-22 0:27 ` Dave Chinner 2010-05-24 9:20 ` Jan Kara 0 siblings, 1 reply; 44+ messages in thread From: Dave Chinner @ 2010-05-22 0:27 UTC (permalink / raw) To: Jan Kara Cc: Nick Piggin, Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel On Fri, May 21, 2010 at 08:58:46PM +0200, Jan Kara wrote: > On Fri 21-05-10 09:05:24, Dave Chinner wrote: > > On Thu, May 20, 2010 at 10:12:32PM +0200, Jan Kara wrote: > > > b) E.g. ext4 can do even without hole punching. It can allocate extent > > > as 'unwritten' and when something during the write fails, it just > > > leaves the extent allocated and the 'unwritten' flag makes sure that > > > any read will see zeros. I suppose that other filesystems that care > > > about multipage writes are able to do similar things (e.g. btrfs can > > > do the same as far as I remember, I'm not sure about gfs2). > > > > Allocating multipage writes as unwritten extents turns off delayed > > allocation and hence we'd lose all the benefits that this gives... > Ah, sorry. That was a short-circuit in my brain. But when we do delayed > I don't see why we should actually do any hole punching... The write needs > to: > a) reserve enough blocks for the write - I don't know about other > filesystems but for ext4 this means just incrementing a counter. > b) copy data page by page > c) release part of reservation (i.e. decrement counter) if we actually > copied less than we originally thought. > > Am I missing something? Possibly. Delayed allocation is made up of two parts - space reservation and recording the regions of delayed allocation in an extent tree, page/bufferhead state or both. In XFS, these two steps happen in the same get_blocks call, but the result of that is we have to truncate/punch delayed allocate extents out just like normal extents if we are not going to use them. Hence a reserve/allocate interface allows us to split the operation - reserve ensures we have space for the delayed allocation, allocate inserts the delayed extents into the inode extent tree for later real allocation during writeback. Hence the unreserve call can simply be accounting - it has no requirement to punch out delayed extents that may have already been allocated, just do work on counters. btrfs already has this split design - it reserves space, does the copy, then marks the extent ranges as delalloc once the copy has succeeded, otherwise it simply unreserves the unused space. Once again, I don't know if ext4 does this internal delayed allocation extent tracking or whether it just uses page state to track those extents, but it would probably still have to use the allocate call to mark all the pages/bufferheads as delalloc so that uneserve didn't have to do any extra work. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-22 0:27 ` Dave Chinner @ 2010-05-24 9:20 ` Jan Kara 2010-05-24 9:33 ` Nick Piggin 2010-06-05 15:05 ` tytso 0 siblings, 2 replies; 44+ messages in thread From: Jan Kara @ 2010-05-24 9:20 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, Nick Piggin, Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel On Sat 22-05-10 10:27:59, Dave Chinner wrote: > On Fri, May 21, 2010 at 08:58:46PM +0200, Jan Kara wrote: > > On Fri 21-05-10 09:05:24, Dave Chinner wrote: > > > On Thu, May 20, 2010 at 10:12:32PM +0200, Jan Kara wrote: > > > > b) E.g. ext4 can do even without hole punching. It can allocate extent > > > > as 'unwritten' and when something during the write fails, it just > > > > leaves the extent allocated and the 'unwritten' flag makes sure that > > > > any read will see zeros. I suppose that other filesystems that care > > > > about multipage writes are able to do similar things (e.g. btrfs can > > > > do the same as far as I remember, I'm not sure about gfs2). > > > > > > Allocating multipage writes as unwritten extents turns off delayed > > > allocation and hence we'd lose all the benefits that this gives... > > Ah, sorry. That was a short-circuit in my brain. But when we do delayed > > I don't see why we should actually do any hole punching... The write needs > > to: > > a) reserve enough blocks for the write - I don't know about other > > filesystems but for ext4 this means just incrementing a counter. > > b) copy data page by page > > c) release part of reservation (i.e. decrement counter) if we actually > > copied less than we originally thought. > > > > Am I missing something? > > Possibly. Delayed allocation is made up of two parts - space > reservation and recording the regions of delayed allocation in an > extent tree, page/bufferhead state or both. Yes. Ext4 records the info about delayed allocation only in buffer heads. > In XFS, these two steps happen in the same get_blocks call, but the > result of that is we have to truncate/punch delayed allocate extents > out just like normal extents if we are not going to use them. Hence > a reserve/allocate interface allows us to split the operation - > reserve ensures we have space for the delayed allocation, allocate > inserts the delayed extents into the inode extent tree for later > real allocation during writeback. Hence the unreserve call can > simply be accounting - it has no requirement to punch out delayed > extents that may have already been allocated, just do work on > counters. > > btrfs already has this split design - it reserves space, does the > copy, then marks the extent ranges as delalloc once the copy has > succeeded, otherwise it simply unreserves the unused space. > > Once again, I don't know if ext4 does this internal delayed > allocation extent tracking or whether it just uses page state to > track those extents, but it would probably still have to use the > allocate call to mark all the pages/bufferheads as delalloc so > that uneserve didn't have to do any extra work. Yes, exactly. I just wanted to point out that AFAICS ext4 can implement proper error recovery without a need for 'punch' operation. So after all Nick's copy page-by-page should be plausible at least for ext4. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-24 9:20 ` Jan Kara @ 2010-05-24 9:33 ` Nick Piggin 2010-06-05 15:05 ` tytso 1 sibling, 0 replies; 44+ messages in thread From: Nick Piggin @ 2010-05-24 9:33 UTC (permalink / raw) To: Jan Kara Cc: Dave Chinner, Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel On Mon, May 24, 2010 at 11:20:34AM +0200, Jan Kara wrote: > On Sat 22-05-10 10:27:59, Dave Chinner wrote: > > On Fri, May 21, 2010 at 08:58:46PM +0200, Jan Kara wrote: > > > On Fri 21-05-10 09:05:24, Dave Chinner wrote: > > > > On Thu, May 20, 2010 at 10:12:32PM +0200, Jan Kara wrote: > > > > > b) E.g. ext4 can do even without hole punching. It can allocate extent > > > > > as 'unwritten' and when something during the write fails, it just > > > > > leaves the extent allocated and the 'unwritten' flag makes sure that > > > > > any read will see zeros. I suppose that other filesystems that care > > > > > about multipage writes are able to do similar things (e.g. btrfs can > > > > > do the same as far as I remember, I'm not sure about gfs2). > > > > > > > > Allocating multipage writes as unwritten extents turns off delayed > > > > allocation and hence we'd lose all the benefits that this gives... > > > Ah, sorry. That was a short-circuit in my brain. But when we do delayed > > > I don't see why we should actually do any hole punching... The write needs > > > to: > > > a) reserve enough blocks for the write - I don't know about other > > > filesystems but for ext4 this means just incrementing a counter. > > > b) copy data page by page > > > c) release part of reservation (i.e. decrement counter) if we actually > > > copied less than we originally thought. > > > > > > Am I missing something? > > > > Possibly. Delayed allocation is made up of two parts - space > > reservation and recording the regions of delayed allocation in an > > extent tree, page/bufferhead state or both. > Yes. Ext4 records the info about delayed allocation only in buffer > heads. > > > In XFS, these two steps happen in the same get_blocks call, but the > > result of that is we have to truncate/punch delayed allocate extents > > out just like normal extents if we are not going to use them. Hence > > a reserve/allocate interface allows us to split the operation - > > reserve ensures we have space for the delayed allocation, allocate > > inserts the delayed extents into the inode extent tree for later > > real allocation during writeback. Hence the unreserve call can > > simply be accounting - it has no requirement to punch out delayed > > extents that may have already been allocated, just do work on > > counters. > > > > btrfs already has this split design - it reserves space, does the > > copy, then marks the extent ranges as delalloc once the copy has > > succeeded, otherwise it simply unreserves the unused space. > > > > Once again, I don't know if ext4 does this internal delayed > > allocation extent tracking or whether it just uses page state to > > track those extents, but it would probably still have to use the > > allocate call to mark all the pages/bufferheads as delalloc so > > that uneserve didn't have to do any extra work. > Yes, exactly. I just wanted to point out that AFAICS ext4 can implement > proper error recovery without a need for 'punch' operation. So after all > Nick's copy page-by-page should be plausible at least for ext4. Great. AFAIKS, any filesystem that does not leak uninitialized data on IO error or crash when allocating writeback cache over holes should already have enough information to recover properly from short-copy type of error today. Otherwise, an IO error or crash seems like quite a similar problem from the point of view of the filesystem. Now perhaps it can be recovered only in a fsck type operation which is far too expensive to do in a normal error path, which sounds like XFS. So possibly we could have 2 APIs, one for filesystems like XFS, but I don't think we should penalise ones like ext4 which can handle this situation. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-24 9:20 ` Jan Kara 2010-05-24 9:33 ` Nick Piggin @ 2010-06-05 15:05 ` tytso 2010-06-06 7:59 ` Nick Piggin 1 sibling, 1 reply; 44+ messages in thread From: tytso @ 2010-06-05 15:05 UTC (permalink / raw) To: Jan Kara Cc: Dave Chinner, Nick Piggin, Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel On Mon, May 24, 2010 at 11:20:34AM +0200, Jan Kara wrote: > Yes, exactly. I just wanted to point out that AFAICS ext4 can implement > proper error recovery without a need for 'punch' operation. So after all > Nick's copy page-by-page should be plausible at least for ext4. Sorry for my late response to this thread; I've been busy catching up on another of other fronts, so I didn't have a chance to go through this thread until now. First of all, I'm not against implementing a 'punch' operation for ext4; I've actually toyed with this idea before. Secondly, I'm not sure it's really necessary; we already have a code path (which I was planning on making be the default when I have a chance to rewrite ext4_writepages) where the blocks are initially allocated with the 'uninitialized' flag in the extent tree; this is the same flag used for fallocate(2) support when we allocate blocks without filling in the data blocks. Then, when the block I/O completes, we use the block I/O callback to clear the uninit flag in the extent tree. This is currently used to avoid safely avoid locking in the read path, which is needed to speed up access for extremely fast (think Fusion I/O-like) flash devices. I was already thinking about using this trick in my planned ext4_writepages() rewrite, and if it turns out we have common code that also assumes that file systems can do the equivalent fallocate(2) and can clear the uninitialized bit on a callback, I think that makes ext4 fairly similar to what XFS does, at least at the high level, doesn't it? Note that strictly speaking this isn't a 'punch' operation in this case; it's rather an fallocate(2) and don't convert the extent to mark the data blocks as valid on error, which is not quite the same as a 'punch' operation. Am I missing something? - Ted ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-06-05 15:05 ` tytso @ 2010-06-06 7:59 ` Nick Piggin 0 siblings, 0 replies; 44+ messages in thread From: Nick Piggin @ 2010-06-06 7:59 UTC (permalink / raw) To: tytso, Jan Kara, Dave Chinner, Josef Bacik, linux-fsdevel, chris.mason, hch On Sat, Jun 05, 2010 at 11:05:23AM -0400, tytso@mit.edu wrote: > On Mon, May 24, 2010 at 11:20:34AM +0200, Jan Kara wrote: > > Yes, exactly. I just wanted to point out that AFAICS ext4 can implement > > proper error recovery without a need for 'punch' operation. So after all > > Nick's copy page-by-page should be plausible at least for ext4. > > Sorry for my late response to this thread; I've been busy catching up > on another of other fronts, so I didn't have a chance to go through > this thread until now. > > First of all, I'm not against implementing a 'punch' operation for > ext4; I've actually toyed with this idea before. > > Secondly, I'm not sure it's really necessary; we already have a code > path (which I was planning on making be the default when I have a > chance to rewrite ext4_writepages) where the blocks are initially > allocated with the 'uninitialized' flag in the extent tree; this is > the same flag used for fallocate(2) support when we allocate blocks > without filling in the data blocks. Then, when the block I/O > completes, we use the block I/O callback to clear the uninit flag in > the extent tree. This is currently used to avoid safely avoid locking > in the read path, which is needed to speed up access for extremely > fast (think Fusion I/O-like) flash devices. > > I was already thinking about using this trick in my planned > ext4_writepages() rewrite, and if it turns out we have common code > that also assumes that file systems can do the equivalent fallocate(2) > and can clear the uninitialized bit on a callback, I think that makes > ext4 fairly similar to what XFS does, at least at the high level, > doesn't it? > > Note that strictly speaking this isn't a 'punch' operation in this > case; it's rather an fallocate(2) and don't convert the extent to mark > the data blocks as valid on error, which is not quite the same as a > 'punch' operation. > > Am I missing something? No this is fine, it's actually better than a punch operation from error recovery point of view because it wouldn't require further modifications to to filesystem in the error case. AFAIKS this 'uninitialised blocks' approach seems to be the most optimal way to do block allocations that are not tightly coupled with the pagecache. Do you mean the ext4's file_write path? ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-14 7:22 ` Nick Piggin 2010-05-14 8:38 ` Dave Chinner @ 2010-05-21 15:15 ` Christoph Hellwig 2010-05-22 2:31 ` Nick Piggin 1 sibling, 1 reply; 44+ messages in thread From: Christoph Hellwig @ 2010-05-21 15:15 UTC (permalink / raw) To: Nick Piggin Cc: Dave Chinner, Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, linux-kernel Nick, what exactly is the problem with the reserve + allocate design? In a delalloc filesystem (which is all those that will care about high performance large writes) the write path fundamentally consists of those two operations. Getting rid of the get_blocks mess and replacing it with a dedicated operations vector will simplify things a lot. Punching holes is a rather problematic operation, and as mentioned not actually implemented for most filesystems - just decrementing counters on errors increases the chances that our error handling will actually work massively. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-21 15:15 ` Christoph Hellwig @ 2010-05-22 2:31 ` Nick Piggin 2010-05-22 8:37 ` Dave Chinner 0 siblings, 1 reply; 44+ messages in thread From: Nick Piggin @ 2010-05-22 2:31 UTC (permalink / raw) To: Christoph Hellwig Cc: Dave Chinner, Josef Bacik, linux-fsdevel, chris.mason, akpm, linux-kernel On Fri, May 21, 2010 at 11:15:18AM -0400, Christoph Hellwig wrote: > Nick, what exactly is the problem with the reserve + allocate design? > > In a delalloc filesystem (which is all those that will care about high > performance large writes) the write path fundamentally consists of those > two operations. Getting rid of the get_blocks mess and replacing it > with a dedicated operations vector will simplify things a lot. Nothing wrong with it, I think it's a fine idea (although you may still need a per-bh call to connect the fs metadata to each page). I just much prefer to have operations after the copy not able to fail, otherwise you get into all those pagecache corner cases. BTW. when you say reserve + allocate, this is in the page-dirty path, right? I thought delalloc filesystems tend to do the actual allocation in the page-cleaning path? Or am I confused? > Punching holes is a rather problematic operation, and as mentioned not > actually implemented for most filesystems - just decrementing counters > on errors increases the chances that our error handling will actually > work massively. It's just harder for the pagecache. Invalidating and throwing out old pagecache and splicing in new pages seems a bit of a hack. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-22 2:31 ` Nick Piggin @ 2010-05-22 8:37 ` Dave Chinner 2010-05-24 3:09 ` Nick Piggin 0 siblings, 1 reply; 44+ messages in thread From: Dave Chinner @ 2010-05-22 8:37 UTC (permalink / raw) To: Nick Piggin Cc: Christoph Hellwig, Josef Bacik, linux-fsdevel, chris.mason, akpm, linux-kernel On Sat, May 22, 2010 at 12:31:02PM +1000, Nick Piggin wrote: > On Fri, May 21, 2010 at 11:15:18AM -0400, Christoph Hellwig wrote: > > Nick, what exactly is the problem with the reserve + allocate design? > > > > In a delalloc filesystem (which is all those that will care about high > > performance large writes) the write path fundamentally consists of those > > two operations. Getting rid of the get_blocks mess and replacing it > > with a dedicated operations vector will simplify things a lot. > > Nothing wrong with it, I think it's a fine idea (although you may still > need a per-bh call to connect the fs metadata to each page). > > I just much prefer to have operations after the copy not able to fail, > otherwise you get into all those pagecache corner cases. > > BTW. when you say reserve + allocate, this is in the page-dirty path, > right? I thought delalloc filesystems tend to do the actual allocation > in the page-cleaning path? Or am I confused? See my reply to Jan - delayed allocate has two parts to it - space reservation (accounting for ENOSPC) and recording of the delalloc extents (allocate). This is separate to the writeback path where we convert delalloc extents to real extents.... > > Punching holes is a rather problematic operation, and as mentioned not > > actually implemented for most filesystems - just decrementing counters > > on errors increases the chances that our error handling will actually > > work massively. > > It's just harder for the pagecache. Invalidating and throwing out old > pagecache and splicing in new pages seems a bit of a hack. Hardly a hack - it turns a buffered write into an operation that does not expose transient page state and hence prevents torn writes. That will allow us to use DIF enabled storage paths for buffered filesystem IO(*), perhaps even allow us to generate checksums during copy-in to do end-to-end checksum protection of data.... Cheers, Dave. (*) Yes, I know that mmap writes will still break DIF even if we do writes this way. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-22 8:37 ` Dave Chinner @ 2010-05-24 3:09 ` Nick Piggin 2010-05-24 5:53 ` Dave Chinner 0 siblings, 1 reply; 44+ messages in thread From: Nick Piggin @ 2010-05-24 3:09 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Josef Bacik, linux-fsdevel, chris.mason, akpm, linux-kernel On Sat, May 22, 2010 at 06:37:03PM +1000, Dave Chinner wrote: > On Sat, May 22, 2010 at 12:31:02PM +1000, Nick Piggin wrote: > > On Fri, May 21, 2010 at 11:15:18AM -0400, Christoph Hellwig wrote: > > > Nick, what exactly is the problem with the reserve + allocate design? > > > > > > In a delalloc filesystem (which is all those that will care about high > > > performance large writes) the write path fundamentally consists of those > > > two operations. Getting rid of the get_blocks mess and replacing it > > > with a dedicated operations vector will simplify things a lot. > > > > Nothing wrong with it, I think it's a fine idea (although you may still > > need a per-bh call to connect the fs metadata to each page). > > > > I just much prefer to have operations after the copy not able to fail, > > otherwise you get into all those pagecache corner cases. > > > > BTW. when you say reserve + allocate, this is in the page-dirty path, > > right? I thought delalloc filesystems tend to do the actual allocation > > in the page-cleaning path? Or am I confused? > > See my reply to Jan - delayed allocate has two parts to it - space > reservation (accounting for ENOSPC) and recording of the delalloc extents > (allocate). This is separate to the writeback path where we convert > delalloc extents to real extents.... Yes I saw that. I'm sure we'll want clearer terminology in the core code. But I don't quite know why you need to do it in 2 parts (reserve, then "allocate"). Surely even reservation failures are very rare, and obviously the error handling is required, so why not just do a single call? > > > Punching holes is a rather problematic operation, and as mentioned not > > > actually implemented for most filesystems - just decrementing counters > > > on errors increases the chances that our error handling will actually > > > work massively. > > > > It's just harder for the pagecache. Invalidating and throwing out old > > pagecache and splicing in new pages seems a bit of a hack. > > Hardly a hack - it turns a buffered write into an operation that > does not expose transient page state and hence prevents torn writes. > That will allow us to use DIF enabled storage paths for buffered > filesystem IO(*), perhaps even allow us to generate checksums during > copy-in to do end-to-end checksum protection of data.... It is a hack. Invalidating is inherently racy and isn't guaranteed to succeed. You do not need to invalidate the pagecache to do this (which as I said is racy). You need to lock the page to prevent writes, and then unmap user mappings. You'd also need to have done some magic so writable mmaps don't leak into get_user_pages. But this should be a different discussion anyway. Don't forget, your approach is forced into the invalidation requirement because of downsides in its error handling sequence. That cannot be construed as positive, because you are forced into it, wheras other approaches *could* use it, but do not have to. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-24 3:09 ` Nick Piggin @ 2010-05-24 5:53 ` Dave Chinner 2010-05-24 6:55 ` Nick Piggin 0 siblings, 1 reply; 44+ messages in thread From: Dave Chinner @ 2010-05-24 5:53 UTC (permalink / raw) To: Nick Piggin Cc: Christoph Hellwig, Josef Bacik, linux-fsdevel, chris.mason, akpm, linux-kernel On Mon, May 24, 2010 at 01:09:43PM +1000, Nick Piggin wrote: > On Sat, May 22, 2010 at 06:37:03PM +1000, Dave Chinner wrote: > > On Sat, May 22, 2010 at 12:31:02PM +1000, Nick Piggin wrote: > > > On Fri, May 21, 2010 at 11:15:18AM -0400, Christoph Hellwig wrote: > > > > Nick, what exactly is the problem with the reserve + allocate design? > > > > > > > > In a delalloc filesystem (which is all those that will care about high > > > > performance large writes) the write path fundamentally consists of those > > > > two operations. Getting rid of the get_blocks mess and replacing it > > > > with a dedicated operations vector will simplify things a lot. > > > > > > Nothing wrong with it, I think it's a fine idea (although you may still > > > need a per-bh call to connect the fs metadata to each page). > > > > > > I just much prefer to have operations after the copy not able to fail, > > > otherwise you get into all those pagecache corner cases. > > > > > > BTW. when you say reserve + allocate, this is in the page-dirty path, > > > right? I thought delalloc filesystems tend to do the actual allocation > > > in the page-cleaning path? Or am I confused? > > > > See my reply to Jan - delayed allocate has two parts to it - space > > reservation (accounting for ENOSPC) and recording of the delalloc extents > > (allocate). This is separate to the writeback path where we convert > > delalloc extents to real extents.... > > Yes I saw that. I'm sure we'll want clearer terminology in the core > code. But I don't quite know why you need to do it in 2 parts > (reserve, then "allocate"). Because reserve/allocate are the two steps that allocation is generally broken down into, even in filesystems that don't do delayed allocation. That's because.... > Surely even reservation failures are > very rare ... ENOSPC and EDQUOT are not at all rare, and they are generated during the reservation stage. i.e. before any real allocation or state changes are made. Just about every filesystem does this because failing half way through an allocation not being able to allocate a block due to ENOSPC or EDQUOT is pretty much impossible to undo reliably in most filesystems. > , and obviously the error handling is required, so why not > just do a single call? Because if we fail after the allocation then ensuring we handle the error *correctly* and *without further failures* is *fucking hard*. IMO, the fundamental issue with using hole punching or direct IO from the zero page to handle errors is that they are complex enough that there is *no guarantee that they will succeed*. e.g. Both can get ENOSPC/EDQUOT because they may end up with metadata allocation requirements above and beyond what was originally reserved. If the error handling fails to handle the error, then where do we go from there? In comparison, undoing a reservation is simple - maybe incrementing a couple of counters - and is effectively guaranteed never to fail. This is a good characteristic to have in an error handling function... > > > > Punching holes is a rather problematic operation, and as mentioned not > > > > actually implemented for most filesystems - just decrementing counters > > > > on errors increases the chances that our error handling will actually > > > > work massively. > > > > > > It's just harder for the pagecache. Invalidating and throwing out old > > > pagecache and splicing in new pages seems a bit of a hack. > > > > Hardly a hack - it turns a buffered write into an operation that > > does not expose transient page state and hence prevents torn writes. > > That will allow us to use DIF enabled storage paths for buffered > > filesystem IO(*), perhaps even allow us to generate checksums during > > copy-in to do end-to-end checksum protection of data.... > > It is a hack. Invalidating is inherently racy and isn't guaranteed > to succeed. > > You do not need to invalidate the pagecache to do this (which as I said > is racy). You need to lock the page to prevent writes, and then unmap > user mappings. Which is the major part of invalidating a page. The other part of invalidation is removing the page from the page cache, so if invalidation is inherently too racy to use safely here, then I fail to see why the above isn't also too racy to use safely.... > You'd also need to have done some magic so writable mmaps > don't leak into get_user_pages. Which means what, and why don't we have to do any special magic now to prevent it? > But this should be a different discussion anyway. Don't forget, your > approach is forced into the invalidation requirement because of > downsides in its error handling sequence. I wouldn't say forced into it, Nick - it's a deliberate design choice to make the overall stack simpler and more likely to function correctly. Besides, all it takes to avoid the requirement of invalidation is to provide the guarantee that the allocation after reservation will either succeed or the filesystem shuts down in a corrupted state. If we provide that guarantee then the fact that transient page cache data might appear on allocation error is irrelevant, because it will never get written to disk and applications will error out pretty quickly. I'm quite happy with that requirement, because of two things. Firstly, after the reservation nothing but a corruption or IO error should prevent the allocation from succeeding. In that case, the filesystem should already be in a shutdown state by the time the failed allocation returns. Secondly, filesystems using delayed allocation are already making this promise successfully from get_blocks to ->writepage, hence I don't see any issues with encoding it into an allocation interface.... > That cannot be construed as > positive, because you are forced into it, wheras other approaches > *could* use it, but do not have to. Except for the fact the other alternatives have much, much worse downsides. Yes, they could also use such a write path, but that doesn't reduce the complexity of those solutions or prevent any of the problems they have. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-24 5:53 ` Dave Chinner @ 2010-05-24 6:55 ` Nick Piggin 2010-05-24 10:21 ` Dave Chinner 2010-05-24 18:40 ` Joel Becker 0 siblings, 2 replies; 44+ messages in thread From: Nick Piggin @ 2010-05-24 6:55 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Josef Bacik, linux-fsdevel, chris.mason, akpm, linux-kernel On Mon, May 24, 2010 at 03:53:29PM +1000, Dave Chinner wrote: > On Mon, May 24, 2010 at 01:09:43PM +1000, Nick Piggin wrote: > > On Sat, May 22, 2010 at 06:37:03PM +1000, Dave Chinner wrote: > > > On Sat, May 22, 2010 at 12:31:02PM +1000, Nick Piggin wrote: > > > > On Fri, May 21, 2010 at 11:15:18AM -0400, Christoph Hellwig wrote: > > > > > Nick, what exactly is the problem with the reserve + allocate design? > > > > > > > > > > In a delalloc filesystem (which is all those that will care about high > > > > > performance large writes) the write path fundamentally consists of those > > > > > two operations. Getting rid of the get_blocks mess and replacing it > > > > > with a dedicated operations vector will simplify things a lot. > > > > > > > > Nothing wrong with it, I think it's a fine idea (although you may still > > > > need a per-bh call to connect the fs metadata to each page). > > > > > > > > I just much prefer to have operations after the copy not able to fail, > > > > otherwise you get into all those pagecache corner cases. > > > > > > > > BTW. when you say reserve + allocate, this is in the page-dirty path, > > > > right? I thought delalloc filesystems tend to do the actual allocation > > > > in the page-cleaning path? Or am I confused? > > > > > > See my reply to Jan - delayed allocate has two parts to it - space > > > reservation (accounting for ENOSPC) and recording of the delalloc extents > > > (allocate). This is separate to the writeback path where we convert > > > delalloc extents to real extents.... > > > > Yes I saw that. I'm sure we'll want clearer terminology in the core > > code. But I don't quite know why you need to do it in 2 parts > > (reserve, then "allocate"). > > Because reserve/allocate are the two steps that allocation is > generally broken down into, even in filesystems that don't do > delayed allocation. That's because.... > > > Surely even reservation failures are > > very rare > > ... ENOSPC and EDQUOT are not at all rare, and they are generated > during the reservation stage. i.e. before any real allocation or I meant "rare" as-in not critical for performance. Not that they don't have to be handled properly. > state changes are made. Just about every filesystem does this > because failing half way through an allocation not being able to > allocate a block due to ENOSPC or EDQUOT is pretty much impossible > to undo reliably in most filesystems. > > > , and obviously the error handling is required, so why not > > just do a single call? > > Because if we fail after the allocation then ensuring we handle the > error *correctly* and *without further failures* is *fucking hard*. I don't think you really answered my question. Let me put it in concrete terms. In your proposal, why not just do the reserve+allocate *after* the pagecache copy? What does the "reserve" part add? > > IMO, the fundamental issue with using hole punching or direct IO > from the zero page to handle errors is that they are complex enough > that there is *no guarantee that they will succeed*. e.g. Both can > get ENOSPC/EDQUOT because they may end up with metadata allocation > requirements above and beyond what was originally reserved. If the > error handling fails to handle the error, then where do we go from > there? There are already fundamental issues that seems like they are not handled properly if your filesystem may allocate uninitialized blocks over holes for writeback cache without somehow marking them as uninitialized. If you get a power failure or IO error before the pagecache can be written out, you're left with uninitialized data there, aren't you? Simple buffer head based filesystems are already subject to this. > In comparison, undoing a reservation is simple - maybe incrementing > a couple of counters - and is effectively guaranteed never to fail. > This is a good characteristic to have in an error handling > function... Yes of course. > > > > > Punching holes is a rather problematic operation, and as mentioned not > > > > > actually implemented for most filesystems - just decrementing counters > > > > > on errors increases the chances that our error handling will actually > > > > > work massively. > > > > > > > > It's just harder for the pagecache. Invalidating and throwing out old > > > > pagecache and splicing in new pages seems a bit of a hack. > > > > > > Hardly a hack - it turns a buffered write into an operation that > > > does not expose transient page state and hence prevents torn writes. > > > That will allow us to use DIF enabled storage paths for buffered > > > filesystem IO(*), perhaps even allow us to generate checksums during > > > copy-in to do end-to-end checksum protection of data.... > > > > It is a hack. Invalidating is inherently racy and isn't guaranteed > > to succeed. > > > > You do not need to invalidate the pagecache to do this (which as I said > > is racy). You need to lock the page to prevent writes, and then unmap > > user mappings. > > Which is the major part of invalidating a page. The other part of > invalidation is removing the page from the page cache, so if > invalidation is inherently too racy to use safely here, then I fail > to see why the above isn't also too racy to use safely.... I don't know what you mean by the major part of invalidating the page. Taking the page out of the pagecache is indeed the fallable part of the operation. > > You'd also need to have done some magic so writable mmaps > > don't leak into get_user_pages. > > Which means what, and why don't we have to do any special magic now > to prevent it? We do, but filesystems don't tend to use it. > > But this should be a different discussion anyway. Don't forget, your > > approach is forced into the invalidation requirement because of > > downsides in its error handling sequence. > > I wouldn't say forced into it, Nick - it's a deliberate design > choice to make the overall stack simpler and more likely to function > correctly. Your design is forced to do it when I pointed out that writes into the pagecache should not be made visiable if the process can subsequently fail. copy-last is not subject to this. So you can't say invalidation is an advantage of copy-first, because if it is advantageous in other areas, copy-last can implement it too. > Besides, all it takes to avoid the requirement of invalidation is to > provide the guarantee that the allocation after reservation will > either succeed or the filesystem shuts down in a corrupted state. > If we provide that guarantee then the fact that transient page cache > data might appear on allocation error is irrelevant, because it > will never get written to disk and applications will error out > pretty quickly. Sure. IO errors and writeback cache means that we don't guarantee without fsync that the data will come back after a crash. This could be an answer to my above question (what is the 2-call sequence for?) All that is required upon write(2) completion (or partial completion) is that the data can actually be found and written back at a later date. > I'm quite happy with that requirement, because of two things. > Firstly, after the reservation nothing but a corruption or IO error > should prevent the allocation from succeeding. In that case, the > filesystem should already be in a shutdown state by the time the > failed allocation returns. Secondly, filesystems using delayed > allocation are already making this promise successfully from > get_blocks to ->writepage, hence I don't see any issues with > encoding it into an allocation interface.... Well it's already there, and not just for delalloc filesystems, because a lot of filesystems do writeback on their metadata too, so it's all subject to IO errors. > > That cannot be construed as > > positive, because you are forced into it, wheras other approaches > > *could* use it, but do not have to. > > Except for the fact the other alternatives have much, much worse > downsides. Yes, they could also use such a write path, but that > doesn't reduce the complexity of those solutions or prevent any of > the problems they have. Let's just carefully look at the alternatives. This alternative of zeroing out uninitialized blocks (via pagecache) is what we have today. What we should do is carefully consider exactly what error semantics and guarantees we want, and then implement the best API taking those into account. If we are happy with the existing state of error handling, copy-first is clearly better because the fast path is faster. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-24 6:55 ` Nick Piggin @ 2010-05-24 10:21 ` Dave Chinner 2010-06-01 6:27 ` Nick Piggin 2010-05-24 18:40 ` Joel Becker 1 sibling, 1 reply; 44+ messages in thread From: Dave Chinner @ 2010-05-24 10:21 UTC (permalink / raw) To: Nick Piggin Cc: Christoph Hellwig, Josef Bacik, linux-fsdevel, chris.mason, akpm, linux-kernel On Mon, May 24, 2010 at 04:55:19PM +1000, Nick Piggin wrote: > On Mon, May 24, 2010 at 03:53:29PM +1000, Dave Chinner wrote: > > On Mon, May 24, 2010 at 01:09:43PM +1000, Nick Piggin wrote: > > > On Sat, May 22, 2010 at 06:37:03PM +1000, Dave Chinner wrote: > > > > On Sat, May 22, 2010 at 12:31:02PM +1000, Nick Piggin wrote: > > > > > On Fri, May 21, 2010 at 11:15:18AM -0400, Christoph Hellwig wrote: > > > > > > Nick, what exactly is the problem with the reserve + allocate design? > > > > > > > > > > > > In a delalloc filesystem (which is all those that will care about high > > > > > > performance large writes) the write path fundamentally consists of those > > > > > > two operations. Getting rid of the get_blocks mess and replacing it > > > > > > with a dedicated operations vector will simplify things a lot. > > > > > > > > > > Nothing wrong with it, I think it's a fine idea (although you may still > > > > > need a per-bh call to connect the fs metadata to each page). > > > > > > > > > > I just much prefer to have operations after the copy not able to fail, > > > > > otherwise you get into all those pagecache corner cases. > > > > > > > > > > BTW. when you say reserve + allocate, this is in the page-dirty path, > > > > > right? I thought delalloc filesystems tend to do the actual allocation > > > > > in the page-cleaning path? Or am I confused? > > > > > > > > See my reply to Jan - delayed allocate has two parts to it - space > > > > reservation (accounting for ENOSPC) and recording of the delalloc extents > > > > (allocate). This is separate to the writeback path where we convert > > > > delalloc extents to real extents.... > > > > > > Yes I saw that. I'm sure we'll want clearer terminology in the core > > > code. But I don't quite know why you need to do it in 2 parts > > > (reserve, then "allocate"). > > > > Because reserve/allocate are the two steps that allocation is > > generally broken down into, even in filesystems that don't do > > delayed allocation. That's because.... > > > > > Surely even reservation failures are > > > very rare > > > > ... ENOSPC and EDQUOT are not at all rare, and they are generated > > during the reservation stage. i.e. before any real allocation or > > I meant "rare" as-in not critical for performance. Not that they don't > have to be handled properly. I am trying to handle these cases properly for the fast path, not turn it into a maze of twisty, dark, buggy passages. > > state changes are made. Just about every filesystem does this > > because failing half way through an allocation not being able to > > allocate a block due to ENOSPC or EDQUOT is pretty much impossible > > to undo reliably in most filesystems. > > > > > , and obviously the error handling is required, so why not > > > just do a single call? > > > > Because if we fail after the allocation then ensuring we handle the > > error *correctly* and *without further failures* is *fucking hard*. > > I don't think you really answered my question. Let me put it in concrete > terms. In your proposal, why not just do the reserve+allocate *after* > the pagecache copy? What does the "reserve" part add? .... > Your design is forced to do it when I pointed out that writes into the > pagecache should not be made visiable if the process can subsequently > fail. copy-last is not subject to this. Let's summarise what we've got so far so that it becomes obvious why we need a two step process. 1. Copy First Problem: Copying in data before the allocate results in transient data exposure if the allocate fails or if we overwrite pages in the same loop iteration as the failed allocation. To deal with this, we've talked page replacement mechanisms for the page cache and all the complex issues associated with that. 2. Copy Last Problem: If we allocate up front, then we have to handle errors allocating pages into the page cache or copying the data. To work around copying failures, we've talked about punching out the allocation range that wasn't copied into. This gets complex with mixed overwrite/allocate ranges. To deal with page cache allocation failures, we talked about using direct IO w/ the zero page to zero the remaining range, but this gets complex with delalloc. Situation: both copy-first and copy-last have serious issues. 3. Generic Simplification: The first thing we need to do in both cases is avoid the mixed overwrite/allocate copy case, as that greatly increases the complexity of error handling in both cases. To do this we need to split the mutlipage write up into ranges that are of the same kind (only allocate, or only overwrite). This needs to be done before we copy the data and hence we need to be able to map the range we are allowed to copy before we do the copy. A filesystem callout that is required to do this. 4. Avoiding Copy Last Problems: To avoid page allocation failures requiring hole punching or DIO writes, we need to ensure we don't do allocation until after the copy has succeeded (i.e. take that bit of Copy First). However, to avoid data exposure problems, we need to ensure this allocation will succeed in all normal cases. ENOSPC and EDQUOT are normal cases, EIO or filesystem corruption are not normal cases. 5. To avoid the catch-22 situation, we can avoid "normal" errors in the allocation case if we reserve the space required for the upcoming allocation before we copy any data. Effectively we can guarantee the allocation will succeed. 6. If the allocation does fail, then the filesystem is considered corrupted and needs to enter a shutdown/failed state. This ensures allocation failures do not expose transient data in any meaningful way because the cached data on an errored filesystem can not be guaranteed to be valid. This is a hard stop, but should never occur unless there is some other hard-stop error condition in the filesystem. 7. Hence by reserving space and defining how the filesystem needs to handle allocation failures after a successful reserve, we remove all the problematic error handling in the Copy Last case. This requires a filesystem callout. 8. Avoiding Copy First Problems: By ensuring that allocation will never fail except in a hard-stop situation, the allocation can be safely done after the copy has been made. Effectively the steps to remove the problematic error handling from the Copy Last case also remove the problematic transient data issues from the Copy First case. 9. Copying Data: Given the guarantee that allocation will succeed, we don't need to worry about exposing transіent data during the data copy. The copy can fail at any time (either page allocation failure or EFAULT during the copy) and we can easily deal with it. Hence we don't need to lock more than one page at once, and the copy can proceed a page at a time. Failure during a copy will require partial pages to be treated like a partial write. 10. Handling Errors: If we fail the reserve/map stage, then there is nothing to undo and we can simply return at that point. 11. If we fail part way through the copy (i.e. in the middle), the error handling is simple: a) allocate the range that did succeed (if allocate) b) zero the portion of the page that failed (if allocate) c) update the page states appropriately (write_end) d) release the unused reserved space (if allocate) 13. If we fail the allocation: a) the filesystem is shut down b) the range that failed is tossed out of the page cache. To me this seems pretty sane - "copy middle" avoids all the pitfalls and traps we've been talking about. No transient data exposure problems, no need to lock multiple pages at a time, no complex error handling required, no major new filesystem functionality, etc. > > IMO, the fundamental issue with using hole punching or direct IO > > from the zero page to handle errors is that they are complex enough > > that there is *no guarantee that they will succeed*. e.g. Both can > > get ENOSPC/EDQUOT because they may end up with metadata allocation > > requirements above and beyond what was originally reserved. If the > > error handling fails to handle the error, then where do we go from > > there? > > There are already fundamental issues that seems like they are not > handled properly if your filesystem may allocate uninitialized blocks > over holes for writeback cache without somehow marking them as > uninitialized. You mean like the problems ext3's data=writeback mode has? ;) > If you get a power failure or IO error before the pagecache can be > written out, you're left with uninitialized data there, aren't you? > Simple buffer head based filesystems are already subject to this. Most filesystems avoid this one way or another. XFS, btrfs, ext3/4 (in data=ordered/journal modes), gfs2, etc all use different methods, but the end result is the same - they don't expose stale blocks. But that's not really the issue I was alluding to. What I was thinking of was a whole lot more complex than that, and might give you insight into why the error handling you were proposing is .. meeting resistance. Take a delayed allocation in XFS - it XFS reserves an amount (worst case) of extra blocks for metadata allocation during the eventual real extent allocation. Every delalloc reserves this metadata space but it is not specific to the inode, so pretty much any outstanding delalloc conversion has a big pool of reserved metadata space that it can allocate metadata blocks from. When the allocation occurs during writeback this unused metadata space is taken back out of the pool. If we then change this to use DIO, the first thing XFS has to do is punch out the delalloc extent spanning the page because we cannot do direct IO into delalloc extents (a fundamental design rule). That punch would require it's own metadata reservation on top of the current delalloc reservation which is not released until after the reservation for th ehole punch is made. Hence it can ENOSPC/EDQUOT. Then we have to allocate during DIO, which also requires it's own metadata reservation plus the blocks for the page being written. That can also ENOSPC/EDQUOT because it can't make use of the large delalloc metadata pool that would have allowed it to succeed. But because this is an error handling path, it must be made to succeed. If it fails, we've failed to handle the error correctly and somethign bad *will* happen as a result. As you can see switching to DIO from the zero page in a copy-last error handler would be quite hard to implement in XFS - it would require a special DIO path because we'd need to know that it was in a do-not-fail case and that we'd have to hole punch before allocation rather than asserting that a design rule had been violated, and then make sure that neither punching or the later allocation failed, which we can't actually do.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-24 10:21 ` Dave Chinner @ 2010-06-01 6:27 ` Nick Piggin 0 siblings, 0 replies; 44+ messages in thread From: Nick Piggin @ 2010-06-01 6:27 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Josef Bacik, linux-fsdevel, chris.mason, akpm, linux-kernel On Mon, May 24, 2010 at 08:21:25PM +1000, Dave Chinner wrote: > On Mon, May 24, 2010 at 04:55:19PM +1000, Nick Piggin wrote: > > > > Surely even reservation failures are > > > > very rare > > > > > > ... ENOSPC and EDQUOT are not at all rare, and they are generated > > > during the reservation stage. i.e. before any real allocation or > > > > I meant "rare" as-in not critical for performance. Not that they don't > > have to be handled properly. > > I am trying to handle these cases properly for the fast path, not > turn it into a maze of twisty, dark, buggy passages. What I mean is that when the common-code has such a failure, the action to correct it is not performance critical, ie. the fast-path is the success case. So a cleaner or more performant fast-path is preferable even if it results in slow-path error case being slower or more complex (within reasonable limits). > > > state changes are made. Just about every filesystem does this > > > because failing half way through an allocation not being able to > > > allocate a block due to ENOSPC or EDQUOT is pretty much impossible > > > to undo reliably in most filesystems. > > > > > > > , and obviously the error handling is required, so why not > > > > just do a single call? > > > > > > Because if we fail after the allocation then ensuring we handle the > > > error *correctly* and *without further failures* is *fucking hard*. > > > > I don't think you really answered my question. Let me put it in concrete > > terms. In your proposal, why not just do the reserve+allocate *after* > > the pagecache copy? What does the "reserve" part add? > > .... > > > Your design is forced to do it when I pointed out that writes into the > > pagecache should not be made visiable if the process can subsequently > > fail. copy-last is not subject to this. > > Let's summarise what we've got so far so that it becomes obvious why > we need a two step process. > > 1. Copy First Problem: Copying in data before the allocate results > in transient data exposure if the allocate fails or if we overwrite > pages in the same loop iteration as the failed allocation. To deal > with this, we've talked page replacement mechanisms for the page > cache and all the complex issues associated with that. > > 2. Copy Last Problem: If we allocate up front, then we have to > handle errors allocating pages into the page cache or copying the > data. To work around copying failures, we've talked about punching > out the allocation range that wasn't copied into. This gets complex > with mixed overwrite/allocate ranges. To deal with page cache > allocation failures, we talked about using direct IO w/ the zero > page to zero the remaining range, but this gets complex with > delalloc. > > Situation: both copy-first and copy-last have serious issues. > > 3. Generic Simplification: The first thing we need to do in both > cases is avoid the mixed overwrite/allocate copy case, as that > greatly increases the complexity of error handling in both cases. > To do this we need to split the mutlipage write up into ranges that > are of the same kind (only allocate, or only overwrite). This needs > to be done before we copy the data and hence we need to be able to > map the range we are allowed to copy before we do the copy. A > filesystem callout that is required to do this. OK, this allows filesystems that can handle mixed to do so. > 4. Avoiding Copy Last Problems: To avoid page allocation failures > requiring hole punching or DIO writes, we need to ensure we don't do > allocation until after the copy has succeeded (i.e. take that bit of > Copy First). However, to avoid data exposure problems, we need to > ensure this allocation will succeed in all normal cases. ENOSPC and > EDQUOT are normal cases, EIO or filesystem corruption are not normal > cases. In generic code, I don't know whether this should be a requirement. If ext4 marks allocated blocks as uninitialized, it seems to be able to handle this pretty gracefully. Not all filesystems necessarily can reserve blocks before the copy, you see. > 5. To avoid the catch-22 situation, we can avoid "normal" errors in > the allocation case if we reserve the space required for the > upcoming allocation before we copy any data. Effectively we can > guarantee the allocation will succeed. > > 6. If the allocation does fail, then the filesystem is considered > corrupted and needs to enter a shutdown/failed state. This ensures > allocation failures do not expose transient data in any meaningful > way because the cached data on an errored filesystem can not be > guaranteed to be valid. This is a hard stop, but should never occur > unless there is some other hard-stop error condition in the > filesystem. > > 7. Hence by reserving space and defining how the filesystem needs to > handle allocation failures after a successful reserve, we remove all > the problematic error handling in the Copy Last case. This requires > a filesystem callout. > > 8. Avoiding Copy First Problems: By ensuring that allocation will > never fail except in a hard-stop situation, the allocation can be > safely done after the copy has been made. Effectively the steps to > remove the problematic error handling from the Copy Last case also > remove the problematic transient data issues from the Copy First > case. This is a reasonable observation and seems to solve your copy-first problem. However this maybe depends a little bit on the error semantics of the fs, and a lot on the filesystem itself. Is it reasonable to encode this in the generic code? I think I will still like to see a little more implementations before it is turned into generic code. I would like to see actually how much code is saved versus implementing it in the fs (perhaps with some helpers in the vfs to avoid generic checks). We have relatively simple iov accessors/usercopy etc. so the pagecache part of it really doesn't need to be in generic code. > 9. Copying Data: Given the guarantee that allocation will succeed, > we don't need to worry about exposing transіent data during the data > copy. The copy can fail at any time (either page allocation failure > or EFAULT during the copy) and we can easily deal with it. Hence we > don't need to lock more than one page at once, and the copy can > proceed a page at a time. Failure during a copy will require partial > pages to be treated like a partial write. > > 10. Handling Errors: If we fail the reserve/map stage, then there is > nothing to undo and we can simply return at that point. > > 11. If we fail part way through the copy (i.e. in the middle), the > error handling is simple: > a) allocate the range that did succeed (if allocate) > b) zero the portion of the page that failed (if allocate) > c) update the page states appropriately (write_end) > d) release the unused reserved space (if allocate) > > 13. If we fail the allocation: > a) the filesystem is shut down > b) the range that failed is tossed out of the page cache. > > To me this seems pretty sane - "copy middle" avoids all the pitfalls > and traps we've been talking about. No transient data exposure > problems, no need to lock multiple pages at a time, no complex error > handling required, no major new filesystem functionality, etc. So long as you avoid the transient pagecache data without requiring pagecache invalidation, I don't have a problem with it. > > > IMO, the fundamental issue with using hole punching or direct IO > > > from the zero page to handle errors is that they are complex enough > > > that there is *no guarantee that they will succeed*. e.g. Both can > > > get ENOSPC/EDQUOT because they may end up with metadata allocation > > > requirements above and beyond what was originally reserved. If the > > > error handling fails to handle the error, then where do we go from > > > there? > > > > There are already fundamental issues that seems like they are not > > handled properly if your filesystem may allocate uninitialized blocks > > over holes for writeback cache without somehow marking them as > > uninitialized. > > You mean like the problems ext3's data=writeback mode has? ;) And probably most non-journalling filesystems. > > If you get a power failure or IO error before the pagecache can be > > written out, you're left with uninitialized data there, aren't you? > > Simple buffer head based filesystems are already subject to this. > > Most filesystems avoid this one way or another. XFS, btrfs, ext3/4 > (in data=ordered/journal modes), gfs2, etc all use different > methods, but the end result is the same - they don't expose stale > blocks. > > But that's not really the issue I was alluding to. What I was > thinking of was a whole lot more complex than that, and might give > you insight into why the error handling you were proposing is .. > meeting resistance. > > Take a delayed allocation in XFS - it XFS reserves an amount (worst > case) of extra blocks for metadata allocation during the eventual > real extent allocation. Every delalloc reserves this metadata space > but it is not specific to the inode, so pretty much any outstanding > delalloc conversion has a big pool of reserved metadata space that > it can allocate metadata blocks from. When the allocation occurs > during writeback this unused metadata space is taken back out of the > pool. > > If we then change this to use DIO, the first thing XFS has to do is > punch out the delalloc extent spanning the page because we cannot do > direct IO into delalloc extents (a fundamental design rule). That > punch would require it's own metadata reservation on top of the > current delalloc reservation which is not released until after the > reservation for th ehole punch is made. Hence it can ENOSPC/EDQUOT. > > Then we have to allocate during DIO, which also requires it's own > metadata reservation plus the blocks for the page being written. > That can also ENOSPC/EDQUOT because it can't make use of the large > delalloc metadata pool that would have allowed it to succeed. > > But because this is an error handling path, it must be made to > succeed. If it fails, we've failed to handle the error correctly and > somethign bad *will* happen as a result. As you can see switching to > DIO from the zero page in a copy-last error handler would be quite > hard to implement in XFS - it would require a special DIO path > because we'd need to know that it was in a do-not-fail case and that > we'd have to hole punch before allocation rather than asserting that > a design rule had been violated, and then make sure that neither > punching or the later allocation failed, which we can't actually > do.... Fair point. Thanks for the explanation. I could say why pagecache invalidation is very hard to do as well, but luckily you've found a solution that avoids all the problems. Is it easy to explain how XFS avoids the uninitialized block problem, btw? ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-24 6:55 ` Nick Piggin 2010-05-24 10:21 ` Dave Chinner @ 2010-05-24 18:40 ` Joel Becker 1 sibling, 0 replies; 44+ messages in thread From: Joel Becker @ 2010-05-24 18:40 UTC (permalink / raw) To: Nick Piggin Cc: Dave Chinner, Christoph Hellwig, Josef Bacik, linux-fsdevel, chris.mason, akpm, linux-kernel On Mon, May 24, 2010 at 04:55:19PM +1000, Nick Piggin wrote: > On Mon, May 24, 2010 at 03:53:29PM +1000, Dave Chinner wrote: > > Because if we fail after the allocation then ensuring we handle the > > error *correctly* and *without further failures* is *fucking hard*. > > I don't think you really answered my question. Let me put it in concrete > terms. In your proposal, why not just do the reserve+allocate *after* > the pagecache copy? What does the "reserve" part add? In ocfs2, we can't just crash our filesystem. We have to be safe not just with respect to the local machine, we have to leave the filesystem in a consistent state - structure *and* data - for the other nodes. The ordering and locking of allocation in get_block(s)() is so bad that we just Don't Do It. By the time get_block(s)() is called, we require our filesystem to have the allocation done. We do our allocation in write_begin(). By the time we get to the page copy, we can't ENOSPC or EDQUOT. O_DIRECT I/O falls back to sync buffered I/O if it must allocate, pushing us through write_begin() forcing other nodes to honor what we've done. This is easily extended to the reserve multipage operation. It's not delalloc, because we actually allocate in the reserve operation. We handle it just like a large case of the single page operation. Someday we hope to add delalloc, and it would actually do better here. I guess you could call this "copy middle" like Dave describes in his followup to your mail. Copy Middle also has the property that it can handle short writes without any error handling. Copy First has to discover it can only get half the allocation and drop the latter half of the pagecache. Copy Last has to discover it can only do half the page copy and drop the latter half of the allocation. > > IMO, the fundamental issue with using hole punching or direct IO > > from the zero page to handle errors is that they are complex enough > > that there is *no guarantee that they will succeed*. e.g. Both can > > get ENOSPC/EDQUOT because they may end up with metadata allocation > > requirements above and beyond what was originally reserved. If the > > error handling fails to handle the error, then where do we go from > > there? > > There are already fundamental issues that seems like they are not > handled properly if your filesystem may allocate uninitialized blocks > over holes for writeback cache without somehow marking them as > uninitialized. > > If you get a power failure or IO error before the pagecache can be > written out, you're left with uninitialized data there, aren't you? > Simple buffer head based filesystems are already subject to this. Sure, ext2 does this. But don't most filesystems guaranteeing state actually make sure to order such I/Os? If you run ext3 in data=writeback, you get what you pay for. This sounds like a red herring. Dave's original point stands. ocfs2 supports unwritten extents and punching holes. In fact, we directly copied the XFS ioctl(2)s. But when we do punch holes, we have to adjust our tree. That may require additional metadata, and *that* can fail with ENOSPC or EDQUOT. Joel -- "I always thought the hardest questions were those I could not answer. Now I know they are the ones I can never ask." - Charlie Watkins Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-14 6:41 ` Dave Chinner 2010-05-14 7:22 ` Nick Piggin @ 2010-05-17 23:35 ` Jan Kara 2010-05-18 1:21 ` Dave Chinner 1 sibling, 1 reply; 44+ messages in thread From: Jan Kara @ 2010-05-17 23:35 UTC (permalink / raw) To: Dave Chinner Cc: Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, npiggin, linux-kernel On Fri 14-05-10 16:41:45, Dave Chinner wrote: ... > Well, the "RESERVE" callout I proposed should mean that the > allocation that follows the copy should never fail with ENOSPC. i.e. > if there isn't space, the RESERVE call will fail with ENOSPC, not > the actual ALLOCATE call that occurs after the data is copied. > > Basically I was thinking that in the XFS case, RESERVE would scan the > range for holes, and reserve all blocks needed to fill all the holes > in the range. Then the ALLOCATE call would mark them all delalloc. > The space reservation and delalloc is done in a single call right > now, but splitting them is not hard... But this would mean that you would have to implement at least a basic delayed allocation support for filesystems like ext2, ext3, udf, vfat, etc. They currently do not have a way to provide a functionality you require from RESERVE call... Not that it would be so hard to do - I actually have some patches for ext2/3 doing that because of problems with ENOSPC during mmaped writes but it's not completely trivial either. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] new ->perform_write fop 2010-05-17 23:35 ` Jan Kara @ 2010-05-18 1:21 ` Dave Chinner 0 siblings, 0 replies; 44+ messages in thread From: Dave Chinner @ 2010-05-18 1:21 UTC (permalink / raw) To: Jan Kara Cc: Josef Bacik, linux-fsdevel, chris.mason, hch, akpm, npiggin, linux-kernel On Tue, May 18, 2010 at 01:35:41AM +0200, Jan Kara wrote: > On Fri 14-05-10 16:41:45, Dave Chinner wrote: > ... > > Well, the "RESERVE" callout I proposed should mean that the > > allocation that follows the copy should never fail with ENOSPC. i.e. > > if there isn't space, the RESERVE call will fail with ENOSPC, not > > the actual ALLOCATE call that occurs after the data is copied. > > > > Basically I was thinking that in the XFS case, RESERVE would scan the > > range for holes, and reserve all blocks needed to fill all the holes > > in the range. Then the ALLOCATE call would mark them all delalloc. > > The space reservation and delalloc is done in a single call right > > now, but splitting them is not hard... > But this would mean that you would have to implement at least a basic > delayed allocation support for filesystems like ext2, ext3, udf, vfat, etc. > They currently do not have a way to provide a functionality you require > from RESERVE call... I don't think that's necessary. For filesytems that don't do multipage writes (i.e. continue to operate as they do now) then RSERVE/UNRESERVE can effectively be a no-op as there is nothing extra to reserve or undo if the single page allocation fails. Hence filesystems that can't easily be converted or we don't want to put the effort into converting to multi-page writes do not need any new functionality.... > Not that it would be so hard to do - I actually have some patches for > ext2/3 doing that because of problems with ENOSPC during mmaped writes but > it's not completely trivial either. Yup, and a good reason for not trying to retrofit delayed allocation to all those old filesystems. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 44+ messages in thread
end of thread, other threads:[~2010-06-06 7:59 UTC | newest] Thread overview: 44+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-05-12 21:24 [RFC] new ->perform_write fop Josef Bacik 2010-05-13 1:39 ` Josef Bacik 2010-05-13 15:36 ` Christoph Hellwig 2010-05-14 1:00 ` Dave Chinner 2010-05-14 3:30 ` Josef Bacik 2010-05-14 5:50 ` Nick Piggin 2010-05-14 7:20 ` Dave Chinner 2010-05-14 7:33 ` Nick Piggin 2010-05-14 6:41 ` Dave Chinner 2010-05-14 7:22 ` Nick Piggin 2010-05-14 8:38 ` Dave Chinner 2010-05-14 13:33 ` Chris Mason 2010-05-18 6:36 ` Nick Piggin 2010-05-18 8:05 ` Dave Chinner 2010-05-18 10:43 ` Nick Piggin 2010-05-18 12:27 ` Dave Chinner 2010-05-18 15:09 ` Nick Piggin 2010-05-19 23:50 ` Dave Chinner 2010-05-20 6:48 ` Nick Piggin 2010-05-20 20:12 ` Jan Kara 2010-05-20 23:05 ` Dave Chinner 2010-05-21 9:05 ` Steven Whitehouse 2010-05-21 13:50 ` Josef Bacik 2010-05-21 14:23 ` Nick Piggin 2010-05-21 15:19 ` Josef Bacik 2010-05-24 3:29 ` Nick Piggin 2010-05-22 0:31 ` Dave Chinner 2010-05-21 18:58 ` Jan Kara 2010-05-22 0:27 ` Dave Chinner 2010-05-24 9:20 ` Jan Kara 2010-05-24 9:33 ` Nick Piggin 2010-06-05 15:05 ` tytso 2010-06-06 7:59 ` Nick Piggin 2010-05-21 15:15 ` Christoph Hellwig 2010-05-22 2:31 ` Nick Piggin 2010-05-22 8:37 ` Dave Chinner 2010-05-24 3:09 ` Nick Piggin 2010-05-24 5:53 ` Dave Chinner 2010-05-24 6:55 ` Nick Piggin 2010-05-24 10:21 ` Dave Chinner 2010-06-01 6:27 ` Nick Piggin 2010-05-24 18:40 ` Joel Becker 2010-05-17 23:35 ` Jan Kara 2010-05-18 1:21 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).