Re: [RFC] new ->perform_write fop

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Nick Piggin <npiggin@suse.de>
To: Dave Chinner <david@fromorbit.com>
Cc: Josef Bacik <josef@redhat.com>,
	linux-fsdevel@vger.kernel.org, chris.mason@oracle.com,
	hch@infradead.org, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC] new ->perform_write fop
Date: Tue, 18 May 2010 20:43:51 +1000	[thread overview]
Message-ID: <20100518104351.GF2516@laptop> (raw)
In-Reply-To: <20100518080503.GF2150@dastard>

On Tue, May 18, 2010 at 06:05:03PM +1000, Dave Chinner wrote:
> On Tue, May 18, 2010 at 04:36:47PM +1000, Nick Piggin wrote:
> > Well you could do a large span block allocation at the beginning,
> > and then dirty the pagecache one by one like we do right now.
> 
> The problem is that if we fail to allocate a page (e.g.  ENOMEM) or
> fail the copy (EFAULT) after the block allocation, we have to undo
> the allocation we have already completed. If we don't, we leave
> uninitialisaed allocations on disk that will expose stale data.
> 
> In the second case (EFAULT) we might be able to zero the pages to
> avoid punching out blocks, but the first case where pages can't be
> allocated to cover the block allocated range makes it very
> difficult without being able to punch holes in allocated block
> ranges.
> 
> AFAIK, only XFS and OCFS2 currently support punching out arbitrary
> ranges of allocated blocks from an inode - there is not VFS method
> for it, just an ioctl (XFS_IOC_UNRESVSP).
> 
> Hence the way to avoid needing hole punching is to allocate and lock
> down all the pages into the page cache fіrst, then do the copy so
> they fail before the allocation is done if they are going to fail.
> That makes it much, much easier to handle failures....

So it is just a matter of what is exposed as a vfs interface?
I would much prefer to make it a requirement to support hole
punching in the block allocator if the filesystem wishes to do
such writes.

If you have an ->allocate(inode, RESERVE/ALLOCATE, interval) API
then it makes sense to have a DEALLOCATE operation there too.

That seems like it should be cleaner than to work around it if
we're talking about adding new APIs anyway.


> Remember, we still don't do the "truncate blocks past eof on failure"
> correctly in block_write_begin (still calls vmtruncate, not
> ->setattr), so we can't claim that the current way of doing things
> is a model of perfection that we ca't improve on....

Right, but the API (write_begin/write_end) is sufficient now to
make it work correctly. What should really happen is the
filesystem detects and handles these error cases.

In the truncate patchset (that deprecates ->truncate and vmtruncate
entirely), that is exactly what happens.

> 
> > The only reason to do operations on multiple pages at once is if
> > we need to lock them all. 
> 
> Well, to avoid the refaulting of pages we just unmapped we'd need to
> do that...

Well, the lock/unmap/copy/unlock could be done on a per-page
basis.

 
> > Now the fs might well have that requirement
> > (if it is not using i_mutex for block (de)allocation
> > serialisation), but I don't think generic code needs to be doing
> > that.
> 
> XFS already falls into the category of a filesystem using the
> generic code that does not use i_mutex for allocation serialisation.
> I'm sure it isn't the only filesystem that this is true for, so it
> seems sensible for the generic code to handle this case.

Well, does it need page lock? All pages locked concurrently in
a range under which block allocation is happening? I would much
prefer an allocation API that supports allocation/freeing
without requiring any pagecache at all.

 
> > Basically, once pagecache is marked uptodate, I don't think we should
> > ever put maybe-invalid data into it -- the way to do it is to invalidate
> > that page and put a *new* page in there.
> 
> Ok, so lets do that...
> 
> > Why? Because user mappings are just one problem, but once you had a
> > user mapping, you can have been subject to get_user_pages, so it could
> > be in the middle of a DMA operation or something.
> 
> ... because we already know this behaviour causes problems for
> high end enterprise level features like hardware checksumming IO
> paths.
> 
> Hence it seems that a multipage write needs to:
> 
> 	1. allocate new pages
> 	2. attach bufferheads/mapping structures to pages (if required)
> 	3. copy data into pages
> 	4. allocate space
> 	5. for each old page in the range:
> 		lock page
> 		invalidate mappings
> 		clear page uptodate flag
> 		remove page from page cache
> 	6. for each new page:
> 		map new page to allocated space
> 		lock new page
> 		insert new page into pagecache
> 		update new page state (write_end equivalent)
> 		unlock new page
> 	7. free old pages
> 
> Steps 1-4 can all fail, and can all be backed out from without
> changing the current state. Steps 5-7 can't fail AFAICT, so we
> should be able to run this safely after the allocation without
> needing significant error unwinding...
> 
> Thoughts?

Possibly. The importance of hot cache is reduced, because we are
doing full-page copies, and bulk copies, by definition. But it
could still be an issue. The allocations and deallocations could
cost a little as well.

Compared to having a nice API to just do bulk allocate/free block
operations and then just doing simple per-page copies, I think it
doesn't look so nice.

WARNING: multiple messages have this Message-ID (diff)

From: Nick Piggin <npiggin@suse.de>
To: Dave Chinner <david@fromorbit.com>
Cc: Josef Bacik <josef@redhat.com>,
	linux-fsdevel@vger.kernel.org, chris.mason@oracle.com,
	hch@infradead.org, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC] new ->perform_write fop
Date: Tue, 18 May 2010 20:43:51 +1000	[thread overview]
Message-ID: <20100518104351.GF2516@laptop> (raw)
In-Reply-To: <20100518080503.GF2150@dastard>

On Tue, May 18, 2010 at 06:05:03PM +1000, Dave Chinner wrote:
> On Tue, May 18, 2010 at 04:36:47PM +1000, Nick Piggin wrote:
> > Well you could do a large span block allocation at the beginning,
> > and then dirty the pagecache one by one like we do right now.
> 
> The problem is that if we fail to allocate a page (e.g.  ENOMEM) or
> fail the copy (EFAULT) after the block allocation, we have to undo
> the allocation we have already completed. If we don't, we leave
> uninitialisaed allocations on disk that will expose stale data.
> 
> In the second case (EFAULT) we might be able to zero the pages to
> avoid punching out blocks, but the first case where pages can't be
> allocated to cover the block allocated range makes it very
> difficult without being able to punch holes in allocated block
> ranges.
> 
> AFAIK, only XFS and OCFS2 currently support punching out arbitrary
> ranges of allocated blocks from an inode - there is not VFS method
> for it, just an ioctl (XFS_IOC_UNRESVSP).
> 
> Hence the way to avoid needing hole punching is to allocate and lock
> down all the pages into the page cache fіrst, then do the copy so
> they fail before the allocation is done if they are going to fail.
> That makes it much, much easier to handle failures....

So it is just a matter of what is exposed as a vfs interface?
I would much prefer to make it a requirement to support hole
punching in the block allocator if the filesystem wishes to do
such writes.

If you have an ->allocate(inode, RESERVE/ALLOCATE, interval) API
then it makes sense to have a DEALLOCATE operation there too.

That seems like it should be cleaner than to work around it if
we're talking about adding new APIs anyway.


> Remember, we still don't do the "truncate blocks past eof on failure"
> correctly in block_write_begin (still calls vmtruncate, not
> ->setattr), so we can't claim that the current way of doing things
> is a model of perfection that we ca't improve on....

Right, but the API (write_begin/write_end) is sufficient now to
make it work correctly. What should really happen is the
filesystem detects and handles these error cases.

In the truncate patchset (that deprecates ->truncate and vmtruncate
entirely), that is exactly what happens.

> 
> > The only reason to do operations on multiple pages at once is if
> > we need to lock them all. 
> 
> Well, to avoid the refaulting of pages we just unmapped we'd need to
> do that...

Well, the lock/unmap/copy/unlock could be done on a per-page
basis.

 
> > Now the fs might well have that requirement
> > (if it is not using i_mutex for block (de)allocation
> > serialisation), but I don't think generic code needs to be doing
> > that.
> 
> XFS already falls into the category of a filesystem using the
> generic code that does not use i_mutex for allocation serialisation.
> I'm sure it isn't the only filesystem that this is true for, so it
> seems sensible for the generic code to handle this case.

Well, does it need page lock? All pages locked concurrently in
a range under which block allocation is happening? I would much
prefer an allocation API that supports allocation/freeing
without requiring any pagecache at all.

 
> > Basically, once pagecache is marked uptodate, I don't think we should
> > ever put maybe-invalid data into it -- the way to do it is to invalidate
> > that page and put a *new* page in there.
> 
> Ok, so lets do that...
> 
> > Why? Because user mappings are just one problem, but once you had a
> > user mapping, you can have been subject to get_user_pages, so it could
> > be in the middle of a DMA operation or something.
> 
> ... because we already know this behaviour causes problems for
> high end enterprise level features like hardware checksumming IO
> paths.
> 
> Hence it seems that a multipage write needs to:
> 
> 	1. allocate new pages
> 	2. attach bufferheads/mapping structures to pages (if required)
> 	3. copy data into pages
> 	4. allocate space
> 	5. for each old page in the range:
> 		lock page
> 		invalidate mappings
> 		clear page uptodate flag
> 		remove page from page cache
> 	6. for each new page:
> 		map new page to allocated space
> 		lock new page
> 		insert new page into pagecache
> 		update new page state (write_end equivalent)
> 		unlock new page
> 	7. free old pages
> 
> Steps 1-4 can all fail, and can all be backed out from without
> changing the current state. Steps 5-7 can't fail AFAICT, so we
> should be able to run this safely after the allocation without
> needing significant error unwinding...
> 
> Thoughts?

Possibly. The importance of hot cache is reduced, because we are
doing full-page copies, and bulk copies, by definition. But it
could still be an issue. The allocations and deallocations could
cost a little as well.

Compared to having a nice API to just do bulk allocate/free block
operations and then just doing simple per-page copies, I think it
doesn't look so nice.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2010-05-18 10:44 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-05-12 21:24 [RFC] new ->perform_write fop Josef Bacik
2010-05-13  1:39 ` Josef Bacik
2010-05-13 15:36   ` Christoph Hellwig
2010-05-14  1:00   ` Dave Chinner
2010-05-14  3:30     ` Josef Bacik
2010-05-14  5:50       ` Nick Piggin
2010-05-14  7:20         ` Dave Chinner
2010-05-14  7:33           ` Nick Piggin
2010-05-14  6:41       ` Dave Chinner
2010-05-14  7:22         ` Nick Piggin
2010-05-14  8:38           ` Dave Chinner
2010-05-14 13:33             ` Chris Mason
2010-05-18  6:36             ` Nick Piggin
2010-05-18  8:05               ` Dave Chinner
2010-05-18 10:43                 ` Nick Piggin [this message]
2010-05-18 10:43                   ` Nick Piggin
2010-05-18 12:27                   ` Dave Chinner
2010-05-18 12:27                     ` Dave Chinner
2010-05-18 15:09                     ` Nick Piggin
2010-05-19 23:50                       ` Dave Chinner
2010-05-20  6:48                         ` Nick Piggin
2010-05-20 20:12                         ` Jan Kara
2010-05-20 23:05                           ` Dave Chinner
2010-05-21  9:05                             ` Steven Whitehouse
2010-05-21 13:50                             ` Josef Bacik
2010-05-21 13:50                               ` Josef Bacik
2010-05-21 14:23                               ` Nick Piggin
2010-05-21 15:19                                 ` Josef Bacik
2010-05-24  3:29                                   ` Nick Piggin
2010-05-22  0:31                               ` Dave Chinner
2010-05-21 18:58                             ` Jan Kara
2010-05-22  0:27                               ` Dave Chinner
2010-05-24  9:20                                 ` Jan Kara
2010-05-24  9:33                                   ` Nick Piggin
2010-06-05 15:05                                   ` tytso
2010-06-06  7:59                                     ` Nick Piggin
2010-06-06  7:59                                       ` Nick Piggin
2010-05-21 15:15           ` Christoph Hellwig
2010-05-22  2:31             ` Nick Piggin
2010-05-22  8:37               ` Dave Chinner
2010-05-24  3:09                 ` Nick Piggin
2010-05-24  5:53                   ` Dave Chinner
2010-05-24  6:55                     ` Nick Piggin
2010-05-24 10:21                       ` Dave Chinner
2010-06-01  6:27                         ` Nick Piggin
2010-05-24 18:40                       ` Joel Becker
2010-05-17 23:35         ` Jan Kara
2010-05-18  1:21           ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100518104351.GF2516@laptop \
    --to=npiggin@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=chris.mason@oracle.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=josef@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.