Re: Status of buffered write path (deadlock fixes)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mark Fasheh <mark.fasheh@oracle.com>
To: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Linux Memory Management <linux-mm@kvack.org>,
	linux-fsdevel@vger.kernel.org,
	linux-kernel <linux-kernel@vger.kernel.org>,
	OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>,
	Andrew Morton <akpm@google.com>
Subject: Re: Status of buffered write path (deadlock fixes)
Date: Thu, 7 Dec 2006 11:55:18 -0800	[thread overview]
Message-ID: <20061207195518.GG4497@ca-server1.us.oracle.com> (raw)
In-Reply-To: <45751712.80301@yahoo.com.au>

Hi Nick,

On Tue, Dec 05, 2006 at 05:52:02PM +1100, Nick Piggin wrote:
> Hi,
> 
> I'd like to try to state where we are WRT the buffered write patches,
> and ask for comments. Sorry for the wide cc list, but this is an
> important issue which hasn't had enough review.

I pulled broken-out-2006-12-05-01-0.tar.gz from ftp.kernel.org and applied
the following patches to get a reasonable idea of what the final product
would look like:

revert-generic_file_buffered_write-handle-zero-length-iovec-segments.patch
revert-generic_file_buffered_write-deadlock-on-vectored-write.patch
generic_file_buffered_write-cleanup.patch
mm-only-mm-debug-write-deadlocks.patch
mm-fix-pagecache-write-deadlocks.patch
mm-fix-pagecache-write-deadlocks-comment.patch
mm-fix-pagecache-write-deadlocks-xip.patch
mm-fix-pagecache-write-deadlocks-mm-pagecache-write-deadlocks-efault-fix.patch
mm-fix-pagecache-write-deadlocks-zerolength-fix.patch
mm-fix-pagecache-write-deadlocks-stale-holes-fix.patch
fs-prepare_write-fixes.patch

If this is incorrect, or I should apply further patches, please let me know.

Hopefully my feedback can be of use to you.


> Well the next -mm will include everything we've done so far. I won't
> repost patches unless someone would like to comment on a specific one.
> 
> I think the core generic_file_buffered_write is fairly robust, after
> fixing the efault and zerolength iov problems picked up in testing
> (thanks, very helpful!).
> 
> So now I *believe* we have an approach that solves the deadlock and
> doesn't expose transient or stale data, transient zeroes, or anything
> like that.

In generic_file_buffered_write() we now do:

	status = a_ops->commit_write(file, page, offset,offset+copied);

Which tells the file system to commit only the amount of data that
filemap_copy_from_user() was able to pull in, despite our zeroing of 
the newly allocated buffers in the copied != bytes case. Shouldn't we be
doing:

        status = a_ops->commit_write(file, page, offset,offset+bytes);

instead, thus preserving ordered writeout (for ext3, ocfs2, etc) for those
parts of the page which are properly allocated and zero'd but haven't been
copied into yet? I think that in the case of a crash just after the
transaction is closed in ->commit_write(), we might lose those guarantees,
exposing stale data on disk.

 
> Error handling is getting close, but there may be cases that nobody
> has picked up, and I've noticed a couple which I'll explain below.
> 
> I think we do the right thing WRT pagecache error handling: a
> !uptodate page remains !uptodate, an uptodate page can handle the
> write being done in several parts. Comments in the patches attempt
> to explain how this works. I think it is pretty straightforward.
> 
> But WRT block allocation in the case of errors, it needs more review.
> 
> Block allocation:
> - prepare_write can allocate blocks
> - prepare_write doesn't need to initialize the pagecache on top of
>   these blocks where it is within the range specified in prepare_write
>   (because the copy_from_user will initialise it correctly)
> - In the case of a !uptodate page, unless the page is brought uptodate
>   (ie the copy_from_user completely succeeds) and marked dirty, then
>   a read that sneaks in after we unlock the page (to retry the write)
>   will try to bring it uptodate by pulling in the uninitialised blocks.

For some reason, I'm not seeing where BH_New is being cleared in case with
no errors or faults. Hopefully I'm wrong, but if I'm not I believe we need
to clear the flag somewhere (perhaps in block_commit_write()?).


Ok, that's it for now. I have some thoughts regarding the asymmetry between
ranges passed to ->prepare_write() and ->commit_write(), but I'd like to
save those thoughts until I know whether my comments above uncovered real
issues :)

Thanks,
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com

WARNING: multiple messages have this Message-ID (diff)

From: Mark Fasheh <mark.fasheh@oracle.com>
To: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Linux Memory Management <linux-mm@kvack.org>,
	linux-fsdevel@vger.kernel.org,
	linux-kernel <linux-kernel@vger.kernel.org>,
	OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>,
	Andrew Morton <akpm@google.com>
Subject: Re: Status of buffered write path (deadlock fixes)
Date: Thu, 7 Dec 2006 11:55:18 -0800	[thread overview]
Message-ID: <20061207195518.GG4497@ca-server1.us.oracle.com> (raw)
In-Reply-To: <45751712.80301@yahoo.com.au>

Hi Nick,

On Tue, Dec 05, 2006 at 05:52:02PM +1100, Nick Piggin wrote:
> Hi,
> 
> I'd like to try to state where we are WRT the buffered write patches,
> and ask for comments. Sorry for the wide cc list, but this is an
> important issue which hasn't had enough review.

I pulled broken-out-2006-12-05-01-0.tar.gz from ftp.kernel.org and applied
the following patches to get a reasonable idea of what the final product
would look like:

revert-generic_file_buffered_write-handle-zero-length-iovec-segments.patch
revert-generic_file_buffered_write-deadlock-on-vectored-write.patch
generic_file_buffered_write-cleanup.patch
mm-only-mm-debug-write-deadlocks.patch
mm-fix-pagecache-write-deadlocks.patch
mm-fix-pagecache-write-deadlocks-comment.patch
mm-fix-pagecache-write-deadlocks-xip.patch
mm-fix-pagecache-write-deadlocks-mm-pagecache-write-deadlocks-efault-fix.patch
mm-fix-pagecache-write-deadlocks-zerolength-fix.patch
mm-fix-pagecache-write-deadlocks-stale-holes-fix.patch
fs-prepare_write-fixes.patch

If this is incorrect, or I should apply further patches, please let me know.

Hopefully my feedback can be of use to you.


> Well the next -mm will include everything we've done so far. I won't
> repost patches unless someone would like to comment on a specific one.
> 
> I think the core generic_file_buffered_write is fairly robust, after
> fixing the efault and zerolength iov problems picked up in testing
> (thanks, very helpful!).
> 
> So now I *believe* we have an approach that solves the deadlock and
> doesn't expose transient or stale data, transient zeroes, or anything
> like that.

In generic_file_buffered_write() we now do:

	status = a_ops->commit_write(file, page, offset,offset+copied);

Which tells the file system to commit only the amount of data that
filemap_copy_from_user() was able to pull in, despite our zeroing of 
the newly allocated buffers in the copied != bytes case. Shouldn't we be
doing:

        status = a_ops->commit_write(file, page, offset,offset+bytes);

instead, thus preserving ordered writeout (for ext3, ocfs2, etc) for those
parts of the page which are properly allocated and zero'd but haven't been
copied into yet? I think that in the case of a crash just after the
transaction is closed in ->commit_write(), we might lose those guarantees,
exposing stale data on disk.

 
> Error handling is getting close, but there may be cases that nobody
> has picked up, and I've noticed a couple which I'll explain below.
> 
> I think we do the right thing WRT pagecache error handling: a
> !uptodate page remains !uptodate, an uptodate page can handle the
> write being done in several parts. Comments in the patches attempt
> to explain how this works. I think it is pretty straightforward.
> 
> But WRT block allocation in the case of errors, it needs more review.
> 
> Block allocation:
> - prepare_write can allocate blocks
> - prepare_write doesn't need to initialize the pagecache on top of
>   these blocks where it is within the range specified in prepare_write
>   (because the copy_from_user will initialise it correctly)
> - In the case of a !uptodate page, unless the page is brought uptodate
>   (ie the copy_from_user completely succeeds) and marked dirty, then
>   a read that sneaks in after we unlock the page (to retry the write)
>   will try to bring it uptodate by pulling in the uninitialised blocks.

For some reason, I'm not seeing where BH_New is being cleared in case with
no errors or faults. Hopefully I'm wrong, but if I'm not I believe we need
to clear the flag somewhere (perhaps in block_commit_write()?).


Ok, that's it for now. I have some thoughts regarding the asymmetry between
ranges passed to ->prepare_write() and ->commit_write(), but I'd like to
save those thoughts until I know whether my comments above uncovered real
issues :)

Thanks,
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2006-12-07 19:55 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-12-05  6:52 Status of buffered write path (deadlock fixes) Nick Piggin
2006-12-05  6:52 ` Nick Piggin
2006-12-07 19:55 ` Mark Fasheh [this message]
2006-12-07 19:55   ` Mark Fasheh
2006-12-08  3:28   ` Nick Piggin
2006-12-08  3:28     ` Nick Piggin
2006-12-08 23:48     ` Mark Fasheh
2006-12-08 23:48       ` Mark Fasheh
2006-12-11  9:11       ` Nick Piggin
2006-12-11  9:11         ` Nick Piggin
2006-12-11 14:20         ` Nick Piggin
2006-12-11 14:20           ` Nick Piggin
2006-12-11 15:52         ` Nick Piggin
2006-12-11 15:52           ` Nick Piggin
2006-12-11 16:12           ` Steven Whitehouse
2006-12-11 16:39             ` Nick Piggin
2006-12-11 16:39               ` Nick Piggin
2006-12-11 17:18               ` Steven Whitehouse
2006-12-11 17:18                 ` Steven Whitehouse
2006-12-12 22:31           ` Mark Fasheh
2006-12-12 22:31             ` Mark Fasheh
2006-12-13  0:53             ` Nick Piggin
2006-12-13  0:53               ` Nick Piggin
2006-12-13  1:47               ` Trond Myklebust
2006-12-13  1:47                 ` Trond Myklebust
2006-12-13  1:56                 ` Nick Piggin
2006-12-13  1:56                   ` Nick Piggin
2006-12-13  2:31                   ` Trond Myklebust
2006-12-13  2:31                     ` Trond Myklebust
2006-12-13  4:03                     ` Nick Piggin
2006-12-13  4:03                       ` Nick Piggin
2006-12-13 12:21                       ` Trond Myklebust
2006-12-13 12:21                         ` Trond Myklebust
2006-12-13 13:49                     ` Peter Staubach
2006-12-13 13:49                       ` Peter Staubach
2006-12-13 13:55                       ` Trond Myklebust
2006-12-13 13:55                         ` Trond Myklebust
2006-12-11 18:17 ` OGAWA Hirofumi
2006-12-11 18:17   ` OGAWA Hirofumi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20061207195518.GG4497@ca-server1.us.oracle.com \
    --to=mark.fasheh@oracle.com \
    --cc=akpm@google.com \
    --cc=hirofumi@mail.parknet.co.jp \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nickpiggin@yahoo.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.