linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Status of buffered write path (deadlock fixes)
@ 2006-12-05  6:52 Nick Piggin
  2006-12-07 19:55 ` Mark Fasheh
  2006-12-11 18:17 ` OGAWA Hirofumi
  0 siblings, 2 replies; 20+ messages in thread
From: Nick Piggin @ 2006-12-05  6:52 UTC (permalink / raw)
  To: Linux Memory Management, linux-fsdevel, linux-kernel
  Cc: Mark Fasheh, OGAWA Hirofumi, Andrew Morton

Hi,

I'd like to try to state where we are WRT the buffered write patches,
and ask for comments. Sorry for the wide cc list, but this is an
important issue which hasn't had enough review.

Well the next -mm will include everything we've done so far. I won't
repost patches unless someone would like to comment on a specific one.

I think the core generic_file_buffered_write is fairly robust, after
fixing the efault and zerolength iov problems picked up in testing
(thanks, very helpful!).

So now I *believe* we have an approach that solves the deadlock and
doesn't expose transient or stale data, transient zeroes, or anything
like that.

Error handling is getting close, but there may be cases that nobody
has picked up, and I've noticed a couple which I'll explain below.

I think we do the right thing WRT pagecache error handling: a
!uptodate page remains !uptodate, an uptodate page can handle the
write being done in several parts. Comments in the patches attempt
to explain how this works. I think it is pretty straightforward.

But WRT block allocation in the case of errors, it needs more review.

Block allocation:
- prepare_write can allocate blocks
- prepare_write doesn't need to initialize the pagecache on top of
   these blocks where it is within the range specified in prepare_write
   (because the copy_from_user will initialise it correctly)
- In the case of a !uptodate page, unless the page is brought uptodate
   (ie the copy_from_user completely succeeds) and marked dirty, then
   a read that sneaks in after we unlock the page (to retry the write)
   will try to bring it uptodate by pulling in the uninitialised blocks.

Problem 1:
I think that allocating blocks outside i_size is OK WRT uninitialised
data, because we update i_size only after a successful copy. However,
I don't think we trim these blocks off (eg. perhaps the "prepare_write
may have instantiated a few blocks" path should be the normal error
path for both the copy_from_user and the commit_write error cases as
well?)

We allocate blocks within holes, but these don't need to be trimmed: it
is enough to just zero out any new buffers. It might be nicer if we had
some kind of way to punch a hole, but it is a rare corner case.

Problem 2:
nobh error handling[*]. We have just a single buffer that is used for
each block in the prepare_write path, so the "zero new buffers" trick
doesn't work.

I think one solution to this could be to allocate all buffers for the
page like normal, and then strip them off when commit_write succeeds?
This would allow the zero_new_buffers path to work properly.

[*] Actually I think there is a problem with the mainline nobh error
handling in that a whole page of blocks will get zeroed on failure,
even valid data that isn't being touched by the write.

Finally, filesystems. Only OGAWA Hirofumi and Mark Fasheh have given much
feedback so far. I've tried to grok ext2/3 and think they'll work OK, and
have at least *looked* at all the rest. However in the worst case, there
might be many subtle and different problems :( Filesystem developers need
to review this, please. I don't want to cc every filesystem dev list, but
if anybody thinks it would be helpful to forward this then please do.

Well, that's about where its at. Block allocation problems 1 and 2
shouldn't be too hard to fix, but I would like confirmation / suggestions.

Thanks,
Nick

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2006-12-13 14:24 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-12-05  6:52 Status of buffered write path (deadlock fixes) Nick Piggin
2006-12-07 19:55 ` Mark Fasheh
2006-12-08  3:28   ` Nick Piggin
2006-12-08 23:48     ` Mark Fasheh
2006-12-11  9:11       ` Nick Piggin
2006-12-11 14:20         ` Nick Piggin
2006-12-11 15:52         ` Nick Piggin
2006-12-11 16:12           ` Steven Whitehouse
2006-12-11 16:39             ` Nick Piggin
2006-12-11 17:18               ` Steven Whitehouse
2006-12-12 22:31           ` Mark Fasheh
2006-12-13  0:53             ` Nick Piggin
2006-12-13  1:47               ` Trond Myklebust
2006-12-13  1:56                 ` Nick Piggin
2006-12-13  2:31                   ` Trond Myklebust
2006-12-13  4:03                     ` Nick Piggin
2006-12-13 12:21                       ` Trond Myklebust
2006-12-13 13:49                     ` Peter Staubach
2006-12-13 13:55                       ` Trond Myklebust
2006-12-11 18:17 ` OGAWA Hirofumi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).