From: Joel Becker <Joel.Becker@oracle.com>
To: Nick Piggin <npiggin@suse.de>
Cc: Dave Chinner <david@fromorbit.com>,
Christoph Hellwig <hch@infradead.org>,
Josef Bacik <josef@redhat.com>,
linux-fsdevel@vger.kernel.org, chris.mason@oracle.com,
akpm@linux-foundation.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC] new ->perform_write fop
Date: Mon, 24 May 2010 11:40:25 -0700 [thread overview]
Message-ID: <20100524184024.GA9905@mail.oracle.com> (raw)
In-Reply-To: <20100524065519.GT2516@laptop>
On Mon, May 24, 2010 at 04:55:19PM +1000, Nick Piggin wrote:
> On Mon, May 24, 2010 at 03:53:29PM +1000, Dave Chinner wrote:
> > Because if we fail after the allocation then ensuring we handle the
> > error *correctly* and *without further failures* is *fucking hard*.
>
> I don't think you really answered my question. Let me put it in concrete
> terms. In your proposal, why not just do the reserve+allocate *after*
> the pagecache copy? What does the "reserve" part add?
In ocfs2, we can't just crash our filesystem. We have to be
safe not just with respect to the local machine, we have to leave the
filesystem in a consistent state - structure *and* data - for the other
nodes.
The ordering and locking of allocation in get_block(s)() is so
bad that we just Don't Do It. By the time get_block(s)() is called, we
require our filesystem to have the allocation done. We do our
allocation in write_begin(). By the time we get to the page copy, we
can't ENOSPC or EDQUOT. O_DIRECT I/O falls back to sync buffered I/O if
it must allocate, pushing us through write_begin() forcing other nodes
to honor what we've done.
This is easily extended to the reserve multipage operation.
It's not delalloc, because we actually allocate in the reserve
operation. We handle it just like a large case of the single page
operation. Someday we hope to add delalloc, and it would actually do
better here.
I guess you could call this "copy middle" like Dave describes in
his followup to your mail. Copy Middle also has the property that it
can handle short writes without any error handling. Copy First has to
discover it can only get half the allocation and drop the latter half of
the pagecache. Copy Last has to discover it can only do half the page
copy and drop the latter half of the allocation.
> > IMO, the fundamental issue with using hole punching or direct IO
> > from the zero page to handle errors is that they are complex enough
> > that there is *no guarantee that they will succeed*. e.g. Both can
> > get ENOSPC/EDQUOT because they may end up with metadata allocation
> > requirements above and beyond what was originally reserved. If the
> > error handling fails to handle the error, then where do we go from
> > there?
>
> There are already fundamental issues that seems like they are not
> handled properly if your filesystem may allocate uninitialized blocks
> over holes for writeback cache without somehow marking them as
> uninitialized.
>
> If you get a power failure or IO error before the pagecache can be
> written out, you're left with uninitialized data there, aren't you?
> Simple buffer head based filesystems are already subject to this.
Sure, ext2 does this. But don't most filesystems guaranteeing
state actually make sure to order such I/Os? If you run ext3 in
data=writeback, you get what you pay for. This sounds like a red
herring.
Dave's original point stands. ocfs2 supports unwritten extents
and punching holes. In fact, we directly copied the XFS ioctl(2)s. But
when we do punch holes, we have to adjust our tree. That may require
additional metadata, and *that* can fail with ENOSPC or EDQUOT.
Joel
--
"I always thought the hardest questions were those I could not answer.
Now I know they are the ones I can never ask."
- Charlie Watkins
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
next prev parent reply other threads:[~2010-05-24 18:42 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-05-12 21:24 [RFC] new ->perform_write fop Josef Bacik
2010-05-13 1:39 ` Josef Bacik
2010-05-13 15:36 ` Christoph Hellwig
2010-05-14 1:00 ` Dave Chinner
2010-05-14 3:30 ` Josef Bacik
2010-05-14 5:50 ` Nick Piggin
2010-05-14 7:20 ` Dave Chinner
2010-05-14 7:33 ` Nick Piggin
2010-05-14 6:41 ` Dave Chinner
2010-05-14 7:22 ` Nick Piggin
2010-05-14 8:38 ` Dave Chinner
2010-05-14 13:33 ` Chris Mason
2010-05-18 6:36 ` Nick Piggin
2010-05-18 8:05 ` Dave Chinner
2010-05-18 10:43 ` Nick Piggin
2010-05-18 12:27 ` Dave Chinner
2010-05-18 15:09 ` Nick Piggin
2010-05-19 23:50 ` Dave Chinner
2010-05-20 6:48 ` Nick Piggin
2010-05-20 20:12 ` Jan Kara
2010-05-20 23:05 ` Dave Chinner
2010-05-21 9:05 ` Steven Whitehouse
2010-05-21 13:50 ` Josef Bacik
2010-05-21 14:23 ` Nick Piggin
2010-05-21 15:19 ` Josef Bacik
2010-05-24 3:29 ` Nick Piggin
2010-05-22 0:31 ` Dave Chinner
2010-05-21 18:58 ` Jan Kara
2010-05-22 0:27 ` Dave Chinner
2010-05-24 9:20 ` Jan Kara
2010-05-24 9:33 ` Nick Piggin
2010-06-05 15:05 ` tytso
2010-06-06 7:59 ` Nick Piggin
2010-05-21 15:15 ` Christoph Hellwig
2010-05-22 2:31 ` Nick Piggin
2010-05-22 8:37 ` Dave Chinner
2010-05-24 3:09 ` Nick Piggin
2010-05-24 5:53 ` Dave Chinner
2010-05-24 6:55 ` Nick Piggin
2010-05-24 10:21 ` Dave Chinner
2010-06-01 6:27 ` Nick Piggin
2010-05-24 18:40 ` Joel Becker [this message]
2010-05-17 23:35 ` Jan Kara
2010-05-18 1:21 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100524184024.GA9905@mail.oracle.com \
--to=joel.becker@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=chris.mason@oracle.com \
--cc=david@fromorbit.com \
--cc=hch@infradead.org \
--cc=josef@redhat.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=npiggin@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).