From: Chris Mason <chris.mason@oracle.com>
To: Nick Piggin <npiggin@suse.de>
Cc: David Chinner <dgc@sgi.com>,
Nick Piggin <nickpiggin@yahoo.com.au>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Linux Memory Management List <linux-mm@kvack.org>,
linux-fsdevel@vger.kernel.org
Subject: Re: [RFC] fsblock
Date: Thu, 28 Jun 2007 08:20:31 -0400 [thread overview]
Message-ID: <20070628122031.GF5313@think.oraclecorp.com> (raw)
In-Reply-To: <20070628024443.GB6038@wotan.suse.de>
On Thu, Jun 28, 2007 at 04:44:43AM +0200, Nick Piggin wrote:
> On Thu, Jun 28, 2007 at 08:35:48AM +1000, David Chinner wrote:
> > On Wed, Jun 27, 2007 at 07:50:56AM -0400, Chris Mason wrote:
> > > Lets look at a typical example of how IO actually gets done today,
> > > starting with sys_write():
> > >
> > > sys_write(file, buffer, 1MB)
> > > for each page:
> > > prepare_write()
> > > allocate contiguous chunks of disk
> > > attach buffers
> > > copy_from_user()
> > > commit_write()
> > > dirty buffers
> > >
> > > pdflush:
> > > writepages()
> > > find pages with contiguous chunks of disk
> > > build and submit large bios
> > >
> > > So, we replace prepare_write and commit_write with an extent based api,
> > > but we keep the dirty each buffer part. writepages has to turn that
> > > back into extents (bio sized), and the result is completely full of dark
> > > dark corner cases.
>
> That's true but I don't think an extent data structure means we can
> become too far divorced from the pagecache or the native block size
> -- what will end up happening is that often we'll need "stuff" to map
> between all those as well, even if it is only at IO-time.
I think the fundamental difference is that fsblock still does:
mapping_info = page->something, where something is attached on a per
page basis. What we really want is mapping_info = lookup_mapping(page),
where that function goes and finds something stored on a per extent
basis, with extra bits for tracking dirty and locked state.
Ideally, in at least some of the cases the dirty and locked state could
be at an extent granularity (streaming IO) instead of the block
granularity (random IO).
In my little brain, even block based filesystems should be able to take
advantage of this...but such things are always easier to believe in
before the coding starts.
>
> But the point is taken, and I do believe that at least for APIs, extent
> based seems like the best way to go. And that should allow fsblock to
> be replaced or augmented in future without _too_ much pain.
>
>
> > Yup - I've been on the painful end of those dark corner cases several
> > times in the last few months.
> >
> > It's also worth pointing out that mpage_readpages() already works on
> > an extent basis - it overloads bufferheads to provide a "map_bh" that
> > can point to a range of blocks in the same state. The code then iterates
> > the map_bh range a page at a time building bios (i.e. not even using
> > buffer heads) from that map......
>
> One issue I have with the current nobh and mpage stuff is that it
> requires multiple calls into get_block (first to prepare write, then
> to writepage), it doesn't allow filesystems to attach resources
> required for writeout at prepare_write time, and it doesn't play nicely
> with buffers in general. (not to mention that nobh error handling is
> buggy).
>
> I haven't done any mpage-like code for fsblocks yet, but I think they
> wouldn't be too much trouble, and wouldn't have any of the above
> problems...
Could be, but the fundamental issue of sometimes pages have mappings
attached and sometimes they don't is still there. The window is
smaller, but non-zero.
-chris
next prev parent reply other threads:[~2007-06-28 12:23 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-06-24 1:45 [RFC] fsblock Nick Piggin
2007-06-24 1:46 ` [patch 1/3] add the fsblock layer Nick Piggin
2007-06-24 15:28 ` Andi Kleen
2007-06-24 20:18 ` Arjan van de Ven
2007-06-25 8:58 ` Andi Kleen
2007-06-25 7:19 ` Nick Piggin
2007-06-24 23:01 ` Neil Brown
2007-06-25 7:41 ` Nick Piggin
2007-06-25 12:29 ` Chris Mason
2007-06-26 2:34 ` Nick Piggin
2007-06-26 2:48 ` Neil Brown
2007-06-26 3:07 ` Nick Piggin
2007-06-26 12:26 ` Chris Mason
2007-06-30 10:40 ` Christoph Hellwig
2007-06-30 10:40 ` Christoph Hellwig
2007-06-25 13:19 ` Chris Mason
2007-06-26 2:42 ` Nick Piggin
2007-06-24 1:46 ` [patch 2/3] block_dev: convert to fsblock Nick Piggin
2007-06-24 1:47 ` [patch 3/3] minix: " Nick Piggin
2007-06-24 1:53 ` [RFC] fsblock Nick Piggin
2007-06-24 3:07 ` Jeff Garzik
2007-06-24 3:47 ` Nick Piggin
2007-06-24 13:51 ` Chris Mason
2007-06-25 6:58 ` Nick Piggin
2007-06-25 12:25 ` Chris Mason
2007-06-30 10:44 ` Christoph Hellwig
2007-06-30 10:42 ` Christoph Hellwig
2007-06-30 11:10 ` Jeff Garzik
2007-06-30 11:13 ` Christoph Hellwig
2007-06-24 4:19 ` William Lee Irwin III
2007-06-24 14:16 ` Andi Kleen
2007-06-25 7:16 ` Nick Piggin
2007-06-26 3:06 ` David Chinner
2007-06-26 3:55 ` Nick Piggin
2007-06-26 9:23 ` David Chinner
2007-06-26 11:14 ` Nick Piggin
2007-06-27 12:39 ` Kyle Moffett
2007-06-26 12:34 ` Chris Mason
2007-06-27 5:32 ` Nick Piggin
2007-06-27 6:05 ` David Chinner
2007-06-27 11:50 ` Chris Mason
2007-06-27 15:18 ` Anton Altaparmakov
2007-06-27 22:35 ` David Chinner
2007-06-28 2:44 ` Nick Piggin
2007-06-28 12:20 ` Chris Mason [this message]
2007-06-29 2:08 ` David Chinner
2007-06-29 2:33 ` Nick Piggin
2007-06-30 11:05 ` Christoph Hellwig
2007-07-09 17:14 ` Christoph Lameter
2007-07-10 0:54 ` Nick Piggin
2007-07-10 0:59 ` Christoph Lameter
2007-07-10 1:07 ` Nick Piggin
2007-07-10 1:37 ` Dave McCracken
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070628122031.GF5313@think.oraclecorp.com \
--to=chris.mason@oracle.com \
--cc=dgc@sgi.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nickpiggin@yahoo.com.au \
--cc=npiggin@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).