Re: [RFC] generic IO write clustering

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Christoph Hellwig <hch@caldera.de>
To: marcelo@conectiva.com.br (Marcelo Tosatti)
Cc: Rajagopal Ananthanarayanan <ananth@sgi.com>,
	Rik van Riel <riel@conectiva.com.br>,
	"Stephen C. Tweedie" <sct@redhat.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC] generic IO write clustering
Date: Sat, 20 Jan 2001 16:57:58 +0100	[thread overview]
Message-ID: <200101201557.QAA14088@ns.caldera.de> (raw)
In-Reply-To: <Pine.LNX.4.21.0101192142060.6167-100000@freak.distro.conectiva>

In article <Pine.LNX.4.21.0101192142060.6167-100000@freak.distro.conectiva> you wrote:
> The write clustering issue has already been discussed (mainly at Miami)
> and the agreement, AFAIK, was to implement the write clustering at the
> per-address-space writepage() operation.

> IMO there are some problems if we implement the write clustering in this
> level:

>   - The filesystem does not have information (and should not have) about
>     limiting cluster size depending on memory shortage.

Agreed.

>   - By doing the write clustering at a higher level, we avoid a ton of
>     filesystems duplicating the code.

Most filesystems share their writepage-implementation, and most
others have special requirements on write clustering anyway.

For example extent-based filesystems (xfs, jfs) usually want to write out
more pages even if the VM doesn't see a need, just for effiecy reasons.

Network-based filesystems also need special care vs. writeclustering,
because the network behaves different from a typical disk...

> So what I suggest is to add a "cluster" operation to struct address_space
> which can be used by the VM code to know the optimal IO transfer unit in
> the storage device. Something like this (maybe we need an async flag but
> thats a minor detail now):

>         int (*cluster)(struct page *, unsigned long *boffset, 
> 		unsigned long *poffset);

> "page" is from where the filesystem code should start its search for
> contiguous pages. boffset and poffset are passed by the VM code to know
> the logical "backwards offset" (number of contiguous pages going backwards
> from "page") and "forward offset" (cont pages going forward from
> "page") in the inode.

I think there is a big disadvantage of this appropeach:
To find out which pages are clusterable, we need do do bmap/get_block,
that means we have to go through the block-allocation functions, which
is rather expensive, and then we have to do it again in writepage, for
the pages that are actually clustered bt the VM.

Another thing I dislike is that the flushing gets more complicated with
yout VM-level clustering.  Now (and with my appropeach I'll describe
below) flushing is write it out now and do whatever you else want,
with you design it is 'find out pages beside this page in write out
a bunch of them' - much more complicated.  I'd like it abstracted out.

> The idea is to work with delayed allocated pages, too. A filesystem which
> has this feature can, at its "cluster" operation, allocate delayed pages
> contiguously on disk, and then return to the VM code which now can
> potentially write a bunch of dirty pages in a few big IO operations.

That does also work nicely together with ->writepage level IO clustering.

> I'm sure that a bit of tuning to know the optimal cluster size will be
> needed. Also some fs locking problems will appear.

Sure, but again that's an issue for every kind of IO clustering...

No my proposal.  I prefer doing it in writepage, as stated above.
Writepage loops over the MAX_CLUSTERED_PAGES/2 dirty pages before and
behind the initial page, it first uses test wether the page should be
clustered (a callback from vm, highly 'balanceable'...), then does
a bmap/get_block to check wether it is contingous.

Finally the IO is submitted using a submit_bh loop, or when using a
kiobuf-based IO path all clustered pages are passed down to ll_rw_kio
in one piece.
As you see the easy integration with the new bulk-IO mechanisms is also
an advantage of this proposal, without the need a new multi-page a_op.

	Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

next prev parent reply	other threads:[~2001-01-20 15:58 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2001-01-20  0:34 [RFC] generic IO write clustering Marcelo Tosatti
2001-01-20  2:58 ` Rik van Riel
2001-01-20  1:52   ` Marcelo Tosatti
2001-01-20 15:57 ` Christoph Hellwig [this message]
2001-01-20 15:24   ` Marcelo Tosatti
2001-01-20 17:45     ` Christoph Hellwig
2001-01-20 16:00       ` Marcelo Tosatti
2001-01-20 19:05         ` Christoph Hellwig
2001-01-20 17:55           ` Marcelo Tosatti

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200101201557.QAA14088@ns.caldera.de \
    --to=hch@caldera.de \
    --cc=ananth@sgi.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=marcelo@conectiva.com.br \
    --cc=riel@conectiva.com.br \
    --cc=sct@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox