Re: [CFT] delayed allocation and multipage I/O patches for 2.5.6.

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Andrew Morton <akpm@zip.com.au>
To: Daniel Phillips <phillips@bonn-fries.net>
Cc: lkml <linux-kernel@vger.kernel.org>
Subject: Re: [CFT] delayed allocation and multipage I/O patches for 2.5.6.
Date: Tue, 12 Mar 2002 13:00:19 -0800	[thread overview]
Message-ID: <3C8E6C63.E8B72195@zip.com.au> (raw)
In-Reply-To: <3C8D9999.83F991DB@zip.com.au>, <3C8D9999.83F991DB@zip.com.au> <E16kkcq-0001rV-00@starship>

Daniel Phillips wrote:
> 
> On March 12, 2002 07:00 am, Andrew Morton wrote:
> > dallocbase-15-pageprivate
> >
> >   page->buffers is a bit of a layering violation.  Not all address_spaces
> >   have pages which are backed by buffers.
> >
> >   The exclusive use of page->buffers for buffers means that a piece of prime
> >   real estate in struct page is unavailable to other forms of address_space.
> >
> >   This patch turns page->buffers into `unsigned long page->private' and sets
> >   in place all the infrastructure which is needed to allow other address_spaces
> >   to use this storage.
> >
> >   With this change in place, the multipage-bio no-buffer_head code can use
> >   page->private to cache the results of an earlier get_block(), so repeated
> >   calls into the filesystem are not needed in the case of file overwriting.
> 
> That's pragmatic, a good short term solution.  Getting rid of page->buffers
> entirely will be nicer, and in that case you want to cache the physical block
> only for those pages that have one, e.g., not for swap-backed pages, which
> keep that information in the page table.

Really, I don't think we can lose page->buffers for *enough* users
of address_spaces to make it worthwhile.

If it was only being used for, say, blockdev inodes then we could
perhaps take it out and hash for it, but there are a ton of
filesystems out there...

The main problem I see with this patch series is that it introduces
a new way of performing writeback while leaving the old way in place.
The new way is better, I think - it's just a_ops->write_many_pages().
But at present, there are some address_spaces which support write_many_pages(),
and others which still use ->writepage() and sync_page_buffers().

This will make VM development harder, because the VM now needs to cope
with the nice, uniform, does-clustering-for-you writeback as well as
the crufty old write-little-bits-of-crap-all-over-the-disk writeback :)

I need to give the VM a uniform way of performing writeback for
all address_spaces.  My current thinking there is that all
address_spaces (even the non-delalloc, buffer_head-backed ones)
need to be taught to perform multipage clustered writeback
based on the address_space, not the dirty buffer LRU.

This is pretty deep surgery.  If it can be made to work, it'll
be nice - it will heavily deprecate the buffer_head layer and will
unify the current two-or-three different ways of performing
writeback (I've already unified all ways of performing writeback
for delalloc filesystems - my version of kupdate writeback, bdflush
writeback, vm-writeback and write(2) writeback are all unified).

> I've been playing with the idea of caching the physical block in the radix
> tree, which imposes the cost only on cache pages.  This forces you to do a
> tree probe at IO time, but that cost is probably insignificant against the
> cost of the IO.  This arrangement could make it quite convenient for the
> filesystem to exploit the structure by doing opportunistic map-ahead, i.e.,
> when ->get_block consults the metadata to fill in one physical address, why
> not fill in several more, if it's convenient?

That would be fairly easy to do.  My current writeback interface
into the filesytem is, basically, "write back N pages from your
mapping->dirty_pages list" [1].  The address_space could quite simply
whizz through that list and map all the required pages in a batched
manner.

[1] Problem with the current implementation is that I've taken
    out the guarantee that the page which the VM wanted to free
    actually has I/O started against it.  So if the VM wants to
    free something from ZONE_NORMAL, the address_space may just
    go and start writeback against 1000 ZONE_HIGHMEM pages instead.
    In practice, I suspect this doesn't matter much.  But it needs
    fixing.

    (Our current behaviour in this scenario is terrible.  Suppose
    a mapping has a mixture of dirty pages from two or more zones,
    and the VM is trying to free up a particular zone: the VM will
    *selectively* perform writepage against *some* of the dirty
    pages, and will skip writeback of pages from other zones.

    This means that we're submitting great chunks of discontiguous
    I/O.  It'll fragment the layout of sparse files and will
    greatly decrease writeout bandwidth.  We should be opportunistically
    submitting writeback against disk-contiguous and file-offset-contiguous
    pages from other zones at the same time!  I'm doing that now, but
    with the present VM design [2] I do need to provide a way to
    ensure that writeback has commenced against the target page).

[2] The more I think about it, the less I like it.  I have a feeling
    that I'll end up having to, umm, redesign the VM.  Damn.

-

next prev parent reply	other threads:[~2002-03-12 21:02 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2002-03-12  6:00 [CFT] delayed allocation and multipage I/O patches for 2.5.6 Andrew Morton
2002-03-12 11:18 ` Daniel Phillips
2002-03-12 20:29   ` Andrew Morton
2002-03-12 20:40     ` Daniel Phillips
2002-03-12 11:39 ` Daniel Phillips
2002-03-12 21:00   ` Andrew Morton [this message]
2002-03-13 11:58     ` Daniel Phillips
2002-03-13 19:50       ` Andrew Morton
2002-03-13 21:51         ` Mike Fedyk
2002-03-14 11:59         ` Daniel Phillips
2002-03-13  0:42   ` David Woodhouse
2002-03-18 19:16 ` Hanna Linder
2002-03-18 20:14   ` Andrew Morton
2002-03-18 20:22     ` Hanna Linder
2002-03-18 20:49       ` Andrew Morton
  -- strict thread matches above, loose matches on Subject: below --
2002-03-19  0:41 rwhron

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3C8E6C63.E8B72195@zip.com.au \
    --to=akpm@zip.com.au \
    --cc=linux-kernel@vger.kernel.org \
    --cc=phillips@bonn-fries.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox