Re: [patch][rfc] mm: hold page lock over page_mkwrite

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: Nick Piggin <npiggin@suse.de>
Cc: linux-fsdevel@vger.kernel.org,
	Linux Memory Management List <linux-mm@kvack.org>
Subject: Re: [patch][rfc] mm: hold page lock over page_mkwrite
Date: Mon, 2 Mar 2009 19:19:53 +1100	[thread overview]
Message-ID: <20090302081953.GK26138@disturbed> (raw)
In-Reply-To: <20090301135057.GA26905@wotan.suse.de>

On Sun, Mar 01, 2009 at 02:50:57PM +0100, Nick Piggin wrote:
> On Sun, Mar 01, 2009 at 07:17:44PM +1100, Dave Chinner wrote:
> > On Wed, Feb 25, 2009 at 10:36:29AM +0100, Nick Piggin wrote:
> > > I need this in fsblock because I am working to ensure filesystem metadata
> > > can be correctly allocated and refcounted. This means that page cleaning
> > > should not require memory allocation (to be really robust).
> > 
> > Which, unfortunately, is just a dream for any filesystem that uses
> > delayed allocation. i.e. they have to walk the free space trees
> > which may need to be read from disk and therefore require memory
> > to succeed....
> 
> Well it's a dream because probably none of them get it right, but
> that doesn't mean its impossible.
> 
> You don't need complete memory allocation up-front to be robust,
> but having reserves or degraded modes that simply guarantee
> forward progress is enough.
> 
> For example, if you need to read/write filesystem metadata to find
> and allocate free space, then you really only need a page to do all
> the IO.

For journalling filesystems, dirty metadata is pinned for at least the
duration of the transaction and in many cases it is pinned for
multiple transactions (i.e. in memory aggregation of commits like
XFS does). And then once the transaction is complete, it can't be
reused until it is written to disk.

For the worst case usage in XFS, think about a complete btree split
of both free space trees, plus a complete btree split of the extent
tree.  That is two buffers per level per btree that are pinned by
the transaction.

The free space trees are bound in depth by the AG size so the limit
is (IIRC) 15 buffers per tree at 1TB AG size. However, the inode
extent tree can be deeper than that (bound by filesystem size). In
effect, writing back a single page could require memory allocation
of 30-40 pages just for metadata that is dirtied by the allocation
transaction.

And then the next page written back goes into a different
AG and splits the trees there. And then the next does the same.

Luckily, this sort of thing doesn't happen very often, but it does
serve to demonstrate how difficult it is to quantify how much memory
the writeback path really needs to guarantee forward progress.
Hence the dream......

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2009-03-02  8:19 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-02-25  9:36 [patch][rfc] mm: hold page lock over page_mkwrite Nick Piggin
2009-02-25 16:42 ` Zach Brown
2009-02-25 16:55   ` Nick Piggin
2009-02-25 16:58     ` Zach Brown
2009-02-25 17:02       ` Nick Piggin
2009-02-25 22:35         ` Mark Fasheh
2009-02-25 16:48 ` Chris Mason
2009-02-26  9:20 ` Peter Zijlstra
2009-02-26 11:09   ` Nick Piggin
2009-03-01  8:17 ` Dave Chinner
2009-03-01 13:50   ` Nick Piggin
2009-03-02  8:19     ` Dave Chinner [this message]
2009-03-02  8:37       ` Nick Piggin
2009-03-02 15:26         ` jim owens
2009-03-03  4:33           ` Nick Piggin
2009-03-03 17:25             ` Jamie Lokier
2009-03-04  4:37               ` Dave Chinner
2009-03-04  9:23               ` Nick Piggin
2009-03-04 18:13                 ` Jamie Lokier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090302081953.GK26138@disturbed \
    --to=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=npiggin@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).