From: Dave Chinner <david@fromorbit.com>
To: David Rientjes <rientjes@google.com>
Cc: Ted Ts'o <tytso@mit.edu>, Peter Zijlstra <peterz@infradead.org>,
Jens Axboe <jaxboe@fusionio.com>,
Andrew Morton <akpm@linux-foundation.org>,
Neil Brown <neilb@suse.de>, Alasdair G Kergon <agk@redhat.com>,
Chris Mason <chris.mason@oracle.com>,
Steven Whitehouse <swhiteho@redhat.com>, Jan Kara <jack@suse.cz>,
Frederic Weisbecker <fweisbec@gmail.com>,
"linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>,
"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>,
"cluster-devel@redhat.com" <cluster-devel@redhat.com>,
"linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>,
"reiserfs-devel@vger.kernel.org" <reiserfs-devel@vger.kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc
Date: Thu, 26 Aug 2010 17:06:19 +1000 [thread overview]
Message-ID: <20100826070619.GC705@dastard> (raw)
In-Reply-To: <alpine.DEB.2.00.1008251951230.7034@chino.kir.corp.google.com>
On Wed, Aug 25, 2010 at 08:09:21PM -0700, David Rientjes wrote:
> On Wed, 25 Aug 2010, Ted Ts'o wrote:
> > > I think it's really sad that the caller can't know what the upper bounds
> > > of its memory requirement are ahead of time or at least be able to
> > > implement a memory freeing function when kmalloc() returns NULL.
> >
> > Oh, we can determine an upper bound. You might just not like it.
> > Actually ext3/ext4 shouldn't be as bad as XFS, which Dave estimated to
> > be around 400k for a transaction. My guess is that the worst case for
> > ext3/ext4 is probably around 256k or so; like XFS, most of the time,
> > it would be a lot less. (At least, if data != journalled; if we are
> > doing data journalling and every single data block begins with
> > 0xc03b3998U, we'll need to allocate a 4k page for every single data
> > block written.) We could dynamically calculate an upper bound if we
> > had to. Of course, if ext3/ext4 is attached to a network block
> > device, then it could get a lot worse than 256k, of course.
> >
>
> On my 8GB machine, /proc/zoneinfo says the min watermark for ZONE_NORMAL
> is 5086 pages, or ~20MB. GFP_ATOMIC would allow access to ~12MB of that,
> so perhaps we should consider this is an acceptable abuse of GFP_ATOMIC as
> a fallback behavior when GFP_NOFS or GFP_NOIO fails?
It would take a handful of concurrent transactions in XFS with
worst case memory allocation requirements to exhaust that pool, and
then we really would be in trouble. Alternatively, it would take a
few allocations from each of a couple of thousand concurrent
transactions to get to the same point.
Bound memory pools only work when serialised access to the pool can
be enforced and there are no dependencies on other operations in
progress for completion of the work and freeing of the memory.
This is where it becomes exceedingly difficult to guarantee
progress.
One of the ideas that has floated around (I think Mel Gorman came up
with it first) was that if hardening the filesystem is so difficult,
why not just harden a single path via a single thread? e.g. we allow
the bdi flusher thread to have a separate reserve pool of free
pages, and when memory allocations start to fail, then that thread
can dip into it's pool to complete the writeback of the dirty pages
being flushed. When a fileystem attaches to a bdi, it can specify
the size of the reserve pool it needs.
This can be easily tested for during allocation (say a PF_ flag) and
switched to the reserve pool as necessary. because it is per-thread,
access to the pool is guaranteed to serialised. Memory reclaim can
then refill these pools before putting pages on freelists. This
could give us a mechanism for ensuring that allocations succeed in
the ->writepage path without needing to care about filesystem
implementation details.
And in the case of ext3/4, a pool could be attached to the jbd
thread as well so that it never starves of memory when commits are
required...
So, rather than turning filesystems upside down, maybe we should
revisit per-thread reserve pools for threads that are tasked with
cleaning pages for the VM?
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2010-08-26 7:06 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-08-24 10:50 [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc David Rientjes
2010-08-24 12:15 ` Jan Kara
2010-08-24 13:29 ` Peter Zijlstra
2010-08-24 13:33 ` Jens Axboe
2010-08-24 20:11 ` David Rientjes
2010-08-25 11:24 ` Ted Ts'o
2010-08-25 11:35 ` Peter Zijlstra
2010-08-25 11:57 ` Ted Ts'o
2010-08-25 12:48 ` Peter Zijlstra
2010-08-25 12:52 ` Peter Zijlstra
2010-08-25 13:20 ` Theodore Tso
2010-08-25 13:31 ` Peter Zijlstra
2010-08-25 20:43 ` David Rientjes
2010-08-25 20:55 ` Peter Zijlstra
2010-08-25 21:11 ` David Rientjes
2010-08-25 21:27 ` Peter Zijlstra
2010-08-25 23:11 ` David Rientjes
2010-08-26 0:19 ` Ted Ts'o
2010-08-26 0:30 ` David Rientjes
2010-08-26 1:48 ` Ted Ts'o
2010-08-26 3:09 ` David Rientjes
2010-08-26 6:38 ` Dave Chinner
[not found] ` <alpine.DEB.2.00.1008251951230.7034@chino.kir.corp.google.com>
2010-08-26 7:06 ` Dave Chinner [this message]
2010-08-26 8:29 ` Peter Zijlstra
2010-08-25 13:34 ` Peter Zijlstra
2010-08-25 13:24 ` Dave Chinner
2010-08-25 13:35 ` Peter Zijlstra
2010-08-25 20:53 ` Ted Ts'o
2010-08-25 20:59 ` David Rientjes
2010-08-25 21:35 ` Peter Zijlstra
2010-08-25 20:58 ` David Rientjes
2010-08-25 21:11 ` Christoph Lameter
2010-08-25 21:21 ` Peter Zijlstra
2010-08-25 21:23 ` David Rientjes
2010-08-25 21:35 ` Christoph Lameter
2010-08-25 23:05 ` David Rientjes
2010-08-26 1:30 ` Christoph Lameter
2010-08-26 3:12 ` David Rientjes
2010-08-26 14:16 ` Christoph Lameter
2010-08-26 22:31 ` David Rientjes
2010-08-26 0:09 ` Dave Chinner
2010-08-25 14:13 ` Peter Zijlstra
2010-08-24 13:55 ` Dave Chinner
2010-08-24 14:03 ` Peter Zijlstra
2010-08-24 20:12 ` David Rientjes
2010-08-24 20:08 ` David Rientjes
2010-09-02 1:02 ` [patch v2 " David Rientjes
2010-09-02 7:59 ` Jiri Slaby
2010-09-02 14:51 ` Jan Kara
2010-09-02 21:15 ` Neil Brown
2010-09-05 23:03 ` David Rientjes
2010-09-05 23:01 ` David Rientjes
2010-09-06 9:05 ` David Rientjes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100826070619.GC705@dastard \
--to=david@fromorbit.com \
--cc=agk@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=chris.mason@oracle.com \
--cc=cluster-devel@redhat.com \
--cc=fweisbec@gmail.com \
--cc=jack@suse.cz \
--cc=jaxboe@fusionio.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.de \
--cc=peterz@infradead.org \
--cc=reiserfs-devel@vger.kernel.org \
--cc=rientjes@google.com \
--cc=swhiteho@redhat.com \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).