From: Christoph Hellwig <hch@infradead.org>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>, Mel Gorman <mel@csn.ul.ie>,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-mm@kvack.org, Dave Chinner <david@fromorbit.com>,
Chris Mason <chris.mason@oracle.com>,
Nick Piggin <npiggin@suse.de>, Rik van Riel <riel@redhat.com>
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible
Date: Tue, 15 Jun 2010 10:43:42 -0400 [thread overview]
Message-ID: <20100615144342.GA3339@infradead.org> (raw)
In-Reply-To: <20100615142219.GE28052@random.random>
On Tue, Jun 15, 2010 at 04:22:19PM +0200, Andrea Arcangeli wrote:
> If we were forbidden to call ->writepage just because of stack
> overflow yes as I don't think it's big deal with memory compaction and
> I see this as a too limiting design to allow ->writepage only in
> kernel thread. ->writepage is also called by the pagecache layer,
> msync etc.. not just by kswapd.
Other callers of ->writepage are fine because they come from a
controlled environment with relatively little stack usage. The problem
with direct reclaim is that we splice multiple stack hogs ontop of each
other.
Direct reclaim can come from any point that does memory allocations,
including those that absolutely have to because their stack "quota"
is almost used up. Let's look at a worst case scenario:
We're in a deep stack codepath, say
(1) core_sys_select, which has to kmalloc the array if it doesn't
fit on the huge stack variable. All fine by now, it stays in it's
stack quota.
(2) That code now calls into the slab allocator, which doesn't find free
space in the large slab, and then calls into kmem_getpages, adding
more stack usage.
(3) That calls into alloc_pages_exact_node which adds stack usage of
the page allocator.
(4) no free pages in the zone anymore, and direct reclaim is invoked,
adding the stack usage of the reclaim code, which currently is
quite heavy.
(5) direct reclaim calls into foofs ->writepage. foofs_writepage
notices the page is delayed allocated and needs to conver it.
It now has to start a transaction, then call the extent management
code to convert the extent, which calls into the space managment
code, which calls into the buffercache for the metadata buffers,
which needs to submit a bio to read/write the metadata.
(6) The metadata buffer goes through submit_bio and the block layer
code. Because we're doing a synchronous writeout it gets directly
dispatched to the block layer.
(7) for extra fun add a few remapping layers for raid or similar to
add to the stack usage.
(8) The lowlevel block driver is iscsi or something similar, so after
going through the scsi layer adding more stack it now goes through
the networking layer with tcp and ipv4 (if you're unlucky ipv6)
code
(9) we finally end up in the lowlevel networking driver (except that we
would have long overflown the stack)
And for extrea fun:
(10) Just when we're way down that stack an IRQ comes in on the CPU that
we're executing on. Because we don't enable irqstacks for the only
sensible stack configuration (yeah, still bitter about the patch
for that getting ignored) it goes on the same stack above.
And note that the above does not only happen with ext4/btrfs/xfs that
have delayed allocations. With every other filesystem it can also
happen, just a lot less likely - when writing to a file through shared
mmaps we still have to call the allocator from ->writepage in
ext2/ext3/reiserfs/etc.
And seriously, if the VM isn't stopped from calling ->writepage from
reclaim context we FS people will simply ignore any ->writepage from
reclaim context. Been there, done that and never again.
Just wondering, what filesystems do your hugepage testing systems use?
If it's any of the ext4/btrfs/xfs above you're already seeing the
filesystem refuse ->writepage from both kswapd and direct reclaim,
so Mel's series will allow us to reclaim pages from more contexts
than before.
next prev parent reply other threads:[~2010-06-15 14:43 UTC|newest]
Thread overview: 67+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-06-08 9:02 [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible Mel Gorman
2010-06-08 9:02 ` [PATCH 1/6] tracing, vmscan: Add trace events for kswapd wakeup, sleeping and direct reclaim Mel Gorman
2010-06-08 9:02 ` [PATCH 2/6] tracing, vmscan: Add trace events for LRU page isolation Mel Gorman
2010-06-08 9:02 ` [PATCH 3/6] tracing, vmscan: Add trace event when a page is written Mel Gorman
2010-06-08 9:02 ` [PATCH 4/6] tracing, vmscan: Add a postprocessing script for reclaim-related ftrace events Mel Gorman
2010-06-08 9:02 ` [PATCH 5/6] vmscan: Write out ranges of pages contiguous to the inode where possible Mel Gorman
2010-06-11 6:10 ` Andrew Morton
2010-06-11 12:49 ` Mel Gorman
2010-06-11 19:07 ` Andrew Morton
2010-06-11 20:44 ` Mel Gorman
2010-06-11 21:33 ` Andrew Morton
2010-06-12 0:17 ` Mel Gorman
2010-06-11 16:27 ` Christoph Hellwig
2010-06-08 9:02 ` [PATCH 6/6] vmscan: Do not writeback pages in direct reclaim Mel Gorman
2010-06-11 6:17 ` Andrew Morton
2010-06-11 12:54 ` Mel Gorman
2010-06-11 16:25 ` Christoph Hellwig
2010-06-11 17:43 ` Andrew Morton
2010-06-11 17:49 ` Christoph Hellwig
2010-06-11 18:13 ` Mel Gorman
2010-06-08 9:08 ` [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible Christoph Hellwig
2010-06-08 9:28 ` Mel Gorman
2010-06-11 16:29 ` Christoph Hellwig
2010-06-11 18:15 ` Mel Gorman
2010-06-11 19:12 ` Chris Mason
2010-06-09 2:52 ` KAMEZAWA Hiroyuki
2010-06-09 9:52 ` Mel Gorman
2010-06-10 0:38 ` KAMEZAWA Hiroyuki
2010-06-10 1:10 ` Mel Gorman
2010-06-10 1:29 ` KAMEZAWA Hiroyuki
2010-06-11 5:57 ` Andrew Morton
2010-06-11 12:33 ` Mel Gorman
2010-06-11 16:30 ` Christoph Hellwig
2010-06-11 18:17 ` Mel Gorman
2010-06-15 14:00 ` Andrea Arcangeli
2010-06-15 14:11 ` Christoph Hellwig
2010-06-15 14:22 ` Andrea Arcangeli
2010-06-15 14:43 ` Christoph Hellwig [this message]
2010-06-15 15:08 ` Andrea Arcangeli
2010-06-15 15:25 ` Christoph Hellwig
2010-06-15 15:45 ` Andrea Arcangeli
2010-06-15 16:26 ` Christoph Hellwig
2010-06-15 16:31 ` Andrea Arcangeli
2010-06-15 16:49 ` Rik van Riel
2010-06-15 16:54 ` Christoph Hellwig
2010-06-15 19:13 ` Rik van Riel
2010-06-15 19:17 ` Christoph Hellwig
2010-06-15 19:44 ` Chris Mason
2010-06-16 7:57 ` Nick Piggin
2010-06-16 16:59 ` Rik van Riel
2010-06-16 17:04 ` Andrea Arcangeli
2010-06-15 16:54 ` Nick Piggin
2010-06-15 15:38 ` Mel Gorman
2010-06-15 16:14 ` Andrea Arcangeli
2010-06-15 16:22 ` Christoph Hellwig
2010-06-15 16:30 ` Mel Gorman
2010-06-15 16:34 ` Mel Gorman
2010-06-15 16:54 ` Andrea Arcangeli
2010-06-15 16:35 ` Christoph Hellwig
2010-06-15 16:37 ` Andrea Arcangeli
2010-06-15 17:43 ` Christoph Hellwig
2010-06-15 16:45 ` Christoph Hellwig
2010-06-15 14:51 ` Mel Gorman
2010-06-15 14:55 ` Rik van Riel
2010-06-15 15:08 ` Nick Piggin
2010-06-15 15:10 ` Mel Gorman
2010-06-15 16:28 ` Andrea Arcangeli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100615144342.GA3339@infradead.org \
--to=hch@infradead.org \
--cc=aarcange@redhat.com \
--cc=chris.mason@oracle.com \
--cc=david@fromorbit.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mel@csn.ul.ie \
--cc=npiggin@suse.de \
--cc=riel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).