Re: [PATCH] mm: disallow direct reclaim page writeback

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Chris Mason <chris.mason@oracle.com>, Mel Gorman <mel@csn.ul.ie>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH] mm: disallow direct reclaim page writeback
Date: Wed, 14 Apr 2010 17:28:30 +1000	[thread overview]
Message-ID: <20100414072830.GK2493@dastard> (raw)
In-Reply-To: <20100414155233.D153.A69D9226@jp.fujitsu.com>

On Wed, Apr 14, 2010 at 03:52:44PM +0900, KOSAKI Motohiro wrote:
> > On Tue, Apr 13, 2010 at 04:20:21PM -0400, Chris Mason wrote:
> > > On Tue, Apr 13, 2010 at 08:34:29PM +0100, Mel Gorman wrote:
> > > > > Basically, there is not enough stack space available to allow direct
> > > > > reclaim to enter ->writepage _anywhere_ according to the stack usage
> > > > > profiles we are seeing here....
> > > > > 
> > > > 
> > > > I'm not denying the evidence but how has it been gotten away with for years
> > > > then? Prevention of writeback isn't the answer without figuring out how
> > > > direct reclaimers can queue pages for IO and in the case of lumpy reclaim
> > > > doing sync IO, then waiting on those pages.
> > > 
> > > So, I've been reading along, nodding my head to Dave's side of things
> > > because seeks are evil and direct reclaim makes seeks.  I'd really loev
> > > for direct reclaim to somehow trigger writepages on large chunks instead
> > > of doing page by page spatters of IO to the drive.
> 
> I agree that "seeks are evil and direct reclaim makes seeks". Actually,
> making 4k io is not must for pageout. So, probably we can improve it.
> 
> 
> > Perhaps drop the lock on the page if it is held and call one of the
> > helpers that filesystems use to do this, like:
> > 
> > 	filemap_write_and_wait(page->mapping);
> 
> Sorry, I'm lost what you talk about. Why do we need per-file
> waiting? If file is 1GB file, do we need to wait 1GB writeout?

So use filemap_fdatawrite(page->mapping), or if it's better only
to start IO on a segment of the file, use
filemap_fdatawrite_range(page->mapping, start, end)....

> > > But, somewhere along the line I overlooked the part of Dave's stack trace
> > > that said:
> > > 
> > > 43)     1568     912   do_select+0x3d6/0x700
> > > 
> > > Huh, 912 bytes...for select, really?  From poll.h:
> > 
> > Sure, it's bad, but we focussing on the specific case misses the
> > point that even code that is using minimal stack can enter direct
> > reclaim after consuming 1.5k of stack. e.g.:
> 
> checkstack.pl says do_select() and __generic_file_splice_read() are one
> of worstest stack consumer. both sould be fixed.

the deepest call chain in queue_work() needs 700 bytes of stack
to complete, wait_for_completion() requires almost 2k of stack space
at it's deepest, the scheduler has some heavy stack users, etc,
and these are all functions that appear at the top of the stack.

> also, checkstack.pl says such stack eater aren't so much.

Yeah, but when we have ia callchain 70 or more functions deep,
even 100 bytes of stack is a lot....

> > > So, select is intentionally trying to use that much stack.  It should be using
> > > GFP_NOFS if it really wants to suck down that much stack...
> > 
> > The code that did the allocation is called from multiple different
> > contexts - how is it supposed to know that in some of those contexts
> > it is supposed to treat memory allocation differently?
> > 
> > This is my point - if you introduce a new semantic to memory allocation
> > that is "use GFP_NOFS when you are using too much stack" and too much
> > stack is more than 15% of the stack, then pretty much every code path
> > will need to set that flag...
> 
> Nodding my head to Dave's side. changing caller argument seems not good
> solution. I mean
>  - do_select() should use GFP_KERNEL instead stack (as revert 70674f95c0)
>  - reclaim and xfs (and other something else) need to diet.

The list I'm seeing so far includes:
	- scheduler
	- completion interfaces
	- radix tree
	- memory allocation, memory reclaim
	- anything that implements ->writepage
	- select
	- splice read

> Also, I believe stack eater function should be created waring. patch attached.

Good start, but 512 bytes will only catch select and splice read,
and there are 300-400 byte functions in the above list that sit near
the top of the stack....

> > We need at least _700_ bytes of stack free just to call queue_work(),
> > and that now happens deep in the guts of the driver subsystem below XFS.
> > This trace shows 1.8k of stack usage on a simple, single sata disk
> > storage subsystem, so my estimate of 2k of stack for the storage system
> > below XFS is too small - a worst case of 2.5-3k of stack space is probably
> > closer to the mark.
> 
> your explanation is very interesting. I have a (probably dumb) question.
> Why nobody faced stack overflow issue in past? now I think every users
> easily get stack overflow if your explanation is correct.

It's always a problem, but the focus on minimising stack usage has
gone away since i386 has mostly disappeared from server rooms.

XFS has always been the thing that triggered stack usage problems
first - the first reports of problems on x86_64 with 8k stacks in low
memory situations have only just come in, and this is the first time
in a couple of years I've paid close attention to stack usage
outside XFS. What I'm seeing is not pretty....

> > This is the sort of thing I'm pointing at when I say that stack
> > usage outside XFS has grown significantly significantly over the
> > past couple of years. Given XFS has remained pretty much the same or
> > even reduced slightly over the same time period, blaming XFS or
> > saying "callers should use GFP_NOFS" seems like a cop-out to me.
> > Regardless of the IO pattern performance issues, writeback via
> > direct reclaim just uses too much stack to be safe these days...
> 
> Yeah, My answer is simple, All stack eater should be fixed.
> but XFS seems not innocence too. 3.5K is enough big although
> xfs have use such amount since very ago.

XFS used to use much more than that - significant effort has been
put into reduce the stack footprint over many years. There's not
much left to trim without rewriting half the filesystem...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2010-04-14  7:28 UTC|newest]

Thread overview: 116+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-04-13  0:17 [PATCH] mm: disallow direct reclaim page writeback Dave Chinner
2010-04-13  8:31 ` KOSAKI Motohiro
2010-04-13 10:29   ` Dave Chinner
2010-04-13 11:39     ` KOSAKI Motohiro
2010-04-13 14:36       ` Dave Chinner
2010-04-14  3:12         ` Dave Chinner
2010-04-14  6:52           ` KOSAKI Motohiro
2010-04-15  1:56             ` Dave Chinner
2010-04-14  6:52         ` KOSAKI Motohiro
2010-04-14  7:36           ` Dave Chinner
2010-04-13  9:58 ` Mel Gorman
2010-04-13 11:19   ` Dave Chinner
2010-04-13 19:34     ` Mel Gorman
2010-04-13 20:20       ` Chris Mason
2010-04-14  1:40         ` Dave Chinner
2010-04-14  4:59           ` KAMEZAWA Hiroyuki
2010-04-14  5:41             ` Dave Chinner
2010-04-14  5:54               ` KOSAKI Motohiro
2010-04-14  6:13                 ` Minchan Kim
2010-04-14  7:19                   ` Minchan Kim
2010-04-14  9:42                     ` KAMEZAWA Hiroyuki
2010-04-14 10:01                       ` Minchan Kim
2010-04-14 10:07                         ` Mel Gorman
2010-04-14 10:16                           ` Minchan Kim
2010-04-14  7:06                 ` Dave Chinner
2010-04-14  6:52           ` KOSAKI Motohiro
2010-04-14  7:28             ` Dave Chinner [this message]
2010-04-14  8:51               ` Mel Gorman
2010-04-15  1:34                 ` Dave Chinner
2010-04-15  4:09                   ` KOSAKI Motohiro
2010-04-15  4:11                     ` [PATCH 1/4] vmscan: delegate pageout io to flusher thread if current is kswapd KOSAKI Motohiro
2010-04-15  8:05                       ` Suleiman Souhlal
2010-04-15  8:17                         ` KOSAKI Motohiro
2010-04-15  8:26                           ` KOSAKI Motohiro
2010-04-15 10:30                             ` Johannes Weiner
2010-04-15 17:24                               ` Suleiman Souhlal
2010-04-20  2:56                               ` Ying Han
2010-04-15  9:32                         ` Dave Chinner
2010-04-15  9:41                           ` KOSAKI Motohiro
2010-04-15 17:27                           ` Suleiman Souhlal
2010-04-15 23:33                             ` Dave Chinner
2010-04-15 23:41                               ` Suleiman Souhlal
2010-04-16  9:50                               ` Alan Cox
2010-04-17  3:06                                 ` Dave Chinner
2010-04-15  8:18                       ` KOSAKI Motohiro
2010-04-15 10:31                       ` Mel Gorman
2010-04-15 11:26                         ` KOSAKI Motohiro
2010-04-15  4:13                     ` [PATCH 2/4] vmscan: kill prev_priority completely KOSAKI Motohiro
2010-04-15  4:14                     ` [PATCH 3/4] vmscan: move priority variable into scan_control KOSAKI Motohiro
2010-04-15  4:15                     ` [PATCH 4/4] vmscan: delegate page cleaning io to flusher thread if VM pressure is low KOSAKI Motohiro
2010-04-15  4:35                     ` [PATCH] mm: disallow direct reclaim page writeback KOSAKI Motohiro
2010-04-15  6:32                       ` Dave Chinner
2010-04-15  6:44                         ` KOSAKI Motohiro
2010-04-15  6:58                           ` Dave Chinner
2010-04-15  6:20                     ` Dave Chinner
2010-04-15  6:35                       ` KOSAKI Motohiro
2010-04-15  8:54                         ` Dave Chinner
2010-04-15 10:21                           ` KOSAKI Motohiro
2010-04-15 10:23                             ` [PATCH 1/4] vmscan: simplify shrink_inactive_list() KOSAKI Motohiro
2010-04-15 13:15                               ` Mel Gorman
2010-04-15 15:01                                 ` Andi Kleen
2010-04-15 15:44                                   ` Mel Gorman
2010-04-15 16:54                                     ` Andi Kleen
2010-04-15 23:40                                       ` Dave Chinner
2010-04-16  7:13                                         ` Andi Kleen
2010-04-16 14:57                                         ` Mel Gorman
2010-04-17  2:37                                           ` Dave Chinner
2010-04-16 14:55                                       ` Mel Gorman
2010-04-15 18:22                                 ` Valdis.Kletnieks
2010-04-16  9:39                                   ` Mel Gorman
2010-04-15 10:24                             ` [PATCH 2/4] [cleanup] mm: introduce free_pages_prepare KOSAKI Motohiro
2010-04-15 13:33                               ` Mel Gorman
2010-04-15 10:24                             ` [PATCH 3/4] mm: introduce free_pages_bulk KOSAKI Motohiro
2010-04-15 13:46                               ` Mel Gorman
2010-04-15 10:26                             ` [PATCH 4/4] vmscan: replace the pagevec in shrink_inactive_list() with list KOSAKI Motohiro
2010-04-15 10:28                   ` [PATCH] mm: disallow direct reclaim page writeback Mel Gorman
2010-04-15 13:42                     ` Chris Mason
2010-04-15 17:50                       ` tytso
2010-04-16 15:05                       ` Mel Gorman
2010-04-19 15:15                         ` Mel Gorman
2010-04-19 17:38                           ` Chris Mason
2010-04-16  4:14                     ` Dave Chinner
2010-04-16 15:14                       ` Mel Gorman
2010-04-18  0:32                         ` Andrew Morton
2010-04-18 19:05                           ` Christoph Hellwig
2010-04-18 16:31                             ` Andrew Morton
2010-04-18 19:35                               ` Christoph Hellwig
2010-04-18 19:11                             ` Sorin Faibish
2010-04-18 19:10                           ` Sorin Faibish
2010-04-18 21:30                             ` James Bottomley
2010-04-18 23:34                               ` Sorin Faibish
2010-04-19  3:08                               ` tytso
2010-04-19  0:35                           ` Dave Chinner
2010-04-19  0:49                             ` Arjan van de Ven
2010-04-19  1:08                               ` Dave Chinner
2010-04-19  4:32                                 ` Arjan van de Ven
2010-04-19 15:20                         ` Mel Gorman
2010-04-23  1:06                           ` Dave Chinner
2010-04-23 10:50                             ` Mel Gorman
2010-04-15 14:57                   ` Andi Kleen
2010-04-15  2:37                 ` Johannes Weiner
2010-04-15  2:43                   ` KOSAKI Motohiro
2010-04-16 23:56                     ` Johannes Weiner
2010-04-14  6:52         ` KOSAKI Motohiro
2010-04-14 10:06         ` Andi Kleen
2010-04-14 11:20           ` Chris Mason
2010-04-14 12:15             ` Andi Kleen
2010-04-14 12:32               ` Alan Cox
2010-04-14 12:34                 ` Andi Kleen
2010-04-14 13:23             ` Mel Gorman
2010-04-14 14:07               ` Chris Mason
2010-04-14  0:24 ` Minchan Kim
2010-04-14  4:44   ` Dave Chinner
2010-04-14  7:54     ` Minchan Kim
2010-04-16  1:13 ` KAMEZAWA Hiroyuki
2010-04-16  4:18   ` KAMEZAWA Hiroyuki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100414072830.GK2493@dastard \
    --to=david@fromorbit.com \
    --cc=chris.mason@oracle.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox