linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mel Gorman <mel@csn.ul.ie>
To: Dave Chinner <david@fromorbit.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Chris Mason <chris.mason@oracle.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH] mm: disallow direct reclaim page writeback
Date: Wed, 14 Apr 2010 09:51:33 +0100	[thread overview]
Message-ID: <20100414085132.GJ25756@csn.ul.ie> (raw)
In-Reply-To: <20100414072830.GK2493@dastard>

On Wed, Apr 14, 2010 at 05:28:30PM +1000, Dave Chinner wrote:
> On Wed, Apr 14, 2010 at 03:52:44PM +0900, KOSAKI Motohiro wrote:
> > > On Tue, Apr 13, 2010 at 04:20:21PM -0400, Chris Mason wrote:
> > > > On Tue, Apr 13, 2010 at 08:34:29PM +0100, Mel Gorman wrote:
> > > > > > Basically, there is not enough stack space available to allow direct
> > > > > > reclaim to enter ->writepage _anywhere_ according to the stack usage
> > > > > > profiles we are seeing here....
> > > > > > 
> > > > > 
> > > > > I'm not denying the evidence but how has it been gotten away with for years
> > > > > then? Prevention of writeback isn't the answer without figuring out how
> > > > > direct reclaimers can queue pages for IO and in the case of lumpy reclaim
> > > > > doing sync IO, then waiting on those pages.
> > > > 
> > > > So, I've been reading along, nodding my head to Dave's side of things
> > > > because seeks are evil and direct reclaim makes seeks.  I'd really loev
> > > > for direct reclaim to somehow trigger writepages on large chunks instead
> > > > of doing page by page spatters of IO to the drive.
> > 
> > I agree that "seeks are evil and direct reclaim makes seeks". Actually,
> > making 4k io is not must for pageout. So, probably we can improve it.
> > 
> > 
> > > Perhaps drop the lock on the page if it is held and call one of the
> > > helpers that filesystems use to do this, like:
> > > 
> > > 	filemap_write_and_wait(page->mapping);
> > 
> > Sorry, I'm lost what you talk about. Why do we need per-file
> > waiting? If file is 1GB file, do we need to wait 1GB writeout?
> 
> So use filemap_fdatawrite(page->mapping), or if it's better only
> to start IO on a segment of the file, use
> filemap_fdatawrite_range(page->mapping, start, end)....
> 

That does not help the stack usage issue, the caller ends up in
->writepages. From an IO perspective, it'll be better from a seek point of
view but from a VM perspective, it may or may not be cleaning the right pages.
So I think this is a red herring.

> > > > But, somewhere along the line I overlooked the part of Dave's stack trace
> > > > that said:
> > > > 
> > > > 43)     1568     912   do_select+0x3d6/0x700
> > > > 
> > > > Huh, 912 bytes...for select, really?  From poll.h:
> > > 
> > > Sure, it's bad, but we focussing on the specific case misses the
> > > point that even code that is using minimal stack can enter direct
> > > reclaim after consuming 1.5k of stack. e.g.:
> > 
> > checkstack.pl says do_select() and __generic_file_splice_read() are one
> > of worstest stack consumer. both sould be fixed.
> 
> the deepest call chain in queue_work() needs 700 bytes of stack
> to complete, wait_for_completion() requires almost 2k of stack space
> at it's deepest, the scheduler has some heavy stack users, etc,
> and these are all functions that appear at the top of the stack.
> 

The real issue here then is that stack usage has gone out of control.
Disabling ->writepage in direct reclaim does not guarantee that stack
usage will not be a problem again. From your traces, page reclaim itself
seems to be a big dirty hog.

Differences in what people see in their machines may be down to architecture,
compiler but most likely inlining. Changing inlining will not fix the problem,
it'll just move the stack usage around.

> > also, checkstack.pl says such stack eater aren't so much.
> 
> Yeah, but when we have ia callchain 70 or more functions deep,
> even 100 bytes of stack is a lot....
> 
> > > > So, select is intentionally trying to use that much stack.  It should be using
> > > > GFP_NOFS if it really wants to suck down that much stack...
> > > 
> > > The code that did the allocation is called from multiple different
> > > contexts - how is it supposed to know that in some of those contexts
> > > it is supposed to treat memory allocation differently?
> > > 
> > > This is my point - if you introduce a new semantic to memory allocation
> > > that is "use GFP_NOFS when you are using too much stack" and too much
> > > stack is more than 15% of the stack, then pretty much every code path
> > > will need to set that flag...
> > 
> > Nodding my head to Dave's side. changing caller argument seems not good
> > solution. I mean
> >  - do_select() should use GFP_KERNEL instead stack (as revert 70674f95c0)
> >  - reclaim and xfs (and other something else) need to diet.
> 
> The list I'm seeing so far includes:
> 	- scheduler
> 	- completion interfaces
> 	- radix tree
> 	- memory allocation, memory reclaim
> 	- anything that implements ->writepage
> 	- select
> 	- splice read
> 
> > Also, I believe stack eater function should be created waring. patch attached.
> 
> Good start, but 512 bytes will only catch select and splice read,
> and there are 300-400 byte functions in the above list that sit near
> the top of the stack....
> 

They will need to be tackled in turn then but obviously there should be
a focus on the common paths. The reclaim paths do seem particularly
heavy and it's down to a lot of temporary variables. I might not get the
time today but what I'm going to try do some time this week is

o Look at what temporary variables are copies of other pieces of information
o See what variables live for the duration of reclaim but are not needed
  for all of it (i.e. uninline parts of it so variables do not persist)
o See if it's possible to dynamically allocate scan_control

The last one is the trickiest. Basically, the idea would be to move as much
into scan_control as possible. Then, instead of allocating it on the stack,
allocate a fixed number of them at boot-time (NR_CPU probably) protected by
a semaphore. Limit the number of direct reclaimers that can be active at a
time to the number of scan_control variables. kswapd could still allocate
its on the stack or with kmalloc.

If it works out, it would have two main benefits. Limits the number of
processes in direct reclaim - if there is NR_CPU-worth of proceses in direct
reclaim, there is too much going on. It would also shrink the stack usage
particularly if some of the stack variables are moved into scan_control.

Maybe someone will beat me to looking at the feasibility of this.

> > > We need at least _700_ bytes of stack free just to call queue_work(),
> > > and that now happens deep in the guts of the driver subsystem below XFS.
> > > This trace shows 1.8k of stack usage on a simple, single sata disk
> > > storage subsystem, so my estimate of 2k of stack for the storage system
> > > below XFS is too small - a worst case of 2.5-3k of stack space is probably
> > > closer to the mark.
> > 
> > your explanation is very interesting. I have a (probably dumb) question.
> > Why nobody faced stack overflow issue in past? now I think every users
> > easily get stack overflow if your explanation is correct.
> 
> It's always a problem, but the focus on minimising stack usage has
> gone away since i386 has mostly disappeared from server rooms.
> 
> XFS has always been the thing that triggered stack usage problems
> first - the first reports of problems on x86_64 with 8k stacks in low
> memory situations have only just come in, and this is the first time
> in a couple of years I've paid close attention to stack usage
> outside XFS. What I'm seeing is not pretty....
> 
> > > This is the sort of thing I'm pointing at when I say that stack
> > > usage outside XFS has grown significantly significantly over the
> > > past couple of years. Given XFS has remained pretty much the same or
> > > even reduced slightly over the same time period, blaming XFS or
> > > saying "callers should use GFP_NOFS" seems like a cop-out to me.
> > > Regardless of the IO pattern performance issues, writeback via
> > > direct reclaim just uses too much stack to be safe these days...
> > 
> > Yeah, My answer is simple, All stack eater should be fixed.
> > but XFS seems not innocence too. 3.5K is enough big although
> > xfs have use such amount since very ago.
> 
> XFS used to use much more than that - significant effort has been
> put into reduce the stack footprint over many years. There's not
> much left to trim without rewriting half the filesystem...
> 

I don't think he is levelling a complain at XFS in particular - just pointing
out that it's heavy too. Still, we should be gratful that XFS is sort of
a "Stack Canary". If it dies, everyone else could be in trouble soon :)

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2010-04-14  8:51 UTC|newest]

Thread overview: 116+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-04-13  0:17 [PATCH] mm: disallow direct reclaim page writeback Dave Chinner
2010-04-13  8:31 ` KOSAKI Motohiro
2010-04-13 10:29   ` Dave Chinner
2010-04-13 11:39     ` KOSAKI Motohiro
2010-04-13 14:36       ` Dave Chinner
2010-04-14  3:12         ` Dave Chinner
2010-04-14  6:52           ` KOSAKI Motohiro
2010-04-15  1:56             ` Dave Chinner
2010-04-14  6:52         ` KOSAKI Motohiro
2010-04-14  7:36           ` Dave Chinner
2010-04-13  9:58 ` Mel Gorman
2010-04-13 11:19   ` Dave Chinner
2010-04-13 19:34     ` Mel Gorman
2010-04-13 20:20       ` Chris Mason
2010-04-14  1:40         ` Dave Chinner
2010-04-14  4:59           ` KAMEZAWA Hiroyuki
2010-04-14  5:41             ` Dave Chinner
2010-04-14  5:54               ` KOSAKI Motohiro
2010-04-14  6:13                 ` Minchan Kim
2010-04-14  7:19                   ` Minchan Kim
2010-04-14  9:42                     ` KAMEZAWA Hiroyuki
2010-04-14 10:01                       ` Minchan Kim
2010-04-14 10:07                         ` Mel Gorman
2010-04-14 10:16                           ` Minchan Kim
2010-04-14  7:06                 ` Dave Chinner
2010-04-14  6:52           ` KOSAKI Motohiro
2010-04-14  7:28             ` Dave Chinner
2010-04-14  8:51               ` Mel Gorman [this message]
2010-04-15  1:34                 ` Dave Chinner
2010-04-15  4:09                   ` KOSAKI Motohiro
2010-04-15  4:11                     ` [PATCH 1/4] vmscan: delegate pageout io to flusher thread if current is kswapd KOSAKI Motohiro
2010-04-15  8:05                       ` Suleiman Souhlal
2010-04-15  8:17                         ` KOSAKI Motohiro
2010-04-15  8:26                           ` KOSAKI Motohiro
2010-04-15 10:30                             ` Johannes Weiner
2010-04-15 17:24                               ` Suleiman Souhlal
2010-04-20  2:56                               ` Ying Han
2010-04-15  9:32                         ` Dave Chinner
2010-04-15  9:41                           ` KOSAKI Motohiro
2010-04-15 17:27                           ` Suleiman Souhlal
2010-04-15 23:33                             ` Dave Chinner
2010-04-15 23:41                               ` Suleiman Souhlal
2010-04-16  9:50                               ` Alan Cox
2010-04-17  3:06                                 ` Dave Chinner
2010-04-15  8:18                       ` KOSAKI Motohiro
2010-04-15 10:31                       ` Mel Gorman
2010-04-15 11:26                         ` KOSAKI Motohiro
2010-04-15  4:13                     ` [PATCH 2/4] vmscan: kill prev_priority completely KOSAKI Motohiro
2010-04-15  4:14                     ` [PATCH 3/4] vmscan: move priority variable into scan_control KOSAKI Motohiro
2010-04-15  4:15                     ` [PATCH 4/4] vmscan: delegate page cleaning io to flusher thread if VM pressure is low KOSAKI Motohiro
2010-04-15  4:35                     ` [PATCH] mm: disallow direct reclaim page writeback KOSAKI Motohiro
2010-04-15  6:32                       ` Dave Chinner
2010-04-15  6:44                         ` KOSAKI Motohiro
2010-04-15  6:58                           ` Dave Chinner
2010-04-15  6:20                     ` Dave Chinner
2010-04-15  6:35                       ` KOSAKI Motohiro
2010-04-15  8:54                         ` Dave Chinner
2010-04-15 10:21                           ` KOSAKI Motohiro
2010-04-15 10:23                             ` [PATCH 1/4] vmscan: simplify shrink_inactive_list() KOSAKI Motohiro
2010-04-15 13:15                               ` Mel Gorman
2010-04-15 15:01                                 ` Andi Kleen
2010-04-15 15:44                                   ` Mel Gorman
2010-04-15 16:54                                     ` Andi Kleen
2010-04-15 23:40                                       ` Dave Chinner
2010-04-16  7:13                                         ` Andi Kleen
2010-04-16 14:57                                         ` Mel Gorman
2010-04-17  2:37                                           ` Dave Chinner
2010-04-16 14:55                                       ` Mel Gorman
2010-04-15 18:22                                 ` Valdis.Kletnieks
2010-04-16  9:39                                   ` Mel Gorman
2010-04-15 10:24                             ` [PATCH 2/4] [cleanup] mm: introduce free_pages_prepare KOSAKI Motohiro
2010-04-15 13:33                               ` Mel Gorman
2010-04-15 10:24                             ` [PATCH 3/4] mm: introduce free_pages_bulk KOSAKI Motohiro
2010-04-15 13:46                               ` Mel Gorman
2010-04-15 10:26                             ` [PATCH 4/4] vmscan: replace the pagevec in shrink_inactive_list() with list KOSAKI Motohiro
2010-04-15 10:28                   ` [PATCH] mm: disallow direct reclaim page writeback Mel Gorman
2010-04-15 13:42                     ` Chris Mason
2010-04-15 17:50                       ` tytso
2010-04-16 15:05                       ` Mel Gorman
     [not found]                       ` <20100416150510.GL19264@csn.ul.ie>
2010-04-19 15:15                         ` Mel Gorman
     [not found]                         ` <20100419151511.GV19264@csn.ul.ie>
2010-04-19 17:38                           ` Chris Mason
2010-04-16  4:14                     ` Dave Chinner
2010-04-16 15:14                       ` Mel Gorman
2010-04-18  0:32                         ` Andrew Morton
2010-04-18 19:05                           ` Christoph Hellwig
2010-04-18 16:31                             ` Andrew Morton
2010-04-18 19:35                               ` Christoph Hellwig
2010-04-18 19:11                             ` Sorin Faibish
2010-04-18 19:10                           ` Sorin Faibish
2010-04-18 21:30                             ` James Bottomley
2010-04-18 23:34                               ` Sorin Faibish
2010-04-19  3:08                               ` tytso
2010-04-19  0:35                           ` Dave Chinner
2010-04-19  0:49                             ` Arjan van de Ven
2010-04-19  1:08                               ` Dave Chinner
2010-04-19  4:32                                 ` Arjan van de Ven
2010-04-19 15:20                         ` Mel Gorman
2010-04-23  1:06                           ` Dave Chinner
2010-04-23 10:50                             ` Mel Gorman
2010-04-15 14:57                   ` Andi Kleen
2010-04-15  2:37                 ` Johannes Weiner
2010-04-15  2:43                   ` KOSAKI Motohiro
2010-04-16 23:56                     ` Johannes Weiner
2010-04-14  6:52         ` KOSAKI Motohiro
2010-04-14 10:06         ` Andi Kleen
2010-04-14 11:20           ` Chris Mason
2010-04-14 12:15             ` Andi Kleen
2010-04-14 12:32               ` Alan Cox
2010-04-14 12:34                 ` Andi Kleen
2010-04-14 13:23             ` Mel Gorman
2010-04-14 14:07               ` Chris Mason
2010-04-14  0:24 ` Minchan Kim
2010-04-14  4:44   ` Dave Chinner
2010-04-14  7:54     ` Minchan Kim
2010-04-16  1:13 ` KAMEZAWA Hiroyuki
2010-04-16  4:18   ` KAMEZAWA Hiroyuki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100414085132.GJ25756@csn.ul.ie \
    --to=mel@csn.ul.ie \
    --cc=chris.mason@oracle.com \
    --cc=david@fromorbit.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).