Re: [RFC 2/2] x86_64: expand kernel stack to 16K

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>, Minchan Kim <minchan@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm <linux-mm@kvack.org>, "H. Peter Anvin" <hpa@zytor.com>,
	Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Mel Gorman <mgorman@suse.de>, Rik van Riel <riel@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Hugh Dickins <hughd@google.com>,
	Rusty Russell <rusty@rustcorp.com.au>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Steven Rostedt <rostedt@goodmis.org>
Subject: Re: [RFC 2/2] x86_64: expand kernel stack to 16K
Date: Thu, 29 May 2014 17:26:33 +1000	[thread overview]
Message-ID: <20140529072633.GH6677@dastard> (raw)
In-Reply-To: <CA+55aFzdq2V-Q3WUV7hQJG8jBSAvBqdYLVTNtbD4ObVZ5yDRmw@mail.gmail.com>

On Wed, May 28, 2014 at 07:42:40PM -0700, Linus Torvalds wrote:
> On Wed, May 28, 2014 at 6:30 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > You're focussing on the specific symptoms, not the bigger picture.
> > i.e. you're ignoring all the other "let's start IO" triggers in
> > direct reclaim. e.g there's two separate plug flush triggers in
> > shrink_inactive_list(), one of which is:
> 
> Fair enough. I certainly agree that we should look at the other cases here too.
> 
> In fact, I also find it distasteful just how much stack space some of
> those VM routines are just using up on their own, never mind any
> actual IO paths at all. The fact that __alloc_pages_nodemask() uses
> 350 bytes of stackspace on its own is actually quite disturbing. The
> fact that kernel_map_pages() apparently has almost 400 bytes of stack
> is just crazy. Obviously that case only happens with
> CONFIG_DEBUG_PAGEALLOC, but still..

What concerns me about both __alloc_pages_nodemask() and
kernel_map_pages is that when I look at the code I see functions
that have no obvious stack usage problem. However, the compiler is
producing functions with huge stack footprints and it's not at all
obvious when I read the code. So in this case I'm more concerned
that we have a major disconnect between the source code structure
and the code that the compiler produces...

> > I'm not saying we shouldn't turn of swap from direct reclaim, just
> > that all we'd be doing by turning off swap is playing whack-a-stack
> > - the next report will simply be from one of the other direct
> > reclaim IO schedule points.
> 
> Playing whack-a-mole with this for a while might not be a bad idea,
> though. It's not like we will ever really improve unless we start
> whacking the worst cases. And it should still be a fairly limited
> number.

I guess I've been playing whack-a-stack for so long now and some of
the overruns have been so large I just don't see it as a viable
medium to long term solution.

> After all, historically, some of the cases we've played whack-a-mole
> on have been in XFS, so I'd think you'd be thrilled to see some other
> code get blamed this time around ;)

Blame shifting doesn't thrill me - I'm still at the pointy end of
stack overrun reports, and we've still got to do the hard work of
solving the problem. However, I am happy to see acknowlegement of
the problem so we can work out how to solve the issues...

> > Regardless of whether it is swap or something external queues the
> > bio on the plug, perhaps we should look at why it's done inline
> > rather than by kblockd, where it was moved because it was blowing
> > the stack from schedule():
> 
> So it sounds like we need to do this for io_schedule() too.
> 
> In fact, we've generally found it to be a mistake every time we
> "automatically" unblock some IO queue. And I'm not saying that because
> of stack space, but because we've _often_ had the situation that eager
> unblocking results in IO that could have been done as bigger requests.
> 
> Of course, we do need to worry about latency for starting IO, but any
> of these kinds of memory-pressure writeback patterns are pretty much
> by definition not about the latency of one _particular_ IO, so they
> don't tent to be latency-sensitive. Quite the reverse: we start
> writeback and then end up waiting on something else altogether
> (possibly a writeback that got started much earlier).

*nod*

> swapout certainly is _not_ IO-latency-sensitive, especially these
> days. And while we _do_ want to throttle in direct reclaim, if it's
> about throttling I'd certainly think that it sounds quite reasonable
> to push any unplugging to kblockd than try to do that synchronously.
> If we are throttling in direct-reclaim, we need to slow things _down_
> for the writer, not worry about latency.

Right, we are adding latency to the caller by having to swap so
a small amount of additional IO dispatch latency for IO we aren't
going to wait directly on doesn't really matter at all.

> >                That implies no IO in direct reclaim context
> > is safe - either from swap or io_schedule() unplugging. It also
> > lends a lot of weight to my assertion that the majority of the stack
> > growth over the past couple of years has been ocurring outside the
> > filesystems....
> 
> I think Minchan's stack trace definitely backs you up on that. The
> filesystem part - despite that one ext4_writepages() function - is a
> very small part of the whole. It sits at about ~1kB of stack. Just the
> VM "top-level" writeback code is about as much, and then the VM page
> alloc/shrinking code when the filesystem needs memory is *twice* that,
> and then the block layer and the virtio code are another 1kB each.

*nod*

As i said early, look at this in the context of the bigger picture.
We can also have more stack using layers in the IO stack and/or more
stack-expensive layers. e.g.  it could be block -> dm -> md -> SCSI
-> mempool_alloc in that stack rather than block -> virtio ->
kmalloc. Hence 1k of virtio stack could be 1.5k of SCSI stack,
md/dm could contribute a few hundred bytes each (or more depending
on how many layers of dm/md there are), and so on.

When you start adding all that up, it doesn't paint a pretty
picture. That's one of the main reasons why I don't think the
whack-a-stack approach will solve the problem in the medium to long
term...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2014-05-29  7:26 UTC|newest]

Thread overview: 97+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-05-28  6:53 [PATCH 1/2] ftrace: print stack usage right before Oops Minchan Kim
2014-05-28  6:53 ` [RFC 2/2] x86_64: expand kernel stack to 16K Minchan Kim
2014-05-28  8:37   ` Dave Chinner
2014-05-28  9:13     ` Dave Chinner
2014-05-28 16:06       ` Johannes Weiner
2014-05-28 21:55         ` Dave Chinner
2014-05-29  6:06         ` Minchan Kim
2014-05-28  9:04   ` Michael S. Tsirkin
2014-05-29  1:09     ` Minchan Kim
2014-05-29  2:44       ` Steven Rostedt
2014-05-29  4:11         ` Minchan Kim
2014-05-29  2:47       ` Rusty Russell
2014-05-28  9:27   ` Borislav Petkov
2014-05-29 13:23     ` One Thousand Gnomes
2014-05-28 14:14   ` Steven Rostedt
2014-05-28 14:23     ` H. Peter Anvin
2014-05-28 22:11       ` Dave Chinner
2014-05-28 22:42         ` H. Peter Anvin
2014-05-28 23:17           ` Dave Chinner
2014-05-28 23:21             ` H. Peter Anvin
2014-05-28 15:43   ` Richard Weinberger
2014-05-28 16:08     ` Steven Rostedt
2014-05-28 16:11       ` Richard Weinberger
2014-05-28 16:13       ` Linus Torvalds
2014-05-28 16:09   ` Linus Torvalds
2014-05-28 22:31     ` Dave Chinner
2014-05-28 22:41       ` Linus Torvalds
2014-05-29  1:30         ` Dave Chinner
2014-05-29  1:58           ` Dave Chinner
2014-05-29  2:51             ` Linus Torvalds
2014-05-29 23:36             ` Minchan Kim
2014-05-30  0:05               ` Linus Torvalds
2014-05-30  0:20                 ` Minchan Kim
2014-05-30  0:31                   ` Linus Torvalds
2014-05-30  0:50                     ` Minchan Kim
2014-05-30  1:24                       ` Linus Torvalds
2014-05-30  1:58                         ` Dave Chinner
2014-05-30  2:13                           ` Linus Torvalds
2014-05-30  6:21                         ` Minchan Kim
2014-05-30  1:30                 ` Linus Torvalds
2014-05-30  0:15               ` Dave Chinner
2014-05-30  2:12                 ` Minchan Kim
2014-05-30  4:37                   ` Linus Torvalds
2014-05-31  1:45                     ` Linus Torvalds
2014-05-30  6:12                   ` Minchan Kim
2014-06-03 13:28                   ` Rasmus Villemoes
2014-06-03 19:04                     ` Linus Torvalds
2014-05-29  2:42           ` Linus Torvalds
2014-05-29  5:14             ` H. Peter Anvin
2014-05-29  6:01             ` Rusty Russell
2014-05-29  7:26               ` virtio ring cleanups, which save stack on older gcc Rusty Russell
2014-05-29  7:26                 ` [PATCH 1/4] Hack: measure stack taken by vring from virtio_blk Rusty Russell
2014-05-29 15:39                   ` Linus Torvalds
2014-05-29  7:26                 ` [PATCH 2/4] virtio_net: pass well-formed sg to virtqueue_add_inbuf() Rusty Russell
2014-05-29 10:07                   ` Michael S. Tsirkin
2014-05-29  7:26                 ` [PATCH 3/4] virtio_ring: assume sgs are always well-formed Rusty Russell
2014-05-29 11:18                   ` Michael S. Tsirkin
2014-05-29  7:26                 ` [PATCH 4/4] virtio_ring: unify direct/indirect code paths Rusty Russell
2014-05-29  7:52                   ` Peter Zijlstra
2014-05-29 11:05                     ` Rusty Russell
2014-05-29 11:33                       ` Michael S. Tsirkin
2014-05-29 11:29                   ` Michael S. Tsirkin
2014-05-30  2:37                     ` Rusty Russell
2014-05-29  7:41                 ` virtio ring cleanups, which save stack on older gcc Minchan Kim
2014-05-29 10:39                   ` Dave Chinner
2014-05-29 11:08                   ` Rusty Russell
2014-05-29 23:45                     ` Minchan Kim
2014-05-30  1:06                       ` Minchan Kim
2014-05-30  6:56                       ` Rusty Russell
2014-05-29  7:26             ` Dave Chinner [this message]
2014-05-29 15:24               ` [RFC 2/2] x86_64: expand kernel stack to 16K Linus Torvalds
2014-05-29 23:40                 ` Minchan Kim
2014-05-29 23:53                 ` Dave Chinner
2014-05-30  0:06                   ` Dave Jones
2014-05-30  0:21                     ` Dave Chinner
2014-05-30  0:29                       ` Dave Jones
2014-05-30  0:32                       ` Minchan Kim
2014-05-30  1:34                         ` Dave Chinner
2014-05-30 15:25                           ` H. Peter Anvin
2014-05-30 15:41                             ` Linus Torvalds
2014-05-30 15:52                               ` H. Peter Anvin
2014-05-30 16:06                                 ` Linus Torvalds
2014-05-30 17:24                                   ` Dave Hansen
2014-05-30 18:12                                     ` H. Peter Anvin
2014-05-30  9:48                 ` Richard Weinberger
2014-05-30 15:36                   ` Linus Torvalds
2014-05-31  2:06             ` Jens Axboe
2014-06-02 22:59               ` Dave Chinner
2014-06-03 13:02               ` Konstantin Khlebnikov
2014-05-29  3:46     ` Minchan Kim
2014-05-29  4:13       ` Linus Torvalds
2014-05-29  5:10         ` Minchan Kim
2014-05-30 21:23     ` Andi Kleen
2014-05-28 16:18 ` [PATCH 1/2] ftrace: print stack usage right before Oops Steven Rostedt
2014-05-29  3:52   ` Minchan Kim
2014-05-29  3:01 ` Steven Rostedt
2014-05-29  3:49   ` Minchan Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140529072633.GH6677@dastard \
    --to=david@fromorbit.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=dave.hansen@intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=hpa@zytor.com \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=minchan@kernel.org \
    --cc=mingo@kernel.org \
    --cc=mst@redhat.com \
    --cc=riel@redhat.com \
    --cc=rostedt@goodmis.org \
    --cc=rusty@rustcorp.com.au \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).