linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Minchan Kim <minchan@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm <linux-mm@kvack.org>, "H. Peter Anvin" <hpa@zytor.com>,
	Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Mel Gorman <mgorman@suse.de>, Rik van Riel <riel@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Hugh Dickins <hughd@google.com>,
	Rusty Russell <rusty@rustcorp.com.au>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Steven Rostedt <rostedt@goodmis.org>
Subject: Re: [RFC 2/2] x86_64: expand kernel stack to 16K
Date: Thu, 29 May 2014 11:30:07 +1000	[thread overview]
Message-ID: <20140529013007.GF6677@dastard> (raw)
In-Reply-To: <CA+55aFyRk6_v6COPGVvu6hvt=i2A8-dPcs1X3Ydn1g24AxbPkg@mail.gmail.com>

On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote:
> On Wed, May 28, 2014 at 3:31 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > Indeed, the call chain reported here is not caused by swap issuing
> > IO.
> 
> Well, that's one way of reading that callchain.
> 
> I think it's the *wrong* way of reading it, though. Almost dishonestly
> so.

I guess you haven't met your insult quota for the day, Linus. :/

> Because very clearly, the swapout _is_ what causes the unplugging
> of the IO queue, and does so because it is allocating the BIO for its
> own IO.  The fact that that then fails (because of other IO's in
> flight), and causes *other* IO to be flushed, doesn't really change
> anything fundamental. It's still very much swap that causes that
> "let's start IO".

It is not rocket science to see how plugging outside memory
allocation context can lead to flushing that plug within direct
reclaim without having dispatched any IO at all from direct
reclaim....

You're focussing on the specific symptoms, not the bigger picture.
i.e. you're ignoring all the other "let's start IO" triggers in
direct reclaim. e.g there's two separate plug flush triggers in
shrink_inactive_list(), one of which is:

        /*
         * Stall direct reclaim for IO completions if underlying BDIs or zone
         * is congested. Allow kswapd to continue until it starts encountering
         * unqueued dirty pages or cycling through the LRU too quickly.
         */
        if (!sc->hibernation_mode && !current_is_kswapd())
                wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);

I'm not saying we shouldn't turn of swap from direct reclaim, just
that all we'd be doing by turning off swap is playing whack-a-stack
- the next report will simply be from one of the other direct
reclaim IO schedule points.

Regardless of whether it is swap or something external queues the
bio on the plug, perhaps we should look at why it's done inline
rather than by kblockd, where it was moved because it was blowing
the stack from schedule():

commit f4af3c3d077a004762aaad052049c809fd8c6f0c
Author: Jens Axboe <jaxboe@fusionio.com>
Date:   Tue Apr 12 14:58:51 2011 +0200

    block: move queue run on unplug to kblockd
    
    There are worries that we are now consuming a lot more stack in
    some cases, since we potentially call into IO dispatch from
    schedule() or io_schedule(). We can reduce this problem by moving
    the running of the queue to kblockd, like the old plugging scheme
    did as well.
    
    This may or may not be a good idea from a performance perspective,
    depending on how many tasks have queue plugs running at the same
    time. For even the slightly contended case, doing just a single
    queue run from kblockd instead of multiple runs directly from the
    unpluggers will be faster.
    
    Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca
Author: Jens Axboe <jaxboe@fusionio.com>
Date:   Sat Apr 16 13:27:55 2011 +0200

    block: let io_schedule() flush the plug inline
    
    Linus correctly observes that the most important dispatch cases
    are now done from kblockd, this isn't ideal for latency reasons.
    The original reason for switching dispatches out-of-line was to
    avoid too deep a stack, so by _only_ letting the "accidental"
    flush directly in schedule() be guarded by offload to kblockd,
    we should be able to get the best of both worlds.
    
    So add a blk_schedule_flush_plug() that offloads to kblockd,
    and only use that from the schedule() path.
    
    Signed-off-by: Jens Axboe <jaxboe@fusionio.com>

And now we have too deep a stack due to unplugging from io_schedule()...

> IOW, swap-out directly caused that extra 3kB of stack use in what was
> a deep call chain (due to memory allocation). I really don't
> understand why you are arguing anything else on a pure technicality.
>
> I thought you had some other argument for why swap was different, and
> against removing that "page_is_file_cache()" special case in
> shrink_page_list().

I've said in the past that swap is different to filesystem
->writepage implementations because it doesn't require significant
stack to do block allocation and doesn't trigger IO deep in that
allocation stack. Hence it has much lower stack overhead than the
filesystem ->writepage implementations and so is much less likely to
have stack issues.

This stack overflow shows us that just the memory reclaim + IO
layers are sufficient to cause a stack overflow, which is something
I've never seen before. That implies no IO in direct reclaim context
is safe - either from swap or io_schedule() unplugging. It also
lends a lot of weight to my assertion that the majority of the stack
growth over the past couple of years has been ocurring outside the
filesystems....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2014-05-29  1:30 UTC|newest]

Thread overview: 97+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-05-28  6:53 [PATCH 1/2] ftrace: print stack usage right before Oops Minchan Kim
2014-05-28  6:53 ` [RFC 2/2] x86_64: expand kernel stack to 16K Minchan Kim
2014-05-28  8:37   ` Dave Chinner
2014-05-28  9:13     ` Dave Chinner
2014-05-28 16:06       ` Johannes Weiner
2014-05-28 21:55         ` Dave Chinner
2014-05-29  6:06         ` Minchan Kim
2014-05-28  9:04   ` Michael S. Tsirkin
2014-05-29  1:09     ` Minchan Kim
2014-05-29  2:44       ` Steven Rostedt
2014-05-29  4:11         ` Minchan Kim
2014-05-29  2:47       ` Rusty Russell
2014-05-28  9:27   ` Borislav Petkov
2014-05-29 13:23     ` One Thousand Gnomes
2014-05-28 14:14   ` Steven Rostedt
2014-05-28 14:23     ` H. Peter Anvin
2014-05-28 22:11       ` Dave Chinner
2014-05-28 22:42         ` H. Peter Anvin
2014-05-28 23:17           ` Dave Chinner
2014-05-28 23:21             ` H. Peter Anvin
2014-05-28 15:43   ` Richard Weinberger
2014-05-28 16:08     ` Steven Rostedt
2014-05-28 16:11       ` Richard Weinberger
2014-05-28 16:13       ` Linus Torvalds
2014-05-28 16:09   ` Linus Torvalds
2014-05-28 22:31     ` Dave Chinner
2014-05-28 22:41       ` Linus Torvalds
2014-05-29  1:30         ` Dave Chinner [this message]
2014-05-29  1:58           ` Dave Chinner
2014-05-29  2:51             ` Linus Torvalds
2014-05-29 23:36             ` Minchan Kim
2014-05-30  0:05               ` Linus Torvalds
2014-05-30  0:20                 ` Minchan Kim
2014-05-30  0:31                   ` Linus Torvalds
2014-05-30  0:50                     ` Minchan Kim
2014-05-30  1:24                       ` Linus Torvalds
2014-05-30  1:58                         ` Dave Chinner
2014-05-30  2:13                           ` Linus Torvalds
2014-05-30  6:21                         ` Minchan Kim
2014-05-30  1:30                 ` Linus Torvalds
2014-05-30  0:15               ` Dave Chinner
2014-05-30  2:12                 ` Minchan Kim
2014-05-30  4:37                   ` Linus Torvalds
2014-05-31  1:45                     ` Linus Torvalds
2014-05-30  6:12                   ` Minchan Kim
2014-06-03 13:28                   ` Rasmus Villemoes
2014-06-03 19:04                     ` Linus Torvalds
2014-05-29  2:42           ` Linus Torvalds
2014-05-29  5:14             ` H. Peter Anvin
2014-05-29  6:01             ` Rusty Russell
2014-05-29  7:26               ` virtio ring cleanups, which save stack on older gcc Rusty Russell
2014-05-29  7:26                 ` [PATCH 1/4] Hack: measure stack taken by vring from virtio_blk Rusty Russell
2014-05-29 15:39                   ` Linus Torvalds
2014-05-29  7:26                 ` [PATCH 2/4] virtio_net: pass well-formed sg to virtqueue_add_inbuf() Rusty Russell
2014-05-29 10:07                   ` Michael S. Tsirkin
2014-05-29  7:26                 ` [PATCH 3/4] virtio_ring: assume sgs are always well-formed Rusty Russell
2014-05-29 11:18                   ` Michael S. Tsirkin
2014-05-29  7:26                 ` [PATCH 4/4] virtio_ring: unify direct/indirect code paths Rusty Russell
2014-05-29  7:52                   ` Peter Zijlstra
2014-05-29 11:05                     ` Rusty Russell
2014-05-29 11:33                       ` Michael S. Tsirkin
2014-05-29 11:29                   ` Michael S. Tsirkin
2014-05-30  2:37                     ` Rusty Russell
2014-05-29  7:41                 ` virtio ring cleanups, which save stack on older gcc Minchan Kim
2014-05-29 10:39                   ` Dave Chinner
2014-05-29 11:08                   ` Rusty Russell
2014-05-29 23:45                     ` Minchan Kim
2014-05-30  1:06                       ` Minchan Kim
2014-05-30  6:56                       ` Rusty Russell
2014-05-29  7:26             ` [RFC 2/2] x86_64: expand kernel stack to 16K Dave Chinner
2014-05-29 15:24               ` Linus Torvalds
2014-05-29 23:40                 ` Minchan Kim
2014-05-29 23:53                 ` Dave Chinner
2014-05-30  0:06                   ` Dave Jones
2014-05-30  0:21                     ` Dave Chinner
2014-05-30  0:29                       ` Dave Jones
2014-05-30  0:32                       ` Minchan Kim
2014-05-30  1:34                         ` Dave Chinner
2014-05-30 15:25                           ` H. Peter Anvin
2014-05-30 15:41                             ` Linus Torvalds
2014-05-30 15:52                               ` H. Peter Anvin
2014-05-30 16:06                                 ` Linus Torvalds
2014-05-30 17:24                                   ` Dave Hansen
2014-05-30 18:12                                     ` H. Peter Anvin
2014-05-30  9:48                 ` Richard Weinberger
2014-05-30 15:36                   ` Linus Torvalds
2014-05-31  2:06             ` Jens Axboe
2014-06-02 22:59               ` Dave Chinner
2014-06-03 13:02               ` Konstantin Khlebnikov
2014-05-29  3:46     ` Minchan Kim
2014-05-29  4:13       ` Linus Torvalds
2014-05-29  5:10         ` Minchan Kim
2014-05-30 21:23     ` Andi Kleen
2014-05-28 16:18 ` [PATCH 1/2] ftrace: print stack usage right before Oops Steven Rostedt
2014-05-29  3:52   ` Minchan Kim
2014-05-29  3:01 ` Steven Rostedt
2014-05-29  3:49   ` Minchan Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140529013007.GF6677@dastard \
    --to=david@fromorbit.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=hpa@zytor.com \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=minchan@kernel.org \
    --cc=mingo@kernel.org \
    --cc=mst@redhat.com \
    --cc=riel@redhat.com \
    --cc=rostedt@goodmis.org \
    --cc=rusty@rustcorp.com.au \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).