* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
[not found] ` <1401260039-18189-2-git-send-email-minchan@kernel.org>
@ 2014-05-28 8:37 ` Dave Chinner
2014-05-28 9:13 ` Dave Chinner
0 siblings, 1 reply; 5+ messages in thread
From: Dave Chinner @ 2014-05-28 8:37 UTC (permalink / raw)
To: Minchan Kim
Cc: Dave Hansen, Rik van Riel, rusty, Peter Zijlstra, mst,
Johannes Weiner, Hugh Dickins, linux-kernel, Steven Rostedt, xfs,
linux-mm, Mel Gorman, H. Peter Anvin, Andrew Morton, Ingo Molnar
[ cc XFS list ]
On Wed, May 28, 2014 at 03:53:59PM +0900, Minchan Kim wrote:
> While I play inhouse patches with much memory pressure on qemu-kvm,
> 3.14 kernel was randomly crashed. The reason was kernel stack overflow.
>
> When I investigated the problem, the callstack was a little bit deeper
> by involve with reclaim functions but not direct reclaim path.
>
> I tried to diet stack size of some functions related with alloc/reclaim
> so did a hundred of byte but overflow was't disappeard so that I encounter
> overflow by another deeper callstack on reclaim/allocator path.
>
> Of course, we might sweep every sites we have found for reducing
> stack usage but I'm not sure how long it saves the world(surely,
> lots of developer start to add nice features which will use stack
> agains) and if we consider another more complex feature in I/O layer
> and/or reclaim path, it might be better to increase stack size(
> meanwhile, stack usage on 64bit machine was doubled compared to 32bit
> while it have sticked to 8K. Hmm, it's not a fair to me and arm64
> already expaned to 16K. )
>
> So, my stupid idea is just let's expand stack size and keep an eye
> toward stack consumption on each kernel functions via stacktrace of ftrace.
> For example, we can have a bar like that each funcion shouldn't exceed 200K
> and emit the warning when some function consumes more in runtime.
> Of course, it could make false positive but at least, it could make a
> chance to think over it.
>
> I guess this topic was discussed several time so there might be
> strong reason not to increase kernel stack size on x86_64, for me not
> knowing so Ccing x86_64 maintainers, other MM guys and virtio
> maintainers.
>
> [ 1065.604404] kworker/-5766 0d..2 1071625990us : stack_trace_call: Depth Size Location (51 entries)
> [ 1065.604404] ----- ---- --------
> [ 1065.604404] kworker/-5766 0d..2 1071625991us : stack_trace_call: 0) 7696 16 lookup_address+0x28/0x30
> [ 1065.604404] kworker/-5766 0d..2 1071625991us : stack_trace_call: 1) 7680 16 _lookup_address_cpa.isra.3+0x3b/0x40
> [ 1065.604404] kworker/-5766 0d..2 1071625991us : stack_trace_call: 2) 7664 24 __change_page_attr_set_clr+0xe0/0xb50
> [ 1065.604404] kworker/-5766 0d..2 1071625991us : stack_trace_call: 3) 7640 392 kernel_map_pages+0x6c/0x120
> [ 1065.604404] kworker/-5766 0d..2 1071625992us : stack_trace_call: 4) 7248 256 get_page_from_freelist+0x489/0x920
> [ 1065.604404] kworker/-5766 0d..2 1071625992us : stack_trace_call: 5) 6992 352 __alloc_pages_nodemask+0x5e1/0xb20
> [ 1065.604404] kworker/-5766 0d..2 1071625992us : stack_trace_call: 6) 6640 8 alloc_pages_current+0x10f/0x1f0
> [ 1065.604404] kworker/-5766 0d..2 1071625992us : stack_trace_call: 7) 6632 168 new_slab+0x2c5/0x370
> [ 1065.604404] kworker/-5766 0d..2 1071625992us : stack_trace_call: 8) 6464 8 __slab_alloc+0x3a9/0x501
> [ 1065.604404] kworker/-5766 0d..2 1071625993us : stack_trace_call: 9) 6456 80 __kmalloc+0x1cb/0x200
> [ 1065.604404] kworker/-5766 0d..2 1071625993us : stack_trace_call: 10) 6376 376 vring_add_indirect+0x36/0x200
> [ 1065.604404] kworker/-5766 0d..2 1071625993us : stack_trace_call: 11) 6000 144 virtqueue_add_sgs+0x2e2/0x320
> [ 1065.604404] kworker/-5766 0d..2 1071625993us : stack_trace_call: 12) 5856 288 __virtblk_add_req+0xda/0x1b0
> [ 1065.604404] kworker/-5766 0d..2 1071625993us : stack_trace_call: 13) 5568 96 virtio_queue_rq+0xd3/0x1d0
> [ 1065.604404] kworker/-5766 0d..2 1071625994us : stack_trace_call: 14) 5472 128 __blk_mq_run_hw_queue+0x1ef/0x440
> [ 1065.604404] kworker/-5766 0d..2 1071625994us : stack_trace_call: 15) 5344 16 blk_mq_run_hw_queue+0x35/0x40
> [ 1065.604404] kworker/-5766 0d..2 1071625994us : stack_trace_call: 16) 5328 96 blk_mq_insert_requests+0xdb/0x160
> [ 1065.604404] kworker/-5766 0d..2 1071625994us : stack_trace_call: 17) 5232 112 blk_mq_flush_plug_list+0x12b/0x140
> [ 1065.604404] kworker/-5766 0d..2 1071625994us : stack_trace_call: 18) 5120 112 blk_flush_plug_list+0xc7/0x220
> [ 1065.604404] kworker/-5766 0d..2 1071625995us : stack_trace_call: 19) 5008 64 io_schedule_timeout+0x88/0x100
> [ 1065.604404] kworker/-5766 0d..2 1071625995us : stack_trace_call: 20) 4944 128 mempool_alloc+0x145/0x170
> [ 1065.604404] kworker/-5766 0d..2 1071625995us : stack_trace_call: 21) 4816 96 bio_alloc_bioset+0x10b/0x1d0
> [ 1065.604404] kworker/-5766 0d..2 1071625995us : stack_trace_call: 22) 4720 48 get_swap_bio+0x30/0x90
> [ 1065.604404] kworker/-5766 0d..2 1071625995us : stack_trace_call: 23) 4672 160 __swap_writepage+0x150/0x230
> [ 1065.604404] kworker/-5766 0d..2 1071625996us : stack_trace_call: 24) 4512 32 swap_writepage+0x42/0x90
> [ 1065.604404] kworker/-5766 0d..2 1071625996us : stack_trace_call: 25) 4480 320 shrink_page_list+0x676/0xa80
> [ 1065.604404] kworker/-5766 0d..2 1071625996us : stack_trace_call: 26) 4160 208 shrink_inactive_list+0x262/0x4e0
> [ 1065.604404] kworker/-5766 0d..2 1071625996us : stack_trace_call: 27) 3952 304 shrink_lruvec+0x3e1/0x6a0
> [ 1065.604404] kworker/-5766 0d..2 1071625996us : stack_trace_call: 28) 3648 80 shrink_zone+0x3f/0x110
> [ 1065.604404] kworker/-5766 0d..2 1071625997us : stack_trace_call: 29) 3568 128 do_try_to_free_pages+0x156/0x4c0
> [ 1065.604404] kworker/-5766 0d..2 1071625997us : stack_trace_call: 30) 3440 208 try_to_free_pages+0xf7/0x1e0
> [ 1065.604404] kworker/-5766 0d..2 1071625997us : stack_trace_call: 31) 3232 352 __alloc_pages_nodemask+0x783/0xb20
> [ 1065.604404] kworker/-5766 0d..2 1071625997us : stack_trace_call: 32) 2880 8 alloc_pages_current+0x10f/0x1f0
> [ 1065.604404] kworker/-5766 0d..2 1071625997us : stack_trace_call: 33) 2872 200 __page_cache_alloc+0x13f/0x160
> [ 1065.604404] kworker/-5766 0d..2 1071625998us : stack_trace_call: 34) 2672 80 find_or_create_page+0x4c/0xb0
> [ 1065.604404] kworker/-5766 0d..2 1071625998us : stack_trace_call: 35) 2592 80 ext4_mb_load_buddy+0x1e9/0x370
> [ 1065.604404] kworker/-5766 0d..2 1071625998us : stack_trace_call: 36) 2512 176 ext4_mb_regular_allocator+0x1b7/0x460
> [ 1065.604404] kworker/-5766 0d..2 1071625998us : stack_trace_call: 37) 2336 128 ext4_mb_new_blocks+0x458/0x5f0
> [ 1065.604404] kworker/-5766 0d..2 1071625998us : stack_trace_call: 38) 2208 256 ext4_ext_map_blocks+0x70b/0x1010
> [ 1065.604404] kworker/-5766 0d..2 1071625999us : stack_trace_call: 39) 1952 160 ext4_map_blocks+0x325/0x530
> [ 1065.604404] kworker/-5766 0d..2 1071625999us : stack_trace_call: 40) 1792 384 ext4_writepages+0x6d1/0xce0
> [ 1065.604404] kworker/-5766 0d..2 1071625999us : stack_trace_call: 41) 1408 16 do_writepages+0x23/0x40
> [ 1065.604404] kworker/-5766 0d..2 1071625999us : stack_trace_call: 42) 1392 96 __writeback_single_inode+0x45/0x2e0
> [ 1065.604404] kworker/-5766 0d..2 1071625999us : stack_trace_call: 43) 1296 176 writeback_sb_inodes+0x2ad/0x500
> [ 1065.604404] kworker/-5766 0d..2 1071626000us : stack_trace_call: 44) 1120 80 __writeback_inodes_wb+0x9e/0xd0
> [ 1065.604404] kworker/-5766 0d..2 1071626000us : stack_trace_call: 45) 1040 160 wb_writeback+0x29b/0x350
> [ 1065.604404] kworker/-5766 0d..2 1071626000us : stack_trace_call: 46) 880 208 bdi_writeback_workfn+0x11c/0x480
> [ 1065.604404] kworker/-5766 0d..2 1071626000us : stack_trace_call: 47) 672 144 process_one_work+0x1d2/0x570
> [ 1065.604404] kworker/-5766 0d..2 1071626000us : stack_trace_call: 48) 528 112 worker_thread+0x116/0x370
> [ 1065.604404] kworker/-5766 0d..2 1071626001us : stack_trace_call: 49) 416 240 kthread+0xf3/0x110
> [ 1065.604404] kworker/-5766 0d..2 1071626001us : stack_trace_call: 50) 176 176 ret_from_fork+0x7c/0xb0
>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
> arch/x86/include/asm/page_64_types.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
> index 8de6d9cf3b95..678205195ae1 100644
> --- a/arch/x86/include/asm/page_64_types.h
> +++ b/arch/x86/include/asm/page_64_types.h
> @@ -1,7 +1,7 @@
> #ifndef _ASM_X86_PAGE_64_DEFS_H
> #define _ASM_X86_PAGE_64_DEFS_H
>
> -#define THREAD_SIZE_ORDER 1
> +#define THREAD_SIZE_ORDER 2
> #define THREAD_SIZE (PAGE_SIZE << THREAD_SIZE_ORDER)
> #define CURRENT_MASK (~(THREAD_SIZE - 1))
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
2014-05-28 8:37 ` [RFC 2/2] x86_64: expand kernel stack to 16K Dave Chinner
@ 2014-05-28 9:13 ` Dave Chinner
2014-05-28 16:06 ` Johannes Weiner
0 siblings, 1 reply; 5+ messages in thread
From: Dave Chinner @ 2014-05-28 9:13 UTC (permalink / raw)
To: Minchan Kim
Cc: Dave Hansen, Rik van Riel, rusty, Peter Zijlstra, mst,
Johannes Weiner, Hugh Dickins, linux-kernel, Steven Rostedt, xfs,
linux-mm, Mel Gorman, H. Peter Anvin, Andrew Morton, Ingo Molnar
On Wed, May 28, 2014 at 06:37:38PM +1000, Dave Chinner wrote:
> [ cc XFS list ]
[and now there is a complete copy on the XFs list, I'll add my 2c]
> On Wed, May 28, 2014 at 03:53:59PM +0900, Minchan Kim wrote:
> > While I play inhouse patches with much memory pressure on qemu-kvm,
> > 3.14 kernel was randomly crashed. The reason was kernel stack overflow.
> >
> > When I investigated the problem, the callstack was a little bit deeper
> > by involve with reclaim functions but not direct reclaim path.
> >
> > I tried to diet stack size of some functions related with alloc/reclaim
> > so did a hundred of byte but overflow was't disappeard so that I encounter
> > overflow by another deeper callstack on reclaim/allocator path.
That's a no win situation. The stack overruns through ->writepage
we've been seeing with XFS over the past *4 years* are much larger
than a few bytes. The worst case stack usage on a virtio block
device was about 10.5KB of stack usage.
And, like this one, it came from the flusher thread as well. The
difference was that the allocation that triggered the reclaim path
you've reported occurred when 5k of the stack had already been
used...
> > Of course, we might sweep every sites we have found for reducing
> > stack usage but I'm not sure how long it saves the world(surely,
> > lots of developer start to add nice features which will use stack
> > agains) and if we consider another more complex feature in I/O layer
> > and/or reclaim path, it might be better to increase stack size(
> > meanwhile, stack usage on 64bit machine was doubled compared to 32bit
> > while it have sticked to 8K. Hmm, it's not a fair to me and arm64
> > already expaned to 16K. )
Yup, that's all been pointed out previously. 8k stacks were never
large enough to fit the linux IO architecture on x86-64, but nobody
outside filesystem and IO developers has been willing to accept that
argument as valid, despite regular stack overruns and filesystem
having to add workaround after workaround to prevent stack overruns.
That's why stuff like this appears in various filesystem's
->writepage:
/*
* Refuse to write the page out if we are called from reclaim context.
*
* This avoids stack overflows when called from deeply used stacks in
* random callers for direct reclaim or memcg reclaim. We explicitly
* allow reclaim from kswapd as the stack usage there is relatively low.
*
* This should never happen except in the case of a VM regression so
* warn about it.
*/
if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
PF_MEMALLOC))
goto redirty;
That still doesn't guarantee us enough stack space to do writeback,
though, because memory allocation can occur when reading in metadata
needed to do delayed allocation, and so we could trigger GFP_NOFS
memory allocation from the flusher thread with 4-5k of stack already
consumed, so that would still overrun teh stack.
So, a couple of years ago we started defering half the writeback
stack usage to a worker thread (commit c999a22 "xfs: introduce an
allocation workqueue"), under the assumption that the worst stack
usage when we call memory allocation is around 3-3.5k of stack used.
We thought that would be safe, but the stack trace you've posted
shows that alloc_page(GFP_NOFS) can consume upwards of 5k of stack,
which means we're still screwed despite all the workarounds we have
in place.
We've also had recent reports of allocation from direct IO blowing
the stack, as well as block allocation adding an entry to a
directory. We're basically at the point where we have to push every
XFS operation that requires block allocation off to another thread
to get enough stack space for normal operation.....
> > So, my stupid idea is just let's expand stack size and keep an eye
Not stupid: it's been what I've been advocating we need to do for
the past 3-4 years. XFS has always been the stack usage canary and
this issue is basically a repeat of the 4k stack on i386 kernel
debacle.
> > toward stack consumption on each kernel functions via stacktrace of ftrace.
> > For example, we can have a bar like that each funcion shouldn't exceed 200K
> > and emit the warning when some function consumes more in runtime.
> > Of course, it could make false positive but at least, it could make a
> > chance to think over it.
I don't think that's a good idea. There are reasons for putting a
150-200 byte structure on the stack (e.g. used in a context where
allocation cannot be guaranteed to succeed because forward progress
cannot be guaranteed). hence having these users warn all the time
will quickly get very annoying and that functionality switched off
or removed....
> > I guess this topic was discussed several time so there might be
> > strong reason not to increase kernel stack size on x86_64, for me not
> > knowing so Ccing x86_64 maintainers, other MM guys and virtio
> > maintainers.
> >
> > Depth Size Location (51 entries)
> >
> > 0) 7696 16 lookup_address+0x28/0x30
> > 1) 7680 16 _lookup_address_cpa.isra.3+0x3b/0x40
> > 2) 7664 24 __change_page_attr_set_clr+0xe0/0xb50
> > 3) 7640 392 kernel_map_pages+0x6c/0x120
> > 4) 7248 256 get_page_from_freelist+0x489/0x920
> > 5) 6992 352 __alloc_pages_nodemask+0x5e1/0xb20
> > 6) 6640 8 alloc_pages_current+0x10f/0x1f0
> > 7) 6632 168 new_slab+0x2c5/0x370
> > 8) 6464 8 __slab_alloc+0x3a9/0x501
> > 9) 6456 80 __kmalloc+0x1cb/0x200
> > 10) 6376 376 vring_add_indirect+0x36/0x200
> > 11) 6000 144 virtqueue_add_sgs+0x2e2/0x320
> > 12) 5856 288 __virtblk_add_req+0xda/0x1b0
> > 13) 5568 96 virtio_queue_rq+0xd3/0x1d0
> > 14) 5472 128 __blk_mq_run_hw_queue+0x1ef/0x440
> > 15) 5344 16 blk_mq_run_hw_queue+0x35/0x40
> > 16) 5328 96 blk_mq_insert_requests+0xdb/0x160
> > 17) 5232 112 blk_mq_flush_plug_list+0x12b/0x140
> > 18) 5120 112 blk_flush_plug_list+0xc7/0x220
> > 19) 5008 64 io_schedule_timeout+0x88/0x100
> > 20) 4944 128 mempool_alloc+0x145/0x170
> > 21) 4816 96 bio_alloc_bioset+0x10b/0x1d0
> > 22) 4720 48 get_swap_bio+0x30/0x90
> > 23) 4672 160 __swap_writepage+0x150/0x230
> > 24) 4512 32 swap_writepage+0x42/0x90
> > 25) 4480 320 shrink_page_list+0x676/0xa80
> > 26) 4160 208 shrink_inactive_list+0x262/0x4e0
> > 27) 3952 304 shrink_lruvec+0x3e1/0x6a0
> > 28) 3648 80 shrink_zone+0x3f/0x110
> > 29) 3568 128 do_try_to_free_pages+0x156/0x4c0
> > 30) 3440 208 try_to_free_pages+0xf7/0x1e0
> > 31) 3232 352 __alloc_pages_nodemask+0x783/0xb20
> > 32) 2880 8 alloc_pages_current+0x10f/0x1f0
> > 33) 2872 200 __page_cache_alloc+0x13f/0x160
> > 34) 2672 80 find_or_create_page+0x4c/0xb0
> > 35) 2592 80 ext4_mb_load_buddy+0x1e9/0x370
> > 36) 2512 176 ext4_mb_regular_allocator+0x1b7/0x460
> > 37) 2336 128 ext4_mb_new_blocks+0x458/0x5f0
> > 38) 2208 256 ext4_ext_map_blocks+0x70b/0x1010
> > 39) 1952 160 ext4_map_blocks+0x325/0x530
> > 40) 1792 384 ext4_writepages+0x6d1/0xce0
> > 41) 1408 16 do_writepages+0x23/0x40
> > 42) 1392 96 __writeback_single_inode+0x45/0x2e0
> > 43) 1296 176 writeback_sb_inodes+0x2ad/0x500
> > 44) 1120 80 __writeback_inodes_wb+0x9e/0xd0
> > 45) 1040 160 wb_writeback+0x29b/0x350
> > 46) 880 208 bdi_writeback_workfn+0x11c/0x480
> > 47) 672 144 process_one_work+0x1d2/0x570
> > 48) 528 112 worker_thread+0x116/0x370
> > 49) 416 240 kthread+0xf3/0x110
> > 50) 176 176 ret_from_fork+0x7c/0xb0
Impressive: 3 nested allocations - GFP_NOFS, GFP_NOIO and then
GFP_ATOMIC before the stack goes boom. XFS usually only needs 2...
However, add another 1000 bytes of stack for each IO by going
through the FC/scsi layers and hitting command allocation at the
bottom of the IO stack rather than bio allocation at the top and
maybe stack usage for 2-3 layers of MD and LVM as well, and you
start to see how that stack pushes >10k of usage rather than just
overflowing 8k....
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> > arch/x86/include/asm/page_64_types.h | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
> > index 8de6d9cf3b95..678205195ae1 100644
> > --- a/arch/x86/include/asm/page_64_types.h
> > +++ b/arch/x86/include/asm/page_64_types.h
> > @@ -1,7 +1,7 @@
> > #ifndef _ASM_X86_PAGE_64_DEFS_H
> > #define _ASM_X86_PAGE_64_DEFS_H
> >
> > -#define THREAD_SIZE_ORDER 1
> > +#define THREAD_SIZE_ORDER 2
> > #define THREAD_SIZE (PAGE_SIZE << THREAD_SIZE_ORDER)
> > #define CURRENT_MASK (~(THREAD_SIZE - 1))
Got my vote. Can we get this into 3.16, please?
Acked-by: Dave Chinner <david@fromorbit.com>
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
2014-05-28 9:13 ` Dave Chinner
@ 2014-05-28 16:06 ` Johannes Weiner
2014-05-28 21:55 ` Dave Chinner
2014-05-29 6:06 ` Minchan Kim
0 siblings, 2 replies; 5+ messages in thread
From: Johannes Weiner @ 2014-05-28 16:06 UTC (permalink / raw)
To: Dave Chinner
Cc: Dave Hansen, Rik van Riel, Peter Zijlstra, linux-mm, rusty,
Hugh Dickins, linux-kernel, Steven Rostedt, xfs, Minchan Kim, mst,
Mel Gorman, H. Peter Anvin, Andrew Morton, Ingo Molnar
On Wed, May 28, 2014 at 07:13:45PM +1000, Dave Chinner wrote:
> On Wed, May 28, 2014 at 06:37:38PM +1000, Dave Chinner wrote:
> > [ cc XFS list ]
>
> [and now there is a complete copy on the XFs list, I'll add my 2c]
>
> > On Wed, May 28, 2014 at 03:53:59PM +0900, Minchan Kim wrote:
> > > While I play inhouse patches with much memory pressure on qemu-kvm,
> > > 3.14 kernel was randomly crashed. The reason was kernel stack overflow.
> > >
> > > When I investigated the problem, the callstack was a little bit deeper
> > > by involve with reclaim functions but not direct reclaim path.
> > >
> > > I tried to diet stack size of some functions related with alloc/reclaim
> > > so did a hundred of byte but overflow was't disappeard so that I encounter
> > > overflow by another deeper callstack on reclaim/allocator path.
>
> That's a no win situation. The stack overruns through ->writepage
> we've been seeing with XFS over the past *4 years* are much larger
> than a few bytes. The worst case stack usage on a virtio block
> device was about 10.5KB of stack usage.
>
> And, like this one, it came from the flusher thread as well. The
> difference was that the allocation that triggered the reclaim path
> you've reported occurred when 5k of the stack had already been
> used...
>
> > > Of course, we might sweep every sites we have found for reducing
> > > stack usage but I'm not sure how long it saves the world(surely,
> > > lots of developer start to add nice features which will use stack
> > > agains) and if we consider another more complex feature in I/O layer
> > > and/or reclaim path, it might be better to increase stack size(
> > > meanwhile, stack usage on 64bit machine was doubled compared to 32bit
> > > while it have sticked to 8K. Hmm, it's not a fair to me and arm64
> > > already expaned to 16K. )
>
> Yup, that's all been pointed out previously. 8k stacks were never
> large enough to fit the linux IO architecture on x86-64, but nobody
> outside filesystem and IO developers has been willing to accept that
> argument as valid, despite regular stack overruns and filesystem
> having to add workaround after workaround to prevent stack overruns.
>
> That's why stuff like this appears in various filesystem's
> ->writepage:
>
> /*
> * Refuse to write the page out if we are called from reclaim context.
> *
> * This avoids stack overflows when called from deeply used stacks in
> * random callers for direct reclaim or memcg reclaim. We explicitly
> * allow reclaim from kswapd as the stack usage there is relatively low.
> *
> * This should never happen except in the case of a VM regression so
> * warn about it.
> */
> if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
> PF_MEMALLOC))
> goto redirty;
>
> That still doesn't guarantee us enough stack space to do writeback,
> though, because memory allocation can occur when reading in metadata
> needed to do delayed allocation, and so we could trigger GFP_NOFS
> memory allocation from the flusher thread with 4-5k of stack already
> consumed, so that would still overrun teh stack.
>
> So, a couple of years ago we started defering half the writeback
> stack usage to a worker thread (commit c999a22 "xfs: introduce an
> allocation workqueue"), under the assumption that the worst stack
> usage when we call memory allocation is around 3-3.5k of stack used.
> We thought that would be safe, but the stack trace you've posted
> shows that alloc_page(GFP_NOFS) can consume upwards of 5k of stack,
> which means we're still screwed despite all the workarounds we have
> in place.
The allocation and reclaim stack itself is only 2k per the stacktrace
below. What got us in this particular case is that we engaged a
complicated block layer setup from within the allocation context in
order to swap out a page.
In the past we disabled filesystem ->writepage from within the
allocation context and deferred it to kswapd for stack reasons (see
the WARN_ON_ONCE and the comment in your above quote), but I think we
have to go further and do the same for even swap_writepage():
> > > I guess this topic was discussed several time so there might be
> > > strong reason not to increase kernel stack size on x86_64, for me not
> > > knowing so Ccing x86_64 maintainers, other MM guys and virtio
> > > maintainers.
> > >
> > > Depth Size Location (51 entries)
> > >
> > > 0) 7696 16 lookup_address+0x28/0x30
> > > 1) 7680 16 _lookup_address_cpa.isra.3+0x3b/0x40
> > > 2) 7664 24 __change_page_attr_set_clr+0xe0/0xb50
> > > 3) 7640 392 kernel_map_pages+0x6c/0x120
> > > 4) 7248 256 get_page_from_freelist+0x489/0x920
> > > 5) 6992 352 __alloc_pages_nodemask+0x5e1/0xb20
> > > 6) 6640 8 alloc_pages_current+0x10f/0x1f0
> > > 7) 6632 168 new_slab+0x2c5/0x370
> > > 8) 6464 8 __slab_alloc+0x3a9/0x501
> > > 9) 6456 80 __kmalloc+0x1cb/0x200
> > > 10) 6376 376 vring_add_indirect+0x36/0x200
> > > 11) 6000 144 virtqueue_add_sgs+0x2e2/0x320
> > > 12) 5856 288 __virtblk_add_req+0xda/0x1b0
> > > 13) 5568 96 virtio_queue_rq+0xd3/0x1d0
> > > 14) 5472 128 __blk_mq_run_hw_queue+0x1ef/0x440
> > > 15) 5344 16 blk_mq_run_hw_queue+0x35/0x40
> > > 16) 5328 96 blk_mq_insert_requests+0xdb/0x160
> > > 17) 5232 112 blk_mq_flush_plug_list+0x12b/0x140
> > > 18) 5120 112 blk_flush_plug_list+0xc7/0x220
> > > 19) 5008 64 io_schedule_timeout+0x88/0x100
> > > 20) 4944 128 mempool_alloc+0x145/0x170
> > > 21) 4816 96 bio_alloc_bioset+0x10b/0x1d0
> > > 22) 4720 48 get_swap_bio+0x30/0x90
> > > 23) 4672 160 __swap_writepage+0x150/0x230
> > > 24) 4512 32 swap_writepage+0x42/0x90
Without swap IO from the allocation context, the stack would have
ended here, which would have been easily survivable. And left the
writeout work to kswapd, which has a much shallower stack than this:
> > > 25) 4480 320 shrink_page_list+0x676/0xa80
> > > 26) 4160 208 shrink_inactive_list+0x262/0x4e0
> > > 27) 3952 304 shrink_lruvec+0x3e1/0x6a0
> > > 28) 3648 80 shrink_zone+0x3f/0x110
> > > 29) 3568 128 do_try_to_free_pages+0x156/0x4c0
> > > 30) 3440 208 try_to_free_pages+0xf7/0x1e0
> > > 31) 3232 352 __alloc_pages_nodemask+0x783/0xb20
> > > 32) 2880 8 alloc_pages_current+0x10f/0x1f0
> > > 33) 2872 200 __page_cache_alloc+0x13f/0x160
> > > 34) 2672 80 find_or_create_page+0x4c/0xb0
> > > 35) 2592 80 ext4_mb_load_buddy+0x1e9/0x370
> > > 36) 2512 176 ext4_mb_regular_allocator+0x1b7/0x460
> > > 37) 2336 128 ext4_mb_new_blocks+0x458/0x5f0
> > > 38) 2208 256 ext4_ext_map_blocks+0x70b/0x1010
> > > 39) 1952 160 ext4_map_blocks+0x325/0x530
> > > 40) 1792 384 ext4_writepages+0x6d1/0xce0
> > > 41) 1408 16 do_writepages+0x23/0x40
> > > 42) 1392 96 __writeback_single_inode+0x45/0x2e0
> > > 43) 1296 176 writeback_sb_inodes+0x2ad/0x500
> > > 44) 1120 80 __writeback_inodes_wb+0x9e/0xd0
> > > 45) 1040 160 wb_writeback+0x29b/0x350
> > > 46) 880 208 bdi_writeback_workfn+0x11c/0x480
> > > 47) 672 144 process_one_work+0x1d2/0x570
> > > 48) 528 112 worker_thread+0x116/0x370
> > > 49) 416 240 kthread+0xf3/0x110
> > > 50) 176 176 ret_from_fork+0x7c/0xb0
>
> Impressive: 3 nested allocations - GFP_NOFS, GFP_NOIO and then
> GFP_ATOMIC before the stack goes boom. XFS usually only needs 2...
Do they also usually involve swap_writepage()?
---
diff --git a/mm/page_io.c b/mm/page_io.c
index 7c59ef681381..02e7e3c168cf 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -233,6 +233,22 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
{
int ret = 0;
+ /*
+ * Refuse to write the page out if we are called from reclaim context.
+ *
+ * This avoids stack overflows when called from deeply used stacks in
+ * random callers for direct reclaim or memcg reclaim. We explicitly
+ * allow reclaim from kswapd as the stack usage there is relatively low.
+ *
+ * This should never happen except in the case of a VM regression so
+ * warn about it.
+ */
+ if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
+ PF_MEMALLOC)) {
+ SetPageDirty(page);
+ goto out;
+ }
+
if (try_to_free_swap(page)) {
unlock_page(page);
goto out;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 61c576083c07..99cca6633e0d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -985,13 +985,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
if (PageDirty(page)) {
/*
- * Only kswapd can writeback filesystem pages to
- * avoid risk of stack overflow but only writeback
+ * Only kswapd can writeback pages to avoid
+ * risk of stack overflow but only writeback
* if many dirty pages have been encountered.
*/
- if (page_is_file_cache(page) &&
- (!current_is_kswapd() ||
- !zone_is_reclaim_dirty(zone))) {
+ if (!current_is_kswapd() ||
+ !zone_is_reclaim_dirty(zone))) {
/*
* Immediately reclaim when written back.
* Similar in principal to deactivate_page()
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
2014-05-28 16:06 ` Johannes Weiner
@ 2014-05-28 21:55 ` Dave Chinner
2014-05-29 6:06 ` Minchan Kim
1 sibling, 0 replies; 5+ messages in thread
From: Dave Chinner @ 2014-05-28 21:55 UTC (permalink / raw)
To: Johannes Weiner
Cc: Dave Hansen, Rik van Riel, Peter Zijlstra, linux-mm, rusty,
Hugh Dickins, linux-kernel, Steven Rostedt, xfs, Minchan Kim, mst,
Mel Gorman, H. Peter Anvin, Andrew Morton, Ingo Molnar
On Wed, May 28, 2014 at 12:06:58PM -0400, Johannes Weiner wrote:
> On Wed, May 28, 2014 at 07:13:45PM +1000, Dave Chinner wrote:
> > On Wed, May 28, 2014 at 06:37:38PM +1000, Dave Chinner wrote:
> > > [ cc XFS list ]
> >
> > [and now there is a complete copy on the XFs list, I'll add my 2c]
> >
> > > On Wed, May 28, 2014 at 03:53:59PM +0900, Minchan Kim wrote:
> > > > While I play inhouse patches with much memory pressure on qemu-kvm,
> > > > 3.14 kernel was randomly crashed. The reason was kernel stack overflow.
> > > >
> > > > When I investigated the problem, the callstack was a little bit deeper
> > > > by involve with reclaim functions but not direct reclaim path.
> > > >
> > > > I tried to diet stack size of some functions related with alloc/reclaim
> > > > so did a hundred of byte but overflow was't disappeard so that I encounter
> > > > overflow by another deeper callstack on reclaim/allocator path.
> >
> > That's a no win situation. The stack overruns through ->writepage
> > we've been seeing with XFS over the past *4 years* are much larger
> > than a few bytes. The worst case stack usage on a virtio block
> > device was about 10.5KB of stack usage.
> >
> > And, like this one, it came from the flusher thread as well. The
> > difference was that the allocation that triggered the reclaim path
> > you've reported occurred when 5k of the stack had already been
> > used...
> >
> > > > Of course, we might sweep every sites we have found for reducing
> > > > stack usage but I'm not sure how long it saves the world(surely,
> > > > lots of developer start to add nice features which will use stack
> > > > agains) and if we consider another more complex feature in I/O layer
> > > > and/or reclaim path, it might be better to increase stack size(
> > > > meanwhile, stack usage on 64bit machine was doubled compared to 32bit
> > > > while it have sticked to 8K. Hmm, it's not a fair to me and arm64
> > > > already expaned to 16K. )
> >
> > Yup, that's all been pointed out previously. 8k stacks were never
> > large enough to fit the linux IO architecture on x86-64, but nobody
> > outside filesystem and IO developers has been willing to accept that
> > argument as valid, despite regular stack overruns and filesystem
> > having to add workaround after workaround to prevent stack overruns.
> >
> > That's why stuff like this appears in various filesystem's
> > ->writepage:
> >
> > /*
> > * Refuse to write the page out if we are called from reclaim context.
> > *
> > * This avoids stack overflows when called from deeply used stacks in
> > * random callers for direct reclaim or memcg reclaim. We explicitly
> > * allow reclaim from kswapd as the stack usage there is relatively low.
> > *
> > * This should never happen except in the case of a VM regression so
> > * warn about it.
> > */
> > if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
> > PF_MEMALLOC))
> > goto redirty;
> >
> > That still doesn't guarantee us enough stack space to do writeback,
> > though, because memory allocation can occur when reading in metadata
> > needed to do delayed allocation, and so we could trigger GFP_NOFS
> > memory allocation from the flusher thread with 4-5k of stack already
> > consumed, so that would still overrun teh stack.
> >
> > So, a couple of years ago we started defering half the writeback
> > stack usage to a worker thread (commit c999a22 "xfs: introduce an
> > allocation workqueue"), under the assumption that the worst stack
> > usage when we call memory allocation is around 3-3.5k of stack used.
> > We thought that would be safe, but the stack trace you've posted
> > shows that alloc_page(GFP_NOFS) can consume upwards of 5k of stack,
> > which means we're still screwed despite all the workarounds we have
> > in place.
>
> The allocation and reclaim stack itself is only 2k per the stacktrace
> below. What got us in this particular case is that we engaged a
> complicated block layer setup from within the allocation context in
> order to swap out a page.
The report does not have a complicated block layer setup - it's just
a swap device on a virtio device. There's no MD, no raid, no complex
transport and protocol layer, etc. It's about as simple as it gets.
> In the past we disabled filesystem ->writepage from within the
> allocation context and deferred it to kswapd for stack reasons (see
> the WARN_ON_ONCE and the comment in your above quote), but I think we
> have to go further and do the same for even swap_writepage():
I don't think that solves the problem. I've seen plenty of near
stack overflows that were caused by >3k of stack being used because
of memory allocation/reclaim overhead and then scheduling.
usage and another 1k of stack scheduling waiting.
If we have a subsystem that can put >3k on the stack at arbitrary
locations, then we really only have <5k of stack available for
callers. And when the generic code typically consumes 1-2k of stack
before we get to filesystem specific methods, we only have 3-4k of
stack left for the worst case storage path stack usage. With the
block layer and driver layers requiring 2.5-3k because they can do
memory allocation and schedule, that leaves very little for the
layers in the middle, which is arguably the most algorithmically
complex layer of the storage stack.....
> > > > I guess this topic was discussed several time so there might be
> > > > strong reason not to increase kernel stack size on x86_64, for me not
> > > > knowing so Ccing x86_64 maintainers, other MM guys and virtio
> > > > maintainers.
> > > >
> > > > Depth Size Location (51 entries)
> > > >
> > > > 0) 7696 16 lookup_address+0x28/0x30
> > > > 1) 7680 16 _lookup_address_cpa.isra.3+0x3b/0x40
> > > > 2) 7664 24 __change_page_attr_set_clr+0xe0/0xb50
> > > > 3) 7640 392 kernel_map_pages+0x6c/0x120
> > > > 4) 7248 256 get_page_from_freelist+0x489/0x920
> > > > 5) 6992 352 __alloc_pages_nodemask+0x5e1/0xb20
> > > > 6) 6640 8 alloc_pages_current+0x10f/0x1f0
> > > > 7) 6632 168 new_slab+0x2c5/0x370
> > > > 8) 6464 8 __slab_alloc+0x3a9/0x501
> > > > 9) 6456 80 __kmalloc+0x1cb/0x200
> > > > 10) 6376 376 vring_add_indirect+0x36/0x200
> > > > 11) 6000 144 virtqueue_add_sgs+0x2e2/0x320
> > > > 12) 5856 288 __virtblk_add_req+0xda/0x1b0
> > > > 13) 5568 96 virtio_queue_rq+0xd3/0x1d0
> > > > 14) 5472 128 __blk_mq_run_hw_queue+0x1ef/0x440
> > > > 15) 5344 16 blk_mq_run_hw_queue+0x35/0x40
> > > > 16) 5328 96 blk_mq_insert_requests+0xdb/0x160
> > > > 17) 5232 112 blk_mq_flush_plug_list+0x12b/0x140
> > > > 18) 5120 112 blk_flush_plug_list+0xc7/0x220
> > > > 19) 5008 64 io_schedule_timeout+0x88/0x100
> > > > 20) 4944 128 mempool_alloc+0x145/0x170
> > > > 21) 4816 96 bio_alloc_bioset+0x10b/0x1d0
> > > > 22) 4720 48 get_swap_bio+0x30/0x90
> > > > 23) 4672 160 __swap_writepage+0x150/0x230
> > > > 24) 4512 32 swap_writepage+0x42/0x90
>
> Without swap IO from the allocation context, the stack would have
> ended here, which would have been easily survivable. And left the
> writeout work to kswapd, which has a much shallower stack than this:
Sure, but this is just playing whack-a-stack. We can keep slapping
band-aids and restrictions on code and make the code more complex,
constrainted, convouted and slower, or we can just increase the
stack size....
> > > > 25) 4480 320 shrink_page_list+0x676/0xa80
> > > > 26) 4160 208 shrink_inactive_list+0x262/0x4e0
> > > > 27) 3952 304 shrink_lruvec+0x3e1/0x6a0
> > > > 28) 3648 80 shrink_zone+0x3f/0x110
> > > > 29) 3568 128 do_try_to_free_pages+0x156/0x4c0
> > > > 30) 3440 208 try_to_free_pages+0xf7/0x1e0
> > > > 31) 3232 352 __alloc_pages_nodemask+0x783/0xb20
> > > > 32) 2880 8 alloc_pages_current+0x10f/0x1f0
> > > > 33) 2872 200 __page_cache_alloc+0x13f/0x160
> > > > 34) 2672 80 find_or_create_page+0x4c/0xb0
> > > > 35) 2592 80 ext4_mb_load_buddy+0x1e9/0x370
> > > > 36) 2512 176 ext4_mb_regular_allocator+0x1b7/0x460
> > > > 37) 2336 128 ext4_mb_new_blocks+0x458/0x5f0
> > > > 38) 2208 256 ext4_ext_map_blocks+0x70b/0x1010
> > > > 39) 1952 160 ext4_map_blocks+0x325/0x530
> > > > 40) 1792 384 ext4_writepages+0x6d1/0xce0
> > > > 41) 1408 16 do_writepages+0x23/0x40
> > > > 42) 1392 96 __writeback_single_inode+0x45/0x2e0
> > > > 43) 1296 176 writeback_sb_inodes+0x2ad/0x500
> > > > 44) 1120 80 __writeback_inodes_wb+0x9e/0xd0
> > > > 45) 1040 160 wb_writeback+0x29b/0x350
> > > > 46) 880 208 bdi_writeback_workfn+0x11c/0x480
> > > > 47) 672 144 process_one_work+0x1d2/0x570
> > > > 48) 528 112 worker_thread+0x116/0x370
> > > > 49) 416 240 kthread+0xf3/0x110
> > > > 50) 176 176 ret_from_fork+0x7c/0xb0
> >
> > Impressive: 3 nested allocations - GFP_NOFS, GFP_NOIO and then
> > GFP_ATOMIC before the stack goes boom. XFS usually only needs 2...
>
> Do they also usually involve swap_writepage()?
No. Have a look at this recent thread when Dave Jones reported
trinity was busting the stack.
http://oss.sgi.com/archives/xfs/2014-02/msg00325.html
What happens when a shrinker issues IO:
http://oss.sgi.com/archives/xfs/2014-02/msg00361.html
Yes, there was an XFS problem in there that was fixed (by moving
work to a workqueue!) but the point is that swap is not the only
path through memory allocation that can consume huge amounts of
stack. That above trace also points out a path through the scheduler
of close to 1k of stack usage. That gets worse -
wait_for_completion() typically requires 1.5k of stack....
Contributing is the new blk-mq layer, which from the above stack
trace still hasn't been fixed:
http://oss.sgi.com/archives/xfs/2014-02/msg00355.html
and a lot of the stack usage is because of saved registers on each
function call:
http://oss.sgi.com/archives/xfs/2014-02/msg00470.html
And here's a good set of examples of the amount of stack certain
functions can require:
http://oss.sgi.com/archives/xfs/2014-02/msg00365.html
Am I the only person who sees a widespread problem here?
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
2014-05-28 16:06 ` Johannes Weiner
2014-05-28 21:55 ` Dave Chinner
@ 2014-05-29 6:06 ` Minchan Kim
1 sibling, 0 replies; 5+ messages in thread
From: Minchan Kim @ 2014-05-29 6:06 UTC (permalink / raw)
To: Johannes Weiner
Cc: Dave Hansen, Rik van Riel, Peter Zijlstra, mst, Hugh Dickins,
rusty, linux-kernel, Steven Rostedt, xfs, linux-mm, Mel Gorman,
H. Peter Anvin, Andrew Morton, Ingo Molnar
On Wed, May 28, 2014 at 12:06:58PM -0400, Johannes Weiner wrote:
> On Wed, May 28, 2014 at 07:13:45PM +1000, Dave Chinner wrote:
> > On Wed, May 28, 2014 at 06:37:38PM +1000, Dave Chinner wrote:
> > > [ cc XFS list ]
> >
> > [and now there is a complete copy on the XFs list, I'll add my 2c]
> >
> > > On Wed, May 28, 2014 at 03:53:59PM +0900, Minchan Kim wrote:
> > > > While I play inhouse patches with much memory pressure on qemu-kvm,
> > > > 3.14 kernel was randomly crashed. The reason was kernel stack overflow.
> > > >
> > > > When I investigated the problem, the callstack was a little bit deeper
> > > > by involve with reclaim functions but not direct reclaim path.
> > > >
> > > > I tried to diet stack size of some functions related with alloc/reclaim
> > > > so did a hundred of byte but overflow was't disappeard so that I encounter
> > > > overflow by another deeper callstack on reclaim/allocator path.
> >
> > That's a no win situation. The stack overruns through ->writepage
> > we've been seeing with XFS over the past *4 years* are much larger
> > than a few bytes. The worst case stack usage on a virtio block
> > device was about 10.5KB of stack usage.
> >
> > And, like this one, it came from the flusher thread as well. The
> > difference was that the allocation that triggered the reclaim path
> > you've reported occurred when 5k of the stack had already been
> > used...
> >
> > > > Of course, we might sweep every sites we have found for reducing
> > > > stack usage but I'm not sure how long it saves the world(surely,
> > > > lots of developer start to add nice features which will use stack
> > > > agains) and if we consider another more complex feature in I/O layer
> > > > and/or reclaim path, it might be better to increase stack size(
> > > > meanwhile, stack usage on 64bit machine was doubled compared to 32bit
> > > > while it have sticked to 8K. Hmm, it's not a fair to me and arm64
> > > > already expaned to 16K. )
> >
> > Yup, that's all been pointed out previously. 8k stacks were never
> > large enough to fit the linux IO architecture on x86-64, but nobody
> > outside filesystem and IO developers has been willing to accept that
> > argument as valid, despite regular stack overruns and filesystem
> > having to add workaround after workaround to prevent stack overruns.
> >
> > That's why stuff like this appears in various filesystem's
> > ->writepage:
> >
> > /*
> > * Refuse to write the page out if we are called from reclaim context.
> > *
> > * This avoids stack overflows when called from deeply used stacks in
> > * random callers for direct reclaim or memcg reclaim. We explicitly
> > * allow reclaim from kswapd as the stack usage there is relatively low.
> > *
> > * This should never happen except in the case of a VM regression so
> > * warn about it.
> > */
> > if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
> > PF_MEMALLOC))
> > goto redirty;
> >
> > That still doesn't guarantee us enough stack space to do writeback,
> > though, because memory allocation can occur when reading in metadata
> > needed to do delayed allocation, and so we could trigger GFP_NOFS
> > memory allocation from the flusher thread with 4-5k of stack already
> > consumed, so that would still overrun teh stack.
> >
> > So, a couple of years ago we started defering half the writeback
> > stack usage to a worker thread (commit c999a22 "xfs: introduce an
> > allocation workqueue"), under the assumption that the worst stack
> > usage when we call memory allocation is around 3-3.5k of stack used.
> > We thought that would be safe, but the stack trace you've posted
> > shows that alloc_page(GFP_NOFS) can consume upwards of 5k of stack,
> > which means we're still screwed despite all the workarounds we have
> > in place.
>
> The allocation and reclaim stack itself is only 2k per the stacktrace
> below. What got us in this particular case is that we engaged a
> complicated block layer setup from within the allocation context in
> order to swap out a page.
>
> In the past we disabled filesystem ->writepage from within the
> allocation context and deferred it to kswapd for stack reasons (see
> the WARN_ON_ONCE and the comment in your above quote), but I think we
> have to go further and do the same for even swap_writepage():
>
> > > > I guess this topic was discussed several time so there might be
> > > > strong reason not to increase kernel stack size on x86_64, for me not
> > > > knowing so Ccing x86_64 maintainers, other MM guys and virtio
> > > > maintainers.
> > > >
> > > > Depth Size Location (51 entries)
> > > >
> > > > 0) 7696 16 lookup_address+0x28/0x30
> > > > 1) 7680 16 _lookup_address_cpa.isra.3+0x3b/0x40
> > > > 2) 7664 24 __change_page_attr_set_clr+0xe0/0xb50
> > > > 3) 7640 392 kernel_map_pages+0x6c/0x120
> > > > 4) 7248 256 get_page_from_freelist+0x489/0x920
> > > > 5) 6992 352 __alloc_pages_nodemask+0x5e1/0xb20
> > > > 6) 6640 8 alloc_pages_current+0x10f/0x1f0
> > > > 7) 6632 168 new_slab+0x2c5/0x370
> > > > 8) 6464 8 __slab_alloc+0x3a9/0x501
> > > > 9) 6456 80 __kmalloc+0x1cb/0x200
> > > > 10) 6376 376 vring_add_indirect+0x36/0x200
> > > > 11) 6000 144 virtqueue_add_sgs+0x2e2/0x320
> > > > 12) 5856 288 __virtblk_add_req+0xda/0x1b0
> > > > 13) 5568 96 virtio_queue_rq+0xd3/0x1d0
> > > > 14) 5472 128 __blk_mq_run_hw_queue+0x1ef/0x440
> > > > 15) 5344 16 blk_mq_run_hw_queue+0x35/0x40
> > > > 16) 5328 96 blk_mq_insert_requests+0xdb/0x160
> > > > 17) 5232 112 blk_mq_flush_plug_list+0x12b/0x140
> > > > 18) 5120 112 blk_flush_plug_list+0xc7/0x220
> > > > 19) 5008 64 io_schedule_timeout+0x88/0x100
> > > > 20) 4944 128 mempool_alloc+0x145/0x170
> > > > 21) 4816 96 bio_alloc_bioset+0x10b/0x1d0
> > > > 22) 4720 48 get_swap_bio+0x30/0x90
> > > > 23) 4672 160 __swap_writepage+0x150/0x230
> > > > 24) 4512 32 swap_writepage+0x42/0x90
>
> Without swap IO from the allocation context, the stack would have
> ended here, which would have been easily survivable. And left the
> writeout work to kswapd, which has a much shallower stack than this:
>
> > > > 25) 4480 320 shrink_page_list+0x676/0xa80
> > > > 26) 4160 208 shrink_inactive_list+0x262/0x4e0
> > > > 27) 3952 304 shrink_lruvec+0x3e1/0x6a0
> > > > 28) 3648 80 shrink_zone+0x3f/0x110
> > > > 29) 3568 128 do_try_to_free_pages+0x156/0x4c0
> > > > 30) 3440 208 try_to_free_pages+0xf7/0x1e0
> > > > 31) 3232 352 __alloc_pages_nodemask+0x783/0xb20
> > > > 32) 2880 8 alloc_pages_current+0x10f/0x1f0
> > > > 33) 2872 200 __page_cache_alloc+0x13f/0x160
> > > > 34) 2672 80 find_or_create_page+0x4c/0xb0
> > > > 35) 2592 80 ext4_mb_load_buddy+0x1e9/0x370
> > > > 36) 2512 176 ext4_mb_regular_allocator+0x1b7/0x460
> > > > 37) 2336 128 ext4_mb_new_blocks+0x458/0x5f0
> > > > 38) 2208 256 ext4_ext_map_blocks+0x70b/0x1010
> > > > 39) 1952 160 ext4_map_blocks+0x325/0x530
> > > > 40) 1792 384 ext4_writepages+0x6d1/0xce0
> > > > 41) 1408 16 do_writepages+0x23/0x40
> > > > 42) 1392 96 __writeback_single_inode+0x45/0x2e0
> > > > 43) 1296 176 writeback_sb_inodes+0x2ad/0x500
> > > > 44) 1120 80 __writeback_inodes_wb+0x9e/0xd0
> > > > 45) 1040 160 wb_writeback+0x29b/0x350
> > > > 46) 880 208 bdi_writeback_workfn+0x11c/0x480
> > > > 47) 672 144 process_one_work+0x1d2/0x570
> > > > 48) 528 112 worker_thread+0x116/0x370
> > > > 49) 416 240 kthread+0xf3/0x110
> > > > 50) 176 176 ret_from_fork+0x7c/0xb0
> >
> > Impressive: 3 nested allocations - GFP_NOFS, GFP_NOIO and then
> > GFP_ATOMIC before the stack goes boom. XFS usually only needs 2...
>
> Do they also usually involve swap_writepage()?
Maybe it works but the problem I can think of is churn of LRU because
anon pages scanned in direct reclaim would live another round in LRU
and as Dave already pointed out, it couldn't prevent synchronous
unplugging caused by another shedule point in direct reclaim path
so I buy Dave's idea which pass plug list off to the kblockd.
>
> ---
>
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 7c59ef681381..02e7e3c168cf 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -233,6 +233,22 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
> {
> int ret = 0;
>
> + /*
> + * Refuse to write the page out if we are called from reclaim context.
> + *
> + * This avoids stack overflows when called from deeply used stacks in
> + * random callers for direct reclaim or memcg reclaim. We explicitly
> + * allow reclaim from kswapd as the stack usage there is relatively low.
> + *
> + * This should never happen except in the case of a VM regression so
> + * warn about it.
> + */
> + if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
> + PF_MEMALLOC)) {
> + SetPageDirty(page);
> + goto out;
> + }
> +
> if (try_to_free_swap(page)) {
> unlock_page(page);
> goto out;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 61c576083c07..99cca6633e0d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -985,13 +985,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>
> if (PageDirty(page)) {
> /*
> - * Only kswapd can writeback filesystem pages to
> - * avoid risk of stack overflow but only writeback
> + * Only kswapd can writeback pages to avoid
> + * risk of stack overflow but only writeback
> * if many dirty pages have been encountered.
> */
> - if (page_is_file_cache(page) &&
> - (!current_is_kswapd() ||
> - !zone_is_reclaim_dirty(zone))) {
> + if (!current_is_kswapd() ||
> + !zone_is_reclaim_dirty(zone))) {
> /*
> * Immediately reclaim when written back.
> * Similar in principal to deactivate_page()
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
Kind regards,
Minchan Kim
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2014-05-29 6:05 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1401260039-18189-1-git-send-email-minchan@kernel.org>
[not found] ` <1401260039-18189-2-git-send-email-minchan@kernel.org>
2014-05-28 8:37 ` [RFC 2/2] x86_64: expand kernel stack to 16K Dave Chinner
2014-05-28 9:13 ` Dave Chinner
2014-05-28 16:06 ` Johannes Weiner
2014-05-28 21:55 ` Dave Chinner
2014-05-29 6:06 ` Minchan Kim
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox