ext4 deep stack with mark_page

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* ext4 deep stack with mark_page_dirty reclaim
@ 2011-03-14 19:20 Hugh Dickins
  2011-03-14 20:46 ` Ted Ts'o
  2011-03-14 22:46 ` Christoph Hellwig
  0 siblings, 2 replies; 7+ messages in thread
From: Hugh Dickins @ 2011-03-14 19:20 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, linux-kernel, linux-mm

When testing something else on 2.6.38-rc8 last night,
I hit this x86_64 stack overflow.  I've never had one before,
it seems worth reporting.  kdb was in, I jotted it down by hand
(the notifier part of it will be notifying kdb of the fault).
CONFIG_DEBUG_STACK_OVERFLOW and DEBUG_STACK_USAGE were not set.

I should disclose that I have a hack in which may make my stack
frames slightly larger than they should be: check against yours.
So it may not be an overflow for anyone else, but still a trace
to worry about.

The faulting address has a stray 2 in it: presumably from when
the stack descended into the thread_info and got corrupted.

BUG: unable to handle kernel paging request at ffff88027a704060
IP: vprintk+0x100
Thread overran stack, or stack corrupted
ffff88007a7040a8 notifier_call_chain+0x40
ffff88007a704108 __atomic_notifier_call_chain+0x48
ffff88007a704148 __die+0x48
ffff88007a7041a8 die+0x50
ffff88007a7041e8 do_general_protection+0x168
ffff88007a704248 general_protection+0x1f
ffff88007a704338 schedule+0x25a
ffff88007a7044a8 io_schedule+0x35
ffff88007a7044c8 get_request_wait+0xc6
ffff88007a704568 __make_request+0x36d
ffff88007a7045d8 generic_make_request+0x2f2
ffff88007a7046a8 submit_bio+0xe1
ffff88007a704738 swap_writepage+0xa3
ffff88007a704788 pageout+0x151
ffff88007a704808 shrink_page_list+0x2db
ffff88007a7048b8 shrink_inactive_list+0x2d3
ffff88007a7049b8 shrink_zone+0x17d
ffff88007a704a98 shrink_zones+0x0xa3
ffff88007a704b18 do_try_to_free_pages+0x87
ffff88007a704ba8 try_to_free_mem_cgroup_pages+0x8e
ffff88007a704c18 mem_cgroup_hierarchical_reclaim+0x220
ffff88007a704cc8 mem_cgroup_do_charge+0xdc
ffff88007a704d48 __mem_cgroup_try_charge+0x19c
ffff88007a704dc8 mem_cgroup_charge_common+0xa8
ffff88007a704e48 mem_cgroup_cache_charge+0x19a
ffff88007a704ec8 add_to_page_cache_locked+0x57
ffff88007a704f28 add_to_page_cache_lru+0x3e
ffff88007a704f78 find_or_create_page+0x69
ffff88007a704fe8 grow_dev_page+0x4a
ffff88007a705048 grow_buffers+0x41
ffff88007a705088 __getblk_slow+0xd7
ffff88007a7050d8 __getblk+0x44
ffff88007a705128 __ext4_get_inode_loc+0x12c
ffff88007a7051d8 ext4_get_inode_loc+0x30
ffff88007a705208 ext4_reserve_inode_write+0x21
ffff88007a705258 ext4_mark_inode_dirty+0x3b
ffff88007a7052f8 ext4_dirty_inode+0x3e
ffff88007a705338 __mark_inode_dirty+0x32
linux/fs.h       mark_inode_dirty
linux/quotaops.h dquot_alloc_space
linux/quotaops.h dquot_alloc_block
ffff88007a705388 ext4_mb_new_blocks+0xc2
ffff88007a705418 ext4_alloc_blocks+0x189
ffff88007a7054e8 ext4_alloc_branch+0x73
ffff88007a7055b8 ext4_ind_map_blocks+0x148
ffff88007a7056c8 ext4_map_blocks+0x148
ffff88007a705738 ext4_getblk+0x5f
ffff88007a7057c8 ext4_bread+0x36
ffff88007a705828 ext4_append+0x52
ffff88007a705888 do_split+0x5b
ffff88007a705968 ext4_dx_add_entry+0x4b4
ffff88007a705a98 ext4_add_entry+0x7c
ffff88007a705b48 ext4_add_nondir+0x2e
ffff88007a705b98 ext4_create+0xf5
ffff88007a705c28 vfs_create+0x83
ffff88007a705c88 __open_namei_create+0x59
ffff88007a705ce8 do_last+0x13b
ffff88007a705d58 do_filp_open+0x2ae
ffff88007a705ed8 do_sys_open+0x72
ffff88007a705f58 sys_open+0x27

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ext4 deep stack with mark_page_dirty reclaim
  2011-03-14 19:20 ext4 deep stack with mark_page_dirty reclaim Hugh Dickins
@ 2011-03-14 20:46 ` Ted Ts'o
  2011-03-15  2:25   ` Andreas Dilger
  2011-03-14 22:46 ` Christoph Hellwig
  1 sibling, 1 reply; 7+ messages in thread
From: Ted Ts'o @ 2011-03-14 20:46 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-ext4, linux-kernel, linux-mm

On Mon, Mar 14, 2011 at 12:20:52PM -0700, Hugh Dickins wrote:
> When testing something else on 2.6.38-rc8 last night,
> I hit this x86_64 stack overflow.  I've never had one before,
> it seems worth reporting.  kdb was in, I jotted it down by hand
> (the notifier part of it will be notifying kdb of the fault).
> CONFIG_DEBUG_STACK_OVERFLOW and DEBUG_STACK_USAGE were not set.
> 
> I should disclose that I have a hack in which may make my stack
> frames slightly larger than they should be: check against yours.
> So it may not be an overflow for anyone else, but still a trace
> to worry about.

Here's the trace translated to the stack space used by each function.
There are a few piggy ext4 functions that we can try to shrink, but
the real problem is just how deep the whole stack is getting.

>From the syscall to the lowest-level ext4 function is 3712 bytes, and
everything from there to the schedule() which then triggered the GPF
was another 3728 of stack space....

				- Ted

 240 schedule+0x25a
 368 io_schedule+0x35
  32 get_request_wait+0xc6
 160 __make_request+0x36d
 112 generic_make_request+0x2f2
 208 submit_bio+0xe1
 144 swap_writepage+0xa3
  80 pageout+0x151
 128 shrink_page_list+0x2db
 176 shrink_inactive_list+0x2d3
 256 shrink_zone+0x17d
 224 shrink_zones+0x0xa3
 128 do_try_to_free_pages+0x87
 144 try_to_free_mem_cgroup_pages+0x8e
 112 mem_cgroup_hierarchical_reclaim+0x220
 176 mem_cgroup_do_charge+0xdc
 128 __mem_cgroup_try_charge+0x19c
 128 mem_cgroup_charge_common+0xa8
 128 mem_cgroup_cache_charge+0x19a
 128 add_to_page_cache_locked+0x57
  96 add_to_page_cache_lru+0x3e
  80 find_or_create_page+0x69
 112 grow_dev_page+0x4a
  96 grow_buffers+0x41
  64 __getblk_slow+0xd7
  80 __getblk+0x44
  80 __ext4_get_inode_loc+0x12c
 176 ext4_get_inode_loc+0x30
  48 ext4_reserve_inode_write+0x21
  80 ext4_mark_inode_dirty+0x3b
 160 ext4_dirty_inode+0x3e
  64 __mark_inode_dirty+0x32
  80 linux/fs.h       mark_inode_dirty
   0 linux/quotaops.h dquot_alloc_space
   0 linux/quotaops.h dquot_alloc_block
   0 ext4_mb_new_blocks+0xc2
 144 ext4_alloc_blocks+0x189
 208 ext4_alloc_branch+0x73
 208 ext4_ind_map_blocks+0x148
 272 ext4_map_blocks+0x148
 112 ext4_getblk+0x5f
 144 ext4_bread+0x36
  96 ext4_append+0x52
  96 do_split+0x5b
 224 ext4_dx_add_entry+0x4b4
 304 ext4_add_entry+0x7c
 176 ext4_add_nondir+0x2e
  80 ext4_create+0xf5
 144 vfs_create+0x83
  96 __open_namei_create+0x59
  96 do_last+0x13b
 112 do_filp_open+0x2ae
 384 do_sys_open+0x72
 128 sys_open+0x27

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ext4 deep stack with mark_page_dirty reclaim
  2011-03-14 20:46 ` Ted Ts'o
@ 2011-03-15  2:25   ` Andreas Dilger
  2011-03-15 15:22     ` David Sterba
  0 siblings, 1 reply; 7+ messages in thread
From: Andreas Dilger @ 2011-03-15  2:25 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Hugh Dickins, linux-ext4, linux-kernel, linux-mm

On 2011-03-14, at 1:46 PM, Ted Ts'o wrote:
> On Mon, Mar 14, 2011 at 12:20:52PM -0700, Hugh Dickins wrote:
>> When testing something else on 2.6.38-rc8 last night,
>> I hit this x86_64 stack overflow.  I've never had one before,
>> it seems worth reporting.  kdb was in, I jotted it down by hand
>> (the notifier part of it will be notifying kdb of the fault).
>> CONFIG_DEBUG_STACK_OVERFLOW and DEBUG_STACK_USAGE were not set.
>> 
>> I should disclose that I have a hack in which may make my stack
>> frames slightly larger than they should be: check against yours.
>> So it may not be an overflow for anyone else, but still a trace
>> to worry about.
> 
> Here's the trace translated to the stack space used by each function.
> There are a few piggy ext4 functions that we can try to shrink, but
> the real problem is just how deep the whole stack is getting.
> 
> From the syscall to the lowest-level ext4 function is 3712 bytes, and
> everything from there to the schedule() which then triggered the GPF
> was another 3728 of stack space....

Is there a script which you used to generate this stack trace to function size mapping, or did you do it by hand?  I've always wanted such a script, but the tricky part is that there is so much garbage on the stack that any automated stack parsing is almost useless.  Alternately, it would seem trivial to have the stack dumper print the relative address of each symbol, and the delta from the previous symbol...

To be honest, I think the stack size limitation is becoming a serious problem in itself.  While some stack-size reduction effort is actually useful in removing inefficiency, I think there is a lot of crazy and inefficient things to try and minimize the stack usage (e.g. lots of kmalloc/kfree of temporary arrays instead of just putting them on the stack), which ends up consuming _more_ total memory.

This can be seen with deep storage stacks that are using the network on both ends, like NFS+{XFS, ext4}+LVM+DM+{fcoib,iSCSI}+driver+kmalloc or similar...  The below stack isn't even using something so convoluted.

> 240 schedule+0x25a
> 368 io_schedule+0x35
>  32 get_request_wait+0xc6
> 160 __make_request+0x36d
> 112 generic_make_request+0x2f2
> 208 submit_bio+0xe1
> 144 swap_writepage+0xa3
>  80 pageout+0x151
> 128 shrink_page_list+0x2db
> 176 shrink_inactive_list+0x2d3
> 256 shrink_zone+0x17d
> 224 shrink_zones+0x0xa3
> 128 do_try_to_free_pages+0x87
> 144 try_to_free_mem_cgroup_pages+0x8e
> 112 mem_cgroup_hierarchical_reclaim+0x220
> 176 mem_cgroup_do_charge+0xdc
> 128 __mem_cgroup_try_charge+0x19c
> 128 mem_cgroup_charge_common+0xa8
> 128 mem_cgroup_cache_charge+0x19a
> 128 add_to_page_cache_locked+0x57
>  96 add_to_page_cache_lru+0x3e
>  80 find_or_create_page+0x69
> 112 grow_dev_page+0x4a
>  96 grow_buffers+0x41
>  64 __getblk_slow+0xd7
>  80 __getblk+0x44
>  80 __ext4_get_inode_loc+0x12c
> 176 ext4_get_inode_loc+0x30
>  48 ext4_reserve_inode_write+0x21
>  80 ext4_mark_inode_dirty+0x3b
> 160 ext4_dirty_inode+0x3e
>  64 __mark_inode_dirty+0x32
>  80 linux/fs.h       mark_inode_dirty
>   0 linux/quotaops.h dquot_alloc_space
>   0 linux/quotaops.h dquot_alloc_block
>   0 ext4_mb_new_blocks+0xc2
> 144 ext4_alloc_blocks+0x189
> 208 ext4_alloc_branch+0x73
> 208 ext4_ind_map_blocks+0x148
> 272 ext4_map_blocks+0x148
> 112 ext4_getblk+0x5f
> 144 ext4_bread+0x36
>  96 ext4_append+0x52
>  96 do_split+0x5b
> 224 ext4_dx_add_entry+0x4b4
> 304 ext4_add_entry+0x7c
> 176 ext4_add_nondir+0x2e
>  80 ext4_create+0xf5
> 144 vfs_create+0x83
>  96 __open_namei_create+0x59
>  96 do_last+0x13b
> 112 do_filp_open+0x2ae
> 384 do_sys_open+0x72
> 128 sys_open+0x27
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ext4 deep stack with mark_page_dirty reclaim
  2011-03-15  2:25   ` Andreas Dilger
@ 2011-03-15 15:22     ` David Sterba
  2011-03-15 16:26       ` Chris Mason
  0 siblings, 1 reply; 7+ messages in thread
From: David Sterba @ 2011-03-15 15:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm; +Cc: adilger

On Mon, Mar 14, 2011 at 07:25:10PM -0700, Andreas Dilger wrote:
> Is there a script which you used to generate this stack trace to
> function size mapping, or did you do it by hand?  I've always wanted
> such a script, but the tricky part is that there is so much garbage on
> the stack that any automated stack parsing is almost useless.
> Alternately, it would seem trivial to have the stack dumper print the
> relative address of each symbol, and the delta from the previous
> symbol...

> > 240 schedule+0x25a
> > 368 io_schedule+0x35
> >  32 get_request_wait+0xc6

from the callstack:

ffff88007a704338 schedule+0x25a
ffff88007a7044a8 io_schedule+0x35
ffff88007a7044c8 get_request_wait+0xc6

subtract the values and you get the ones Ted posted,

eg. for get_request_wait:

0xffff88007a7044c8 - 0xffff88007a7044a8 = 32

There'se a script scripts/checkstack.pl which tries to determine stack
usage from 'objdump -d' looking for the 'sub 0x123,%rsp' instruction and
reporting the 0x123 as stack consumption. It does not give same results,
for the get_request_wait:

ffffffff81216205:       48 83 ec 68             sub    $0x68,%rsp

reported as 104.


dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ext4 deep stack with mark_page_dirty reclaim
  2011-03-15 15:22     ` David Sterba
@ 2011-03-15 16:26       ` Chris Mason
  2011-03-23 14:02         ` David Sterba
  0 siblings, 1 reply; 7+ messages in thread
From: Chris Mason @ 2011-03-15 16:26 UTC (permalink / raw)
  To: dave; +Cc: linux-kernel, linux-mm, adilger

Excerpts from David Sterba's message of 2011-03-15 11:22:22 -0400:
> On Mon, Mar 14, 2011 at 07:25:10PM -0700, Andreas Dilger wrote:
> > Is there a script which you used to generate this stack trace to
> > function size mapping, or did you do it by hand?  I've always wanted
> > such a script, but the tricky part is that there is so much garbage on
> > the stack that any automated stack parsing is almost useless.
> > Alternately, it would seem trivial to have the stack dumper print the
> > relative address of each symbol, and the delta from the previous
> > symbol...
> 
> > > 240 schedule+0x25a
> > > 368 io_schedule+0x35
> > >  32 get_request_wait+0xc6
> 
> from the callstack:
> 
> ffff88007a704338 schedule+0x25a
> ffff88007a7044a8 io_schedule+0x35
> ffff88007a7044c8 get_request_wait+0xc6
> 
> subtract the values and you get the ones Ted posted,
> 
> eg. for get_request_wait:
> 
> 0xffff88007a7044c8 - 0xffff88007a7044a8 = 32
> 
> There'se a script scripts/checkstack.pl which tries to determine stack
> usage from 'objdump -d' looking for the 'sub 0x123,%rsp' instruction and
> reporting the 0x123 as stack consumption. It does not give same results,
> for the get_request_wait:
> 
> ffffffff81216205:       48 83 ec 68             sub    $0x68,%rsp
> 
> reported as 104.

Also, the ftrace stack usage tracer gives more verbose output that
includes the size of each function.

-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ext4 deep stack with mark_page_dirty reclaim
  2011-03-15 16:26       ` Chris Mason
@ 2011-03-23 14:02         ` David Sterba
  0 siblings, 0 replies; 7+ messages in thread
From: David Sterba @ 2011-03-23 14:02 UTC (permalink / raw)
  To: Chris Mason; +Cc: dave, linux-kernel, linux-mm, adilger

On Tue, Mar 15, 2011 at 12:26:43PM -0400, Chris Mason wrote:
> Also, the ftrace stack usage tracer gives more verbose output that
> includes the size of each function.

Yet another one on the list is the -fstack-size option in new gcc 4.6 [*].
It creates a file with .su extension containing lines in format
file:line:char:function_name	stack_size	linkage_type

eg.

a.c:168:5:main     224     static


dave

* http://gcc.gnu.org/gcc-4.6/changes.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ext4 deep stack with mark_page_dirty reclaim
  2011-03-14 19:20 ext4 deep stack with mark_page_dirty reclaim Hugh Dickins
  2011-03-14 20:46 ` Ted Ts'o
@ 2011-03-14 22:46 ` Christoph Hellwig
  1 sibling, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2011-03-14 22:46 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Theodore Ts'o, linux-ext4, linux-kernel, linux-mm

Direct reclaim (in the cgroup variant) at it's work.  We had a couple of
flamewars on this before, but this trivial example with reclaim from
the most simple case (swap space) shows that we really should never
reclaim from memory allocation callers for stack usage reasons.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-03-23 14:03 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-14 19:20 ext4 deep stack with mark_page_dirty reclaim Hugh Dickins
2011-03-14 20:46 ` Ted Ts'o
2011-03-15  2:25   ` Andreas Dilger
2011-03-15 15:22     ` David Sterba
2011-03-15 16:26       ` Chris Mason
2011-03-23 14:02         ` David Sterba
2011-03-14 22:46 ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).