* Need to potentially watch stack usage for ext4 and AIO... @ 2009-06-19 17:59 Theodore Ts'o 2009-06-20 1:46 ` Eric Sandeen 0 siblings, 1 reply; 8+ messages in thread From: Theodore Ts'o @ 2009-06-19 17:59 UTC (permalink / raw) To: linux-ext4 On a 32-bit system, while running aio-stress, I got the following kernel message: aio-stress used greatest stack depth: 372 bytes left That's a bit close for comfort; we may want to see if we have some especially piggy on-stack allocations on the AIO code paths. - Ted ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Need to potentially watch stack usage for ext4 and AIO... 2009-06-19 17:59 Need to potentially watch stack usage for ext4 and AIO Theodore Ts'o @ 2009-06-20 1:46 ` Eric Sandeen 2009-06-21 0:49 ` Theodore Tso 0 siblings, 1 reply; 8+ messages in thread From: Eric Sandeen @ 2009-06-20 1:46 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4 Theodore Ts'o wrote: > On a 32-bit system, while running aio-stress, I got the following kernel > message: > > aio-stress used greatest stack depth: 372 bytes left > > That's a bit close for comfort; we may want to see if we have some > especially piggy on-stack allocations on the AIO code paths. > > - Ted Ted, you might try the built-in stack depth tracing stuff: config STACK_TRACER bool "Trace max stack" depends on HAVE_FUNCTION_TRACER select FUNCTION_TRACER select STACKTRACE select KALLSYMS help This special tracer records the maximum stack footprint of the kernel and displays it in debugfs/tracing/stack_trace. This tracer works by hooking into every function call that the kernel executes, and keeping a maximum stack depth value and stack-trace saved. If this is configured with DYNAMIC_FTRACE then it will not have any overhead while the stack tracer is disabled. To enable the stack tracer on bootup, pass in 'stacktrace' on the kernel command line. The stack tracer can also be enabled or disabled via the sysctl kernel.stack_tracer_enabled Say N if unsure. if you got within 372 bytes on 32-bit (with 8k stacks) then that's indeed pretty worrisome. -Eric ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Need to potentially watch stack usage for ext4 and AIO... 2009-06-20 1:46 ` Eric Sandeen @ 2009-06-21 0:49 ` Theodore Tso 2009-06-24 16:15 ` Eric Sandeen 0 siblings, 1 reply; 8+ messages in thread From: Theodore Tso @ 2009-06-21 0:49 UTC (permalink / raw) To: Eric Sandeen; +Cc: linux-ext4 On Fri, Jun 19, 2009 at 08:46:12PM -0500, Eric Sandeen wrote: > if you got within 372 bytes on 32-bit (with 8k stacks) then that's > indeed pretty worrisome. Fortunately this was with a 4k stack, but it's still not a good thing; the 8k stack also has to support interrupts / soft irq's, whereas the CONFIG_4KSTACK has a separate interrupt stack.... Anyone have statistics on what the worst, most evil proprietary SCSI/FC/10gigE driver might use in terms of stack space, combined with say, the most evil proprietary multipath product at interrupt/softirq time, by any chance? In any case, here are two stack dumps that I captured, the first using a 1k blocksize, and the second using a 4k blocksize (not that the blocksize should make a huge amount of difference). This time, I got to within 200 bytes of disaster on the second stack dump. Worse yet, the stack usage bloat isn't in any one place, it seems to be finally peanut-buttered across the call stack. I can see some things we can do to optimize stack usage; for example, struct ext4_allocation_request is allocated on the stack, and the structure was laid out without any regard to space wastage caused by alignment requirements. That won't help on x86 at all, but it will help substantially on x86_64 (since x86_64 requires that 8 byte variables must be 8-byte aligned, where as x86_64 only requires 4 byte alignment, even for unsigned long long's). But it's going have to be a whole series of incremental improvements; I don't see any magic bullet solution to our stack usage. - Ted Depth Size Location (38 entries) ----- ---- -------- 0) 3064 48 kvm_mmu_write+0x5f/0x67 1) 3016 16 kvm_set_pte+0x21/0x27 2) 3000 208 __change_page_attr_set_clr+0x272/0x73b 3) 2792 76 kernel_map_pages+0xd4/0x102 4) 2716 80 get_page_from_freelist+0x2dd/0x3b5 5) 2636 108 __alloc_pages_nodemask+0xf6/0x435 6) 2528 16 alloc_slab_page+0x20/0x26 7) 2512 60 __slab_alloc+0x171/0x470 8) 2452 4 kmem_cache_alloc+0x8f/0x127 9) 2448 68 radix_tree_preload+0x27/0x66 10) 2380 56 cfq_set_request+0xf1/0x2b4 11) 2324 16 elv_set_request+0x1c/0x2b 12) 2308 44 get_request+0x1b0/0x25f 13) 2264 60 get_request_wait+0x1d/0x135 14) 2204 52 __make_request+0x24d/0x34e 15) 2152 96 generic_make_request+0x28f/0x2d2 16) 2056 56 submit_bio+0xb2/0xba 17) 2000 20 submit_bh+0xe4/0x101 18) 1980 196 ext4_mb_init_cache+0x221/0x8ad 19) 1784 232 ext4_mb_regular_allocator+0x443/0xbda 20) 1552 72 ext4_mb_new_blocks+0x1f6/0x46d 21) 1480 220 ext4_ext_get_blocks+0xad9/0xc68 22) 1260 68 ext4_get_blocks+0x10e/0x27e 23) 1192 244 mpage_da_map_blocks+0xa7/0x720 24) 948 108 ext4_da_writepages+0x27b/0x3d3 25) 840 16 do_writepages+0x28/0x39 26) 824 72 __writeback_single_inode+0x162/0x333 27) 752 68 generic_sync_sb_inodes+0x2b6/0x426 28) 684 20 writeback_inodes+0x8a/0xd1 29) 664 96 balance_dirty_pages_ratelimited_nr+0x12d/0x237 30) 568 92 generic_file_buffered_write+0x173/0x23e 31) 476 124 __generic_file_aio_write_nolock+0x258/0x280 32) 352 52 generic_file_aio_write+0x6e/0xc2 33) 300 52 ext4_file_write+0xa8/0x12c 34) 248 36 aio_rw_vect_retry+0x72/0x135 35) 212 24 aio_run_iocb+0x69/0xfd 36) 188 108 sys_io_submit+0x418/0x4dc 37) 80 80 syscall_call+0x7/0xb ------------- Depth Size Location (47 entries) ----- ---- -------- 0) 3556 8 kvm_clock_read+0x1b/0x1d 1) 3548 8 sched_clock+0x8/0xb 2) 3540 96 __lock_acquire+0x1c0/0xb21 3) 3444 44 lock_acquire+0x94/0xb7 4) 3400 16 _spin_lock_irqsave+0x37/0x6a 5) 3384 28 clocksource_get_next+0x12/0x48 6) 3356 96 update_wall_time+0x661/0x740 7) 3260 8 do_timer+0x1b/0x22 8) 3252 44 tick_do_update_jiffies64+0xed/0x127 9) 3208 24 tick_sched_timer+0x47/0xa0 10) 3184 40 __run_hrtimer+0x67/0x97 11) 3144 56 hrtimer_interrupt+0xfe/0x151 12) 3088 16 smp_apic_timer_interrupt+0x6f/0x82 13) 3072 92 apic_timer_interrupt+0x2f/0x34 14) 2980 48 kvm_mmu_write+0x5f/0x67 15) 2932 16 kvm_set_pte+0x21/0x27 16) 2916 208 __change_page_attr_set_clr+0x272/0x73b 17) 2708 76 kernel_map_pages+0xd4/0x102 18) 2632 32 free_hot_cold_page+0x74/0x1bc 19) 2600 20 __pagevec_free+0x22/0x2a 20) 2580 168 shrink_page_list+0x542/0x61a 21) 2412 168 shrink_list+0x26a/0x50b 22) 2244 96 shrink_zone+0x211/0x2a7 23) 2148 116 try_to_free_pages+0x1db/0x2f3 24) 2032 92 __alloc_pages_nodemask+0x2ab/0x435 25) 1940 40 find_or_create_page+0x43/0x79 26) 1900 84 __getblk+0x13a/0x2de 27) 1816 164 ext4_ext_insert_extent+0x853/0xb56 28) 1652 224 ext4_ext_get_blocks+0xb27/0xc68 29) 1428 68 ext4_get_blocks+0x10e/0x27e 30) 1360 244 mpage_da_map_blocks+0xa7/0x720 31) 1116 32 __mpage_da_writepage+0x35/0x158 32) 1084 132 write_cache_pages+0x1b1/0x293 33) 952 112 ext4_da_writepages+0x262/0x3d3 34) 840 16 do_writepages+0x28/0x39 35) 824 72 __writeback_single_inode+0x162/0x333 36) 752 68 generic_sync_sb_inodes+0x2b6/0x426 37) 684 20 writeback_inodes+0x8a/0xd1 38) 664 96 balance_dirty_pages_ratelimited_nr+0x12d/0x237 39) 568 92 generic_file_buffered_write+0x173/0x23e 40) 476 124 __generic_file_aio_write_nolock+0x258/0x280 41) 352 52 generic_file_aio_write+0x6e/0xc2 42) 300 52 ext4_file_write+0xa8/0x12c 43) 248 36 aio_rw_vect_retry+0x72/0x135 44) 212 24 aio_run_iocb+0x69/0xfd 45) 188 108 sys_io_submit+0x418/0x4dc 46) 80 80 syscall_call+0x7/0xb ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Need to potentially watch stack usage for ext4 and AIO... 2009-06-21 0:49 ` Theodore Tso @ 2009-06-24 16:15 ` Eric Sandeen 2009-06-24 16:39 ` Eric Sandeen 0 siblings, 1 reply; 8+ messages in thread From: Eric Sandeen @ 2009-06-24 16:15 UTC (permalink / raw) To: Theodore Tso; +Cc: linux-ext4 Theodore Tso wrote: > On Fri, Jun 19, 2009 at 08:46:12PM -0500, Eric Sandeen wrote: >> if you got within 372 bytes on 32-bit (with 8k stacks) then that's >> indeed pretty worrisome. > > Fortunately this was with a 4k stack, but it's still not a good thing; > the 8k stack also has to support interrupts / soft irq's, whereas the > CONFIG_4KSTACK has a separate interrupt stack.... Hm I thought we had irq stacks for both but I guess not. FWIW F11 has gone to 8k stacks ( \o/ ) > Anyone have statistics on what the worst, most evil proprietary > SCSI/FC/10gigE driver might use in terms of stack space, combined with > say, the most evil proprietary multipath product at interrupt/softirq > time, by any chance? well I'm inclined to ignore the proprietary stuff TBH, we can't control it. > In any case, here are two stack dumps that I captured, the first using > a 1k blocksize, and the second using a 4k blocksize (not that the > blocksize should make a huge amount of difference). This time, I got > to within 200 bytes of disaster on the second stack dump. Worse yet, > the stack usage bloat isn't in any one place, it seems to be finally > peanut-buttered across the call stack. > > I can see some things we can do to optimize stack usage; for example, > struct ext4_allocation_request is allocated on the stack, and the > structure was laid out without any regard to space wastage caused by > alignment requirements. That won't help on x86 at all, but it will > help substantially on x86_64 (since x86_64 requires that 8 byte > variables must be 8-byte aligned, where as x86_64 only requires 4 byte > alignment, even for unsigned long long's). But it's going have to be > a whole series of incremental improvements; I don't see any magic > bullet solution to our stack usage. XFS forces gcc to not inline any static function; it's extreme, but maybe it'd help here too. > - Ted > > > Depth Size Location (38 entries) > ----- ---- -------- > 0) 3064 48 kvm_mmu_write+0x5f/0x67 > 1) 3016 16 kvm_set_pte+0x21/0x27 > 2) 3000 208 __change_page_attr_set_clr+0x272/0x73b This looks like a victim of inlining > 3) 2792 76 kernel_map_pages+0xd4/0x102 > 4) 2716 80 get_page_from_freelist+0x2dd/0x3b5 > 5) 2636 108 __alloc_pages_nodemask+0xf6/0x435 > 6) 2528 16 alloc_slab_page+0x20/0x26 > 7) 2512 60 __slab_alloc+0x171/0x470 > 8) 2452 4 kmem_cache_alloc+0x8f/0x127 > 9) 2448 68 radix_tree_preload+0x27/0x66 > 10) 2380 56 cfq_set_request+0xf1/0x2b4 > 11) 2324 16 elv_set_request+0x1c/0x2b > 12) 2308 44 get_request+0x1b0/0x25f > 13) 2264 60 get_request_wait+0x1d/0x135 > 14) 2204 52 __make_request+0x24d/0x34e > 15) 2152 96 generic_make_request+0x28f/0x2d2 > 16) 2056 56 submit_bio+0xb2/0xba > 17) 2000 20 submit_bh+0xe4/0x101 > 18) 1980 196 ext4_mb_init_cache+0x221/0x8ad nothing obvious here, maybe inlining > 19) 1784 232 ext4_mb_regular_allocator+0x443/0xbda ditto > 20) 1552 72 ext4_mb_new_blocks+0x1f6/0x46d > 21) 1480 220 ext4_ext_get_blocks+0xad9/0xc68 ext4_allocation_request is largeish & holey as you said: struct ext4_allocation_request { struct inode * inode; /* 0 8 */ ext4_lblk_t logical; /* 8 4 */ /* XXX 4 bytes hole, try to pack */ ext4_fsblk_t goal; /* 16 8 */ ext4_lblk_t lleft; /* 24 4 */ /* XXX 4 bytes hole, try to pack */ ext4_fsblk_t pleft; /* 32 8 */ ext4_lblk_t lright; /* 40 4 */ /* XXX 4 bytes hole, try to pack */ ext4_fsblk_t pright; /* 48 8 */ unsigned int len; /* 56 4 */ unsigned int flags; /* 60 4 */ /* --- cacheline 1 boundary (64 bytes) --- */ /* size: 64, cachelines: 1, members: 9 */ /* sum members: 52, holes: 3, sum holes: 12 */ }; > 22) 1260 68 ext4_get_blocks+0x10e/0x27e > 23) 1192 244 mpage_da_map_blocks+0xa7/0x720 struct buffer_head new on the stack hurts here, it's 104 bytes. I really dislike the whole abuse of buffer heads as handy containers for mapping, we don't need all these fields, I think, but that's a battle for another day. > 24) 948 108 ext4_da_writepages+0x27b/0x3d3 > 25) 840 16 do_writepages+0x28/0x39 > 26) 824 72 __writeback_single_inode+0x162/0x333 > 27) 752 68 generic_sync_sb_inodes+0x2b6/0x426 > 28) 684 20 writeback_inodes+0x8a/0xd1 > 29) 664 96 balance_dirty_pages_ratelimited_nr+0x12d/0x237 > 30) 568 92 generic_file_buffered_write+0x173/0x23e > 31) 476 124 __generic_file_aio_write_nolock+0x258/0x280 > 32) 352 52 generic_file_aio_write+0x6e/0xc2 > 33) 300 52 ext4_file_write+0xa8/0x12c > 34) 248 36 aio_rw_vect_retry+0x72/0x135 > 35) 212 24 aio_run_iocb+0x69/0xfd > 36) 188 108 sys_io_submit+0x418/0x4dc > 37) 80 80 syscall_call+0x7/0xb <snip> -Eric ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Need to potentially watch stack usage for ext4 and AIO... 2009-06-24 16:15 ` Eric Sandeen @ 2009-06-24 16:39 ` Eric Sandeen 2009-06-25 0:05 ` Theodore Tso 0 siblings, 1 reply; 8+ messages in thread From: Eric Sandeen @ 2009-06-24 16:39 UTC (permalink / raw) To: Theodore Tso; +Cc: linux-ext4 Eric Sandeen wrote: > Theodore Tso wrote: ... >> I can see some things we can do to optimize stack usage; for example, >> struct ext4_allocation_request is allocated on the stack, and the >> structure was laid out without any regard to space wastage caused by >> alignment requirements. That won't help on x86 at all, but it will >> help substantially on x86_64 (since x86_64 requires that 8 byte >> variables must be 8-byte aligned, where as x86_64 only requires 4 byte >> alignment, even for unsigned long long's). But it's going have to be >> a whole series of incremental improvements; I don't see any magic >> bullet solution to our stack usage. > > XFS forces gcc to not inline any static function; it's extreme, but > maybe it'd help here too. Giving a blanket noinline treatment to mballoc.c yields some significant stack savings: -ext4_mb_free_blocks 200 +ext4_mb_free_blocks 184 -ext4_mb_init_cache 232 +ext4_mb_init_cache 136 -ext4_mb_regular_allocator 232 +ext4_mb_regular_allocator 104 -ext4_mb_new_blocks 104 (drops below 100 bytes) -Eric ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Need to potentially watch stack usage for ext4 and AIO... 2009-06-24 16:39 ` Eric Sandeen @ 2009-06-25 0:05 ` Theodore Tso 2009-06-25 0:32 ` Eric Sandeen 0 siblings, 1 reply; 8+ messages in thread From: Theodore Tso @ 2009-06-25 0:05 UTC (permalink / raw) To: Eric Sandeen; +Cc: linux-ext4 On Wed, Jun 24, 2009 at 11:39:02AM -0500, Eric Sandeen wrote: > Eric Sandeen wrote: > > Theodore Tso wrote: > > >> I can see some things we can do to optimize stack usage; for example, > >> struct ext4_allocation_request is allocated on the stack, and the > >> structure was laid out without any regard to space wastage caused by > >> alignment requirements. That won't help on x86 at all, but it will > >> help substantially on x86_64 (since x86_64 requires that 8 byte > >> variables must be 8-byte aligned, where as x86_64 only requires 4 byte > >> alignment, even for unsigned long long's). But it's going have to be > >> a whole series of incremental improvements; I don't see any magic > >> bullet solution to our stack usage. > > > > XFS forces gcc to not inline any static function; it's extreme, but > > maybe it'd help here too. > > Giving a blanket noinline treatment to mballoc.c yields some significant > stack savings: So stupid question. I can see how using noinline reduces the static stack savings, but does it actually reduce the run-time stack usage? After all, if function ext4_mb_foo() call ext4_mb_bar(), using noinline is a great way for seeing which function is actually responsible for chewing up disk space, but if ext4_mb_foo() always calls ext4_mb_bar(), and ext4_mb_bar() is a static inline only called once by ext4_mb_foo() unconditionally, won't we ultimately end up using more disk space (since we also have to save registers and save the return address on the stack)? - Ted ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Need to potentially watch stack usage for ext4 and AIO... 2009-06-25 0:05 ` Theodore Tso @ 2009-06-25 0:32 ` Eric Sandeen 2009-06-25 4:58 ` Eric Sandeen 0 siblings, 1 reply; 8+ messages in thread From: Eric Sandeen @ 2009-06-25 0:32 UTC (permalink / raw) To: Theodore Tso; +Cc: linux-ext4 Theodore Tso wrote: > On Wed, Jun 24, 2009 at 11:39:02AM -0500, Eric Sandeen wrote: >> Eric Sandeen wrote: >>> Theodore Tso wrote: >>>> I can see some things we can do to optimize stack usage; for example, >>>> struct ext4_allocation_request is allocated on the stack, and the >>>> structure was laid out without any regard to space wastage caused by >>>> alignment requirements. That won't help on x86 at all, but it will >>>> help substantially on x86_64 (since x86_64 requires that 8 byte >>>> variables must be 8-byte aligned, where as x86_64 only requires 4 byte >>>> alignment, even for unsigned long long's). But it's going have to be >>>> a whole series of incremental improvements; I don't see any magic >>>> bullet solution to our stack usage. >>> XFS forces gcc to not inline any static function; it's extreme, but >>> maybe it'd help here too. >> Giving a blanket noinline treatment to mballoc.c yields some significant >> stack savings: > > So stupid question. I can see how using noinline reduces the static > stack savings, but does it actually reduce the run-time stack usage? > After all, if function ext4_mb_foo() call ext4_mb_bar(), using > noinline is a great way for seeing which function is actually > responsible for chewing up disk space, but if ext4_mb_foo() always ^^stack :) > calls ext4_mb_bar(), and ext4_mb_bar() is a static inline only called > once by ext4_mb_foo() unconditionally, won't we ultimately end up > using more disk space (since we also have to save registers and save > the return address on the stack)? True, so maybe I should be a bit more careful w/ that patch I sent, and do more detailed callchain analysis to be sure that it's all warranted. But here's how the noinlining can help, at least: foo() bar() baz() whoop() If they're each 100 bytes of stack usage on their own, and bar() baz() and whoop() all get inlined into foo(), then foo() uses ~400 bytes, because it's all taken off the stack when we subtract from %rsp when we enter foo(). But if we don't inline bar() baz() and whoop(), then at worst we have ~200 bytes used; 100 when we enter foo(), 100 more (200 total) when we enter bar(), then we return to foo() (popping the stack back to 100), and again at 200 when we enter baz(), and again only 200 when we get into whoop(). if it were just: foo() bar() then you're right, noinlining bar() wouldn't help, and probably hurts - so I probably need to look more closely at the shotgun approach patch I sent. :) I had found some tools once to do static callchain analysis & graph them, maybe time to break it out again. -Eric ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Need to potentially watch stack usage for ext4 and AIO... 2009-06-25 0:32 ` Eric Sandeen @ 2009-06-25 4:58 ` Eric Sandeen 0 siblings, 0 replies; 8+ messages in thread From: Eric Sandeen @ 2009-06-25 4:58 UTC (permalink / raw) To: Theodore Tso; +Cc: linux-ext4 Eric Sandeen wrote: > I had found some tools once to do static callchain analysis & graph > them, maybe time to break it out again. codeviz was the tool; getting it to work is fiddly. But here, for example, are some of the callers of ext4_mb_init_cache() (one of the functions at the bottom of your deep chain), with stack usage and piggish ones highlighted in red: http://sandeen.fedorapeople.org/ext4/ext4_mb_init_cache_callers.png This is actually only analysis of the functions in mballoc.c, but that's relevant for the static / noinline decisions. The stack usage values were after my attempt to get gcc to inline -nothing- at all. So there you can see that ext4_mb_regular_allocator by itself uses 104 bytes, but calls several other functions which get inlined normally: ext4_mb_try_best_found 16 ext4_mb_try_by_goal 56 ext4_mb_load_buddy 24 ext4_mb_init_group 24 Without all the noinlining, ext4_mb_regular_allocator uses 232 bytes ... 104+16+56+24+24 = 224 is close to that. On the flip side here are the functions called by ext4_mb_init_cache_callees within mballoc.c: http://sandeen.fedorapeople.org/ext4/ext4_mb_init_cache_callees.png Here too I think you can see that if much of that gets inlined, it'll bloat that function. A bit more analysis like this might yield some prudent changes ... but it's tedious. :) -Eric ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2009-06-25 4:58 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-06-19 17:59 Need to potentially watch stack usage for ext4 and AIO Theodore Ts'o 2009-06-20 1:46 ` Eric Sandeen 2009-06-21 0:49 ` Theodore Tso 2009-06-24 16:15 ` Eric Sandeen 2009-06-24 16:39 ` Eric Sandeen 2009-06-25 0:05 ` Theodore Tso 2009-06-25 0:32 ` Eric Sandeen 2009-06-25 4:58 ` Eric Sandeen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).