* Need to potentially watch stack usage for ext4 and AIO...
@ 2009-06-19 17:59 Theodore Ts'o
2009-06-20 1:46 ` Eric Sandeen
0 siblings, 1 reply; 8+ messages in thread
From: Theodore Ts'o @ 2009-06-19 17:59 UTC (permalink / raw)
To: linux-ext4
On a 32-bit system, while running aio-stress, I got the following kernel
message:
aio-stress used greatest stack depth: 372 bytes left
That's a bit close for comfort; we may want to see if we have some
especially piggy on-stack allocations on the AIO code paths.
- Ted
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Need to potentially watch stack usage for ext4 and AIO...
2009-06-19 17:59 Need to potentially watch stack usage for ext4 and AIO Theodore Ts'o
@ 2009-06-20 1:46 ` Eric Sandeen
2009-06-21 0:49 ` Theodore Tso
0 siblings, 1 reply; 8+ messages in thread
From: Eric Sandeen @ 2009-06-20 1:46 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: linux-ext4
Theodore Ts'o wrote:
> On a 32-bit system, while running aio-stress, I got the following kernel
> message:
>
> aio-stress used greatest stack depth: 372 bytes left
>
> That's a bit close for comfort; we may want to see if we have some
> especially piggy on-stack allocations on the AIO code paths.
>
> - Ted
Ted, you might try the built-in stack depth tracing stuff:
config STACK_TRACER
bool "Trace max stack"
depends on HAVE_FUNCTION_TRACER
select FUNCTION_TRACER
select STACKTRACE
select KALLSYMS
help
This special tracer records the maximum stack footprint of the
kernel and displays it in debugfs/tracing/stack_trace.
This tracer works by hooking into every function call that the
kernel executes, and keeping a maximum stack depth value and
stack-trace saved. If this is configured with DYNAMIC_FTRACE
then it will not have any overhead while the stack tracer
is disabled.
To enable the stack tracer on bootup, pass in 'stacktrace'
on the kernel command line.
The stack tracer can also be enabled or disabled via the
sysctl kernel.stack_tracer_enabled
Say N if unsure.
if you got within 372 bytes on 32-bit (with 8k stacks) then that's
indeed pretty worrisome.
-Eric
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Need to potentially watch stack usage for ext4 and AIO...
2009-06-20 1:46 ` Eric Sandeen
@ 2009-06-21 0:49 ` Theodore Tso
2009-06-24 16:15 ` Eric Sandeen
0 siblings, 1 reply; 8+ messages in thread
From: Theodore Tso @ 2009-06-21 0:49 UTC (permalink / raw)
To: Eric Sandeen; +Cc: linux-ext4
On Fri, Jun 19, 2009 at 08:46:12PM -0500, Eric Sandeen wrote:
> if you got within 372 bytes on 32-bit (with 8k stacks) then that's
> indeed pretty worrisome.
Fortunately this was with a 4k stack, but it's still not a good thing;
the 8k stack also has to support interrupts / soft irq's, whereas the
CONFIG_4KSTACK has a separate interrupt stack....
Anyone have statistics on what the worst, most evil proprietary
SCSI/FC/10gigE driver might use in terms of stack space, combined with
say, the most evil proprietary multipath product at interrupt/softirq
time, by any chance?
In any case, here are two stack dumps that I captured, the first using
a 1k blocksize, and the second using a 4k blocksize (not that the
blocksize should make a huge amount of difference). This time, I got
to within 200 bytes of disaster on the second stack dump. Worse yet,
the stack usage bloat isn't in any one place, it seems to be finally
peanut-buttered across the call stack.
I can see some things we can do to optimize stack usage; for example,
struct ext4_allocation_request is allocated on the stack, and the
structure was laid out without any regard to space wastage caused by
alignment requirements. That won't help on x86 at all, but it will
help substantially on x86_64 (since x86_64 requires that 8 byte
variables must be 8-byte aligned, where as x86_64 only requires 4 byte
alignment, even for unsigned long long's). But it's going have to be
a whole series of incremental improvements; I don't see any magic
bullet solution to our stack usage.
- Ted
Depth Size Location (38 entries)
----- ---- --------
0) 3064 48 kvm_mmu_write+0x5f/0x67
1) 3016 16 kvm_set_pte+0x21/0x27
2) 3000 208 __change_page_attr_set_clr+0x272/0x73b
3) 2792 76 kernel_map_pages+0xd4/0x102
4) 2716 80 get_page_from_freelist+0x2dd/0x3b5
5) 2636 108 __alloc_pages_nodemask+0xf6/0x435
6) 2528 16 alloc_slab_page+0x20/0x26
7) 2512 60 __slab_alloc+0x171/0x470
8) 2452 4 kmem_cache_alloc+0x8f/0x127
9) 2448 68 radix_tree_preload+0x27/0x66
10) 2380 56 cfq_set_request+0xf1/0x2b4
11) 2324 16 elv_set_request+0x1c/0x2b
12) 2308 44 get_request+0x1b0/0x25f
13) 2264 60 get_request_wait+0x1d/0x135
14) 2204 52 __make_request+0x24d/0x34e
15) 2152 96 generic_make_request+0x28f/0x2d2
16) 2056 56 submit_bio+0xb2/0xba
17) 2000 20 submit_bh+0xe4/0x101
18) 1980 196 ext4_mb_init_cache+0x221/0x8ad
19) 1784 232 ext4_mb_regular_allocator+0x443/0xbda
20) 1552 72 ext4_mb_new_blocks+0x1f6/0x46d
21) 1480 220 ext4_ext_get_blocks+0xad9/0xc68
22) 1260 68 ext4_get_blocks+0x10e/0x27e
23) 1192 244 mpage_da_map_blocks+0xa7/0x720
24) 948 108 ext4_da_writepages+0x27b/0x3d3
25) 840 16 do_writepages+0x28/0x39
26) 824 72 __writeback_single_inode+0x162/0x333
27) 752 68 generic_sync_sb_inodes+0x2b6/0x426
28) 684 20 writeback_inodes+0x8a/0xd1
29) 664 96 balance_dirty_pages_ratelimited_nr+0x12d/0x237
30) 568 92 generic_file_buffered_write+0x173/0x23e
31) 476 124 __generic_file_aio_write_nolock+0x258/0x280
32) 352 52 generic_file_aio_write+0x6e/0xc2
33) 300 52 ext4_file_write+0xa8/0x12c
34) 248 36 aio_rw_vect_retry+0x72/0x135
35) 212 24 aio_run_iocb+0x69/0xfd
36) 188 108 sys_io_submit+0x418/0x4dc
37) 80 80 syscall_call+0x7/0xb
-------------
Depth Size Location (47 entries)
----- ---- --------
0) 3556 8 kvm_clock_read+0x1b/0x1d
1) 3548 8 sched_clock+0x8/0xb
2) 3540 96 __lock_acquire+0x1c0/0xb21
3) 3444 44 lock_acquire+0x94/0xb7
4) 3400 16 _spin_lock_irqsave+0x37/0x6a
5) 3384 28 clocksource_get_next+0x12/0x48
6) 3356 96 update_wall_time+0x661/0x740
7) 3260 8 do_timer+0x1b/0x22
8) 3252 44 tick_do_update_jiffies64+0xed/0x127
9) 3208 24 tick_sched_timer+0x47/0xa0
10) 3184 40 __run_hrtimer+0x67/0x97
11) 3144 56 hrtimer_interrupt+0xfe/0x151
12) 3088 16 smp_apic_timer_interrupt+0x6f/0x82
13) 3072 92 apic_timer_interrupt+0x2f/0x34
14) 2980 48 kvm_mmu_write+0x5f/0x67
15) 2932 16 kvm_set_pte+0x21/0x27
16) 2916 208 __change_page_attr_set_clr+0x272/0x73b
17) 2708 76 kernel_map_pages+0xd4/0x102
18) 2632 32 free_hot_cold_page+0x74/0x1bc
19) 2600 20 __pagevec_free+0x22/0x2a
20) 2580 168 shrink_page_list+0x542/0x61a
21) 2412 168 shrink_list+0x26a/0x50b
22) 2244 96 shrink_zone+0x211/0x2a7
23) 2148 116 try_to_free_pages+0x1db/0x2f3
24) 2032 92 __alloc_pages_nodemask+0x2ab/0x435
25) 1940 40 find_or_create_page+0x43/0x79
26) 1900 84 __getblk+0x13a/0x2de
27) 1816 164 ext4_ext_insert_extent+0x853/0xb56
28) 1652 224 ext4_ext_get_blocks+0xb27/0xc68
29) 1428 68 ext4_get_blocks+0x10e/0x27e
30) 1360 244 mpage_da_map_blocks+0xa7/0x720
31) 1116 32 __mpage_da_writepage+0x35/0x158
32) 1084 132 write_cache_pages+0x1b1/0x293
33) 952 112 ext4_da_writepages+0x262/0x3d3
34) 840 16 do_writepages+0x28/0x39
35) 824 72 __writeback_single_inode+0x162/0x333
36) 752 68 generic_sync_sb_inodes+0x2b6/0x426
37) 684 20 writeback_inodes+0x8a/0xd1
38) 664 96 balance_dirty_pages_ratelimited_nr+0x12d/0x237
39) 568 92 generic_file_buffered_write+0x173/0x23e
40) 476 124 __generic_file_aio_write_nolock+0x258/0x280
41) 352 52 generic_file_aio_write+0x6e/0xc2
42) 300 52 ext4_file_write+0xa8/0x12c
43) 248 36 aio_rw_vect_retry+0x72/0x135
44) 212 24 aio_run_iocb+0x69/0xfd
45) 188 108 sys_io_submit+0x418/0x4dc
46) 80 80 syscall_call+0x7/0xb
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Need to potentially watch stack usage for ext4 and AIO...
2009-06-21 0:49 ` Theodore Tso
@ 2009-06-24 16:15 ` Eric Sandeen
2009-06-24 16:39 ` Eric Sandeen
0 siblings, 1 reply; 8+ messages in thread
From: Eric Sandeen @ 2009-06-24 16:15 UTC (permalink / raw)
To: Theodore Tso; +Cc: linux-ext4
Theodore Tso wrote:
> On Fri, Jun 19, 2009 at 08:46:12PM -0500, Eric Sandeen wrote:
>> if you got within 372 bytes on 32-bit (with 8k stacks) then that's
>> indeed pretty worrisome.
>
> Fortunately this was with a 4k stack, but it's still not a good thing;
> the 8k stack also has to support interrupts / soft irq's, whereas the
> CONFIG_4KSTACK has a separate interrupt stack....
Hm I thought we had irq stacks for both but I guess not.
FWIW F11 has gone to 8k stacks ( \o/ )
> Anyone have statistics on what the worst, most evil proprietary
> SCSI/FC/10gigE driver might use in terms of stack space, combined with
> say, the most evil proprietary multipath product at interrupt/softirq
> time, by any chance?
well I'm inclined to ignore the proprietary stuff TBH, we can't control it.
> In any case, here are two stack dumps that I captured, the first using
> a 1k blocksize, and the second using a 4k blocksize (not that the
> blocksize should make a huge amount of difference). This time, I got
> to within 200 bytes of disaster on the second stack dump. Worse yet,
> the stack usage bloat isn't in any one place, it seems to be finally
> peanut-buttered across the call stack.
>
> I can see some things we can do to optimize stack usage; for example,
> struct ext4_allocation_request is allocated on the stack, and the
> structure was laid out without any regard to space wastage caused by
> alignment requirements. That won't help on x86 at all, but it will
> help substantially on x86_64 (since x86_64 requires that 8 byte
> variables must be 8-byte aligned, where as x86_64 only requires 4 byte
> alignment, even for unsigned long long's). But it's going have to be
> a whole series of incremental improvements; I don't see any magic
> bullet solution to our stack usage.
XFS forces gcc to not inline any static function; it's extreme, but
maybe it'd help here too.
> - Ted
>
>
> Depth Size Location (38 entries)
> ----- ---- --------
> 0) 3064 48 kvm_mmu_write+0x5f/0x67
> 1) 3016 16 kvm_set_pte+0x21/0x27
> 2) 3000 208 __change_page_attr_set_clr+0x272/0x73b
This looks like a victim of inlining
> 3) 2792 76 kernel_map_pages+0xd4/0x102
> 4) 2716 80 get_page_from_freelist+0x2dd/0x3b5
> 5) 2636 108 __alloc_pages_nodemask+0xf6/0x435
> 6) 2528 16 alloc_slab_page+0x20/0x26
> 7) 2512 60 __slab_alloc+0x171/0x470
> 8) 2452 4 kmem_cache_alloc+0x8f/0x127
> 9) 2448 68 radix_tree_preload+0x27/0x66
> 10) 2380 56 cfq_set_request+0xf1/0x2b4
> 11) 2324 16 elv_set_request+0x1c/0x2b
> 12) 2308 44 get_request+0x1b0/0x25f
> 13) 2264 60 get_request_wait+0x1d/0x135
> 14) 2204 52 __make_request+0x24d/0x34e
> 15) 2152 96 generic_make_request+0x28f/0x2d2
> 16) 2056 56 submit_bio+0xb2/0xba
> 17) 2000 20 submit_bh+0xe4/0x101
> 18) 1980 196 ext4_mb_init_cache+0x221/0x8ad
nothing obvious here, maybe inlining
> 19) 1784 232 ext4_mb_regular_allocator+0x443/0xbda
ditto
> 20) 1552 72 ext4_mb_new_blocks+0x1f6/0x46d
> 21) 1480 220 ext4_ext_get_blocks+0xad9/0xc68
ext4_allocation_request is largeish & holey as you said:
struct ext4_allocation_request {
struct inode * inode; /* 0 8 */
ext4_lblk_t logical; /* 8 4 */
/* XXX 4 bytes hole, try to pack */
ext4_fsblk_t goal; /* 16 8 */
ext4_lblk_t lleft; /* 24 4 */
/* XXX 4 bytes hole, try to pack */
ext4_fsblk_t pleft; /* 32 8 */
ext4_lblk_t lright; /* 40 4 */
/* XXX 4 bytes hole, try to pack */
ext4_fsblk_t pright; /* 48 8 */
unsigned int len; /* 56 4 */
unsigned int flags; /* 60 4 */
/* --- cacheline 1 boundary (64 bytes) --- */
/* size: 64, cachelines: 1, members: 9 */
/* sum members: 52, holes: 3, sum holes: 12 */
};
> 22) 1260 68 ext4_get_blocks+0x10e/0x27e
> 23) 1192 244 mpage_da_map_blocks+0xa7/0x720
struct buffer_head new on the stack hurts here, it's 104 bytes.
I really dislike the whole abuse of buffer heads as handy containers for
mapping, we don't need all these fields, I think, but that's a battle
for another day.
> 24) 948 108 ext4_da_writepages+0x27b/0x3d3
> 25) 840 16 do_writepages+0x28/0x39
> 26) 824 72 __writeback_single_inode+0x162/0x333
> 27) 752 68 generic_sync_sb_inodes+0x2b6/0x426
> 28) 684 20 writeback_inodes+0x8a/0xd1
> 29) 664 96 balance_dirty_pages_ratelimited_nr+0x12d/0x237
> 30) 568 92 generic_file_buffered_write+0x173/0x23e
> 31) 476 124 __generic_file_aio_write_nolock+0x258/0x280
> 32) 352 52 generic_file_aio_write+0x6e/0xc2
> 33) 300 52 ext4_file_write+0xa8/0x12c
> 34) 248 36 aio_rw_vect_retry+0x72/0x135
> 35) 212 24 aio_run_iocb+0x69/0xfd
> 36) 188 108 sys_io_submit+0x418/0x4dc
> 37) 80 80 syscall_call+0x7/0xb
<snip>
-Eric
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Need to potentially watch stack usage for ext4 and AIO...
2009-06-24 16:15 ` Eric Sandeen
@ 2009-06-24 16:39 ` Eric Sandeen
2009-06-25 0:05 ` Theodore Tso
0 siblings, 1 reply; 8+ messages in thread
From: Eric Sandeen @ 2009-06-24 16:39 UTC (permalink / raw)
To: Theodore Tso; +Cc: linux-ext4
Eric Sandeen wrote:
> Theodore Tso wrote:
...
>> I can see some things we can do to optimize stack usage; for example,
>> struct ext4_allocation_request is allocated on the stack, and the
>> structure was laid out without any regard to space wastage caused by
>> alignment requirements. That won't help on x86 at all, but it will
>> help substantially on x86_64 (since x86_64 requires that 8 byte
>> variables must be 8-byte aligned, where as x86_64 only requires 4 byte
>> alignment, even for unsigned long long's). But it's going have to be
>> a whole series of incremental improvements; I don't see any magic
>> bullet solution to our stack usage.
>
> XFS forces gcc to not inline any static function; it's extreme, but
> maybe it'd help here too.
Giving a blanket noinline treatment to mballoc.c yields some significant
stack savings:
-ext4_mb_free_blocks 200
+ext4_mb_free_blocks 184
-ext4_mb_init_cache 232
+ext4_mb_init_cache 136
-ext4_mb_regular_allocator 232
+ext4_mb_regular_allocator 104
-ext4_mb_new_blocks 104
(drops below 100 bytes)
-Eric
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Need to potentially watch stack usage for ext4 and AIO...
2009-06-24 16:39 ` Eric Sandeen
@ 2009-06-25 0:05 ` Theodore Tso
2009-06-25 0:32 ` Eric Sandeen
0 siblings, 1 reply; 8+ messages in thread
From: Theodore Tso @ 2009-06-25 0:05 UTC (permalink / raw)
To: Eric Sandeen; +Cc: linux-ext4
On Wed, Jun 24, 2009 at 11:39:02AM -0500, Eric Sandeen wrote:
> Eric Sandeen wrote:
> > Theodore Tso wrote:
>
> >> I can see some things we can do to optimize stack usage; for example,
> >> struct ext4_allocation_request is allocated on the stack, and the
> >> structure was laid out without any regard to space wastage caused by
> >> alignment requirements. That won't help on x86 at all, but it will
> >> help substantially on x86_64 (since x86_64 requires that 8 byte
> >> variables must be 8-byte aligned, where as x86_64 only requires 4 byte
> >> alignment, even for unsigned long long's). But it's going have to be
> >> a whole series of incremental improvements; I don't see any magic
> >> bullet solution to our stack usage.
> >
> > XFS forces gcc to not inline any static function; it's extreme, but
> > maybe it'd help here too.
>
> Giving a blanket noinline treatment to mballoc.c yields some significant
> stack savings:
So stupid question. I can see how using noinline reduces the static
stack savings, but does it actually reduce the run-time stack usage?
After all, if function ext4_mb_foo() call ext4_mb_bar(), using
noinline is a great way for seeing which function is actually
responsible for chewing up disk space, but if ext4_mb_foo() always
calls ext4_mb_bar(), and ext4_mb_bar() is a static inline only called
once by ext4_mb_foo() unconditionally, won't we ultimately end up
using more disk space (since we also have to save registers and save
the return address on the stack)?
- Ted
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Need to potentially watch stack usage for ext4 and AIO...
2009-06-25 0:05 ` Theodore Tso
@ 2009-06-25 0:32 ` Eric Sandeen
2009-06-25 4:58 ` Eric Sandeen
0 siblings, 1 reply; 8+ messages in thread
From: Eric Sandeen @ 2009-06-25 0:32 UTC (permalink / raw)
To: Theodore Tso; +Cc: linux-ext4
Theodore Tso wrote:
> On Wed, Jun 24, 2009 at 11:39:02AM -0500, Eric Sandeen wrote:
>> Eric Sandeen wrote:
>>> Theodore Tso wrote:
>>>> I can see some things we can do to optimize stack usage; for example,
>>>> struct ext4_allocation_request is allocated on the stack, and the
>>>> structure was laid out without any regard to space wastage caused by
>>>> alignment requirements. That won't help on x86 at all, but it will
>>>> help substantially on x86_64 (since x86_64 requires that 8 byte
>>>> variables must be 8-byte aligned, where as x86_64 only requires 4 byte
>>>> alignment, even for unsigned long long's). But it's going have to be
>>>> a whole series of incremental improvements; I don't see any magic
>>>> bullet solution to our stack usage.
>>> XFS forces gcc to not inline any static function; it's extreme, but
>>> maybe it'd help here too.
>> Giving a blanket noinline treatment to mballoc.c yields some significant
>> stack savings:
>
> So stupid question. I can see how using noinline reduces the static
> stack savings, but does it actually reduce the run-time stack usage?
> After all, if function ext4_mb_foo() call ext4_mb_bar(), using
> noinline is a great way for seeing which function is actually
> responsible for chewing up disk space, but if ext4_mb_foo() always
^^stack :)
> calls ext4_mb_bar(), and ext4_mb_bar() is a static inline only called
> once by ext4_mb_foo() unconditionally, won't we ultimately end up
> using more disk space (since we also have to save registers and save
> the return address on the stack)?
True, so maybe I should be a bit more careful w/ that patch I sent, and
do more detailed callchain analysis to be sure that it's all warranted.
But here's how the noinlining can help, at least:
foo()
bar()
baz()
whoop()
If they're each 100 bytes of stack usage on their own, and bar() baz()
and whoop() all get inlined into foo(), then foo() uses ~400 bytes,
because it's all taken off the stack when we subtract from %rsp when we
enter foo().
But if we don't inline bar() baz() and whoop(), then at worst we have
~200 bytes used; 100 when we enter foo(), 100 more (200 total) when we
enter bar(), then we return to foo() (popping the stack back to 100),
and again at 200 when we enter baz(), and again only 200 when we get
into whoop().
if it were just:
foo()
bar()
then you're right, noinlining bar() wouldn't help, and probably hurts -
so I probably need to look more closely at the shotgun approach patch I
sent. :)
I had found some tools once to do static callchain analysis & graph
them, maybe time to break it out again.
-Eric
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Need to potentially watch stack usage for ext4 and AIO...
2009-06-25 0:32 ` Eric Sandeen
@ 2009-06-25 4:58 ` Eric Sandeen
0 siblings, 0 replies; 8+ messages in thread
From: Eric Sandeen @ 2009-06-25 4:58 UTC (permalink / raw)
To: Theodore Tso; +Cc: linux-ext4
Eric Sandeen wrote:
> I had found some tools once to do static callchain analysis & graph
> them, maybe time to break it out again.
codeviz was the tool; getting it to work is fiddly. But here, for
example, are some of the callers of ext4_mb_init_cache() (one of the
functions at the bottom of your deep chain), with stack usage and
piggish ones highlighted in red:
http://sandeen.fedorapeople.org/ext4/ext4_mb_init_cache_callers.png
This is actually only analysis of the functions in mballoc.c, but that's
relevant for the static / noinline decisions.
The stack usage values were after my attempt to get gcc to inline
-nothing- at all.
So there you can see that ext4_mb_regular_allocator by itself uses 104
bytes, but calls several other functions which get inlined normally:
ext4_mb_try_best_found 16
ext4_mb_try_by_goal 56
ext4_mb_load_buddy 24
ext4_mb_init_group 24
Without all the noinlining, ext4_mb_regular_allocator uses 232 bytes ...
104+16+56+24+24 = 224 is close to that.
On the flip side here are the functions called by
ext4_mb_init_cache_callees within mballoc.c:
http://sandeen.fedorapeople.org/ext4/ext4_mb_init_cache_callees.png
Here too I think you can see that if much of that gets inlined, it'll
bloat that function.
A bit more analysis like this might yield some prudent changes ... but
it's tedious. :)
-Eric
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2009-06-25 4:58 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-06-19 17:59 Need to potentially watch stack usage for ext4 and AIO Theodore Ts'o
2009-06-20 1:46 ` Eric Sandeen
2009-06-21 0:49 ` Theodore Tso
2009-06-24 16:15 ` Eric Sandeen
2009-06-24 16:39 ` Eric Sandeen
2009-06-25 0:05 ` Theodore Tso
2009-06-25 0:32 ` Eric Sandeen
2009-06-25 4:58 ` Eric Sandeen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).