* raid5 high cpu usage during reads - oprofile results
[not found] <1143240438.8573.59.camel@starfire>
@ 2006-03-25 8:38 ` Alex Izvorski
2006-03-26 5:38 ` dean gaudet
0 siblings, 1 reply; 6+ messages in thread
From: Alex Izvorski @ 2006-03-25 8:38 UTC (permalink / raw)
To: linux-raid
Hello,
I profiled some raid5 reads using oprofile to try to track down the
suspiciously high cpu load I see. This uses the same 8-disk SATA setup
as I had described earlier. One of runs is on a 1MB chunk raid5, the
other on a 32MB chunk raid5. As Neil suggested memcpy is a big part of
the cpu load. The rest of it apprears to be in handle_stripe and
get_active_stripe - these three account for most of the load, with the
remainder fairly evenly distributed among a dozen other routines. If
using a large chunk size, handle_stripe and get_active_stripe will
predominate (and result in some truly abnormal cpu loads). I am
attaching annotated (from opannotate) source and assembly for raid5.c
from the second (32MB chunk) run. The annotated results do not make
much sense to me, but I suspect that the exact line numbers etc may be
shifted slightly as usually happens with optimized builds. I hope this
would be useful.
Regards,
--Alex
P.S. I had originally mailed the oprofile results as attachments to the
list, but I think they didn't go through. I put them at:
http://linuxraid.pastebin.com/621363 - oprofile annotated assembly
http://linuxraid.pastebin.com/621364 - oprofile annotated source
Sorry if you get this email twice.
-----------------------------------------------------------------------
mdadm --create /dev/md0 --level=raid5 --chunk=1024 --raid-devices=8 \
--size=10485760 /dev/sd[abcdefgh]1
echo "8192" > /sys/block/md0/md/stripe_cache_size
./test_aio -f /dev/md0 -T 10 -s 60G -r 8M -n 14
throughput 205MB/s, cpu load 40%
opreport --symbols --image-path=/lib/modules/2.6.15-gentoo-r7/kernel/
samples % image name app name symbol name
91839 42.9501 vmlinux vmlinux memcpy
25946 12.1341 raid5.ko raid5 handle_stripe
17732 8.2927 raid5.ko raid5 get_active_stripe
5454 2.5507 vmlinux vmlinux blk_rq_map_sg
4850 2.2682 vmlinux vmlinux __delay
4726 2.2102 raid5.ko raid5 raid5_compute_sector
4588 2.1457 raid5.ko raid5 copy_data
4389 2.0526 raid5.ko raid5 .text
4362 2.0400 raid5.ko raid5 make_request
3688 1.7248 vmlinux vmlinux clear_page
3548 1.6593 raid5.ko raid5 raid5_end_read_request
2594 1.2131 vmlinux vmlinux blk_recount_segments
1944 0.9091 vmlinux vmlinux generic_make_request
1627 0.7609 vmlinux vmlinux dma_map_sg
1555 0.7272 vmlinux vmlinux get_user_pages
1540 0.7202 libata.ko libata ata_bmdma_status
1464 0.6847 vmlinux vmlinux __make_request
1350 0.6314 vmlinux vmlinux follow_page
1098 0.5135 vmlinux vmlinux finish_wait
...(others under 0.5%)
-----------------------------------------------------------------------
mdadm --create /dev/md0 --level=raid5 --chunk=32678 --raid-devices=8 \
--size=10485760 /dev/sd[abcdefgh]1
echo "16384" > /sys/block/md0/md/stripe_cache_size
./test_aio -f /dev/md0 -T 10 -s 60G -r 8M -n 14
throughput 207MB/s, cpu load 80%
opreport --symbols --image-path=/lib/modules/2.6.15-gentoo-r7/kernel/
samples % image name app name symbol name
112441 28.2297 raid5.ko raid5 get_active_stripe
86826 21.7988 vmlinux vmlinux memcpy
78688 19.7556 raid5.ko raid5 handle_stripe
18796 4.7190 raid5.ko raid5 .text
13301 3.3394 raid5.ko raid5 raid5_compute_sector
8339 2.0936 raid5.ko raid5 make_request
5881 1.4765 vmlinux vmlinux blk_rq_map_sg
5463 1.3716 raid5.ko raid5 raid5_end_read_request
4269 1.0718 vmlinux vmlinux __delay
4131 1.0371 raid5.ko raid5 copy_data
3531 0.8865 vmlinux vmlinux clear_page
2964 0.7441 vmlinux vmlinux blk_recount_segments
2617 0.6570 vmlinux vmlinux get_user_pages
2025 0.5084 vmlinux vmlinux dma_map_sg
...(others under 0.5%)
-----------------------------------------------------------------------
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: raid5 high cpu usage during reads - oprofile results
2006-03-25 8:38 ` raid5 high cpu usage during reads - oprofile results Alex Izvorski
@ 2006-03-26 5:38 ` dean gaudet
2006-04-01 18:40 ` Alex Izvorski
0 siblings, 1 reply; 6+ messages in thread
From: dean gaudet @ 2006-03-26 5:38 UTC (permalink / raw)
To: Alex Izvorski; +Cc: linux-raid
On Sat, 25 Mar 2006, Alex Izvorski wrote:
> http://linuxraid.pastebin.com/621363 - oprofile annotated assembly
it looks to me like a lot of time is spent in __find_stripe() ... i wonder
if the hash is working properly.
in raid5.c you could try changing
#define stripe_hash(conf, sect) (&((conf)->stripe_hashtbl[((sect) >> STRIPE_SHIFT) & HASH_MASK]))
to
#define stripe_hash(conf, sect) (&((conf)->stripe_hashtbl[(((sect) >> STRIPE_SHIFT) ^ ((sect) >> (2*STRIPE_SHIFT))) & HASH_MASK]))
or maybe try using jhash_1word((sect) >> STRIPE_SHIFT, 0) ...
-dean
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: raid5 high cpu usage during reads - oprofile results
2006-03-26 5:38 ` dean gaudet
@ 2006-04-01 18:40 ` Alex Izvorski
2006-04-01 22:28 ` dean gaudet
0 siblings, 1 reply; 6+ messages in thread
From: Alex Izvorski @ 2006-04-01 18:40 UTC (permalink / raw)
To: dean gaudet; +Cc: linux-raid
On Sat, 2006-03-25 at 21:38 -0800, dean gaudet wrote:
> On Sat, 25 Mar 2006, Alex Izvorski wrote:
>
> > http://linuxraid.pastebin.com/621363 - oprofile annotated assembly
>
> it looks to me like a lot of time is spent in __find_stripe() ... i wonder
> if the hash is working properly.
>
> in raid5.c you could try changing
>
> #define stripe_hash(conf, sect) (&((conf)->stripe_hashtbl[((sect) >> STRIPE_SHIFT) & HASH_MASK]))
>
> to
>
> #define stripe_hash(conf, sect) (&((conf)->stripe_hashtbl[(((sect) >> STRIPE_SHIFT) ^ ((sect) >> (2*STRIPE_SHIFT))) & HASH_MASK]))
>
> or maybe try using jhash_1word((sect) >> STRIPE_SHIFT, 0) ...
>
> -dean
Dean - I think I see what you mean, you're looking at this line in the
assembly?
65830 16.8830 : c1f: cmp %rcx,0x28(%rax)
I looked at the hash stuff, I think the problem is not that the hash
function is poor, but rather that the number of entries in all buckets
gets to be pretty high. I haven't actually run this with extra
debugging statements yet, but as far as I can tell from reading the
code, the hash-related defines will get the following values:
NR_HASH = 512
HASH_MASK = 0x1ff
STRIPE_SHIFT = 3
With this, every successive stripe will go to a different bucket, so for
a sequential access this is a good hash function (actually, to break it
you'd need to read 4k every 2M, which is not a common access pattern).
The problem is that there are too few buckets? Since with 16k entries
in the stripe cache and 512 buckets we'd have 32 elements per bucket.
Would redefining HASH_PAGES to 16 or so help? For that matter, is the
code even expected to work with HASH_PAGES > 1? Perhaps someone who is
more familiar with the code can comment on this.
Regards,
--Alex
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: raid5 high cpu usage during reads - oprofile results
2006-04-01 18:40 ` Alex Izvorski
@ 2006-04-01 22:28 ` dean gaudet
2006-04-02 6:03 ` Alex Izvorski
0 siblings, 1 reply; 6+ messages in thread
From: dean gaudet @ 2006-04-01 22:28 UTC (permalink / raw)
To: Alex Izvorski; +Cc: linux-raid
On Sat, 1 Apr 2006, Alex Izvorski wrote:
> Dean - I think I see what you mean, you're looking at this line in the
> assembly?
>
> 65830 16.8830 : c1f: cmp %rcx,0x28(%rax)
yup that's the one... that's probably a fair number of cache (or tlb)
misses going on right there.
> I looked at the hash stuff, I think the problem is not that the hash
> function is poor, but rather that the number of entries in all buckets
> gets to be pretty high.
yeah... your analysis seems more likely.
i suppose increasing the number of buckets is the only option. it looks
to me like you'd just need to change NR_HASH and the kzalloc in run() in
order to increase the number of buckets.
i'm guessing there's a good reason for STRIPE_SIZE being 4KiB -- 'cause
otherwise it'd be cool to run with STRIPE_SIZE the same as your raid
chunksize... which would decrease the number of entries -- much more
desirable than increasing the number of buckets.
-dean
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: raid5 high cpu usage during reads - oprofile results
2006-04-01 22:28 ` dean gaudet
@ 2006-04-02 6:03 ` Alex Izvorski
2006-04-02 7:13 ` dean gaudet
0 siblings, 1 reply; 6+ messages in thread
From: Alex Izvorski @ 2006-04-02 6:03 UTC (permalink / raw)
To: dean gaudet; +Cc: linux-raid
On Sat, 2006-04-01 at 14:28 -0800, dean gaudet wrote:
> i'm guessing there's a good reason for STRIPE_SIZE being 4KiB -- 'cause
> otherwise it'd be cool to run with STRIPE_SIZE the same as your raid
> chunksize... which would decrease the number of entries -- much more
> desirable than increasing the number of buckets.
Dean - that is an interesting thought. I can't think of a reason why
not, except that it is the same as the page size? But offhand I don't
see any reason why that is a particularly good choice either. Would the
code work with other sizes? What about a variable (per array) size?
How would that interact with small reads?
Do you happen to know how many find_stripe calls there are for each
read? I rather suspect it is several (many) times per sector, since it
uses up something on the order of several thousand clock cycles per
*sector* (reading 400k sectors per second produces 80% load of 2x 2.4GHz
cpus, of which get_active_stripe accounts for ~30% - that's 2.8k clock
cycles per sector just in that one function). I really don't see any way
a single hash lookup even in a table with ~30 entries per bucket could
do anything close to that.
Short of changing STRIPE_SIZE, it should be enough to make sure the
average bucket occupancy is considerably less than one - as long as the
occupancy is kept low the the speed of access is independent of the
number of entries. 256 stripe cache entries and 512 hash buckets works
well with a 0.5 max occupancy; we should ideally have at least 32k
buckets (or 64 pages) for 16k entries. Yeah, ok, it's quite a bit more
memory than is used now, but considering that the box I'm running this
on has 4GB, it's not that much ;)
--Alex
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: raid5 high cpu usage during reads - oprofile results
2006-04-02 6:03 ` Alex Izvorski
@ 2006-04-02 7:13 ` dean gaudet
0 siblings, 0 replies; 6+ messages in thread
From: dean gaudet @ 2006-04-02 7:13 UTC (permalink / raw)
To: Alex Izvorski; +Cc: linux-raid
On Sat, 1 Apr 2006, Alex Izvorski wrote:
> On Sat, 2006-04-01 at 14:28 -0800, dean gaudet wrote:
> > i'm guessing there's a good reason for STRIPE_SIZE being 4KiB -- 'cause
> > otherwise it'd be cool to run with STRIPE_SIZE the same as your raid
> > chunksize... which would decrease the number of entries -- much more
> > desirable than increasing the number of buckets.
>
> Dean - that is an interesting thought. I can't think of a reason why
> not, except that it is the same as the page size? But offhand I don't
> see any reason why that is a particularly good choice either. Would the
> code work with other sizes? What about a variable (per array) size?
> How would that interact with small reads?
i don't understand the code well enough...
> Do you happen to know how many find_stripe calls there are for each
> read? I rather suspect it is several (many) times per sector, since it
> uses up something on the order of several thousand clock cycles per
> *sector* (reading 400k sectors per second produces 80% load of 2x 2.4GHz
> cpus, of which get_active_stripe accounts for ~30% - that's 2.8k clock
> cycles per sector just in that one function). I really don't see any way
> a single hash lookup even in a table with ~30 entries per bucket could
> do anything close to that.
well the lists are all struct stripe_heads... which on i386 seem to be
0x30 + 0x6c*(devs - 1) bytes each. that's pretty big. they're allocated
in a slab, so they're relatively well packed into pages... but still,
unless i've messed up somewhere that's 480 bytes for a 5 disk raid5. so
that's only 8 per page... so a chain of length 30 touches at least 4
pages. if you're hitting all 512 buckets, chains of length 30, then
you're looking at somewhere on the order of 2048 pages...
that causes a lot of thrashing in the TLBs... and isn't so great on the
cache either.
it's even worse on x86_64 ... it looks like 0xf8 + 0xb0*(devs - 1) bytes
per stripe_head ... (i'm pulling these numbers from the call setup for
kmem_cache_create in the disassembly of raid5.ko from kernels on my
boxes).
oh btw you might get a small improvement by moving the "sector" field of
struct stripe_head close to the hash field... right now the sector field
is at 0x28 (x86_64) and so it's probably on a different cache line from
the "hash" field at offset 0 about half the time (64 byte cache line). if
you move sector to right after the "hash" field it'll more likely be on
the same cache line...
but still, i think the tlb is the problem.
oh you can probably ask oprofile to tell you if you're seeing cache miss
or tlb miss stalls there (not sure on the syntax).
> Short of changing STRIPE_SIZE, it should be enough to make sure the
> average bucket occupancy is considerably less than one - as long as the
> occupancy is kept low the the speed of access is independent of the
> number of entries. 256 stripe cache entries and 512 hash buckets works
> well with a 0.5 max occupancy; we should ideally have at least 32k
> buckets (or 64 pages) for 16k entries. Yeah, ok, it's quite a bit more
> memory than is used now, but considering that the box I'm running this
> on has 4GB, it's not that much ;)
i still don't understand all the code well enough... but if i assume
there's a good reason for STRIPE_SIZE == PAGE_SIZE then it seems like you
need to improve the cache locality of the hash chaining... a linked list
of struct stripe_heads doesn't have very good locality because they're
such large structures.
one possibility is a linked list of:
struct stripe_hash_entry {
struct hlist_node hash;
sector_t sector;
struct stripe_head * sh;
};
but that's still 32 bytes on x86_64 ...
you can get it down to 16 bytes by getting rid of chaining and using open
addressing...
eh ... this still isn't that hot... really there's too much pressure
because there's a hash table entry per 4KiB of disk i/o...
anyhow i'm only eyeballing code here, i could easily have missed some
critical details.
-dean
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2006-04-02 7:13 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1143240438.8573.59.camel@starfire>
2006-03-25 8:38 ` raid5 high cpu usage during reads - oprofile results Alex Izvorski
2006-03-26 5:38 ` dean gaudet
2006-04-01 18:40 ` Alex Izvorski
2006-04-01 22:28 ` dean gaudet
2006-04-02 6:03 ` Alex Izvorski
2006-04-02 7:13 ` dean gaudet
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).