raid5 high cpu usage during reads

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* raid5 high cpu usage during reads - oprofile results
       [not found] <1143240438.8573.59.camel@starfire>
@ 2006-03-25  8:38 ` Alex Izvorski
  2006-03-26  5:38   ` dean gaudet
  0 siblings, 1 reply; 6+ messages in thread
From: Alex Izvorski @ 2006-03-25  8:38 UTC (permalink / raw)
  To: linux-raid

Hello,

I profiled some raid5 reads using oprofile to try to track down the
suspiciously high cpu load I see.  This uses the same 8-disk SATA setup
as I had described earlier.  One of runs is on a 1MB chunk raid5, the
other on a 32MB chunk raid5.  As Neil suggested memcpy is a big part of
the cpu load.  The rest of it apprears to be in handle_stripe and
get_active_stripe - these three account for most of the load, with the
remainder fairly evenly distributed among a dozen other routines.  If
using a large chunk size, handle_stripe and get_active_stripe will
predominate (and result in some truly abnormal cpu loads).  I am
attaching annotated (from opannotate) source and assembly for raid5.c
from the second (32MB chunk) run.   The annotated results do not make
much sense to me, but I suspect that the exact line numbers etc may be
shifted slightly as usually happens with optimized builds.  I hope this
would be useful.

Regards,
--Alex 

P.S. I had originally mailed the oprofile results as attachments to the 
list, but I think they didn't go through.  I put them at:
 http://linuxraid.pastebin.com/621363 - oprofile annotated assembly
 http://linuxraid.pastebin.com/621364 - oprofile annotated source
Sorry if you get this email twice.


-----------------------------------------------------------------------
mdadm --create /dev/md0 --level=raid5 --chunk=1024 --raid-devices=8 \
 --size=10485760 /dev/sd[abcdefgh]1
echo "8192" > /sys/block/md0/md/stripe_cache_size
./test_aio -f /dev/md0 -T 10 -s 60G -r 8M -n 14
throughput 205MB/s, cpu load 40%

opreport --symbols --image-path=/lib/modules/2.6.15-gentoo-r7/kernel/ 
samples  %        image name app name   symbol name
91839    42.9501  vmlinux    vmlinux    memcpy
25946    12.1341  raid5.ko   raid5      handle_stripe
17732     8.2927  raid5.ko   raid5      get_active_stripe
5454      2.5507  vmlinux    vmlinux    blk_rq_map_sg
4850      2.2682  vmlinux    vmlinux    __delay
4726      2.2102  raid5.ko   raid5      raid5_compute_sector
4588      2.1457  raid5.ko   raid5      copy_data
4389      2.0526  raid5.ko   raid5      .text
4362      2.0400  raid5.ko   raid5      make_request
3688      1.7248  vmlinux    vmlinux    clear_page
3548      1.6593  raid5.ko   raid5      raid5_end_read_request
2594      1.2131  vmlinux    vmlinux    blk_recount_segments
1944      0.9091  vmlinux    vmlinux    generic_make_request
1627      0.7609  vmlinux    vmlinux    dma_map_sg
1555      0.7272  vmlinux    vmlinux    get_user_pages
1540      0.7202  libata.ko  libata     ata_bmdma_status
1464      0.6847  vmlinux    vmlinux    __make_request
1350      0.6314  vmlinux    vmlinux    follow_page
1098      0.5135  vmlinux    vmlinux    finish_wait
...(others under 0.5%)
-----------------------------------------------------------------------

mdadm --create /dev/md0 --level=raid5 --chunk=32678 --raid-devices=8 \
 --size=10485760 /dev/sd[abcdefgh]1
echo "16384" > /sys/block/md0/md/stripe_cache_size
./test_aio -f /dev/md0 -T 10 -s 60G -r 8M -n 14
throughput 207MB/s, cpu load 80%

opreport --symbols --image-path=/lib/modules/2.6.15-gentoo-r7/kernel/ 
samples  %        image name app name   symbol name
112441   28.2297  raid5.ko   raid5      get_active_stripe
86826    21.7988  vmlinux    vmlinux    memcpy
78688    19.7556  raid5.ko   raid5      handle_stripe
18796     4.7190  raid5.ko   raid5      .text
13301     3.3394  raid5.ko   raid5      raid5_compute_sector
8339      2.0936  raid5.ko   raid5      make_request
5881      1.4765  vmlinux    vmlinux    blk_rq_map_sg
5463      1.3716  raid5.ko   raid5      raid5_end_read_request
4269      1.0718  vmlinux    vmlinux    __delay
4131      1.0371  raid5.ko   raid5      copy_data
3531      0.8865  vmlinux    vmlinux    clear_page
2964      0.7441  vmlinux    vmlinux    blk_recount_segments
2617      0.6570  vmlinux    vmlinux    get_user_pages
2025      0.5084  vmlinux    vmlinux    dma_map_sg
...(others under 0.5%)
-----------------------------------------------------------------------




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: raid5 high cpu usage during reads - oprofile results
  2006-03-25  8:38 ` raid5 high cpu usage during reads - oprofile results Alex Izvorski
@ 2006-03-26  5:38   ` dean gaudet
  2006-04-01 18:40     ` Alex Izvorski
  0 siblings, 1 reply; 6+ messages in thread
From: dean gaudet @ 2006-03-26  5:38 UTC (permalink / raw)
  To: Alex Izvorski; +Cc: linux-raid

On Sat, 25 Mar 2006, Alex Izvorski wrote:

>  http://linuxraid.pastebin.com/621363 - oprofile annotated assembly

it looks to me like a lot of time is spent in __find_stripe() ... i wonder 
if the hash is working properly.

in raid5.c you could try changing

#define stripe_hash(conf, sect) (&((conf)->stripe_hashtbl[((sect) >> STRIPE_SHIFT) & HASH_MASK]))

to

#define stripe_hash(conf, sect) (&((conf)->stripe_hashtbl[(((sect) >> STRIPE_SHIFT) ^ ((sect) >> (2*STRIPE_SHIFT))) & HASH_MASK]))

or maybe try using jhash_1word((sect) >> STRIPE_SHIFT, 0) ...

-dean

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: raid5 high cpu usage during reads - oprofile results
  2006-03-26  5:38   ` dean gaudet
@ 2006-04-01 18:40     ` Alex Izvorski
  2006-04-01 22:28       ` dean gaudet
  0 siblings, 1 reply; 6+ messages in thread
From: Alex Izvorski @ 2006-04-01 18:40 UTC (permalink / raw)
  To: dean gaudet; +Cc: linux-raid

On Sat, 2006-03-25 at 21:38 -0800, dean gaudet wrote:
> On Sat, 25 Mar 2006, Alex Izvorski wrote:
> 
> >  http://linuxraid.pastebin.com/621363 - oprofile annotated assembly
> 
> it looks to me like a lot of time is spent in __find_stripe() ... i wonder 
> if the hash is working properly.
> 
> in raid5.c you could try changing
> 
> #define stripe_hash(conf, sect) (&((conf)->stripe_hashtbl[((sect) >> STRIPE_SHIFT) & HASH_MASK]))
> 
> to
> 
> #define stripe_hash(conf, sect) (&((conf)->stripe_hashtbl[(((sect) >> STRIPE_SHIFT) ^ ((sect) >> (2*STRIPE_SHIFT))) & HASH_MASK]))
> 
> or maybe try using jhash_1word((sect) >> STRIPE_SHIFT, 0) ...
> 
> -dean

Dean - I think I see what you mean, you're looking at this line in the
assembly?

65830 16.8830 :     c1f:       cmp    %rcx,0x28(%rax)

I looked at the hash stuff, I think the problem is not that the hash
function is poor, but rather that the number of entries in all buckets
gets to be pretty high.  I haven't actually run this with extra
debugging statements yet, but as far as I can tell from reading the
code, the hash-related defines will get the following values:

NR_HASH = 512
HASH_MASK = 0x1ff
STRIPE_SHIFT = 3

With this, every successive stripe will go to a different bucket, so for
a sequential access this is a good hash function (actually, to break it
you'd need to read 4k every 2M, which is not a common access pattern).
The problem is that there are too few buckets?  Since with 16k entries
in the stripe cache and 512 buckets we'd have 32 elements per bucket.
Would redefining HASH_PAGES to 16 or so help?  For that matter, is the
code even expected to work with HASH_PAGES > 1?   Perhaps someone who is
more familiar with the code can comment on this.

Regards,
--Alex

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: raid5 high cpu usage during reads - oprofile results
  2006-04-01 18:40     ` Alex Izvorski
@ 2006-04-01 22:28       ` dean gaudet
  2006-04-02  6:03         ` Alex Izvorski
  0 siblings, 1 reply; 6+ messages in thread
From: dean gaudet @ 2006-04-01 22:28 UTC (permalink / raw)
  To: Alex Izvorski; +Cc: linux-raid

On Sat, 1 Apr 2006, Alex Izvorski wrote:

> Dean - I think I see what you mean, you're looking at this line in the
> assembly?
> 
> 65830 16.8830 :     c1f:       cmp    %rcx,0x28(%rax)

yup that's the one... that's probably a fair number of cache (or tlb) 
misses going on right there.

> I looked at the hash stuff, I think the problem is not that the hash
> function is poor, but rather that the number of entries in all buckets
> gets to be pretty high.

yeah... your analysis seems more likely.

i suppose increasing the number of buckets is the only option.  it looks 
to me like you'd just need to change NR_HASH and the kzalloc in run() in 
order to increase the number of buckets.

i'm guessing there's a good reason for STRIPE_SIZE being 4KiB -- 'cause 
otherwise it'd be cool to run with STRIPE_SIZE the same as your raid 
chunksize... which would decrease the number of entries -- much more 
desirable than increasing the number of buckets.

-dean

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: raid5 high cpu usage during reads - oprofile results
  2006-04-01 22:28       ` dean gaudet
@ 2006-04-02  6:03         ` Alex Izvorski
  2006-04-02  7:13           ` dean gaudet
  0 siblings, 1 reply; 6+ messages in thread
From: Alex Izvorski @ 2006-04-02  6:03 UTC (permalink / raw)
  To: dean gaudet; +Cc: linux-raid

On Sat, 2006-04-01 at 14:28 -0800, dean gaudet wrote:
> i'm guessing there's a good reason for STRIPE_SIZE being 4KiB -- 'cause 
> otherwise it'd be cool to run with STRIPE_SIZE the same as your raid 
> chunksize... which would decrease the number of entries -- much more 
> desirable than increasing the number of buckets.

Dean - that is an interesting thought.  I can't think of a reason why
not, except that it is the same as the page size?  But offhand I don't
see any reason why that is a particularly good choice either.  Would the
code work with other sizes?  What about a variable (per array) size?
How would that interact with small reads?

Do you happen to know how many find_stripe calls there are for each
read?  I rather suspect it is several (many) times per sector, since it
uses up something on the order of several thousand clock cycles per
*sector* (reading 400k sectors per second produces 80% load of 2x 2.4GHz
cpus, of which get_active_stripe accounts for ~30% - that's 2.8k clock
cycles per sector just in that one function). I really don't see any way
a single hash lookup even in a table with ~30 entries per bucket could
do anything close to that.

Short of changing STRIPE_SIZE, it should be enough to make sure the
average bucket occupancy is considerably less than one - as long as the
occupancy is kept low the the speed of access is independent of the
number of entries.  256 stripe cache entries and 512 hash buckets works
well with a 0.5 max occupancy; we should ideally have at least 32k
buckets (or 64 pages) for 16k entries.  Yeah, ok, it's quite a bit more
memory than is used now, but considering that the box I'm running this
on has 4GB, it's not that much ;)

--Alex

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: raid5 high cpu usage during reads - oprofile results
  2006-04-02  6:03         ` Alex Izvorski
@ 2006-04-02  7:13           ` dean gaudet
  0 siblings, 0 replies; 6+ messages in thread
From: dean gaudet @ 2006-04-02  7:13 UTC (permalink / raw)
  To: Alex Izvorski; +Cc: linux-raid

On Sat, 1 Apr 2006, Alex Izvorski wrote:

> On Sat, 2006-04-01 at 14:28 -0800, dean gaudet wrote:
> > i'm guessing there's a good reason for STRIPE_SIZE being 4KiB -- 'cause 
> > otherwise it'd be cool to run with STRIPE_SIZE the same as your raid 
> > chunksize... which would decrease the number of entries -- much more 
> > desirable than increasing the number of buckets.
> 
> Dean - that is an interesting thought.  I can't think of a reason why
> not, except that it is the same as the page size?  But offhand I don't
> see any reason why that is a particularly good choice either.  Would the
> code work with other sizes?  What about a variable (per array) size?
> How would that interact with small reads?

i don't understand the code well enough...

> Do you happen to know how many find_stripe calls there are for each
> read?  I rather suspect it is several (many) times per sector, since it
> uses up something on the order of several thousand clock cycles per
> *sector* (reading 400k sectors per second produces 80% load of 2x 2.4GHz
> cpus, of which get_active_stripe accounts for ~30% - that's 2.8k clock
> cycles per sector just in that one function). I really don't see any way
> a single hash lookup even in a table with ~30 entries per bucket could
> do anything close to that.

well the lists are all struct stripe_heads... which on i386 seem to be 
0x30 + 0x6c*(devs - 1) bytes each.  that's pretty big.  they're allocated 
in a slab, so they're relatively well packed into pages... but still, 
unless i've messed up somewhere that's 480 bytes for a 5 disk raid5.  so 
that's only 8 per page... so a chain of length 30 touches at least 4 
pages.  if you're hitting all 512 buckets, chains of length 30, then 
you're looking at somewhere on the order of 2048 pages...

that causes a lot of thrashing in the TLBs... and isn't so great on the 
cache either.

it's even worse on x86_64 ... it looks like 0xf8 + 0xb0*(devs - 1) bytes 
per stripe_head ... (i'm pulling these numbers from the call setup for 
kmem_cache_create in the disassembly of raid5.ko from kernels on my 
boxes).

oh btw you might get a small improvement by moving the "sector" field of 
struct stripe_head close to the hash field... right now the sector field 
is at 0x28 (x86_64) and so it's probably on a different cache line from 
the "hash" field at offset 0 about half the time (64 byte cache line).  if 
you move sector to right after the "hash" field it'll more likely be on 
the same cache line...

but still, i think the tlb is the problem.

oh you can probably ask oprofile to tell you if you're seeing cache miss 
or tlb miss stalls there (not sure on the syntax).

> Short of changing STRIPE_SIZE, it should be enough to make sure the
> average bucket occupancy is considerably less than one - as long as the
> occupancy is kept low the the speed of access is independent of the
> number of entries.  256 stripe cache entries and 512 hash buckets works
> well with a 0.5 max occupancy; we should ideally have at least 32k
> buckets (or 64 pages) for 16k entries.  Yeah, ok, it's quite a bit more
> memory than is used now, but considering that the box I'm running this
> on has 4GB, it's not that much ;)

i still don't understand all the code well enough... but if i assume 
there's a good reason for STRIPE_SIZE == PAGE_SIZE then it seems like you 
need to improve the cache locality of the hash chaining... a linked list 
of struct stripe_heads doesn't have very good locality because they're 
such large structures.

one possibility is a linked list of:

	struct stripe_hash_entry {
		struct hlist_node	hash;
		sector_t		sector;
		struct stripe_head *	sh;
	};

but that's still 32 bytes on x86_64 ...

you can get it down to 16 bytes by getting rid of chaining and using open 
addressing...

eh ... this still isn't that hot... really there's too much pressure 
because there's a hash table entry per 4KiB of disk i/o...

anyhow i'm only eyeballing code here, i could easily have missed some 
critical details.

-dean

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2006-04-02  7:13 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1143240438.8573.59.camel@starfire>
2006-03-25  8:38 ` raid5 high cpu usage during reads - oprofile results Alex Izvorski
2006-03-26  5:38   ` dean gaudet
2006-04-01 18:40     ` Alex Izvorski
2006-04-01 22:28       ` dean gaudet
2006-04-02  6:03         ` Alex Izvorski
2006-04-02  7:13           ` dean gaudet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).