* Re: experiences with raid5: stripe_queue patches
2007-10-15 15:03 experiences with raid5: stripe_queue patches Bernd Schubert
@ 2007-10-15 16:40 ` Justin Piszcz
2007-10-16 2:01 ` Neil Brown
` (2 subsequent siblings)
3 siblings, 0 replies; 6+ messages in thread
From: Justin Piszcz @ 2007-10-15 16:40 UTC (permalink / raw)
To: Bernd Schubert; +Cc: Dan Williams, linux-raid, neilb
On Mon, 15 Oct 2007, Bernd Schubert wrote:
> Hi,
>
> in order to tune raid performance I did some benchmarks with and without the
> stripe queue patches. 2.6.22 is only for comparison to rule out other
> effects, e.g. the new scheduler, etc.
> It seems there is a regression with these patch regarding the re-write
> performance, as you can see its almost 50% of what it should be.
>
> write re-write read re-read
> 480844.26 448723.48 707927.55 706075.02 (2.6.22 w/o SQ patches)
> 487069.47 232574.30 709038.28 707595.09 (2.6.23 with SQ patches)
> 469865.75 438649.88 711211.92 703229.00 (2.6.23 without SQ patches)
>
> Benchmark details:
>
> 3xraid5 over 4 partitions of the very same hardware raid (in the end thats
> raid65, raid6 in hardware and raid5 in software, we need to do that).
>
> chunk size: 8192
> stripe_cache_size: 8192 each
> readahead of the md*: 65535 (well actually it limits itself to 65528
> readahead of the underlying partitions: 16384
>
> filesystem: xfs
>
> Testsystem: 2 x Quadcore Xeon 1.86 GHz (E5320)
>
> An interesting effect to notice: Without these patches the pdflush daemons
> will take a lot of CPU time, with these patches, pdflush almost doesn't
> appear in the 'top' list.
>
> Actually we would prefer one single raid5 array, but then one single raid5
> thread will run with 100% CPU time leaving 7 CPUs idle state, the status of
> the hardware raid says its utilization is only at about 50% and we only see
> writes at about 200 MB/s.
> On the contrary, with 3 different software raid5 sets the i/o to the harware
> raid systems is the bottleneck.
>
> Is there any chance to parallize the raid5 code? I think almost everything is
> done in raid5.c make_request(), but the main loop there is spin_locked by
> prepare_to_wait(). Would it be possible not to lock this entire loop?
>
>
> Thanks,
> Bernd
>
> --
> Bernd Schubert
> Q-Leap Networks GmbH
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
Excellent questions I look forward to reading this thread :)
Justin.
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: experiences with raid5: stripe_queue patches
2007-10-15 15:03 experiences with raid5: stripe_queue patches Bernd Schubert
2007-10-15 16:40 ` Justin Piszcz
@ 2007-10-16 2:01 ` Neil Brown
[not found] ` <BAY125-W2D0CD53AC925A85655321A59C0@phx.gbl>
2007-10-16 17:31 ` Dan Williams
3 siblings, 0 replies; 6+ messages in thread
From: Neil Brown @ 2007-10-16 2:01 UTC (permalink / raw)
To: Bernd Schubert; +Cc: Dan Williams, linux-raid
On Monday October 15, bs@q-leap.de wrote:
> Hi,
>
> in order to tune raid performance I did some benchmarks with and without the
> stripe queue patches. 2.6.22 is only for comparison to rule out other
> effects, e.g. the new scheduler, etc.
Thanks!
> It seems there is a regression with these patch regarding the re-write
> performance, as you can see its almost 50% of what it should be.
>
> write re-write read re-read
> 480844.26 448723.48 707927.55 706075.02 (2.6.22 w/o SQ patches)
> 487069.47 232574.30 709038.28 707595.09 (2.6.23 with SQ patches)
> 469865.75 438649.88 711211.92 703229.00 (2.6.23 without SQ patches)
I wonder if it is a fairness issue. One concern I have about that new
code it that it seems to allow full stripes to bypass incomplete
stripes in the queue indefinitely. So an incomplete stripe might be
delayed a very long time.
I've had a bit of time to think about these patches and experiment a
bit.
I think we should think about the stripe queue in four parts;
A/ those that have scheduled some write requests
B/ those that have scheduled some pre-read requests
C/ those that can start writing without any preread
D/ those that need some preread before we write.
Original code lets C flow directly to A, and D move into B in
bursts. i.e. once B becomes empty, all of D moves to B.
The new code further restricts D to only move to B when the total size
of A+B is below some limit.
I think that including the size of A is good as it gives stripes on D
more chance to move to C by getting more blocks attached. However it
is bad because it makes it easier for stripes on C to over take
stripes on D.
I made a tiny change to raid5_activate_delayed so that the while loop
aborts if "atomic_read(&conf->active_stripes) < 32"
This (in a very coarse way) limits D moving to B when A+B is more than
a certain size, and it had a similar effect to the SQ patches on a
simple sequential write test. But it still allowed some pre-read
requests (that shouldn't be needed) to slip through.
I think we should:
Keep a precise count of the size of A
Only allow the D->B transition when A < one-full-stripe
Limit the extent to which C can leap frog D.
I'm not sure how best to do this yet. Something simple but fair
is needed.
>
> An interesting effect to notice: Without these patches the pdflush daemons
> will take a lot of CPU time, with these patches, pdflush almost doesn't
> appear in the 'top' list.
Maybe the patches move processing time from make_request into raid5d,
thus moving it from pdflush to raid5d. Does raid5d appear higher in
the list....
>
> Actually we would prefer one single raid5 array, but then one single raid5
> thread will run with 100% CPU time leaving 7 CPUs idle state, the status of
> the hardware raid says its utilization is only at about 50% and we only see
> writes at about 200 MB/s.
> On the contrary, with 3 different software raid5 sets the i/o to the harware
> raid systems is the bottleneck.
>
> Is there any chance to parallize the raid5 code? I think almost everything is
> done in raid5.c make_request(), but the main loop there is spin_locked by
> prepare_to_wait(). Would it be possible not to lock this entire loop?
I think you want multiple raid5d threads - that is where most of the
work is done. That is just a case of creating them and keeping track
of them so they can be destroyed when appropriate, and - possibly the
trickiest bit - waking them up at the right time, so they share the
load without wasteful wakeups.
NeilBrown
^ permalink raw reply [flat|nested] 6+ messages in thread[parent not found: <BAY125-W2D0CD53AC925A85655321A59C0@phx.gbl>]
* RE: experiences with raid5: stripe_queue patches
[not found] ` <BAY125-W2D0CD53AC925A85655321A59C0@phx.gbl>
@ 2007-10-16 2:04 ` Neil Brown
0 siblings, 0 replies; 6+ messages in thread
From: Neil Brown @ 2007-10-16 2:04 UTC (permalink / raw)
To: 彭席汉; +Cc: Bernd Schubert, linux-raid, dan.j.williams
On Tuesday October 16, pengxihan@hotmail.com wrote:
> In my opinion, one
> of ways to increase raid5 performance may be to change STRIPE_SIZE to
> to a bigger value. It is now equal to PAGE_SIZE. But we can't just
> change it simply, some other codes must be modified. I havn't tried,:).
> Maybe Neil can give us some advices!
I really don't think increase STRIPE_SIZE would have much effect. You
might save a bit of CPU (though it might cost you too) but it should
change the order of requests, so it wouldn't affect disk io much.
I really think this is a scheduling problem. We need to make sure to
send the requests in the right order without delaying things
needlessly, and avoid reads when ever possible.
NeilBrown
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: experiences with raid5: stripe_queue patches
2007-10-15 15:03 experiences with raid5: stripe_queue patches Bernd Schubert
` (2 preceding siblings ...)
[not found] ` <BAY125-W2D0CD53AC925A85655321A59C0@phx.gbl>
@ 2007-10-16 17:31 ` Dan Williams
2007-10-17 16:59 ` Bernd Schubert
3 siblings, 1 reply; 6+ messages in thread
From: Dan Williams @ 2007-10-16 17:31 UTC (permalink / raw)
To: Bernd Schubert; +Cc: linux-raid, neilb
On Mon, 2007-10-15 at 08:03 -0700, Bernd Schubert wrote:
> Hi,
>
> in order to tune raid performance I did some benchmarks with and
> without the
> stripe queue patches. 2.6.22 is only for comparison to rule out other
> effects, e.g. the new scheduler, etc.
Thanks for testing!
> It seems there is a regression with these patch regarding the re-write
> performance, as you can see its almost 50% of what it should be.
>
> write re-write read re-read
> 480844.26 448723.48 707927.55 706075.02 (2.6.22 w/o SQ patches)
> 487069.47 232574.30 709038.28 707595.09 (2.6.23 with SQ patches)
> 469865.75 438649.88 711211.92 703229.00 (2.6.23 without SQ patches)
A quick way to verify that it is a fairness issue is to simply not
promote full stripe writes to their own list, debug patch follows:
---
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index eb7fd10..755aafb 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -162,7 +162,7 @@ static void __release_queue(raid5_conf_t *conf, struct stripe_queue *sq)
if (to_write &&
io_weight(sq->overwrite, disks) == data_disks) {
- list_add_tail(&sq->list_node, &conf->io_hi_q_list);
+ list_add_tail(&sq->list_node, &conf->io_lo_q_list);
queue_work(conf->workqueue, &conf->stripe_queue_work);
} else if (io_weight(sq->to_read, disks)) {
list_add_tail(&sq->list_node, &conf->io_lo_q_list);
---
<snip>
>
> An interesting effect to notice: Without these patches the pdflush
> daemons
> will take a lot of CPU time, with these patches, pdflush almost
> doesn't
> appear in the 'top' list.
>
> Actually we would prefer one single raid5 array, but then one single
> raid5
> thread will run with 100% CPU time leaving 7 CPUs idle state, the
> status of
> the hardware raid says its utilization is only at about 50% and we
> only see
> writes at about 200 MB/s.
> On the contrary, with 3 different software raid5 sets the i/o to the
> harware
> raid systems is the bottleneck.
>
> Is there any chance to parallize the raid5 code? I think almost
> everything is
> done in raid5.c make_request(), but the main loop there is spin_locked
> by
> prepare_to_wait(). Would it be possible not to lock this entire loop?
I made a rough attempt at multi-threading raid5[1] a while back.
However, this configuration only helps affinity, it does not address the
cases where the load needs to be further rebalanced between cpus.
>
>
> Thanks,
> Bernd
>
[1] http://marc.info/?l=linux-raid&m=117262977831208&w=2
Note this implementation incorrectly handles the raid6 spare_page, we
would need a spare_page per cpu.
^ permalink raw reply related [flat|nested] 6+ messages in thread* Re: experiences with raid5: stripe_queue patches
2007-10-16 17:31 ` Dan Williams
@ 2007-10-17 16:59 ` Bernd Schubert
0 siblings, 0 replies; 6+ messages in thread
From: Bernd Schubert @ 2007-10-17 16:59 UTC (permalink / raw)
To: Dan Williams; +Cc: linux-raid, neilb
Hello Dan, hello Neil,
thanks for your help!
On Tuesday 16 October 2007 19:31:08 Dan Williams wrote:
> On Mon, 2007-10-15 at 08:03 -0700, Bernd Schubert wrote:
> > Hi,
> >
> > in order to tune raid performance I did some benchmarks with and
> > without the
> > stripe queue patches. 2.6.22 is only for comparison to rule out other
> > effects, e.g. the new scheduler, etc.
>
> Thanks for testing!
>
> > It seems there is a regression with these patch regarding the re-write
> > performance, as you can see its almost 50% of what it should be.
> >
> > write re-write read re-read
> > 480844.26 448723.48 707927.55 706075.02 (2.6.22 w/o SQ patches)
> > 487069.47 232574.30 709038.28 707595.09 (2.6.23 with SQ patches)
> > 469865.75 438649.88 711211.92 703229.00 (2.6.23 without SQ patches)
>
> A quick way to verify that it is a fairness issue is to simply not
> promote full stripe writes to their own list, debug patch follows:
I tested with that and the rewrite performance is better, but still not
perfect:
write re-write read re-read
461794.14 377896.27 701793.81 693018.02
[...]
> I made a rough attempt at multi-threading raid5[1] a while back.
> However, this configuration only helps affinity, it does not address the
> cases where the load needs to be further rebalanced between cpus.
>
> > Thanks,
> > Bernd
>
> [1] http://marc.info/?l=linux-raid&m=117262977831208&w=2
> Note this implementation incorrectly handles the raid6 spare_page, we
> would need a spare_page per cpu.
Ah great, I will test this on Friday.
Thanks,
Bernd
--
Bernd Schubert
Q-Leap Networks GmbH
^ permalink raw reply [flat|nested] 6+ messages in thread