* [PATCH 1/2] md/raid5: split wait_for_stripe and introduce wait_for_quiesce
@ 2015-04-24 13:39 Yuanhan Liu
2015-04-24 13:39 ` [PATCH 2/2] md/raid5: exclusive wait_for_stripe Yuanhan Liu
2015-04-27 0:10 ` [PATCH 1/2] md/raid5: split wait_for_stripe and introduce wait_for_quiesce NeilBrown
0 siblings, 2 replies; 7+ messages in thread
From: Yuanhan Liu @ 2015-04-24 13:39 UTC (permalink / raw)
To: neilb; +Cc: linux-raid, linux-kernel, Yuanhan Liu
If I read code correctly, current wait_for_stripe actually has 2 usage:
- wait for there is enough free stripe cache, triggered when
get_free_stripe() failed. This is what wait_for_stripe intend
for literally.
- wait for quiesce == 0 or
active_aligned_reads == 0 && active_stripes == 0
It has nothing to do with wait_for_stripe literally, and releasing
an active stripe won't actually wake them up. On the contrary, wake_up
from under this case won't actually wake up the process waiting for
an free stripe being available.
Hence, we'd better split wait_for_stripe, and here I introduce
wait_for_quiesce for the second usage. The name may not well taken, or
even taken wrongly. Feel free to correct me then.
This is also a prepare patch for next patch: make wait_for_stripe
exclusive.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
drivers/md/raid5.c | 13 +++++++------
drivers/md/raid5.h | 1 +
2 files changed, 8 insertions(+), 6 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 9716319..b7e385f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -667,7 +667,7 @@ get_active_stripe(struct r5conf *conf, sector_t sector,
spin_lock_irq(conf->hash_locks + hash);
do {
- wait_event_lock_irq(conf->wait_for_stripe,
+ wait_event_lock_irq(conf->wait_for_quiesce,
conf->quiesce == 0 || noquiesce,
*(conf->hash_locks + hash));
sh = __find_stripe(conf, sector, conf->generation - previous);
@@ -4725,7 +4725,7 @@ static void raid5_align_endio(struct bio *bi, int error)
raid_bi, 0);
bio_endio(raid_bi, 0);
if (atomic_dec_and_test(&conf->active_aligned_reads))
- wake_up(&conf->wait_for_stripe);
+ wake_up(&conf->wait_for_quiesce);
return;
}
@@ -4820,7 +4820,7 @@ static int chunk_aligned_read(struct mddev *mddev, struct bio * raid_bio)
align_bi->bi_iter.bi_sector += rdev->data_offset;
spin_lock_irq(&conf->device_lock);
- wait_event_lock_irq(conf->wait_for_stripe,
+ wait_event_lock_irq(conf->wait_for_quiesce,
conf->quiesce == 0,
conf->device_lock);
atomic_inc(&conf->active_aligned_reads);
@@ -5659,7 +5659,7 @@ static int retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
bio_endio(raid_bio, 0);
}
if (atomic_dec_and_test(&conf->active_aligned_reads))
- wake_up(&conf->wait_for_stripe);
+ wake_up(&conf->wait_for_quiesce);
return handled;
}
@@ -6390,6 +6390,7 @@ static struct r5conf *setup_conf(struct mddev *mddev)
goto abort;
spin_lock_init(&conf->device_lock);
seqcount_init(&conf->gen_lock);
+ init_waitqueue_head(&conf->wait_for_quiesce);
init_waitqueue_head(&conf->wait_for_stripe);
init_waitqueue_head(&conf->wait_for_overlap);
INIT_LIST_HEAD(&conf->handle_list);
@@ -7413,7 +7414,7 @@ static void raid5_quiesce(struct mddev *mddev, int state)
* active stripes can drain
*/
conf->quiesce = 2;
- wait_event_cmd(conf->wait_for_stripe,
+ wait_event_cmd(conf->wait_for_quiesce,
atomic_read(&conf->active_stripes) == 0 &&
atomic_read(&conf->active_aligned_reads) == 0,
unlock_all_device_hash_locks_irq(conf),
@@ -7427,7 +7428,7 @@ static void raid5_quiesce(struct mddev *mddev, int state)
case 0: /* re-enable writes */
lock_all_device_hash_locks_irq(conf);
conf->quiesce = 0;
- wake_up(&conf->wait_for_stripe);
+ wake_up(&conf->wait_for_quiesce);
wake_up(&conf->wait_for_overlap);
unlock_all_device_hash_locks_irq(conf);
break;
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 7dc0dd8..fab53a3 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -508,6 +508,7 @@ struct r5conf {
struct list_head inactive_list[NR_STRIPE_HASH_LOCKS];
atomic_t empty_inactive_list_nr;
struct llist_head released_stripes;
+ wait_queue_head_t wait_for_quiesce;
wait_queue_head_t wait_for_stripe;
wait_queue_head_t wait_for_overlap;
unsigned long cache_state;
--
1.9.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* [PATCH 2/2] md/raid5: exclusive wait_for_stripe 2015-04-24 13:39 [PATCH 1/2] md/raid5: split wait_for_stripe and introduce wait_for_quiesce Yuanhan Liu @ 2015-04-24 13:39 ` Yuanhan Liu 2015-04-27 0:24 ` NeilBrown 2015-04-27 0:10 ` [PATCH 1/2] md/raid5: split wait_for_stripe and introduce wait_for_quiesce NeilBrown 1 sibling, 1 reply; 7+ messages in thread From: Yuanhan Liu @ 2015-04-24 13:39 UTC (permalink / raw) To: neilb; +Cc: linux-raid, linux-kernel, Yuanhan Liu I noticed heavy spin lock contention at get_active_stripe() with fsmark multiple thread write workloads. Here is how this hot contention comes from. We have limited stripes, and it's a multiple thread write workload. Hence, those stripes will be taken soon, which puts later processes to sleep for waiting free stripes. When enough stripes(> 1/4 total stripes) are released, all process are woken, trying to get the lock. But there is one only being able to get this lock for each hash lock, making other processes spinning out there for acquiring the lock. Thus, it's effectiveless to wakeup all processes and let them battle for a lock that permits one to access only each time. Instead, we could make it be a exclusive wake up: wake up one process only. That avoids the heavy spin lock contention naturally. Here are some test results I have got with this patch applied(all test run 3 times): `fsmark.files_per_sec' ===================== next-20150317 this patch ------------------------- ------------------------- metric_value ±stddev metric_value ±stddev change testbox/benchmark/testcase-params ------------------------- ------------------------- -------- ------------------------------ 25.600 ±0.0 92.700 ±2.5 262.1% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose 25.600 ±0.0 77.800 ±0.6 203.9% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose 32.000 ±0.0 93.800 ±1.7 193.1% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose 32.000 ±0.0 81.233 ±1.7 153.9% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose 48.800 ±14.5 99.667 ±2.0 104.2% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose 6.400 ±0.0 12.800 ±0.0 100.0% ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose 63.133 ±8.2 82.800 ±0.7 31.2% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose 245.067 ±0.7 306.567 ±7.9 25.1% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-f2fs-4M-30G-fsyncBeforeClose 17.533 ±0.3 21.000 ±0.8 19.8% ivb44/fsmark/1x-1t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose 188.167 ±1.9 215.033 ±3.1 14.3% ivb44/fsmark/1x-1t-4BRD_12G-RAID5-btrfs-4M-30G-NoSync 254.500 ±1.8 290.733 ±2.4 14.2% ivb44/fsmark/1x-1t-9BRD_6G-RAID5-btrfs-4M-30G-NoSync `time.system_time' ===================== next-20150317 this patch ------------------------- ------------------------- metric_value ±stddev metric_value ±stddev change testbox/benchmark/testcase-params ------------------------- ------------------------- -------- ------------------------------ 7235.603 ±1.2 185.163 ±1.9 -97.4% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose 7666.883 ±2.9 202.750 ±1.0 -97.4% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose 14567.893 ±0.7 421.230 ±0.4 -97.1% ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose 3697.667 ±14.0 148.190 ±1.7 -96.0% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose 5572.867 ±3.8 310.717 ±1.4 -94.4% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose 5565.050 ±0.5 313.277 ±1.5 -94.4% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose 2420.707 ±17.1 171.043 ±2.7 -92.9% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose 3743.300 ±4.6 379.827 ±3.5 -89.9% ivb44/fsmark/1x-64t-3HDD-RAID5-ext4-4M-40G-fsyncBeforeClose 3308.687 ±6.3 363.050 ±2.0 -89.0% ivb44/fsmark/1x-64t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose Where, 1x: where 'x' means iterations or loop, corresponding to the 'L' option of fsmark 1t, 64t: where 't' means thread 4M: means the single file size, corresponding to the '-s' option of fsmark 40G, 30G, 120G: means the total test size 4BRD_12G: BRD is the ramdisk, where '4' means 4 ramdisk, and where '12G' means the size of one ramdisk. So, it would be 48G in total. And we made a raid on those ramdisk As you can see, though there are no much performance gain for hard disk workload, the system time is dropped heavily, up to 97%. And as expected, the performance increased a lot, up to 260%, for fast device(ram disk). Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> --- drivers/md/raid5.c | 46 +++++++++++++++++++++++++++++++++++----------- drivers/md/raid5.h | 2 +- 2 files changed, 36 insertions(+), 12 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index b7e385f..2d8fcc1 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -344,7 +344,8 @@ static void release_inactive_stripe_list(struct r5conf *conf, int hash) { int size; - bool do_wakeup = false; + bool do_wakeup[NR_STRIPE_HASH_LOCKS] = { false, }; + int i = 0; unsigned long flags; if (hash == NR_STRIPE_HASH_LOCKS) { @@ -365,17 +366,22 @@ static void release_inactive_stripe_list(struct r5conf *conf, !list_empty(list)) atomic_dec(&conf->empty_inactive_list_nr); list_splice_tail_init(list, conf->inactive_list + hash); - do_wakeup = true; + do_wakeup[size - 1] = true; spin_unlock_irqrestore(conf->hash_locks + hash, flags); } size--; hash--; } - if (do_wakeup) { - wake_up(&conf->wait_for_stripe); - if (conf->retry_read_aligned) - md_wakeup_thread(conf->mddev->thread); + for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++) { + bool waked_thread = false; + if (do_wakeup[i]) { + wake_up(&conf->wait_for_stripe[i]); + if (!waked_thread) { + waked_thread = true; + md_wakeup_thread(conf->mddev->thread); + } + } } } @@ -655,6 +661,18 @@ static int has_failed(struct r5conf *conf) return 0; } +/* XXX: might put it to linux/wait.h to be a public API? */ +#define raid_wait_event_exclusive_cmd(wq, condition, cmd1, cmd2) \ +do { \ + if (condition) \ + break; \ + (void)___wait_event(wq, condition, TASK_UNINTERRUPTIBLE, 1, 0, \ + cmd1; \ + schedule(); \ + cmd2); \ +} while (0) + + static struct stripe_head * get_active_stripe(struct r5conf *conf, sector_t sector, int previous, int noblock, int noquiesce) @@ -684,14 +702,15 @@ get_active_stripe(struct r5conf *conf, sector_t sector, if (!sh) { set_bit(R5_INACTIVE_BLOCKED, &conf->cache_state); - wait_event_lock_irq( - conf->wait_for_stripe, + raid_wait_event_exclusive_cmd( + conf->wait_for_stripe[hash], !list_empty(conf->inactive_list + hash) && (atomic_read(&conf->active_stripes) < (conf->max_nr_stripes * 3 / 4) || !test_bit(R5_INACTIVE_BLOCKED, &conf->cache_state)), - *(conf->hash_locks + hash)); + spin_unlock_irq(conf->hash_locks + hash), + spin_lock_irq(conf->hash_locks + hash)); clear_bit(R5_INACTIVE_BLOCKED, &conf->cache_state); } else { @@ -716,6 +735,9 @@ get_active_stripe(struct r5conf *conf, sector_t sector, } } while (sh == NULL); + if (!list_empty(conf->inactive_list + hash)) + wake_up(&conf->wait_for_stripe[hash]); + spin_unlock_irq(conf->hash_locks + hash); return sh; } @@ -2136,7 +2158,7 @@ static int resize_stripes(struct r5conf *conf, int newsize) cnt = 0; list_for_each_entry(nsh, &newstripes, lru) { lock_device_hash_lock(conf, hash); - wait_event_cmd(conf->wait_for_stripe, + raid_wait_event_exclusive_cmd(conf->wait_for_stripe[hash], !list_empty(conf->inactive_list + hash), unlock_device_hash_lock(conf, hash), lock_device_hash_lock(conf, hash)); @@ -6391,7 +6413,9 @@ static struct r5conf *setup_conf(struct mddev *mddev) spin_lock_init(&conf->device_lock); seqcount_init(&conf->gen_lock); init_waitqueue_head(&conf->wait_for_quiesce); - init_waitqueue_head(&conf->wait_for_stripe); + for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++) { + init_waitqueue_head(&conf->wait_for_stripe[i]); + } init_waitqueue_head(&conf->wait_for_overlap); INIT_LIST_HEAD(&conf->handle_list); INIT_LIST_HEAD(&conf->hold_list); diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h index fab53a3..cdad2d2 100644 --- a/drivers/md/raid5.h +++ b/drivers/md/raid5.h @@ -509,7 +509,7 @@ struct r5conf { atomic_t empty_inactive_list_nr; struct llist_head released_stripes; wait_queue_head_t wait_for_quiesce; - wait_queue_head_t wait_for_stripe; + wait_queue_head_t wait_for_stripe[NR_STRIPE_HASH_LOCKS]; wait_queue_head_t wait_for_overlap; unsigned long cache_state; #define R5_INACTIVE_BLOCKED 1 /* release of inactive stripes blocked, -- 1.9.0 -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] md/raid5: exclusive wait_for_stripe 2015-04-24 13:39 ` [PATCH 2/2] md/raid5: exclusive wait_for_stripe Yuanhan Liu @ 2015-04-27 0:24 ` NeilBrown 2015-04-27 2:16 ` Yuanhan Liu 0 siblings, 1 reply; 7+ messages in thread From: NeilBrown @ 2015-04-27 0:24 UTC (permalink / raw) To: Yuanhan Liu; +Cc: linux-raid, linux-kernel [-- Attachment #1: Type: text/plain, Size: 10501 bytes --] On Fri, 24 Apr 2015 21:39:04 +0800 Yuanhan Liu <yuanhan.liu@linux.intel.com> wrote: > I noticed heavy spin lock contention at get_active_stripe() with fsmark > multiple thread write workloads. > > Here is how this hot contention comes from. We have limited stripes, and > it's a multiple thread write workload. Hence, those stripes will be taken > soon, which puts later processes to sleep for waiting free stripes. When > enough stripes(> 1/4 total stripes) are released, all process are woken, > trying to get the lock. But there is one only being able to get this lock > for each hash lock, making other processes spinning out there for acquiring > the lock. > > Thus, it's effectiveless to wakeup all processes and let them battle for > a lock that permits one to access only each time. Instead, we could make > it be a exclusive wake up: wake up one process only. That avoids the heavy > spin lock contention naturally. > > Here are some test results I have got with this patch applied(all test run > 3 times): > > `fsmark.files_per_sec' > ===================== > > next-20150317 this patch > ------------------------- ------------------------- > metric_value ±stddev metric_value ±stddev change testbox/benchmark/testcase-params > ------------------------- ------------------------- -------- ------------------------------ > 25.600 ±0.0 92.700 ±2.5 262.1% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose > 25.600 ±0.0 77.800 ±0.6 203.9% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose > 32.000 ±0.0 93.800 ±1.7 193.1% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose > 32.000 ±0.0 81.233 ±1.7 153.9% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose > 48.800 ±14.5 99.667 ±2.0 104.2% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose > 6.400 ±0.0 12.800 ±0.0 100.0% ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose > 63.133 ±8.2 82.800 ±0.7 31.2% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose > 245.067 ±0.7 306.567 ±7.9 25.1% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-f2fs-4M-30G-fsyncBeforeClose > 17.533 ±0.3 21.000 ±0.8 19.8% ivb44/fsmark/1x-1t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose > 188.167 ±1.9 215.033 ±3.1 14.3% ivb44/fsmark/1x-1t-4BRD_12G-RAID5-btrfs-4M-30G-NoSync > 254.500 ±1.8 290.733 ±2.4 14.2% ivb44/fsmark/1x-1t-9BRD_6G-RAID5-btrfs-4M-30G-NoSync > > `time.system_time' > ===================== > > next-20150317 this patch > ------------------------- ------------------------- > metric_value ±stddev metric_value ±stddev change testbox/benchmark/testcase-params > ------------------------- ------------------------- -------- ------------------------------ > 7235.603 ±1.2 185.163 ±1.9 -97.4% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose > 7666.883 ±2.9 202.750 ±1.0 -97.4% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose > 14567.893 ±0.7 421.230 ±0.4 -97.1% ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose > 3697.667 ±14.0 148.190 ±1.7 -96.0% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose > 5572.867 ±3.8 310.717 ±1.4 -94.4% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose > 5565.050 ±0.5 313.277 ±1.5 -94.4% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose > 2420.707 ±17.1 171.043 ±2.7 -92.9% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose > 3743.300 ±4.6 379.827 ±3.5 -89.9% ivb44/fsmark/1x-64t-3HDD-RAID5-ext4-4M-40G-fsyncBeforeClose > 3308.687 ±6.3 363.050 ±2.0 -89.0% ivb44/fsmark/1x-64t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose > > Where, > > 1x: where 'x' means iterations or loop, corresponding to the 'L' option of fsmark > > 1t, 64t: where 't' means thread > > 4M: means the single file size, corresponding to the '-s' option of fsmark > 40G, 30G, 120G: means the total test size > > 4BRD_12G: BRD is the ramdisk, where '4' means 4 ramdisk, and where '12G' means > the size of one ramdisk. So, it would be 48G in total. And we made a > raid on those ramdisk > > As you can see, though there are no much performance gain for hard disk > workload, the system time is dropped heavily, up to 97%. And as expected, > the performance increased a lot, up to 260%, for fast device(ram disk). > > Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> > --- > drivers/md/raid5.c | 46 +++++++++++++++++++++++++++++++++++----------- > drivers/md/raid5.h | 2 +- > 2 files changed, 36 insertions(+), 12 deletions(-) > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > index b7e385f..2d8fcc1 100644 > --- a/drivers/md/raid5.c > +++ b/drivers/md/raid5.c > @@ -344,7 +344,8 @@ static void release_inactive_stripe_list(struct r5conf *conf, > int hash) > { > int size; > - bool do_wakeup = false; > + bool do_wakeup[NR_STRIPE_HASH_LOCKS] = { false, }; I think I'd rather use an 'unsigned long' and set bits. > + int i = 0; > unsigned long flags; > > if (hash == NR_STRIPE_HASH_LOCKS) { > @@ -365,17 +366,22 @@ static void release_inactive_stripe_list(struct r5conf *conf, > !list_empty(list)) > atomic_dec(&conf->empty_inactive_list_nr); > list_splice_tail_init(list, conf->inactive_list + hash); > - do_wakeup = true; > + do_wakeup[size - 1] = true; ... so this becomes do_wakeup |= 1 << (size - 1); > spin_unlock_irqrestore(conf->hash_locks + hash, flags); > } > size--; > hash--; > } > > - if (do_wakeup) { > - wake_up(&conf->wait_for_stripe); > - if (conf->retry_read_aligned) > - md_wakeup_thread(conf->mddev->thread); > + for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++) { > + bool waked_thread = false; > + if (do_wakeup[i]) { > + wake_up(&conf->wait_for_stripe[i]); > + if (!waked_thread) { > + waked_thread = true; > + md_wakeup_thread(conf->mddev->thread); > + } > + } I don't think you want waked_thread to be local to this loop. As it is, the "if (!waked_thread)" test *always* succeeds. You can discard it if do_wakeup becomes and unsigned long, and just do if (do_wakeup && conf->retry_read_aligned) md_wakeup_thread(conf->mddev->thread); And why have you removed the test on conf->retry_read_aligned?? > } > } > > @@ -655,6 +661,18 @@ static int has_failed(struct r5conf *conf) > return 0; > } > > +/* XXX: might put it to linux/wait.h to be a public API? */ Yes, definitely put it in linux/wait.h Thanks, NeilBrown > +#define raid_wait_event_exclusive_cmd(wq, condition, cmd1, cmd2) \ > +do { \ > + if (condition) \ > + break; \ > + (void)___wait_event(wq, condition, TASK_UNINTERRUPTIBLE, 1, 0, \ > + cmd1; \ > + schedule(); \ > + cmd2); \ > +} while (0) > + > + > static struct stripe_head * > get_active_stripe(struct r5conf *conf, sector_t sector, > int previous, int noblock, int noquiesce) > @@ -684,14 +702,15 @@ get_active_stripe(struct r5conf *conf, sector_t sector, > if (!sh) { > set_bit(R5_INACTIVE_BLOCKED, > &conf->cache_state); > - wait_event_lock_irq( > - conf->wait_for_stripe, > + raid_wait_event_exclusive_cmd( > + conf->wait_for_stripe[hash], > !list_empty(conf->inactive_list + hash) && > (atomic_read(&conf->active_stripes) > < (conf->max_nr_stripes * 3 / 4) > || !test_bit(R5_INACTIVE_BLOCKED, > &conf->cache_state)), > - *(conf->hash_locks + hash)); > + spin_unlock_irq(conf->hash_locks + hash), > + spin_lock_irq(conf->hash_locks + hash)); > clear_bit(R5_INACTIVE_BLOCKED, > &conf->cache_state); > } else { > @@ -716,6 +735,9 @@ get_active_stripe(struct r5conf *conf, sector_t sector, > } > } while (sh == NULL); > > + if (!list_empty(conf->inactive_list + hash)) > + wake_up(&conf->wait_for_stripe[hash]); > + > spin_unlock_irq(conf->hash_locks + hash); > return sh; > } > @@ -2136,7 +2158,7 @@ static int resize_stripes(struct r5conf *conf, int newsize) > cnt = 0; > list_for_each_entry(nsh, &newstripes, lru) { > lock_device_hash_lock(conf, hash); > - wait_event_cmd(conf->wait_for_stripe, > + raid_wait_event_exclusive_cmd(conf->wait_for_stripe[hash], > !list_empty(conf->inactive_list + hash), > unlock_device_hash_lock(conf, hash), > lock_device_hash_lock(conf, hash)); > @@ -6391,7 +6413,9 @@ static struct r5conf *setup_conf(struct mddev *mddev) > spin_lock_init(&conf->device_lock); > seqcount_init(&conf->gen_lock); > init_waitqueue_head(&conf->wait_for_quiesce); > - init_waitqueue_head(&conf->wait_for_stripe); > + for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++) { > + init_waitqueue_head(&conf->wait_for_stripe[i]); > + } > init_waitqueue_head(&conf->wait_for_overlap); > INIT_LIST_HEAD(&conf->handle_list); > INIT_LIST_HEAD(&conf->hold_list); > diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h > index fab53a3..cdad2d2 100644 > --- a/drivers/md/raid5.h > +++ b/drivers/md/raid5.h > @@ -509,7 +509,7 @@ struct r5conf { > atomic_t empty_inactive_list_nr; > struct llist_head released_stripes; > wait_queue_head_t wait_for_quiesce; > - wait_queue_head_t wait_for_stripe; > + wait_queue_head_t wait_for_stripe[NR_STRIPE_HASH_LOCKS]; > wait_queue_head_t wait_for_overlap; > unsigned long cache_state; > #define R5_INACTIVE_BLOCKED 1 /* release of inactive stripes blocked, [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 811 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] md/raid5: exclusive wait_for_stripe 2015-04-27 0:24 ` NeilBrown @ 2015-04-27 2:16 ` Yuanhan Liu 0 siblings, 0 replies; 7+ messages in thread From: Yuanhan Liu @ 2015-04-27 2:16 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid, linux-kernel On Mon, Apr 27, 2015 at 10:24:05AM +1000, NeilBrown wrote: > On Fri, 24 Apr 2015 21:39:04 +0800 Yuanhan Liu <yuanhan.liu@linux.intel.com> > wrote: > > > I noticed heavy spin lock contention at get_active_stripe() with fsmark > > multiple thread write workloads. > > > > Here is how this hot contention comes from. We have limited stripes, and > > it's a multiple thread write workload. Hence, those stripes will be taken > > soon, which puts later processes to sleep for waiting free stripes. When > > enough stripes(> 1/4 total stripes) are released, all process are woken, > > trying to get the lock. But there is one only being able to get this lock > > for each hash lock, making other processes spinning out there for acquiring > > the lock. > > > > Thus, it's effectiveless to wakeup all processes and let them battle for > > a lock that permits one to access only each time. Instead, we could make > > it be a exclusive wake up: wake up one process only. That avoids the heavy > > spin lock contention naturally. > > > > Here are some test results I have got with this patch applied(all test run > > 3 times): > > > > `fsmark.files_per_sec' > > ===================== > > > > next-20150317 this patch > > ------------------------- ------------------------- > > metric_value ±stddev metric_value ±stddev change testbox/benchmark/testcase-params > > ------------------------- ------------------------- -------- ------------------------------ > > 25.600 ±0.0 92.700 ±2.5 262.1% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose > > 25.600 ±0.0 77.800 ±0.6 203.9% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose > > 32.000 ±0.0 93.800 ±1.7 193.1% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose > > 32.000 ±0.0 81.233 ±1.7 153.9% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose > > 48.800 ±14.5 99.667 ±2.0 104.2% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose > > 6.400 ±0.0 12.800 ±0.0 100.0% ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose > > 63.133 ±8.2 82.800 ±0.7 31.2% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose > > 245.067 ±0.7 306.567 ±7.9 25.1% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-f2fs-4M-30G-fsyncBeforeClose > > 17.533 ±0.3 21.000 ±0.8 19.8% ivb44/fsmark/1x-1t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose > > 188.167 ±1.9 215.033 ±3.1 14.3% ivb44/fsmark/1x-1t-4BRD_12G-RAID5-btrfs-4M-30G-NoSync > > 254.500 ±1.8 290.733 ±2.4 14.2% ivb44/fsmark/1x-1t-9BRD_6G-RAID5-btrfs-4M-30G-NoSync > > > > `time.system_time' > > ===================== > > > > next-20150317 this patch > > ------------------------- ------------------------- > > metric_value ±stddev metric_value ±stddev change testbox/benchmark/testcase-params > > ------------------------- ------------------------- -------- ------------------------------ > > 7235.603 ±1.2 185.163 ±1.9 -97.4% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose > > 7666.883 ±2.9 202.750 ±1.0 -97.4% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose > > 14567.893 ±0.7 421.230 ±0.4 -97.1% ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose > > 3697.667 ±14.0 148.190 ±1.7 -96.0% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose > > 5572.867 ±3.8 310.717 ±1.4 -94.4% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose > > 5565.050 ±0.5 313.277 ±1.5 -94.4% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose > > 2420.707 ±17.1 171.043 ±2.7 -92.9% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose > > 3743.300 ±4.6 379.827 ±3.5 -89.9% ivb44/fsmark/1x-64t-3HDD-RAID5-ext4-4M-40G-fsyncBeforeClose > > 3308.687 ±6.3 363.050 ±2.0 -89.0% ivb44/fsmark/1x-64t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose > > > > Where, > > > > 1x: where 'x' means iterations or loop, corresponding to the 'L' option of fsmark > > > > 1t, 64t: where 't' means thread > > > > 4M: means the single file size, corresponding to the '-s' option of fsmark > > 40G, 30G, 120G: means the total test size > > > > 4BRD_12G: BRD is the ramdisk, where '4' means 4 ramdisk, and where '12G' means > > the size of one ramdisk. So, it would be 48G in total. And we made a > > raid on those ramdisk > > > > As you can see, though there are no much performance gain for hard disk > > workload, the system time is dropped heavily, up to 97%. And as expected, > > the performance increased a lot, up to 260%, for fast device(ram disk). > > > > Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> > > --- > > drivers/md/raid5.c | 46 +++++++++++++++++++++++++++++++++++----------- > > drivers/md/raid5.h | 2 +- > > 2 files changed, 36 insertions(+), 12 deletions(-) > > > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > > index b7e385f..2d8fcc1 100644 > > --- a/drivers/md/raid5.c > > +++ b/drivers/md/raid5.c > > @@ -344,7 +344,8 @@ static void release_inactive_stripe_list(struct r5conf *conf, > > int hash) > > { > > int size; > > - bool do_wakeup = false; > > + bool do_wakeup[NR_STRIPE_HASH_LOCKS] = { false, }; > > I think I'd rather use an 'unsigned long' and set bits. Will do that. > > > + int i = 0; > > unsigned long flags; > > > > if (hash == NR_STRIPE_HASH_LOCKS) { > > @@ -365,17 +366,22 @@ static void release_inactive_stripe_list(struct r5conf *conf, > > !list_empty(list)) > > atomic_dec(&conf->empty_inactive_list_nr); > > list_splice_tail_init(list, conf->inactive_list + hash); > > - do_wakeup = true; > > + do_wakeup[size - 1] = true; > > ... so this becomes > do_wakeup |= 1 << (size - 1); > > > spin_unlock_irqrestore(conf->hash_locks + hash, flags); > > } > > size--; > > hash--; > > } > > > > - if (do_wakeup) { > > - wake_up(&conf->wait_for_stripe); > > - if (conf->retry_read_aligned) > > - md_wakeup_thread(conf->mddev->thread); > > + for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++) { > > + bool waked_thread = false; > > + if (do_wakeup[i]) { > > + wake_up(&conf->wait_for_stripe[i]); > > + if (!waked_thread) { > > + waked_thread = true; > > + md_wakeup_thread(conf->mddev->thread); > > + } > > + } > > I don't think you want waked_thread to be local to this loop. > As it is, the "if (!waked_thread)" test *always* succeeds. > > You can discard it if do_wakeup becomes and unsigned long, and just do > > if (do_wakeup && conf->retry_read_aligned) > md_wakeup_thread(conf->mddev->thread); > > And why have you removed the test on conf->retry_read_aligned?? Oops, a careless editing. > > > } > > } > > > > @@ -655,6 +661,18 @@ static int has_failed(struct r5conf *conf) > > return 0; > > } > > > > +/* XXX: might put it to linux/wait.h to be a public API? */ > > Yes, definitely put it in linux/wait.h I will send a seperate patch for that. Thanks. --yliu > > > > > > +#define raid_wait_event_exclusive_cmd(wq, condition, cmd1, cmd2) \ > > +do { \ > > + if (condition) \ > > + break; \ > > + (void)___wait_event(wq, condition, TASK_UNINTERRUPTIBLE, 1, 0, \ > > + cmd1; \ > > + schedule(); \ > > + cmd2); \ > > +} while (0) > > + > > + > > static struct stripe_head * > > get_active_stripe(struct r5conf *conf, sector_t sector, > > int previous, int noblock, int noquiesce) > > @@ -684,14 +702,15 @@ get_active_stripe(struct r5conf *conf, sector_t sector, > > if (!sh) { > > set_bit(R5_INACTIVE_BLOCKED, > > &conf->cache_state); > > - wait_event_lock_irq( > > - conf->wait_for_stripe, > > + raid_wait_event_exclusive_cmd( > > + conf->wait_for_stripe[hash], > > !list_empty(conf->inactive_list + hash) && > > (atomic_read(&conf->active_stripes) > > < (conf->max_nr_stripes * 3 / 4) > > || !test_bit(R5_INACTIVE_BLOCKED, > > &conf->cache_state)), > > - *(conf->hash_locks + hash)); > > + spin_unlock_irq(conf->hash_locks + hash), > > + spin_lock_irq(conf->hash_locks + hash)); > > clear_bit(R5_INACTIVE_BLOCKED, > > &conf->cache_state); > > } else { > > @@ -716,6 +735,9 @@ get_active_stripe(struct r5conf *conf, sector_t sector, > > } > > } while (sh == NULL); > > > > + if (!list_empty(conf->inactive_list + hash)) > > + wake_up(&conf->wait_for_stripe[hash]); > > + > > spin_unlock_irq(conf->hash_locks + hash); > > return sh; > > } > > @@ -2136,7 +2158,7 @@ static int resize_stripes(struct r5conf *conf, int newsize) > > cnt = 0; > > list_for_each_entry(nsh, &newstripes, lru) { > > lock_device_hash_lock(conf, hash); > > - wait_event_cmd(conf->wait_for_stripe, > > + raid_wait_event_exclusive_cmd(conf->wait_for_stripe[hash], > > !list_empty(conf->inactive_list + hash), > > unlock_device_hash_lock(conf, hash), > > lock_device_hash_lock(conf, hash)); > > @@ -6391,7 +6413,9 @@ static struct r5conf *setup_conf(struct mddev *mddev) > > spin_lock_init(&conf->device_lock); > > seqcount_init(&conf->gen_lock); > > init_waitqueue_head(&conf->wait_for_quiesce); > > - init_waitqueue_head(&conf->wait_for_stripe); > > + for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++) { > > + init_waitqueue_head(&conf->wait_for_stripe[i]); > > + } > > init_waitqueue_head(&conf->wait_for_overlap); > > INIT_LIST_HEAD(&conf->handle_list); > > INIT_LIST_HEAD(&conf->hold_list); > > diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h > > index fab53a3..cdad2d2 100644 > > --- a/drivers/md/raid5.h > > +++ b/drivers/md/raid5.h > > @@ -509,7 +509,7 @@ struct r5conf { > > atomic_t empty_inactive_list_nr; > > struct llist_head released_stripes; > > wait_queue_head_t wait_for_quiesce; > > - wait_queue_head_t wait_for_stripe; > > + wait_queue_head_t wait_for_stripe[NR_STRIPE_HASH_LOCKS]; > > wait_queue_head_t wait_for_overlap; > > unsigned long cache_state; > > #define R5_INACTIVE_BLOCKED 1 /* release of inactive stripes blocked, > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/2] md/raid5: split wait_for_stripe and introduce wait_for_quiesce 2015-04-24 13:39 [PATCH 1/2] md/raid5: split wait_for_stripe and introduce wait_for_quiesce Yuanhan Liu 2015-04-24 13:39 ` [PATCH 2/2] md/raid5: exclusive wait_for_stripe Yuanhan Liu @ 2015-04-27 0:10 ` NeilBrown 2015-04-27 2:12 ` Yuanhan Liu 1 sibling, 1 reply; 7+ messages in thread From: NeilBrown @ 2015-04-27 0:10 UTC (permalink / raw) To: Yuanhan Liu; +Cc: linux-raid, linux-kernel [-- Attachment #1: Type: text/plain, Size: 5586 bytes --] On Fri, 24 Apr 2015 21:39:03 +0800 Yuanhan Liu <yuanhan.liu@linux.intel.com> wrote: > If I read code correctly, current wait_for_stripe actually has 2 usage: > > - wait for there is enough free stripe cache, triggered when > get_free_stripe() failed. This is what wait_for_stripe intend > for literally. > > - wait for quiesce == 0 or > active_aligned_reads == 0 && active_stripes == 0 > > It has nothing to do with wait_for_stripe literally, and releasing > an active stripe won't actually wake them up. On the contrary, wake_up > from under this case won't actually wake up the process waiting for > an free stripe being available. I disagree. Releasing an active stripe *will* (or *can*) wake up that third case, as it decrements "active_stripes" which will eventually reach zero. I don't think your new code will properly wake up a process which is waiting for "active_stripes == 0". > > Hence, we'd better split wait_for_stripe, and here I introduce > wait_for_quiesce for the second usage. The name may not well taken, or > even taken wrongly. Feel free to correct me then. > > This is also a prepare patch for next patch: make wait_for_stripe > exclusive. I think you have this commit description upside down :-) The real motivation is that you are seeing contention on some spinlock and so you want to split 'wait_for_stripe' up in to multiple wait_queues so that you can use exclusive wakeup. As this is the main motivation, it should be stated first. Then explain that 'wait_for_stripe' is used to wait for the array to enter or leave the quiescent state, and also to wait for an available stripe in each of the hash lists. So this patch splits the first usage off into a separate wait_queue, and the next patch will split the second usage into one waitqueue for each hash value. Then explain just is what is needed for that first step. When you put it that way around, the patch makes lots of sense. So: could you please resubmit with the description the right way around, and with an appropriate wakeup call to ensure raid5_quiesce is woken up when active_stripes reaches zero? Thanks, NeilBrown > > Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> > --- > drivers/md/raid5.c | 13 +++++++------ > drivers/md/raid5.h | 1 + > 2 files changed, 8 insertions(+), 6 deletions(-) > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > index 9716319..b7e385f 100644 > --- a/drivers/md/raid5.c > +++ b/drivers/md/raid5.c > @@ -667,7 +667,7 @@ get_active_stripe(struct r5conf *conf, sector_t sector, > spin_lock_irq(conf->hash_locks + hash); > > do { > - wait_event_lock_irq(conf->wait_for_stripe, > + wait_event_lock_irq(conf->wait_for_quiesce, > conf->quiesce == 0 || noquiesce, > *(conf->hash_locks + hash)); > sh = __find_stripe(conf, sector, conf->generation - previous); > @@ -4725,7 +4725,7 @@ static void raid5_align_endio(struct bio *bi, int error) > raid_bi, 0); > bio_endio(raid_bi, 0); > if (atomic_dec_and_test(&conf->active_aligned_reads)) > - wake_up(&conf->wait_for_stripe); > + wake_up(&conf->wait_for_quiesce); > return; > } > > @@ -4820,7 +4820,7 @@ static int chunk_aligned_read(struct mddev *mddev, struct bio * raid_bio) > align_bi->bi_iter.bi_sector += rdev->data_offset; > > spin_lock_irq(&conf->device_lock); > - wait_event_lock_irq(conf->wait_for_stripe, > + wait_event_lock_irq(conf->wait_for_quiesce, > conf->quiesce == 0, > conf->device_lock); > atomic_inc(&conf->active_aligned_reads); > @@ -5659,7 +5659,7 @@ static int retry_aligned_read(struct r5conf *conf, struct bio *raid_bio) > bio_endio(raid_bio, 0); > } > if (atomic_dec_and_test(&conf->active_aligned_reads)) > - wake_up(&conf->wait_for_stripe); > + wake_up(&conf->wait_for_quiesce); > return handled; > } > > @@ -6390,6 +6390,7 @@ static struct r5conf *setup_conf(struct mddev *mddev) > goto abort; > spin_lock_init(&conf->device_lock); > seqcount_init(&conf->gen_lock); > + init_waitqueue_head(&conf->wait_for_quiesce); > init_waitqueue_head(&conf->wait_for_stripe); > init_waitqueue_head(&conf->wait_for_overlap); > INIT_LIST_HEAD(&conf->handle_list); > @@ -7413,7 +7414,7 @@ static void raid5_quiesce(struct mddev *mddev, int state) > * active stripes can drain > */ > conf->quiesce = 2; > - wait_event_cmd(conf->wait_for_stripe, > + wait_event_cmd(conf->wait_for_quiesce, > atomic_read(&conf->active_stripes) == 0 && > atomic_read(&conf->active_aligned_reads) == 0, > unlock_all_device_hash_locks_irq(conf), > @@ -7427,7 +7428,7 @@ static void raid5_quiesce(struct mddev *mddev, int state) > case 0: /* re-enable writes */ > lock_all_device_hash_locks_irq(conf); > conf->quiesce = 0; > - wake_up(&conf->wait_for_stripe); > + wake_up(&conf->wait_for_quiesce); > wake_up(&conf->wait_for_overlap); > unlock_all_device_hash_locks_irq(conf); > break; > diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h > index 7dc0dd8..fab53a3 100644 > --- a/drivers/md/raid5.h > +++ b/drivers/md/raid5.h > @@ -508,6 +508,7 @@ struct r5conf { > struct list_head inactive_list[NR_STRIPE_HASH_LOCKS]; > atomic_t empty_inactive_list_nr; > struct llist_head released_stripes; > + wait_queue_head_t wait_for_quiesce; > wait_queue_head_t wait_for_stripe; > wait_queue_head_t wait_for_overlap; > unsigned long cache_state; [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 811 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/2] md/raid5: split wait_for_stripe and introduce wait_for_quiesce 2015-04-27 0:10 ` [PATCH 1/2] md/raid5: split wait_for_stripe and introduce wait_for_quiesce NeilBrown @ 2015-04-27 2:12 ` Yuanhan Liu 2015-04-27 2:24 ` NeilBrown 0 siblings, 1 reply; 7+ messages in thread From: Yuanhan Liu @ 2015-04-27 2:12 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid, linux-kernel On Mon, Apr 27, 2015 at 10:10:24AM +1000, NeilBrown wrote: > On Fri, 24 Apr 2015 21:39:03 +0800 Yuanhan Liu <yuanhan.liu@linux.intel.com> > wrote: > > > If I read code correctly, current wait_for_stripe actually has 2 usage: > > > > - wait for there is enough free stripe cache, triggered when > > get_free_stripe() failed. This is what wait_for_stripe intend > > for literally. > > > > - wait for quiesce == 0 or > > active_aligned_reads == 0 && active_stripes == 0 > > > > It has nothing to do with wait_for_stripe literally, and releasing > > an active stripe won't actually wake them up. On the contrary, wake_up > > from under this case won't actually wake up the process waiting for > > an free stripe being available. > > I disagree. Releasing an active stripe *will* (or *can*) wake up that third > case, as it decrements "active_stripes" which will eventually reach zero. > > I don't think your new code will properly wake up a process which is waiting > for "active_stripes == 0". Right, and thanks for pointing it out. So, is this enough? --- diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 2d8fcc1..3f23035 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -383,6 +383,9 @@ static void release_inactive_stripe_list(struct r5conf *conf, } } } + + if (!atomic_read(&conf->active_stripes)) + wake_up(&conf->wait_for_quiesce); } /* should hold conf->device_lock already */ Or, should I put it a bit ahead, trying to invoke wake_up(&conf->wait_for_quiesce) after each atomic_dec(&conf->active_stripes)? if (atomic_dec_return(&conf->active_stripes) == 0) wake_up(&conf->wait_for_quiesce); > > > > > Hence, we'd better split wait_for_stripe, and here I introduce > > wait_for_quiesce for the second usage. The name may not well taken, or > > even taken wrongly. Feel free to correct me then. > > > > This is also a prepare patch for next patch: make wait_for_stripe > > exclusive. > > I think you have this commit description upside down :-) > > The real motivation is that you are seeing contention on some spinlock and so > you want to split 'wait_for_stripe' up in to multiple wait_queues so that you > can use exclusive wakeup. As this is the main motivation, it should be > stated first. > > Then explain that 'wait_for_stripe' is used to wait for the array to enter or > leave the quiescent state, and also to wait for an available stripe in each > of the hash lists. > > So this patch splits the first usage off into a separate wait_queue, and the > next patch will split the second usage into one waitqueue for each hash value. > > Then explain just is what is needed for that first step. > > When you put it that way around, the patch makes lots of sense. It does, and thanks! > > So: could you please resubmit with the description the right way around, and To make sure I followed you correctly, my patch order is correct(I mean, split lock first, and make wait_for_stripe per lock hash and exclusive second), and what I need to do is re-writing the commit log as you suggested, and fixing all issues you pointed out. Right? --yliu > with an appropriate wakeup call to ensure raid5_quiesce is woken up when > active_stripes reaches zero? > > Thanks, > NeilBrown > > > > > > Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> > > --- > > drivers/md/raid5.c | 13 +++++++------ > > drivers/md/raid5.h | 1 + > > 2 files changed, 8 insertions(+), 6 deletions(-) > > > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > > index 9716319..b7e385f 100644 > > --- a/drivers/md/raid5.c > > +++ b/drivers/md/raid5.c > > @@ -667,7 +667,7 @@ get_active_stripe(struct r5conf *conf, sector_t sector, > > spin_lock_irq(conf->hash_locks + hash); > > > > do { > > - wait_event_lock_irq(conf->wait_for_stripe, > > + wait_event_lock_irq(conf->wait_for_quiesce, > > conf->quiesce == 0 || noquiesce, > > *(conf->hash_locks + hash)); > > sh = __find_stripe(conf, sector, conf->generation - previous); > > @@ -4725,7 +4725,7 @@ static void raid5_align_endio(struct bio *bi, int error) > > raid_bi, 0); > > bio_endio(raid_bi, 0); > > if (atomic_dec_and_test(&conf->active_aligned_reads)) > > - wake_up(&conf->wait_for_stripe); > > + wake_up(&conf->wait_for_quiesce); > > return; > > } > > > > @@ -4820,7 +4820,7 @@ static int chunk_aligned_read(struct mddev *mddev, struct bio * raid_bio) > > align_bi->bi_iter.bi_sector += rdev->data_offset; > > > > spin_lock_irq(&conf->device_lock); > > - wait_event_lock_irq(conf->wait_for_stripe, > > + wait_event_lock_irq(conf->wait_for_quiesce, > > conf->quiesce == 0, > > conf->device_lock); > > atomic_inc(&conf->active_aligned_reads); > > @@ -5659,7 +5659,7 @@ static int retry_aligned_read(struct r5conf *conf, struct bio *raid_bio) > > bio_endio(raid_bio, 0); > > } > > if (atomic_dec_and_test(&conf->active_aligned_reads)) > > - wake_up(&conf->wait_for_stripe); > > + wake_up(&conf->wait_for_quiesce); > > return handled; > > } > > > > @@ -6390,6 +6390,7 @@ static struct r5conf *setup_conf(struct mddev *mddev) > > goto abort; > > spin_lock_init(&conf->device_lock); > > seqcount_init(&conf->gen_lock); > > + init_waitqueue_head(&conf->wait_for_quiesce); > > init_waitqueue_head(&conf->wait_for_stripe); > > init_waitqueue_head(&conf->wait_for_overlap); > > INIT_LIST_HEAD(&conf->handle_list); > > @@ -7413,7 +7414,7 @@ static void raid5_quiesce(struct mddev *mddev, int state) > > * active stripes can drain > > */ > > conf->quiesce = 2; > > - wait_event_cmd(conf->wait_for_stripe, > > + wait_event_cmd(conf->wait_for_quiesce, > > atomic_read(&conf->active_stripes) == 0 && > > atomic_read(&conf->active_aligned_reads) == 0, > > unlock_all_device_hash_locks_irq(conf), > > @@ -7427,7 +7428,7 @@ static void raid5_quiesce(struct mddev *mddev, int state) > > case 0: /* re-enable writes */ > > lock_all_device_hash_locks_irq(conf); > > conf->quiesce = 0; > > - wake_up(&conf->wait_for_stripe); > > + wake_up(&conf->wait_for_quiesce); > > wake_up(&conf->wait_for_overlap); > > unlock_all_device_hash_locks_irq(conf); > > break; > > diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h > > index 7dc0dd8..fab53a3 100644 > > --- a/drivers/md/raid5.h > > +++ b/drivers/md/raid5.h > > @@ -508,6 +508,7 @@ struct r5conf { > > struct list_head inactive_list[NR_STRIPE_HASH_LOCKS]; > > atomic_t empty_inactive_list_nr; > > struct llist_head released_stripes; > > + wait_queue_head_t wait_for_quiesce; > > wait_queue_head_t wait_for_stripe; > > wait_queue_head_t wait_for_overlap; > > unsigned long cache_state; > ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH 1/2] md/raid5: split wait_for_stripe and introduce wait_for_quiesce 2015-04-27 2:12 ` Yuanhan Liu @ 2015-04-27 2:24 ` NeilBrown 0 siblings, 0 replies; 7+ messages in thread From: NeilBrown @ 2015-04-27 2:24 UTC (permalink / raw) To: Yuanhan Liu; +Cc: linux-raid, linux-kernel [-- Attachment #1: Type: text/plain, Size: 7886 bytes --] On Mon, 27 Apr 2015 10:12:49 +0800 Yuanhan Liu <yuanhan.liu@linux.intel.com> wrote: > On Mon, Apr 27, 2015 at 10:10:24AM +1000, NeilBrown wrote: > > On Fri, 24 Apr 2015 21:39:03 +0800 Yuanhan Liu <yuanhan.liu@linux.intel.com> > > wrote: > > > > > If I read code correctly, current wait_for_stripe actually has 2 usage: > > > > > > - wait for there is enough free stripe cache, triggered when > > > get_free_stripe() failed. This is what wait_for_stripe intend > > > for literally. > > > > > > - wait for quiesce == 0 or > > > active_aligned_reads == 0 && active_stripes == 0 > > > > > > It has nothing to do with wait_for_stripe literally, and releasing > > > an active stripe won't actually wake them up. On the contrary, wake_up > > > from under this case won't actually wake up the process waiting for > > > an free stripe being available. > > > > I disagree. Releasing an active stripe *will* (or *can*) wake up that third > > case, as it decrements "active_stripes" which will eventually reach zero. > > > > I don't think your new code will properly wake up a process which is waiting > > for "active_stripes == 0". > > Right, and thanks for pointing it out. So, is this enough? > > --- > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > index 2d8fcc1..3f23035 100644 > --- a/drivers/md/raid5.c > +++ b/drivers/md/raid5.c > @@ -383,6 +383,9 @@ static void release_inactive_stripe_list(struct > r5conf *conf, > } > } > } > + > + if (!atomic_read(&conf->active_stripes)) > + wake_up(&conf->wait_for_quiesce); > } > > /* should hold conf->device_lock already */ > > > Or, should I put it a bit ahead, trying to invoke wake_up(&conf->wait_for_quiesce) > after each atomic_dec(&conf->active_stripes)? > > if (atomic_dec_return(&conf->active_stripes) == 0) > wake_up(&conf->wait_for_quiesce); I think the first version is fine. While waiting for active_stripes to be zero, ->quiesce is set to 2, and so no new stripes should get used. > > > > > > > > > Hence, we'd better split wait_for_stripe, and here I introduce > > > wait_for_quiesce for the second usage. The name may not well taken, or > > > even taken wrongly. Feel free to correct me then. > > > > > > This is also a prepare patch for next patch: make wait_for_stripe > > > exclusive. > > > > I think you have this commit description upside down :-) > > > > The real motivation is that you are seeing contention on some spinlock and so > > you want to split 'wait_for_stripe' up in to multiple wait_queues so that you > > can use exclusive wakeup. As this is the main motivation, it should be > > stated first. > > > > Then explain that 'wait_for_stripe' is used to wait for the array to enter or > > leave the quiescent state, and also to wait for an available stripe in each > > of the hash lists. > > > > So this patch splits the first usage off into a separate wait_queue, and the > > next patch will split the second usage into one waitqueue for each hash value. > > > > Then explain just is what is needed for that first step. > > > > When you put it that way around, the patch makes lots of sense. > > It does, and thanks! > > > > > So: could you please resubmit with the description the right way around, and > > To make sure I followed you correctly, my patch order is correct(I mean, > split lock first, and make wait_for_stripe per lock hash and exclusive > second), and what I need to do is re-writing the commit log as you suggested, > and fixing all issues you pointed out. Right? Correct. Thanks, NeilBrown > > --yliu > > > with an appropriate wakeup call to ensure raid5_quiesce is woken up when > > active_stripes reaches zero? > > > > Thanks, > > NeilBrown > > > > > > > > > > Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> > > > --- > > > drivers/md/raid5.c | 13 +++++++------ > > > drivers/md/raid5.h | 1 + > > > 2 files changed, 8 insertions(+), 6 deletions(-) > > > > > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > > > index 9716319..b7e385f 100644 > > > --- a/drivers/md/raid5.c > > > +++ b/drivers/md/raid5.c > > > @@ -667,7 +667,7 @@ get_active_stripe(struct r5conf *conf, sector_t sector, > > > spin_lock_irq(conf->hash_locks + hash); > > > > > > do { > > > - wait_event_lock_irq(conf->wait_for_stripe, > > > + wait_event_lock_irq(conf->wait_for_quiesce, > > > conf->quiesce == 0 || noquiesce, > > > *(conf->hash_locks + hash)); > > > sh = __find_stripe(conf, sector, conf->generation - previous); > > > @@ -4725,7 +4725,7 @@ static void raid5_align_endio(struct bio *bi, int error) > > > raid_bi, 0); > > > bio_endio(raid_bi, 0); > > > if (atomic_dec_and_test(&conf->active_aligned_reads)) > > > - wake_up(&conf->wait_for_stripe); > > > + wake_up(&conf->wait_for_quiesce); > > > return; > > > } > > > > > > @@ -4820,7 +4820,7 @@ static int chunk_aligned_read(struct mddev *mddev, struct bio * raid_bio) > > > align_bi->bi_iter.bi_sector += rdev->data_offset; > > > > > > spin_lock_irq(&conf->device_lock); > > > - wait_event_lock_irq(conf->wait_for_stripe, > > > + wait_event_lock_irq(conf->wait_for_quiesce, > > > conf->quiesce == 0, > > > conf->device_lock); > > > atomic_inc(&conf->active_aligned_reads); > > > @@ -5659,7 +5659,7 @@ static int retry_aligned_read(struct r5conf *conf, struct bio *raid_bio) > > > bio_endio(raid_bio, 0); > > > } > > > if (atomic_dec_and_test(&conf->active_aligned_reads)) > > > - wake_up(&conf->wait_for_stripe); > > > + wake_up(&conf->wait_for_quiesce); > > > return handled; > > > } > > > > > > @@ -6390,6 +6390,7 @@ static struct r5conf *setup_conf(struct mddev *mddev) > > > goto abort; > > > spin_lock_init(&conf->device_lock); > > > seqcount_init(&conf->gen_lock); > > > + init_waitqueue_head(&conf->wait_for_quiesce); > > > init_waitqueue_head(&conf->wait_for_stripe); > > > init_waitqueue_head(&conf->wait_for_overlap); > > > INIT_LIST_HEAD(&conf->handle_list); > > > @@ -7413,7 +7414,7 @@ static void raid5_quiesce(struct mddev *mddev, int state) > > > * active stripes can drain > > > */ > > > conf->quiesce = 2; > > > - wait_event_cmd(conf->wait_for_stripe, > > > + wait_event_cmd(conf->wait_for_quiesce, > > > atomic_read(&conf->active_stripes) == 0 && > > > atomic_read(&conf->active_aligned_reads) == 0, > > > unlock_all_device_hash_locks_irq(conf), > > > @@ -7427,7 +7428,7 @@ static void raid5_quiesce(struct mddev *mddev, int state) > > > case 0: /* re-enable writes */ > > > lock_all_device_hash_locks_irq(conf); > > > conf->quiesce = 0; > > > - wake_up(&conf->wait_for_stripe); > > > + wake_up(&conf->wait_for_quiesce); > > > wake_up(&conf->wait_for_overlap); > > > unlock_all_device_hash_locks_irq(conf); > > > break; > > > diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h > > > index 7dc0dd8..fab53a3 100644 > > > --- a/drivers/md/raid5.h > > > +++ b/drivers/md/raid5.h > > > @@ -508,6 +508,7 @@ struct r5conf { > > > struct list_head inactive_list[NR_STRIPE_HASH_LOCKS]; > > > atomic_t empty_inactive_list_nr; > > > struct llist_head released_stripes; > > > + wait_queue_head_t wait_for_quiesce; > > > wait_queue_head_t wait_for_stripe; > > > wait_queue_head_t wait_for_overlap; > > > unsigned long cache_state; > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 811 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2015-04-27 2:24 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-04-24 13:39 [PATCH 1/2] md/raid5: split wait_for_stripe and introduce wait_for_quiesce Yuanhan Liu 2015-04-24 13:39 ` [PATCH 2/2] md/raid5: exclusive wait_for_stripe Yuanhan Liu 2015-04-27 0:24 ` NeilBrown 2015-04-27 2:16 ` Yuanhan Liu 2015-04-27 0:10 ` [PATCH 1/2] md/raid5: split wait_for_stripe and introduce wait_for_quiesce NeilBrown 2015-04-27 2:12 ` Yuanhan Liu 2015-04-27 2:24 ` NeilBrown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).