From mboxrd@z Thu Jan 1 00:00:00 1970 From: Yuanhan Liu Subject: [PATCH 2/2] md/raid5: exclusive wait_for_stripe Date: Fri, 24 Apr 2015 21:39:04 +0800 Message-ID: <1429882744-22655-2-git-send-email-yuanhan.liu@linux.intel.com> References: <1429882744-22655-1-git-send-email-yuanhan.liu@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: In-Reply-To: <1429882744-22655-1-git-send-email-yuanhan.liu@linux.intel.com> Sender: linux-raid-owner@vger.kernel.org To: neilb@suse.de Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, Yuanhan Liu List-Id: linux-raid.ids I noticed heavy spin lock contention at get_active_stripe() with fsmark multiple thread write workloads. Here is how this hot contention comes from. We have limited stripes, an= d it's a multiple thread write workload. Hence, those stripes will be tak= en soon, which puts later processes to sleep for waiting free stripes. Whe= n enough stripes(> 1/4 total stripes) are released, all process are woken= , trying to get the lock. But there is one only being able to get this lo= ck for each hash lock, making other processes spinning out there for acqui= ring the lock. Thus, it's effectiveless to wakeup all processes and let them battle fo= r a lock that permits one to access only each time. Instead, we could mak= e it be a exclusive wake up: wake up one process only. That avoids the he= avy spin lock contention naturally. Here are some test results I have got with this patch applied(all test = run 3 times): `fsmark.files_per_sec' =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D next-20150317 this patch ------------------------- ------------------------- metric_value =C2=B1stddev metric_value =C2=B1stddev ch= ange testbox/benchmark/testcase-params ------------------------- ------------------------- -------- = ------------------------------ 25.600 =C2=B10.0 92.700 =C2=B12.5 2= 62.1% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeCl= ose 25.600 =C2=B10.0 77.800 =C2=B10.6 2= 03.9% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClo= se 32.000 =C2=B10.0 93.800 =C2=B11.7 1= 93.1% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClo= se 32.000 =C2=B10.0 81.233 =C2=B11.7 1= 53.9% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClos= e 48.800 =C2=B114.5 99.667 =C2=B12.0 1= 04.2% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClos= e 6.400 =C2=B10.0 12.800 =C2=B10.0 1= 00.0% ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose 63.133 =C2=B18.2 82.800 =C2=B10.7 = 31.2% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose 245.067 =C2=B10.7 306.567 =C2=B17.9 = 25.1% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-f2fs-4M-30G-fsyncBeforeClo= se 17.533 =C2=B10.3 21.000 =C2=B10.8 = 19.8% ivb44/fsmark/1x-1t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose 188.167 =C2=B11.9 215.033 =C2=B13.1 = 14.3% ivb44/fsmark/1x-1t-4BRD_12G-RAID5-btrfs-4M-30G-NoSync 254.500 =C2=B11.8 290.733 =C2=B12.4 = 14.2% ivb44/fsmark/1x-1t-9BRD_6G-RAID5-btrfs-4M-30G-NoSync `time.system_time' =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D next-20150317 this patch ------------------------- ------------------------- metric_value =C2=B1stddev metric_value =C2=B1stddev cha= nge testbox/benchmark/testcase-params ------------------------- ------------------------- -------- = ------------------------------ 7235.603 =C2=B11.2 185.163 =C2=B11.9 -= 97.4% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeCl= ose 7666.883 =C2=B12.9 202.750 =C2=B11.0 -= 97.4% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClo= se 14567.893 =C2=B10.7 421.230 =C2=B10.4 -= 97.1% ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose 3697.667 =C2=B114.0 148.190 =C2=B11.7 -= 96.0% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClos= e 5572.867 =C2=B13.8 310.717 =C2=B11.4 -= 94.4% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClos= e 5565.050 =C2=B10.5 313.277 =C2=B11.5 -= 94.4% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClo= se 2420.707 =C2=B117.1 171.043 =C2=B12.7 -= 92.9% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose 3743.300 =C2=B14.6 379.827 =C2=B13.5 -= 89.9% ivb44/fsmark/1x-64t-3HDD-RAID5-ext4-4M-40G-fsyncBeforeClose 3308.687 =C2=B16.3 363.050 =C2=B12.0 -= 89.0% ivb44/fsmark/1x-64t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose Where, 1x: where 'x' means iterations or loop, corresponding to the 'L' o= ption of fsmark 1t, 64t: where 't' means thread 4M: means the single file size, corresponding to the '-s' option o= f fsmark 40G, 30G, 120G: means the total test size 4BRD_12G: BRD is the ramdisk, where '4' means 4 ramdisk, and where= '12G' means the size of one ramdisk. So, it would be 48G in total. A= nd we made a raid on those ramdisk As you can see, though there are no much performance gain for hard disk workload, the system time is dropped heavily, up to 97%. And as expecte= d, the performance increased a lot, up to 260%, for fast device(ram disk). Signed-off-by: Yuanhan Liu --- drivers/md/raid5.c | 46 +++++++++++++++++++++++++++++++++++----------- drivers/md/raid5.h | 2 +- 2 files changed, 36 insertions(+), 12 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index b7e385f..2d8fcc1 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -344,7 +344,8 @@ static void release_inactive_stripe_list(struct r5c= onf *conf, int hash) { int size; - bool do_wakeup =3D false; + bool do_wakeup[NR_STRIPE_HASH_LOCKS] =3D { false, }; + int i =3D 0; unsigned long flags; =20 if (hash =3D=3D NR_STRIPE_HASH_LOCKS) { @@ -365,17 +366,22 @@ static void release_inactive_stripe_list(struct r= 5conf *conf, !list_empty(list)) atomic_dec(&conf->empty_inactive_list_nr); list_splice_tail_init(list, conf->inactive_list + hash); - do_wakeup =3D true; + do_wakeup[size - 1] =3D true; spin_unlock_irqrestore(conf->hash_locks + hash, flags); } size--; hash--; } =20 - if (do_wakeup) { - wake_up(&conf->wait_for_stripe); - if (conf->retry_read_aligned) - md_wakeup_thread(conf->mddev->thread); + for (i =3D 0; i < NR_STRIPE_HASH_LOCKS; i++) { + bool waked_thread =3D false; + if (do_wakeup[i]) { + wake_up(&conf->wait_for_stripe[i]); + if (!waked_thread) { + waked_thread =3D true; + md_wakeup_thread(conf->mddev->thread); + } + } } } =20 @@ -655,6 +661,18 @@ static int has_failed(struct r5conf *conf) return 0; } =20 +/* XXX: might put it to linux/wait.h to be a public API? */ +#define raid_wait_event_exclusive_cmd(wq, condition, cmd1, cmd2) \ +do { \ + if (condition) \ + break; \ + (void)___wait_event(wq, condition, TASK_UNINTERRUPTIBLE, 1, 0, \ + cmd1; \ + schedule(); \ + cmd2); \ +} while (0) + + static struct stripe_head * get_active_stripe(struct r5conf *conf, sector_t sector, int previous, int noblock, int noquiesce) @@ -684,14 +702,15 @@ get_active_stripe(struct r5conf *conf, sector_t s= ector, if (!sh) { set_bit(R5_INACTIVE_BLOCKED, &conf->cache_state); - wait_event_lock_irq( - conf->wait_for_stripe, + raid_wait_event_exclusive_cmd( + conf->wait_for_stripe[hash], !list_empty(conf->inactive_list + hash) && (atomic_read(&conf->active_stripes) < (conf->max_nr_stripes * 3 / 4) || !test_bit(R5_INACTIVE_BLOCKED, &conf->cache_state)), - *(conf->hash_locks + hash)); + spin_unlock_irq(conf->hash_locks + hash), + spin_lock_irq(conf->hash_locks + hash)); clear_bit(R5_INACTIVE_BLOCKED, &conf->cache_state); } else { @@ -716,6 +735,9 @@ get_active_stripe(struct r5conf *conf, sector_t sec= tor, } } while (sh =3D=3D NULL); =20 + if (!list_empty(conf->inactive_list + hash)) + wake_up(&conf->wait_for_stripe[hash]); + spin_unlock_irq(conf->hash_locks + hash); return sh; } @@ -2136,7 +2158,7 @@ static int resize_stripes(struct r5conf *conf, in= t newsize) cnt =3D 0; list_for_each_entry(nsh, &newstripes, lru) { lock_device_hash_lock(conf, hash); - wait_event_cmd(conf->wait_for_stripe, + raid_wait_event_exclusive_cmd(conf->wait_for_stripe[hash], !list_empty(conf->inactive_list + hash), unlock_device_hash_lock(conf, hash), lock_device_hash_lock(conf, hash)); @@ -6391,7 +6413,9 @@ static struct r5conf *setup_conf(struct mddev *md= dev) spin_lock_init(&conf->device_lock); seqcount_init(&conf->gen_lock); init_waitqueue_head(&conf->wait_for_quiesce); - init_waitqueue_head(&conf->wait_for_stripe); + for (i =3D 0; i < NR_STRIPE_HASH_LOCKS; i++) { + init_waitqueue_head(&conf->wait_for_stripe[i]); + } init_waitqueue_head(&conf->wait_for_overlap); INIT_LIST_HEAD(&conf->handle_list); INIT_LIST_HEAD(&conf->hold_list); diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h index fab53a3..cdad2d2 100644 --- a/drivers/md/raid5.h +++ b/drivers/md/raid5.h @@ -509,7 +509,7 @@ struct r5conf { atomic_t empty_inactive_list_nr; struct llist_head released_stripes; wait_queue_head_t wait_for_quiesce; - wait_queue_head_t wait_for_stripe; + wait_queue_head_t wait_for_stripe[NR_STRIPE_HASH_LOCKS]; wait_queue_head_t wait_for_overlap; unsigned long cache_state; #define R5_INACTIVE_BLOCKED 1 /* release of inactive stripes blocked, --=20 1.9.0 -- To unsubscribe from this list: send the line "unsubscribe linux-raid" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html