From: NeilBrown <neilb@suse.de>
To: Shaohua Li <shli@kernel.org>
Cc: linux-raid@vger.kernel.org, tj@kernel.org, dan.j.williams@gmail.com
Subject: Re: [patch v3 3/5] raid5: offload stripe handle to workqueue
Date: Wed, 28 Aug 2013 16:56:14 +1000 [thread overview]
Message-ID: <20130828165614.011b4331@notabene.brown> (raw)
In-Reply-To: <20130828063016.GB17163@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 16854 bytes --]
On Wed, 28 Aug 2013 14:30:16 +0800 Shaohua Li <shli@kernel.org> wrote:
> On Wed, Aug 28, 2013 at 01:53:05PM +1000, NeilBrown wrote:
> > On Tue, 27 Aug 2013 17:50:41 +0800 Shaohua Li <shli@kernel.org> wrote:
> >
> > > -static struct stripe_head *__get_priority_stripe(struct r5conf *conf)
> > > +static struct stripe_head *__get_priority_stripe(struct r5conf *conf, int group)
> > > {
> > > - struct stripe_head *sh;
> > > + struct stripe_head *sh = NULL, *tmp;
> > > + struct list_head *handle_list = NULL;
> > > +
> > > + if (conf->worker_cnt_per_group == 0) {
> > > + handle_list = &conf->handle_list;
> > > + if (list_empty(handle_list))
> > > + handle_list = NULL;
> > > + } else if (group != ANY_GROUP) {
> > > + handle_list = &conf->worker_groups[group].handle_list;
> > > + if (list_empty(handle_list))
> > > + handle_list = NULL;
> > > + } else {
> > > + int i;
> > > + for (i = 0; i < conf->group_cnt; i++) {
> > > + handle_list = &conf->worker_groups[i].handle_list;
> > > + if (!list_empty(handle_list))
> > > + break;
> > > + }
> > > + if (i == conf->group_cnt)
> > > + handle_list = NULL;
> > > + }
> >
> > Sorry, I meant to mention this last time...
> >
> > The setting of handle_list to NULL seems unnecessary. It is sufficient to
> > test if it is empty.
> > The only interesting case is the last 'else' above. That is only reached if
> > worker_cnt_per_group != 0 so handle_list will get set to some list_head.
> > If all list_heads are empty, handle_list will be set to the last list_head
> > so a list_empty(handle_list) test is perfectly safe.
> >
> > Also with the current code:
> > >
> > > pr_debug("%s: handle: %s hold: %s full_writes: %d bypass_count: %d\n",
> > > __func__,
> > > - list_empty(&conf->handle_list) ? "empty" : "busy",
> > > + list_empty(handle_list) ? "empty" : "busy",
> > ^^^^^^^^^^^^^^^^^^^^^^^
> > This will crash.
>
> Fixed. Other patches can still apply with hunks.
>
>
> Subject: raid5: offload stripe handle to workqueue
>
> This is another attempt to create multiple threads to handle raid5 stripes.
> This time I use workqueue.
>
> raid5 handles request (especially write) in stripe unit. A stripe is page size
> aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
> state machine for the corresponding stripe, which includes reading some disks
> of the stripe, calculating parity, and writing some disks of the stripe. The
> state machine is running in raid5d thread currently. Since there is only one
> thread, it doesn't scale well for high speed storage. An obvious solution is
> multi-threading.
>
> To get better performance, we have some requirements:
> a. locality. stripe corresponding to request submitted from one cpu is better
> handled in thread in local cpu or local node. local cpu is preferred but some
> times could be a bottleneck, for example, parity calculation is too heavy.
> local node running has wide adaptability.
> b. configurablity. Different setup of raid5 array might need diffent
> configuration. Especially the thread number. More threads don't always mean
> better performance because of lock contentions.
>
> My original implementation is creating some kernel threads. There are
> interfaces to control which cpu's stripe each thread should handle. And
> userspace can set affinity of the threads. This provides biggest flexibility
> and configurability. But it's hard to use and apparently a new thread pool
> implementation is disfavor.
>
> Recent workqueue improvement is quite promising. unbound workqueue will be
> bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
> do affinity setting. For example, we can only include one HT sibling in
> affinity. Since work is non-reentrant by default, and we can control running
> thread number by limiting dispatched work_struct number.
>
> In this patch, I created several stripe worker group. A group is a numa node.
> stripes from cpus of one node will be added to a group list. Workqueue thread
> of one node will only handle stripes of worker group of the node. In this way,
> stripe handling has numa node locality. And as I said, we can control thread
> number by limiting dispatched work_struct number.
>
> The work_struct callback function handles several stripes in one run. A typical
> work queue usage is to run one unit in each work_struct. In raid5 case, the
> unit is a stripe. But we can't do that:
> a. Though handling a stripe doesn't need lock because of reference accounting
> and stripe isn't in any list, queuing a work_struct for each stripe will make
> workqueue lock contended very heavily.
> b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
> might dispatch request. If each work_struct only handles one stripe, such block
> plug is meaningless.
>
> This implementation can't do very fine grained configuration. But the numa
> binding is most popular usage model, should be enough for most workloads.
>
> Note: since we have only one stripe queue, switching to multi-thread might
> decrease request size dispatching down to low level layer. The impact depends
> on thread number, raid configuration and workload. So multi-thread raid5 might
> not be proper for all setups.
>
> Changes V1 -> V2:
> 1. remove WQ_NON_REENTRANT
> 2. disabling multi-threading by default
> 3. Add more descriptions in changelog
>
> Signed-off-by: Shaohua Li <shli@fusionio.com>
> ---
> drivers/md/raid5.c | 186 ++++++++++++++++++++++++++++++++++++++++++++++++-----
> drivers/md/raid5.h | 15 ++++
> 2 files changed, 186 insertions(+), 15 deletions(-)
>
> Index: linux/drivers/md/raid5.c
> ===================================================================
> --- linux.orig/drivers/md/raid5.c 2013-08-28 13:54:17.260930690 +0800
> +++ linux/drivers/md/raid5.c 2013-08-28 13:59:13.225210586 +0800
> @@ -53,6 +53,7 @@
> #include <linux/cpu.h>
> #include <linux/slab.h>
> #include <linux/ratelimit.h>
> +#include <linux/nodemask.h>
> #include <trace/events/block.h>
>
> #include "md.h"
> @@ -60,6 +61,10 @@
> #include "raid0.h"
> #include "bitmap.h"
>
> +#define cpu_to_group(cpu) cpu_to_node(cpu)
> +#define ANY_GROUP NUMA_NO_NODE
> +
> +static struct workqueue_struct *raid5_wq;
> /*
> * Stripe cache
> */
> @@ -200,6 +205,34 @@ static int stripe_operations_active(stru
> test_bit(STRIPE_COMPUTE_RUN, &sh->state);
> }
>
> +static void raid5_wakeup_stripe_thread(struct stripe_head *sh)
> +{
> + struct r5conf *conf = sh->raid_conf;
> + struct r5worker_group *group;
> + int i, cpu = sh->cpu;
> +
> + if (!cpu_online(cpu)) {
> + cpu = cpumask_any(cpu_online_mask);
> + sh->cpu = cpu;
> + }
> +
> + if (list_empty(&sh->lru)) {
> + struct r5worker_group *group;
> + group = conf->worker_groups + cpu_to_group(cpu);
> + list_add_tail(&sh->lru, &group->handle_list);
> + }
> +
> + if (conf->worker_cnt_per_group == 0) {
> + md_wakeup_thread(conf->mddev->thread);
> + return;
> + }
> +
> + group = conf->worker_groups + cpu_to_group(sh->cpu);
> +
> + for (i = 0; i < conf->worker_cnt_per_group; i++)
> + queue_work_on(sh->cpu, raid5_wq, &group->workers[i].work);
> +}
> +
> static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh)
> {
> BUG_ON(!list_empty(&sh->lru));
> @@ -214,7 +247,12 @@ static void do_release_stripe(struct r5c
> else {
> clear_bit(STRIPE_DELAYED, &sh->state);
> clear_bit(STRIPE_BIT_DELAY, &sh->state);
> - list_add_tail(&sh->lru, &conf->handle_list);
> + if (conf->worker_cnt_per_group == 0) {
> + list_add_tail(&sh->lru, &conf->handle_list);
> + } else {
> + raid5_wakeup_stripe_thread(sh);
> + return;
> + }
> }
> md_wakeup_thread(conf->mddev->thread);
> } else {
> @@ -409,6 +447,7 @@ static void init_stripe(struct stripe_he
> raid5_build_block(sh, i, previous);
> }
> insert_hash(conf, sh);
> + sh->cpu = smp_processor_id();
> }
>
> static struct stripe_head *__find_stripe(struct r5conf *conf, sector_t sector,
> @@ -3830,6 +3869,7 @@ static void raid5_activate_delayed(struc
> if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> atomic_inc(&conf->preread_active_stripes);
> list_add_tail(&sh->lru, &conf->hold_list);
> + raid5_wakeup_stripe_thread(sh);
> }
> }
> }
> @@ -4109,18 +4149,32 @@ static int chunk_aligned_read(struct mdd
> * head of the hold_list has changed, i.e. the head was promoted to the
> * handle_list.
> */
> -static struct stripe_head *__get_priority_stripe(struct r5conf *conf)
> +static struct stripe_head *__get_priority_stripe(struct r5conf *conf, int group)
> {
> - struct stripe_head *sh;
> + struct stripe_head *sh = NULL, *tmp;
> + struct list_head *handle_list = NULL;
> +
> + if (conf->worker_cnt_per_group == 0) {
> + handle_list = &conf->handle_list;
> + } else if (group != ANY_GROUP) {
> + handle_list = &conf->worker_groups[group].handle_list;
> + } else {
> + int i;
> + for (i = 0; i < conf->group_cnt; i++) {
> + handle_list = &conf->worker_groups[i].handle_list;
> + if (!list_empty(handle_list))
> + break;
> + }
> + }
>
> pr_debug("%s: handle: %s hold: %s full_writes: %d bypass_count: %d\n",
> __func__,
> - list_empty(&conf->handle_list) ? "empty" : "busy",
> + list_empty(handle_list) ? "empty" : "busy",
> list_empty(&conf->hold_list) ? "empty" : "busy",
> atomic_read(&conf->pending_full_writes), conf->bypass_count);
>
> - if (!list_empty(&conf->handle_list)) {
> - sh = list_entry(conf->handle_list.next, typeof(*sh), lru);
> + if (!list_empty(handle_list)) {
> + sh = list_entry(handle_list->next, typeof(*sh), lru);
>
> if (list_empty(&conf->hold_list))
> conf->bypass_count = 0;
> @@ -4138,12 +4192,25 @@ static struct stripe_head *__get_priorit
> ((conf->bypass_threshold &&
> conf->bypass_count > conf->bypass_threshold) ||
> atomic_read(&conf->pending_full_writes) == 0)) {
> - sh = list_entry(conf->hold_list.next,
> - typeof(*sh), lru);
> - conf->bypass_count -= conf->bypass_threshold;
> - if (conf->bypass_count < 0)
> - conf->bypass_count = 0;
> - } else
> +
> + list_for_each_entry(tmp, &conf->hold_list, lru) {
> + if (conf->worker_cnt_per_group == 0 ||
> + group == ANY_GROUP ||
> + !cpu_online(tmp->cpu) ||
> + cpu_to_group(tmp->cpu) == group) {
> + sh = tmp;
> + break;
> + }
> + }
> +
> + if (sh) {
> + conf->bypass_count -= conf->bypass_threshold;
> + if (conf->bypass_count < 0)
> + conf->bypass_count = 0;
> + }
> + }
> +
> + if (!sh)
> return NULL;
>
> list_del_init(&sh->lru);
> @@ -4853,13 +4920,13 @@ static int retry_aligned_read(struct r5
> }
>
> #define MAX_STRIPE_BATCH 8
> -static int handle_active_stripes(struct r5conf *conf)
> +static int handle_active_stripes(struct r5conf *conf, int group)
> {
> struct stripe_head *batch[MAX_STRIPE_BATCH], *sh;
> int i, batch_size = 0;
>
> while (batch_size < MAX_STRIPE_BATCH &&
> - (sh = __get_priority_stripe(conf)) != NULL)
> + (sh = __get_priority_stripe(conf, group)) != NULL)
> batch[batch_size++] = sh;
>
> if (batch_size == 0)
> @@ -4877,6 +4944,38 @@ static int handle_active_stripes(struct
> return batch_size;
> }
>
> +static void raid5_do_work(struct work_struct *work)
> +{
> + struct r5worker *worker = container_of(work, struct r5worker, work);
> + struct r5worker_group *group = worker->group;
> + struct r5conf *conf = group->conf;
> + int group_id = group - conf->worker_groups;
> + int handled;
> + struct blk_plug plug;
> +
> + pr_debug("+++ raid5worker active\n");
> +
> + blk_start_plug(&plug);
> + handled = 0;
> + spin_lock_irq(&conf->device_lock);
> + while (1) {
> + int batch_size, released;
> +
> + released = release_stripe_list(conf);
> +
> + batch_size = handle_active_stripes(conf, group_id);
> + if (!batch_size && !released)
> + break;
> + handled += batch_size;
> + }
> + pr_debug("%d stripes handled\n", handled);
> +
> + spin_unlock_irq(&conf->device_lock);
> + blk_finish_plug(&plug);
> +
> + pr_debug("--- raid5worker inactive\n");
> +}
> +
> /*
> * This is our raid5 kernel thread.
> *
> @@ -4926,7 +5025,7 @@ static void raid5d(struct md_thread *thr
> handled++;
> }
>
> - batch_size = handle_active_stripes(conf);
> + batch_size = handle_active_stripes(conf, ANY_GROUP);
> if (!batch_size && !released)
> break;
> handled += batch_size;
> @@ -5066,6 +5165,54 @@ static struct attribute_group raid5_attr
> .attrs = raid5_attrs,
> };
>
> +static int alloc_thread_groups(struct r5conf *conf, int cnt)
> +{
> + int i, j;
> + ssize_t size;
> + struct r5worker *workers;
> +
> + conf->worker_cnt_per_group = cnt;
> + if (cnt == 0) {
> + conf->worker_groups = NULL;
> + return 0;
> + }
> + conf->group_cnt = num_possible_nodes();
> + size = sizeof(struct r5worker) * cnt;
> + workers = kzalloc(size * conf->group_cnt, GFP_NOIO);
> + conf->worker_groups = kzalloc(sizeof(struct r5worker_group) *
> + conf->group_cnt, GFP_NOIO);
> + if (!conf->worker_groups || !workers) {
> + kfree(workers);
> + kfree(conf->worker_groups);
> + conf->worker_groups = NULL;
> + return -ENOMEM;
> + }
> +
> + for (i = 0; i < conf->group_cnt; i++) {
> + struct r5worker_group *group;
> +
> + group = &conf->worker_groups[i];
> + INIT_LIST_HEAD(&group->handle_list);
> + group->conf = conf;
> + group->workers = workers + i * cnt;
> +
> + for (j = 0; j < cnt; j++) {
> + group->workers[j].group = group;
> + INIT_WORK(&group->workers[j].work, raid5_do_work);
> + }
> + }
> +
> + return 0;
> +}
> +
> +static void free_thread_groups(struct r5conf *conf)
> +{
> + if (conf->worker_groups)
> + kfree(conf->worker_groups[0].workers);
> + kfree(conf->worker_groups);
> + conf->worker_groups = NULL;
> +}
> +
> static sector_t
> raid5_size(struct mddev *mddev, sector_t sectors, int raid_disks)
> {
> @@ -5106,6 +5253,7 @@ static void raid5_free_percpu(struct r5c
>
> static void free_conf(struct r5conf *conf)
> {
> + free_thread_groups(conf);
> shrink_stripes(conf);
> raid5_free_percpu(conf);
> kfree(conf->disks);
> @@ -5234,6 +5382,9 @@ static struct r5conf *setup_conf(struct
> conf = kzalloc(sizeof(struct r5conf), GFP_KERNEL);
> if (conf == NULL)
> goto abort;
> + /* Don't enable multi-threading by default*/
> + if (alloc_thread_groups(conf, 0))
> + goto abort;
> spin_lock_init(&conf->device_lock);
> seqcount_init(&conf->gen_lock);
> init_waitqueue_head(&conf->wait_for_stripe);
> @@ -6549,6 +6700,10 @@ static struct md_personality raid4_perso
>
> static int __init raid5_init(void)
> {
> + raid5_wq = alloc_workqueue("raid5wq",
> + WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE|WQ_SYSFS, 0);
> + if (!raid5_wq)
> + return -ENOMEM;
> register_md_personality(&raid6_personality);
> register_md_personality(&raid5_personality);
> register_md_personality(&raid4_personality);
> @@ -6560,6 +6715,7 @@ static void raid5_exit(void)
> unregister_md_personality(&raid6_personality);
> unregister_md_personality(&raid5_personality);
> unregister_md_personality(&raid4_personality);
> + destroy_workqueue(raid5_wq);
> }
>
> module_init(raid5_init);
> Index: linux/drivers/md/raid5.h
> ===================================================================
> --- linux.orig/drivers/md/raid5.h 2013-08-28 13:54:17.260930690 +0800
> +++ linux/drivers/md/raid5.h 2013-08-28 13:54:17.260930690 +0800
> @@ -212,6 +212,7 @@ struct stripe_head {
> enum check_states check_state;
> enum reconstruct_states reconstruct_state;
> spinlock_t stripe_lock;
> + int cpu;
> /**
> * struct stripe_operations
> * @target - STRIPE_OP_COMPUTE_BLK target
> @@ -365,6 +366,17 @@ struct disk_info {
> struct md_rdev *rdev, *replacement;
> };
>
> +struct r5worker {
> + struct work_struct work;
> + struct r5worker_group *group;
> +};
> +
> +struct r5worker_group {
> + struct list_head handle_list;
> + struct r5conf *conf;
> + struct r5worker *workers;
> +};
> +
> struct r5conf {
> struct hlist_head *stripe_hashtbl;
> struct mddev *mddev;
> @@ -462,6 +474,9 @@ struct r5conf {
> * the new thread here until we fully activate the array.
> */
> struct md_thread *thread;
> + struct r5worker_group *worker_groups;
> + int group_cnt;
> + int worker_cnt_per_group;
> };
>
> /*
Applied, thanks.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
next prev parent reply other threads:[~2013-08-28 6:56 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-08-27 9:50 [patch v3 0/5] raid5: make stripe handling multi-threading Shaohua Li
2013-08-27 9:50 ` [patch v3 1/5] raid5: make release_stripe lockless Shaohua Li
2013-08-28 14:04 ` Tejun Heo
2013-08-28 14:29 ` Shaohua Li
2013-08-28 14:30 ` Tejun Heo
2013-08-27 9:50 ` [patch v3 2/5] raid5: fix stripe release order Shaohua Li
2013-08-28 3:41 ` NeilBrown
2013-08-28 6:29 ` Shaohua Li
2013-08-28 6:37 ` NeilBrown
2013-08-27 9:50 ` [patch v3 3/5] raid5: offload stripe handle to workqueue Shaohua Li
2013-08-28 3:53 ` NeilBrown
2013-08-28 6:30 ` Shaohua Li
2013-08-28 6:56 ` NeilBrown [this message]
2013-08-27 9:50 ` [patch v3 4/5] raid5: sysfs entry to control worker thread number Shaohua Li
2013-08-27 9:50 ` [patch v3 5/5] raid5: only wakeup necessary threads Shaohua Li
2013-08-28 4:13 ` NeilBrown
2013-08-28 6:31 ` Shaohua Li
2013-08-28 6:59 ` NeilBrown
2013-08-29 7:40 ` Shaohua Li
2013-09-02 0:45 ` NeilBrown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130828165614.011b4331@notabene.brown \
--to=neilb@suse.de \
--cc=dan.j.williams@gmail.com \
--cc=linux-raid@vger.kernel.org \
--cc=shli@kernel.org \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).