linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/3] raid5: make stripe handling multi-threading
@ 2013-07-30  5:52 shli
  2013-07-30  5:52 ` [patch 1/3] raid5: offload stripe handle to workqueue shli
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: shli @ 2013-07-30  5:52 UTC (permalink / raw)
  To: linux-raid, linux-kernel; +Cc: neilb, tj, djbw

Neil,

This is another attempt to make raid5 stripe handling multi-threading.
Recent workqueue improvement for unbound workqueue looks very promising to the
raid5 usage. I had details in the first patch.

The patches are against your tree with patch 'raid5: make release_stripe
lockless' and 'raid5: fix stripe release order' but without 'raid5: create
multiple threads to handle stripes'

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [patch 1/3] raid5: offload stripe handle to workqueue
  2013-07-30  5:52 [patch 0/3] raid5: make stripe handling multi-threading shli
@ 2013-07-30  5:52 ` shli
  2013-07-30 11:46   ` Tejun Heo
  2013-07-30 12:53   ` Tejun Heo
  2013-07-30  5:52 ` [patch 2/3] raid5: sysfs entry to control worker thread number shli
  2013-07-30  5:52 ` [patch 3/3] raid5: only wakeup necessary threads shli
  2 siblings, 2 replies; 15+ messages in thread
From: shli @ 2013-07-30  5:52 UTC (permalink / raw)
  To: linux-raid, linux-kernel; +Cc: neilb, tj, djbw

[-- Attachment #1: raid5-wq.patch --]
[-- Type: text/plain, Size: 13214 bytes --]

This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.

raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.

To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.

My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.

Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.

In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.

This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads. 

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |  198 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 drivers/md/raid5.h |   15 ++++
 2 files changed, 198 insertions(+), 15 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2013-07-29 16:53:47.492071532 +0800
+++ linux/drivers/md/raid5.c	2013-07-30 09:43:55.859080206 +0800
@@ -53,6 +53,7 @@
 #include <linux/cpu.h>
 #include <linux/slab.h>
 #include <linux/ratelimit.h>
+#include <linux/nodemask.h>
 #include <trace/events/block.h>
 
 #include "md.h"
@@ -60,6 +61,10 @@
 #include "raid0.h"
 #include "bitmap.h"
 
+#define cpu_to_group(cpu) cpu_to_node(cpu)
+#define ANY_GROUP NUMA_NO_NODE
+
+static struct workqueue_struct *raid5_wq;
 /*
  * Stripe cache
  */
@@ -200,6 +205,23 @@ static int stripe_operations_active(stru
 	       test_bit(STRIPE_COMPUTE_RUN, &sh->state);
 }
 
+static void raid5_wakeup_stripe_thread(struct stripe_head *sh)
+{
+	struct r5conf *conf = sh->raid_conf;
+	struct r5worker_group *group;
+	int i;
+
+	if (conf->worker_cnt_per_group == 0) {
+		md_wakeup_thread(conf->mddev->thread);
+		return;
+	}
+
+	group = conf->worker_groups + cpu_to_group(sh->cpu);
+
+	for (i = 0; i < conf->worker_cnt_per_group; i++)
+		queue_work_on(sh->cpu, raid5_wq, &group->workers[i].work);
+}
+
 static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh)
 {
 	BUG_ON(!list_empty(&sh->lru));
@@ -212,9 +234,23 @@ static void do_release_stripe(struct r5c
 			   sh->bm_seq - conf->seq_write > 0)
 			list_add_tail(&sh->lru, &conf->bitmap_list);
 		else {
+			int cpu = sh->cpu;
+			if (!cpu_online(cpu)) {
+				cpu = cpumask_any(cpu_online_mask);
+				sh->cpu = cpu;
+			}
+
 			clear_bit(STRIPE_DELAYED, &sh->state);
 			clear_bit(STRIPE_BIT_DELAY, &sh->state);
-			list_add_tail(&sh->lru, &conf->handle_list);
+			if (conf->worker_cnt_per_group == 0) {
+				list_add_tail(&sh->lru, &conf->handle_list);
+			} else {
+				struct r5worker_group *group;
+				group = conf->worker_groups + cpu_to_group(cpu);
+				list_add_tail(&sh->lru, &group->handle_list);
+			}
+			raid5_wakeup_stripe_thread(sh);
+			return;
 		}
 		md_wakeup_thread(conf->mddev->thread);
 	} else {
@@ -395,6 +431,7 @@ static void init_stripe(struct stripe_he
 		raid5_build_block(sh, i, previous);
 	}
 	insert_hash(conf, sh);
+	sh->cpu = smp_processor_id();
 }
 
 static struct stripe_head *__find_stripe(struct r5conf *conf, sector_t sector,
@@ -3810,12 +3847,19 @@ static void raid5_activate_delayed(struc
 		while (!list_empty(&conf->delayed_list)) {
 			struct list_head *l = conf->delayed_list.next;
 			struct stripe_head *sh;
+			int cpu;
 			sh = list_entry(l, struct stripe_head, lru);
 			list_del_init(l);
 			clear_bit(STRIPE_DELAYED, &sh->state);
 			if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
 				atomic_inc(&conf->preread_active_stripes);
 			list_add_tail(&sh->lru, &conf->hold_list);
+			cpu = sh->cpu;
+			if (!cpu_online(cpu)) {
+				cpu = cpumask_any(cpu_online_mask);
+				sh->cpu = cpu;
+			}
+			raid5_wakeup_stripe_thread(sh);
 		}
 	}
 }
@@ -4095,18 +4139,39 @@ static int chunk_aligned_read(struct mdd
  * head of the hold_list has changed, i.e. the head was promoted to the
  * handle_list.
  */
-static struct stripe_head *__get_priority_stripe(struct r5conf *conf)
+static struct stripe_head *__get_priority_stripe(struct r5conf *conf, int group)
 {
-	struct stripe_head *sh;
+	struct stripe_head *sh = NULL, *tmp;
+	struct list_head *handle_list = NULL;
+
+	if (conf->worker_cnt_per_group == 0) {
+		handle_list = &conf->handle_list;
+		if (list_empty(handle_list))
+			handle_list = NULL;
+	} else if (group != ANY_GROUP) {
+		handle_list = &conf->worker_groups[group].handle_list;
+		if (list_empty(handle_list))
+			handle_list = NULL;
+	} else {
+		int i;
+		/* Should we take action to avoid starvation of latter groups ? */
+		for (i = 0; i < conf->group_cnt; i++) {
+			handle_list = &conf->worker_groups[i].handle_list;
+			if (!list_empty(handle_list))
+				break;
+		}
+		if (i == conf->group_cnt)
+			handle_list = NULL;
+	}
 
 	pr_debug("%s: handle: %s hold: %s full_writes: %d bypass_count: %d\n",
 		  __func__,
-		  list_empty(&conf->handle_list) ? "empty" : "busy",
+		  list_empty(handle_list) ? "empty" : "busy",
 		  list_empty(&conf->hold_list) ? "empty" : "busy",
 		  atomic_read(&conf->pending_full_writes), conf->bypass_count);
 
-	if (!list_empty(&conf->handle_list)) {
-		sh = list_entry(conf->handle_list.next, typeof(*sh), lru);
+	if (handle_list) {
+		sh = list_entry(handle_list->next, typeof(*sh), lru);
 
 		if (list_empty(&conf->hold_list))
 			conf->bypass_count = 0;
@@ -4124,12 +4189,25 @@ static struct stripe_head *__get_priorit
 		   ((conf->bypass_threshold &&
 		     conf->bypass_count > conf->bypass_threshold) ||
 		    atomic_read(&conf->pending_full_writes) == 0)) {
-		sh = list_entry(conf->hold_list.next,
-				typeof(*sh), lru);
-		conf->bypass_count -= conf->bypass_threshold;
-		if (conf->bypass_count < 0)
-			conf->bypass_count = 0;
-	} else
+
+		list_for_each_entry(tmp, &conf->hold_list,  lru) {
+			if (conf->worker_cnt_per_group == 0 ||
+			    group == ANY_GROUP ||
+			    !cpu_online(tmp->cpu) ||
+			    cpu_to_group(tmp->cpu) == group) {
+				sh = tmp;
+				break;
+			}
+		}
+
+		if (sh) {
+			conf->bypass_count -= conf->bypass_threshold;
+			if (conf->bypass_count < 0)
+				conf->bypass_count = 0;
+		}
+	}
+
+	if (!sh)
 		return NULL;
 
 	list_del_init(&sh->lru);
@@ -4830,13 +4908,13 @@ static int  retry_aligned_read(struct r5
 }
 
 #define MAX_STRIPE_BATCH 8
-static int handle_active_stripes(struct r5conf *conf)
+static int handle_active_stripes(struct r5conf *conf, int group)
 {
 	struct stripe_head *batch[MAX_STRIPE_BATCH], *sh;
 	int i, batch_size = 0;
 
 	while (batch_size < MAX_STRIPE_BATCH &&
-			(sh = __get_priority_stripe(conf)) != NULL)
+			(sh = __get_priority_stripe(conf, group)) != NULL)
 		batch[batch_size++] = sh;
 
 	if (batch_size == 0)
@@ -4854,6 +4932,38 @@ static int handle_active_stripes(struct
 	return batch_size;
 }
 
+static void raid5_do_work(struct work_struct *work)
+{
+	struct r5worker *worker = container_of(work, struct r5worker, work);
+	struct r5worker_group *group = worker->group;
+	struct r5conf *conf = group->conf;
+	int group_id = group - conf->worker_groups;
+	int handled;
+	struct blk_plug plug;
+
+	pr_debug("+++ raid5worker active\n");
+
+	blk_start_plug(&plug);
+	handled = 0;
+	spin_lock_irq(&conf->device_lock);
+	while (1) {
+		int batch_size, released;
+
+		released = release_stripe_list(conf);
+
+		batch_size = handle_active_stripes(conf, group_id);
+		if (!batch_size && !released)
+			break;
+		handled += batch_size;
+	}
+	pr_debug("%d stripes handled\n", handled);
+
+	spin_unlock_irq(&conf->device_lock);
+	blk_finish_plug(&plug);
+
+	pr_debug("--- raid5worker inactive\n");
+}
+
 /*
  * This is our raid5 kernel thread.
  *
@@ -4903,7 +5013,7 @@ static void raid5d(struct md_thread *thr
 			handled++;
 		}
 
-		batch_size = handle_active_stripes(conf);
+		batch_size = handle_active_stripes(conf, ANY_GROUP);
 		if (!batch_size && !released)
 			break;
 		handled += batch_size;
@@ -5043,6 +5153,55 @@ static struct attribute_group raid5_attr
 	.attrs = raid5_attrs,
 };
 
+#define DEFAULT_THREAD_NUM 4
+static int alloc_thread_groups(struct r5conf *conf, int cnt)
+{
+	int i, j;
+	ssize_t size;
+	struct r5worker *workers;
+
+	conf->worker_cnt_per_group = cnt;
+	if (cnt == 0) {
+		conf->worker_groups = NULL;
+		return 0;
+	}
+	conf->group_cnt = num_possible_nodes();
+	size = sizeof(struct r5worker) * cnt;
+	workers = kzalloc(size * conf->group_cnt, GFP_KERNEL);
+	conf->worker_groups = kzalloc(sizeof(struct r5worker_group) *
+				conf->group_cnt, GFP_KERNEL);
+	if (!conf->worker_groups || !workers) {
+		kfree(workers);
+		kfree(conf->worker_groups);
+		conf->worker_groups = NULL;
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < conf->group_cnt; i++) {
+		struct r5worker_group *group;
+
+		group = &conf->worker_groups[i];
+		INIT_LIST_HEAD(&group->handle_list);
+		group->conf = conf;
+		group->workers = workers + i * cnt;
+
+		for (j = 0; j < cnt; j++) {
+			group->workers[j].group = group;
+			INIT_WORK(&group->workers[j].work, raid5_do_work);
+		}
+	}
+
+	return 0;
+}
+
+static void free_thread_groups(struct r5conf *conf)
+{
+	if (conf->worker_groups)
+		kfree(conf->worker_groups[0].workers);
+	kfree(conf->worker_groups);
+	conf->worker_groups = NULL;
+}
+
 static sector_t
 raid5_size(struct mddev *mddev, sector_t sectors, int raid_disks)
 {
@@ -5083,6 +5242,7 @@ static void raid5_free_percpu(struct r5c
 
 static void free_conf(struct r5conf *conf)
 {
+	free_thread_groups(conf);
 	shrink_stripes(conf);
 	raid5_free_percpu(conf);
 	kfree(conf->disks);
@@ -5211,6 +5371,8 @@ static struct r5conf *setup_conf(struct
 	conf = kzalloc(sizeof(struct r5conf), GFP_KERNEL);
 	if (conf == NULL)
 		goto abort;
+	if (alloc_thread_groups(conf, DEFAULT_THREAD_NUM))
+		goto abort;
 	spin_lock_init(&conf->device_lock);
 	init_waitqueue_head(&conf->wait_for_stripe);
 	init_waitqueue_head(&conf->wait_for_overlap);
@@ -6516,6 +6678,11 @@ static struct md_personality raid4_perso
 
 static int __init raid5_init(void)
 {
+	raid5_wq = alloc_workqueue("raid5wq",
+		WQ_NON_REENTRANT|WQ_UNBOUND|WQ_MEM_RECLAIM|
+			WQ_CPU_INTENSIVE|WQ_SYSFS, 0);
+	if (!raid5_wq)
+		return -ENOMEM;
 	register_md_personality(&raid6_personality);
 	register_md_personality(&raid5_personality);
 	register_md_personality(&raid4_personality);
@@ -6527,6 +6694,7 @@ static void raid5_exit(void)
 	unregister_md_personality(&raid6_personality);
 	unregister_md_personality(&raid5_personality);
 	unregister_md_personality(&raid4_personality);
+	destroy_workqueue(raid5_wq);
 }
 
 module_init(raid5_init);
Index: linux/drivers/md/raid5.h
===================================================================
--- linux.orig/drivers/md/raid5.h	2013-07-29 16:53:46.148085930 +0800
+++ linux/drivers/md/raid5.h	2013-07-30 09:14:22.481374214 +0800
@@ -212,6 +212,7 @@ struct stripe_head {
 	enum check_states	check_state;
 	enum reconstruct_states reconstruct_state;
 	spinlock_t		stripe_lock;
+	int			cpu;
 	/**
 	 * struct stripe_operations
 	 * @target - STRIPE_OP_COMPUTE_BLK target
@@ -365,6 +366,17 @@ struct disk_info {
 	struct md_rdev	*rdev, *replacement;
 };
 
+struct r5worker {
+	struct work_struct work;
+	struct r5worker_group *group;
+};
+
+struct r5worker_group {
+	struct list_head handle_list;
+	struct r5conf *conf;
+	struct r5worker *workers;
+};
+
 struct r5conf {
 	struct hlist_head	*stripe_hashtbl;
 	struct mddev		*mddev;
@@ -461,6 +473,9 @@ struct r5conf {
 	 * the new thread here until we fully activate the array.
 	 */
 	struct md_thread	*thread;
+	struct r5worker_group	*worker_groups;
+	int			group_cnt;
+	int			worker_cnt_per_group;
 };
 
 /*

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [patch 2/3] raid5: sysfs entry to control worker thread number
  2013-07-30  5:52 [patch 0/3] raid5: make stripe handling multi-threading shli
  2013-07-30  5:52 ` [patch 1/3] raid5: offload stripe handle to workqueue shli
@ 2013-07-30  5:52 ` shli
  2013-07-30  5:52 ` [patch 3/3] raid5: only wakeup necessary threads shli
  2 siblings, 0 replies; 15+ messages in thread
From: shli @ 2013-07-30  5:52 UTC (permalink / raw)
  To: linux-raid, linux-kernel; +Cc: neilb, tj, djbw

[-- Attachment #1: raid5-configure-wq.patch --]
[-- Type: text/plain, Size: 2271 bytes --]

Add a sysfs entry to control running workqueue thread number. If
group_thread_cnt is set to 0, we will disable workqueue offload handling of
stripes.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |   60 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2013-07-30 09:43:55.859080206 +0800
+++ linux/drivers/md/raid5.c	2013-07-30 09:44:18.666793102 +0800
@@ -5142,10 +5142,70 @@ stripe_cache_active_show(struct mddev *m
 static struct md_sysfs_entry
 raid5_stripecache_active = __ATTR_RO(stripe_cache_active);
 
+static ssize_t
+raid5_show_group_thread_cnt(struct mddev *mddev, char *page)
+{
+	struct r5conf *conf = mddev->private;
+	if (conf)
+		return sprintf(page, "%d\n", conf->worker_cnt_per_group);
+	else
+		return 0;
+}
+
+static int alloc_thread_groups(struct r5conf *conf, int cnt);
+static ssize_t
+raid5_store_group_thread_cnt(struct mddev *mddev, const char *page, size_t len)
+{
+	struct r5conf *conf = mddev->private;
+	unsigned long new;
+	int err;
+	struct r5worker_group *old_groups;
+	int old_group_cnt;
+
+	if (len >= PAGE_SIZE)
+		return -EINVAL;
+	if (!conf)
+		return -ENODEV;
+
+	if (kstrtoul(page, 10, &new))
+		return -EINVAL;
+
+	if (new == conf->worker_cnt_per_group)
+		return len;
+
+	mddev_suspend(mddev);
+
+	old_groups = conf->worker_groups;
+	old_group_cnt = conf->worker_cnt_per_group;
+
+	conf->worker_groups = NULL;
+	err = alloc_thread_groups(conf, new);
+	if (err) {
+		conf->worker_groups = old_groups;
+		conf->worker_cnt_per_group = old_group_cnt;
+	} else {
+		if (old_groups)
+			kfree(old_groups[0].workers);
+		kfree(old_groups);
+	}
+
+	mddev_resume(mddev);
+
+	if (err)
+		return err;
+	return len;
+}
+
+static struct md_sysfs_entry
+raid5_group_thread_cnt = __ATTR(group_thread_cnt, S_IRUGO | S_IWUSR,
+				raid5_show_group_thread_cnt,
+				raid5_store_group_thread_cnt);
+
 static struct attribute *raid5_attrs[] =  {
 	&raid5_stripecache_size.attr,
 	&raid5_stripecache_active.attr,
 	&raid5_preread_bypass_threshold.attr,
+	&raid5_group_thread_cnt.attr,
 	NULL,
 };
 static struct attribute_group raid5_attrs_group = {


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [patch 3/3] raid5: only wakeup necessary threads
  2013-07-30  5:52 [patch 0/3] raid5: make stripe handling multi-threading shli
  2013-07-30  5:52 ` [patch 1/3] raid5: offload stripe handle to workqueue shli
  2013-07-30  5:52 ` [patch 2/3] raid5: sysfs entry to control worker thread number shli
@ 2013-07-30  5:52 ` shli
  2013-07-30 12:46   ` Tejun Heo
  2 siblings, 1 reply; 15+ messages in thread
From: shli @ 2013-07-30  5:52 UTC (permalink / raw)
  To: linux-raid, linux-kernel; +Cc: neilb, tj, djbw

[-- Attachment #1: raid5-intelligent-wakeup.patch --]
[-- Type: text/plain, Size: 6303 bytes --]

If there are no enough stripes to handle, we'd better now always queue all
available work_structs. If one worker can only handle small or even none
stripes, it will impact request merge and create lock contention.

With this patch, the number of work_struct running will depend on pending
stripes number. Not some statistics info used in the patch are accessed without
locking protection. Yhis should doesn't matter, we just try best to avoid queue
unnecessary work_struct.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |   50 ++++++++++++++++++++++++++++++++++++++++++++------
 drivers/md/raid5.h |    4 ++++
 2 files changed, 48 insertions(+), 6 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2013-07-30 09:44:18.000000000 +0800
+++ linux/drivers/md/raid5.c	2013-07-30 13:03:28.738736366 +0800
@@ -77,6 +77,7 @@ static struct workqueue_struct *raid5_wq
 #define BYPASS_THRESHOLD	1
 #define NR_HASH			(PAGE_SIZE / sizeof(struct hlist_head))
 #define HASH_MASK		(NR_HASH - 1)
+#define MAX_STRIPE_BATCH	8
 
 static inline struct hlist_head *stripe_hash(struct r5conf *conf, sector_t sect)
 {
@@ -209,6 +210,7 @@ static void raid5_wakeup_stripe_thread(s
 {
 	struct r5conf *conf = sh->raid_conf;
 	struct r5worker_group *group;
+	int thread_cnt;
 	int i;
 
 	if (conf->worker_cnt_per_group == 0) {
@@ -218,8 +220,26 @@ static void raid5_wakeup_stripe_thread(s
 
 	group = conf->worker_groups + cpu_to_group(sh->cpu);
 
-	for (i = 0; i < conf->worker_cnt_per_group; i++)
-		queue_work_on(sh->cpu, raid5_wq, &group->workers[i].work);
+	group->workers[0].working = 1;
+	/* at least one worker should run to avoid race */
+	queue_work_on(sh->cpu, raid5_wq, &group->workers[0].work);
+
+	thread_cnt = group->stripes_cnt / MAX_STRIPE_BATCH - 1;
+	/* wakeup more worker */
+	for (i = 1; i < conf->worker_cnt_per_group && thread_cnt > 0; i++) {
+		if (group->workers[i].working == 0) {
+			group->workers[i].working = 1;
+			queue_work_on(sh->cpu, raid5_wq,
+				      &group->workers[i].work);
+			thread_cnt--;
+		} else if (group->workers[i].working_cnt <=
+			   MAX_STRIPE_BATCH / 2)
+			/*
+			 * If a worker has no enough stripes handling, assume
+			 * it will fetch more stripes soon
+			 */
+			thread_cnt--;
+	}
 }
 
 static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh)
@@ -248,6 +268,8 @@ static void do_release_stripe(struct r5c
 				struct r5worker_group *group;
 				group = conf->worker_groups + cpu_to_group(cpu);
 				list_add_tail(&sh->lru, &group->handle_list);
+				group->stripes_cnt++;
+				sh->group = group;
 			}
 			raid5_wakeup_stripe_thread(sh);
 			return;
@@ -573,6 +595,10 @@ get_active_stripe(struct r5conf *conf, s
 				    !test_bit(STRIPE_EXPANDING, &sh->state))
 					BUG();
 				list_del_init(&sh->lru);
+				if (sh->group) {
+					sh->group->stripes_cnt--;
+					sh->group = NULL;
+				}
 			}
 		}
 	} while (sh == NULL);
@@ -4143,6 +4169,7 @@ static struct stripe_head *__get_priorit
 {
 	struct stripe_head *sh = NULL, *tmp;
 	struct list_head *handle_list = NULL;
+	struct r5worker_group *wg = NULL;
 
 	if (conf->worker_cnt_per_group == 0) {
 		handle_list = &conf->handle_list;
@@ -4150,6 +4177,7 @@ static struct stripe_head *__get_priorit
 			handle_list = NULL;
 	} else if (group != ANY_GROUP) {
 		handle_list = &conf->worker_groups[group].handle_list;
+		wg = &conf->worker_groups[group];
 		if (list_empty(handle_list))
 			handle_list = NULL;
 	} else {
@@ -4157,6 +4185,7 @@ static struct stripe_head *__get_priorit
 		/* Should we take action to avoid starvation of latter groups ? */
 		for (i = 0; i < conf->group_cnt; i++) {
 			handle_list = &conf->worker_groups[i].handle_list;
+			wg = &conf->worker_groups[i];
 			if (!list_empty(handle_list))
 				break;
 		}
@@ -4205,11 +4234,16 @@ static struct stripe_head *__get_priorit
 			if (conf->bypass_count < 0)
 				conf->bypass_count = 0;
 		}
+		wg = NULL;
 	}
 
 	if (!sh)
 		return NULL;
 
+	if (wg) {
+		wg->stripes_cnt--;
+		sh->group = NULL;
+	}
 	list_del_init(&sh->lru);
 	atomic_inc(&sh->count);
 	BUG_ON(atomic_read(&sh->count) != 1);
@@ -4907,8 +4941,8 @@ static int  retry_aligned_read(struct r5
 	return handled;
 }
 
-#define MAX_STRIPE_BATCH 8
-static int handle_active_stripes(struct r5conf *conf, int group)
+static int handle_active_stripes(struct r5conf *conf, int group,
+		struct r5worker *worker)
 {
 	struct stripe_head *batch[MAX_STRIPE_BATCH], *sh;
 	int i, batch_size = 0;
@@ -4917,6 +4951,9 @@ static int handle_active_stripes(struct
 			(sh = __get_priority_stripe(conf, group)) != NULL)
 		batch[batch_size++] = sh;
 
+	if (worker)
+		worker->working_cnt = batch_size;
+
 	if (batch_size == 0)
 		return batch_size;
 	spin_unlock_irq(&conf->device_lock);
@@ -4951,11 +4988,12 @@ static void raid5_do_work(struct work_st
 
 		released = release_stripe_list(conf);
 
-		batch_size = handle_active_stripes(conf, group_id);
+		batch_size = handle_active_stripes(conf, group_id, worker);
 		if (!batch_size && !released)
 			break;
 		handled += batch_size;
 	}
+	worker->working = 0;
 	pr_debug("%d stripes handled\n", handled);
 
 	spin_unlock_irq(&conf->device_lock);
@@ -5013,7 +5051,7 @@ static void raid5d(struct md_thread *thr
 			handled++;
 		}
 
-		batch_size = handle_active_stripes(conf, ANY_GROUP);
+		batch_size = handle_active_stripes(conf, ANY_GROUP, NULL);
 		if (!batch_size && !released)
 			break;
 		handled += batch_size;
Index: linux/drivers/md/raid5.h
===================================================================
--- linux.orig/drivers/md/raid5.h	2013-07-30 09:14:22.000000000 +0800
+++ linux/drivers/md/raid5.h	2013-07-30 09:46:22.777233803 +0800
@@ -213,6 +213,7 @@ struct stripe_head {
 	enum reconstruct_states reconstruct_state;
 	spinlock_t		stripe_lock;
 	int			cpu;
+	struct r5worker_group	*group;
 	/**
 	 * struct stripe_operations
 	 * @target - STRIPE_OP_COMPUTE_BLK target
@@ -369,12 +370,15 @@ struct disk_info {
 struct r5worker {
 	struct work_struct work;
 	struct r5worker_group *group;
+	int working:1;
+	int working_cnt:8;
 };
 
 struct r5worker_group {
 	struct list_head handle_list;
 	struct r5conf *conf;
 	struct r5worker *workers;
+	int stripes_cnt;
 };
 
 struct r5conf {

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 1/3] raid5: offload stripe handle to workqueue
  2013-07-30  5:52 ` [patch 1/3] raid5: offload stripe handle to workqueue shli
@ 2013-07-30 11:46   ` Tejun Heo
  2013-07-30 12:53   ` Tejun Heo
  1 sibling, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2013-07-30 11:46 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, linux-kernel, neilb, djbw

Hello,

On Tue, Jul 30, 2013 at 01:52:08PM +0800, shli@kernel.org wrote:
>  static int __init raid5_init(void)
>  {
> +	raid5_wq = alloc_workqueue("raid5wq",
> +		WQ_NON_REENTRANT|WQ_UNBOUND|WQ_MEM_RECLAIM|
> +			WQ_CPU_INTENSIVE|WQ_SYSFS, 0);

Workqueues are now always non-reentrant and WQ_NON_REENTRANT is noop.
I was planning to remove it and then forgot. :) Will send out patches
to kill it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 3/3] raid5: only wakeup necessary threads
  2013-07-30  5:52 ` [patch 3/3] raid5: only wakeup necessary threads shli
@ 2013-07-30 12:46   ` Tejun Heo
  2013-07-30 13:24     ` Shaohua Li
  0 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2013-07-30 12:46 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, linux-kernel, neilb, djbw

Hello,

On Tue, Jul 30, 2013 at 01:52:10PM +0800, shli@kernel.org wrote:
> If there are no enough stripes to handle, we'd better now always queue all
> available work_structs. If one worker can only handle small or even none
> stripes, it will impact request merge and create lock contention.
> 
> With this patch, the number of work_struct running will depend on pending
> stripes number. Not some statistics info used in the patch are accessed without
> locking protection. Yhis should doesn't matter, we just try best to avoid queue
> unnecessary work_struct.

I haven't really followed the code but two general comments.

* Stacking drivers in general should always try to keep the bios
  passing through in the same order that they are received.  The order
  of bios is an important information to the io scheduler and io
  scheduling will suffer badly if the bios are shuffled by the
  stacking driver.  It'd probably be a good idea to have a mechanism
  to keep the issue order intact even when multiple workers are
  employed.

* While limiting the number of work_struct dynamically could be
  beneficial and it's upto Neil, it'd be nice if you can accompany it
  with some numbers so that whether such optimization actually is
  worthwhile or not can be decided.  The same goes for the whole
  series, I suppose.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 1/3] raid5: offload stripe handle to workqueue
  2013-07-30  5:52 ` [patch 1/3] raid5: offload stripe handle to workqueue shli
  2013-07-30 11:46   ` Tejun Heo
@ 2013-07-30 12:53   ` Tejun Heo
  2013-07-30 13:07     ` Shaohua Li
  1 sibling, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2013-07-30 12:53 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, linux-kernel, neilb, djbw

Hello,

On Tue, Jul 30, 2013 at 01:52:08PM +0800, shli@kernel.org wrote:
> +static void raid5_wakeup_stripe_thread(struct stripe_head *sh)
> +{
> +	struct r5conf *conf = sh->raid_conf;
> +	struct r5worker_group *group;
> +	int i;
> +
> +	if (conf->worker_cnt_per_group == 0) {
> +		md_wakeup_thread(conf->mddev->thread);
> +		return;
> +	}
> +
> +	group = conf->worker_groups + cpu_to_group(sh->cpu);
> +
> +	for (i = 0; i < conf->worker_cnt_per_group; i++)
> +		queue_work_on(sh->cpu, raid5_wq, &group->workers[i].work);
> +}

Another general suggestion.  Using workqueue mechanism simply as
thread dispatching mechanism like above and then buliding your own
work dispatching code on top is usually a poor form.  It usually is
much better to assign a single unit of work to a single work item as
it allows things like per work unit flushing and much easier
implementation of freezing.  It's possible that you have some
overriding constraints here but if so it'd be nice if you can explain
it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 1/3] raid5: offload stripe handle to workqueue
  2013-07-30 12:53   ` Tejun Heo
@ 2013-07-30 13:07     ` Shaohua Li
  2013-07-30 13:57       ` Tejun Heo
  0 siblings, 1 reply; 15+ messages in thread
From: Shaohua Li @ 2013-07-30 13:07 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-raid, linux-kernel, neilb, djbw

On Tue, Jul 30, 2013 at 08:53:06AM -0400, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jul 30, 2013 at 01:52:08PM +0800, shli@kernel.org wrote:
> > +static void raid5_wakeup_stripe_thread(struct stripe_head *sh)
> > +{
> > +	struct r5conf *conf = sh->raid_conf;
> > +	struct r5worker_group *group;
> > +	int i;
> > +
> > +	if (conf->worker_cnt_per_group == 0) {
> > +		md_wakeup_thread(conf->mddev->thread);
> > +		return;
> > +	}
> > +
> > +	group = conf->worker_groups + cpu_to_group(sh->cpu);
> > +
> > +	for (i = 0; i < conf->worker_cnt_per_group; i++)
> > +		queue_work_on(sh->cpu, raid5_wq, &group->workers[i].work);
> > +}
> 
> Another general suggestion.  Using workqueue mechanism simply as
> thread dispatching mechanism like above and then buliding your own
> work dispatching code on top is usually a poor form.  It usually is
> much better to assign a single unit of work to a single work item as
> it allows things like per work unit flushing and much easier
> implementation of freezing.  It's possible that you have some
> overriding constraints here but if so it'd be nice if you can explain
> it.

Ok, I should explain here. I can't add a work_struct for each stripe, because
this will stress workqueue very hard. My system handles > 1M/s stripes, which
makes workqueue pool lock contended very hard.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 3/3] raid5: only wakeup necessary threads
  2013-07-30 12:46   ` Tejun Heo
@ 2013-07-30 13:24     ` Shaohua Li
  2013-07-30 14:01       ` Tejun Heo
  0 siblings, 1 reply; 15+ messages in thread
From: Shaohua Li @ 2013-07-30 13:24 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-raid, linux-kernel, neilb, djbw

On Tue, Jul 30, 2013 at 08:46:55AM -0400, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jul 30, 2013 at 01:52:10PM +0800, shli@kernel.org wrote:
> > If there are no enough stripes to handle, we'd better now always queue all
> > available work_structs. If one worker can only handle small or even none
> > stripes, it will impact request merge and create lock contention.
> > 
> > With this patch, the number of work_struct running will depend on pending
> > stripes number. Not some statistics info used in the patch are accessed without
> > locking protection. Yhis should doesn't matter, we just try best to avoid queue
> > unnecessary work_struct.
> 
> I haven't really followed the code but two general comments.
> 
> * Stacking drivers in general should always try to keep the bios
>   passing through in the same order that they are received.  The order
>   of bios is an important information to the io scheduler and io
>   scheduling will suffer badly if the bios are shuffled by the
>   stacking driver.  It'd probably be a good idea to have a mechanism
>   to keep the issue order intact even when multiple workers are
>   employed.

In the raid5 case, it's very hard to keep the order the bios passed in, because
we need read some disks, calculate parity, and write some disks, the timing
could break any kind of order. Besides the workqueue handles 8 stripes one
time, so I suppose this keeps some order if there is.
 
> * While limiting the number of work_struct dynamically could be
>   beneficial and it's upto Neil, it'd be nice if you can accompany it
>   with some numbers so that whether such optimization actually is
>   worthwhile or not can be decided.  The same goes for the whole
>   series, I suppose.

Sure, I can add the number in next post. Basically if I let 8 worker running
for 7 disks raid5 setup, multi-threading is 4x ~ 5x faster.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 1/3] raid5: offload stripe handle to workqueue
  2013-07-30 13:07     ` Shaohua Li
@ 2013-07-30 13:57       ` Tejun Heo
  2013-07-31  1:24         ` Shaohua Li
  0 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2013-07-30 13:57 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, linux-kernel, neilb, djbw

Hello,

On Tue, Jul 30, 2013 at 09:07:08PM +0800, Shaohua Li wrote:
> Ok, I should explain here. I can't add a work_struct for each stripe, because
> this will stress workqueue very hard. My system handles > 1M/s stripes, which
> makes workqueue pool lock contended very hard.

It doesn't have to be embedding work_struct in each stripe and
schduling them altogether.  It's more about scheduling "work units"
rather than "workers" - ie. letting each scheduled work item handle
single work unit rather than making it dispatch multiple work items.
It may make controlling concurrency a bit more interesting but you can
always do it with workqueue_set_max_active(), which is the intended
usage anyway.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 3/3] raid5: only wakeup necessary threads
  2013-07-30 13:24     ` Shaohua Li
@ 2013-07-30 14:01       ` Tejun Heo
  0 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2013-07-30 14:01 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, linux-kernel, neilb, djbw

Hello, Shaohua.

On Tue, Jul 30, 2013 at 09:24:14PM +0800, Shaohua Li wrote:
> In the raid5 case, it's very hard to keep the order the bios passed in, because
> we need read some disks, calculate parity, and write some disks, the timing
> could break any kind of order. Besides the workqueue handles 8 stripes one
> time, so I suppose this keeps some order if there is.

Of course, it can't be absolute but still keeping the relative
ordering and temporal locality (ie. issue IOs in order once the XOR
calculation of the whole source bio is done rather than issuing as
each block completes calculation) would help quite a bit.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 1/3] raid5: offload stripe handle to workqueue
  2013-07-30 13:57       ` Tejun Heo
@ 2013-07-31  1:24         ` Shaohua Li
  2013-07-31 10:33           ` Tejun Heo
  0 siblings, 1 reply; 15+ messages in thread
From: Shaohua Li @ 2013-07-31  1:24 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-raid, linux-kernel, neilb, djbw

On Tue, Jul 30, 2013 at 09:57:51AM -0400, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jul 30, 2013 at 09:07:08PM +0800, Shaohua Li wrote:
> > Ok, I should explain here. I can't add a work_struct for each stripe, because
> > this will stress workqueue very hard. My system handles > 1M/s stripes, which
> > makes workqueue pool lock contended very hard.
> 
> It doesn't have to be embedding work_struct in each stripe and
> schduling them altogether.  It's more about scheduling "work units"
> rather than "workers" - ie. letting each scheduled work item handle
> single work unit rather than making it dispatch multiple work items.
> It may make controlling concurrency a bit more interesting but you can
> always do it with workqueue_set_max_active(), which is the intended
> usage anyway.

stripe is the work unit actually. As I said, if I queue a work for each stripe,
just queue_work() will make the system blast because of the pwq->pool->lock
contention. dispatching one work has another side effect that I can't add block
plug. Since this is the queue stage concurrency problem,
workqueue_set_max_active() doesn't help.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 1/3] raid5: offload stripe handle to workqueue
  2013-07-31  1:24         ` Shaohua Li
@ 2013-07-31 10:33           ` Tejun Heo
  2013-08-01  2:01             ` Shaohua Li
  0 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2013-07-31 10:33 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, linux-kernel, neilb, djbw

Hello,

On Wed, Jul 31, 2013 at 09:24:34AM +0800, Shaohua Li wrote:
> stripe is the work unit actually. As I said, if I queue a work for each stripe,
> just queue_work() will make the system blast because of the pwq->pool->lock
> contention. dispatching one work has another side effect that I can't add block

Hmmm.... I see.  I'm not familiar with the code base and could be
missing something but how does the custom stripe dispatch queue
synchronize?  Doesn't that make use of a lock anyway?  If so, how
would scheduling separate work items be worse?  Also, can you please
elaborate the block plug part?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 1/3] raid5: offload stripe handle to workqueue
  2013-07-31 10:33           ` Tejun Heo
@ 2013-08-01  2:01             ` Shaohua Li
  2013-08-01 12:15               ` Tejun Heo
  0 siblings, 1 reply; 15+ messages in thread
From: Shaohua Li @ 2013-08-01  2:01 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-raid, linux-kernel, neilb, djbw

On Wed, Jul 31, 2013 at 06:33:32AM -0400, Tejun Heo wrote:
> Hello,
> 
> On Wed, Jul 31, 2013 at 09:24:34AM +0800, Shaohua Li wrote:
> > stripe is the work unit actually. As I said, if I queue a work for each stripe,
> > just queue_work() will make the system blast because of the pwq->pool->lock
> > contention. dispatching one work has another side effect that I can't add block
> 
> Hmmm.... I see.  I'm not familiar with the code base and could be
> missing something but how does the custom stripe dispatch queue
> synchronize?  Doesn't that make use of a lock anyway?  If so, how
> would scheduling separate work items be worse?  

It does have lock, but when a stripe is queued to handle, no lock is required.
So the workqueue lock will be high contended.

> Also, can you please
> elaborate the block plug part?

Basically I do:

blk_start_plug()
handle_stripe() //may dispatch request
blk_end_plug()

If only handle one stripe between block plug, the plug is useless, so I need
handle several stripes.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 1/3] raid5: offload stripe handle to workqueue
  2013-08-01  2:01             ` Shaohua Li
@ 2013-08-01 12:15               ` Tejun Heo
  0 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2013-08-01 12:15 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, linux-kernel, neilb, djbw

Hello,

On Thu, Aug 01, 2013 at 10:01:01AM +0800, Shaohua Li wrote:
> It does have lock, but when a stripe is queued to handle, no lock is required.

Hmmmm.... sorry but can you please explain it a bit further?  Why
wouldn't it require a lock?  Perhaps because it has to be on some
queue already?

> So the workqueue lock will be high contended.

I still don't follow how it'd be more contended than the presumably
single lock that you'd have to use for queueing.

> > Also, can you please
> > elaborate the block plug part?
> 
> Basically I do:
> 
> blk_start_plug()
> handle_stripe() //may dispatch request
> blk_end_plug()
> 
> If only handle one stripe between block plug, the plug is useless, so I need
> handle several stripes.

Ah, right, plugging is tied to the current context, so yeap that makes
sense.

The only thing which may actually matter is freezer handling as direct
freezer handling tends to be pretty tricky.  If the work item
processing doesn't need freezing (which should be the case, I think),
it's all good.  If it does, it'd probably be best to make the
workqueue freezable and breakout of the loop if freezing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2013-08-01 12:15 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-07-30  5:52 [patch 0/3] raid5: make stripe handling multi-threading shli
2013-07-30  5:52 ` [patch 1/3] raid5: offload stripe handle to workqueue shli
2013-07-30 11:46   ` Tejun Heo
2013-07-30 12:53   ` Tejun Heo
2013-07-30 13:07     ` Shaohua Li
2013-07-30 13:57       ` Tejun Heo
2013-07-31  1:24         ` Shaohua Li
2013-07-31 10:33           ` Tejun Heo
2013-08-01  2:01             ` Shaohua Li
2013-08-01 12:15               ` Tejun Heo
2013-07-30  5:52 ` [patch 2/3] raid5: sysfs entry to control worker thread number shli
2013-07-30  5:52 ` [patch 3/3] raid5: only wakeup necessary threads shli
2013-07-30 12:46   ` Tejun Heo
2013-07-30 13:24     ` Shaohua Li
2013-07-30 14:01       ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).