[patch v3 0/5] raid5: make stripe handling multi-threading

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [patch v3 0/5] raid5: make stripe handling multi-threading
@ 2013-08-27  9:50 Shaohua Li
  2013-08-27  9:50 ` [patch v3 1/5] raid5: make release_stripe lockless Shaohua Li
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Shaohua Li @ 2013-08-27  9:50 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, tj, dan.j.williams

Neil,

This set basically is the same as my previous post
http://marc.info/?l=linux-raid&m=137627404708265&w=2 except fixed two issues
you pointed out today. The first two patches are in your tree before, since you
rebased your for-next tree, I included here. Should apply to your for-next
cleanly.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch v3 1/5] raid5: make release_stripe lockless
  2013-08-27  9:50 [patch v3 0/5] raid5: make stripe handling multi-threading Shaohua Li
@ 2013-08-27  9:50 ` Shaohua Li
  2013-08-28 14:04   ` Tejun Heo
  2013-08-27  9:50 ` [patch v3 2/5] raid5: fix stripe release order Shaohua Li
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: Shaohua Li @ 2013-08-27  9:50 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, tj, dan.j.williams

[-- Attachment #1: raid5-reduce-release_stripe.patch --]
[-- Type: text/plain, Size: 5221 bytes --]

release_stripe still has big lock contention. We just add the stripe to a llist
without taking device_lock. We let the raid5d thread to do the real stripe
release, which must hold device_lock anyway. In this way, release_stripe
doesn't hold any locks.

The side effect is the released stripes order is changed. But sounds not a big
deal, stripes are never handled in order. And I thought block layer can already
do nice request merge, which means order isn't that important.

I kept the unplug release batch, which is unnecessary with this patch from lock
contention avoid point of view, and actually if we delete it, the stripe_head
release_list and lru can share storage. But the unplug release batch is also
helpful for request merge. We probably can delay wakeup raid5d till unplug, but
I'm still afraid of the case which raid5d is running.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |   49 ++++++++++++++++++++++++++++++++++++++++++++++---
 drivers/md/raid5.h |    3 +++
 2 files changed, 49 insertions(+), 3 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2013-08-27 16:58:13.526743800 +0800
+++ linux/drivers/md/raid5.c	2013-08-27 16:58:13.522743850 +0800
@@ -239,12 +239,47 @@ static void __release_stripe(struct r5co
 		do_release_stripe(conf, sh);
 }
 
+/* should hold conf->device_lock already */
+static int release_stripe_list(struct r5conf *conf)
+{
+	struct stripe_head *sh;
+	int count = 0;
+	struct llist_node *head;
+
+	head = llist_del_all(&conf->released_stripes);
+	while (head) {
+		sh = llist_entry(head, struct stripe_head, release_list);
+		head = llist_next(head);
+		/* sh could be readded after STRIPE_ON_RELEASE_LIST is cleard */
+		smp_mb();
+		clear_bit(STRIPE_ON_RELEASE_LIST, &sh->state);
+		/*
+		 * Don't worry the bit is set here, because if the bit is set
+		 * again, the count is always > 1. This is true for
+		 * STRIPE_ON_UNPLUG_LIST bit too.
+		 */
+		__release_stripe(conf, sh);
+		count++;
+	}
+
+	return count;
+}
+
 static void release_stripe(struct stripe_head *sh)
 {
 	struct r5conf *conf = sh->raid_conf;
 	unsigned long flags;
+	bool wakeup;
 
+	if (test_and_set_bit(STRIPE_ON_RELEASE_LIST, &sh->state))
+		goto slow_path;
+	wakeup = llist_add(&sh->release_list, &conf->released_stripes);
+	if (wakeup)
+		md_wakeup_thread(conf->mddev->thread);
+	return;
+slow_path:
 	local_irq_save(flags);
+	/* we are ok here if STRIPE_ON_RELEASE_LIST is set or not */
 	if (atomic_dec_and_lock(&sh->count, &conf->device_lock)) {
 		do_release_stripe(conf, sh);
 		spin_unlock(&conf->device_lock);
@@ -491,7 +526,8 @@ get_active_stripe(struct r5conf *conf, s
 			if (atomic_read(&sh->count)) {
 				BUG_ON(!list_empty(&sh->lru)
 				    && !test_bit(STRIPE_EXPANDING, &sh->state)
-				    && !test_bit(STRIPE_ON_UNPLUG_LIST, &sh->state));
+				    && !test_bit(STRIPE_ON_UNPLUG_LIST, &sh->state)
+				    && !test_bit(STRIPE_ON_RELEASE_LIST, &sh->state));
 			} else {
 				if (!test_bit(STRIPE_HANDLE, &sh->state))
 					atomic_inc(&conf->active_stripes);
@@ -4127,6 +4163,10 @@ static void raid5_unplug(struct blk_plug
 			 */
 			smp_mb__before_clear_bit();
 			clear_bit(STRIPE_ON_UNPLUG_LIST, &sh->state);
+			/*
+			 * STRIPE_ON_RELEASE_LIST could be set here. In that
+			 * case, the count is always > 1 here
+			 */
 			__release_stripe(conf, sh);
 			cnt++;
 		}
@@ -4845,7 +4885,9 @@ static void raid5d(struct md_thread *thr
 	spin_lock_irq(&conf->device_lock);
 	while (1) {
 		struct bio *bio;
-		int batch_size;
+		int batch_size, released;
+
+		released = release_stripe_list(conf);
 
 		if (
 		    !list_empty(&conf->bitmap_list)) {
@@ -4870,7 +4912,7 @@ static void raid5d(struct md_thread *thr
 		}
 
 		batch_size = handle_active_stripes(conf);
-		if (!batch_size)
+		if (!batch_size && !released)
 			break;
 		handled += batch_size;
 
@@ -5186,6 +5228,7 @@ static struct r5conf *setup_conf(struct
 	INIT_LIST_HEAD(&conf->delayed_list);
 	INIT_LIST_HEAD(&conf->bitmap_list);
 	INIT_LIST_HEAD(&conf->inactive_list);
+	init_llist_head(&conf->released_stripes);
 	atomic_set(&conf->active_stripes, 0);
 	atomic_set(&conf->preread_active_stripes, 0);
 	atomic_set(&conf->active_aligned_reads, 0);
Index: linux/drivers/md/raid5.h
===================================================================
--- linux.orig/drivers/md/raid5.h	2013-08-27 16:58:13.526743800 +0800
+++ linux/drivers/md/raid5.h	2013-08-27 16:58:13.522743850 +0800
@@ -197,6 +197,7 @@ enum reconstruct_states {
 struct stripe_head {
 	struct hlist_node	hash;
 	struct list_head	lru;	      /* inactive_list or handle_list */
+	struct llist_node	release_list;
 	struct r5conf		*raid_conf;
 	short			generation;	/* increments with every
 						 * reshape */
@@ -321,6 +322,7 @@ enum {
 	STRIPE_OPS_REQ_PENDING,
 	STRIPE_ON_UNPLUG_LIST,
 	STRIPE_DISCARD,
+	STRIPE_ON_RELEASE_LIST,
 };
 
 /*
@@ -446,6 +448,7 @@ struct r5conf {
 	 */
 	atomic_t		active_stripes;
 	struct list_head	inactive_list;
+	struct llist_head	released_stripes;
 	wait_queue_head_t	wait_for_stripe;
 	wait_queue_head_t	wait_for_overlap;
 	int			inactive_blocked;	/* release of inactive stripes blocked,


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch v3 2/5] raid5: fix stripe release order
  2013-08-27  9:50 [patch v3 0/5] raid5: make stripe handling multi-threading Shaohua Li
  2013-08-27  9:50 ` [patch v3 1/5] raid5: make release_stripe lockless Shaohua Li
@ 2013-08-27  9:50 ` Shaohua Li
  2013-08-28  3:41   ` NeilBrown
  2013-08-27  9:50 ` [patch v3 3/5] raid5: offload stripe handle to workqueue Shaohua Li
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: Shaohua Li @ 2013-08-27  9:50 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, tj, dan.j.williams

[-- Attachment #1: raid5-release_stripe-order.patch --]
[-- Type: text/plain, Size: 2255 bytes --]

patch "make release_stripe lockless" changes the order stripes are released.
Originally I thought block layer can take care of request merge, but it appears
there are still some requests not merged. It's easy to fix the order.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c    |    1 +
 include/linux/llist.h |    1 +
 lib/llist.c           |   27 +++++++++++++++++++++++++++
 3 files changed, 29 insertions(+)

Index: linux/include/linux/llist.h
===================================================================
--- linux.orig/include/linux/llist.h	2013-07-24 09:09:41.014384439 +0800
+++ linux/include/linux/llist.h	2013-07-25 15:09:03.109026773 +0800
@@ -172,4 +172,5 @@ static inline struct llist_node *llist_d
 
 extern struct llist_node *llist_del_first(struct llist_head *head);
 
+extern struct llist_node *llist_reverse_order(struct llist_node *head);
 #endif /* LLIST_H */
Index: linux/lib/llist.c
===================================================================
--- linux.orig/lib/llist.c	2013-07-24 09:09:41.062383834 +0800
+++ linux/lib/llist.c	2013-07-25 16:08:40.096054565 +0800
@@ -81,3 +81,30 @@ struct llist_node *llist_del_first(struc
 	return entry;
 }
 EXPORT_SYMBOL_GPL(llist_del_first);
+
+/*
+ * llist_reverse_order - reverse llist order
+ * @head: list head
+ *
+ * Return reversed list head
+ */
+struct llist_node *llist_reverse_order(struct llist_node *head)
+{
+	struct llist_node *second, *third;
+
+	if (head == NULL || head->next == NULL)
+		return head;
+	second = head->next;
+	head->next = NULL;
+
+	do {
+		third = second->next;
+		second->next = head;
+
+		head = second;
+		second = third;
+	} while (second);
+
+	return head;
+}
+EXPORT_SYMBOL_GPL(llist_reverse_order);
Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2013-07-25 15:02:38.289865404 +0800
+++ linux/drivers/md/raid5.c	2013-07-25 15:35:33.201040089 +0800
@@ -247,6 +247,7 @@ static int release_stripe_list(struct r5
 	struct llist_node *head;
 
 	head = llist_del_all(&conf->released_stripes);
+	head = llist_reverse_order(head);
 	while (head) {
 		sh = llist_entry(head, struct stripe_head, release_list);
 		head = llist_next(head);


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch v3 3/5] raid5: offload stripe handle to workqueue
  2013-08-27  9:50 [patch v3 0/5] raid5: make stripe handling multi-threading Shaohua Li
  2013-08-27  9:50 ` [patch v3 1/5] raid5: make release_stripe lockless Shaohua Li
  2013-08-27  9:50 ` [patch v3 2/5] raid5: fix stripe release order Shaohua Li
@ 2013-08-27  9:50 ` Shaohua Li
  2013-08-28  3:53   ` NeilBrown
  2013-08-27  9:50 ` [patch v3 4/5] raid5: sysfs entry to control worker thread number Shaohua Li
  2013-08-27  9:50 ` [patch v3 5/5] raid5: only wakeup necessary threads Shaohua Li
  4 siblings, 1 reply; 20+ messages in thread
From: Shaohua Li @ 2013-08-27  9:50 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, tj, dan.j.williams

[-- Attachment #1: raid5-wq.patch --]
[-- Type: text/plain, Size: 13618 bytes --]

This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.

raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.

To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.

My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.

Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.

In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.

The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.

This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads. 

Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.

Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |  192 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 drivers/md/raid5.h |   15 ++++
 2 files changed, 192 insertions(+), 15 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2013-08-27 16:58:42.362381311 +0800
+++ linux/drivers/md/raid5.c	2013-08-27 17:09:12.674458981 +0800
@@ -53,6 +53,7 @@
 #include <linux/cpu.h>
 #include <linux/slab.h>
 #include <linux/ratelimit.h>
+#include <linux/nodemask.h>
 #include <trace/events/block.h>
 
 #include "md.h"
@@ -60,6 +61,10 @@
 #include "raid0.h"
 #include "bitmap.h"
 
+#define cpu_to_group(cpu) cpu_to_node(cpu)
+#define ANY_GROUP NUMA_NO_NODE
+
+static struct workqueue_struct *raid5_wq;
 /*
  * Stripe cache
  */
@@ -200,6 +205,34 @@ static int stripe_operations_active(stru
 	       test_bit(STRIPE_COMPUTE_RUN, &sh->state);
 }
 
+static void raid5_wakeup_stripe_thread(struct stripe_head *sh)
+{
+	struct r5conf *conf = sh->raid_conf;
+	struct r5worker_group *group;
+	int i, cpu = sh->cpu;
+
+	if (!cpu_online(cpu)) {
+		cpu = cpumask_any(cpu_online_mask);
+		sh->cpu = cpu;
+	}
+
+	if (list_empty(&sh->lru)) {
+		struct r5worker_group *group;
+		group = conf->worker_groups + cpu_to_group(cpu);
+		list_add_tail(&sh->lru, &group->handle_list);
+	}
+
+	if (conf->worker_cnt_per_group == 0) {
+		md_wakeup_thread(conf->mddev->thread);
+		return;
+	}
+
+	group = conf->worker_groups + cpu_to_group(sh->cpu);
+
+	for (i = 0; i < conf->worker_cnt_per_group; i++)
+		queue_work_on(sh->cpu, raid5_wq, &group->workers[i].work);
+}
+
 static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh)
 {
 	BUG_ON(!list_empty(&sh->lru));
@@ -214,7 +247,12 @@ static void do_release_stripe(struct r5c
 		else {
 			clear_bit(STRIPE_DELAYED, &sh->state);
 			clear_bit(STRIPE_BIT_DELAY, &sh->state);
-			list_add_tail(&sh->lru, &conf->handle_list);
+			if (conf->worker_cnt_per_group == 0) {
+				list_add_tail(&sh->lru, &conf->handle_list);
+			} else {
+				raid5_wakeup_stripe_thread(sh);
+				return;
+			}
 		}
 		md_wakeup_thread(conf->mddev->thread);
 	} else {
@@ -395,6 +433,7 @@ static void init_stripe(struct stripe_he
 		raid5_build_block(sh, i, previous);
 	}
 	insert_hash(conf, sh);
+	sh->cpu = smp_processor_id();
 }
 
 static struct stripe_head *__find_stripe(struct r5conf *conf, sector_t sector,
@@ -3816,6 +3855,7 @@ static void raid5_activate_delayed(struc
 			if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
 				atomic_inc(&conf->preread_active_stripes);
 			list_add_tail(&sh->lru, &conf->hold_list);
+			raid5_wakeup_stripe_thread(sh);
 		}
 	}
 }
@@ -4095,18 +4135,38 @@ static int chunk_aligned_read(struct mdd
  * head of the hold_list has changed, i.e. the head was promoted to the
  * handle_list.
  */
-static struct stripe_head *__get_priority_stripe(struct r5conf *conf)
+static struct stripe_head *__get_priority_stripe(struct r5conf *conf, int group)
 {
-	struct stripe_head *sh;
+	struct stripe_head *sh = NULL, *tmp;
+	struct list_head *handle_list = NULL;
+
+	if (conf->worker_cnt_per_group == 0) {
+		handle_list = &conf->handle_list;
+		if (list_empty(handle_list))
+			handle_list = NULL;
+	} else if (group != ANY_GROUP) {
+		handle_list = &conf->worker_groups[group].handle_list;
+		if (list_empty(handle_list))
+			handle_list = NULL;
+	} else {
+		int i;
+		for (i = 0; i < conf->group_cnt; i++) {
+			handle_list = &conf->worker_groups[i].handle_list;
+			if (!list_empty(handle_list))
+				break;
+		}
+		if (i == conf->group_cnt)
+			handle_list = NULL;
+	}
 
 	pr_debug("%s: handle: %s hold: %s full_writes: %d bypass_count: %d\n",
 		  __func__,
-		  list_empty(&conf->handle_list) ? "empty" : "busy",
+		  list_empty(handle_list) ? "empty" : "busy",
 		  list_empty(&conf->hold_list) ? "empty" : "busy",
 		  atomic_read(&conf->pending_full_writes), conf->bypass_count);
 
-	if (!list_empty(&conf->handle_list)) {
-		sh = list_entry(conf->handle_list.next, typeof(*sh), lru);
+	if (handle_list) {
+		sh = list_entry(handle_list->next, typeof(*sh), lru);
 
 		if (list_empty(&conf->hold_list))
 			conf->bypass_count = 0;
@@ -4124,12 +4184,25 @@ static struct stripe_head *__get_priorit
 		   ((conf->bypass_threshold &&
 		     conf->bypass_count > conf->bypass_threshold) ||
 		    atomic_read(&conf->pending_full_writes) == 0)) {
-		sh = list_entry(conf->hold_list.next,
-				typeof(*sh), lru);
-		conf->bypass_count -= conf->bypass_threshold;
-		if (conf->bypass_count < 0)
-			conf->bypass_count = 0;
-	} else
+
+		list_for_each_entry(tmp, &conf->hold_list,  lru) {
+			if (conf->worker_cnt_per_group == 0 ||
+			    group == ANY_GROUP ||
+			    !cpu_online(tmp->cpu) ||
+			    cpu_to_group(tmp->cpu) == group) {
+				sh = tmp;
+				break;
+			}
+		}
+
+		if (sh) {
+			conf->bypass_count -= conf->bypass_threshold;
+			if (conf->bypass_count < 0)
+				conf->bypass_count = 0;
+		}
+	}
+
+	if (!sh)
 		return NULL;
 
 	list_del_init(&sh->lru);
@@ -4839,13 +4912,13 @@ static int  retry_aligned_read(struct r5
 }
 
 #define MAX_STRIPE_BATCH 8
-static int handle_active_stripes(struct r5conf *conf)
+static int handle_active_stripes(struct r5conf *conf, int group)
 {
 	struct stripe_head *batch[MAX_STRIPE_BATCH], *sh;
 	int i, batch_size = 0;
 
 	while (batch_size < MAX_STRIPE_BATCH &&
-			(sh = __get_priority_stripe(conf)) != NULL)
+			(sh = __get_priority_stripe(conf, group)) != NULL)
 		batch[batch_size++] = sh;
 
 	if (batch_size == 0)
@@ -4863,6 +4936,38 @@ static int handle_active_stripes(struct
 	return batch_size;
 }
 
+static void raid5_do_work(struct work_struct *work)
+{
+	struct r5worker *worker = container_of(work, struct r5worker, work);
+	struct r5worker_group *group = worker->group;
+	struct r5conf *conf = group->conf;
+	int group_id = group - conf->worker_groups;
+	int handled;
+	struct blk_plug plug;
+
+	pr_debug("+++ raid5worker active\n");
+
+	blk_start_plug(&plug);
+	handled = 0;
+	spin_lock_irq(&conf->device_lock);
+	while (1) {
+		int batch_size, released;
+
+		released = release_stripe_list(conf);
+
+		batch_size = handle_active_stripes(conf, group_id);
+		if (!batch_size && !released)
+			break;
+		handled += batch_size;
+	}
+	pr_debug("%d stripes handled\n", handled);
+
+	spin_unlock_irq(&conf->device_lock);
+	blk_finish_plug(&plug);
+
+	pr_debug("--- raid5worker inactive\n");
+}
+
 /*
  * This is our raid5 kernel thread.
  *
@@ -4912,7 +5017,7 @@ static void raid5d(struct md_thread *thr
 			handled++;
 		}
 
-		batch_size = handle_active_stripes(conf);
+		batch_size = handle_active_stripes(conf, ANY_GROUP);
 		if (!batch_size && !released)
 			break;
 		handled += batch_size;
@@ -5052,6 +5157,54 @@ static struct attribute_group raid5_attr
 	.attrs = raid5_attrs,
 };
 
+static int alloc_thread_groups(struct r5conf *conf, int cnt)
+{
+	int i, j;
+	ssize_t size;
+	struct r5worker *workers;
+
+	conf->worker_cnt_per_group = cnt;
+	if (cnt == 0) {
+		conf->worker_groups = NULL;
+		return 0;
+	}
+	conf->group_cnt = num_possible_nodes();
+	size = sizeof(struct r5worker) * cnt;
+	workers = kzalloc(size * conf->group_cnt, GFP_NOIO);
+	conf->worker_groups = kzalloc(sizeof(struct r5worker_group) *
+				conf->group_cnt, GFP_NOIO);
+	if (!conf->worker_groups || !workers) {
+		kfree(workers);
+		kfree(conf->worker_groups);
+		conf->worker_groups = NULL;
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < conf->group_cnt; i++) {
+		struct r5worker_group *group;
+
+		group = &conf->worker_groups[i];
+		INIT_LIST_HEAD(&group->handle_list);
+		group->conf = conf;
+		group->workers = workers + i * cnt;
+
+		for (j = 0; j < cnt; j++) {
+			group->workers[j].group = group;
+			INIT_WORK(&group->workers[j].work, raid5_do_work);
+		}
+	}
+
+	return 0;
+}
+
+static void free_thread_groups(struct r5conf *conf)
+{
+	if (conf->worker_groups)
+		kfree(conf->worker_groups[0].workers);
+	kfree(conf->worker_groups);
+	conf->worker_groups = NULL;
+}
+
 static sector_t
 raid5_size(struct mddev *mddev, sector_t sectors, int raid_disks)
 {
@@ -5092,6 +5245,7 @@ static void raid5_free_percpu(struct r5c
 
 static void free_conf(struct r5conf *conf)
 {
+	free_thread_groups(conf);
 	shrink_stripes(conf);
 	raid5_free_percpu(conf);
 	kfree(conf->disks);
@@ -5220,6 +5374,9 @@ static struct r5conf *setup_conf(struct
 	conf = kzalloc(sizeof(struct r5conf), GFP_KERNEL);
 	if (conf == NULL)
 		goto abort;
+	/* Don't enable multi-threading by default*/
+	if (alloc_thread_groups(conf, 0))
+		goto abort;
 	spin_lock_init(&conf->device_lock);
 	seqcount_init(&conf->gen_lock);
 	init_waitqueue_head(&conf->wait_for_stripe);
@@ -6535,6 +6692,10 @@ static struct md_personality raid4_perso
 
 static int __init raid5_init(void)
 {
+	raid5_wq = alloc_workqueue("raid5wq",
+		WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE|WQ_SYSFS, 0);
+	if (!raid5_wq)
+		return -ENOMEM;
 	register_md_personality(&raid6_personality);
 	register_md_personality(&raid5_personality);
 	register_md_personality(&raid4_personality);
@@ -6546,6 +6707,7 @@ static void raid5_exit(void)
 	unregister_md_personality(&raid6_personality);
 	unregister_md_personality(&raid5_personality);
 	unregister_md_personality(&raid4_personality);
+	destroy_workqueue(raid5_wq);
 }
 
 module_init(raid5_init);
Index: linux/drivers/md/raid5.h
===================================================================
--- linux.orig/drivers/md/raid5.h	2013-08-27 16:58:42.362381311 +0800
+++ linux/drivers/md/raid5.h	2013-08-27 16:58:53.000000000 +0800
@@ -212,6 +212,7 @@ struct stripe_head {
 	enum check_states	check_state;
 	enum reconstruct_states reconstruct_state;
 	spinlock_t		stripe_lock;
+	int			cpu;
 	/**
 	 * struct stripe_operations
 	 * @target - STRIPE_OP_COMPUTE_BLK target
@@ -365,6 +366,17 @@ struct disk_info {
 	struct md_rdev	*rdev, *replacement;
 };
 
+struct r5worker {
+	struct work_struct work;
+	struct r5worker_group *group;
+};
+
+struct r5worker_group {
+	struct list_head handle_list;
+	struct r5conf *conf;
+	struct r5worker *workers;
+};
+
 struct r5conf {
 	struct hlist_head	*stripe_hashtbl;
 	struct mddev		*mddev;
@@ -462,6 +474,9 @@ struct r5conf {
 	 * the new thread here until we fully activate the array.
 	 */
 	struct md_thread	*thread;
+	struct r5worker_group	*worker_groups;
+	int			group_cnt;
+	int			worker_cnt_per_group;
 };
 
 /*


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch v3 4/5] raid5: sysfs entry to control worker thread number
  2013-08-27  9:50 [patch v3 0/5] raid5: make stripe handling multi-threading Shaohua Li
                   ` (2 preceding siblings ...)
  2013-08-27  9:50 ` [patch v3 3/5] raid5: offload stripe handle to workqueue Shaohua Li
@ 2013-08-27  9:50 ` Shaohua Li
  2013-08-27  9:50 ` [patch v3 5/5] raid5: only wakeup necessary threads Shaohua Li
  4 siblings, 0 replies; 20+ messages in thread
From: Shaohua Li @ 2013-08-27  9:50 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, tj, dan.j.williams

[-- Attachment #1: raid5-configure-wq.patch --]
[-- Type: text/plain, Size: 2271 bytes --]

Add a sysfs entry to control running workqueue thread number. If
group_thread_cnt is set to 0, we will disable workqueue offload handling of
stripes.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |   60 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2013-08-27 17:14:52.910179844 +0800
+++ linux/drivers/md/raid5.c	2013-08-27 17:14:52.906179906 +0800
@@ -5146,10 +5146,70 @@ stripe_cache_active_show(struct mddev *m
 static struct md_sysfs_entry
 raid5_stripecache_active = __ATTR_RO(stripe_cache_active);
 
+static ssize_t
+raid5_show_group_thread_cnt(struct mddev *mddev, char *page)
+{
+	struct r5conf *conf = mddev->private;
+	if (conf)
+		return sprintf(page, "%d\n", conf->worker_cnt_per_group);
+	else
+		return 0;
+}
+
+static int alloc_thread_groups(struct r5conf *conf, int cnt);
+static ssize_t
+raid5_store_group_thread_cnt(struct mddev *mddev, const char *page, size_t len)
+{
+	struct r5conf *conf = mddev->private;
+	unsigned long new;
+	int err;
+	struct r5worker_group *old_groups;
+	int old_group_cnt;
+
+	if (len >= PAGE_SIZE)
+		return -EINVAL;
+	if (!conf)
+		return -ENODEV;
+
+	if (kstrtoul(page, 10, &new))
+		return -EINVAL;
+
+	if (new == conf->worker_cnt_per_group)
+		return len;
+
+	mddev_suspend(mddev);
+
+	old_groups = conf->worker_groups;
+	old_group_cnt = conf->worker_cnt_per_group;
+
+	conf->worker_groups = NULL;
+	err = alloc_thread_groups(conf, new);
+	if (err) {
+		conf->worker_groups = old_groups;
+		conf->worker_cnt_per_group = old_group_cnt;
+	} else {
+		if (old_groups)
+			kfree(old_groups[0].workers);
+		kfree(old_groups);
+	}
+
+	mddev_resume(mddev);
+
+	if (err)
+		return err;
+	return len;
+}
+
+static struct md_sysfs_entry
+raid5_group_thread_cnt = __ATTR(group_thread_cnt, S_IRUGO | S_IWUSR,
+				raid5_show_group_thread_cnt,
+				raid5_store_group_thread_cnt);
+
 static struct attribute *raid5_attrs[] =  {
 	&raid5_stripecache_size.attr,
 	&raid5_stripecache_active.attr,
 	&raid5_preread_bypass_threshold.attr,
+	&raid5_group_thread_cnt.attr,
 	NULL,
 };
 static struct attribute_group raid5_attrs_group = {


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch v3 5/5] raid5: only wakeup necessary threads
  2013-08-27  9:50 [patch v3 0/5] raid5: make stripe handling multi-threading Shaohua Li
                   ` (3 preceding siblings ...)
  2013-08-27  9:50 ` [patch v3 4/5] raid5: sysfs entry to control worker thread number Shaohua Li
@ 2013-08-27  9:50 ` Shaohua Li
  2013-08-28  4:13   ` NeilBrown
  4 siblings, 1 reply; 20+ messages in thread
From: Shaohua Li @ 2013-08-27  9:50 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, tj, dan.j.williams

[-- Attachment #1: raid5-intelligent-wakeup.patch --]
[-- Type: text/plain, Size: 6183 bytes --]

If there are no enough stripes to handle, we'd better not always queue all
available work_structs. If one worker can only handle small or even none
stripes, it will impact request merge and create lock contention.

With this patch, the number of work_struct running will depend on pending
stripes number. Note: some statistics info used in the patch are accessed without
locking protection. This should doesn't matter, we just try best to avoid queue
unnecessary work_struct.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |   50 ++++++++++++++++++++++++++++++++++++++++++++------
 drivers/md/raid5.h |    4 ++++
 2 files changed, 48 insertions(+), 6 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2013-08-27 17:17:54.371899776 +0800
+++ linux/drivers/md/raid5.c	2013-08-27 17:18:56.331120797 +0800
@@ -77,6 +77,7 @@ static struct workqueue_struct *raid5_wq
 #define BYPASS_THRESHOLD	1
 #define NR_HASH			(PAGE_SIZE / sizeof(struct hlist_head))
 #define HASH_MASK		(NR_HASH - 1)
+#define MAX_STRIPE_BATCH	8
 
 static inline struct hlist_head *stripe_hash(struct r5conf *conf, sector_t sect)
 {
@@ -209,6 +210,7 @@ static void raid5_wakeup_stripe_thread(s
 {
 	struct r5conf *conf = sh->raid_conf;
 	struct r5worker_group *group;
+	int thread_cnt;
 	int i, cpu = sh->cpu;
 
 	if (!cpu_online(cpu)) {
@@ -220,6 +222,8 @@ static void raid5_wakeup_stripe_thread(s
 		struct r5worker_group *group;
 		group = conf->worker_groups + cpu_to_group(cpu);
 		list_add_tail(&sh->lru, &group->handle_list);
+		group->stripes_cnt++;
+		sh->group = group;
 	}
 
 	if (conf->worker_cnt_per_group == 0) {
@@ -229,8 +233,26 @@ static void raid5_wakeup_stripe_thread(s
 
 	group = conf->worker_groups + cpu_to_group(sh->cpu);
 
-	for (i = 0; i < conf->worker_cnt_per_group; i++)
-		queue_work_on(sh->cpu, raid5_wq, &group->workers[i].work);
+	group->workers[0].working = true;
+	/* at least one worker should run to avoid race */
+	queue_work_on(sh->cpu, raid5_wq, &group->workers[0].work);
+
+	thread_cnt = group->stripes_cnt / MAX_STRIPE_BATCH - 1;
+	/* wakeup more workers */
+	for (i = 1; i < conf->worker_cnt_per_group && thread_cnt > 0; i++) {
+		if (group->workers[i].working == false) {
+			group->workers[i].working = true;
+			queue_work_on(sh->cpu, raid5_wq,
+				      &group->workers[i].work);
+			thread_cnt--;
+		} else if (group->workers[i].working_cnt <=
+			   MAX_STRIPE_BATCH / 2)
+			/*
+			 * If a worker has no enough stripes handling, assume
+			 * it will fetch more stripes soon.
+			 */
+			thread_cnt--;
+	}
 }
 
 static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh)
@@ -575,6 +597,10 @@ get_active_stripe(struct r5conf *conf, s
 				    !test_bit(STRIPE_EXPANDING, &sh->state))
 					BUG();
 				list_del_init(&sh->lru);
+				if (sh->group) {
+					sh->group->stripes_cnt--;
+					sh->group = NULL;
+				}
 			}
 		}
 	} while (sh == NULL);
@@ -4139,6 +4165,7 @@ static struct stripe_head *__get_priorit
 {
 	struct stripe_head *sh = NULL, *tmp;
 	struct list_head *handle_list = NULL;
+	struct r5worker_group *wg = NULL;
 
 	if (conf->worker_cnt_per_group == 0) {
 		handle_list = &conf->handle_list;
@@ -4146,12 +4173,14 @@ static struct stripe_head *__get_priorit
 			handle_list = NULL;
 	} else if (group != ANY_GROUP) {
 		handle_list = &conf->worker_groups[group].handle_list;
+		wg = &conf->worker_groups[group];
 		if (list_empty(handle_list))
 			handle_list = NULL;
 	} else {
 		int i;
 		for (i = 0; i < conf->group_cnt; i++) {
 			handle_list = &conf->worker_groups[i].handle_list;
+			wg = &conf->worker_groups[i];
 			if (!list_empty(handle_list))
 				break;
 		}
@@ -4200,11 +4229,16 @@ static struct stripe_head *__get_priorit
 			if (conf->bypass_count < 0)
 				conf->bypass_count = 0;
 		}
+		wg = NULL;
 	}
 
 	if (!sh)
 		return NULL;
 
+	if (wg) {
+		wg->stripes_cnt--;
+		sh->group = NULL;
+	}
 	list_del_init(&sh->lru);
 	atomic_inc(&sh->count);
 	BUG_ON(atomic_read(&sh->count) != 1);
@@ -4911,8 +4945,8 @@ static int  retry_aligned_read(struct r5
 	return handled;
 }
 
-#define MAX_STRIPE_BATCH 8
-static int handle_active_stripes(struct r5conf *conf, int group)
+static int handle_active_stripes(struct r5conf *conf, int group,
+		struct r5worker *worker)
 {
 	struct stripe_head *batch[MAX_STRIPE_BATCH], *sh;
 	int i, batch_size = 0;
@@ -4921,6 +4955,9 @@ static int handle_active_stripes(struct
 			(sh = __get_priority_stripe(conf, group)) != NULL)
 		batch[batch_size++] = sh;
 
+	if (worker)
+		worker->working_cnt = batch_size;
+
 	if (batch_size == 0)
 		return batch_size;
 	spin_unlock_irq(&conf->device_lock);
@@ -4955,11 +4992,12 @@ static void raid5_do_work(struct work_st
 
 		released = release_stripe_list(conf);
 
-		batch_size = handle_active_stripes(conf, group_id);
+		batch_size = handle_active_stripes(conf, group_id, worker);
 		if (!batch_size && !released)
 			break;
 		handled += batch_size;
 	}
+	worker->working = false;
 	pr_debug("%d stripes handled\n", handled);
 
 	spin_unlock_irq(&conf->device_lock);
@@ -5017,7 +5055,7 @@ static void raid5d(struct md_thread *thr
 			handled++;
 		}
 
-		batch_size = handle_active_stripes(conf, ANY_GROUP);
+		batch_size = handle_active_stripes(conf, ANY_GROUP, NULL);
 		if (!batch_size && !released)
 			break;
 		handled += batch_size;
Index: linux/drivers/md/raid5.h
===================================================================
--- linux.orig/drivers/md/raid5.h	2013-08-27 17:17:54.371899776 +0800
+++ linux/drivers/md/raid5.h	2013-08-27 17:17:54.367900015 +0800
@@ -213,6 +213,7 @@ struct stripe_head {
 	enum reconstruct_states reconstruct_state;
 	spinlock_t		stripe_lock;
 	int			cpu;
+	struct r5worker_group	*group;
 	/**
 	 * struct stripe_operations
 	 * @target - STRIPE_OP_COMPUTE_BLK target
@@ -369,12 +370,15 @@ struct disk_info {
 struct r5worker {
 	struct work_struct work;
 	struct r5worker_group *group;
+	int working_cnt:8;
+	bool working;
 };
 
 struct r5worker_group {
 	struct list_head handle_list;
 	struct r5conf *conf;
 	struct r5worker *workers;
+	int stripes_cnt;
 };
 
 struct r5conf {


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch v3 2/5] raid5: fix stripe release order
  2013-08-27  9:50 ` [patch v3 2/5] raid5: fix stripe release order Shaohua Li
@ 2013-08-28  3:41   ` NeilBrown
  2013-08-28  6:29     ` Shaohua Li
  0 siblings, 1 reply; 20+ messages in thread
From: NeilBrown @ 2013-08-28  3:41 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, tj, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 871 bytes --]


> +struct llist_node *llist_reverse_order(struct llist_node *head)
> +{
> +	struct llist_node *second, *third;
> +
> +	if (head == NULL || head->next == NULL)
> +		return head;
> +	second = head->next;
> +	head->next = NULL;
> +
> +	do {
> +		third = second->next;
> +		second->next = head;
> +
> +		head = second;
> +		second = third;
> +	} while (second);
> +
> +	return head;
> +}
> +EXPORT_SYMBOL_GPL(llist_reverse_order);

This is somewhat longer that necessary.

struct llist_node *llist_reverse_order(struct llist_node *head)
{
	struct llist_node *new_head = NULL;

	while (head) {
		struct llist_node *tmp = head;
		head = head->next;
		tmp->next = new_head;
		new_head = tmp;
	}

	return new_head;
}

I think that is short enough to just open-code in the top of
release_stripe_list.

Are you OK with that?

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch v3 3/5] raid5: offload stripe handle to workqueue
  2013-08-27  9:50 ` [patch v3 3/5] raid5: offload stripe handle to workqueue Shaohua Li
@ 2013-08-28  3:53   ` NeilBrown
  2013-08-28  6:30     ` Shaohua Li
  0 siblings, 1 reply; 20+ messages in thread
From: NeilBrown @ 2013-08-28  3:53 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, tj, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 1708 bytes --]

On Tue, 27 Aug 2013 17:50:41 +0800 Shaohua Li <shli@kernel.org> wrote:

> -static struct stripe_head *__get_priority_stripe(struct r5conf *conf)
> +static struct stripe_head *__get_priority_stripe(struct r5conf *conf, int group)
>  {
> -	struct stripe_head *sh;
> +	struct stripe_head *sh = NULL, *tmp;
> +	struct list_head *handle_list = NULL;
> +
> +	if (conf->worker_cnt_per_group == 0) {
> +		handle_list = &conf->handle_list;
> +		if (list_empty(handle_list))
> +			handle_list = NULL;
> +	} else if (group != ANY_GROUP) {
> +		handle_list = &conf->worker_groups[group].handle_list;
> +		if (list_empty(handle_list))
> +			handle_list = NULL;
> +	} else {
> +		int i;
> +		for (i = 0; i < conf->group_cnt; i++) {
> +			handle_list = &conf->worker_groups[i].handle_list;
> +			if (!list_empty(handle_list))
> +				break;
> +		}
> +		if (i == conf->group_cnt)
> +			handle_list = NULL;
> +	}

Sorry, I meant to mention this last time...

The setting of handle_list to NULL seems unnecessary.  It is sufficient to
test if it is empty.
The only interesting case is the last 'else' above.  That is only reached if
worker_cnt_per_group != 0 so handle_list will get set to some list_head.
If all list_heads are empty, handle_list will be set to the last list_head
so a list_empty(handle_list) test is perfectly safe.

Also with the current code:
>  
>  	pr_debug("%s: handle: %s hold: %s full_writes: %d bypass_count: %d\n",
>  		  __func__,
> -		  list_empty(&conf->handle_list) ? "empty" : "busy",
> +		  list_empty(handle_list) ? "empty" : "busy",
                  ^^^^^^^^^^^^^^^^^^^^^^^
This will crash.


Could you fix that up please?
Thanks.
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch v3 5/5] raid5: only wakeup necessary threads
  2013-08-27  9:50 ` [patch v3 5/5] raid5: only wakeup necessary threads Shaohua Li
@ 2013-08-28  4:13   ` NeilBrown
  2013-08-28  6:31     ` Shaohua Li
  2013-08-29  7:40     ` Shaohua Li
  0 siblings, 2 replies; 20+ messages in thread
From: NeilBrown @ 2013-08-28  4:13 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, tj, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 2201 bytes --]

On Tue, 27 Aug 2013 17:50:43 +0800 Shaohua Li <shli@kernel.org> wrote:


> @@ -229,8 +233,26 @@ static void raid5_wakeup_stripe_thread(s
>  
>  	group = conf->worker_groups + cpu_to_group(sh->cpu);
>  
> -	for (i = 0; i < conf->worker_cnt_per_group; i++)
> -		queue_work_on(sh->cpu, raid5_wq, &group->workers[i].work);
> +	group->workers[0].working = true;
> +	/* at least one worker should run to avoid race */
> +	queue_work_on(sh->cpu, raid5_wq, &group->workers[0].work);
> +
> +	thread_cnt = group->stripes_cnt / MAX_STRIPE_BATCH - 1;
> +	/* wakeup more workers */
> +	for (i = 1; i < conf->worker_cnt_per_group && thread_cnt > 0; i++) {
> +		if (group->workers[i].working == false) {
> +			group->workers[i].working = true;
> +			queue_work_on(sh->cpu, raid5_wq,
> +				      &group->workers[i].work);
> +			thread_cnt--;
> +		} else if (group->workers[i].working_cnt <=
> +			   MAX_STRIPE_BATCH / 2)
> +			/*
> +			 * If a worker has no enough stripes handling, assume
> +			 * it will fetch more stripes soon.
> +			 */
> +			thread_cnt--;
> +	}
>  }

I don't really understand this  "working_cnt <= MAX_STRIPE_BATCH / 2"
heuristic.  It is at best a very coarse estimate of how long until the worker
will get some more stripes to work on.
I think I would simply not count any thread that is already working (except
the first, which is always counted whether it is working or not)
Do you see some particular gain from the counting?



> -#define MAX_STRIPE_BATCH 8
> -static int handle_active_stripes(struct r5conf *conf, int group)
> +static int handle_active_stripes(struct r5conf *conf, int group,
> +		struct r5worker *worker)
>  {
>  	struct stripe_head *batch[MAX_STRIPE_BATCH], *sh;
>  	int i, batch_size = 0;
> @@ -4921,6 +4955,9 @@ static int handle_active_stripes(struct
>  			(sh = __get_priority_stripe(conf, group)) != NULL)
>  		batch[batch_size++] = sh;
>  
> +	if (worker)
> +		worker->working_cnt = batch_size;
> +
>  	if (batch_size == 0)
>  		return batch_size;

I think this could possibly return with ->working still 'true'.
I think it is safest to clear it on every exit from the function


Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch v3 2/5] raid5: fix stripe release order
  2013-08-28  3:41   ` NeilBrown
@ 2013-08-28  6:29     ` Shaohua Li
  2013-08-28  6:37       ` NeilBrown
  0 siblings, 1 reply; 20+ messages in thread
From: Shaohua Li @ 2013-08-28  6:29 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, tj, dan.j.williams

On Wed, Aug 28, 2013 at 01:41:50PM +1000, NeilBrown wrote:
> 
> > +struct llist_node *llist_reverse_order(struct llist_node *head)
> > +{
> > +	struct llist_node *second, *third;
> > +
> > +	if (head == NULL || head->next == NULL)
> > +		return head;
> > +	second = head->next;
> > +	head->next = NULL;
> > +
> > +	do {
> > +		third = second->next;
> > +		second->next = head;
> > +
> > +		head = second;
> > +		second = third;
> > +	} while (second);
> > +
> > +	return head;
> > +}
> > +EXPORT_SYMBOL_GPL(llist_reverse_order);
> 
> This is somewhat longer that necessary.
> 
> struct llist_node *llist_reverse_order(struct llist_node *head)
> {
> 	struct llist_node *new_head = NULL;
> 
> 	while (head) {
> 		struct llist_node *tmp = head;
> 		head = head->next;
> 		tmp->next = new_head;
> 		new_head = tmp;
> 	}
> 
> 	return new_head;
> }
> 
> I think that is short enough to just open-code in the top of
> release_stripe_list.
> 
> Are you OK with that?

It's ok. Other patches can still apply with hunks.

Subject: raid5: fix stripe release order

patch "make release_stripe lockless" changes the order stripes are released.
Originally I thought block layer can take care of request merge, but it appears
there are still some requests not merged. It's easy to fix the order.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |   15 +++++++++++++++
 1 file changed, 15 insertions(+)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2013-08-28 13:51:39.586914255 +0800
+++ linux/drivers/md/raid5.c	2013-08-28 13:53:25.429582150 +0800
@@ -239,6 +239,20 @@ static void __release_stripe(struct r5co
 		do_release_stripe(conf, sh);
 }
 
+static struct llist_node *llist_reverse_order(struct llist_node *head)
+{
+	struct llist_node *new_head = NULL;
+
+	while (head) {
+		struct llist_node *tmp = head;
+		head = head->next;
+		tmp->next = new_head;
+		new_head = tmp;
+	}
+
+	return new_head;
+}
+
 /* should hold conf->device_lock already */
 static int release_stripe_list(struct r5conf *conf)
 {
@@ -247,6 +261,7 @@ static int release_stripe_list(struct r5
 	struct llist_node *head;
 
 	head = llist_del_all(&conf->released_stripes);
+	head = llist_reverse_order(head);
 	while (head) {
 		sh = llist_entry(head, struct stripe_head, release_list);
 		head = llist_next(head);

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch v3 3/5] raid5: offload stripe handle to workqueue
  2013-08-28  3:53   ` NeilBrown
@ 2013-08-28  6:30     ` Shaohua Li
  2013-08-28  6:56       ` NeilBrown
  0 siblings, 1 reply; 20+ messages in thread
From: Shaohua Li @ 2013-08-28  6:30 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, tj, dan.j.williams

On Wed, Aug 28, 2013 at 01:53:05PM +1000, NeilBrown wrote:
> On Tue, 27 Aug 2013 17:50:41 +0800 Shaohua Li <shli@kernel.org> wrote:
> 
> > -static struct stripe_head *__get_priority_stripe(struct r5conf *conf)
> > +static struct stripe_head *__get_priority_stripe(struct r5conf *conf, int group)
> >  {
> > -	struct stripe_head *sh;
> > +	struct stripe_head *sh = NULL, *tmp;
> > +	struct list_head *handle_list = NULL;
> > +
> > +	if (conf->worker_cnt_per_group == 0) {
> > +		handle_list = &conf->handle_list;
> > +		if (list_empty(handle_list))
> > +			handle_list = NULL;
> > +	} else if (group != ANY_GROUP) {
> > +		handle_list = &conf->worker_groups[group].handle_list;
> > +		if (list_empty(handle_list))
> > +			handle_list = NULL;
> > +	} else {
> > +		int i;
> > +		for (i = 0; i < conf->group_cnt; i++) {
> > +			handle_list = &conf->worker_groups[i].handle_list;
> > +			if (!list_empty(handle_list))
> > +				break;
> > +		}
> > +		if (i == conf->group_cnt)
> > +			handle_list = NULL;
> > +	}
> 
> Sorry, I meant to mention this last time...
> 
> The setting of handle_list to NULL seems unnecessary.  It is sufficient to
> test if it is empty.
> The only interesting case is the last 'else' above.  That is only reached if
> worker_cnt_per_group != 0 so handle_list will get set to some list_head.
> If all list_heads are empty, handle_list will be set to the last list_head
> so a list_empty(handle_list) test is perfectly safe.
> 
> Also with the current code:
> >  
> >  	pr_debug("%s: handle: %s hold: %s full_writes: %d bypass_count: %d\n",
> >  		  __func__,
> > -		  list_empty(&conf->handle_list) ? "empty" : "busy",
> > +		  list_empty(handle_list) ? "empty" : "busy",
>                   ^^^^^^^^^^^^^^^^^^^^^^^
> This will crash.

Fixed. Other patches can still apply with hunks.


Subject: raid5: offload stripe handle to workqueue

This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.

raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.

To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.

My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.

Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.

In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.

The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.

This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads. 

Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.

Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |  186 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 drivers/md/raid5.h |   15 ++++
 2 files changed, 186 insertions(+), 15 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2013-08-28 13:54:17.260930690 +0800
+++ linux/drivers/md/raid5.c	2013-08-28 13:59:13.225210586 +0800
@@ -53,6 +53,7 @@
 #include <linux/cpu.h>
 #include <linux/slab.h>
 #include <linux/ratelimit.h>
+#include <linux/nodemask.h>
 #include <trace/events/block.h>
 
 #include "md.h"
@@ -60,6 +61,10 @@
 #include "raid0.h"
 #include "bitmap.h"
 
+#define cpu_to_group(cpu) cpu_to_node(cpu)
+#define ANY_GROUP NUMA_NO_NODE
+
+static struct workqueue_struct *raid5_wq;
 /*
  * Stripe cache
  */
@@ -200,6 +205,34 @@ static int stripe_operations_active(stru
 	       test_bit(STRIPE_COMPUTE_RUN, &sh->state);
 }
 
+static void raid5_wakeup_stripe_thread(struct stripe_head *sh)
+{
+	struct r5conf *conf = sh->raid_conf;
+	struct r5worker_group *group;
+	int i, cpu = sh->cpu;
+
+	if (!cpu_online(cpu)) {
+		cpu = cpumask_any(cpu_online_mask);
+		sh->cpu = cpu;
+	}
+
+	if (list_empty(&sh->lru)) {
+		struct r5worker_group *group;
+		group = conf->worker_groups + cpu_to_group(cpu);
+		list_add_tail(&sh->lru, &group->handle_list);
+	}
+
+	if (conf->worker_cnt_per_group == 0) {
+		md_wakeup_thread(conf->mddev->thread);
+		return;
+	}
+
+	group = conf->worker_groups + cpu_to_group(sh->cpu);
+
+	for (i = 0; i < conf->worker_cnt_per_group; i++)
+		queue_work_on(sh->cpu, raid5_wq, &group->workers[i].work);
+}
+
 static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh)
 {
 	BUG_ON(!list_empty(&sh->lru));
@@ -214,7 +247,12 @@ static void do_release_stripe(struct r5c
 		else {
 			clear_bit(STRIPE_DELAYED, &sh->state);
 			clear_bit(STRIPE_BIT_DELAY, &sh->state);
-			list_add_tail(&sh->lru, &conf->handle_list);
+			if (conf->worker_cnt_per_group == 0) {
+				list_add_tail(&sh->lru, &conf->handle_list);
+			} else {
+				raid5_wakeup_stripe_thread(sh);
+				return;
+			}
 		}
 		md_wakeup_thread(conf->mddev->thread);
 	} else {
@@ -409,6 +447,7 @@ static void init_stripe(struct stripe_he
 		raid5_build_block(sh, i, previous);
 	}
 	insert_hash(conf, sh);
+	sh->cpu = smp_processor_id();
 }
 
 static struct stripe_head *__find_stripe(struct r5conf *conf, sector_t sector,
@@ -3830,6 +3869,7 @@ static void raid5_activate_delayed(struc
 			if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
 				atomic_inc(&conf->preread_active_stripes);
 			list_add_tail(&sh->lru, &conf->hold_list);
+			raid5_wakeup_stripe_thread(sh);
 		}
 	}
 }
@@ -4109,18 +4149,32 @@ static int chunk_aligned_read(struct mdd
  * head of the hold_list has changed, i.e. the head was promoted to the
  * handle_list.
  */
-static struct stripe_head *__get_priority_stripe(struct r5conf *conf)
+static struct stripe_head *__get_priority_stripe(struct r5conf *conf, int group)
 {
-	struct stripe_head *sh;
+	struct stripe_head *sh = NULL, *tmp;
+	struct list_head *handle_list = NULL;
+
+	if (conf->worker_cnt_per_group == 0) {
+		handle_list = &conf->handle_list;
+	} else if (group != ANY_GROUP) {
+		handle_list = &conf->worker_groups[group].handle_list;
+	} else {
+		int i;
+		for (i = 0; i < conf->group_cnt; i++) {
+			handle_list = &conf->worker_groups[i].handle_list;
+			if (!list_empty(handle_list))
+				break;
+		}
+	}
 
 	pr_debug("%s: handle: %s hold: %s full_writes: %d bypass_count: %d\n",
 		  __func__,
-		  list_empty(&conf->handle_list) ? "empty" : "busy",
+		  list_empty(handle_list) ? "empty" : "busy",
 		  list_empty(&conf->hold_list) ? "empty" : "busy",
 		  atomic_read(&conf->pending_full_writes), conf->bypass_count);
 
-	if (!list_empty(&conf->handle_list)) {
-		sh = list_entry(conf->handle_list.next, typeof(*sh), lru);
+	if (!list_empty(handle_list)) {
+		sh = list_entry(handle_list->next, typeof(*sh), lru);
 
 		if (list_empty(&conf->hold_list))
 			conf->bypass_count = 0;
@@ -4138,12 +4192,25 @@ static struct stripe_head *__get_priorit
 		   ((conf->bypass_threshold &&
 		     conf->bypass_count > conf->bypass_threshold) ||
 		    atomic_read(&conf->pending_full_writes) == 0)) {
-		sh = list_entry(conf->hold_list.next,
-				typeof(*sh), lru);
-		conf->bypass_count -= conf->bypass_threshold;
-		if (conf->bypass_count < 0)
-			conf->bypass_count = 0;
-	} else
+
+		list_for_each_entry(tmp, &conf->hold_list,  lru) {
+			if (conf->worker_cnt_per_group == 0 ||
+			    group == ANY_GROUP ||
+			    !cpu_online(tmp->cpu) ||
+			    cpu_to_group(tmp->cpu) == group) {
+				sh = tmp;
+				break;
+			}
+		}
+
+		if (sh) {
+			conf->bypass_count -= conf->bypass_threshold;
+			if (conf->bypass_count < 0)
+				conf->bypass_count = 0;
+		}
+	}
+
+	if (!sh)
 		return NULL;
 
 	list_del_init(&sh->lru);
@@ -4853,13 +4920,13 @@ static int  retry_aligned_read(struct r5
 }
 
 #define MAX_STRIPE_BATCH 8
-static int handle_active_stripes(struct r5conf *conf)
+static int handle_active_stripes(struct r5conf *conf, int group)
 {
 	struct stripe_head *batch[MAX_STRIPE_BATCH], *sh;
 	int i, batch_size = 0;
 
 	while (batch_size < MAX_STRIPE_BATCH &&
-			(sh = __get_priority_stripe(conf)) != NULL)
+			(sh = __get_priority_stripe(conf, group)) != NULL)
 		batch[batch_size++] = sh;
 
 	if (batch_size == 0)
@@ -4877,6 +4944,38 @@ static int handle_active_stripes(struct
 	return batch_size;
 }
 
+static void raid5_do_work(struct work_struct *work)
+{
+	struct r5worker *worker = container_of(work, struct r5worker, work);
+	struct r5worker_group *group = worker->group;
+	struct r5conf *conf = group->conf;
+	int group_id = group - conf->worker_groups;
+	int handled;
+	struct blk_plug plug;
+
+	pr_debug("+++ raid5worker active\n");
+
+	blk_start_plug(&plug);
+	handled = 0;
+	spin_lock_irq(&conf->device_lock);
+	while (1) {
+		int batch_size, released;
+
+		released = release_stripe_list(conf);
+
+		batch_size = handle_active_stripes(conf, group_id);
+		if (!batch_size && !released)
+			break;
+		handled += batch_size;
+	}
+	pr_debug("%d stripes handled\n", handled);
+
+	spin_unlock_irq(&conf->device_lock);
+	blk_finish_plug(&plug);
+
+	pr_debug("--- raid5worker inactive\n");
+}
+
 /*
  * This is our raid5 kernel thread.
  *
@@ -4926,7 +5025,7 @@ static void raid5d(struct md_thread *thr
 			handled++;
 		}
 
-		batch_size = handle_active_stripes(conf);
+		batch_size = handle_active_stripes(conf, ANY_GROUP);
 		if (!batch_size && !released)
 			break;
 		handled += batch_size;
@@ -5066,6 +5165,54 @@ static struct attribute_group raid5_attr
 	.attrs = raid5_attrs,
 };
 
+static int alloc_thread_groups(struct r5conf *conf, int cnt)
+{
+	int i, j;
+	ssize_t size;
+	struct r5worker *workers;
+
+	conf->worker_cnt_per_group = cnt;
+	if (cnt == 0) {
+		conf->worker_groups = NULL;
+		return 0;
+	}
+	conf->group_cnt = num_possible_nodes();
+	size = sizeof(struct r5worker) * cnt;
+	workers = kzalloc(size * conf->group_cnt, GFP_NOIO);
+	conf->worker_groups = kzalloc(sizeof(struct r5worker_group) *
+				conf->group_cnt, GFP_NOIO);
+	if (!conf->worker_groups || !workers) {
+		kfree(workers);
+		kfree(conf->worker_groups);
+		conf->worker_groups = NULL;
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < conf->group_cnt; i++) {
+		struct r5worker_group *group;
+
+		group = &conf->worker_groups[i];
+		INIT_LIST_HEAD(&group->handle_list);
+		group->conf = conf;
+		group->workers = workers + i * cnt;
+
+		for (j = 0; j < cnt; j++) {
+			group->workers[j].group = group;
+			INIT_WORK(&group->workers[j].work, raid5_do_work);
+		}
+	}
+
+	return 0;
+}
+
+static void free_thread_groups(struct r5conf *conf)
+{
+	if (conf->worker_groups)
+		kfree(conf->worker_groups[0].workers);
+	kfree(conf->worker_groups);
+	conf->worker_groups = NULL;
+}
+
 static sector_t
 raid5_size(struct mddev *mddev, sector_t sectors, int raid_disks)
 {
@@ -5106,6 +5253,7 @@ static void raid5_free_percpu(struct r5c
 
 static void free_conf(struct r5conf *conf)
 {
+	free_thread_groups(conf);
 	shrink_stripes(conf);
 	raid5_free_percpu(conf);
 	kfree(conf->disks);
@@ -5234,6 +5382,9 @@ static struct r5conf *setup_conf(struct
 	conf = kzalloc(sizeof(struct r5conf), GFP_KERNEL);
 	if (conf == NULL)
 		goto abort;
+	/* Don't enable multi-threading by default*/
+	if (alloc_thread_groups(conf, 0))
+		goto abort;
 	spin_lock_init(&conf->device_lock);
 	seqcount_init(&conf->gen_lock);
 	init_waitqueue_head(&conf->wait_for_stripe);
@@ -6549,6 +6700,10 @@ static struct md_personality raid4_perso
 
 static int __init raid5_init(void)
 {
+	raid5_wq = alloc_workqueue("raid5wq",
+		WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE|WQ_SYSFS, 0);
+	if (!raid5_wq)
+		return -ENOMEM;
 	register_md_personality(&raid6_personality);
 	register_md_personality(&raid5_personality);
 	register_md_personality(&raid4_personality);
@@ -6560,6 +6715,7 @@ static void raid5_exit(void)
 	unregister_md_personality(&raid6_personality);
 	unregister_md_personality(&raid5_personality);
 	unregister_md_personality(&raid4_personality);
+	destroy_workqueue(raid5_wq);
 }
 
 module_init(raid5_init);
Index: linux/drivers/md/raid5.h
===================================================================
--- linux.orig/drivers/md/raid5.h	2013-08-28 13:54:17.260930690 +0800
+++ linux/drivers/md/raid5.h	2013-08-28 13:54:17.260930690 +0800
@@ -212,6 +212,7 @@ struct stripe_head {
 	enum check_states	check_state;
 	enum reconstruct_states reconstruct_state;
 	spinlock_t		stripe_lock;
+	int			cpu;
 	/**
 	 * struct stripe_operations
 	 * @target - STRIPE_OP_COMPUTE_BLK target
@@ -365,6 +366,17 @@ struct disk_info {
 	struct md_rdev	*rdev, *replacement;
 };
 
+struct r5worker {
+	struct work_struct work;
+	struct r5worker_group *group;
+};
+
+struct r5worker_group {
+	struct list_head handle_list;
+	struct r5conf *conf;
+	struct r5worker *workers;
+};
+
 struct r5conf {
 	struct hlist_head	*stripe_hashtbl;
 	struct mddev		*mddev;
@@ -462,6 +474,9 @@ struct r5conf {
 	 * the new thread here until we fully activate the array.
 	 */
 	struct md_thread	*thread;
+	struct r5worker_group	*worker_groups;
+	int			group_cnt;
+	int			worker_cnt_per_group;
 };
 
 /*

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch v3 5/5] raid5: only wakeup necessary threads
  2013-08-28  4:13   ` NeilBrown
@ 2013-08-28  6:31     ` Shaohua Li
  2013-08-28  6:59       ` NeilBrown
  2013-08-29  7:40     ` Shaohua Li
  1 sibling, 1 reply; 20+ messages in thread
From: Shaohua Li @ 2013-08-28  6:31 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, tj, dan.j.williams

On Wed, Aug 28, 2013 at 02:13:04PM +1000, NeilBrown wrote:
> On Tue, 27 Aug 2013 17:50:43 +0800 Shaohua Li <shli@kernel.org> wrote:
> 
> 
> > @@ -229,8 +233,26 @@ static void raid5_wakeup_stripe_thread(s
> >  
> >  	group = conf->worker_groups + cpu_to_group(sh->cpu);
> >  
> > -	for (i = 0; i < conf->worker_cnt_per_group; i++)
> > -		queue_work_on(sh->cpu, raid5_wq, &group->workers[i].work);
> > +	group->workers[0].working = true;
> > +	/* at least one worker should run to avoid race */
> > +	queue_work_on(sh->cpu, raid5_wq, &group->workers[0].work);
> > +
> > +	thread_cnt = group->stripes_cnt / MAX_STRIPE_BATCH - 1;
> > +	/* wakeup more workers */
> > +	for (i = 1; i < conf->worker_cnt_per_group && thread_cnt > 0; i++) {
> > +		if (group->workers[i].working == false) {
> > +			group->workers[i].working = true;
> > +			queue_work_on(sh->cpu, raid5_wq,
> > +				      &group->workers[i].work);
> > +			thread_cnt--;
> > +		} else if (group->workers[i].working_cnt <=
> > +			   MAX_STRIPE_BATCH / 2)
> > +			/*
> > +			 * If a worker has no enough stripes handling, assume
> > +			 * it will fetch more stripes soon.
> > +			 */
> > +			thread_cnt--;
> > +	}
> >  }
> 
> I don't really understand this  "working_cnt <= MAX_STRIPE_BATCH / 2"
> heuristic.  It is at best a very coarse estimate of how long until the worker
> will get some more stripes to work on.
> I think I would simply not count any thread that is already working (except
> the first, which is always counted whether it is working or not)
> Do you see some particular gain from the counting?
> 
> 
> 
> > -#define MAX_STRIPE_BATCH 8
> > -static int handle_active_stripes(struct r5conf *conf, int group)
> > +static int handle_active_stripes(struct r5conf *conf, int group,
> > +		struct r5worker *worker)
> >  {
> >  	struct stripe_head *batch[MAX_STRIPE_BATCH], *sh;
> >  	int i, batch_size = 0;
> > @@ -4921,6 +4955,9 @@ static int handle_active_stripes(struct
> >  			(sh = __get_priority_stripe(conf, group)) != NULL)
> >  		batch[batch_size++] = sh;
> >  
> > +	if (worker)
> > +		worker->working_cnt = batch_size;
> > +
> >  	if (batch_size == 0)
> >  		return batch_size;
> 
> I think this could possibly return with ->working still 'true'.
> I think it is safest to clear it on every exit from the function

I need do more tests on this one. Could you please apply other patches to your
tree, then I can rebase this patch against it when I'm done.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch v3 2/5] raid5: fix stripe release order
  2013-08-28  6:29     ` Shaohua Li
@ 2013-08-28  6:37       ` NeilBrown
  0 siblings, 0 replies; 20+ messages in thread
From: NeilBrown @ 2013-08-28  6:37 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, tj, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 2973 bytes --]

On Wed, 28 Aug 2013 14:29:05 +0800 Shaohua Li <shli@kernel.org> wrote:

> On Wed, Aug 28, 2013 at 01:41:50PM +1000, NeilBrown wrote:
> > 
> > > +struct llist_node *llist_reverse_order(struct llist_node *head)
> > > +{
> > > +	struct llist_node *second, *third;
> > > +
> > > +	if (head == NULL || head->next == NULL)
> > > +		return head;
> > > +	second = head->next;
> > > +	head->next = NULL;
> > > +
> > > +	do {
> > > +		third = second->next;
> > > +		second->next = head;
> > > +
> > > +		head = second;
> > > +		second = third;
> > > +	} while (second);
> > > +
> > > +	return head;
> > > +}
> > > +EXPORT_SYMBOL_GPL(llist_reverse_order);
> > 
> > This is somewhat longer that necessary.
> > 
> > struct llist_node *llist_reverse_order(struct llist_node *head)
> > {
> > 	struct llist_node *new_head = NULL;
> > 
> > 	while (head) {
> > 		struct llist_node *tmp = head;
> > 		head = head->next;
> > 		tmp->next = new_head;
> > 		new_head = tmp;
> > 	}
> > 
> > 	return new_head;
> > }
> > 
> > I think that is short enough to just open-code in the top of
> > release_stripe_list.
> > 
> > Are you OK with that?
> 
> It's ok. Other patches can still apply with hunks.
> 
> Subject: raid5: fix stripe release order
> 
> patch "make release_stripe lockless" changes the order stripes are released.
> Originally I thought block layer can take care of request merge, but it appears
> there are still some requests not merged. It's easy to fix the order.
> 
> Signed-off-by: Shaohua Li <shli@fusionio.com>
> ---
>  drivers/md/raid5.c |   15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> Index: linux/drivers/md/raid5.c
> ===================================================================
> --- linux.orig/drivers/md/raid5.c	2013-08-28 13:51:39.586914255 +0800
> +++ linux/drivers/md/raid5.c	2013-08-28 13:53:25.429582150 +0800
> @@ -239,6 +239,20 @@ static void __release_stripe(struct r5co
>  		do_release_stripe(conf, sh);
>  }
>  
> +static struct llist_node *llist_reverse_order(struct llist_node *head)
> +{
> +	struct llist_node *new_head = NULL;
> +
> +	while (head) {
> +		struct llist_node *tmp = head;
> +		head = head->next;
> +		tmp->next = new_head;
> +		new_head = tmp;
> +	}
> +
> +	return new_head;
> +}
> +
>  /* should hold conf->device_lock already */
>  static int release_stripe_list(struct r5conf *conf)
>  {
> @@ -247,6 +261,7 @@ static int release_stripe_list(struct r5
>  	struct llist_node *head;
>  
>  	head = llist_del_all(&conf->released_stripes);
> +	head = llist_reverse_order(head);
>  	while (head) {
>  		sh = llist_entry(head, struct stripe_head, release_list);
>  		head = llist_next(head);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Applied, thanks.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch v3 3/5] raid5: offload stripe handle to workqueue
  2013-08-28  6:30     ` Shaohua Li
@ 2013-08-28  6:56       ` NeilBrown
  0 siblings, 0 replies; 20+ messages in thread
From: NeilBrown @ 2013-08-28  6:56 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, tj, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 16854 bytes --]

On Wed, 28 Aug 2013 14:30:16 +0800 Shaohua Li <shli@kernel.org> wrote:

> On Wed, Aug 28, 2013 at 01:53:05PM +1000, NeilBrown wrote:
> > On Tue, 27 Aug 2013 17:50:41 +0800 Shaohua Li <shli@kernel.org> wrote:
> > 
> > > -static struct stripe_head *__get_priority_stripe(struct r5conf *conf)
> > > +static struct stripe_head *__get_priority_stripe(struct r5conf *conf, int group)
> > >  {
> > > -	struct stripe_head *sh;
> > > +	struct stripe_head *sh = NULL, *tmp;
> > > +	struct list_head *handle_list = NULL;
> > > +
> > > +	if (conf->worker_cnt_per_group == 0) {
> > > +		handle_list = &conf->handle_list;
> > > +		if (list_empty(handle_list))
> > > +			handle_list = NULL;
> > > +	} else if (group != ANY_GROUP) {
> > > +		handle_list = &conf->worker_groups[group].handle_list;
> > > +		if (list_empty(handle_list))
> > > +			handle_list = NULL;
> > > +	} else {
> > > +		int i;
> > > +		for (i = 0; i < conf->group_cnt; i++) {
> > > +			handle_list = &conf->worker_groups[i].handle_list;
> > > +			if (!list_empty(handle_list))
> > > +				break;
> > > +		}
> > > +		if (i == conf->group_cnt)
> > > +			handle_list = NULL;
> > > +	}
> > 
> > Sorry, I meant to mention this last time...
> > 
> > The setting of handle_list to NULL seems unnecessary.  It is sufficient to
> > test if it is empty.
> > The only interesting case is the last 'else' above.  That is only reached if
> > worker_cnt_per_group != 0 so handle_list will get set to some list_head.
> > If all list_heads are empty, handle_list will be set to the last list_head
> > so a list_empty(handle_list) test is perfectly safe.
> > 
> > Also with the current code:
> > >  
> > >  	pr_debug("%s: handle: %s hold: %s full_writes: %d bypass_count: %d\n",
> > >  		  __func__,
> > > -		  list_empty(&conf->handle_list) ? "empty" : "busy",
> > > +		  list_empty(handle_list) ? "empty" : "busy",
> >                   ^^^^^^^^^^^^^^^^^^^^^^^
> > This will crash.
> 
> Fixed. Other patches can still apply with hunks.
> 
> 
> Subject: raid5: offload stripe handle to workqueue
> 
> This is another attempt to create multiple threads to handle raid5 stripes.
> This time I use workqueue.
> 
> raid5 handles request (especially write) in stripe unit. A stripe is page size
> aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
> state machine for the corresponding stripe, which includes reading some disks
> of the stripe, calculating parity, and writing some disks of the stripe. The
> state machine is running in raid5d thread currently. Since there is only one
> thread, it doesn't scale well for high speed storage. An obvious solution is
> multi-threading.
> 
> To get better performance, we have some requirements:
> a. locality. stripe corresponding to request submitted from one cpu is better
> handled in thread in local cpu or local node. local cpu is preferred but some
> times could be a bottleneck, for example, parity calculation is too heavy.
> local node running has wide adaptability.
> b. configurablity. Different setup of raid5 array might need diffent
> configuration. Especially the thread number. More threads don't always mean
> better performance because of lock contentions.
> 
> My original implementation is creating some kernel threads. There are
> interfaces to control which cpu's stripe each thread should handle. And
> userspace can set affinity of the threads. This provides biggest flexibility
> and configurability. But it's hard to use and apparently a new thread pool
> implementation is disfavor.
> 
> Recent workqueue improvement is quite promising. unbound workqueue will be
> bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
> do affinity setting. For example, we can only include one HT sibling in
> affinity. Since work is non-reentrant by default, and we can control running
> thread number by limiting dispatched work_struct number.
> 
> In this patch, I created several stripe worker group. A group is a numa node.
> stripes from cpus of one node will be added to a group list. Workqueue thread
> of one node will only handle stripes of worker group of the node. In this way,
> stripe handling has numa node locality. And as I said, we can control thread
> number by limiting dispatched work_struct number.
> 
> The work_struct callback function handles several stripes in one run. A typical
> work queue usage is to run one unit in each work_struct. In raid5 case, the
> unit is a stripe. But we can't do that:
> a. Though handling a stripe doesn't need lock because of reference accounting
> and stripe isn't in any list, queuing a work_struct for each stripe will make
> workqueue lock contended very heavily.
> b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
> might dispatch request. If each work_struct only handles one stripe, such block
> plug is meaningless.
> 
> This implementation can't do very fine grained configuration. But the numa
> binding is most popular usage model, should be enough for most workloads. 
> 
> Note: since we have only one stripe queue, switching to multi-thread might
> decrease request size dispatching down to low level layer. The impact depends
> on thread number, raid configuration and workload. So multi-thread raid5 might
> not be proper for all setups.
> 
> Changes V1 -> V2:
> 1. remove WQ_NON_REENTRANT
> 2. disabling multi-threading by default
> 3. Add more descriptions in changelog
> 
> Signed-off-by: Shaohua Li <shli@fusionio.com>
> ---
>  drivers/md/raid5.c |  186 ++++++++++++++++++++++++++++++++++++++++++++++++-----
>  drivers/md/raid5.h |   15 ++++
>  2 files changed, 186 insertions(+), 15 deletions(-)
> 
> Index: linux/drivers/md/raid5.c
> ===================================================================
> --- linux.orig/drivers/md/raid5.c	2013-08-28 13:54:17.260930690 +0800
> +++ linux/drivers/md/raid5.c	2013-08-28 13:59:13.225210586 +0800
> @@ -53,6 +53,7 @@
>  #include <linux/cpu.h>
>  #include <linux/slab.h>
>  #include <linux/ratelimit.h>
> +#include <linux/nodemask.h>
>  #include <trace/events/block.h>
>  
>  #include "md.h"
> @@ -60,6 +61,10 @@
>  #include "raid0.h"
>  #include "bitmap.h"
>  
> +#define cpu_to_group(cpu) cpu_to_node(cpu)
> +#define ANY_GROUP NUMA_NO_NODE
> +
> +static struct workqueue_struct *raid5_wq;
>  /*
>   * Stripe cache
>   */
> @@ -200,6 +205,34 @@ static int stripe_operations_active(stru
>  	       test_bit(STRIPE_COMPUTE_RUN, &sh->state);
>  }
>  
> +static void raid5_wakeup_stripe_thread(struct stripe_head *sh)
> +{
> +	struct r5conf *conf = sh->raid_conf;
> +	struct r5worker_group *group;
> +	int i, cpu = sh->cpu;
> +
> +	if (!cpu_online(cpu)) {
> +		cpu = cpumask_any(cpu_online_mask);
> +		sh->cpu = cpu;
> +	}
> +
> +	if (list_empty(&sh->lru)) {
> +		struct r5worker_group *group;
> +		group = conf->worker_groups + cpu_to_group(cpu);
> +		list_add_tail(&sh->lru, &group->handle_list);
> +	}
> +
> +	if (conf->worker_cnt_per_group == 0) {
> +		md_wakeup_thread(conf->mddev->thread);
> +		return;
> +	}
> +
> +	group = conf->worker_groups + cpu_to_group(sh->cpu);
> +
> +	for (i = 0; i < conf->worker_cnt_per_group; i++)
> +		queue_work_on(sh->cpu, raid5_wq, &group->workers[i].work);
> +}
> +
>  static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh)
>  {
>  	BUG_ON(!list_empty(&sh->lru));
> @@ -214,7 +247,12 @@ static void do_release_stripe(struct r5c
>  		else {
>  			clear_bit(STRIPE_DELAYED, &sh->state);
>  			clear_bit(STRIPE_BIT_DELAY, &sh->state);
> -			list_add_tail(&sh->lru, &conf->handle_list);
> +			if (conf->worker_cnt_per_group == 0) {
> +				list_add_tail(&sh->lru, &conf->handle_list);
> +			} else {
> +				raid5_wakeup_stripe_thread(sh);
> +				return;
> +			}
>  		}
>  		md_wakeup_thread(conf->mddev->thread);
>  	} else {
> @@ -409,6 +447,7 @@ static void init_stripe(struct stripe_he
>  		raid5_build_block(sh, i, previous);
>  	}
>  	insert_hash(conf, sh);
> +	sh->cpu = smp_processor_id();
>  }
>  
>  static struct stripe_head *__find_stripe(struct r5conf *conf, sector_t sector,
> @@ -3830,6 +3869,7 @@ static void raid5_activate_delayed(struc
>  			if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
>  				atomic_inc(&conf->preread_active_stripes);
>  			list_add_tail(&sh->lru, &conf->hold_list);
> +			raid5_wakeup_stripe_thread(sh);
>  		}
>  	}
>  }
> @@ -4109,18 +4149,32 @@ static int chunk_aligned_read(struct mdd
>   * head of the hold_list has changed, i.e. the head was promoted to the
>   * handle_list.
>   */
> -static struct stripe_head *__get_priority_stripe(struct r5conf *conf)
> +static struct stripe_head *__get_priority_stripe(struct r5conf *conf, int group)
>  {
> -	struct stripe_head *sh;
> +	struct stripe_head *sh = NULL, *tmp;
> +	struct list_head *handle_list = NULL;
> +
> +	if (conf->worker_cnt_per_group == 0) {
> +		handle_list = &conf->handle_list;
> +	} else if (group != ANY_GROUP) {
> +		handle_list = &conf->worker_groups[group].handle_list;
> +	} else {
> +		int i;
> +		for (i = 0; i < conf->group_cnt; i++) {
> +			handle_list = &conf->worker_groups[i].handle_list;
> +			if (!list_empty(handle_list))
> +				break;
> +		}
> +	}
>  
>  	pr_debug("%s: handle: %s hold: %s full_writes: %d bypass_count: %d\n",
>  		  __func__,
> -		  list_empty(&conf->handle_list) ? "empty" : "busy",
> +		  list_empty(handle_list) ? "empty" : "busy",
>  		  list_empty(&conf->hold_list) ? "empty" : "busy",
>  		  atomic_read(&conf->pending_full_writes), conf->bypass_count);
>  
> -	if (!list_empty(&conf->handle_list)) {
> -		sh = list_entry(conf->handle_list.next, typeof(*sh), lru);
> +	if (!list_empty(handle_list)) {
> +		sh = list_entry(handle_list->next, typeof(*sh), lru);
>  
>  		if (list_empty(&conf->hold_list))
>  			conf->bypass_count = 0;
> @@ -4138,12 +4192,25 @@ static struct stripe_head *__get_priorit
>  		   ((conf->bypass_threshold &&
>  		     conf->bypass_count > conf->bypass_threshold) ||
>  		    atomic_read(&conf->pending_full_writes) == 0)) {
> -		sh = list_entry(conf->hold_list.next,
> -				typeof(*sh), lru);
> -		conf->bypass_count -= conf->bypass_threshold;
> -		if (conf->bypass_count < 0)
> -			conf->bypass_count = 0;
> -	} else
> +
> +		list_for_each_entry(tmp, &conf->hold_list,  lru) {
> +			if (conf->worker_cnt_per_group == 0 ||
> +			    group == ANY_GROUP ||
> +			    !cpu_online(tmp->cpu) ||
> +			    cpu_to_group(tmp->cpu) == group) {
> +				sh = tmp;
> +				break;
> +			}
> +		}
> +
> +		if (sh) {
> +			conf->bypass_count -= conf->bypass_threshold;
> +			if (conf->bypass_count < 0)
> +				conf->bypass_count = 0;
> +		}
> +	}
> +
> +	if (!sh)
>  		return NULL;
>  
>  	list_del_init(&sh->lru);
> @@ -4853,13 +4920,13 @@ static int  retry_aligned_read(struct r5
>  }
>  
>  #define MAX_STRIPE_BATCH 8
> -static int handle_active_stripes(struct r5conf *conf)
> +static int handle_active_stripes(struct r5conf *conf, int group)
>  {
>  	struct stripe_head *batch[MAX_STRIPE_BATCH], *sh;
>  	int i, batch_size = 0;
>  
>  	while (batch_size < MAX_STRIPE_BATCH &&
> -			(sh = __get_priority_stripe(conf)) != NULL)
> +			(sh = __get_priority_stripe(conf, group)) != NULL)
>  		batch[batch_size++] = sh;
>  
>  	if (batch_size == 0)
> @@ -4877,6 +4944,38 @@ static int handle_active_stripes(struct
>  	return batch_size;
>  }
>  
> +static void raid5_do_work(struct work_struct *work)
> +{
> +	struct r5worker *worker = container_of(work, struct r5worker, work);
> +	struct r5worker_group *group = worker->group;
> +	struct r5conf *conf = group->conf;
> +	int group_id = group - conf->worker_groups;
> +	int handled;
> +	struct blk_plug plug;
> +
> +	pr_debug("+++ raid5worker active\n");
> +
> +	blk_start_plug(&plug);
> +	handled = 0;
> +	spin_lock_irq(&conf->device_lock);
> +	while (1) {
> +		int batch_size, released;
> +
> +		released = release_stripe_list(conf);
> +
> +		batch_size = handle_active_stripes(conf, group_id);
> +		if (!batch_size && !released)
> +			break;
> +		handled += batch_size;
> +	}
> +	pr_debug("%d stripes handled\n", handled);
> +
> +	spin_unlock_irq(&conf->device_lock);
> +	blk_finish_plug(&plug);
> +
> +	pr_debug("--- raid5worker inactive\n");
> +}
> +
>  /*
>   * This is our raid5 kernel thread.
>   *
> @@ -4926,7 +5025,7 @@ static void raid5d(struct md_thread *thr
>  			handled++;
>  		}
>  
> -		batch_size = handle_active_stripes(conf);
> +		batch_size = handle_active_stripes(conf, ANY_GROUP);
>  		if (!batch_size && !released)
>  			break;
>  		handled += batch_size;
> @@ -5066,6 +5165,54 @@ static struct attribute_group raid5_attr
>  	.attrs = raid5_attrs,
>  };
>  
> +static int alloc_thread_groups(struct r5conf *conf, int cnt)
> +{
> +	int i, j;
> +	ssize_t size;
> +	struct r5worker *workers;
> +
> +	conf->worker_cnt_per_group = cnt;
> +	if (cnt == 0) {
> +		conf->worker_groups = NULL;
> +		return 0;
> +	}
> +	conf->group_cnt = num_possible_nodes();
> +	size = sizeof(struct r5worker) * cnt;
> +	workers = kzalloc(size * conf->group_cnt, GFP_NOIO);
> +	conf->worker_groups = kzalloc(sizeof(struct r5worker_group) *
> +				conf->group_cnt, GFP_NOIO);
> +	if (!conf->worker_groups || !workers) {
> +		kfree(workers);
> +		kfree(conf->worker_groups);
> +		conf->worker_groups = NULL;
> +		return -ENOMEM;
> +	}
> +
> +	for (i = 0; i < conf->group_cnt; i++) {
> +		struct r5worker_group *group;
> +
> +		group = &conf->worker_groups[i];
> +		INIT_LIST_HEAD(&group->handle_list);
> +		group->conf = conf;
> +		group->workers = workers + i * cnt;
> +
> +		for (j = 0; j < cnt; j++) {
> +			group->workers[j].group = group;
> +			INIT_WORK(&group->workers[j].work, raid5_do_work);
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static void free_thread_groups(struct r5conf *conf)
> +{
> +	if (conf->worker_groups)
> +		kfree(conf->worker_groups[0].workers);
> +	kfree(conf->worker_groups);
> +	conf->worker_groups = NULL;
> +}
> +
>  static sector_t
>  raid5_size(struct mddev *mddev, sector_t sectors, int raid_disks)
>  {
> @@ -5106,6 +5253,7 @@ static void raid5_free_percpu(struct r5c
>  
>  static void free_conf(struct r5conf *conf)
>  {
> +	free_thread_groups(conf);
>  	shrink_stripes(conf);
>  	raid5_free_percpu(conf);
>  	kfree(conf->disks);
> @@ -5234,6 +5382,9 @@ static struct r5conf *setup_conf(struct
>  	conf = kzalloc(sizeof(struct r5conf), GFP_KERNEL);
>  	if (conf == NULL)
>  		goto abort;
> +	/* Don't enable multi-threading by default*/
> +	if (alloc_thread_groups(conf, 0))
> +		goto abort;
>  	spin_lock_init(&conf->device_lock);
>  	seqcount_init(&conf->gen_lock);
>  	init_waitqueue_head(&conf->wait_for_stripe);
> @@ -6549,6 +6700,10 @@ static struct md_personality raid4_perso
>  
>  static int __init raid5_init(void)
>  {
> +	raid5_wq = alloc_workqueue("raid5wq",
> +		WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE|WQ_SYSFS, 0);
> +	if (!raid5_wq)
> +		return -ENOMEM;
>  	register_md_personality(&raid6_personality);
>  	register_md_personality(&raid5_personality);
>  	register_md_personality(&raid4_personality);
> @@ -6560,6 +6715,7 @@ static void raid5_exit(void)
>  	unregister_md_personality(&raid6_personality);
>  	unregister_md_personality(&raid5_personality);
>  	unregister_md_personality(&raid4_personality);
> +	destroy_workqueue(raid5_wq);
>  }
>  
>  module_init(raid5_init);
> Index: linux/drivers/md/raid5.h
> ===================================================================
> --- linux.orig/drivers/md/raid5.h	2013-08-28 13:54:17.260930690 +0800
> +++ linux/drivers/md/raid5.h	2013-08-28 13:54:17.260930690 +0800
> @@ -212,6 +212,7 @@ struct stripe_head {
>  	enum check_states	check_state;
>  	enum reconstruct_states reconstruct_state;
>  	spinlock_t		stripe_lock;
> +	int			cpu;
>  	/**
>  	 * struct stripe_operations
>  	 * @target - STRIPE_OP_COMPUTE_BLK target
> @@ -365,6 +366,17 @@ struct disk_info {
>  	struct md_rdev	*rdev, *replacement;
>  };
>  
> +struct r5worker {
> +	struct work_struct work;
> +	struct r5worker_group *group;
> +};
> +
> +struct r5worker_group {
> +	struct list_head handle_list;
> +	struct r5conf *conf;
> +	struct r5worker *workers;
> +};
> +
>  struct r5conf {
>  	struct hlist_head	*stripe_hashtbl;
>  	struct mddev		*mddev;
> @@ -462,6 +474,9 @@ struct r5conf {
>  	 * the new thread here until we fully activate the array.
>  	 */
>  	struct md_thread	*thread;
> +	struct r5worker_group	*worker_groups;
> +	int			group_cnt;
> +	int			worker_cnt_per_group;
>  };
>  
>  /*


Applied, thanks.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch v3 5/5] raid5: only wakeup necessary threads
  2013-08-28  6:31     ` Shaohua Li
@ 2013-08-28  6:59       ` NeilBrown
  0 siblings, 0 replies; 20+ messages in thread
From: NeilBrown @ 2013-08-28  6:59 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, tj, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 2761 bytes --]

On Wed, 28 Aug 2013 14:31:59 +0800 Shaohua Li <shli@kernel.org> wrote:

> On Wed, Aug 28, 2013 at 02:13:04PM +1000, NeilBrown wrote:
> > On Tue, 27 Aug 2013 17:50:43 +0800 Shaohua Li <shli@kernel.org> wrote:
> > 
> > 
> > > @@ -229,8 +233,26 @@ static void raid5_wakeup_stripe_thread(s
> > >  
> > >  	group = conf->worker_groups + cpu_to_group(sh->cpu);
> > >  
> > > -	for (i = 0; i < conf->worker_cnt_per_group; i++)
> > > -		queue_work_on(sh->cpu, raid5_wq, &group->workers[i].work);
> > > +	group->workers[0].working = true;
> > > +	/* at least one worker should run to avoid race */
> > > +	queue_work_on(sh->cpu, raid5_wq, &group->workers[0].work);
> > > +
> > > +	thread_cnt = group->stripes_cnt / MAX_STRIPE_BATCH - 1;
> > > +	/* wakeup more workers */
> > > +	for (i = 1; i < conf->worker_cnt_per_group && thread_cnt > 0; i++) {
> > > +		if (group->workers[i].working == false) {
> > > +			group->workers[i].working = true;
> > > +			queue_work_on(sh->cpu, raid5_wq,
> > > +				      &group->workers[i].work);
> > > +			thread_cnt--;
> > > +		} else if (group->workers[i].working_cnt <=
> > > +			   MAX_STRIPE_BATCH / 2)
> > > +			/*
> > > +			 * If a worker has no enough stripes handling, assume
> > > +			 * it will fetch more stripes soon.
> > > +			 */
> > > +			thread_cnt--;
> > > +	}
> > >  }
> > 
> > I don't really understand this  "working_cnt <= MAX_STRIPE_BATCH / 2"
> > heuristic.  It is at best a very coarse estimate of how long until the worker
> > will get some more stripes to work on.
> > I think I would simply not count any thread that is already working (except
> > the first, which is always counted whether it is working or not)
> > Do you see some particular gain from the counting?
> > 
> > 
> > 
> > > -#define MAX_STRIPE_BATCH 8
> > > -static int handle_active_stripes(struct r5conf *conf, int group)
> > > +static int handle_active_stripes(struct r5conf *conf, int group,
> > > +		struct r5worker *worker)
> > >  {
> > >  	struct stripe_head *batch[MAX_STRIPE_BATCH], *sh;
> > >  	int i, batch_size = 0;
> > > @@ -4921,6 +4955,9 @@ static int handle_active_stripes(struct
> > >  			(sh = __get_priority_stripe(conf, group)) != NULL)
> > >  		batch[batch_size++] = sh;
> > >  
> > > +	if (worker)
> > > +		worker->working_cnt = batch_size;
> > > +
> > >  	if (batch_size == 0)
> > >  		return batch_size;
> > 
> > I think this could possibly return with ->working still 'true'.
> > I think it is safest to clear it on every exit from the function
> 
> I need do more tests on this one. Could you please apply other patches to your
> tree, then I can rebase this patch against it when I'm done.

Other patches are now applied and pushed out.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch v3 1/5] raid5: make release_stripe lockless
  2013-08-27  9:50 ` [patch v3 1/5] raid5: make release_stripe lockless Shaohua Li
@ 2013-08-28 14:04   ` Tejun Heo
  2013-08-28 14:29     ` Shaohua Li
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2013-08-28 14:04 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, neilb, dan.j.williams

Hello, Shaohua.

On Tue, Aug 27, 2013 at 05:50:39PM +0800, Shaohua Li wrote:
> release_stripe still has big lock contention. We just add the stripe to a llist
> without taking device_lock. We let the raid5d thread to do the real stripe
> release, which must hold device_lock anyway. In this way, release_stripe
> doesn't hold any locks.
> 
> The side effect is the released stripes order is changed. But sounds not a big
> deal, stripes are never handled in order. And I thought block layer can already
> do nice request merge, which means order isn't that important.

I wrote this before but the order of requests is an important
information to the elevator and the existing elevators will behave
less effectively if you make the relative order and timing of
processed IOs deviate from the original issuer's and the effect could
be very noticeable depending on the workload and stacking drivers
should always strive to preserve as much IO characteristics.

It doesn't make the changes unacceptable or anything but the patch
description is quite misleading.  It'd be nice if you at least can
note that the implemented behavior is far from optimal in the comment
and description.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch v3 1/5] raid5: make release_stripe lockless
  2013-08-28 14:04   ` Tejun Heo
@ 2013-08-28 14:29     ` Shaohua Li
  2013-08-28 14:30       ` Tejun Heo
  0 siblings, 1 reply; 20+ messages in thread
From: Shaohua Li @ 2013-08-28 14:29 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-raid, neilb, dan.j.williams

On Wed, Aug 28, 2013 at 10:04:22AM -0400, Tejun Heo wrote:
> Hello, Shaohua.
> 
> On Tue, Aug 27, 2013 at 05:50:39PM +0800, Shaohua Li wrote:
> > release_stripe still has big lock contention. We just add the stripe to a llist
> > without taking device_lock. We let the raid5d thread to do the real stripe
> > release, which must hold device_lock anyway. In this way, release_stripe
> > doesn't hold any locks.
> > 
> > The side effect is the released stripes order is changed. But sounds not a big
> > deal, stripes are never handled in order. And I thought block layer can already
> > do nice request merge, which means order isn't that important.
> 
> I wrote this before but the order of requests is an important
> information to the elevator and the existing elevators will behave
> less effectively if you make the relative order and timing of
> processed IOs deviate from the original issuer's and the effect could
> be very noticeable depending on the workload and stacking drivers
> should always strive to preserve as much IO characteristics.
> 
> It doesn't make the changes unacceptable or anything but the patch
> description is quite misleading.  It'd be nice if you at least can
> note that the implemented behavior is far from optimal in the comment
> and description.

This order issue is fixed in the second patch.

It's true making raid5 multi-threading might change order, I mentioned this in
the third patch (the direct impact is request size)

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch v3 1/5] raid5: make release_stripe lockless
  2013-08-28 14:29     ` Shaohua Li
@ 2013-08-28 14:30       ` Tejun Heo
  0 siblings, 0 replies; 20+ messages in thread
From: Tejun Heo @ 2013-08-28 14:30 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, neilb, dan.j.williams

Hello,

On Wed, Aug 28, 2013 at 10:29:08PM +0800, Shaohua Li wrote:
> This order issue is fixed in the second patch.

Oh, great.

> It's true making raid5 multi-threading might change order, I mentioned this in
> the third patch (the direct impact is request size)

Yeah, so, if at all possible, it's a good idea to preserve the
relative order, timing and plug boundaries of IOs.  Those tell the
elevator a lot about the IOs.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch v3 5/5] raid5: only wakeup necessary threads
  2013-08-28  4:13   ` NeilBrown
  2013-08-28  6:31     ` Shaohua Li
@ 2013-08-29  7:40     ` Shaohua Li
  2013-09-02  0:45       ` NeilBrown
  1 sibling, 1 reply; 20+ messages in thread
From: Shaohua Li @ 2013-08-29  7:40 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, tj, dan.j.williams

On Wed, Aug 28, 2013 at 02:13:04PM +1000, NeilBrown wrote:
> On Tue, 27 Aug 2013 17:50:43 +0800 Shaohua Li <shli@kernel.org> wrote:
> 
> 
> > @@ -229,8 +233,26 @@ static void raid5_wakeup_stripe_thread(s
> >  
> >  	group = conf->worker_groups + cpu_to_group(sh->cpu);
> >  
> > -	for (i = 0; i < conf->worker_cnt_per_group; i++)
> > -		queue_work_on(sh->cpu, raid5_wq, &group->workers[i].work);
> > +	group->workers[0].working = true;
> > +	/* at least one worker should run to avoid race */
> > +	queue_work_on(sh->cpu, raid5_wq, &group->workers[0].work);
> > +
> > +	thread_cnt = group->stripes_cnt / MAX_STRIPE_BATCH - 1;
> > +	/* wakeup more workers */
> > +	for (i = 1; i < conf->worker_cnt_per_group && thread_cnt > 0; i++) {
> > +		if (group->workers[i].working == false) {
> > +			group->workers[i].working = true;
> > +			queue_work_on(sh->cpu, raid5_wq,
> > +				      &group->workers[i].work);
> > +			thread_cnt--;
> > +		} else if (group->workers[i].working_cnt <=
> > +			   MAX_STRIPE_BATCH / 2)
> > +			/*
> > +			 * If a worker has no enough stripes handling, assume
> > +			 * it will fetch more stripes soon.
> > +			 */
> > +			thread_cnt--;
> > +	}
> >  }
> 
> I don't really understand this  "working_cnt <= MAX_STRIPE_BATCH / 2"
> heuristic.  It is at best a very coarse estimate of how long until the worker
> will get some more stripes to work on.
> I think I would simply not count any thread that is already working (except
> the first, which is always counted whether it is working or not)
> Do you see some particular gain from the counting?

Ok, looks no difference, I removed it.
 
> > -#define MAX_STRIPE_BATCH 8
> > -static int handle_active_stripes(struct r5conf *conf, int group)
> > +static int handle_active_stripes(struct r5conf *conf, int group,
> > +		struct r5worker *worker)
> >  {
> >  	struct stripe_head *batch[MAX_STRIPE_BATCH], *sh;
> >  	int i, batch_size = 0;
> > @@ -4921,6 +4955,9 @@ static int handle_active_stripes(struct
> >  			(sh = __get_priority_stripe(conf, group)) != NULL)
> >  		batch[batch_size++] = sh;
> >  
> > +	if (worker)
> > +		worker->working_cnt = batch_size;
> > +
> >  	if (batch_size == 0)
> >  		return batch_size;
> 
> I think this could possibly return with ->working still 'true'.
> I think it is safest to clear it on every exit from the function

Ok, this one doesn't matter too, I fixed it.
 
Subject:raid5: only wakeup necessary threads

If there are no enough stripes to handle, we'd better not always queue all
available work_structs. If one worker can only handle small or even none
stripes, it will impact request merge and create lock contention.

With this patch, the number of work_struct running will depend on pending
stripes number. Note: some statistics info used in the patch are accessed without
locking protection. This should doesn't matter, we just try best to avoid queue
unnecessary work_struct.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |   44 ++++++++++++++++++++++++++++++++++++++------
 drivers/md/raid5.h |    4 ++++
 2 files changed, 42 insertions(+), 6 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2013-08-28 16:49:51.744491417 +0800
+++ linux/drivers/md/raid5.c	2013-08-29 15:35:57.420380732 +0800
@@ -77,6 +77,7 @@ static struct workqueue_struct *raid5_wq
 #define BYPASS_THRESHOLD	1
 #define NR_HASH			(PAGE_SIZE / sizeof(struct hlist_head))
 #define HASH_MASK		(NR_HASH - 1)
+#define MAX_STRIPE_BATCH	8
 
 static inline struct hlist_head *stripe_hash_list(struct r5conf *conf,
 						sector_t sect)
@@ -267,6 +268,7 @@ static void raid5_wakeup_stripe_thread(s
 {
 	struct r5conf *conf = sh->raid_conf;
 	struct r5worker_group *group;
+	int thread_cnt;
 	int i, cpu = sh->cpu;
 
 	if (!cpu_online(cpu)) {
@@ -278,6 +280,8 @@ static void raid5_wakeup_stripe_thread(s
 		struct r5worker_group *group;
 		group = conf->worker_groups + cpu_to_group(cpu);
 		list_add_tail(&sh->lru, &group->handle_list);
+		group->stripes_cnt++;
+		sh->group = group;
 	}
 
 	if (conf->worker_cnt_per_group == 0) {
@@ -287,8 +291,20 @@ static void raid5_wakeup_stripe_thread(s
 
 	group = conf->worker_groups + cpu_to_group(sh->cpu);
 
-	for (i = 0; i < conf->worker_cnt_per_group; i++)
-		queue_work_on(sh->cpu, raid5_wq, &group->workers[i].work);
+	group->workers[0].working = true;
+	/* at least one worker should run to avoid race */
+	queue_work_on(sh->cpu, raid5_wq, &group->workers[0].work);
+
+	thread_cnt = group->stripes_cnt / MAX_STRIPE_BATCH - 1;
+	/* wakeup more workers */
+	for (i = 1; i < conf->worker_cnt_per_group && thread_cnt > 0; i++) {
+		if (group->workers[i].working == false) {
+			group->workers[i].working = true;
+			queue_work_on(sh->cpu, raid5_wq,
+				      &group->workers[i].work);
+			thread_cnt--;
+		}
+	}
 }
 
 static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh)
@@ -673,6 +689,10 @@ get_active_stripe(struct r5conf *conf, s
 				    !test_bit(STRIPE_EXPANDING, &sh->state))
 					BUG();
 				list_del_init(&sh->lru);
+				if (sh->group) {
+					sh->group->stripes_cnt--;
+					sh->group = NULL;
+				}
 			}
 		}
 	} while (sh == NULL);
@@ -4259,15 +4279,18 @@ static struct stripe_head *__get_priorit
 {
 	struct stripe_head *sh = NULL, *tmp;
 	struct list_head *handle_list = NULL;
+	struct r5worker_group *wg = NULL;
 
 	if (conf->worker_cnt_per_group == 0) {
 		handle_list = &conf->handle_list;
 	} else if (group != ANY_GROUP) {
 		handle_list = &conf->worker_groups[group].handle_list;
+		wg = &conf->worker_groups[group];
 	} else {
 		int i;
 		for (i = 0; i < conf->group_cnt; i++) {
 			handle_list = &conf->worker_groups[i].handle_list;
+			wg = &conf->worker_groups[i];
 			if (!list_empty(handle_list))
 				break;
 		}
@@ -4314,11 +4337,16 @@ static struct stripe_head *__get_priorit
 			if (conf->bypass_count < 0)
 				conf->bypass_count = 0;
 		}
+		wg = NULL;
 	}
 
 	if (!sh)
 		return NULL;
 
+	if (wg) {
+		wg->stripes_cnt--;
+		sh->group = NULL;
+	}
 	list_del_init(&sh->lru);
 	atomic_inc(&sh->count);
 	BUG_ON(atomic_read(&sh->count) != 1);
@@ -5025,8 +5053,8 @@ static int  retry_aligned_read(struct r5
 	return handled;
 }
 
-#define MAX_STRIPE_BATCH 8
-static int handle_active_stripes(struct r5conf *conf, int group)
+static int handle_active_stripes(struct r5conf *conf, int group,
+		struct r5worker *worker)
 {
 	struct stripe_head *batch[MAX_STRIPE_BATCH], *sh;
 	int i, batch_size = 0;
@@ -5035,6 +5063,9 @@ static int handle_active_stripes(struct
 			(sh = __get_priority_stripe(conf, group)) != NULL)
 		batch[batch_size++] = sh;
 
+	if (worker)
+		worker->working_cnt = batch_size;
+
 	if (batch_size == 0)
 		return batch_size;
 	spin_unlock_irq(&conf->device_lock);
@@ -5069,7 +5100,8 @@ static void raid5_do_work(struct work_st
 
 		released = release_stripe_list(conf);
 
-		batch_size = handle_active_stripes(conf, group_id);
+		batch_size = handle_active_stripes(conf, group_id, worker);
+		worker->working = false;
 		if (!batch_size && !released)
 			break;
 		handled += batch_size;
@@ -5131,7 +5163,7 @@ static void raid5d(struct md_thread *thr
 			handled++;
 		}
 
-		batch_size = handle_active_stripes(conf, ANY_GROUP);
+		batch_size = handle_active_stripes(conf, ANY_GROUP, NULL);
 		if (!batch_size && !released)
 			break;
 		handled += batch_size;
Index: linux/drivers/md/raid5.h
===================================================================
--- linux.orig/drivers/md/raid5.h	2013-08-28 16:49:51.744491417 +0800
+++ linux/drivers/md/raid5.h	2013-08-28 16:49:51.744491417 +0800
@@ -214,6 +214,7 @@ struct stripe_head {
 	enum reconstruct_states reconstruct_state;
 	spinlock_t		stripe_lock;
 	int			cpu;
+	struct r5worker_group	*group;
 	/**
 	 * struct stripe_operations
 	 * @target - STRIPE_OP_COMPUTE_BLK target
@@ -370,12 +371,15 @@ struct disk_info {
 struct r5worker {
 	struct work_struct work;
 	struct r5worker_group *group;
+	int working_cnt:8;
+	bool working;
 };
 
 struct r5worker_group {
 	struct list_head handle_list;
 	struct r5conf *conf;
 	struct r5worker *workers;
+	int stripes_cnt;
 };
 
 #define NR_STRIPE_HASH_LOCKS 8

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch v3 5/5] raid5: only wakeup necessary threads
  2013-08-29  7:40     ` Shaohua Li
@ 2013-09-02  0:45       ` NeilBrown
  0 siblings, 0 replies; 20+ messages in thread
From: NeilBrown @ 2013-09-02  0:45 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, tj, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 1030 bytes --]

On Thu, 29 Aug 2013 15:40:32 +0800 Shaohua Li <shli@kernel.org> wrote:
	return batch_size;
> > 
> > I think this could possibly return with ->working still 'true'.
> > I think it is safest to clear it on every exit from the function
> 
> Ok, this one doesn't matter too, I fixed it.
>  
> Subject:raid5: only wakeup necessary threads
> 
> If there are no enough stripes to handle, we'd better not always queue all
> available work_structs. If one worker can only handle small or even none
> stripes, it will impact request merge and create lock contention.
> 
> With this patch, the number of work_struct running will depend on pending
> stripes number. Note: some statistics info used in the patch are accessed without
> locking protection. This should doesn't matter, we just try best to avoid queue
> unnecessary work_struct.
> 
> Signed-off-by: Shaohua Li <shli@fusionio.com>

Thanks.
I removed the 'working_cnt' field as it wasn't being used, and have applied
and push out the patch.

NeilBrown



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2013-09-02  0:45 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-08-27  9:50 [patch v3 0/5] raid5: make stripe handling multi-threading Shaohua Li
2013-08-27  9:50 ` [patch v3 1/5] raid5: make release_stripe lockless Shaohua Li
2013-08-28 14:04   ` Tejun Heo
2013-08-28 14:29     ` Shaohua Li
2013-08-28 14:30       ` Tejun Heo
2013-08-27  9:50 ` [patch v3 2/5] raid5: fix stripe release order Shaohua Li
2013-08-28  3:41   ` NeilBrown
2013-08-28  6:29     ` Shaohua Li
2013-08-28  6:37       ` NeilBrown
2013-08-27  9:50 ` [patch v3 3/5] raid5: offload stripe handle to workqueue Shaohua Li
2013-08-28  3:53   ` NeilBrown
2013-08-28  6:30     ` Shaohua Li
2013-08-28  6:56       ` NeilBrown
2013-08-27  9:50 ` [patch v3 4/5] raid5: sysfs entry to control worker thread number Shaohua Li
2013-08-27  9:50 ` [patch v3 5/5] raid5: only wakeup necessary threads Shaohua Li
2013-08-28  4:13   ` NeilBrown
2013-08-28  6:31     ` Shaohua Li
2013-08-28  6:59       ` NeilBrown
2013-08-29  7:40     ` Shaohua Li
2013-09-02  0:45       ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).