[PATCH 0/8] Fixes for md/raid5 stripe batching code.

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/8] Fixes for md/raid5 stripe batching code.
@ 2015-05-22  5:30 NeilBrown
  2015-05-22  5:30 ` [PATCH 5/8] md/raid5: add handle_flags arg to break_stripe_batch_list NeilBrown
                   ` (7 more replies)
  0 siblings, 8 replies; 17+ messages in thread
From: NeilBrown @ 2015-05-22  5:30 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

Hi Shaohua,
 thanks for explaining a bit more your thinking about this code.

 I've been testing and reviewing and found a few bugs and done some
 cleaning up (e.g. share code).  It seems to be working OK and I'm
 starting to feel confident about it.

 Could you please review all of these patches and confirm that the
 changes I have made are consistent with your understanding of how it
 should work?  Maybe if you could test (there are all in my for-next
 branch) and make sure the performance improvements are still there,
 that would be great.

Thanks,
NeilBrown

---

NeilBrown (8):
      md/raid5: ensure whole batch is delayed for all required bitmap updates.
      md/raid5: Ensure a batch member is not handled prematurely.
      md/raid5: remove condition test from check_break_stripe_batch_list.
      md/raid5: duplicate some more handle_stripe_clean_event code in break_stripe_batch_list
      md/raid5: add handle_flags arg to break_stripe_batch_list.
      md/raid5: be more selective about distributing flags across batch.
      md/raid5: call break_stripe_batch_list from handle_stripe_clean_event
      md/raid5: break stripe-batches when the array has failed.

 drivers/md/raid5.c |  122 ++++++++++++++++++++++++++++------------------------
 drivers/md/raid5.h |    2 -
 2 files changed, 66 insertions(+), 58 deletions(-)

--
Signature

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/8] md/raid5: ensure whole batch is delayed for all required bitmap updates.
  2015-05-22  5:30 [PATCH 0/8] Fixes for md/raid5 stripe batching code NeilBrown
                   ` (3 preceding siblings ...)
  2015-05-22  5:30 ` [PATCH 6/8] md/raid5: be more selective about distributing flags across batch NeilBrown
@ 2015-05-22  5:30 ` NeilBrown
  2015-05-22  5:30 ` [PATCH 3/8] md/raid5: remove condition test from check_break_stripe_batch_list NeilBrown
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: NeilBrown @ 2015-05-22  5:30 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

When we add a stripe to a batch, we need to be sure that
head stripe will wait for the bitmap update required for the new
stripe.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/raid5.c |    9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 987c67298953..c3ccefbd4fe7 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -848,6 +848,15 @@ static void stripe_add_to_batch_list(struct r5conf *conf, struct stripe_head *sh
 		    < IO_THRESHOLD)
 			md_wakeup_thread(conf->mddev->thread);
 
+	if (test_and_clear_bit(STRIPE_BIT_DELAY, &sh->state)) {
+		int seq = sh->bm_seq;
+		if (test_bit(STRIPE_BIT_DELAY, &sh->batch_head->state) &&
+		    sh->batch_head->bm_seq > seq)
+			seq = sh->batch_head->bm_seq;
+		set_bit(STRIPE_BIT_DELAY, &sh->batch_head->state);
+		sh->batch_head->bm_seq = seq;
+	}
+
 	atomic_inc(&sh->count);
 unlock_out:
 	unlock_two_stripes(head, sh);



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/8] md/raid5: Ensure a batch member is not handled prematurely.
  2015-05-22  5:30 [PATCH 0/8] Fixes for md/raid5 stripe batching code NeilBrown
                   ` (5 preceding siblings ...)
  2015-05-22  5:30 ` [PATCH 3/8] md/raid5: remove condition test from check_break_stripe_batch_list NeilBrown
@ 2015-05-22  5:30 ` NeilBrown
  2015-05-22 23:44   ` Shaohua Li
  2015-05-22  5:30 ` [PATCH 8/8] md/raid5: break stripe-batches when the array has failed NeilBrown
  7 siblings, 1 reply; 17+ messages in thread
From: NeilBrown @ 2015-05-22  5:30 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

If a stripe is a member of a batch, but not the head, it must
not be handled separately from the rest of the batch.

'clear_batch_ready()' handles this requirement to some
extent but not completely.  If a member is passed to handle_stripe()
a second time it returns '0' indicating the stripe can be handled,
which is wrong.
So add an extra test.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/raid5.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c3ccefbd4fe7..9a803b735848 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4192,9 +4192,13 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
 
 static int clear_batch_ready(struct stripe_head *sh)
 {
+	/* Return '1' if this is a member of batch, or
+	 * '0' if it is a lone stripe or a head which can now be
+	 * handled.
+	 */
 	struct stripe_head *tmp;
 	if (!test_and_clear_bit(STRIPE_BATCH_READY, &sh->state))
-		return 0;
+		return (sh->batch_head && sh->batch_head != sh);
 	spin_lock(&sh->stripe_lock);
 	if (!sh->batch_head) {
 		spin_unlock(&sh->stripe_lock);



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/8] md/raid5: remove condition test from check_break_stripe_batch_list.
  2015-05-22  5:30 [PATCH 0/8] Fixes for md/raid5 stripe batching code NeilBrown
                   ` (4 preceding siblings ...)
  2015-05-22  5:30 ` [PATCH 1/8] md/raid5: ensure whole batch is delayed for all required bitmap updates NeilBrown
@ 2015-05-22  5:30 ` NeilBrown
  2015-05-22  5:30 ` [PATCH 2/8] md/raid5: Ensure a batch member is not handled prematurely NeilBrown
  2015-05-22  5:30 ` [PATCH 8/8] md/raid5: break stripe-batches when the array has failed NeilBrown
  7 siblings, 0 replies; 17+ messages in thread
From: NeilBrown @ 2015-05-22  5:30 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

handle_stripe_clean_event() contains a chunk of code very
similar to check_break_stripe_batch_list().
If we make the latter more like the former, we can end up
with just one copy of this code.

This  first step removed the condition (and the 'check_') part
of the name.  This has the added advantage of making it clear
what check is being performed at the point where the function is
called.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/raid5.c |   12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 9a803b735848..4952205460e5 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4226,16 +4226,11 @@ static int clear_batch_ready(struct stripe_head *sh)
 	return 0;
 }
 
-static void check_break_stripe_batch_list(struct stripe_head *sh)
+static void break_stripe_batch_list(struct stripe_head *head_sh)
 {
-	struct stripe_head *head_sh, *next;
+	struct stripe_head *sh, *next;
 	int i;
 
-	if (!test_and_clear_bit(STRIPE_BATCH_ERR, &sh->state))
-		return;
-
-	head_sh = sh;
-
 	list_for_each_entry_safe(sh, next, &head_sh->batch_list, batch_list) {
 
 		list_del_init(&sh->batch_list);
@@ -4282,7 +4277,8 @@ static void handle_stripe(struct stripe_head *sh)
 		return;
 	}
 
-	check_break_stripe_batch_list(sh);
+	if (test_and_clear_bit(STRIPE_BATCH_ERR, &sh->state))
+		break_stripe_batch_list(sh);
 
 	if (test_bit(STRIPE_SYNC_REQUESTED, &sh->state) && !sh->batch_head) {
 		spin_lock(&sh->stripe_lock);



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 5/8] md/raid5: add handle_flags arg to break_stripe_batch_list.
  2015-05-22  5:30 [PATCH 0/8] Fixes for md/raid5 stripe batching code NeilBrown
@ 2015-05-22  5:30 ` NeilBrown
  2015-05-22  5:30 ` [PATCH 4/8] md/raid5: duplicate some more handle_stripe_clean_event code in break_stripe_batch_list NeilBrown
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: NeilBrown @ 2015-05-22  5:30 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

When we break a stripe_batch_list we sometimes want to set
STRIPE_HANDLE on the individual stripes, and sometimes not.

So pass a 'handle_flags' arg.  If it is zero, always set STRIPE_HANDLE
(on non-head stripes).  If not zero, only set it if any of the given
flags are present.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/raid5.c |   12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index bfad47e36d1a..1f12a0bfa8bd 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4226,7 +4226,8 @@ static int clear_batch_ready(struct stripe_head *sh)
 	return 0;
 }
 
-static void break_stripe_batch_list(struct stripe_head *head_sh)
+static void break_stripe_batch_list(struct stripe_head *head_sh,
+				    unsigned long handle_flags)
 {
 	struct stripe_head *sh, *next;
 	int i;
@@ -4252,8 +4253,9 @@ static void break_stripe_batch_list(struct stripe_head *head_sh)
 		spin_lock_irq(&sh->stripe_lock);
 		sh->batch_head = NULL;
 		spin_unlock_irq(&sh->stripe_lock);
-
-		set_bit(STRIPE_HANDLE, &sh->state);
+		if (handle_flags == 0 ||
+		    sh->state & handle_flags)
+			set_bit(STRIPE_HANDLE, &sh->state);
 		release_stripe(sh);
 	}
 	spin_lock_irq(&head_sh->stripe_lock);
@@ -4262,6 +4264,8 @@ static void break_stripe_batch_list(struct stripe_head *head_sh)
 	for (i = 0; i < head_sh->disks; i++)
 		if (test_and_clear_bit(R5_Overlap, &head_sh->dev[i].flags))
 			do_wakeup = 1;
+	if (head_sh->state & handle_flags)
+		set_bit(STRIPE_HANDLE, &head_sh->state);
 
 	if (do_wakeup)
 		wake_up(&head_sh->raid_conf->wait_for_overlap);
@@ -4290,7 +4294,7 @@ static void handle_stripe(struct stripe_head *sh)
 	}
 
 	if (test_and_clear_bit(STRIPE_BATCH_ERR, &sh->state))
-		break_stripe_batch_list(sh);
+		break_stripe_batch_list(sh, 0);
 
 	if (test_bit(STRIPE_SYNC_REQUESTED, &sh->state) && !sh->batch_head) {
 		spin_lock(&sh->stripe_lock);



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/8] md/raid5: duplicate some more handle_stripe_clean_event code in break_stripe_batch_list
  2015-05-22  5:30 [PATCH 0/8] Fixes for md/raid5 stripe batching code NeilBrown
  2015-05-22  5:30 ` [PATCH 5/8] md/raid5: add handle_flags arg to break_stripe_batch_list NeilBrown
@ 2015-05-22  5:30 ` NeilBrown
  2015-05-22  5:30 ` [PATCH 7/8] md/raid5: call break_stripe_batch_list from handle_stripe_clean_event NeilBrown
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: NeilBrown @ 2015-05-22  5:30 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

break_stripe_batch list didn't clear head_sh->batch_head.
This was probably a bug.

Also clear all R5_Overlap flags and if any were cleared, wake up
'wait_for_overlap'.
This isn't always necessary but the worst effect is a little
extra checking for code that is waiting on wait_for_overlap.

Also, don't use wake_up_nr() because that does the wrong thing
if 'nr' is zero, and it number of flags cleared doesn't
strongly correlate with the number of threads to wake.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/raid5.c |   16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 4952205460e5..bfad47e36d1a 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4230,6 +4230,7 @@ static void break_stripe_batch_list(struct stripe_head *head_sh)
 {
 	struct stripe_head *sh, *next;
 	int i;
+	int do_wakeup = 0;
 
 	list_for_each_entry_safe(sh, next, &head_sh->batch_list, batch_list) {
 
@@ -4242,10 +4243,12 @@ static void break_stripe_batch_list(struct stripe_head *head_sh)
 						 STRIPE_EXPAND_SYNC_FLAG));
 		sh->check_state = head_sh->check_state;
 		sh->reconstruct_state = head_sh->reconstruct_state;
-		for (i = 0; i < sh->disks; i++)
+		for (i = 0; i < sh->disks; i++) {
+			if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
+				do_wakeup = 1;
 			sh->dev[i].flags = head_sh->dev[i].flags &
 				(~((1 << R5_WriteError) | (1 << R5_Overlap)));
-
+		}
 		spin_lock_irq(&sh->stripe_lock);
 		sh->batch_head = NULL;
 		spin_unlock_irq(&sh->stripe_lock);
@@ -4253,6 +4256,15 @@ static void break_stripe_batch_list(struct stripe_head *head_sh)
 		set_bit(STRIPE_HANDLE, &sh->state);
 		release_stripe(sh);
 	}
+	spin_lock_irq(&head_sh->stripe_lock);
+	head_sh->batch_head = NULL;
+	spin_unlock_irq(&head_sh->stripe_lock);
+	for (i = 0; i < head_sh->disks; i++)
+		if (test_and_clear_bit(R5_Overlap, &head_sh->dev[i].flags))
+			do_wakeup = 1;
+
+	if (do_wakeup)
+		wake_up(&head_sh->raid_conf->wait_for_overlap);
 }
 
 static void handle_stripe(struct stripe_head *sh)



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 6/8] md/raid5: be more selective about distributing flags across batch.
  2015-05-22  5:30 [PATCH 0/8] Fixes for md/raid5 stripe batching code NeilBrown
                   ` (2 preceding siblings ...)
  2015-05-22  5:30 ` [PATCH 7/8] md/raid5: call break_stripe_batch_list from handle_stripe_clean_event NeilBrown
@ 2015-05-22  5:30 ` NeilBrown
  2015-05-22  5:30 ` [PATCH 1/8] md/raid5: ensure whole batch is delayed for all required bitmap updates NeilBrown
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: NeilBrown @ 2015-05-22  5:30 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

When a batch of stripes is broken up, we keep some of the flags
that were per-stripe, and copy other flags from the head to all
others.

This only happens while a stripe is being handled, so many of the
flags are irrelevant.

The "SYNC_FLAGS" (which I've renamed to make it clear there are
several) and STRIPE_DEGRADED are set per-stripe and so need to be
preserved.  STRIPE_INSYNC is the only flag that is set on the head
that needs to be propagated to all others.

For safety, add a WARN_ON if others are set.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/raid5.c |   57 ++++++++++++++++++++++++++++++++++++++++++----------
 drivers/md/raid5.h |    2 +-
 2 files changed, 47 insertions(+), 12 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 1f12a0bfa8bd..39901e2b5502 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3526,10 +3526,28 @@ unhash:
 				      struct stripe_head, batch_list);
 		list_del_init(&sh->batch_list);
 
-		set_mask_bits(&sh->state, ~STRIPE_EXPAND_SYNC_FLAG,
-			      head_sh->state & ~((1 << STRIPE_ACTIVE) |
-						 (1 << STRIPE_PREREAD_ACTIVE) |
-						 STRIPE_EXPAND_SYNC_FLAG));
+		WARN_ON_ONCE(sh->state & ((1 << STRIPE_ACTIVE) |
+					  (1 << STRIPE_SYNCING) |
+					  (1 << STRIPE_REPLACED) |
+					  (1 << STRIPE_PREREAD_ACTIVE) |
+					  (1 << STRIPE_DELAYED) |
+					  (1 << STRIPE_BIT_DELAY) |
+					  (1 << STRIPE_FULL_WRITE) |
+					  (1 << STRIPE_BIOFILL_RUN) |
+					  (1 << STRIPE_COMPUTE_RUN)  |
+					  (1 << STRIPE_OPS_REQ_PENDING) |
+					  (1 << STRIPE_ON_UNPLUG_LIST) |
+					  (1 << STRIPE_DISCARD) |
+					  (1 << STRIPE_ON_RELEASE_LIST) |
+					  (1 << STRIPE_BATCH_READY) |
+					  (1 << STRIPE_BATCH_ERR)));
+		WARN_ON_ONCE(head_sh->state & ((1 << STRIPE_DISCARD) |
+					      (1 << STRIPE_REPLACED)));
+
+		set_mask_bits(&sh->state, ~(STRIPE_EXPAND_SYNC_FLAGS |
+					    (1 << STRIPE_DEGRADED)),
+			      head_sh->state & (1 << STRIPE_INSYNC));
+
 		sh->check_state = head_sh->check_state;
 		sh->reconstruct_state = head_sh->reconstruct_state;
 		for (i = 0; i < sh->disks; i++) {
@@ -3541,7 +3559,7 @@ unhash:
 		spin_lock_irq(&sh->stripe_lock);
 		sh->batch_head = NULL;
 		spin_unlock_irq(&sh->stripe_lock);
-		if (sh->state & STRIPE_EXPAND_SYNC_FLAG)
+		if (sh->state & STRIPE_EXPAND_SYNC_FLAGS)
 			set_bit(STRIPE_HANDLE, &sh->state);
 		release_stripe(sh);
 	}
@@ -3550,7 +3568,7 @@ unhash:
 	head_sh->batch_head = NULL;
 	spin_unlock_irq(&head_sh->stripe_lock);
 	wake_up_nr(&conf->wait_for_overlap, wakeup_nr);
-	if (head_sh->state & STRIPE_EXPAND_SYNC_FLAG)
+	if (head_sh->state & STRIPE_EXPAND_SYNC_FLAGS)
 		set_bit(STRIPE_HANDLE, &head_sh->state);
 }
 
@@ -4237,11 +4255,28 @@ static void break_stripe_batch_list(struct stripe_head *head_sh,
 
 		list_del_init(&sh->batch_list);
 
-		set_mask_bits(&sh->state, ~STRIPE_EXPAND_SYNC_FLAG,
-			      head_sh->state & ~((1 << STRIPE_ACTIVE) |
-						 (1 << STRIPE_PREREAD_ACTIVE) |
-						 (1 << STRIPE_DEGRADED) |
-						 STRIPE_EXPAND_SYNC_FLAG));
+		WARN_ON_ONCE(sh->state & ((1 << STRIPE_ACTIVE) |
+					  (1 << STRIPE_SYNCING) |
+					  (1 << STRIPE_REPLACED) |
+					  (1 << STRIPE_PREREAD_ACTIVE) |
+					  (1 << STRIPE_DELAYED) |
+					  (1 << STRIPE_BIT_DELAY) |
+					  (1 << STRIPE_FULL_WRITE) |
+					  (1 << STRIPE_BIOFILL_RUN) |
+					  (1 << STRIPE_COMPUTE_RUN)  |
+					  (1 << STRIPE_OPS_REQ_PENDING) |
+					  (1 << STRIPE_ON_UNPLUG_LIST) |
+					  (1 << STRIPE_DISCARD) |
+					  (1 << STRIPE_ON_RELEASE_LIST) |
+					  (1 << STRIPE_BATCH_READY) |
+					  (1 << STRIPE_BATCH_ERR)));
+		WARN_ON_ONCE(head_sh->state & ((1 << STRIPE_DISCARD) |
+					      (1 << STRIPE_REPLACED)));
+
+		set_mask_bits(&sh->state, ~(STRIPE_EXPAND_SYNC_FLAGS |
+					    (1 << STRIPE_DEGRADED)),
+			      head_sh->state & (1 << STRIPE_INSYNC));
+
 		sh->check_state = head_sh->check_state;
 		sh->reconstruct_state = head_sh->reconstruct_state;
 		for (i = 0; i < sh->disks; i++) {
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 6307b904d318..d7b2bc8b756f 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -339,7 +339,7 @@ enum {
 	STRIPE_BATCH_ERR,
 };
 
-#define STRIPE_EXPAND_SYNC_FLAG \
+#define STRIPE_EXPAND_SYNC_FLAGS \
 	((1 << STRIPE_EXPAND_SOURCE) |\
 	(1 << STRIPE_EXPAND_READY) |\
 	(1 << STRIPE_EXPANDING) |\



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 7/8] md/raid5: call break_stripe_batch_list from handle_stripe_clean_event
  2015-05-22  5:30 [PATCH 0/8] Fixes for md/raid5 stripe batching code NeilBrown
  2015-05-22  5:30 ` [PATCH 5/8] md/raid5: add handle_flags arg to break_stripe_batch_list NeilBrown
  2015-05-22  5:30 ` [PATCH 4/8] md/raid5: duplicate some more handle_stripe_clean_event code in break_stripe_batch_list NeilBrown
@ 2015-05-22  5:30 ` NeilBrown
  2015-05-22  5:30 ` [PATCH 6/8] md/raid5: be more selective about distributing flags across batch NeilBrown
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: NeilBrown @ 2015-05-22  5:30 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

Now that the code in break_stripe_batch_list() is nearly identical
to the end of handle_stripe_clean_event, replace the later
with a function call.

The only remaining difference of any interest is the masking that is
applieds to dev[i].flags copied from head_sh.
R5_WriteError certainly isn't wanted as it is set per-stripe, not
per-patch.  R5_Overlap isn't wanted as it is explicitly handled.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/raid5.c |   61 +++-------------------------------------------------
 1 file changed, 4 insertions(+), 57 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 39901e2b5502..90da2376e0bb 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3412,6 +3412,8 @@ static void handle_stripe_fill(struct stripe_head *sh,
 	set_bit(STRIPE_HANDLE, &sh->state);
 }
 
+static void break_stripe_batch_list(struct stripe_head *head_sh,
+				    unsigned long handle_flags);
 /* handle_stripe_clean_event
  * any written block on an uptodate or failed drive can be returned.
  * Note that if we 'wrote' to a failed drive, it will be UPTODATE, but
@@ -3425,7 +3427,6 @@ static void handle_stripe_clean_event(struct r5conf *conf,
 	int discard_pending = 0;
 	struct stripe_head *head_sh = sh;
 	bool do_endio = false;
-	int wakeup_nr = 0;
 
 	for (i = disks; i--; )
 		if (sh->dev[i].written) {
@@ -3514,62 +3515,8 @@ unhash:
 		if (atomic_dec_and_test(&conf->pending_full_writes))
 			md_wakeup_thread(conf->mddev->thread);
 
-	if (!head_sh->batch_head || !do_endio)
-		return;
-	for (i = 0; i < head_sh->disks; i++) {
-		if (test_and_clear_bit(R5_Overlap, &head_sh->dev[i].flags))
-			wakeup_nr++;
-	}
-	while (!list_empty(&head_sh->batch_list)) {
-		int i;
-		sh = list_first_entry(&head_sh->batch_list,
-				      struct stripe_head, batch_list);
-		list_del_init(&sh->batch_list);
-
-		WARN_ON_ONCE(sh->state & ((1 << STRIPE_ACTIVE) |
-					  (1 << STRIPE_SYNCING) |
-					  (1 << STRIPE_REPLACED) |
-					  (1 << STRIPE_PREREAD_ACTIVE) |
-					  (1 << STRIPE_DELAYED) |
-					  (1 << STRIPE_BIT_DELAY) |
-					  (1 << STRIPE_FULL_WRITE) |
-					  (1 << STRIPE_BIOFILL_RUN) |
-					  (1 << STRIPE_COMPUTE_RUN)  |
-					  (1 << STRIPE_OPS_REQ_PENDING) |
-					  (1 << STRIPE_ON_UNPLUG_LIST) |
-					  (1 << STRIPE_DISCARD) |
-					  (1 << STRIPE_ON_RELEASE_LIST) |
-					  (1 << STRIPE_BATCH_READY) |
-					  (1 << STRIPE_BATCH_ERR)));
-		WARN_ON_ONCE(head_sh->state & ((1 << STRIPE_DISCARD) |
-					      (1 << STRIPE_REPLACED)));
-
-		set_mask_bits(&sh->state, ~(STRIPE_EXPAND_SYNC_FLAGS |
-					    (1 << STRIPE_DEGRADED)),
-			      head_sh->state & (1 << STRIPE_INSYNC));
-
-		sh->check_state = head_sh->check_state;
-		sh->reconstruct_state = head_sh->reconstruct_state;
-		for (i = 0; i < sh->disks; i++) {
-			if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
-				wakeup_nr++;
-			sh->dev[i].flags = head_sh->dev[i].flags;
-		}
-
-		spin_lock_irq(&sh->stripe_lock);
-		sh->batch_head = NULL;
-		spin_unlock_irq(&sh->stripe_lock);
-		if (sh->state & STRIPE_EXPAND_SYNC_FLAGS)
-			set_bit(STRIPE_HANDLE, &sh->state);
-		release_stripe(sh);
-	}
-
-	spin_lock_irq(&head_sh->stripe_lock);
-	head_sh->batch_head = NULL;
-	spin_unlock_irq(&head_sh->stripe_lock);
-	wake_up_nr(&conf->wait_for_overlap, wakeup_nr);
-	if (head_sh->state & STRIPE_EXPAND_SYNC_FLAGS)
-		set_bit(STRIPE_HANDLE, &head_sh->state);
+	if (head_sh->batch_head && do_endio)
+		break_stripe_batch_list(head_sh, STRIPE_EXPAND_SYNC_FLAGS);
 }
 
 static void handle_stripe_dirtying(struct r5conf *conf,



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 8/8] md/raid5: break stripe-batches when the array has failed.
  2015-05-22  5:30 [PATCH 0/8] Fixes for md/raid5 stripe batching code NeilBrown
                   ` (6 preceding siblings ...)
  2015-05-22  5:30 ` [PATCH 2/8] md/raid5: Ensure a batch member is not handled prematurely NeilBrown
@ 2015-05-22  5:30 ` NeilBrown
  7 siblings, 0 replies; 17+ messages in thread
From: NeilBrown @ 2015-05-22  5:30 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

Once the array has too much failure, we need to break
stripe-batches up so they can all be dealt with.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/raid5.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 90da2376e0bb..041341c66ae5 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4330,6 +4330,7 @@ static void handle_stripe(struct stripe_head *sh)
 	if (s.failed > conf->max_degraded) {
 		sh->check_state = 0;
 		sh->reconstruct_state = 0;
+		break_stripe_batch_list(sh, 0);
 		if (s.to_read+s.to_write+s.written)
 			handle_failed_stripe(conf, sh, &s, disks, &s.return_bi);
 		if (s.syncing + s.replacing)



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/8] md/raid5: Ensure a batch member is not handled prematurely.
  2015-05-22  5:30 ` [PATCH 2/8] md/raid5: Ensure a batch member is not handled prematurely NeilBrown
@ 2015-05-22 23:44   ` Shaohua Li
  2015-05-23  0:26     ` NeilBrown
  0 siblings, 1 reply; 17+ messages in thread
From: Shaohua Li @ 2015-05-22 23:44 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Fri, May 22, 2015 at 03:30:58PM +1000, NeilBrown wrote:
> If a stripe is a member of a batch, but not the head, it must
> not be handled separately from the rest of the batch.
> 
> 'clear_batch_ready()' handles this requirement to some
> extent but not completely.  If a member is passed to handle_stripe()
> a second time it returns '0' indicating the stripe can be handled,
> which is wrong.
> So add an extra test.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  drivers/md/raid5.c |    6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index c3ccefbd4fe7..9a803b735848 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -4192,9 +4192,13 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
>  
>  static int clear_batch_ready(struct stripe_head *sh)
>  {
> +	/* Return '1' if this is a member of batch, or
> +	 * '0' if it is a lone stripe or a head which can now be
> +	 * handled.
> +	 */
>  	struct stripe_head *tmp;
>  	if (!test_and_clear_bit(STRIPE_BATCH_READY, &sh->state))
> -		return 0;
> +		return (sh->batch_head && sh->batch_head != sh);
>  	spin_lock(&sh->stripe_lock);
>  	if (!sh->batch_head) {
>  		spin_unlock(&sh->stripe_lock);

which case can this happen in?

Patches look good. But I'm not in Fusionio any more, so can't check the
performance in big raid array with fast flash cards. I'm doing some tests here.
I hit a warning in break_stripe_batch_list, STRIPE_BIT_DELAY is set in the
stripe state. I'm checking the reason, but if you have thoughts I can try
immediately, please let me know.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/8] md/raid5: Ensure a batch member is not handled prematurely.
  2015-05-22 23:44   ` Shaohua Li
@ 2015-05-23  0:26     ` NeilBrown
  2015-05-26 18:16       ` Shaohua Li
  0 siblings, 1 reply; 17+ messages in thread
From: NeilBrown @ 2015-05-23  0:26 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2489 bytes --]

On Fri, 22 May 2015 16:44:02 -0700 Shaohua Li <shli@kernel.org> wrote:

> On Fri, May 22, 2015 at 03:30:58PM +1000, NeilBrown wrote:
> > If a stripe is a member of a batch, but not the head, it must
> > not be handled separately from the rest of the batch.
> > 
> > 'clear_batch_ready()' handles this requirement to some
> > extent but not completely.  If a member is passed to handle_stripe()
> > a second time it returns '0' indicating the stripe can be handled,
> > which is wrong.
> > So add an extra test.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  drivers/md/raid5.c |    6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > index c3ccefbd4fe7..9a803b735848 100644
> > --- a/drivers/md/raid5.c
> > +++ b/drivers/md/raid5.c
> > @@ -4192,9 +4192,13 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
> >  
> >  static int clear_batch_ready(struct stripe_head *sh)
> >  {
> > +	/* Return '1' if this is a member of batch, or
> > +	 * '0' if it is a lone stripe or a head which can now be
> > +	 * handled.
> > +	 */
> >  	struct stripe_head *tmp;
> >  	if (!test_and_clear_bit(STRIPE_BATCH_READY, &sh->state))
> > -		return 0;
> > +		return (sh->batch_head && sh->batch_head != sh);
> >  	spin_lock(&sh->stripe_lock);
> >  	if (!sh->batch_head) {
> >  		spin_unlock(&sh->stripe_lock);
> 
> which case can this happen in?

It definitely happens as I had reliable problems until I added this fix.
'retry_aligned_read()' can call handle_stripe() on any stripe at any time,
but I doubt that would apply.  I might try putting a warn-on there and see if
it provides any hints.

> 
> Patches look good. But I'm not in Fusionio any more, so can't check the
> performance in big raid array with fast flash cards. I'm doing some tests here.
> I hit a warning in break_stripe_batch_list, STRIPE_BIT_DELAY is set in the
> stripe state. I'm checking the reason, but if you have thoughts I can try
> immediately, please let me know.

I got STRIPE_BIT_DELAY a few times.  That was the main reason for

  md/raid5: ensure whole batch is delayed for all required bitmap updates.

and they went away after I got that patch right.

Maybe there is a race in there..

If you can reproduce it, maybe WARN whenever STRIPE_BIT_DELAY gets set on a
stripe with ->batch_head.

> 
> Thanks,
> Shaohua

Thanks,
NeilBrown

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/8] md/raid5: Ensure a batch member is not handled prematurely.
  2015-05-23  0:26     ` NeilBrown
@ 2015-05-26 18:16       ` Shaohua Li
  2015-05-26 22:35         ` NeilBrown
  0 siblings, 1 reply; 17+ messages in thread
From: Shaohua Li @ 2015-05-26 18:16 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Sat, May 23, 2015 at 10:26:40AM +1000, NeilBrown wrote:
> On Fri, 22 May 2015 16:44:02 -0700 Shaohua Li <shli@kernel.org> wrote:
> 
> > On Fri, May 22, 2015 at 03:30:58PM +1000, NeilBrown wrote:
> > > If a stripe is a member of a batch, but not the head, it must
> > > not be handled separately from the rest of the batch.
> > > 
> > > 'clear_batch_ready()' handles this requirement to some
> > > extent but not completely.  If a member is passed to handle_stripe()
> > > a second time it returns '0' indicating the stripe can be handled,
> > > which is wrong.
> > > So add an extra test.
> > > 
> > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > ---
> > >  drivers/md/raid5.c |    6 +++++-
> > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > > index c3ccefbd4fe7..9a803b735848 100644
> > > --- a/drivers/md/raid5.c
> > > +++ b/drivers/md/raid5.c
> > > @@ -4192,9 +4192,13 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
> > >  
> > >  static int clear_batch_ready(struct stripe_head *sh)
> > >  {
> > > +	/* Return '1' if this is a member of batch, or
> > > +	 * '0' if it is a lone stripe or a head which can now be
> > > +	 * handled.
> > > +	 */
> > >  	struct stripe_head *tmp;
> > >  	if (!test_and_clear_bit(STRIPE_BATCH_READY, &sh->state))
> > > -		return 0;
> > > +		return (sh->batch_head && sh->batch_head != sh);
> > >  	spin_lock(&sh->stripe_lock);
> > >  	if (!sh->batch_head) {
> > >  		spin_unlock(&sh->stripe_lock);
> > 
> > which case can this happen in?
> 
> It definitely happens as I had reliable problems until I added this fix.
> 'retry_aligned_read()' can call handle_stripe() on any stripe at any time,
> but I doubt that would apply.  I might try putting a warn-on there and see if
> it provides any hints.
> 
> > 
> > Patches look good. But I'm not in Fusionio any more, so can't check the
> > performance in big raid array with fast flash cards. I'm doing some tests here.
> > I hit a warning in break_stripe_batch_list, STRIPE_BIT_DELAY is set in the
> > stripe state. I'm checking the reason, but if you have thoughts I can try
> > immediately, please let me know.
> 
> I got STRIPE_BIT_DELAY a few times.  That was the main reason for
> 
>   md/raid5: ensure whole batch is delayed for all required bitmap updates.
> 
> and they went away after I got that patch right.
> 
> Maybe there is a race in there..
> 
> If you can reproduce it, maybe WARN whenever STRIPE_BIT_DELAY gets set on a
> stripe with ->batch_head.

Ok, there is a race in add_stripe_bio(). We unlocked the stripe_lock to set the
BIT_DELAY. After the unlock, the stripe might be added to a batch,
stripe_add_to_batch_list didn't clear the bit. Holding the lock in
add_stripe_bio and checking ->batch_head again when we set the bit should fix
the issue.

And STRIPE_ON_UNPLUG_LIST and STRIPE_ON_RELEASE_LIST are set is legit in
break_stripe_batch_list(), they should be removed from the WARN_ON_ONCE().

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/8] md/raid5: Ensure a batch member is not handled prematurely.
  2015-05-26 18:16       ` Shaohua Li
@ 2015-05-26 22:35         ` NeilBrown
  2015-05-26 23:21           ` Shaohua Li
  2015-05-26 23:34           ` NeilBrown
  0 siblings, 2 replies; 17+ messages in thread
From: NeilBrown @ 2015-05-26 22:35 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 5336 bytes --]

On Tue, 26 May 2015 11:16:47 -0700 Shaohua Li <shli@kernel.org> wrote:

> On Sat, May 23, 2015 at 10:26:40AM +1000, NeilBrown wrote:
> > On Fri, 22 May 2015 16:44:02 -0700 Shaohua Li <shli@kernel.org> wrote:
> > 
> > > On Fri, May 22, 2015 at 03:30:58PM +1000, NeilBrown wrote:
> > > > If a stripe is a member of a batch, but not the head, it must
> > > > not be handled separately from the rest of the batch.
> > > > 
> > > > 'clear_batch_ready()' handles this requirement to some
> > > > extent but not completely.  If a member is passed to handle_stripe()
> > > > a second time it returns '0' indicating the stripe can be handled,
> > > > which is wrong.
> > > > So add an extra test.
> > > > 
> > > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > > ---
> > > >  drivers/md/raid5.c |    6 +++++-
> > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > > > index c3ccefbd4fe7..9a803b735848 100644
> > > > --- a/drivers/md/raid5.c
> > > > +++ b/drivers/md/raid5.c
> > > > @@ -4192,9 +4192,13 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
> > > >  
> > > >  static int clear_batch_ready(struct stripe_head *sh)
> > > >  {
> > > > +	/* Return '1' if this is a member of batch, or
> > > > +	 * '0' if it is a lone stripe or a head which can now be
> > > > +	 * handled.
> > > > +	 */
> > > >  	struct stripe_head *tmp;
> > > >  	if (!test_and_clear_bit(STRIPE_BATCH_READY, &sh->state))
> > > > -		return 0;
> > > > +		return (sh->batch_head && sh->batch_head != sh);
> > > >  	spin_lock(&sh->stripe_lock);
> > > >  	if (!sh->batch_head) {
> > > >  		spin_unlock(&sh->stripe_lock);
> > > 
> > > which case can this happen in?
> > 
> > It definitely happens as I had reliable problems until I added this fix.
> > 'retry_aligned_read()' can call handle_stripe() on any stripe at any time,
> > but I doubt that would apply.  I might try putting a warn-on there and see if
> > it provides any hints.
> > 
> > > 
> > > Patches look good. But I'm not in Fusionio any more, so can't check the
> > > performance in big raid array with fast flash cards. I'm doing some tests here.
> > > I hit a warning in break_stripe_batch_list, STRIPE_BIT_DELAY is set in the
> > > stripe state. I'm checking the reason, but if you have thoughts I can try
> > > immediately, please let me know.
> > 
> > I got STRIPE_BIT_DELAY a few times.  That was the main reason for
> > 
> >   md/raid5: ensure whole batch is delayed for all required bitmap updates.
> > 
> > and they went away after I got that patch right.
> > 
> > Maybe there is a race in there..
> > 
> > If you can reproduce it, maybe WARN whenever STRIPE_BIT_DELAY gets set on a
> > stripe with ->batch_head.
> 
> Ok, there is a race in add_stripe_bio(). We unlocked the stripe_lock to set the
> BIT_DELAY. After the unlock, the stripe might be added to a batch,
> stripe_add_to_batch_list didn't clear the bit. Holding the lock in
> add_stripe_bio and checking ->batch_head again when we set the bit should fix
> the issue.

We can't hold a spin_lock over bitmap_startwrite(), and we really need to
make sure the write doesn't start until bitmap_startwrite has completed.
So we need to keep the stripe_head out of any batch during that time.
So I've added an extra state bit.

Could you please review and possibly test the patch below?

> 
> And STRIPE_ON_UNPLUG_LIST and STRIPE_ON_RELEASE_LIST are set is legit in
> break_stripe_batch_list(), they should be removed from the WARN_ON_ONCE().

Yes, you are right.  Thanks.

> 
> Thanks,
> Shaohua

Thanks,
NeilBrown


diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 041341c66ae5..89d6faafffda 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -760,6 +760,7 @@ static void unlock_two_stripes(struct stripe_head *sh1, struct stripe_head *sh2)
 static bool stripe_can_batch(struct stripe_head *sh)
 {
 	return test_bit(STRIPE_BATCH_READY, &sh->state) &&
+		!test_bit(STRIPE_BITMAP_PENDING, &sh->state) &&
 		is_full_stripe_write(sh);
 }
 
@@ -3007,14 +3008,18 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
 	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
 		(unsigned long long)(*bip)->bi_iter.bi_sector,
 		(unsigned long long)sh->sector, dd_idx);
-	spin_unlock_irq(&sh->stripe_lock);
 
 	if (conf->mddev->bitmap && firstwrite) {
+		set_bit(STRIPE_BITMAP_PENDING, &sh->state);
+		spin_unlock_irq(&sh->stripe_lock);
 		bitmap_startwrite(conf->mddev->bitmap, sh->sector,
 				  STRIPE_SECTORS, 0);
+		spin_lock_irq(&sh->stripe_lock);
+		clear_bit(STRIPE_BITMAP_PENDING, &sh->state);
 		sh->bm_seq = conf->seq_flush+1;
 		set_bit(STRIPE_BIT_DELAY, &sh->state);
 	}
+	spin_lock_irq(&sh->stripe_lock);
 
 	if (stripe_can_batch(sh))
 		stripe_add_to_batch_list(conf, sh);
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index d7b2bc8b756f..02c3bf8fbfe7 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -337,6 +337,9 @@ enum {
 	STRIPE_ON_RELEASE_LIST,
 	STRIPE_BATCH_READY,
 	STRIPE_BATCH_ERR,
+	STRIPE_BITMAP_PENDING,	/* Being added to bitmap, don't add
+				 * to batch yet.
+				 */
 };
 
 #define STRIPE_EXPAND_SYNC_FLAGS \

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/8] md/raid5: Ensure a batch member is not handled prematurely.
  2015-05-26 22:35         ` NeilBrown
@ 2015-05-26 23:21           ` Shaohua Li
  2015-05-26 23:34           ` NeilBrown
  1 sibling, 0 replies; 17+ messages in thread
From: Shaohua Li @ 2015-05-26 23:21 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Wed, May 27, 2015 at 08:35:32AM +1000, NeilBrown wrote:
> On Tue, 26 May 2015 11:16:47 -0700 Shaohua Li <shli@kernel.org> wrote:
> 
> > On Sat, May 23, 2015 at 10:26:40AM +1000, NeilBrown wrote:
> > > On Fri, 22 May 2015 16:44:02 -0700 Shaohua Li <shli@kernel.org> wrote:
> > > 
> > > > On Fri, May 22, 2015 at 03:30:58PM +1000, NeilBrown wrote:
> > > > > If a stripe is a member of a batch, but not the head, it must
> > > > > not be handled separately from the rest of the batch.
> > > > > 
> > > > > 'clear_batch_ready()' handles this requirement to some
> > > > > extent but not completely.  If a member is passed to handle_stripe()
> > > > > a second time it returns '0' indicating the stripe can be handled,
> > > > > which is wrong.
> > > > > So add an extra test.
> > > > > 
> > > > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > > > ---
> > > > >  drivers/md/raid5.c |    6 +++++-
> > > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > > > > index c3ccefbd4fe7..9a803b735848 100644
> > > > > --- a/drivers/md/raid5.c
> > > > > +++ b/drivers/md/raid5.c
> > > > > @@ -4192,9 +4192,13 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
> > > > >  
> > > > >  static int clear_batch_ready(struct stripe_head *sh)
> > > > >  {
> > > > > +	/* Return '1' if this is a member of batch, or
> > > > > +	 * '0' if it is a lone stripe or a head which can now be
> > > > > +	 * handled.
> > > > > +	 */
> > > > >  	struct stripe_head *tmp;
> > > > >  	if (!test_and_clear_bit(STRIPE_BATCH_READY, &sh->state))
> > > > > -		return 0;
> > > > > +		return (sh->batch_head && sh->batch_head != sh);
> > > > >  	spin_lock(&sh->stripe_lock);
> > > > >  	if (!sh->batch_head) {
> > > > >  		spin_unlock(&sh->stripe_lock);
> > > > 
> > > > which case can this happen in?
> > > 
> > > It definitely happens as I had reliable problems until I added this fix.
> > > 'retry_aligned_read()' can call handle_stripe() on any stripe at any time,
> > > but I doubt that would apply.  I might try putting a warn-on there and see if
> > > it provides any hints.
> > > 
> > > > 
> > > > Patches look good. But I'm not in Fusionio any more, so can't check the
> > > > performance in big raid array with fast flash cards. I'm doing some tests here.
> > > > I hit a warning in break_stripe_batch_list, STRIPE_BIT_DELAY is set in the
> > > > stripe state. I'm checking the reason, but if you have thoughts I can try
> > > > immediately, please let me know.
> > > 
> > > I got STRIPE_BIT_DELAY a few times.  That was the main reason for
> > > 
> > >   md/raid5: ensure whole batch is delayed for all required bitmap updates.
> > > 
> > > and they went away after I got that patch right.
> > > 
> > > Maybe there is a race in there..
> > > 
> > > If you can reproduce it, maybe WARN whenever STRIPE_BIT_DELAY gets set on a
> > > stripe with ->batch_head.
> > 
> > Ok, there is a race in add_stripe_bio(). We unlocked the stripe_lock to set the
> > BIT_DELAY. After the unlock, the stripe might be added to a batch,
> > stripe_add_to_batch_list didn't clear the bit. Holding the lock in
> > add_stripe_bio and checking ->batch_head again when we set the bit should fix
> > the issue.
> 
> We can't hold a spin_lock over bitmap_startwrite(), and we really need to
> make sure the write doesn't start until bitmap_startwrite has completed.
> So we need to keep the stripe_head out of any batch during that time.
> So I've added an extra state bit.
> 
> Could you please review and possibly test the patch below?
> 
> > 
> > And STRIPE_ON_UNPLUG_LIST and STRIPE_ON_RELEASE_LIST are set is legit in
> > break_stripe_batch_list(), they should be removed from the WARN_ON_ONCE().
> 
> Yes, you are right.  Thanks.
> 
> > 
> > Thanks,
> > Shaohua
> 
> Thanks,
> NeilBrown
> 
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 041341c66ae5..89d6faafffda 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -760,6 +760,7 @@ static void unlock_two_stripes(struct stripe_head *sh1, struct stripe_head *sh2)
>  static bool stripe_can_batch(struct stripe_head *sh)
>  {
>  	return test_bit(STRIPE_BATCH_READY, &sh->state) &&
> +		!test_bit(STRIPE_BITMAP_PENDING, &sh->state) &&
>  		is_full_stripe_write(sh);
>  }
>  
> @@ -3007,14 +3008,18 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
>  	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
>  		(unsigned long long)(*bip)->bi_iter.bi_sector,
>  		(unsigned long long)sh->sector, dd_idx);
> -	spin_unlock_irq(&sh->stripe_lock);
>  
>  	if (conf->mddev->bitmap && firstwrite) {
> +		set_bit(STRIPE_BITMAP_PENDING, &sh->state);
> +		spin_unlock_irq(&sh->stripe_lock);
>  		bitmap_startwrite(conf->mddev->bitmap, sh->sector,
>  				  STRIPE_SECTORS, 0);
> +		spin_lock_irq(&sh->stripe_lock);
> +		clear_bit(STRIPE_BITMAP_PENDING, &sh->state);
>  		sh->bm_seq = conf->seq_flush+1;
>  		set_bit(STRIPE_BIT_DELAY, &sh->state);
>  	}
> +	spin_lock_irq(&sh->stripe_lock);

should be unlock here. I'll report back if anything is wrong.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/8] md/raid5: Ensure a batch member is not handled prematurely.
  2015-05-26 22:35         ` NeilBrown
  2015-05-26 23:21           ` Shaohua Li
@ 2015-05-26 23:34           ` NeilBrown
  2015-05-27  0:10             ` Shaohua Li
  1 sibling, 1 reply; 17+ messages in thread
From: NeilBrown @ 2015-05-26 23:34 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3272 bytes --]

On Wed, 27 May 2015 08:35:32 +1000 NeilBrown <neilb@suse.de> wrote:

> Could you please review and possibly test the patch below?
> 

well... that patch had a fairly obvious double-lock bug.
Try this one.
(oh, just saw your email that you spotted the lock bug :-)


NeilBrown

From: NeilBrown <neilb@suse.de>
Date: Wed, 27 May 2015 08:43:45 +1000
Subject: [PATCH] md/raid5: close race between STRIPE_BIT_DELAY and batching.

The first time a write is added to a stripe, we need to set the
bitmap bits (if a bitmap is active).
While doing that the stripe is not locked and other writes could
be added and then the stripe could be added to a batch.
Once it has entered the batch it is too large to set STRIPE_BIT_DELAY
as the batch head has taken over when the stripe will be written.

We cannot hold the spinlock while adding the bitmap bit,
so introduce a new stripe_head flag 'STRIPE_BITMAP_PENDING' which
indicates that adding to the bitmap is pending.  This prevents
the stripe from being added to a batch.

Only the first thread to add a write to a stripe can set this bit,
so it is safe for it to clear it again when it is done.

Reported-by: Shaohua Li <shli@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 73b5376dad3b..dae587ecdf71 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -760,6 +760,7 @@ static void unlock_two_stripes(struct stripe_head *sh1, struct stripe_head *sh2)
 static bool stripe_can_batch(struct stripe_head *sh)
 {
 	return test_bit(STRIPE_BATCH_READY, &sh->state) &&
+		!test_bit(STRIPE_BITMAP_PENDING, &sh->state) &&
 		is_full_stripe_write(sh);
 }
 
@@ -3007,14 +3008,27 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
 	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
 		(unsigned long long)(*bip)->bi_iter.bi_sector,
 		(unsigned long long)sh->sector, dd_idx);
-	spin_unlock_irq(&sh->stripe_lock);
 
 	if (conf->mddev->bitmap && firstwrite) {
+		/* Cannot hold spinlock over bitmap_startwrite,
+		 * but must ensure this isn't added to a batch until
+		 * we have added to the bitmap and set bm_seq.
+		 * So set STRIPE_BITMAP_PENDING to prevent
+		 * batching.
+		 * Only the first thread to add a write to a stripe
+		 * can set this bit, so we "own" it.
+		 */
+		WARN_ON(test_bit(STRIPE_BITMAP_PENDING, &sh->state));
+		set_bit(STRIPE_BITMAP_PENDING, &sh->state);
+		spin_unlock_irq(&sh->stripe_lock);
 		bitmap_startwrite(conf->mddev->bitmap, sh->sector,
 				  STRIPE_SECTORS, 0);
+		spin_lock_irq(&sh->stripe_lock);
+		clear_bit(STRIPE_BITMAP_PENDING, &sh->state);
 		sh->bm_seq = conf->seq_flush+1;
 		set_bit(STRIPE_BIT_DELAY, &sh->state);
 	}
+	spin_unlock_irq(&sh->stripe_lock);
 
 	if (stripe_can_batch(sh))
 		stripe_add_to_batch_list(conf, sh);
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index d7b2bc8b756f..02c3bf8fbfe7 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -337,6 +337,9 @@ enum {
 	STRIPE_ON_RELEASE_LIST,
 	STRIPE_BATCH_READY,
 	STRIPE_BATCH_ERR,
+	STRIPE_BITMAP_PENDING,	/* Being added to bitmap, don't add
+				 * to batch yet.
+				 */
 };
 
 #define STRIPE_EXPAND_SYNC_FLAGS \

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/8] md/raid5: Ensure a batch member is not handled prematurely.
  2015-05-26 23:34           ` NeilBrown
@ 2015-05-27  0:10             ` Shaohua Li
  2015-05-27  0:36               ` NeilBrown
  0 siblings, 1 reply; 17+ messages in thread
From: Shaohua Li @ 2015-05-27  0:10 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Wed, May 27, 2015 at 09:34:51AM +1000, NeilBrown wrote:
> On Wed, 27 May 2015 08:35:32 +1000 NeilBrown <neilb@suse.de> wrote:
> 
> > Could you please review and possibly test the patch below?
> > 
> 
> well... that patch had a fairly obvious double-lock bug.
> Try this one.
> (oh, just saw your email that you spotted the lock bug :-)
> 
> 
> NeilBrown
> 
> From: NeilBrown <neilb@suse.de>
> Date: Wed, 27 May 2015 08:43:45 +1000
> Subject: [PATCH] md/raid5: close race between STRIPE_BIT_DELAY and batching.
> 
> The first time a write is added to a stripe, we need to set the
> bitmap bits (if a bitmap is active).
> While doing that the stripe is not locked and other writes could
> be added and then the stripe could be added to a batch.
> Once it has entered the batch it is too large to set STRIPE_BIT_DELAY
> as the batch head has taken over when the stripe will be written.
> 
> We cannot hold the spinlock while adding the bitmap bit,
> so introduce a new stripe_head flag 'STRIPE_BITMAP_PENDING' which
> indicates that adding to the bitmap is pending.  This prevents
> the stripe from being added to a batch.
> 
> Only the first thread to add a write to a stripe can set this bit,
> so it is safe for it to clear it again when it is done.
> 
> Reported-by: Shaohua Li <shli@kernel.org>
> Signed-off-by: NeilBrown <neilb@suse.de>
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 73b5376dad3b..dae587ecdf71 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -760,6 +760,7 @@ static void unlock_two_stripes(struct stripe_head *sh1, struct stripe_head *sh2)
>  static bool stripe_can_batch(struct stripe_head *sh)
>  {
>  	return test_bit(STRIPE_BATCH_READY, &sh->state) &&
> +		!test_bit(STRIPE_BITMAP_PENDING, &sh->state) &&
>  		is_full_stripe_write(sh);
>  }
>  
> @@ -3007,14 +3008,27 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
>  	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
>  		(unsigned long long)(*bip)->bi_iter.bi_sector,
>  		(unsigned long long)sh->sector, dd_idx);
> -	spin_unlock_irq(&sh->stripe_lock);
>  
>  	if (conf->mddev->bitmap && firstwrite) {
> +		/* Cannot hold spinlock over bitmap_startwrite,
> +		 * but must ensure this isn't added to a batch until
> +		 * we have added to the bitmap and set bm_seq.
> +		 * So set STRIPE_BITMAP_PENDING to prevent
> +		 * batching.
> +		 * Only the first thread to add a write to a stripe
> +		 * can set this bit, so we "own" it.
> +		 */
> +		WARN_ON(test_bit(STRIPE_BITMAP_PENDING, &sh->state));
I keep hitting this. the firstwrite is set for every device.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/8] md/raid5: Ensure a batch member is not handled prematurely.
  2015-05-27  0:10             ` Shaohua Li
@ 2015-05-27  0:36               ` NeilBrown
  0 siblings, 0 replies; 17+ messages in thread
From: NeilBrown @ 2015-05-27  0:36 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 6431 bytes --]

On Tue, 26 May 2015 17:10:05 -0700 Shaohua Li <shli@kernel.org> wrote:

> On Wed, May 27, 2015 at 09:34:51AM +1000, NeilBrown wrote:
> > On Wed, 27 May 2015 08:35:32 +1000 NeilBrown <neilb@suse.de> wrote:
> > 
> > > Could you please review and possibly test the patch below?
> > > 
> > 
> > well... that patch had a fairly obvious double-lock bug.
> > Try this one.
> > (oh, just saw your email that you spotted the lock bug :-)
> > 
> > 
> > NeilBrown
> > 
> > From: NeilBrown <neilb@suse.de>
> > Date: Wed, 27 May 2015 08:43:45 +1000
> > Subject: [PATCH] md/raid5: close race between STRIPE_BIT_DELAY and batching.
> > 
> > The first time a write is added to a stripe, we need to set the
> > bitmap bits (if a bitmap is active).
> > While doing that the stripe is not locked and other writes could
> > be added and then the stripe could be added to a batch.
> > Once it has entered the batch it is too large to set STRIPE_BIT_DELAY
> > as the batch head has taken over when the stripe will be written.
> > 
> > We cannot hold the spinlock while adding the bitmap bit,
> > so introduce a new stripe_head flag 'STRIPE_BITMAP_PENDING' which
> > indicates that adding to the bitmap is pending.  This prevents
> > the stripe from being added to a batch.
> > 
> > Only the first thread to add a write to a stripe can set this bit,
> > so it is safe for it to clear it again when it is done.
> > 
> > Reported-by: Shaohua Li <shli@kernel.org>
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > 
> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > index 73b5376dad3b..dae587ecdf71 100644
> > --- a/drivers/md/raid5.c
> > +++ b/drivers/md/raid5.c
> > @@ -760,6 +760,7 @@ static void unlock_two_stripes(struct stripe_head *sh1, struct stripe_head *sh2)
> >  static bool stripe_can_batch(struct stripe_head *sh)
> >  {
> >  	return test_bit(STRIPE_BATCH_READY, &sh->state) &&
> > +		!test_bit(STRIPE_BITMAP_PENDING, &sh->state) &&
> >  		is_full_stripe_write(sh);
> >  }
> >  
> > @@ -3007,14 +3008,27 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
> >  	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
> >  		(unsigned long long)(*bip)->bi_iter.bi_sector,
> >  		(unsigned long long)sh->sector, dd_idx);
> > -	spin_unlock_irq(&sh->stripe_lock);
> >  
> >  	if (conf->mddev->bitmap && firstwrite) {
> > +		/* Cannot hold spinlock over bitmap_startwrite,
> > +		 * but must ensure this isn't added to a batch until
> > +		 * we have added to the bitmap and set bm_seq.
> > +		 * So set STRIPE_BITMAP_PENDING to prevent
> > +		 * batching.
> > +		 * Only the first thread to add a write to a stripe
> > +		 * can set this bit, so we "own" it.
> > +		 */
> > +		WARN_ON(test_bit(STRIPE_BITMAP_PENDING, &sh->state));
> I keep hitting this. the firstwrite is set for every device.
> 

I wonder why I'm not.... maybe because I'm using /dev/loop devices and they
behave a bit differently.

Of course 'firstwrite' is not per-stripe, it is per-device-in-a-stripe.
So it will be set for every device.
Which is rather pointless really, we only need to set the bitmap once for the
whole stripe.  Maybe it was easier the other way.

Could you try this?

Thanks!
NeilBrown

From: NeilBrown <neilb@suse.de>
Date: Wed, 27 May 2015 08:43:45 +1000
Subject: [PATCH] md/raid5: close race between STRIPE_BIT_DELAY and batching.

When we add a write to a stripe we need to make sure the bitmap
bit is set.  While doing that the stripe is not locked so it could
be added to a batch after which further changes to STRIPE_BIT_DELAY
and ->bm_seq are ineffective.

So we need to hold off adding to a stripe until bitmap_startwrite has
completed at least once, and we need to avoid further changes to
STRIPE_BIT_DELAY once the stripe has been added to a batch.

If a bitmap_startwrite() completes after the stripe was added to a
batch, it will not have set the bit, only incremented a counter, so no
extra delay of the stripe is needed.

Reported-by: Shaohua Li <shli@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 73b5376dad3b..5d5ce97c2210 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -760,6 +760,7 @@ static void unlock_two_stripes(struct stripe_head *sh1, struct stripe_head *sh2)
 static bool stripe_can_batch(struct stripe_head *sh)
 {
 	return test_bit(STRIPE_BATCH_READY, &sh->state) &&
+		!test_bit(STRIPE_BITMAP_PENDING, &sh->state) &&
 		is_full_stripe_write(sh);
 }
 
@@ -3007,14 +3008,32 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
 	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
 		(unsigned long long)(*bip)->bi_iter.bi_sector,
 		(unsigned long long)sh->sector, dd_idx);
-	spin_unlock_irq(&sh->stripe_lock);
 
 	if (conf->mddev->bitmap && firstwrite) {
+		/* Cannot hold spinlock over bitmap_startwrite,
+		 * but must ensure this isn't added to a batch until
+		 * we have added to the bitmap and set bm_seq.
+		 * So set STRIPE_BITMAP_PENDING to prevent
+		 * batching.
+		 * If multiple add_stripe_bio() calls race here they
+		 * much all set STRIPE_BITMAP_PENDING.  So only the first one
+		 * to complete "bitmap_startwrite" gets to set
+		 * STRIPE_BIT_DELAY.  This is important as once a stripe
+		 * is added to a batch, STRIPE_BIT_DELAY cannot be changed
+		 * any more.
+		 */
+		set_bit(STRIPE_BITMAP_PENDING, &sh->state);
+		spin_unlock_irq(&sh->stripe_lock);
 		bitmap_startwrite(conf->mddev->bitmap, sh->sector,
 				  STRIPE_SECTORS, 0);
-		sh->bm_seq = conf->seq_flush+1;
-		set_bit(STRIPE_BIT_DELAY, &sh->state);
+		spin_lock_irq(&sh->stripe_lock);
+		clear_bit(STRIPE_BITMAP_PENDING, &sh->state);
+		if (!sh->batch_head) {
+			sh->bm_seq = conf->seq_flush+1;
+			set_bit(STRIPE_BIT_DELAY, &sh->state);
+		}
 	}
+	spin_unlock_irq(&sh->stripe_lock);
 
 	if (stripe_can_batch(sh))
 		stripe_add_to_batch_list(conf, sh);
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index d7b2bc8b756f..02c3bf8fbfe7 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -337,6 +337,9 @@ enum {
 	STRIPE_ON_RELEASE_LIST,
 	STRIPE_BATCH_READY,
 	STRIPE_BATCH_ERR,
+	STRIPE_BITMAP_PENDING,	/* Being added to bitmap, don't add
+				 * to batch yet.
+				 */
 };
 
 #define STRIPE_EXPAND_SYNC_FLAGS \


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-05-27  0:36 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-05-22  5:30 [PATCH 0/8] Fixes for md/raid5 stripe batching code NeilBrown
2015-05-22  5:30 ` [PATCH 5/8] md/raid5: add handle_flags arg to break_stripe_batch_list NeilBrown
2015-05-22  5:30 ` [PATCH 4/8] md/raid5: duplicate some more handle_stripe_clean_event code in break_stripe_batch_list NeilBrown
2015-05-22  5:30 ` [PATCH 7/8] md/raid5: call break_stripe_batch_list from handle_stripe_clean_event NeilBrown
2015-05-22  5:30 ` [PATCH 6/8] md/raid5: be more selective about distributing flags across batch NeilBrown
2015-05-22  5:30 ` [PATCH 1/8] md/raid5: ensure whole batch is delayed for all required bitmap updates NeilBrown
2015-05-22  5:30 ` [PATCH 3/8] md/raid5: remove condition test from check_break_stripe_batch_list NeilBrown
2015-05-22  5:30 ` [PATCH 2/8] md/raid5: Ensure a batch member is not handled prematurely NeilBrown
2015-05-22 23:44   ` Shaohua Li
2015-05-23  0:26     ` NeilBrown
2015-05-26 18:16       ` Shaohua Li
2015-05-26 22:35         ` NeilBrown
2015-05-26 23:21           ` Shaohua Li
2015-05-26 23:34           ` NeilBrown
2015-05-27  0:10             ` Shaohua Li
2015-05-27  0:36               ` NeilBrown
2015-05-22  5:30 ` [PATCH 8/8] md/raid5: break stripe-batches when the array has failed NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).