md raid acceleration and the async

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* md raid acceleration and the async_tx api
@ 2007-08-27  8:49 Yuri Tikhonov
  2007-08-27 19:12 ` Williams, Dan J
  0 siblings, 1 reply; 8+ messages in thread
From: Yuri Tikhonov @ 2007-08-27  8:49 UTC (permalink / raw)
  To: dan.j.williams; +Cc: linux-raid, Wolfgang Denk, dzu

 Hello,

 I tested the h/w accelerated RAID-5 using the kernel with PAGE_SIZE set to 
64KB and found the bonnie++ application hangs-up during the "Re-writing" 
test. I made some investigations and discovered that the hang-up occurs 
because one of the mpage_end_io_read() calls is missing (these are the 
callbacks initiated from the ops_complete_biofill() function).

 The fact is that my low-level ADMA driver (the ppc440spe one) successfully 
initiated the ops_complete_biofill() callback but the ops_complete_biofill() 
function itself skipped calling the bi_end_io() handler of the completed bio 
(current dev->read) because during processing of this (current dev->read) bio 
some other request had come to the sh (current dev_q->toread). Thus 
ops_complete_biofill() scheduled another biofill operation which, as a 
result, overwrote the unacknowledged bio (dev->read in ops_run_biofill()), 
and so we lost the previous dev->read bio completely.

 Here is a patch that solves this problem. Perhaps this might be implemented 
in some more elegant and effective way. What are your thoughts regarding 
this?

 Regards, Yuri

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 08b4893..7abc96b 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -838,11 +838,24 @@ static void ops_complete_biofill(void *stripe_head_ref)
 		/* acknowledge completion of a biofill operation */
 		/* and check if we need to reply to a read request
 		 */
-		if (test_bit(R5_Wantfill, &dev_q->flags) && !dev_q->toread) {
+		if (test_bit(R5_Wantfill, &dev_q->flags)) {
 			struct bio *rbi, *rbi2;
 			struct r5dev *dev = &sh->dev[i];

-			clear_bit(R5_Wantfill, &dev_q->flags);
+			/* There is a chance that another fill operation
+			 * had been scheduled for this dev while we
+			 * processed sh. In this case do one of the following
+			 * alternatives:
+			 * - if there is no active completed biofill for the dev
+			 *   then go to the next dev leaving Wantfill set;
+			 * - if there is active completed biofill for the dev
+			 *   then ack it but leave Wantfill set.
+			 */
+			if (dev_q->toread && !dev->read)
+				continue;
+
+			if (!dev_q->toread)
+				clear_bit(R5_Wantfill, &dev_q->flags);

 			/* The access to dev->read is outside of the
 			 * spin_lock_irq(&conf->device_lock), but is protected

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* RE: md raid acceleration and the async_tx api
  2007-08-27  8:49 md raid acceleration and the async_tx api Yuri Tikhonov
@ 2007-08-27 19:12 ` Williams, Dan J
  2007-08-30 14:57   ` Yuri Tikhonov
  0 siblings, 1 reply; 8+ messages in thread
From: Williams, Dan J @ 2007-08-27 19:12 UTC (permalink / raw)
  To: Yuri Tikhonov; +Cc: linux-raid, Wolfgang Denk, dzu

[-- Attachment #1: Type: text/plain, Size: 5966 bytes --]

> From: Yuri Tikhonov [mailto:yur@emcraft.com]
>  Hello,
> 
>  I tested the h/w accelerated RAID-5 using the kernel with PAGE_SIZE
set to
> 64KB and found the bonnie++ application hangs-up during the
"Re-writing"
> test. I made some investigations and discovered that the hang-up
occurs
> because one of the mpage_end_io_read() calls is missing (these are the
> callbacks initiated from the ops_complete_biofill() function).
> 
>  The fact is that my low-level ADMA driver (the ppc440spe one)
successfully
> initiated the ops_complete_biofill() callback but the
ops_complete_biofill()
> function itself skipped calling the bi_end_io() handler of the
completed bio
> (current dev->read) because during processing of this (current
dev->read) bio
> some other request had come to the sh (current dev_q->toread). Thus
> ops_complete_biofill() scheduled another biofill operation which, as a
> result, overwrote the unacknowledged bio (dev->read in
ops_run_biofill()),
> and so we lost the previous dev->read bio completely.
> 
>  Here is a patch that solves this problem. Perhaps this might be
implemented
> in some more elegant and effective way. What are your thoughts
regarding
> this?
> 
>  Regards, Yuri
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 08b4893..7abc96b 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -838,11 +838,24 @@ static void ops_complete_biofill(void
*stripe_head_ref)
>  		/* acknowledge completion of a biofill operation */
>  		/* and check if we need to reply to a read request
>  		 */
> -		if (test_bit(R5_Wantfill, &dev_q->flags) &&
!dev_q->toread) {
> +		if (test_bit(R5_Wantfill, &dev_q->flags)) {
>  			struct bio *rbi, *rbi2;
>  			struct r5dev *dev = &sh->dev[i];
> 
> -			clear_bit(R5_Wantfill, &dev_q->flags);
> +			/* There is a chance that another fill operation
> +			 * had been scheduled for this dev while we
> +			 * processed sh. In this case do one of the
following
> +			 * alternatives:
> +			 * - if there is no active completed biofill for
the dev
> +			 *   then go to the next dev leaving Wantfill
set;
> +			 * - if there is active completed biofill for
the dev
> +			 *   then ack it but leave Wantfill set.
> +			 */
> +			if (dev_q->toread && !dev->read)
> +				continue;
> +
> +			if (!dev_q->toread)
> +				clear_bit(R5_Wantfill, &dev_q->flags);
> 
>  			/* The access to dev->read is outside of the
>  			 * spin_lock_irq(&conf->device_lock), but is
protected

This still looks racy...  I think the complete fix is to make the
R5_Wantfill and dev_q->toread accesses atomic.  Please test the
following patch (also attached) and let me know if it fixes what you are
seeing:

Applies on top of git://lost.foo-projects.org/~dwillia2/git/iop
md-for-linus

---
raid5: fix ops_complete_biofill race in the asynchronous offload case

Protect against dev_q->toread toggling while testing and clearing the
R5_Wantfill bit.  This change prevents all asynchronous completions
(tasklets) from running during handle_stripe.
---

 drivers/md/raid5.c |   16 +++++++++-------
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 2f9022d..91c14c6 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -824,6 +824,7 @@ static void ops_complete_biofill(void
*stripe_head_ref)
 		(unsigned long long)sh->sector);
 
 	/* clear completed biofills */
+	spin_lock(&sq->lock);
 	for (i = sh->disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
 		struct r5_queue_dev *dev_q = &sq->dev[i];
@@ -861,6 +862,7 @@ static void ops_complete_biofill(void
*stripe_head_ref)
 	}
 	clear_bit(STRIPE_OP_BIOFILL, &sh->ops.ack);
 	clear_bit(STRIPE_OP_BIOFILL, &sh->ops.pending);
+	spin_unlock(&sq->lock);
 
 	return_io(return_bi);
 
@@ -2279,7 +2281,7 @@ static int add_queue_bio(struct stripe_queue *sq,
struct bio *bi, int dd_idx,
 		(unsigned long long)bi->bi_sector,
 		(unsigned long long)sq->sector);
 
-	spin_lock(&sq->lock);
+	spin_lock_bh(&sq->lock);
 	spin_lock_irq(&conf->device_lock);
 	sh = sq->sh;
 	if (forwrite) {
@@ -2306,7 +2308,7 @@ static int add_queue_bio(struct stripe_queue *sq,
struct bio *bi, int dd_idx,
 	*bip = bi;
 	bi->bi_phys_segments ++;
 	spin_unlock_irq(&conf->device_lock);
-	spin_unlock(&sq->lock);
+	spin_unlock_bh(&sq->lock);
 
 	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
 		(unsigned long long)bi->bi_sector,
@@ -2339,7 +2341,7 @@ static int add_queue_bio(struct stripe_queue *sq,
struct bio *bi, int dd_idx,
  overlap:
 	set_bit(R5_Overlap, &sh->dev[dd_idx].flags);
 	spin_unlock_irq(&conf->device_lock);
-	spin_unlock(&sq->lock);
+	spin_unlock_bh(&sq->lock);
 	return 0;
 }
 
@@ -3127,7 +3129,7 @@ static void handle_stripe5(struct stripe_head *sh)
 		atomic_read(&sh->count), sq->pd_idx,
 		sh->ops.pending, sh->ops.ack, sh->ops.complete);
 
-	spin_lock(&sq->lock);
+	spin_lock_bh(&sq->lock);
 	clear_bit(STRIPE_HANDLE, &sh->state);
 
 	s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
@@ -3370,7 +3372,7 @@ static void handle_stripe5(struct stripe_head *sh)
 	if (sh->ops.count)
 		pending = get_stripe_work(sh);
 
-	spin_unlock(&sq->lock);
+	spin_unlock_bh(&sq->lock);
 
 	if (pending)
 		raid5_run_ops(sh, pending);
@@ -3397,7 +3399,7 @@ static void handle_stripe6(struct stripe_head *sh,
struct page *tmp_page)
 	       atomic_read(&sh->count), pd_idx, r6s.qd_idx);
 	memset(&s, 0, sizeof(s));
 
-	spin_lock(&sq->lock);
+	spin_lock_bh(&sq->lock);
 	clear_bit(STRIPE_HANDLE, &sh->state);
 
 	s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
@@ -3576,7 +3578,7 @@ static void handle_stripe6(struct stripe_head *sh,
struct page *tmp_page)
 	if (s.expanding && s.locked == 0)
 		handle_stripe_expansion(conf, sh, &r6s);
 
-	spin_unlock(&sq->lock);
+	spin_unlock_bh(&sq->lock);
 
 	return_io(return_bi);



[-- Attachment #2: md-accel-fix-ops-complete-biofill-race.patch --]
[-- Type: application/octet-stream, Size: 3079 bytes --]

raid5: fix ops_complete_biofill race in the asynchronous offload case

From: Dan Williams <dan.j.williams@intel.com>

Protect against dev_q->toread toggling while testing and clearing the
R5_Wantfill bit.  This change prevents all asynchronous completions
(tasklets) from running during handle_stripe.
---

 drivers/md/raid5.c |   16 +++++++++-------
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 2f9022d..91c14c6 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -824,6 +824,7 @@ static void ops_complete_biofill(void *stripe_head_ref)
 		(unsigned long long)sh->sector);
 
 	/* clear completed biofills */
+	spin_lock(&sq->lock);
 	for (i = sh->disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
 		struct r5_queue_dev *dev_q = &sq->dev[i];
@@ -861,6 +862,7 @@ static void ops_complete_biofill(void *stripe_head_ref)
 	}
 	clear_bit(STRIPE_OP_BIOFILL, &sh->ops.ack);
 	clear_bit(STRIPE_OP_BIOFILL, &sh->ops.pending);
+	spin_unlock(&sq->lock);
 
 	return_io(return_bi);
 
@@ -2279,7 +2281,7 @@ static int add_queue_bio(struct stripe_queue *sq, struct bio *bi, int dd_idx,
 		(unsigned long long)bi->bi_sector,
 		(unsigned long long)sq->sector);
 
-	spin_lock(&sq->lock);
+	spin_lock_bh(&sq->lock);
 	spin_lock_irq(&conf->device_lock);
 	sh = sq->sh;
 	if (forwrite) {
@@ -2306,7 +2308,7 @@ static int add_queue_bio(struct stripe_queue *sq, struct bio *bi, int dd_idx,
 	*bip = bi;
 	bi->bi_phys_segments ++;
 	spin_unlock_irq(&conf->device_lock);
-	spin_unlock(&sq->lock);
+	spin_unlock_bh(&sq->lock);
 
 	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
 		(unsigned long long)bi->bi_sector,
@@ -2339,7 +2341,7 @@ static int add_queue_bio(struct stripe_queue *sq, struct bio *bi, int dd_idx,
  overlap:
 	set_bit(R5_Overlap, &sh->dev[dd_idx].flags);
 	spin_unlock_irq(&conf->device_lock);
-	spin_unlock(&sq->lock);
+	spin_unlock_bh(&sq->lock);
 	return 0;
 }
 
@@ -3127,7 +3129,7 @@ static void handle_stripe5(struct stripe_head *sh)
 		atomic_read(&sh->count), sq->pd_idx,
 		sh->ops.pending, sh->ops.ack, sh->ops.complete);
 
-	spin_lock(&sq->lock);
+	spin_lock_bh(&sq->lock);
 	clear_bit(STRIPE_HANDLE, &sh->state);
 
 	s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
@@ -3370,7 +3372,7 @@ static void handle_stripe5(struct stripe_head *sh)
 	if (sh->ops.count)
 		pending = get_stripe_work(sh);
 
-	spin_unlock(&sq->lock);
+	spin_unlock_bh(&sq->lock);
 
 	if (pending)
 		raid5_run_ops(sh, pending);
@@ -3397,7 +3399,7 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 	       atomic_read(&sh->count), pd_idx, r6s.qd_idx);
 	memset(&s, 0, sizeof(s));
 
-	spin_lock(&sq->lock);
+	spin_lock_bh(&sq->lock);
 	clear_bit(STRIPE_HANDLE, &sh->state);
 
 	s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
@@ -3576,7 +3578,7 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 	if (s.expanding && s.locked == 0)
 		handle_stripe_expansion(conf, sh, &r6s);
 
-	spin_unlock(&sq->lock);
+	spin_unlock_bh(&sq->lock);
 
 	return_io(return_bi);
 

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: md raid acceleration and the async_tx api
  2007-08-27 19:12 ` Williams, Dan J
@ 2007-08-30 14:57   ` Yuri Tikhonov
  2007-08-30 19:34     ` Dan Williams
  0 siblings, 1 reply; 8+ messages in thread
From: Yuri Tikhonov @ 2007-08-30 14:57 UTC (permalink / raw)
  To: Williams, Dan J; +Cc: linux-raid, Wolfgang Denk, dzu


 Hi Dan,

On Monday 27 August 2007 23:12, you wrote:
> This still looks racy...  I think the complete fix is to make the
> R5_Wantfill and dev_q->toread accesses atomic.  Please test the
> following patch (also attached) and let me know if it fixes what you are
> seeing:

 Your approach doesn't help, the Bonnie++ utility hangs up during the 
ReWriting stage.

 Note that before applying your patch I rolled my fix in the 
ops_complete_biofill() function back. Do I understand it right that your 
patch should be used *instead* of my one rather than *with* it ?

 Regards, Yuri

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: md raid acceleration and the async_tx api
  2007-08-30 14:57   ` Yuri Tikhonov
@ 2007-08-30 19:34     ` Dan Williams
  0 siblings, 0 replies; 8+ messages in thread
From: Dan Williams @ 2007-08-30 19:34 UTC (permalink / raw)
  To: Yuri Tikhonov; +Cc: linux-raid, Wolfgang Denk, dzu

[-- Attachment #1: Type: text/plain, Size: 1556 bytes --]

On 8/30/07, Yuri Tikhonov <yur@emcraft.com> wrote:
>
>  Hi Dan,
>
> On Monday 27 August 2007 23:12, you wrote:
> > This still looks racy...  I think the complete fix is to make the
> > R5_Wantfill and dev_q->toread accesses atomic.  Please test the
> > following patch (also attached) and let me know if it fixes what you are
> > seeing:
>
>  Your approach doesn't help, the Bonnie++ utility hangs up during the
> ReWriting stage.
>
Looking at it again I see that what I added would not affect the
failure you are seeing.  However I noticed that you are using a broken
version of the stripe-queue and cache_arbiter patches.  In the current
revisions the dev_q->flags field has been moved back to dev->flags
which fixes a data corruption issue and could potentially address the
hang you are seeing.  The latest revisions are:
raid5: add the stripe_queue object for tracking raid io requests (rev2)
raid5: use stripe_queues to prioritize the "most deserving" requests (rev6)

>  Note that before applying your patch I rolled my fix in the
> ops_complete_biofill() function back. Do I understand it right that your
> patch should be used *instead* of my one rather than *with* it ?
>
You understood correctly.  The attached patch integrates your change
to keep R5_Wantfill set while also protecting the 'more_to_read' case.
 Please try it on top of the latest stripe-queue changes [1] (instead
of the other proposed patches) .

>  Regards, Yuri

Thanks,
Dan

[1] git fetch -f git://lost.foo-projects.org/~dwillia2/git/iop
md-for-linus:refs/heads/md-for-linus

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: fix-ops-complete-biofill.patch --]
[-- Type: text/x-patch; name="fix-ops-complete-biofill.patch", Size: 2878 bytes --]

raid5: fix the 'more_to_read' case in ops_complete_biofill

From: Dan Williams <dan.j.williams@intel.com>

Prevent ops_complete_biofill from running concurrently with add_queue_bio

---

 drivers/md/raid5.c |   33 +++++++++++++++++++--------------
 1 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 2f9022d..1c591d3 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -828,22 +828,19 @@ static void ops_complete_biofill(void *stripe_head_ref)
 		struct r5dev *dev = &sh->dev[i];
 		struct r5_queue_dev *dev_q = &sq->dev[i];
 
-		/* check if this stripe has new incoming reads */
+		/* 1/ acknowledge completion of a biofill operation
+		 * 2/ check if we need to reply to a read request.
+		 * 3/ check if we need to reschedule handle_stripe
+		 */
 		if (dev_q->toread)
 			more_to_read++;
 
-		/* acknowledge completion of a biofill operation */
-		/* and check if we need to reply to a read request
-		*/
-		if (test_bit(R5_Wantfill, &dev->flags) && !dev_q->toread) {
+		if (test_bit(R5_Wantfill, &dev->flags)) {
 			struct bio *rbi, *rbi2;
-			clear_bit(R5_Wantfill, &dev->flags);
 
-			/* The access to dev->read is outside of the
-			 * spin_lock_irq(&conf->device_lock), but is protected
-			 * by the STRIPE_OP_BIOFILL pending bit
-			 */
-			BUG_ON(!dev->read);
+			if (!dev_q->toread)
+				clear_bit(R5_Wantfill, &dev->flags);
+
 			rbi = dev->read;
 			dev->read = NULL;
 			while (rbi && rbi->bi_sector <
@@ -899,8 +896,15 @@ static void ops_run_biofill(struct stripe_head *sh)
 	}
 
 	atomic_inc(&sh->count);
+
+	/* spin_lock prevents ops_complete_biofill from running concurrently
+	 * with add_queue_bio in the synchronous case
+	 */
+	spin_lock(&sq->lock);
 	async_trigger_callback(ASYNC_TX_DEP_ACK | ASYNC_TX_ACK, tx,
 		ops_complete_biofill, sh);
+	spin_unlock(&sq->lock);
+
 }
 
 static void ops_complete_compute5(void *stripe_head_ref)
@@ -2279,7 +2283,8 @@ static int add_queue_bio(struct stripe_queue *sq, struct bio *bi, int dd_idx,
 		(unsigned long long)bi->bi_sector,
 		(unsigned long long)sq->sector);
 
-	spin_lock(&sq->lock);
+	/* prevent asynchronous completions from running */
+	spin_lock_bh(&sq->lock);
 	spin_lock_irq(&conf->device_lock);
 	sh = sq->sh;
 	if (forwrite) {
@@ -2306,7 +2311,7 @@ static int add_queue_bio(struct stripe_queue *sq, struct bio *bi, int dd_idx,
 	*bip = bi;
 	bi->bi_phys_segments ++;
 	spin_unlock_irq(&conf->device_lock);
-	spin_unlock(&sq->lock);
+	spin_unlock_bh(&sq->lock);
 
 	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
 		(unsigned long long)bi->bi_sector,
@@ -2339,7 +2344,7 @@ static int add_queue_bio(struct stripe_queue *sq, struct bio *bi, int dd_idx,
  overlap:
 	set_bit(R5_Overlap, &sh->dev[dd_idx].flags);
 	spin_unlock_irq(&conf->device_lock);
-	spin_unlock(&sq->lock);
+	spin_unlock_bh(&sq->lock);
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 8+ messages in thread

[parent not found: <200709071444.34911.yur@emcraft.com>]

[parent not found: <0C7297FA1D2D244A9C7F6959C0BF1E520268732A@azsmsx413.amr.corp.intel.com>]

* Re: md raid acceleration and the async_tx api
       [not found] ` <0C7297FA1D2D244A9C7F6959C0BF1E520268732A@azsmsx413.amr.corp.intel.com>
@ 2007-09-13  9:38   ` Yuri Tikhonov
  2007-09-13 16:52     ` Dan Williams
  0 siblings, 1 reply; 8+ messages in thread
From: Yuri Tikhonov @ 2007-09-13  9:38 UTC (permalink / raw)
  To: Williams, Dan J; +Cc: Wolfgang Denk, linux-raid

 Hi Dan,

On Friday 07 September 2007 20:02, you wrote:
> You need to fetch from the 'md-for-linus' tree.  But I have attached
> them as well.
>
> git fetch git://lost.foo-projects.org/~dwillia2/git/iop
> md-for-linus:md-for-linus

 Thanks.

 Unrelated question. Comparing the drivers/md/raid5.c file in the Linus's 
2.6.23-rc6 tree and in your md-for-linus one I'd found the following 
difference in the expand-related part of the handle_stripe5() function:

-		s.locked += handle_write_operations5(sh, 1, 1);
+		s.locked += handle_write_operations5(sh, 0, 1);

 That is, in your case we are passing rcw=0, whereas in the Linus's case the 
handle_write_operation5() is called with rcw=1. What code is correct ?

 Regards, Yuri

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: md raid acceleration and the async_tx api
  2007-09-13  9:38   ` Yuri Tikhonov
@ 2007-09-13 16:52     ` Dan Williams
  2007-09-13 21:14       ` Mr. James W. Laferriere
  0 siblings, 1 reply; 8+ messages in thread
From: Dan Williams @ 2007-09-13 16:52 UTC (permalink / raw)
  To: Yuri Tikhonov; +Cc: Wolfgang Denk, linux-raid, Andrew Morton, Neil Brown

On 9/13/07, Yuri Tikhonov <yur@emcraft.com> wrote:
>
>  Hi Dan,
>
> On Friday 07 September 2007 20:02, you wrote:
> > You need to fetch from the 'md-for-linus' tree.  But I have attached
> > them as well.
> >
> > git fetch git://lost.foo-projects.org/~dwillia2/git/iop
> > md-for-linus:md-for-linus
>
>  Thanks.
>
>  Unrelated question. Comparing the drivers/md/raid5.c file in the Linus's
> 2.6.23-rc6 tree and in your md-for-linus one I'd found the following
> difference in the expand-related part of the handle_stripe5() function:
>
> -               s.locked += handle_write_operations5(sh, 1, 1);
> +               s.locked += handle_write_operations5(sh, 0, 1);
>
>  That is, in your case we are passing rcw=0, whereas in the Linus's case the
> handle_write_operation5() is called with rcw=1. What code is correct ?
>
There was a recent bug discovered in my changes to the expansion code.
 The fix has now gone into Linus's tree through Andrew's tree.  I kept
the fix out of my 'md-for-linus' tree to prevent it getting dropped
from -mm due to automatic git-tree merge-detection.  I have now
rebased my git tree so everything is in sync.

However, after talking with Neil at LCE we came to the conclusion that
it would be best if I just sent patches since git tree updates tend to
not get enough review, and because the patch sets will be more
manageable now that the big pieces of the acceleration infrastructure
have been merged.

>  Regards, Yuri

Thanks,
Dan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: md raid acceleration and the async_tx api
  2007-09-13 16:52     ` Dan Williams
@ 2007-09-13 21:14       ` Mr. James W. Laferriere
  2007-09-13 21:30         ` Williams, Dan J
  0 siblings, 1 reply; 8+ messages in thread
From: Mr. James W. Laferriere @ 2007-09-13 21:14 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-raid maillist

 	Hello Dan ,

On Thu, 13 Sep 2007, Dan Williams wrote:
> On 9/13/07, Yuri Tikhonov <yur@emcraft.com> wrote:
>>  Hi Dan,
>> On Friday 07 September 2007 20:02, you wrote:
>>> You need to fetch from the 'md-for-linus' tree.  But I have attached
>>> them as well.
>>>
>>> git fetch git://lost.foo-projects.org/~dwillia2/git/iop
>>> md-for-linus:md-for-linus
>>
>>  Thanks.
>>
>>  Unrelated question. Comparing the drivers/md/raid5.c file in the Linus's
>> 2.6.23-rc6 tree and in your md-for-linus one I'd found the following
>> difference in the expand-related part of the handle_stripe5() function:
>>
>> -               s.locked += handle_write_operations5(sh, 1, 1);
>> +               s.locked += handle_write_operations5(sh, 0, 1);
>>
>>  That is, in your case we are passing rcw=0, whereas in the Linus's case the
>> handle_write_operation5() is called with rcw=1. What code is correct ?
>>
> There was a recent bug discovered in my changes to the expansion code.
> The fix has now gone into Linus's tree through Andrew's tree.  I kept
> the fix out of my 'md-for-linus' tree to prevent it getting dropped
> from -mm due to automatic git-tree merge-detection.  I have now
> rebased my git tree so everything is in sync.
>
> However, after talking with Neil at LCE we came to the conclusion that
> it would be best if I just sent patches since git tree updates tend to
> not get enough review, and because the patch sets will be more
> manageable now that the big pieces of the acceleration infrastructure
> have been merged.
>
>>  Regards, Yuri
>
> Thanks,
> Dan
 	Does this discussion of patches include any changes to cure the 'BUG' 
instance I reported ?

 	ie: raid5:md3: kernel BUG , followed by , Silent halt .

 		Tia ,  JimL
-- 
+-----------------------------------------------------------------+
| James   W.   Laferriere | System   Techniques | Give me VMS     |
| Network        Engineer | 663  Beaumont  Blvd |  Give me Linux  |
| babydr@baby-dragons.com | Pacifica, CA. 94044 |   only  on  AXP |
+-----------------------------------------------------------------+

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: md raid acceleration and the async_tx api
  2007-09-13 21:14       ` Mr. James W. Laferriere
@ 2007-09-13 21:30         ` Williams, Dan J
  0 siblings, 0 replies; 8+ messages in thread
From: Williams, Dan J @ 2007-09-13 21:30 UTC (permalink / raw)
  To: Mr. James W. Laferriere; +Cc: linux-raid maillist

> From: Mr. James W. Laferriere [mailto:babydr@baby-dragons.com]
>  	Hello Dan ,
> 
> On Thu, 13 Sep 2007, Dan Williams wrote:
> > On 9/13/07, Yuri Tikhonov <yur@emcraft.com> wrote:
> >>  Hi Dan,
> >> On Friday 07 September 2007 20:02, you wrote:
> >>> You need to fetch from the 'md-for-linus' tree.  But I have
attached
> >>> them as well.
> >>>
> >>> git fetch git://lost.foo-projects.org/~dwillia2/git/iop
> >>> md-for-linus:md-for-linus
> >>
> >>  Thanks.
> >>
> >>  Unrelated question. Comparing the drivers/md/raid5.c file in the
Linus's
> >> 2.6.23-rc6 tree and in your md-for-linus one I'd found the
following
> >> difference in the expand-related part of the handle_stripe5()
function:
> >>
> >> -               s.locked += handle_write_operations5(sh, 1, 1);
> >> +               s.locked += handle_write_operations5(sh, 0, 1);
> >>
> >>  That is, in your case we are passing rcw=0, whereas in the Linus's
case
> the
> >> handle_write_operation5() is called with rcw=1. What code is
correct ?
> >>
> > There was a recent bug discovered in my changes to the expansion
code.
> > The fix has now gone into Linus's tree through Andrew's tree.  I
kept
> > the fix out of my 'md-for-linus' tree to prevent it getting dropped
> > from -mm due to automatic git-tree merge-detection.  I have now
> > rebased my git tree so everything is in sync.
> >
> > However, after talking with Neil at LCE we came to the conclusion
that
> > it would be best if I just sent patches since git tree updates tend
to
> > not get enough review, and because the patch sets will be more
> > manageable now that the big pieces of the acceleration
infrastructure
> > have been merged.
> >
> >>  Regards, Yuri
> >
> > Thanks,
> > Dan
>  	Does this discussion of patches include any changes to cure the
'BUG'
> instance I reported ?
> 
>  	ie: raid5:md3: kernel BUG , followed by , Silent halt .
No, this is referring to:
http://marc.info/?l=linux-raid&m=118845398229443&w=2

The BUG you reported currently looks to be caused by interactions with
the bitmap support code... still investigating.

>  		Tia ,  JimL

Thanks,
Dan

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2007-09-13 21:30 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-08-27  8:49 md raid acceleration and the async_tx api Yuri Tikhonov
2007-08-27 19:12 ` Williams, Dan J
2007-08-30 14:57   ` Yuri Tikhonov
2007-08-30 19:34     ` Dan Williams
     [not found] <200709071444.34911.yur@emcraft.com>
     [not found] ` <0C7297FA1D2D244A9C7F6959C0BF1E520268732A@azsmsx413.amr.corp.intel.com>
2007-09-13  9:38   ` Yuri Tikhonov
2007-09-13 16:52     ` Dan Williams
2007-09-13 21:14       ` Mr. James W. Laferriere
2007-09-13 21:30         ` Williams, Dan J

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).