[BUG] The kernel thread for md RAID1 could cause a md RAID1 array deadlock

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [BUG] The kernel thread for md RAID1 could cause a md RAID1 array deadlock
@ 2008-01-15  3:10 K.Tanaka
  2008-01-24  3:28 ` Neil Brown
  0 siblings, 1 reply; 4+ messages in thread
From: K.Tanaka @ 2008-01-15  3:10 UTC (permalink / raw)
  To: linux-scsi; +Cc: linux-kernel, linux-raid, dm-devel


This message describes the details about md-RAID1 issue found by
testing the md RAID1 using the SCSI fault injection framework.

Abstract:
Both the error handler for md RAID1 and write access request to the md RAID1
use raid1d kernel thread. The nr_pending flag could cause a race condition
in raid1d, results in a raid1d deadlock.

Details:
         error handling                            write operation
   ------------------------------------------------------------------------------
    A-1. Issue a read request

    A-2  SCSI error detected
                                               B-1. make_request() for raid1 starts.
    A-3. raid1_end_read_request() is called
         in the interrupt context. It detects
         read error and wakes up raid1d
         kernel thread.                        B-2. make_request() calls wait_barrier() to
                                                    increment nr_pending flag.
    A-4. raid1d wake up

    A-5. raid1d calls freeze_array() and waiting
         for nr_pending to be decremented.
         That means stop IO and wait for       B-3. make_request() wakes up raid1d kernel thread
         everything to go quite.                    to send write request to the lower layer.

                                               B-4. raid1d wake up (already waken up by A-3)

                   (****  process stalls here because A-5 never ends ****)

    A-6. raid1d calls fix_read_error() to
         handle read error.                    B-5. raid1d calls generic_make_request() for write request.

                                               B-6. raid1_end_write_request() is called in the
                                                    interrupt context when the write access is completed
                                                    and nr_pending flag is decremented.
The deadlock mechanism:
If raid1d waken up by detecting read error (A-4) goes into freeze_array()
right after make_request() for write request has incremented nr_pending flag(B-2),
raid1d stalls waiting for nr_pending flag to be decremented (A-5).
On the other hand, nr_pending flag incremented by make_request() for write request
will never be decremented because the flag can be decremented after raid1d issues
generic_make_request() (B-5, B-6) but now raid1d is stopped.

This problem could could easily be reproduced with by using the new fault injection framework,
using "no response from the SCSI device" simulation.
However, it could also occur if raid1 error handler contends with write
operation,  but with low probability.

I will report the other problems after I clean up and post the code for
the scsi fault injection framework.

-- 
------------------------------------------------------------------------
Kenichi TANAKA    | Open Source Software Platform Development Division
                  | Computers Software Operations Unit, NEC Corporation
                  | k-tanaka@ce.jp.nec.com



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [BUG] The kernel thread for md RAID1 could cause a md RAID1 array deadlock
  2008-01-15  3:10 [BUG] The kernel thread for md RAID1 could cause a md RAID1 array deadlock K.Tanaka
@ 2008-01-24  3:28 ` Neil Brown
  2008-01-24  9:32   ` K.Tanaka
  0 siblings, 1 reply; 4+ messages in thread
From: Neil Brown @ 2008-01-24  3:28 UTC (permalink / raw)
  To: K.Tanaka; +Cc: linux-scsi, linux-kernel, linux-raid, dm-devel

On Tuesday January 15, k-tanaka@ce.jp.nec.com wrote:
> 
> This message describes the details about md-RAID1 issue found by
> testing the md RAID1 using the SCSI fault injection framework.
> 
> Abstract:
> Both the error handler for md RAID1 and write access request to the md RAID1
> use raid1d kernel thread. The nr_pending flag could cause a race condition
> in raid1d, results in a raid1d deadlock.

Thanks for finding and reporting this.

I believe the following patch should fix the deadlock.

If you are able to repeat your test and confirm this I would
appreciate it.

Thanks,
NeilBrown



Fix deadlock in md/raid1 when handling a read error.

When handling a read error, we freeze the array to stop any other
IO while attempting to over-write with correct data.

This is done in the raid1d thread and must wait for all submitted IO
to complete (except for requests that failed and are sitting in the
retry queue - these are counted in ->nr_queue and will stay there during
a freeze).

However write requests need attention from raid1d as bitmap updates
might be required.  This can cause a deadlock as raid1 is waiting for
requests to finish that themselves need attention from raid1d.

So we create a new function 'flush_pending_writes' to give that attention,
and call it in freeze_array to be sure that we aren't waiting on raid1d.

Thanks to "K.Tanaka" <k-tanaka@ce.jp.nec.com> for finding and reporting
this problem.

Cc: "K.Tanaka" <k-tanaka@ce.jp.nec.com>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid1.c |   66 ++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 45 insertions(+), 21 deletions(-)

diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
--- .prev/drivers/md/raid1.c	2008-01-18 11:19:09.000000000 +1100
+++ ./drivers/md/raid1.c	2008-01-24 14:21:55.000000000 +1100
@@ -592,6 +592,37 @@ static int raid1_congested(void *data, i
 }
 
 
+static int flush_pending_writes(conf_t *conf)
+{
+	/* Any writes that have been queue but are awaiting
+	 * bitmap updates get flushed here.
+	 * We return 1 if any requests were actually submitted.
+	 */
+	int rv = 0;
+
+	spin_lock_irq(&conf->device_lock);
+
+	if (conf->pending_bio_list.head) {
+		struct bio *bio;
+		bio = bio_list_get(&conf->pending_bio_list);
+		blk_remove_plug(conf->mddev->queue);
+		spin_unlock_irq(&conf->device_lock);
+		/* flush any pending bitmap writes to
+		 * disk before proceeding w/ I/O */
+		bitmap_unplug(conf->mddev->bitmap);
+
+		while (bio) { /* submit pending writes */
+			struct bio *next = bio->bi_next;
+			bio->bi_next = NULL;
+			generic_make_request(bio);
+			bio = next;
+		}
+		rv = 1;
+	} else
+		spin_unlock_irq(&conf->device_lock);
+	return rv;
+}
+
 /* Barriers....
  * Sometimes we need to suspend IO while we do something else,
  * either some resync/recovery, or reconfigure the array.
@@ -678,10 +709,14 @@ static void freeze_array(conf_t *conf)
 	spin_lock_irq(&conf->resync_lock);
 	conf->barrier++;
 	conf->nr_waiting++;
+	spin_unlock_irq(&conf->resync_lock);
+
+	spin_lock_irq(&conf->resync_lock);
 	wait_event_lock_irq(conf->wait_barrier,
 			    conf->barrier+conf->nr_pending == conf->nr_queued+2,
 			    conf->resync_lock,
-			    raid1_unplug(conf->mddev->queue));
+			    ({ flush_pending_writes(conf);
+			       raid1_unplug(conf->mddev->queue); }));
 	spin_unlock_irq(&conf->resync_lock);
 }
 static void unfreeze_array(conf_t *conf)
@@ -907,6 +942,9 @@ static int make_request(struct request_q
 	blk_plug_device(mddev->queue);
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 
+	/* In case raid1d snuck into freeze_array */
+	wake_up(&conf->wait_barrier);
+
 	if (do_sync)
 		md_wakeup_thread(mddev->thread);
 #if 0
@@ -1473,28 +1511,14 @@ static void raid1d(mddev_t *mddev)
 	
 	for (;;) {
 		char b[BDEVNAME_SIZE];
-		spin_lock_irqsave(&conf->device_lock, flags);
-
-		if (conf->pending_bio_list.head) {
-			bio = bio_list_get(&conf->pending_bio_list);
-			blk_remove_plug(mddev->queue);
-			spin_unlock_irqrestore(&conf->device_lock, flags);
-			/* flush any pending bitmap writes to disk before proceeding w/ I/O */
-			bitmap_unplug(mddev->bitmap);
-
-			while (bio) { /* submit pending writes */
-				struct bio *next = bio->bi_next;
-				bio->bi_next = NULL;
-				generic_make_request(bio);
-				bio = next;
-			}
-			unplug = 1;
 
-			continue;
-		}
+		unplug += flush_pending_writes(conf);
 
-		if (list_empty(head))
+		spin_lock_irqsave(&conf->device_lock, flags);
+		if (list_empty(head)) {
+			spin_unlock_irqrestore(&conf->device_lock, flags);
 			break;
+		}
 		r1_bio = list_entry(head->prev, r1bio_t, retry_list);
 		list_del(head->prev);
 		conf->nr_queued--;
@@ -1590,7 +1614,7 @@ static void raid1d(mddev_t *mddev)
 			}
 		}
 	}
-	spin_unlock_irqrestore(&conf->device_lock, flags);
+
 	if (unplug)
 		unplug_slaves(mddev);
 }

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [BUG] The kernel thread for md RAID1 could cause a md RAID1 array deadlock
  2008-01-24  3:28 ` Neil Brown
@ 2008-01-24  9:32   ` K.Tanaka
  2008-01-30  2:02     ` K.Tanaka
  0 siblings, 1 reply; 4+ messages in thread
From: K.Tanaka @ 2008-01-24  9:32 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-scsi, linux-kernel, linux-raid, dm-devel

Hi,

Thank you for the patch.
I have applied the patch to 2.6.23.14 and it works well.

- In case of 2.6.23.14, the problem is reproduced.
- In case of 2.6.23.14 with this patch, raid1 works well so far.
  The fault injection script continues to run, and it doesn't deadlock.
  I will keep it running for a while.

Also, md raid10 seems to have the same problem.
I will test raid10 applying this patch as well.


Neil Brown wrote:
> On Tuesday January 15, k-tanaka@ce.jp.nec.com wrote:
>> This message describes the details about md-RAID1 issue found by
>> testing the md RAID1 using the SCSI fault injection framework.
>>
>> Abstract:
>> Both the error handler for md RAID1 and write access request to the md RAID1
>> use raid1d kernel thread. The nr_pending flag could cause a race condition
>> in raid1d, results in a raid1d deadlock.
> 
> Thanks for finding and reporting this.
> 
> I believe the following patch should fix the deadlock.
> 
> If you are able to repeat your test and confirm this I would
> appreciate it.
> 
> Thanks,
> NeilBrown
> 
> 
> 
> Fix deadlock in md/raid1 when handling a read error.
> 
> When handling a read error, we freeze the array to stop any other
> IO while attempting to over-write with correct data.
> 
> This is done in the raid1d thread and must wait for all submitted IO
> to complete (except for requests that failed and are sitting in the
> retry queue - these are counted in ->nr_queue and will stay there during
> a freeze).
> 
> However write requests need attention from raid1d as bitmap updates
> might be required.  This can cause a deadlock as raid1 is waiting for
> requests to finish that themselves need attention from raid1d.
> 
> So we create a new function 'flush_pending_writes' to give that attention,
> and call it in freeze_array to be sure that we aren't waiting on raid1d.
> 
> Thanks to "K.Tanaka" <k-tanaka@ce.jp.nec.com> for finding and reporting
> this problem.
> 
> Cc: "K.Tanaka" <k-tanaka@ce.jp.nec.com>
> Signed-off-by: Neil Brown <neilb@suse.de>
> 
-- 
---------------------------------------------------------
Kenichi TANAKA    | Open Source Software Platform Development Division
                  | Computers Software Operations Unit, NEC Corporation
                  | k-tanaka@ce.jp.nec.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [BUG] The kernel thread for md RAID1 could cause a md RAID1 array deadlock
  2008-01-24  9:32   ` K.Tanaka
@ 2008-01-30  2:02     ` K.Tanaka
  0 siblings, 0 replies; 4+ messages in thread
From: K.Tanaka @ 2008-01-30  2:02 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, dm-devel, linux-scsi

Hi,

>Also, md raid10 seems to have the same problem.
>I will test raid10 applying this patch as well.

Sorry for the late response. I had a trouble with reproducing the problem,
but it turns out that the 2.6.24 kernel needs the latest (possibly testing)
version of systemtap-0.6.1-1 to run systemtap for the fault injection tool.

I've reproduced the stall on both raid1 and raid10 using 2.6.24.
Also I've tested the patch applied to 2.6.24 and confirmed that
it will fix the stall problem for both cases.

K.Tanaka wrote:
> Hi,
> 
> Thank you for the patch.
> I have applied the patch to 2.6.23.14 and it works well.
> 
> - In case of 2.6.23.14, the problem is reproduced.
> - In case of 2.6.23.14 with this patch, raid1 works well so far.
>   The fault injection script continues to run, and it doesn't deadlock.
>   I will keep it running for a while.
> 
> Also, md raid10 seems to have the same problem.
> I will test raid10 applying this patch as well.
> 
> 
> Neil Brown wrote:
>> On Tuesday January 15, k-tanaka@ce.jp.nec.com wrote:
>>> This message describes the details about md-RAID1 issue found by
>>> testing the md RAID1 using the SCSI fault injection framework.
>>>
>>> Abstract:
>>> Both the error handler for md RAID1 and write access request to the md RAID1
>>> use raid1d kernel thread. The nr_pending flag could cause a race condition
>>> in raid1d, results in a raid1d deadlock.
>> Thanks for finding and reporting this.
>>
>> I believe the following patch should fix the deadlock.
>>
>> If you are able to repeat your test and confirm this I would
>> appreciate it.
>>
>> Thanks,
>> NeilBrown
>>
>>
>>
>> Fix deadlock in md/raid1 when handling a read error.
>>
>> When handling a read error, we freeze the array to stop any other
>> IO while attempting to over-write with correct data.
>>

-- 
---------------------------------------------------------
Kenichi TANAKA    | Open Source Software Platform Development Division
                  | Computers Software Operations Unit, NEC Corporation
                  | k-tanaka@ce.jp.nec.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2008-01-30  2:02 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-15  3:10 [BUG] The kernel thread for md RAID1 could cause a md RAID1 array deadlock K.Tanaka
2008-01-24  3:28 ` Neil Brown
2008-01-24  9:32   ` K.Tanaka
2008-01-30  2:02     ` K.Tanaka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).