[BUGREPORT] The kernel thread for md RAID10 could cause a md RAID10 array deadlock

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [BUGREPORT] The kernel thread for md RAID10 could cause a md RAID10 array deadlock
@ 2008-02-13  8:33 K.Tanaka
  2008-03-03  0:11 ` Neil Brown
  0 siblings, 1 reply; 2+ messages in thread
From: K.Tanaka @ 2008-02-13  8:33 UTC (permalink / raw)
  To: linux-raid; +Cc: linux-scsi

This message describes another issue about md-RAID10 found by
testing the 2.6.24 md RAID10 using new scsi fault injection framework.

Abstract:
When a scsi command timeout occurs during RAID10 recovery, the kernel
threads for md RAID10 could cause a md RAID10 array deadlock.
The nr_pending flag set during normal I/O and barrier flag set by recovery
thread conflicts, results in raid10d() and sync_request() deadlock.

Details:
             normal I/O                             recovery I/O
   -----------------------------------------------------------------------------
                                           B-1. kernel thread starts by calling
   A-1. A process issues a read request.         md_do_sync()
        make_request() for raid10 is called
        by block layer.
                                           B-2. md_do_sync() calls sync_request
                                                operation for md raid10.
   A-2. In make_request(), wait_barrier()
        increments nr_pending flag.

   A-3. A read command is issued to the disk,
        but it takes a lot of time because
        of no response from the disk.
                                           B-3. sync_request() of raid10 calls
                                                raise_barrier(), increments barrier
                                                flag, and waits for nr_pending set
                                                in (A-2) to be cleared.
   A-4. raid10_end_read_request() is called
        in the interrupt context. It detects
        read error and wakes up raid10d kernel
        thread.

   A-5. raid10d() calls freeze_array() and waits
        for barrier flag incremented in (B-3)
        to be cleared.

    (**  stalls here because waiting conditions in A-5 and B-3 are never met **)

   A-6. raid1d calls fix_read_error() to
        handle read error.                 B-4. barrier flag will be cleared after
                                                the pending barrier request completes.
   A-7  nr_pending flag will be cleared after
        the pending read request completes.

The deadlock mechanism:
When a normal I/O occurs during recovery, nr_pending flag incremented in (A-2)
blocks subsequent recovery I/O until the normal I/O completes. The recovery thread
will increment barrier flag and wait for nr_pending flag to be decremented (B-3).

Normally, nr_pending flag is decremented after the I/O has completed successfully.
Also, barrier flag is decremented after barrier request (such as recovery I/O) has
completed successfully.

If a normal read I/O results in scsi command timeout, the read request is handled
by error handler in raid10d kernel thread. Then, raid10d calls freeze_array().
But the barrier flag is set by (B-3), freeze_array() waits for barrier request
completion. On the other hand, the recovery thread stalls waiting for nr_pending
flag to  be decremented(B-3). In this way, both error handler and recovery
thread are deadlocked.

This problem can be reproduced  by using the new scsi fault injection framework,
using "no response from the SCSI device" simulation.
I think the new scsi fault injection framework is a little bit complicated
to use, so I will upload some sample wrapper shell scripts for usability.

-- 

---------------------------------------------------------
Kenichi TANAKA    | Open Source Software Platform Development Division
                  | Computers Software Operations Unit, NEC Corporation
                  | k-tanaka@ce.jp.nec.com

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [BUGREPORT] The kernel thread for md RAID10 could cause a md RAID10 array deadlock
  2008-02-13  8:33 [BUGREPORT] The kernel thread for md RAID10 could cause a md RAID10 array deadlock K.Tanaka
@ 2008-03-03  0:11 ` Neil Brown
  0 siblings, 0 replies; 2+ messages in thread
From: Neil Brown @ 2008-03-03  0:11 UTC (permalink / raw)
  To: K.Tanaka; +Cc: linux-raid, linux-scsi

On Wednesday February 13, k-tanaka@ce.jp.nec.com wrote:
> This message describes another issue about md-RAID10 found by
> testing the 2.6.24 md RAID10 using new scsi fault injection framework.

Thanks for finding and reporting this!!!

The following patch should fix the bug, both in raid1 and raid10.

NeilBrown



Fix possible raid1/raid10 deadlock on read error during resync.

Thanks to K.Tanaka and the scsi fault injection framework, here is
a fix for another possible deadlock in raid1/raid10 error handing.

If a read request returns an error while a resync is happening and
a resync request is pending, the attempt to fix the error will block
until the resync progresses, and the resync will block until the
read request completes.  Thus a deadlock.

This patch fixes the problem.

Cc: "K.Tanaka" <k-tanaka@ce.jp.nec.com>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid1.c  |   11 +++++++++--
 ./drivers/md/raid10.c |   11 +++++++++--
 2 files changed, 18 insertions(+), 4 deletions(-)

diff .prev/drivers/md/raid10.c ./drivers/md/raid10.c
--- .prev/drivers/md/raid10.c	2008-03-03 11:03:39.000000000 +1100
+++ ./drivers/md/raid10.c	2008-03-03 09:56:53.000000000 +1100
@@ -747,13 +747,20 @@ static void freeze_array(conf_t *conf)
 	/* stop syncio and normal IO and wait for everything to
 	 * go quiet.
 	 * We increment barrier and nr_waiting, and then
-	 * wait until barrier+nr_pending match nr_queued+2
+	 * wait until nr_pending match nr_queued+1
+	 * This is called in the context of one normal IO request
+	 * that has failed. Thus any sync request that might be pending
+	 * will be blocked by nr_pending, and we need to wait for
+	 * pending IO requests to complete or be queued for re-try.
+	 * Thus the number queued (nr_queued) plus this request (1)
+	 * must match the number of pending IOs (nr_pending) before
+	 * we continue.
 	 */
 	spin_lock_irq(&conf->resync_lock);
 	conf->barrier++;
 	conf->nr_waiting++;
 	wait_event_lock_irq(conf->wait_barrier,
-			    conf->barrier+conf->nr_pending == conf->nr_queued+2,
+			    conf->nr_pending == conf->nr_queued+1,
 			    conf->resync_lock,
 			    ({ flush_pending_writes(conf);
 			       raid10_unplug(conf->mddev->queue); }));

diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
--- .prev/drivers/md/raid1.c	2008-03-03 11:03:39.000000000 +1100
+++ ./drivers/md/raid1.c	2008-03-03 09:56:52.000000000 +1100
@@ -704,13 +704,20 @@ static void freeze_array(conf_t *conf)
 	/* stop syncio and normal IO and wait for everything to
 	 * go quite.
 	 * We increment barrier and nr_waiting, and then
-	 * wait until barrier+nr_pending match nr_queued+2
+	 * wait until nr_pending match nr_queued+1
+	 * This is called in the context of one normal IO request
+	 * that has failed. Thus any sync request that might be pending
+	 * will be blocked by nr_pending, and we need to wait for
+	 * pending IO requests to complete or be queued for re-try.
+	 * Thus the number queued (nr_queued) plus this request (1)
+	 * must match the number of pending IOs (nr_pending) before
+	 * we continue.
 	 */
 	spin_lock_irq(&conf->resync_lock);
 	conf->barrier++;
 	conf->nr_waiting++;
 	wait_event_lock_irq(conf->wait_barrier,
-			    conf->barrier+conf->nr_pending == conf->nr_queued+2,
+			    conf->nr_pending == conf->nr_queued+1,
 			    conf->resync_lock,
 			    ({ flush_pending_writes(conf);
 			       raid1_unplug(conf->mddev->queue); }));

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2008-03-03  0:11 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-13  8:33 [BUGREPORT] The kernel thread for md RAID10 could cause a md RAID10 array deadlock K.Tanaka
2008-03-03  0:11 ` Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).