Fatal crash/hang in scsi_lib after RAID disk failure

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Fatal crash/hang in scsi_lib after RAID disk failure
@ 2012-06-29  0:35 Christian Balzer
  2012-07-03  5:50 ` NeilBrown
  0 siblings, 1 reply; 6+ messages in thread
From: Christian Balzer @ 2012-06-29  0:35 UTC (permalink / raw)
  To: linux-raid


Hello (Neil),

This may or may not be related to the same main error I found a reference
to on the ML archives from November 2011 
(kernel BUG at drivers/scsi/scsi_lib.c:1153).

Again, this is a 3.2.20 kernel, now with the Raid10 recovery bug patch,
but I don't see how this could be related.

The full initial dump, as far as it was logged is here:
http://pastebin.com/wFX5yew2

But the juicy bits are these:
---
Jun 29 05:06:42 borg03b kernel: [231632.877579] sd 8:0:5:0: [sdj] Unhandled sense code
Jun 29 05:06:42 borg03b kernel: [231632.877583] sd 8:0:5:0: [sdj]  Result: hostbyte=invalid driverbyte=DRIVER_SENSE
Jun 29 05:06:42 borg03b kernel: [231632.877586] sd 8:0:5:0: [sdj]  Sense Key : Medium Error [current] 
Jun 29 05:06:42 borg03b kernel: [231632.877590] Info fld=0x904ff8b8
Jun 29 05:06:42 borg03b kernel: [231632.877591] sd 8:0:5:0: [sdj]  Add. Sense: Unrecovered read error
Jun 29 05:06:42 borg03b kernel: [231632.877595] sd 8:0:5:0: [sdj] CDB: Read(10): 28 00 90 4f f8 3f 00 00 f8 00
Jun 29 05:06:42 borg03b kernel: [231632.877602] end_request: critical target error, dev sdj, sector 2421159999
Jun 29 05:06:42 borg03b kernel: [231632.881963] md/raid10:md4: sdj1: rescheduling sector 6052895744
Jun 29 05:06:46 borg03b kernel: [231636.380147] sd 8:0:5:0: [sdj] Unhandled sense code
Jun 29 05:06:46 borg03b kernel: [231636.380150] sd 8:0:5:0: [sdj]  Result: hostbyte=invalid driverbyte=DRIVER_SENSE
Jun 29 05:06:46 borg03b kernel: [231636.380153] sd 8:0:5:0: [sdj]  Sense Key : Medium Error [current] 
Jun 29 05:06:46 borg03b kernel: [231636.380157] Info fld=0x904ff8b8
Jun 29 05:06:46 borg03b kernel: [231636.380159] sd 8:0:5:0: [sdj]  Add. Sense: Unrecovered read error
Jun 29 05:06:46 borg03b kernel: [231636.380162] sd 8:0:5:0: [sdj] CDB: Read(10): 28 00 90 4f f8 b7 00 00 08 00
Jun 29 05:06:46 borg03b kernel: [231636.380168] end_request: critical target error, dev sdj, sector 2421160119
Jun 29 05:06:46 borg03b kernel: [231636.401781] ------------[ cut here ]------------
Jun 29 05:06:46 borg03b kernel: [231636.405694] kernel BUG at drivers/scsi/scsi_lib.c:1153!
Jun 29 05:06:46 borg03b kernel: [231636.405694] invalid opcode: 0000 [#1] SMP 
---

So a drive died, which shouldn't be a big deal and the kernel decided to
jump off the proverbial bridge.

And kept doing that upon reboots:
---
Jun 29 06:44:38 borg03b kernel: [   52.052257] end_request: critical target error, dev sdj, sector 2421149759
Jun 29 06:44:38 borg03b kernel: [   52.054654] md/raid10:md4: sdj1: rescheduling sector 6052870144
Jun 29 06:44:38 borg03b kernel: [   52.057104] md/raid10:md4: sdj1: rescheduling sector 6052870392
Jun 29 06:44:38 borg03b kernel: [   52.059521] md/raid10:md4: sdj1: rescheduling sector 6052870400
Jun 29 06:44:38 borg03b kernel: [   52.061878] md/raid10:md4: sdj1: rescheduling sector 6052870648
Jun 29 06:44:38 borg03b kernel: [   52.064255] md/raid10:md4: sdj1: rescheduling sector 6052870656
Jun 29 06:44:38 borg03b kernel: [   52.066562] md/raid10:md4: sdj1: rescheduling sector 6052870904
Jun 29 06:44:38 borg03b kernel: [   52.068872] md/raid10:md4: sdj1: rescheduling sector 6052870912
Jun 29 06:44:38 borg03b kernel: [   52.071141] md/raid10:md4: sdj1: rescheduling sector 6052871160
Jun 29 06:44:39 borg03b kernel: [   52.250525] md/raid10:md4: sdj1: redirectingsector 6052865024 to another mirror
Jun 29 06:44:39 borg03b kernel: [   52.276817] md/raid10:md4: sdj1: redirectingsector 6052865272 to another mirror
Jun 29 06:44:42 borg03b kernel: [   55.325297] sd 8:0:5:0: [sdj] Unhandled sense code
Jun 29 06:44:42 borg03b kernel: [   55.325301] sd 8:0:5:0: [sdj]  Result: hostbyte=invalid driverbyte=DRIVER_SENSE
Jun 29 06:44:42 borg03b kernel: [   55.325304] sd 8:0:5:0: [sdj]  Sense Key : Medium Error [current] 
Jun 29 06:44:42 borg03b kernel: [   55.325308] Info fld=0x904fc9b4
Jun 29 06:44:42 borg03b kernel: [   55.325310] sd 8:0:5:0: [sdj]  Add. Sense: Unrecovered read error
Jun 29 06:44:42 borg03b kernel: [   55.325313] sd 8:0:5:0: [sdj] CDB: Read(10): 28 00 90 4f c9 af 00 00 08 00
Jun 29 06:44:42 borg03b kernel: [   55.325320] end_request: critical target error, dev sdj, sector 2421148079
Jun 29 06:44:42 borg03b kernel: [   55.343766] ------------[ cut here ]------------
Jun 29 06:44:42 borg03b kernel: [   55.346054] kernel BUG at drivers/scsi/scsi_lib.c:1153!
---
Which resulted a bit later in:
---
Jun 29 06:45:05 borg03b kernel: [   57.051653] ------------[ cut here ]------------
Jun 29 06:45:05 borg03b kernel: [   57.051653] WARNING: at kernel/watchdog.c:241 watchdog_overflow_callback+0x96/0xa1()
Jun 29 06:45:05 borg03b kernel: [   57.051653] Hardware name: H8DM3-2
Jun 29 06:45:05 borg03b kernel: [   57.051653] Watchdog detected hard LOCKUP on cpu 7
---

Not sure if there is a real HW problem (aside from the failing drive) and
kettle calling the pot black, but I managed to recover things by booting
into single-user mode and removing that failing drive before letting the
kernel proceed with booting.

This is pretty bad [TM], any ideas?
If you need more information, just let me know.

Regards,

Christian (sleep deprived)
-- 
Christian Balzer        Network/Systems Engineer                
chibi@gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fatal crash/hang in scsi_lib after RAID disk failure
  2012-06-29  0:35 Fatal crash/hang in scsi_lib after RAID disk failure Christian Balzer
@ 2012-07-03  5:50 ` NeilBrown
  2012-07-03  6:10   ` Christian Balzer
  0 siblings, 1 reply; 6+ messages in thread
From: NeilBrown @ 2012-07-03  5:50 UTC (permalink / raw)
  To: Christian Balzer; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 6858 bytes --]

On Fri, 29 Jun 2012 09:35:52 +0900 Christian Balzer <chibi@gol.com> wrote:

> 
> Hello (Neil),
> 
> This may or may not be related to the same main error I found a reference
> to on the ML archives from November 2011 
> (kernel BUG at drivers/scsi/scsi_lib.c:1153).
> 
> Again, this is a 3.2.20 kernel, now with the Raid10 recovery bug patch,
> but I don't see how this could be related.
> 
> The full initial dump, as far as it was logged is here:
> http://pastebin.com/wFX5yew2
> 
> But the juicy bits are these:
> ---
> Jun 29 05:06:42 borg03b kernel: [231632.877579] sd 8:0:5:0: [sdj] Unhandled sense code
> Jun 29 05:06:42 borg03b kernel: [231632.877583] sd 8:0:5:0: [sdj]  Result: hostbyte=invalid driverbyte=DRIVER_SENSE
> Jun 29 05:06:42 borg03b kernel: [231632.877586] sd 8:0:5:0: [sdj]  Sense Key : Medium Error [current] 
> Jun 29 05:06:42 borg03b kernel: [231632.877590] Info fld=0x904ff8b8
> Jun 29 05:06:42 borg03b kernel: [231632.877591] sd 8:0:5:0: [sdj]  Add. Sense: Unrecovered read error
> Jun 29 05:06:42 borg03b kernel: [231632.877595] sd 8:0:5:0: [sdj] CDB: Read(10): 28 00 90 4f f8 3f 00 00 f8 00
> Jun 29 05:06:42 borg03b kernel: [231632.877602] end_request: critical target error, dev sdj, sector 2421159999
> Jun 29 05:06:42 borg03b kernel: [231632.881963] md/raid10:md4: sdj1: rescheduling sector 6052895744
> Jun 29 05:06:46 borg03b kernel: [231636.380147] sd 8:0:5:0: [sdj] Unhandled sense code
> Jun 29 05:06:46 borg03b kernel: [231636.380150] sd 8:0:5:0: [sdj]  Result: hostbyte=invalid driverbyte=DRIVER_SENSE
> Jun 29 05:06:46 borg03b kernel: [231636.380153] sd 8:0:5:0: [sdj]  Sense Key : Medium Error [current] 
> Jun 29 05:06:46 borg03b kernel: [231636.380157] Info fld=0x904ff8b8
> Jun 29 05:06:46 borg03b kernel: [231636.380159] sd 8:0:5:0: [sdj]  Add. Sense: Unrecovered read error
> Jun 29 05:06:46 borg03b kernel: [231636.380162] sd 8:0:5:0: [sdj] CDB: Read(10): 28 00 90 4f f8 b7 00 00 08 00
> Jun 29 05:06:46 borg03b kernel: [231636.380168] end_request: critical target error, dev sdj, sector 2421160119
> Jun 29 05:06:46 borg03b kernel: [231636.401781] ------------[ cut here ]------------
> Jun 29 05:06:46 borg03b kernel: [231636.405694] kernel BUG at drivers/scsi/scsi_lib.c:1153!
> Jun 29 05:06:46 borg03b kernel: [231636.405694] invalid opcode: 0000 [#1] SMP 
> ---
> 
> So a drive died, which shouldn't be a big deal and the kernel decided to
> jump off the proverbial bridge.
> 
> And kept doing that upon reboots:
> ---
> Jun 29 06:44:38 borg03b kernel: [   52.052257] end_request: critical target error, dev sdj, sector 2421149759
> Jun 29 06:44:38 borg03b kernel: [   52.054654] md/raid10:md4: sdj1: rescheduling sector 6052870144
> Jun 29 06:44:38 borg03b kernel: [   52.057104] md/raid10:md4: sdj1: rescheduling sector 6052870392
> Jun 29 06:44:38 borg03b kernel: [   52.059521] md/raid10:md4: sdj1: rescheduling sector 6052870400
> Jun 29 06:44:38 borg03b kernel: [   52.061878] md/raid10:md4: sdj1: rescheduling sector 6052870648
> Jun 29 06:44:38 borg03b kernel: [   52.064255] md/raid10:md4: sdj1: rescheduling sector 6052870656
> Jun 29 06:44:38 borg03b kernel: [   52.066562] md/raid10:md4: sdj1: rescheduling sector 6052870904
> Jun 29 06:44:38 borg03b kernel: [   52.068872] md/raid10:md4: sdj1: rescheduling sector 6052870912
> Jun 29 06:44:38 borg03b kernel: [   52.071141] md/raid10:md4: sdj1: rescheduling sector 6052871160
> Jun 29 06:44:39 borg03b kernel: [   52.250525] md/raid10:md4: sdj1: redirectingsector 6052865024 to another mirror
> Jun 29 06:44:39 borg03b kernel: [   52.276817] md/raid10:md4: sdj1: redirectingsector 6052865272 to another mirror
> Jun 29 06:44:42 borg03b kernel: [   55.325297] sd 8:0:5:0: [sdj] Unhandled sense code
> Jun 29 06:44:42 borg03b kernel: [   55.325301] sd 8:0:5:0: [sdj]  Result: hostbyte=invalid driverbyte=DRIVER_SENSE
> Jun 29 06:44:42 borg03b kernel: [   55.325304] sd 8:0:5:0: [sdj]  Sense Key : Medium Error [current] 
> Jun 29 06:44:42 borg03b kernel: [   55.325308] Info fld=0x904fc9b4
> Jun 29 06:44:42 borg03b kernel: [   55.325310] sd 8:0:5:0: [sdj]  Add. Sense: Unrecovered read error
> Jun 29 06:44:42 borg03b kernel: [   55.325313] sd 8:0:5:0: [sdj] CDB: Read(10): 28 00 90 4f c9 af 00 00 08 00
> Jun 29 06:44:42 borg03b kernel: [   55.325320] end_request: critical target error, dev sdj, sector 2421148079
> Jun 29 06:44:42 borg03b kernel: [   55.343766] ------------[ cut here ]------------
> Jun 29 06:44:42 borg03b kernel: [   55.346054] kernel BUG at drivers/scsi/scsi_lib.c:1153!
> ---
> Which resulted a bit later in:
> ---
> Jun 29 06:45:05 borg03b kernel: [   57.051653] ------------[ cut here ]------------
> Jun 29 06:45:05 borg03b kernel: [   57.051653] WARNING: at kernel/watchdog.c:241 watchdog_overflow_callback+0x96/0xa1()
> Jun 29 06:45:05 borg03b kernel: [   57.051653] Hardware name: H8DM3-2
> Jun 29 06:45:05 borg03b kernel: [   57.051653] Watchdog detected hard LOCKUP on cpu 7
> ---
> 
> Not sure if there is a real HW problem (aside from the failing drive) and
> kettle calling the pot black, but I managed to recover things by booting
> into single-user mode and removing that failing drive before letting the
> kernel proceed with booting.
> 
> This is pretty bad [TM], any ideas?
> If you need more information, just let me know.

That took *way* to long to find given how simple the fix is.
I spent ages staring at the code, as about to reply and so "no idea" when I
thought I should test it myself.  Test failed immediately.
Then I spent way too look adding tracing into the wrong places.  But I have
it now!

Thanks for the report.  Following fix will go upstream shortly.
(r10_sync_page_io takes sectors, not bytes).

Thanks,
NeilBrown

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index bcf6ea8..ae73e29 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2310,7 +2310,7 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10
 			if (r10_sync_page_io(rdev,
 					     r10_bio->devs[sl].addr +
 					     sect,
-					     s<<9, conf->tmppage, WRITE)
+					     s, conf->tmppage, WRITE)
 			    == 0) {
 				/* Well, this device is dead */
 				printk(KERN_NOTICE
@@ -2349,7 +2349,7 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10
 			switch (r10_sync_page_io(rdev,
 					     r10_bio->devs[sl].addr +
 					     sect,
-					     s<<9, conf->tmppage,
+					     s, conf->tmppage,
 						 READ)) {
 			case 0:
 				/* Well, this device is dead */
@@ -2512,7 +2512,7 @@ read_more:
 	slot = r10_bio->read_slot;
 	printk_ratelimited(
 		KERN_ERR
-		"md/raid10:%s: %s: redirecting"
+		"md/raid10:%s: %s: redirecting "
 		"sector %llu to another mirror\n",
 		mdname(mddev),
 		bdevname(rdev->bdev, b),

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: Fatal crash/hang in scsi_lib after RAID disk failure
  2012-07-03  5:50 ` NeilBrown
@ 2012-07-03  6:10   ` Christian Balzer
  2012-07-03  6:45     ` NeilBrown
  0 siblings, 1 reply; 6+ messages in thread
From: Christian Balzer @ 2012-07-03  6:10 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Tue, 3 Jul 2012 15:50:45 +1000 NeilBrown wrote:

> On Fri, 29 Jun 2012 09:35:52 +0900 Christian Balzer <chibi@gol.com>
> wrote:
> 
> > 
> > Hello (Neil),
> > 
> > This may or may not be related to the same main error I found a
> > reference to on the ML archives from November 2011 
> > (kernel BUG at drivers/scsi/scsi_lib.c:1153).
> > 
> > Again, this is a 3.2.20 kernel, now with the Raid10 recovery bug patch,
> > but I don't see how this could be related.
> > 
> > The full initial dump, as far as it was logged is here:
> > http://pastebin.com/wFX5yew2
> > 
> > But the juicy bits are these:
> > ---
> > Jun 29 05:06:42 borg03b kernel: [231632.877579] sd 8:0:5:0: [sdj]
> > Unhandled sense code Jun 29 05:06:42 borg03b kernel: [231632.877583]
> > sd 8:0:5:0: [sdj]  Result: hostbyte=invalid driverbyte=DRIVER_SENSE
> > Jun 29 05:06:42 borg03b kernel: [231632.877586] sd 8:0:5:0: [sdj]
> > Sense Key : Medium Error [current] Jun 29 05:06:42 borg03b kernel:
> > [231632.877590] Info fld=0x904ff8b8 Jun 29 05:06:42 borg03b kernel:
> > [231632.877591] sd 8:0:5:0: [sdj]  Add. Sense: Unrecovered read error
> > Jun 29 05:06:42 borg03b kernel: [231632.877595] sd 8:0:5:0: [sdj] CDB:
> > Read(10): 28 00 90 4f f8 3f 00 00 f8 00 Jun 29 05:06:42 borg03b
> > kernel: [231632.877602] end_request: critical target error, dev sdj,
> > sector 2421159999 Jun 29 05:06:42 borg03b kernel: [231632.881963]
> > md/raid10:md4: sdj1: rescheduling sector 6052895744 Jun 29 05:06:46
> > borg03b kernel: [231636.380147] sd 8:0:5:0: [sdj] Unhandled sense code
> > Jun 29 05:06:46 borg03b kernel: [231636.380150] sd 8:0:5:0: [sdj]
> > Result: hostbyte=invalid driverbyte=DRIVER_SENSE Jun 29 05:06:46
> > borg03b kernel: [231636.380153] sd 8:0:5:0: [sdj]  Sense Key : Medium
> > Error [current] Jun 29 05:06:46 borg03b kernel: [231636.380157] Info
> > fld=0x904ff8b8 Jun 29 05:06:46 borg03b kernel: [231636.380159] sd
> > 8:0:5:0: [sdj]  Add. Sense: Unrecovered read error Jun 29 05:06:46
> > borg03b kernel: [231636.380162] sd 8:0:5:0: [sdj] CDB: Read(10): 28 00
> > 90 4f f8 b7 00 00 08 00 Jun 29 05:06:46 borg03b kernel:
> > [231636.380168] end_request: critical target error, dev sdj, sector
> > 2421160119 Jun 29 05:06:46 borg03b kernel: [231636.401781]
> > ------------[ cut here ]------------ Jun 29 05:06:46 borg03b kernel:
> > [231636.405694] kernel BUG at drivers/scsi/scsi_lib.c:1153! Jun 29
> > 05:06:46 borg03b kernel: [231636.405694] invalid opcode: 0000 [#1] SMP
> > ---
> > 
> > So a drive died, which shouldn't be a big deal and the kernel decided
> > to jump off the proverbial bridge.
> > 
> > And kept doing that upon reboots:
> > ---
> > Jun 29 06:44:38 borg03b kernel: [   52.052257] end_request: critical
> > target error, dev sdj, sector 2421149759 Jun 29 06:44:38 borg03b
> > kernel: [   52.054654] md/raid10:md4: sdj1: rescheduling sector
> > 6052870144 Jun 29 06:44:38 borg03b kernel: [   52.057104]
> > md/raid10:md4: sdj1: rescheduling sector 6052870392 Jun 29 06:44:38
> > borg03b kernel: [   52.059521] md/raid10:md4: sdj1: rescheduling
> > sector 6052870400 Jun 29 06:44:38 borg03b kernel: [   52.061878]
> > md/raid10:md4: sdj1: rescheduling sector 6052870648 Jun 29 06:44:38
> > borg03b kernel: [   52.064255] md/raid10:md4: sdj1: rescheduling
> > sector 6052870656 Jun 29 06:44:38 borg03b kernel: [   52.066562]
> > md/raid10:md4: sdj1: rescheduling sector 6052870904 Jun 29 06:44:38
> > borg03b kernel: [   52.068872] md/raid10:md4: sdj1: rescheduling
> > sector 6052870912 Jun 29 06:44:38 borg03b kernel: [   52.071141]
> > md/raid10:md4: sdj1: rescheduling sector 6052871160 Jun 29 06:44:39
> > borg03b kernel: [   52.250525] md/raid10:md4: sdj1: redirectingsector
> > 6052865024 to another mirror Jun 29 06:44:39 borg03b kernel:
> > [   52.276817] md/raid10:md4: sdj1: redirectingsector 6052865272 to
> > another mirror Jun 29 06:44:42 borg03b kernel: [   55.325297] sd
> > 8:0:5:0: [sdj] Unhandled sense code Jun 29 06:44:42 borg03b kernel:
> > [   55.325301] sd 8:0:5:0: [sdj]  Result: hostbyte=invalid
> > driverbyte=DRIVER_SENSE Jun 29 06:44:42 borg03b kernel: [   55.325304]
> > sd 8:0:5:0: [sdj]  Sense Key : Medium Error [current] Jun 29 06:44:42
> > borg03b kernel: [   55.325308] Info fld=0x904fc9b4 Jun 29 06:44:42
> > borg03b kernel: [   55.325310] sd 8:0:5:0: [sdj]  Add. Sense:
> > Unrecovered read error Jun 29 06:44:42 borg03b kernel: [   55.325313]
> > sd 8:0:5:0: [sdj] CDB: Read(10): 28 00 90 4f c9 af 00 00 08 00 Jun 29
> > 06:44:42 borg03b kernel: [   55.325320] end_request: critical target
> > error, dev sdj, sector 2421148079 Jun 29 06:44:42 borg03b kernel:
> > [   55.343766] ------------[ cut here ]------------ Jun 29 06:44:42
> > borg03b kernel: [   55.346054] kernel BUG at
> > drivers/scsi/scsi_lib.c:1153! --- Which resulted a bit later in: ---
> > Jun 29 06:45:05 borg03b kernel: [   57.051653] ------------[ cut
> > here ]------------ Jun 29 06:45:05 borg03b kernel: [   57.051653]
> > WARNING: at kernel/watchdog.c:241
> > watchdog_overflow_callback+0x96/0xa1() Jun 29 06:45:05 borg03b kernel:
> > [   57.051653] Hardware name: H8DM3-2 Jun 29 06:45:05 borg03b kernel:
> > [   57.051653] Watchdog detected hard LOCKUP on cpu 7 ---
> > 
> > Not sure if there is a real HW problem (aside from the failing drive)
> > and kettle calling the pot black, but I managed to recover things by
> > booting into single-user mode and removing that failing drive before
> > letting the kernel proceed with booting.
> > 
> > This is pretty bad [TM], any ideas?
> > If you need more information, just let me know.
> 
> That took *way* to long to find given how simple the fix is.

Well, given how long it takes with some OSS projects, I'd say 4 days is
pretty good. ^o^

> I spent ages staring at the code, as about to reply and so "no idea"
> when I thought I should test it myself.  Test failed immediately.

Could you elaborate a bit? 
As in, was this something introduced only very recently, since I had
dozens of disks fail before w/o any such pyrotechnics. 
Or were there some special circumstances that triggered it? 
(But looking at the patch, I guess it should have been pretty universal)

> Then I spent way too look adding tracing into the wrong places.  But I
> have it now!
> 
> Thanks for the report.  Following fix will go upstream shortly.
> (r10_sync_page_io takes sectors, not bytes).
> 
Great to know, I shall keep my eyes on the kernel ChangeLogs and update
as soon as I can. Might just go and patch what I have right now, though...

Thanks for your efforts!

Christian

> Thanks,
> NeilBrown
> 
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index bcf6ea8..ae73e29 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -2310,7 +2310,7 @@ static void fix_read_error(struct r10conf *conf,
> struct mddev *mddev, struct r10 if (r10_sync_page_io(rdev,
>  					     r10_bio->devs[sl].addr +
>  					     sect,
> -					     s<<9, conf->tmppage, WRITE)
> +					     s, conf->tmppage, WRITE)
>  			    == 0) {
>  				/* Well, this device is dead */
>  				printk(KERN_NOTICE
> @@ -2349,7 +2349,7 @@ static void fix_read_error(struct r10conf *conf,
> struct mddev *mddev, struct r10 switch (r10_sync_page_io(rdev,
>  					     r10_bio->devs[sl].addr +
>  					     sect,
> -					     s<<9, conf->tmppage,
> +					     s, conf->tmppage,
>  						 READ)) {
>  			case 0:
>  				/* Well, this device is dead */
> @@ -2512,7 +2512,7 @@ read_more:
>  	slot = r10_bio->read_slot;
>  	printk_ratelimited(
>  		KERN_ERR
> -		"md/raid10:%s: %s: redirecting"
> +		"md/raid10:%s: %s: redirecting "
>  		"sector %llu to another mirror\n",
>  		mdname(mddev),
>  		bdevname(rdev->bdev, b),


-- 
Christian Balzer        Network/Systems Engineer                
chibi@gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fatal crash/hang in scsi_lib after RAID disk failure
  2012-07-03  6:10   ` Christian Balzer
@ 2012-07-03  6:45     ` NeilBrown
  2012-07-03  7:12       ` Christian Balzer
  0 siblings, 1 reply; 6+ messages in thread
From: NeilBrown @ 2012-07-03  6:45 UTC (permalink / raw)
  To: Christian Balzer; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 7263 bytes --]

On Tue, 3 Jul 2012 15:10:38 +0900 Christian Balzer <chibi@gol.com> wrote:

> On Tue, 3 Jul 2012 15:50:45 +1000 NeilBrown wrote:
> 
> > On Fri, 29 Jun 2012 09:35:52 +0900 Christian Balzer <chibi@gol.com>
> > wrote:
> > 
> > > 
> > > Hello (Neil),
> > > 
> > > This may or may not be related to the same main error I found a
> > > reference to on the ML archives from November 2011 
> > > (kernel BUG at drivers/scsi/scsi_lib.c:1153).
> > > 
> > > Again, this is a 3.2.20 kernel, now with the Raid10 recovery bug patch,
> > > but I don't see how this could be related.
> > > 
> > > The full initial dump, as far as it was logged is here:
> > > http://pastebin.com/wFX5yew2
> > > 
> > > But the juicy bits are these:
> > > ---
> > > Jun 29 05:06:42 borg03b kernel: [231632.877579] sd 8:0:5:0: [sdj]
> > > Unhandled sense code Jun 29 05:06:42 borg03b kernel: [231632.877583]
> > > sd 8:0:5:0: [sdj]  Result: hostbyte=invalid driverbyte=DRIVER_SENSE
> > > Jun 29 05:06:42 borg03b kernel: [231632.877586] sd 8:0:5:0: [sdj]
> > > Sense Key : Medium Error [current] Jun 29 05:06:42 borg03b kernel:
> > > [231632.877590] Info fld=0x904ff8b8 Jun 29 05:06:42 borg03b kernel:
> > > [231632.877591] sd 8:0:5:0: [sdj]  Add. Sense: Unrecovered read error
> > > Jun 29 05:06:42 borg03b kernel: [231632.877595] sd 8:0:5:0: [sdj] CDB:
> > > Read(10): 28 00 90 4f f8 3f 00 00 f8 00 Jun 29 05:06:42 borg03b
> > > kernel: [231632.877602] end_request: critical target error, dev sdj,
> > > sector 2421159999 Jun 29 05:06:42 borg03b kernel: [231632.881963]
> > > md/raid10:md4: sdj1: rescheduling sector 6052895744 Jun 29 05:06:46
> > > borg03b kernel: [231636.380147] sd 8:0:5:0: [sdj] Unhandled sense code
> > > Jun 29 05:06:46 borg03b kernel: [231636.380150] sd 8:0:5:0: [sdj]
> > > Result: hostbyte=invalid driverbyte=DRIVER_SENSE Jun 29 05:06:46
> > > borg03b kernel: [231636.380153] sd 8:0:5:0: [sdj]  Sense Key : Medium
> > > Error [current] Jun 29 05:06:46 borg03b kernel: [231636.380157] Info
> > > fld=0x904ff8b8 Jun 29 05:06:46 borg03b kernel: [231636.380159] sd
> > > 8:0:5:0: [sdj]  Add. Sense: Unrecovered read error Jun 29 05:06:46
> > > borg03b kernel: [231636.380162] sd 8:0:5:0: [sdj] CDB: Read(10): 28 00
> > > 90 4f f8 b7 00 00 08 00 Jun 29 05:06:46 borg03b kernel:
> > > [231636.380168] end_request: critical target error, dev sdj, sector
> > > 2421160119 Jun 29 05:06:46 borg03b kernel: [231636.401781]
> > > ------------[ cut here ]------------ Jun 29 05:06:46 borg03b kernel:
> > > [231636.405694] kernel BUG at drivers/scsi/scsi_lib.c:1153! Jun 29
> > > 05:06:46 borg03b kernel: [231636.405694] invalid opcode: 0000 [#1] SMP
> > > ---
> > > 
> > > So a drive died, which shouldn't be a big deal and the kernel decided
> > > to jump off the proverbial bridge.
> > > 
> > > And kept doing that upon reboots:
> > > ---
> > > Jun 29 06:44:38 borg03b kernel: [   52.052257] end_request: critical
> > > target error, dev sdj, sector 2421149759 Jun 29 06:44:38 borg03b
> > > kernel: [   52.054654] md/raid10:md4: sdj1: rescheduling sector
> > > 6052870144 Jun 29 06:44:38 borg03b kernel: [   52.057104]
> > > md/raid10:md4: sdj1: rescheduling sector 6052870392 Jun 29 06:44:38
> > > borg03b kernel: [   52.059521] md/raid10:md4: sdj1: rescheduling
> > > sector 6052870400 Jun 29 06:44:38 borg03b kernel: [   52.061878]
> > > md/raid10:md4: sdj1: rescheduling sector 6052870648 Jun 29 06:44:38
> > > borg03b kernel: [   52.064255] md/raid10:md4: sdj1: rescheduling
> > > sector 6052870656 Jun 29 06:44:38 borg03b kernel: [   52.066562]
> > > md/raid10:md4: sdj1: rescheduling sector 6052870904 Jun 29 06:44:38
> > > borg03b kernel: [   52.068872] md/raid10:md4: sdj1: rescheduling
> > > sector 6052870912 Jun 29 06:44:38 borg03b kernel: [   52.071141]
> > > md/raid10:md4: sdj1: rescheduling sector 6052871160 Jun 29 06:44:39
> > > borg03b kernel: [   52.250525] md/raid10:md4: sdj1: redirectingsector
> > > 6052865024 to another mirror Jun 29 06:44:39 borg03b kernel:
> > > [   52.276817] md/raid10:md4: sdj1: redirectingsector 6052865272 to
> > > another mirror Jun 29 06:44:42 borg03b kernel: [   55.325297] sd
> > > 8:0:5:0: [sdj] Unhandled sense code Jun 29 06:44:42 borg03b kernel:
> > > [   55.325301] sd 8:0:5:0: [sdj]  Result: hostbyte=invalid
> > > driverbyte=DRIVER_SENSE Jun 29 06:44:42 borg03b kernel: [   55.325304]
> > > sd 8:0:5:0: [sdj]  Sense Key : Medium Error [current] Jun 29 06:44:42
> > > borg03b kernel: [   55.325308] Info fld=0x904fc9b4 Jun 29 06:44:42
> > > borg03b kernel: [   55.325310] sd 8:0:5:0: [sdj]  Add. Sense:
> > > Unrecovered read error Jun 29 06:44:42 borg03b kernel: [   55.325313]
> > > sd 8:0:5:0: [sdj] CDB: Read(10): 28 00 90 4f c9 af 00 00 08 00 Jun 29
> > > 06:44:42 borg03b kernel: [   55.325320] end_request: critical target
> > > error, dev sdj, sector 2421148079 Jun 29 06:44:42 borg03b kernel:
> > > [   55.343766] ------------[ cut here ]------------ Jun 29 06:44:42
> > > borg03b kernel: [   55.346054] kernel BUG at
> > > drivers/scsi/scsi_lib.c:1153! --- Which resulted a bit later in: ---
> > > Jun 29 06:45:05 borg03b kernel: [   57.051653] ------------[ cut
> > > here ]------------ Jun 29 06:45:05 borg03b kernel: [   57.051653]
> > > WARNING: at kernel/watchdog.c:241
> > > watchdog_overflow_callback+0x96/0xa1() Jun 29 06:45:05 borg03b kernel:
> > > [   57.051653] Hardware name: H8DM3-2 Jun 29 06:45:05 borg03b kernel:
> > > [   57.051653] Watchdog detected hard LOCKUP on cpu 7 ---
> > > 
> > > Not sure if there is a real HW problem (aside from the failing drive)
> > > and kettle calling the pot black, but I managed to recover things by
> > > booting into single-user mode and removing that failing drive before
> > > letting the kernel proceed with booting.
> > > 
> > > This is pretty bad [TM], any ideas?
> > > If you need more information, just let me know.
> > 
> > That took *way* to long to find given how simple the fix is.
> 
> Well, given how long it takes with some OSS projects, I'd say 4 days is
> pretty good. ^o^

I meant the 4 hours of my time searching, not the 4 days of your time
waiting :-)


> 
> > I spent ages staring at the code, as about to reply and so "no idea"
> > when I thought I should test it myself.  Test failed immediately.
> 
> Could you elaborate a bit? 
> As in, was this something introduced only very recently, since I had
> dozens of disks fail before w/o any such pyrotechnics. 
> Or were there some special circumstances that triggered it? 
> (But looking at the patch, I guess it should have been pretty universal)

Bug was introduced by commit 58c54fcca3bac5bf9 which first appeared in
Linux 3.1.  Since then, any read error on RAID10 will trigger the bug.

I really need to improve my testing!

> 
> > Then I spent way too look adding tracing into the wrong places.  But I
> > have it now!
> > 
> > Thanks for the report.  Following fix will go upstream shortly.
> > (r10_sync_page_io takes sectors, not bytes).
> > 
> Great to know, I shall keep my eyes on the kernel ChangeLogs and update
> as soon as I can. Might just go and patch what I have right now, though...
> 
> Thanks for your efforts!

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fatal crash/hang in scsi_lib after RAID disk failure
  2012-07-03  6:45     ` NeilBrown
@ 2012-07-03  7:12       ` Christian Balzer
  2012-07-03  7:31         ` NeilBrown
  0 siblings, 1 reply; 6+ messages in thread
From: Christian Balzer @ 2012-07-03  7:12 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Tue, 3 Jul 2012 16:45:28 +1000 NeilBrown wrote:

> On Tue, 3 Jul 2012 15:10:38 +0900 Christian Balzer <chibi@gol.com> wrote:
> 
> > On Tue, 3 Jul 2012 15:50:45 +1000 NeilBrown wrote:
> > 
[snip]
> > > That took *way* to long to find given how simple the fix is.
> > 
> > Well, given how long it takes with some OSS projects, I'd say 4 days is
> > pretty good. ^o^
> 
> I meant the 4 hours of my time searching, not the 4 days of your time
> waiting :-)
> 
Hehehe, if you put it that way... ^o^

> 
> > 
> > > I spent ages staring at the code, as about to reply and so "no idea"
> > > when I thought I should test it myself.  Test failed immediately.
> > 
> > Could you elaborate a bit? 
> > As in, was this something introduced only very recently, since I had
> > dozens of disks fail before w/o any such pyrotechnics. 
> > Or were there some special circumstances that triggered it? 
> > (But looking at the patch, I guess it should have been pretty
> > universal)
> 
> Bug was introduced by commit 58c54fcca3bac5bf9 which first appeared in
> Linux 3.1.  Since then, any read error on RAID10 will trigger the bug.
> 
Ouch, that's a pretty substantial number of machines I'd reckon.

But now I'm even more intrigued, how do you (or the md code) define a read
error then? 
Remember this beauty here, which triggered the hunt and kill of the R10
recovery bug of uneven member sets?
---
Jun 20 18:22:01 borg03b kernel: [1383357.792044] mptscsih: ioc0: attempting task abort! (sc=ffff88023c3c5180)
Jun 20 18:22:01 borg03b kernel: [1383357.792049] sd 8:0:3:0: [sdh] CDB: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
Jun 20 18:22:06 borg03b kernel: [1383362.317346] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0
x0000) cb_idx mptscsih_io_done
Jun 20 18:22:06 borg03b kernel: [1383362.317589] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff88023c3c5180)
Jun 20 18:22:06 borg03b kernel: [1383362.567292] mptbase: ioc0: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay
 Retry}, SubCode(0x0000) cb_idx mptscsih_io_done
Jun 20 18:22:06 borg03b kernel: [1383362.567316] mptscsih: ioc0: attempting target reset! (sc=ffff88023c3c5180)
Jun 20 18:22:06 borg03b kernel: [1383362.567321] sd 8:0:3:0: [sdh] CDB: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
Jun 20 18:22:06 borg03b kernel: [1383362.568040] mptscsih: ioc0: target reset: SUCCESS (sc=ffff88023c3c5180)
Jun 20 18:22:06 borg03b kernel: [1383362.568068] mptscsih: ioc0: attempting host reset! (sc=ffff88023c3c5180)
Jun 20 18:22:06 borg03b kernel: [1383362.568074] mptbase: ioc0: Initiating recovery
Jun 20 18:22:29 borg03b kernel: [1383385.440045] mptscsih: ioc0: host reset: SUCCESS (sc=ffff88023c3c5180)
Jun 20 18:22:29 borg03b kernel: [1383385.484846] Device returned, unsetting inDMD
Jun 20 18:22:39 borg03b kernel: [1383395.448043] sd 8:0:3:0: Device offlined - not ready after error recovery
Jun 20 18:22:39 borg03b kernel: [1383395.448135] sd 8:0:3:0: rejecting I/O to offline device
Jun 20 18:22:39 borg03b kernel: [1383395.452063] end_request: I/O error, dev sdh, sector 71
Jun 20 18:22:39 borg03b kernel: [1383395.452063] md: super_written gets error=-5, uptodate=0
Jun 20 18:22:39 borg03b kernel: [1383395.452063] md/raid10:md3: Disk failure on sdh1, disabling device.
Jun 20 18:22:39 borg03b kernel: [1383395.452063] md/raid10:md3: Operation continuing on 4 devices.
---
That was a 3.2.18 kernel, but it didn't die and neither did the other
cluster member with a very similar failure two weeks earlier. 

So I guess the device getting kicked out by the libsata layer below is
fine, but it returning medium errors triggers the bug?

Anyways, time to patch stuff, thankfully this is the only production
cluster I have with a 3.2 kernel using RAID10. ^.^;

Regards,

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fatal crash/hang in scsi_lib after RAID disk failure
  2012-07-03  7:12       ` Christian Balzer
@ 2012-07-03  7:31         ` NeilBrown
  0 siblings, 0 replies; 6+ messages in thread
From: NeilBrown @ 2012-07-03  7:31 UTC (permalink / raw)
  To: Christian Balzer; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4308 bytes --]

On Tue, 3 Jul 2012 16:12:00 +0900 Christian Balzer <chibi@gol.com> wrote:

> On Tue, 3 Jul 2012 16:45:28 +1000 NeilBrown wrote:
> 
> > On Tue, 3 Jul 2012 15:10:38 +0900 Christian Balzer <chibi@gol.com> wrote:
> > 
> > > On Tue, 3 Jul 2012 15:50:45 +1000 NeilBrown wrote:
> > > 
> [snip]
> > > > That took *way* to long to find given how simple the fix is.
> > > 
> > > Well, given how long it takes with some OSS projects, I'd say 4 days is
> > > pretty good. ^o^
> > 
> > I meant the 4 hours of my time searching, not the 4 days of your time
> > waiting :-)
> > 
> Hehehe, if you put it that way... ^o^
> 
> > 
> > > 
> > > > I spent ages staring at the code, as about to reply and so "no idea"
> > > > when I thought I should test it myself.  Test failed immediately.
> > > 
> > > Could you elaborate a bit? 
> > > As in, was this something introduced only very recently, since I had
> > > dozens of disks fail before w/o any such pyrotechnics. 
> > > Or were there some special circumstances that triggered it? 
> > > (But looking at the patch, I guess it should have been pretty
> > > universal)
> > 
> > Bug was introduced by commit 58c54fcca3bac5bf9 which first appeared in
> > Linux 3.1.  Since then, any read error on RAID10 will trigger the bug.
> > 
> Ouch, that's a pretty substantial number of machines I'd reckon.

Could be.  But they all seem to have very reliably disks.  Except yours :-)

> 
> But now I'm even more intrigued, how do you (or the md code) define a read
> error then? 

The obvious way I guess.

> Remember this beauty here, which triggered the hunt and kill of the R10
> recovery bug of uneven member sets?

Looks like that was a write error.  They are handled quite differently.

NeilBrown


> ---
> Jun 20 18:22:01 borg03b kernel: [1383357.792044] mptscsih: ioc0: attempting task abort! (sc=ffff88023c3c5180)
> Jun 20 18:22:01 borg03b kernel: [1383357.792049] sd 8:0:3:0: [sdh] CDB: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
> Jun 20 18:22:06 borg03b kernel: [1383362.317346] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0
> x0000) cb_idx mptscsih_io_done
> Jun 20 18:22:06 borg03b kernel: [1383362.317589] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff88023c3c5180)
> Jun 20 18:22:06 borg03b kernel: [1383362.567292] mptbase: ioc0: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay
>  Retry}, SubCode(0x0000) cb_idx mptscsih_io_done
> Jun 20 18:22:06 borg03b kernel: [1383362.567316] mptscsih: ioc0: attempting target reset! (sc=ffff88023c3c5180)
> Jun 20 18:22:06 borg03b kernel: [1383362.567321] sd 8:0:3:0: [sdh] CDB: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
> Jun 20 18:22:06 borg03b kernel: [1383362.568040] mptscsih: ioc0: target reset: SUCCESS (sc=ffff88023c3c5180)
> Jun 20 18:22:06 borg03b kernel: [1383362.568068] mptscsih: ioc0: attempting host reset! (sc=ffff88023c3c5180)
> Jun 20 18:22:06 borg03b kernel: [1383362.568074] mptbase: ioc0: Initiating recovery
> Jun 20 18:22:29 borg03b kernel: [1383385.440045] mptscsih: ioc0: host reset: SUCCESS (sc=ffff88023c3c5180)
> Jun 20 18:22:29 borg03b kernel: [1383385.484846] Device returned, unsetting inDMD
> Jun 20 18:22:39 borg03b kernel: [1383395.448043] sd 8:0:3:0: Device offlined - not ready after error recovery
> Jun 20 18:22:39 borg03b kernel: [1383395.448135] sd 8:0:3:0: rejecting I/O to offline device
> Jun 20 18:22:39 borg03b kernel: [1383395.452063] end_request: I/O error, dev sdh, sector 71
> Jun 20 18:22:39 borg03b kernel: [1383395.452063] md: super_written gets error=-5, uptodate=0
> Jun 20 18:22:39 borg03b kernel: [1383395.452063] md/raid10:md3: Disk failure on sdh1, disabling device.
> Jun 20 18:22:39 borg03b kernel: [1383395.452063] md/raid10:md3: Operation continuing on 4 devices.
> ---
> That was a 3.2.18 kernel, but it didn't die and neither did the other
> cluster member with a very similar failure two weeks earlier. 
> 
> So I guess the device getting kicked out by the libsata layer below is
> fine, but it returning medium errors triggers the bug?
> 
> Anyways, time to patch stuff, thankfully this is the only production
> cluster I have with a 3.2 kernel using RAID10. ^.^;
> 
> Regards,
> 
> Christian


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-07-03  7:31 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-29  0:35 Fatal crash/hang in scsi_lib after RAID disk failure Christian Balzer
2012-07-03  5:50 ` NeilBrown
2012-07-03  6:10   ` Christian Balzer
2012-07-03  6:45     ` NeilBrown
2012-07-03  7:12       ` Christian Balzer
2012-07-03  7:31         ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).