libata & scsi error handling

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* libata & scsi error handling
@ 2004-08-17 21:22 Brad Campbell
  2004-08-18  2:08 ` Jeff Garzik
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Brad Campbell @ 2004-08-17 21:22 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: linux-ide

G'day Jeff

I think I have this timeout error issue pegged now.

I know this is both wrong, ugly and likely to cause internal kernel damage, but for the purpose of 
pegging what I think may be the culprit it works around the error nicely here

brad@srv:/usr/src$ diff -u temp/linux-2.6.8.1/drivers/scsi/libata-scsi.c 
linux-2.6.8.1/drivers/scsi/libata-scsi.c
--- temp/linux-2.6.8.1/drivers/scsi/libata-scsi.c       2004-08-14 14:55:19.000000000 +0400
+++ linux-2.6.8.1/drivers/scsi/libata-scsi.c    2004-08-18 01:04:11.000000000 +0400
@@ -213,6 +213,7 @@

         ap = (struct ata_port *) &host->hostdata[0];
         ap->ops->eng_timeout(ap);
+       host->host_failed--;

         DPRINTK("EXIT\n");
         return 0;

The issue is that the libata installed eh_strategy_handler does not complete the error as
scsi_unjam_host -> scsi_eh_abort_cmds -> scsi_eh_finish_cmd does.

This leaves shost->host_failed to increment to one above shost->host_busy which means in 
scsi_eh_wakeup we never actually wakeup the error handler thread after the first error.

By adding that line above and doing a
dd if=/dev/sda count=1 > /dev/null

I get constant errors every 20 seconds (which is right given it's incrementing lba by 1 sector at a 
time and readahead seems to ask it to read 0x7F. I assume if I left it be it would error out after 
0x7F retries and then die.) If I plug the cable back in, boom dd drops a read error and we are back 
in business.

I'm not sure where to go from here as I can't seem to find a way to call scsi_eh_finish_cmd from 
within libata-scsi and I'm really well out of my depth here. I hope I can at least contribute to 
debugging.

Regards,
Brad

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: libata & scsi error handling
  2004-08-17 21:22 libata & scsi error handling Brad Campbell
@ 2004-08-18  2:08 ` Jeff Garzik
  2004-08-18  5:11   ` Douglas Gilbert
  2004-08-18  7:04   ` Brad Campbell
  2004-08-18  5:32 ` Jeff Garzik
  2004-08-19 11:49 ` Kevin Shanahan
  2 siblings, 2 replies; 7+ messages in thread
From: Jeff Garzik @ 2004-08-18  2:08 UTC (permalink / raw)
  To: Brad Campbell; +Cc: linux-ide, SCSI Mailing List

Brad Campbell wrote:
> I think I have this timeout error issue pegged now.
> 
> I know this is both wrong, ugly and likely to cause internal kernel 
> damage, but for the purpose of pegging what I think may be the culprit 
> it works around the error nicely here
> 
> brad@srv:/usr/src$ diff -u temp/linux-2.6.8.1/drivers/scsi/libata-scsi.c 
> linux-2.6.8.1/drivers/scsi/libata-scsi.c
> --- temp/linux-2.6.8.1/drivers/scsi/libata-scsi.c       2004-08-14 
> 14:55:19.000000000 +0400
> +++ linux-2.6.8.1/drivers/scsi/libata-scsi.c    2004-08-18 
> 01:04:11.000000000 +0400
> @@ -213,6 +213,7 @@
> 
>         ap = (struct ata_port *) &host->hostdata[0];
>         ap->ops->eng_timeout(ap);
> +       host->host_failed--;
> 
>         DPRINTK("EXIT\n");
>         return 0;
> 
> The issue is that the libata installed eh_strategy_handler does not 
> complete the error as
> scsi_unjam_host -> scsi_eh_abort_cmds -> scsi_eh_finish_cmd does.

Well, well, well.  If I had a libata Honorary Hacker merit badge, I 
would give it to you.

It is highly likely that your patch is doing the right thing.  Doug 
Ledford, 2.4.x SCSI maintainer, pointed out to me recently that my 2.4.x 
error handling code MUST update a couple variables, otherwise error 
handling would hang as you see.  The reason is that scsi_unjam_host(), 
on both 2.4.x and 2.6.x, is the only ->eh_strategy_handler until libata 
came along.

So, it is likely that there are a few details the scsi_unjam_host() 
performs, that needs to do too.

Thanks much for your excellent detective work, I'll see where to best 
put this change...

	Jeff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: libata & scsi error handling
  2004-08-18  2:08 ` Jeff Garzik
@ 2004-08-18  5:11   ` Douglas Gilbert
  2004-08-18  5:31     ` Jeff Garzik
  2004-08-18  7:04   ` Brad Campbell
  1 sibling, 1 reply; 7+ messages in thread
From: Douglas Gilbert @ 2004-08-18  5:11 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Brad Campbell, linux-ide, SCSI Mailing List

[-- Attachment #1: Type: text/plain, Size: 2077 bytes --]

Jeff Garzik wrote:
> Brad Campbell wrote:
> 
>> I think I have this timeout error issue pegged now.
>>
>> I know this is both wrong, ugly and likely to cause internal kernel 
>> damage, but for the purpose of pegging what I think may be the culprit 
>> it works around the error nicely here
>>
>> brad@srv:/usr/src$ diff -u 
>> temp/linux-2.6.8.1/drivers/scsi/libata-scsi.c 
>> linux-2.6.8.1/drivers/scsi/libata-scsi.c
>> --- temp/linux-2.6.8.1/drivers/scsi/libata-scsi.c       2004-08-14 
>> 14:55:19.000000000 +0400
>> +++ linux-2.6.8.1/drivers/scsi/libata-scsi.c    2004-08-18 
>> 01:04:11.000000000 +0400
>> @@ -213,6 +213,7 @@
>>
>>         ap = (struct ata_port *) &host->hostdata[0];
>>         ap->ops->eng_timeout(ap);
>> +       host->host_failed--;
>>
>>         DPRINTK("EXIT\n");
>>         return 0;
>>
>> The issue is that the libata installed eh_strategy_handler does not 
>> complete the error as
>> scsi_unjam_host -> scsi_eh_abort_cmds -> scsi_eh_finish_cmd does.
> 
> 
> 
> Well, well, well.  If I had a libata Honorary Hacker merit badge, I 
> would give it to you.
> 
> It is highly likely that your patch is doing the right thing.  Doug 
> Ledford, 2.4.x SCSI maintainer, pointed out to me recently that my 2.4.x 
> error handling code MUST update a couple variables, otherwise error 
> handling would hang as you see.  The reason is that scsi_unjam_host(), 
> on both 2.4.x and 2.6.x, is the only ->eh_strategy_handler until libata 
> came along.
> 
> So, it is likely that there are a few details the scsi_unjam_host() 
> performs, that needs to do too.
> 
> Thanks much for your excellent detective work, I'll see where to best 
> put this change...

Jeff,
It probably doesn't rate any gold stars but while your patching
libata-scsi.c could you slip this fix in as well.

The patch is against lk 2.6.8.1 . The same patch is needed
(give or take fuzz) in lk 2.4.27 .

Changes:
    - send vendor, product and rev strings back for 36 byte
      INQUIRYs
    - set the additional length field to indicate 96 byte
      response is available

Doug Gilbert

[-- Attachment #2: libata-scsi2681.diff --]
[-- Type: text/x-patch, Size: 565 bytes --]

--- linux/drivers/scsi/libata-scsi.c	2004-08-14 21:12:42.000000000 +1000
+++ linux/drivers/scsi/libata-scsi.c2681dpg	2004-08-17 22:00:59.501464824 +1000
@@ -534,7 +534,7 @@
 		0,
 		0x5,	/* claim SPC-3 version compatibility */
 		2,
-		96 - 4
+		95 - 4
 	};
 
 	/* set scsi removeable (RMB) bit per ata bit */
@@ -545,7 +545,7 @@
 
 	memcpy(rbuf, hdr, sizeof(hdr));
 
-	if (buflen > 36) {
+	if (buflen > 35) {
 		memcpy(&rbuf[8], "ATA     ", 8);
 		ata_dev_id_string(dev, &rbuf[16], ATA_ID_PROD_OFS, 16);
 		ata_dev_id_string(dev, &rbuf[32], ATA_ID_FW_REV_OFS, 4);

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: libata & scsi error handling
  2004-08-18  5:11   ` Douglas Gilbert
@ 2004-08-18  5:31     ` Jeff Garzik
  0 siblings, 0 replies; 7+ messages in thread
From: Jeff Garzik @ 2004-08-18  5:31 UTC (permalink / raw)
  To: dougg; +Cc: Brad Campbell, linux-ide, SCSI Mailing List

applied


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: libata & scsi error handling
  2004-08-18  2:08 ` Jeff Garzik
  2004-08-18  5:11   ` Douglas Gilbert
@ 2004-08-18  7:04   ` Brad Campbell
  1 sibling, 0 replies; 7+ messages in thread
From: Brad Campbell @ 2004-08-18  7:04 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: linux-ide, SCSI Mailing List

Jeff Garzik wrote:

> 
> It is highly likely that your patch is doing the right thing.  Doug 
> Ledford, 2.4.x SCSI maintainer, pointed out to me recently that my 2.4.x 
> error handling code MUST update a couple variables, otherwise error 
> handling would hang as you see.  The reason is that scsi_unjam_host(), 
> on both 2.4.x and 2.6.x, is the only ->eh_strategy_handler until libata 
> came along.
> 
> So, it is likely that there are a few details the scsi_unjam_host() 
> performs, that needs to do too.

Possibly stupid question time. (What I know about the SCSI stack could be written on the back of a 
matchbox)

I'm a little concerned about this bit here. (This is the end of the first command and then the 
timeout related to it).


Aug 18 01:54:48 srv kernel: ata_dev_select: ENTER, ata13: device 0, wait 1
Aug 18 01:54:48 srv kernel: ata_tf_load_pio: hob: feat 0x0 nsect 0x0, lba 0x0 0x0 0x0
Aug 18 01:54:48 srv kernel: ata_tf_load_pio: feat 0x0 nsect 0x80 lba 0x0 0x0 0x0
Aug 18 01:54:48 srv kernel: ata_tf_load_pio: device 0xE0
Aug 18 01:54:48 srv kernel: ata_exec_command_pio: ata13: cmd 0x25
Aug 18 01:54:48 srv kernel: ata_scsi_translate: EXIT
Aug 18 01:54:48 srv kernel: scsi_dispatch_cmd out
Aug 18 00:43:41 srv kernel: scsi_times_out
Aug 18 00:43:41 srv kernel: scsi_eh_scmd_add

Here the scmd that failed gets added to a list.

         list_add_tail(&scmd->eh_entry, &shost->eh_cmd_q);

Because scsi_eh_finish_cmd never runs it will never get removed from the list. Am I missing something?

Aug 18 00:43:41 srv kernel: scsi_eh_scmd_after return 0
Aug 18 00:43:41 srv kernel: host_busy 1, host_failed 1
Aug 18 00:43:41 srv kernel: scsi_times_out out
Aug 18 00:43:41 srv kernel: wake eh_strategy_handler
Aug 18 00:43:41 srv kernel: hit eh_strategy_handler
Aug 18 00:43:41 srv kernel: eh_strategy_handler 1
Aug 18 00:43:41 srv kernel: ata_scsi_error: ENTER
Aug 18 00:43:41 srv kernel: ata_eng_timeout: ENTER
Aug 18 00:43:41 srv kernel: ata_qc_timeout: ENTER
Aug 18 00:43:41 srv kernel: ata13: command 0x25 timeout, stat 0xd0 host_stat 0x1
Aug 18 00:43:41 srv kernel: ata_sg_clean: unmapping 128 sg elements
Aug 18 00:43:41 srv kernel: scsi_device_unbusy
Aug 18 00:43:41 srv kernel: host_busy 0, host_failed 1
Aug 18 00:43:41 srv kernel: scsi12: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 00 00 00 00 
00 00 80 00
Aug 18 00:43:41 srv kernel: Current sda: sense key Medium Error
Aug 18 00:43:41 srv kernel: Additional sense: Unrecovered read error - auto reallocate failed
Aug 18 00:43:41 srv kernel: end_request: I/O error, dev sda, sector 0
Aug 18 00:43:41 srv kernel: Buffer I/O error on device sda, logical block 0
Aug 18 00:43:41 srv kernel: ata_qc_timeout: EXIT
Aug 18 00:43:41 srv kernel: ata_eng_timeout: EXIT
Aug 18 00:43:41 srv kernel: ata_scsi_error: EXIT
Aug 18 00:43:41 srv kernel: eh_strategy_handler 2
Aug 18 00:43:41 srv kernel: eh_strategy_handler 3
Aug 18 00:43:41 srv kernel: scsi_dispatch_cmd
Aug 18 00:43:41 srv kernel: Add Timer
Aug 18 00:43:41 srv kernel: After Add Timer
Aug 18 01:55:14 srv kernel: ata_scsi_dump_cdb: CDB (13:0,0,0) 28 00 00 00 00 01 00 00 7f
Aug 18 01:55:14 srv kernel: ata_scsi_translate: ENTER
Aug 18 01:55:14 srv kernel: ata_scsi_rw_xlat: ten-byte command
Aug 18 01:55:14 srv kernel: ata_sg_setup: ENTER, ata13
Aug 18 01:55:14 srv kernel: ata_sg_setup: 127 sg elements mapped

Regards,
Brad

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: libata & scsi error handling
  2004-08-17 21:22 libata & scsi error handling Brad Campbell
  2004-08-18  2:08 ` Jeff Garzik
@ 2004-08-18  5:32 ` Jeff Garzik
  2004-08-19 11:49 ` Kevin Shanahan
  2 siblings, 0 replies; 7+ messages in thread
From: Jeff Garzik @ 2004-08-18  5:32 UTC (permalink / raw)
  To: Brad Campbell; +Cc: linux-ide

applied (a version of this)


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: libata & scsi error handling
  2004-08-17 21:22 libata & scsi error handling Brad Campbell
  2004-08-18  2:08 ` Jeff Garzik
  2004-08-18  5:32 ` Jeff Garzik
@ 2004-08-19 11:49 ` Kevin Shanahan
  2 siblings, 0 replies; 7+ messages in thread
From: Kevin Shanahan @ 2004-08-19 11:49 UTC (permalink / raw)
  To: linux-ide

On Wed, 2004-08-18 at 06:52, Brad Campbell wrote:
> brad@srv:/usr/src$ diff -u temp/linux-2.6.8.1/drivers/scsi/libata-scsi.c 
> linux-2.6.8.1/drivers/scsi/libata-scsi.c
> --- temp/linux-2.6.8.1/drivers/scsi/libata-scsi.c       2004-08-14 14:55:19.000000000 +0400
> +++ linux-2.6.8.1/drivers/scsi/libata-scsi.c    2004-08-18 01:04:11.000000000 +0400
> @@ -213,6 +213,7 @@
> 
>          ap = (struct ata_port *) &host->hostdata[0];
>          ap->ops->eng_timeout(ap);
> +       host->host_failed--;
> 
>          DPRINTK("EXIT\n");
>          return 0;

Thanks for this Brad - great detective work 8)

This got me going again, so the reads do make progress. I'm looking at
what's happening with my dd_rescue trying to read from the disk and it
seems a little strange.

ptrace shows dd_rescue is calling pread to read 512 bytes from a sector
aligned locations. For each pread call I can see, syslog shows the sata
driver attempting to read several sectors (difficult to count, but it's
probably between 8 and 32 - maybe this is readahead?).

Even if the request was longer than one sector, is there any reason not
to abort the entire read request once an unreadable sector is
encountered? pread will just end up returning -1 anyway.

Do other disk drivers (libata, ide or others) behave the same way?

I ask because it has now taken approximately 24 hours for dd_rescue to
read ~250 (bad) sectors from this disk (I hope I'm not getting too far
off topic).

Thanks,
Kevin.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2004-08-19 11:47 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-08-17 21:22 libata & scsi error handling Brad Campbell
2004-08-18  2:08 ` Jeff Garzik
2004-08-18  5:11   ` Douglas Gilbert
2004-08-18  5:31     ` Jeff Garzik
2004-08-18  7:04   ` Brad Campbell
2004-08-18  5:32 ` Jeff Garzik
2004-08-19 11:49 ` Kevin Shanahan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).