* libata & scsi error handling
@ 2004-08-17 21:22 Brad Campbell
2004-08-18 2:08 ` Jeff Garzik
` (2 more replies)
0 siblings, 3 replies; 7+ messages in thread
From: Brad Campbell @ 2004-08-17 21:22 UTC (permalink / raw)
To: Jeff Garzik; +Cc: linux-ide
G'day Jeff
I think I have this timeout error issue pegged now.
I know this is both wrong, ugly and likely to cause internal kernel damage, but for the purpose of
pegging what I think may be the culprit it works around the error nicely here
brad@srv:/usr/src$ diff -u temp/linux-2.6.8.1/drivers/scsi/libata-scsi.c
linux-2.6.8.1/drivers/scsi/libata-scsi.c
--- temp/linux-2.6.8.1/drivers/scsi/libata-scsi.c 2004-08-14 14:55:19.000000000 +0400
+++ linux-2.6.8.1/drivers/scsi/libata-scsi.c 2004-08-18 01:04:11.000000000 +0400
@@ -213,6 +213,7 @@
ap = (struct ata_port *) &host->hostdata[0];
ap->ops->eng_timeout(ap);
+ host->host_failed--;
DPRINTK("EXIT\n");
return 0;
The issue is that the libata installed eh_strategy_handler does not complete the error as
scsi_unjam_host -> scsi_eh_abort_cmds -> scsi_eh_finish_cmd does.
This leaves shost->host_failed to increment to one above shost->host_busy which means in
scsi_eh_wakeup we never actually wakeup the error handler thread after the first error.
By adding that line above and doing a
dd if=/dev/sda count=1 > /dev/null
I get constant errors every 20 seconds (which is right given it's incrementing lba by 1 sector at a
time and readahead seems to ask it to read 0x7F. I assume if I left it be it would error out after
0x7F retries and then die.) If I plug the cable back in, boom dd drops a read error and we are back
in business.
I'm not sure where to go from here as I can't seem to find a way to call scsi_eh_finish_cmd from
within libata-scsi and I'm really well out of my depth here. I hope I can at least contribute to
debugging.
Regards,
Brad
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: libata & scsi error handling 2004-08-17 21:22 libata & scsi error handling Brad Campbell @ 2004-08-18 2:08 ` Jeff Garzik 2004-08-18 5:11 ` Douglas Gilbert 2004-08-18 7:04 ` Brad Campbell 2004-08-18 5:32 ` Jeff Garzik 2004-08-19 11:49 ` Kevin Shanahan 2 siblings, 2 replies; 7+ messages in thread From: Jeff Garzik @ 2004-08-18 2:08 UTC (permalink / raw) To: Brad Campbell; +Cc: linux-ide, SCSI Mailing List Brad Campbell wrote: > I think I have this timeout error issue pegged now. > > I know this is both wrong, ugly and likely to cause internal kernel > damage, but for the purpose of pegging what I think may be the culprit > it works around the error nicely here > > brad@srv:/usr/src$ diff -u temp/linux-2.6.8.1/drivers/scsi/libata-scsi.c > linux-2.6.8.1/drivers/scsi/libata-scsi.c > --- temp/linux-2.6.8.1/drivers/scsi/libata-scsi.c 2004-08-14 > 14:55:19.000000000 +0400 > +++ linux-2.6.8.1/drivers/scsi/libata-scsi.c 2004-08-18 > 01:04:11.000000000 +0400 > @@ -213,6 +213,7 @@ > > ap = (struct ata_port *) &host->hostdata[0]; > ap->ops->eng_timeout(ap); > + host->host_failed--; > > DPRINTK("EXIT\n"); > return 0; > > The issue is that the libata installed eh_strategy_handler does not > complete the error as > scsi_unjam_host -> scsi_eh_abort_cmds -> scsi_eh_finish_cmd does. Well, well, well. If I had a libata Honorary Hacker merit badge, I would give it to you. It is highly likely that your patch is doing the right thing. Doug Ledford, 2.4.x SCSI maintainer, pointed out to me recently that my 2.4.x error handling code MUST update a couple variables, otherwise error handling would hang as you see. The reason is that scsi_unjam_host(), on both 2.4.x and 2.6.x, is the only ->eh_strategy_handler until libata came along. So, it is likely that there are a few details the scsi_unjam_host() performs, that needs to do too. Thanks much for your excellent detective work, I'll see where to best put this change... Jeff ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: libata & scsi error handling 2004-08-18 2:08 ` Jeff Garzik @ 2004-08-18 5:11 ` Douglas Gilbert 2004-08-18 5:31 ` Jeff Garzik 2004-08-18 7:04 ` Brad Campbell 1 sibling, 1 reply; 7+ messages in thread From: Douglas Gilbert @ 2004-08-18 5:11 UTC (permalink / raw) To: Jeff Garzik; +Cc: Brad Campbell, linux-ide, SCSI Mailing List [-- Attachment #1: Type: text/plain, Size: 2077 bytes --] Jeff Garzik wrote: > Brad Campbell wrote: > >> I think I have this timeout error issue pegged now. >> >> I know this is both wrong, ugly and likely to cause internal kernel >> damage, but for the purpose of pegging what I think may be the culprit >> it works around the error nicely here >> >> brad@srv:/usr/src$ diff -u >> temp/linux-2.6.8.1/drivers/scsi/libata-scsi.c >> linux-2.6.8.1/drivers/scsi/libata-scsi.c >> --- temp/linux-2.6.8.1/drivers/scsi/libata-scsi.c 2004-08-14 >> 14:55:19.000000000 +0400 >> +++ linux-2.6.8.1/drivers/scsi/libata-scsi.c 2004-08-18 >> 01:04:11.000000000 +0400 >> @@ -213,6 +213,7 @@ >> >> ap = (struct ata_port *) &host->hostdata[0]; >> ap->ops->eng_timeout(ap); >> + host->host_failed--; >> >> DPRINTK("EXIT\n"); >> return 0; >> >> The issue is that the libata installed eh_strategy_handler does not >> complete the error as >> scsi_unjam_host -> scsi_eh_abort_cmds -> scsi_eh_finish_cmd does. > > > > Well, well, well. If I had a libata Honorary Hacker merit badge, I > would give it to you. > > It is highly likely that your patch is doing the right thing. Doug > Ledford, 2.4.x SCSI maintainer, pointed out to me recently that my 2.4.x > error handling code MUST update a couple variables, otherwise error > handling would hang as you see. The reason is that scsi_unjam_host(), > on both 2.4.x and 2.6.x, is the only ->eh_strategy_handler until libata > came along. > > So, it is likely that there are a few details the scsi_unjam_host() > performs, that needs to do too. > > Thanks much for your excellent detective work, I'll see where to best > put this change... Jeff, It probably doesn't rate any gold stars but while your patching libata-scsi.c could you slip this fix in as well. The patch is against lk 2.6.8.1 . The same patch is needed (give or take fuzz) in lk 2.4.27 . Changes: - send vendor, product and rev strings back for 36 byte INQUIRYs - set the additional length field to indicate 96 byte response is available Doug Gilbert [-- Attachment #2: libata-scsi2681.diff --] [-- Type: text/x-patch, Size: 565 bytes --] --- linux/drivers/scsi/libata-scsi.c 2004-08-14 21:12:42.000000000 +1000 +++ linux/drivers/scsi/libata-scsi.c2681dpg 2004-08-17 22:00:59.501464824 +1000 @@ -534,7 +534,7 @@ 0, 0x5, /* claim SPC-3 version compatibility */ 2, - 96 - 4 + 95 - 4 }; /* set scsi removeable (RMB) bit per ata bit */ @@ -545,7 +545,7 @@ memcpy(rbuf, hdr, sizeof(hdr)); - if (buflen > 36) { + if (buflen > 35) { memcpy(&rbuf[8], "ATA ", 8); ata_dev_id_string(dev, &rbuf[16], ATA_ID_PROD_OFS, 16); ata_dev_id_string(dev, &rbuf[32], ATA_ID_FW_REV_OFS, 4); ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: libata & scsi error handling 2004-08-18 5:11 ` Douglas Gilbert @ 2004-08-18 5:31 ` Jeff Garzik 0 siblings, 0 replies; 7+ messages in thread From: Jeff Garzik @ 2004-08-18 5:31 UTC (permalink / raw) To: dougg; +Cc: Brad Campbell, linux-ide, SCSI Mailing List applied ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: libata & scsi error handling 2004-08-18 2:08 ` Jeff Garzik 2004-08-18 5:11 ` Douglas Gilbert @ 2004-08-18 7:04 ` Brad Campbell 1 sibling, 0 replies; 7+ messages in thread From: Brad Campbell @ 2004-08-18 7:04 UTC (permalink / raw) To: Jeff Garzik; +Cc: linux-ide, SCSI Mailing List Jeff Garzik wrote: > > It is highly likely that your patch is doing the right thing. Doug > Ledford, 2.4.x SCSI maintainer, pointed out to me recently that my 2.4.x > error handling code MUST update a couple variables, otherwise error > handling would hang as you see. The reason is that scsi_unjam_host(), > on both 2.4.x and 2.6.x, is the only ->eh_strategy_handler until libata > came along. > > So, it is likely that there are a few details the scsi_unjam_host() > performs, that needs to do too. Possibly stupid question time. (What I know about the SCSI stack could be written on the back of a matchbox) I'm a little concerned about this bit here. (This is the end of the first command and then the timeout related to it). Aug 18 01:54:48 srv kernel: ata_dev_select: ENTER, ata13: device 0, wait 1 Aug 18 01:54:48 srv kernel: ata_tf_load_pio: hob: feat 0x0 nsect 0x0, lba 0x0 0x0 0x0 Aug 18 01:54:48 srv kernel: ata_tf_load_pio: feat 0x0 nsect 0x80 lba 0x0 0x0 0x0 Aug 18 01:54:48 srv kernel: ata_tf_load_pio: device 0xE0 Aug 18 01:54:48 srv kernel: ata_exec_command_pio: ata13: cmd 0x25 Aug 18 01:54:48 srv kernel: ata_scsi_translate: EXIT Aug 18 01:54:48 srv kernel: scsi_dispatch_cmd out Aug 18 00:43:41 srv kernel: scsi_times_out Aug 18 00:43:41 srv kernel: scsi_eh_scmd_add Here the scmd that failed gets added to a list. list_add_tail(&scmd->eh_entry, &shost->eh_cmd_q); Because scsi_eh_finish_cmd never runs it will never get removed from the list. Am I missing something? Aug 18 00:43:41 srv kernel: scsi_eh_scmd_after return 0 Aug 18 00:43:41 srv kernel: host_busy 1, host_failed 1 Aug 18 00:43:41 srv kernel: scsi_times_out out Aug 18 00:43:41 srv kernel: wake eh_strategy_handler Aug 18 00:43:41 srv kernel: hit eh_strategy_handler Aug 18 00:43:41 srv kernel: eh_strategy_handler 1 Aug 18 00:43:41 srv kernel: ata_scsi_error: ENTER Aug 18 00:43:41 srv kernel: ata_eng_timeout: ENTER Aug 18 00:43:41 srv kernel: ata_qc_timeout: ENTER Aug 18 00:43:41 srv kernel: ata13: command 0x25 timeout, stat 0xd0 host_stat 0x1 Aug 18 00:43:41 srv kernel: ata_sg_clean: unmapping 128 sg elements Aug 18 00:43:41 srv kernel: scsi_device_unbusy Aug 18 00:43:41 srv kernel: host_busy 0, host_failed 1 Aug 18 00:43:41 srv kernel: scsi12: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 00 00 00 00 00 00 80 00 Aug 18 00:43:41 srv kernel: Current sda: sense key Medium Error Aug 18 00:43:41 srv kernel: Additional sense: Unrecovered read error - auto reallocate failed Aug 18 00:43:41 srv kernel: end_request: I/O error, dev sda, sector 0 Aug 18 00:43:41 srv kernel: Buffer I/O error on device sda, logical block 0 Aug 18 00:43:41 srv kernel: ata_qc_timeout: EXIT Aug 18 00:43:41 srv kernel: ata_eng_timeout: EXIT Aug 18 00:43:41 srv kernel: ata_scsi_error: EXIT Aug 18 00:43:41 srv kernel: eh_strategy_handler 2 Aug 18 00:43:41 srv kernel: eh_strategy_handler 3 Aug 18 00:43:41 srv kernel: scsi_dispatch_cmd Aug 18 00:43:41 srv kernel: Add Timer Aug 18 00:43:41 srv kernel: After Add Timer Aug 18 01:55:14 srv kernel: ata_scsi_dump_cdb: CDB (13:0,0,0) 28 00 00 00 00 01 00 00 7f Aug 18 01:55:14 srv kernel: ata_scsi_translate: ENTER Aug 18 01:55:14 srv kernel: ata_scsi_rw_xlat: ten-byte command Aug 18 01:55:14 srv kernel: ata_sg_setup: ENTER, ata13 Aug 18 01:55:14 srv kernel: ata_sg_setup: 127 sg elements mapped Regards, Brad ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: libata & scsi error handling 2004-08-17 21:22 libata & scsi error handling Brad Campbell 2004-08-18 2:08 ` Jeff Garzik @ 2004-08-18 5:32 ` Jeff Garzik 2004-08-19 11:49 ` Kevin Shanahan 2 siblings, 0 replies; 7+ messages in thread From: Jeff Garzik @ 2004-08-18 5:32 UTC (permalink / raw) To: Brad Campbell; +Cc: linux-ide applied (a version of this) ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: libata & scsi error handling 2004-08-17 21:22 libata & scsi error handling Brad Campbell 2004-08-18 2:08 ` Jeff Garzik 2004-08-18 5:32 ` Jeff Garzik @ 2004-08-19 11:49 ` Kevin Shanahan 2 siblings, 0 replies; 7+ messages in thread From: Kevin Shanahan @ 2004-08-19 11:49 UTC (permalink / raw) To: linux-ide On Wed, 2004-08-18 at 06:52, Brad Campbell wrote: > brad@srv:/usr/src$ diff -u temp/linux-2.6.8.1/drivers/scsi/libata-scsi.c > linux-2.6.8.1/drivers/scsi/libata-scsi.c > --- temp/linux-2.6.8.1/drivers/scsi/libata-scsi.c 2004-08-14 14:55:19.000000000 +0400 > +++ linux-2.6.8.1/drivers/scsi/libata-scsi.c 2004-08-18 01:04:11.000000000 +0400 > @@ -213,6 +213,7 @@ > > ap = (struct ata_port *) &host->hostdata[0]; > ap->ops->eng_timeout(ap); > + host->host_failed--; > > DPRINTK("EXIT\n"); > return 0; Thanks for this Brad - great detective work 8) This got me going again, so the reads do make progress. I'm looking at what's happening with my dd_rescue trying to read from the disk and it seems a little strange. ptrace shows dd_rescue is calling pread to read 512 bytes from a sector aligned locations. For each pread call I can see, syslog shows the sata driver attempting to read several sectors (difficult to count, but it's probably between 8 and 32 - maybe this is readahead?). Even if the request was longer than one sector, is there any reason not to abort the entire read request once an unreadable sector is encountered? pread will just end up returning -1 anyway. Do other disk drivers (libata, ide or others) behave the same way? I ask because it has now taken approximately 24 hours for dd_rescue to read ~250 (bad) sectors from this disk (I hope I'm not getting too far off topic). Thanks, Kevin. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2004-08-19 11:47 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-08-17 21:22 libata & scsi error handling Brad Campbell 2004-08-18 2:08 ` Jeff Garzik 2004-08-18 5:11 ` Douglas Gilbert 2004-08-18 5:31 ` Jeff Garzik 2004-08-18 7:04 ` Brad Campbell 2004-08-18 5:32 ` Jeff Garzik 2004-08-19 11:49 ` Kevin Shanahan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).