linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: libata & scsi error handling
       [not found] <4122771A.4070203@wasp.net.au>
@ 2004-08-18  2:08 ` Jeff Garzik
  2004-08-18  5:11   ` Douglas Gilbert
  2004-08-18  7:04   ` Brad Campbell
  0 siblings, 2 replies; 4+ messages in thread
From: Jeff Garzik @ 2004-08-18  2:08 UTC (permalink / raw)
  To: Brad Campbell; +Cc: linux-ide, SCSI Mailing List

Brad Campbell wrote:
> I think I have this timeout error issue pegged now.
> 
> I know this is both wrong, ugly and likely to cause internal kernel 
> damage, but for the purpose of pegging what I think may be the culprit 
> it works around the error nicely here
> 
> brad@srv:/usr/src$ diff -u temp/linux-2.6.8.1/drivers/scsi/libata-scsi.c 
> linux-2.6.8.1/drivers/scsi/libata-scsi.c
> --- temp/linux-2.6.8.1/drivers/scsi/libata-scsi.c       2004-08-14 
> 14:55:19.000000000 +0400
> +++ linux-2.6.8.1/drivers/scsi/libata-scsi.c    2004-08-18 
> 01:04:11.000000000 +0400
> @@ -213,6 +213,7 @@
> 
>         ap = (struct ata_port *) &host->hostdata[0];
>         ap->ops->eng_timeout(ap);
> +       host->host_failed--;
> 
>         DPRINTK("EXIT\n");
>         return 0;
> 
> The issue is that the libata installed eh_strategy_handler does not 
> complete the error as
> scsi_unjam_host -> scsi_eh_abort_cmds -> scsi_eh_finish_cmd does.


Well, well, well.  If I had a libata Honorary Hacker merit badge, I 
would give it to you.

It is highly likely that your patch is doing the right thing.  Doug 
Ledford, 2.4.x SCSI maintainer, pointed out to me recently that my 2.4.x 
error handling code MUST update a couple variables, otherwise error 
handling would hang as you see.  The reason is that scsi_unjam_host(), 
on both 2.4.x and 2.6.x, is the only ->eh_strategy_handler until libata 
came along.

So, it is likely that there are a few details the scsi_unjam_host() 
performs, that needs to do too.

Thanks much for your excellent detective work, I'll see where to best 
put this change...


	Jeff


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: libata & scsi error handling
  2004-08-18  2:08 ` libata & scsi error handling Jeff Garzik
@ 2004-08-18  5:11   ` Douglas Gilbert
  2004-08-18  5:31     ` Jeff Garzik
  2004-08-18  7:04   ` Brad Campbell
  1 sibling, 1 reply; 4+ messages in thread
From: Douglas Gilbert @ 2004-08-18  5:11 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Brad Campbell, linux-ide, SCSI Mailing List

[-- Attachment #1: Type: text/plain, Size: 2077 bytes --]

Jeff Garzik wrote:
> Brad Campbell wrote:
> 
>> I think I have this timeout error issue pegged now.
>>
>> I know this is both wrong, ugly and likely to cause internal kernel 
>> damage, but for the purpose of pegging what I think may be the culprit 
>> it works around the error nicely here
>>
>> brad@srv:/usr/src$ diff -u 
>> temp/linux-2.6.8.1/drivers/scsi/libata-scsi.c 
>> linux-2.6.8.1/drivers/scsi/libata-scsi.c
>> --- temp/linux-2.6.8.1/drivers/scsi/libata-scsi.c       2004-08-14 
>> 14:55:19.000000000 +0400
>> +++ linux-2.6.8.1/drivers/scsi/libata-scsi.c    2004-08-18 
>> 01:04:11.000000000 +0400
>> @@ -213,6 +213,7 @@
>>
>>         ap = (struct ata_port *) &host->hostdata[0];
>>         ap->ops->eng_timeout(ap);
>> +       host->host_failed--;
>>
>>         DPRINTK("EXIT\n");
>>         return 0;
>>
>> The issue is that the libata installed eh_strategy_handler does not 
>> complete the error as
>> scsi_unjam_host -> scsi_eh_abort_cmds -> scsi_eh_finish_cmd does.
> 
> 
> 
> Well, well, well.  If I had a libata Honorary Hacker merit badge, I 
> would give it to you.
> 
> It is highly likely that your patch is doing the right thing.  Doug 
> Ledford, 2.4.x SCSI maintainer, pointed out to me recently that my 2.4.x 
> error handling code MUST update a couple variables, otherwise error 
> handling would hang as you see.  The reason is that scsi_unjam_host(), 
> on both 2.4.x and 2.6.x, is the only ->eh_strategy_handler until libata 
> came along.
> 
> So, it is likely that there are a few details the scsi_unjam_host() 
> performs, that needs to do too.
> 
> Thanks much for your excellent detective work, I'll see where to best 
> put this change...

Jeff,
It probably doesn't rate any gold stars but while your patching
libata-scsi.c could you slip this fix in as well.

The patch is against lk 2.6.8.1 . The same patch is needed
(give or take fuzz) in lk 2.4.27 .

Changes:
    - send vendor, product and rev strings back for 36 byte
      INQUIRYs
    - set the additional length field to indicate 96 byte
      response is available

Doug Gilbert

[-- Attachment #2: libata-scsi2681.diff --]
[-- Type: text/x-patch, Size: 565 bytes --]

--- linux/drivers/scsi/libata-scsi.c	2004-08-14 21:12:42.000000000 +1000
+++ linux/drivers/scsi/libata-scsi.c2681dpg	2004-08-17 22:00:59.501464824 +1000
@@ -534,7 +534,7 @@
 		0,
 		0x5,	/* claim SPC-3 version compatibility */
 		2,
-		96 - 4
+		95 - 4
 	};
 
 	/* set scsi removeable (RMB) bit per ata bit */
@@ -545,7 +545,7 @@
 
 	memcpy(rbuf, hdr, sizeof(hdr));
 
-	if (buflen > 36) {
+	if (buflen > 35) {
 		memcpy(&rbuf[8], "ATA     ", 8);
 		ata_dev_id_string(dev, &rbuf[16], ATA_ID_PROD_OFS, 16);
 		ata_dev_id_string(dev, &rbuf[32], ATA_ID_FW_REV_OFS, 4);

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: libata & scsi error handling
  2004-08-18  5:11   ` Douglas Gilbert
@ 2004-08-18  5:31     ` Jeff Garzik
  0 siblings, 0 replies; 4+ messages in thread
From: Jeff Garzik @ 2004-08-18  5:31 UTC (permalink / raw)
  To: dougg; +Cc: Brad Campbell, linux-ide, SCSI Mailing List

applied


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: libata & scsi error handling
  2004-08-18  2:08 ` libata & scsi error handling Jeff Garzik
  2004-08-18  5:11   ` Douglas Gilbert
@ 2004-08-18  7:04   ` Brad Campbell
  1 sibling, 0 replies; 4+ messages in thread
From: Brad Campbell @ 2004-08-18  7:04 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: linux-ide, SCSI Mailing List

Jeff Garzik wrote:

> 
> It is highly likely that your patch is doing the right thing.  Doug 
> Ledford, 2.4.x SCSI maintainer, pointed out to me recently that my 2.4.x 
> error handling code MUST update a couple variables, otherwise error 
> handling would hang as you see.  The reason is that scsi_unjam_host(), 
> on both 2.4.x and 2.6.x, is the only ->eh_strategy_handler until libata 
> came along.
> 
> So, it is likely that there are a few details the scsi_unjam_host() 
> performs, that needs to do too.

Possibly stupid question time. (What I know about the SCSI stack could be written on the back of a 
matchbox)

I'm a little concerned about this bit here. (This is the end of the first command and then the 
timeout related to it).


Aug 18 01:54:48 srv kernel: ata_dev_select: ENTER, ata13: device 0, wait 1
Aug 18 01:54:48 srv kernel: ata_tf_load_pio: hob: feat 0x0 nsect 0x0, lba 0x0 0x0 0x0
Aug 18 01:54:48 srv kernel: ata_tf_load_pio: feat 0x0 nsect 0x80 lba 0x0 0x0 0x0
Aug 18 01:54:48 srv kernel: ata_tf_load_pio: device 0xE0
Aug 18 01:54:48 srv kernel: ata_exec_command_pio: ata13: cmd 0x25
Aug 18 01:54:48 srv kernel: ata_scsi_translate: EXIT
Aug 18 01:54:48 srv kernel: scsi_dispatch_cmd out
Aug 18 00:43:41 srv kernel: scsi_times_out
Aug 18 00:43:41 srv kernel: scsi_eh_scmd_add

Here the scmd that failed gets added to a list.

         list_add_tail(&scmd->eh_entry, &shost->eh_cmd_q);

Because scsi_eh_finish_cmd never runs it will never get removed from the list. Am I missing something?

Aug 18 00:43:41 srv kernel: scsi_eh_scmd_after return 0
Aug 18 00:43:41 srv kernel: host_busy 1, host_failed 1
Aug 18 00:43:41 srv kernel: scsi_times_out out
Aug 18 00:43:41 srv kernel: wake eh_strategy_handler
Aug 18 00:43:41 srv kernel: hit eh_strategy_handler
Aug 18 00:43:41 srv kernel: eh_strategy_handler 1
Aug 18 00:43:41 srv kernel: ata_scsi_error: ENTER
Aug 18 00:43:41 srv kernel: ata_eng_timeout: ENTER
Aug 18 00:43:41 srv kernel: ata_qc_timeout: ENTER
Aug 18 00:43:41 srv kernel: ata13: command 0x25 timeout, stat 0xd0 host_stat 0x1
Aug 18 00:43:41 srv kernel: ata_sg_clean: unmapping 128 sg elements
Aug 18 00:43:41 srv kernel: scsi_device_unbusy
Aug 18 00:43:41 srv kernel: host_busy 0, host_failed 1
Aug 18 00:43:41 srv kernel: scsi12: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 00 00 00 00 
00 00 80 00
Aug 18 00:43:41 srv kernel: Current sda: sense key Medium Error
Aug 18 00:43:41 srv kernel: Additional sense: Unrecovered read error - auto reallocate failed
Aug 18 00:43:41 srv kernel: end_request: I/O error, dev sda, sector 0
Aug 18 00:43:41 srv kernel: Buffer I/O error on device sda, logical block 0
Aug 18 00:43:41 srv kernel: ata_qc_timeout: EXIT
Aug 18 00:43:41 srv kernel: ata_eng_timeout: EXIT
Aug 18 00:43:41 srv kernel: ata_scsi_error: EXIT
Aug 18 00:43:41 srv kernel: eh_strategy_handler 2
Aug 18 00:43:41 srv kernel: eh_strategy_handler 3
Aug 18 00:43:41 srv kernel: scsi_dispatch_cmd
Aug 18 00:43:41 srv kernel: Add Timer
Aug 18 00:43:41 srv kernel: After Add Timer
Aug 18 01:55:14 srv kernel: ata_scsi_dump_cdb: CDB (13:0,0,0) 28 00 00 00 00 01 00 00 7f
Aug 18 01:55:14 srv kernel: ata_scsi_translate: ENTER
Aug 18 01:55:14 srv kernel: ata_scsi_rw_xlat: ten-byte command
Aug 18 01:55:14 srv kernel: ata_sg_setup: ENTER, ata13
Aug 18 01:55:14 srv kernel: ata_sg_setup: 127 sg elements mapped

Regards,
Brad

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2004-08-18  7:04 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <4122771A.4070203@wasp.net.au>
2004-08-18  2:08 ` libata & scsi error handling Jeff Garzik
2004-08-18  5:11   ` Douglas Gilbert
2004-08-18  5:31     ` Jeff Garzik
2004-08-18  7:04   ` Brad Campbell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).