* Re: libata & scsi error handling
2004-08-18 2:08 ` libata & scsi error handling Jeff Garzik
@ 2004-08-18 5:11 ` Douglas Gilbert
2004-08-18 5:31 ` Jeff Garzik
2004-08-18 7:04 ` Brad Campbell
1 sibling, 1 reply; 4+ messages in thread
From: Douglas Gilbert @ 2004-08-18 5:11 UTC (permalink / raw)
To: Jeff Garzik; +Cc: Brad Campbell, linux-ide, SCSI Mailing List
[-- Attachment #1: Type: text/plain, Size: 2077 bytes --]
Jeff Garzik wrote:
> Brad Campbell wrote:
>
>> I think I have this timeout error issue pegged now.
>>
>> I know this is both wrong, ugly and likely to cause internal kernel
>> damage, but for the purpose of pegging what I think may be the culprit
>> it works around the error nicely here
>>
>> brad@srv:/usr/src$ diff -u
>> temp/linux-2.6.8.1/drivers/scsi/libata-scsi.c
>> linux-2.6.8.1/drivers/scsi/libata-scsi.c
>> --- temp/linux-2.6.8.1/drivers/scsi/libata-scsi.c 2004-08-14
>> 14:55:19.000000000 +0400
>> +++ linux-2.6.8.1/drivers/scsi/libata-scsi.c 2004-08-18
>> 01:04:11.000000000 +0400
>> @@ -213,6 +213,7 @@
>>
>> ap = (struct ata_port *) &host->hostdata[0];
>> ap->ops->eng_timeout(ap);
>> + host->host_failed--;
>>
>> DPRINTK("EXIT\n");
>> return 0;
>>
>> The issue is that the libata installed eh_strategy_handler does not
>> complete the error as
>> scsi_unjam_host -> scsi_eh_abort_cmds -> scsi_eh_finish_cmd does.
>
>
>
> Well, well, well. If I had a libata Honorary Hacker merit badge, I
> would give it to you.
>
> It is highly likely that your patch is doing the right thing. Doug
> Ledford, 2.4.x SCSI maintainer, pointed out to me recently that my 2.4.x
> error handling code MUST update a couple variables, otherwise error
> handling would hang as you see. The reason is that scsi_unjam_host(),
> on both 2.4.x and 2.6.x, is the only ->eh_strategy_handler until libata
> came along.
>
> So, it is likely that there are a few details the scsi_unjam_host()
> performs, that needs to do too.
>
> Thanks much for your excellent detective work, I'll see where to best
> put this change...
Jeff,
It probably doesn't rate any gold stars but while your patching
libata-scsi.c could you slip this fix in as well.
The patch is against lk 2.6.8.1 . The same patch is needed
(give or take fuzz) in lk 2.4.27 .
Changes:
- send vendor, product and rev strings back for 36 byte
INQUIRYs
- set the additional length field to indicate 96 byte
response is available
Doug Gilbert
[-- Attachment #2: libata-scsi2681.diff --]
[-- Type: text/x-patch, Size: 565 bytes --]
--- linux/drivers/scsi/libata-scsi.c 2004-08-14 21:12:42.000000000 +1000
+++ linux/drivers/scsi/libata-scsi.c2681dpg 2004-08-17 22:00:59.501464824 +1000
@@ -534,7 +534,7 @@
0,
0x5, /* claim SPC-3 version compatibility */
2,
- 96 - 4
+ 95 - 4
};
/* set scsi removeable (RMB) bit per ata bit */
@@ -545,7 +545,7 @@
memcpy(rbuf, hdr, sizeof(hdr));
- if (buflen > 36) {
+ if (buflen > 35) {
memcpy(&rbuf[8], "ATA ", 8);
ata_dev_id_string(dev, &rbuf[16], ATA_ID_PROD_OFS, 16);
ata_dev_id_string(dev, &rbuf[32], ATA_ID_FW_REV_OFS, 4);
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: libata & scsi error handling
2004-08-18 2:08 ` libata & scsi error handling Jeff Garzik
2004-08-18 5:11 ` Douglas Gilbert
@ 2004-08-18 7:04 ` Brad Campbell
1 sibling, 0 replies; 4+ messages in thread
From: Brad Campbell @ 2004-08-18 7:04 UTC (permalink / raw)
To: Jeff Garzik; +Cc: linux-ide, SCSI Mailing List
Jeff Garzik wrote:
>
> It is highly likely that your patch is doing the right thing. Doug
> Ledford, 2.4.x SCSI maintainer, pointed out to me recently that my 2.4.x
> error handling code MUST update a couple variables, otherwise error
> handling would hang as you see. The reason is that scsi_unjam_host(),
> on both 2.4.x and 2.6.x, is the only ->eh_strategy_handler until libata
> came along.
>
> So, it is likely that there are a few details the scsi_unjam_host()
> performs, that needs to do too.
Possibly stupid question time. (What I know about the SCSI stack could be written on the back of a
matchbox)
I'm a little concerned about this bit here. (This is the end of the first command and then the
timeout related to it).
Aug 18 01:54:48 srv kernel: ata_dev_select: ENTER, ata13: device 0, wait 1
Aug 18 01:54:48 srv kernel: ata_tf_load_pio: hob: feat 0x0 nsect 0x0, lba 0x0 0x0 0x0
Aug 18 01:54:48 srv kernel: ata_tf_load_pio: feat 0x0 nsect 0x80 lba 0x0 0x0 0x0
Aug 18 01:54:48 srv kernel: ata_tf_load_pio: device 0xE0
Aug 18 01:54:48 srv kernel: ata_exec_command_pio: ata13: cmd 0x25
Aug 18 01:54:48 srv kernel: ata_scsi_translate: EXIT
Aug 18 01:54:48 srv kernel: scsi_dispatch_cmd out
Aug 18 00:43:41 srv kernel: scsi_times_out
Aug 18 00:43:41 srv kernel: scsi_eh_scmd_add
Here the scmd that failed gets added to a list.
list_add_tail(&scmd->eh_entry, &shost->eh_cmd_q);
Because scsi_eh_finish_cmd never runs it will never get removed from the list. Am I missing something?
Aug 18 00:43:41 srv kernel: scsi_eh_scmd_after return 0
Aug 18 00:43:41 srv kernel: host_busy 1, host_failed 1
Aug 18 00:43:41 srv kernel: scsi_times_out out
Aug 18 00:43:41 srv kernel: wake eh_strategy_handler
Aug 18 00:43:41 srv kernel: hit eh_strategy_handler
Aug 18 00:43:41 srv kernel: eh_strategy_handler 1
Aug 18 00:43:41 srv kernel: ata_scsi_error: ENTER
Aug 18 00:43:41 srv kernel: ata_eng_timeout: ENTER
Aug 18 00:43:41 srv kernel: ata_qc_timeout: ENTER
Aug 18 00:43:41 srv kernel: ata13: command 0x25 timeout, stat 0xd0 host_stat 0x1
Aug 18 00:43:41 srv kernel: ata_sg_clean: unmapping 128 sg elements
Aug 18 00:43:41 srv kernel: scsi_device_unbusy
Aug 18 00:43:41 srv kernel: host_busy 0, host_failed 1
Aug 18 00:43:41 srv kernel: scsi12: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 00 00 00 00
00 00 80 00
Aug 18 00:43:41 srv kernel: Current sda: sense key Medium Error
Aug 18 00:43:41 srv kernel: Additional sense: Unrecovered read error - auto reallocate failed
Aug 18 00:43:41 srv kernel: end_request: I/O error, dev sda, sector 0
Aug 18 00:43:41 srv kernel: Buffer I/O error on device sda, logical block 0
Aug 18 00:43:41 srv kernel: ata_qc_timeout: EXIT
Aug 18 00:43:41 srv kernel: ata_eng_timeout: EXIT
Aug 18 00:43:41 srv kernel: ata_scsi_error: EXIT
Aug 18 00:43:41 srv kernel: eh_strategy_handler 2
Aug 18 00:43:41 srv kernel: eh_strategy_handler 3
Aug 18 00:43:41 srv kernel: scsi_dispatch_cmd
Aug 18 00:43:41 srv kernel: Add Timer
Aug 18 00:43:41 srv kernel: After Add Timer
Aug 18 01:55:14 srv kernel: ata_scsi_dump_cdb: CDB (13:0,0,0) 28 00 00 00 00 01 00 00 7f
Aug 18 01:55:14 srv kernel: ata_scsi_translate: ENTER
Aug 18 01:55:14 srv kernel: ata_scsi_rw_xlat: ten-byte command
Aug 18 01:55:14 srv kernel: ata_sg_setup: ENTER, ata13
Aug 18 01:55:14 srv kernel: ata_sg_setup: 127 sg elements mapped
Regards,
Brad
^ permalink raw reply [flat|nested] 4+ messages in thread