From mboxrd@z Thu Jan 1 00:00:00 1970 From: ishikawa@yk.rim.or.jp Subject: Ungraceful handling of scsi tape drive error condition Date: Sat, 20 Jul 2002 03:53:39 +0900 (JST) Sender: linux-scsi-owner@vger.kernel.org Message-ID: Return-path: Received: by yk.rim.or.jp via sendmail from stdin id (Debian Smail3.2.0.114) for linux-scsi@vger.kernel.org; Sat, 20 Jul 2002 03:53:39 +0900 (JST) List-Id: linux-scsi@vger.kernel.org To: linux-scsi@vger.kernel.org Cc: Kai.Makisara@metla.fi, dc390@garloff.de, garloff@suse Hi, I would like to report a problem of ungraceful handling of tape device error by SCSI subsystem and DC390 (tmscsim) driver. I am not sure exactly where the problem lies, and so send to various e-mail addresses. (To be honest, I am not sure if the software can cope with the situation at all, but here it goes anyway.) Background: I have an HP DDS2 DAT tape drive. The external drive is connected to my linux PC using Tekram DC390 scsi adaptor. (I have a separate active terminator block and more or less good SCSI cable. The combination worked flawlessly previously and since I tested out the connection carefully before I began using this, I am more or less confident that the cabling and terminator is not the problem.) This DAT drive was used to make backup archive of various files. This worked fine under 2.4.16 (and 2.4.17, I think.) SCSI tape support is used as module: st.o This week, I tried to make another backup copy of many files I have brought from external source. Then I have encountered a problem. The tar program got hung in the middle of operation. I thought OK, maybe a drive proble, but then I was surprised to find that I could not even kill the tar process. "kill -KILL pid" won't remove it from the ps output. It seems that it got stuck inside the write system call. I suspect the cause of the problem is drive-related. (Maybe a dirty head or something causing the drive to respond very slowly or even triggered a firmware problme or something.) While I will go out and buy a cleaning tape to remedy the supposedly dirty head, I am reporting this so that we can improve the error handling in SCSI subsystem. That is, it would be nice if we can make sure that the tar program can be killed by interrut (control-C), etc. even in the presence of lower-level physical problem. I agree that if the drive behaves in a very strange manner due to temporary hardware problem, there may be few options to deal with it, but I still would like to be able to kill a program that accesses that drive. This is because I could not even try to remove the st (scsi tape) driver module to see if the older st.c from older kernel might work, etc.. Since tar was holding the /dev/st0, I could not remove the st module. Also, if I recall correctly, trying to rewind the tape was refused since the device was in use, etc.. (But I may be confused about this.) Again, I am reporting this so that someone may notice that there are places where SCSI error handlings can be improved, specifically for tape drives. I will try to solve the physical problem by firstly trying head cleaing tape on my end. By the way, I do think the drive is at fault. (Not the cable, etc.) I have an HP tape library utility (a Windows program made by HP and downloadable from HP's web site) to test and diagnose problems of HP DAT drives. The drive passes all the self-diagnotics and the like that didn't touch the recording medium. The test includes the SCSI connection to/from the tape drive. The diagnostics that looked at internal log buffer of the drive did suggest maybe I need to clean the tape head(!). Then by invoking media validation test of HP program, I found that my DAT drive in current condition has a difficulty in reading/writing the tape medium. (It did write without problems only a couple of days ago, though.) Hence, my current suspcition is the dirty read/write head. REPEAT-BY: The sequence of operation that led to the problem of hung "tar" which is not killable at all is explained below. (Actually, I tried various combinations of OS versions (2.4.18, 2.4.17), max transfer speed (10MB/s, 8MB/s, 6.6MB/s, etc.) set in DC390 BIOS (and honoured by the tmscsim driver), and the problem surfaced after a while during the copying of large tar file. So I am more or less certain that the problem is caused by the device physical error.) insmod st mt -f /dev/st0 datcompression tar cvXbf exclude-file-list 20 /dev/st0 MANY-DIRECTORIES-AND-FILES Luckily I got a few messages from the syslog file which may be relevant (and listed below), but unfortunately I didn't notice this while I was dealing with the malfunctioning drive and so my explanation may be a little out of sync with what truely happened. Anyway, here is the log with my explanation. I believe the following was recorded while I booted 2.4.17 kernel. LOG and the sequence of event. **** "insmod st" was executed from my backup script. Jul 18 23:49:20 duron kernel: st: Version 20011103, bufsize 32768, wrt 30720, max init. bufs 4, s/g segs 16 Jul 18 23:49:20 duron kernel: Attached scsi tape st0 at scsi1, channel 0, id 3, lun 0 Jul 18 23:49:20 duron kernel: DC390: Target 3: Sync transfer 10.0 MHz, Offset 15 Jul 18 23:49:20 duron kernel: st0: Block limits 1 - 16777215 bytes. **** After mt -f /dev/st0 datcompression **** tar began execution. ... tar began writing to /dev/st0 ... But at some point, the drive seemed to encountered an error, ... maybe due to a dirty head, which requires cleaning. ... ... Instead of gracefully handling the situation caused by ... the timeout of the drive, ... the tar program got hung. ... DC390 drive (tmscsim driver) printed out diagnostics in a ... neat manner. Jul 19 00:14:17 duron kernel: scsi : aborting command due to timeout : pid 54817, scsi1, channel 0, id 3, lun 0 Write (6) 00 00 28 00 00 Jul 19 00:14:17 duron kernel: DC390: Abort command (pid 54817, Device 03-00) Jul 19 00:14:17 duron kernel: DC390: SRB: Xferred 00000000, Remain 00000000, State 00000080, Phase 04 Jul 19 00:14:17 duron kernel: DC390: AdpaterStatus: 00, SRB Status 00 Jul 19 00:14:17 duron kernel: DC390: Status of last IRQ (DMA/SC/Int/IRQ): 0890cc20 Jul 19 00:14:17 duron kernel: DC390: Register dump: SCSI block: Jul 19 00:14:17 duron kernel: DC390: XferCnt Cmd Stat IntS IRQS FFIS Ctl1 Ctl2 Ctl3 Ctl4 Jul 19 00:14:17 duron kernel: DC390: 000000 44 10 cc 00 80 17 48 18 04 Jul 19 00:14:17 duron kernel: DC390: Register dump: DMA engine: Jul 19 00:14:17 duron kernel: DC390: Cmd STrCnt SBusA WrkBC WrkAC Stat SBusCtrl Jul 19 00:14:17 duron kernel: DC390: 00 00002800 002f8000 00000000 002fa800 00 03080000 Jul 19 00:14:17 duron kernel: DC390: Register dump: PCI Status: 0200 Jul 19 00:14:17 duron kernel: DC390: In case of driver trouble read linux/drivers/scsi/README.tmscsim Jul 19 00:14:17 duron kernel: DC390: Aborted pid 54817 with status 0 ... I was doing something else and didn't realize the hung tar ... program until I checked the terminal window about 15 minutes later. ... I now noticed the problem and tried to kill the tar process by ... control-C, etc.. ... However, I found that the tar program could not be killed! ... While my doing this, the system seemed to record another set of ... messages. Maybe due to my forceful removal of the tape medium, etc.. Jul 19 00:29:23 duron kernel: scsi : aborting command due to timeout : pid 54817, scsi1, channel 0, id 3, lun 0 Write (6) 00 00 28 00 00 Jul 19 00:29:23 duron kernel: DC390: Abort command (pid 54817, Device 03-00) Jul 19 00:29:23 duron kernel: DC390: SRB: Xferred 00000000, Remain 00000000, State 00000080, Phase 04 Jul 19 00:29:23 duron kernel: DC390: AdpaterStatus: 00, SRB Status 00 Jul 19 00:29:23 duron kernel: DC390: Status of last IRQ (DMA/SC/Int/IRQ): 0890cc20 Jul 19 00:29:23 duron kernel: DC390: Register dump: SCSI block: Jul 19 00:29:23 duron kernel: DC390: XferCnt Cmd Stat IntS IRQS FFIS Ctl1 Ctl2 Ctl3 Ctl4 Jul 19 00:29:23 duron kernel: DC390: 000000 44 10 cc 00 80 17 48 18 04 Jul 19 00:29:23 duron kernel: DC390: Register dump: DMA engine: Jul 19 00:29:23 duron kernel: DC390: Cmd STrCnt SBusA WrkBC WrkAC Stat SBusCtrl Jul 19 00:29:23 duron kernel: DC390: 00 00002800 002f8000 00000000 002fa800 00 03080000 Jul 19 00:29:23 duron kernel: DC390: Register dump: PCI Status: 0200 Jul 19 00:29:23 duron kernel: DC390: In case of driver trouble read linux/drivers/scsi/README.tmscsim Jul 19 00:29:23 duron kernel: DC390: Aborted pid 54817 with status 0 Jul 19 00:29:23 duron kernel: SCSI host 1 abort (pid 54817) timed out - resetting Jul 19 00:29:23 duron kernel: SCSI bus is being reset for host 1 channel 0. Jul 19 00:29:23 duron kernel: DC390: RESET ... done Jul 19 00:29:23 duron kernel: st0: Error 26030000 (sugg. bt 0x20, driver bt 0x6, host bt 0x3). Jul 19 00:29:23 duron kernel: DC390: Target 3: Sync transfer 10.0 MHz, Offset 15 Jul 19 00:29:23 duron kernel: SCSI error: host 1 id 3 lun 0 return code = 28000002 Jul 19 00:29:23 duron kernel: ^ISense class 7, sense error 0, extended sense 0 PS: I don't know if there is a timeout value which could be enlarged to accommodate this particular error of HP drive, but I think the drive requires servicing as explained above. So even a much larger value won't help. What worries me, though is this. I am wondering why the error condition reported by DC390 (tmscsim) driver is not noticed by the upper layer of SCSI system so that the write system call returned an error and we could kill the tar program that called it. That would be a more graceful handling of the situation IMHO. cf. >>From README.st I noticed this excellent file only after I wrote the above to look for the maintainer address. >To enable debugging messages, edit st.c and #define DEBUG 1. As seen >above, debugging can be switched off with an ioctl if debugging is >compiled into the driver. The debugging output is not voluminuous. I will try to do this after I buy the cleaning tape. >If the tape seems to hang, I would be very interested to hear where >the driver is waiting. With the command 'ps -l' you can see the state >of the process using the tape. If the state is D, the process is >waiting for something. The field WCHAN tells where the driver is >waiting. If you have the current System.map in the correct place (in >/boot for the procps I use) or have updated /etc/psdatabase (for kmem >ps), ps writes the function name in the WCHAN field. If not, you have >to look up the function from System.map. > I am sorry that the above log was collected using 2.4.17 kenrel while the main OS I use is nowadays 2.4.18 kernel. I think the System.map was incorrect. Actually I hit ALT+SysReq+t to see where the tar program got hung during the trials and errors, (but to my dismay, I didn't see the output immediately on the console. I was running X. Then later on, while I wrote the above memo, I notice the call trace was in the syslog file!) >>From my own log: I didn't realize the output from Alt+SysReq+t is directed to syslog, etc.. I am not sure if the symbols are correct due to the version mismatch above. ... >Jul 19 00:26:03 duron kernel: bash S 00000000 2656 406 405 409 (NOTLB) >Jul 19 00:26:03 duron kernel: Call Trace: [sys_wait4+862/912] [system_call+51/56] >Jul 19 00:26:03 duron kernel: tar D D961E000 5000 409 406 (NOTLB) >Jul 19 00:26:03 duron kernel: Call Trace: [wait_for_completion+114/144] [ide-cd:ignore+37481379/72914909] [ide-cd:ignore+37485815/72910473] [file_read_actor+0/96] [sys_write+150/208] >Jul 19 00:26:03 duron kernel: [system_call+51/56] ... if the drive still behaves in a funny manner after applying cleaning tape, I will try to gather the information above. Oh, for that matter, if the information is really important, I will try to gather the data WITHOUT applying the cleaning tape. That is I will try to see if I can coerce the misbehaving DAT tape drive into accepting tar output even for a few seconds and then when tar gets hung, I can try Alt+SysReq+t. Please advise me if necessary. >Note also that the timeouts are very long compared to most other >drivers. This means that the Linux driver may appear hung although the >real reason is that the tape firmware has got confused. It seems like the case since the drive now operates in a rather funny manner even under windows 98 when it is tested using HP's excellent diagnostics tool.