From mboxrd@z Thu Jan 1 00:00:00 1970 From: Marc Posch Subject: Poblem solved: Re: PL3507 in daisy-chain - timeouts with disk spinup and "slow" disks Date: Sat, 15 Sep 2007 21:24:40 +0200 Message-ID: <46EC3178.3020909@gmail.com> References: <9ef28c5d0707030225l3aa4587bx4aa6d99cd19762e9@mail.gmail.com> <9ef28c5d0707030246v698d12b2j841edaac7da8c9bd@mail.gmail.com> <468BF73E.3020001@s5r6.in-berlin.de> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from fk-out-0910.google.com ([209.85.128.188]:60776 "EHLO fk-out-0910.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751829AbXIOTYu (ORCPT ); Sat, 15 Sep 2007 15:24:50 -0400 Received: by fk-out-0910.google.com with SMTP id z23so1089575fkz for ; Sat, 15 Sep 2007 12:24:48 -0700 (PDT) In-Reply-To: <468BF73E.3020001@s5r6.in-berlin.de> Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: Stefan Richter Cc: linux1394-user@lists.sourceforge.net, linux-scsi@vger.kernel.org Hi, My problem was a classical "false positive": in this case there really=20 seems to have been a problem: I noticed that one of the disks had the problem more often than the=20 other. So I have taken it to the store where had bought it and let them= =20 exchange it under warranty conditions, claiming a disk spinup problem=20 after resuming from standby (which is not really far away from reality)= =2E After having installed the new disk, the problem has not occurred any=20 more. This is now 2 months ago. As far as I can tell the PL-3507 does=20 not seem to show any problems with firewire at all. Just for reference for other people: I'm using Firmware Version: 2006.04.20.149 [Checksum 4D6F] Thanks anyway for the help!!! Marc Posch Stefan Richter wrote: > (Full quote for linux-scsi) >=20 > mposch@gmail.com wrote at linux1394-user: >> Hi, >> >> I'm successfully running four IcyBox IB351-UE external enclosures in= a >> daisy-chained configuration. All four boxes have the latest Prolific >> firmware from www.raidsonic.de installed. >=20 > Prolific based FireWire/USB combo devices are notorious for being sol= d > with ancient broken firmwares of their FireWire part (not of the USB > part), so it's good that you installed a recent firmware. Do they ha= ve > release notes or release dates which you could compare with those of > Prolific's own firmware? > http://www.prolific.com.tw/eng/downloads.asp?ID=3D44 >=20 >> Everything seems to work fairly well as long as the disks are runnin= g. >> However, to prevent from heat problems I'm using a script and sg_sta= rt >> from sg3_utils to spin down the drives if there is no change in >> /proc/diskstats for the specified disk. Basically, the command being >> executed is: sg_start 0 --pc=3D2 /dev/sdX >> >> The --pc=3D2 (Power condotion) seems to be the only command which th= e >> Prolific chipset seems to support to spin up/down the disks. >> sg_start 1 --pc=3D0 /dev/sdX >> Always works without problems to "wake up" the disk again. >> >> The problem (timeout errors) I'm getting manifests when the disks ar= e >> being automatically spun up on disk access, and it happens only on >> "slow" disks. It also seems to happen more often when the OS is tryi= ng >> to write something to the disk - a simple fdisk -l always gets all >> disks up and running. >=20 > Occasional timeouts shouldn't be a problem. The fact that the timeou= t > of the very first attempted command (which is accompanied by the SBP-= 2 > protocol request of a "fetch agent reset") is followed-up by several > more timeouts, so that the SCSI subsystem eventually decides to take = the > device offline, is IMO a hint on a fragile firmware. >=20 > Default timeouts of different kinds are hardwired in the SCSI subsyst= em. > From linux/include/scsi.h: >=20 > #define FORMAT_UNIT_TIMEOUT (2 * 60 * 60 * HZ) > #define START_STOP_TIMEOUT (60 * HZ) > #define MOVE_MEDIUM_TIMEOUT (5 * 60 * HZ) > #define READ_ELEMENT_STATUS_TIMEOUT (5 * 60 * HZ) > #define READ_DEFECT_DATA_TIMEOUT (60 * HZ ) >=20 > There is also a writable sysfs attribute for each SCSI device of whic= h > effects I am uncertain: > # ll /sys/bus/scsi/devices/0\:0\:0\:0/timeout > -rw-r--r-- 1 root root 4096 Jul 4 21:10 > /sys/bus/scsi/devices/0:0:0:0/timeout > # cat /sys/bus/scsi/devices/0\:0\:0\:0/timeout > 30 >=20 > What if you set a higher value in there with "echo XYZ > ..."? >=20 >> To further explain, this is my /proc/scsi/scsi content: >> Host: scsi8 Channel: 00 Id: 00 Lun: 00 >> Vendor: IC35L040 Model: AVVN07-0 Rev: >> Type: Direct-Access-RBC ANSI SCSI revision: 04 >> Host: scsi9 Channel: 00 Id: 00 Lun: 00 >> Vendor: WDC WD25 Model: 00BB-00GUC0 Rev: >> Type: Direct-Access-RBC ANSI SCSI revision: 04 >> Host: scsi10 Channel: 00 Id: 00 Lun: 00 >> Vendor: Maxtor 6 Model: L080P0 Rev: >> Type: Direct-Access-RBC ANSI SCSI revision: 04 >> Host: scsi11 Channel: 00 Id: 00 Lun: 00 >> Vendor: WDC WD25 Model: 00BB-00RDA0 Rev: >> Type: Direct-Access-RBC ANSI SCSI revision: 04 >> >> The WDC disks need about 6 to 10 seconds to spin up and get ready, a= ll >> other disks are ready after about 3-5 seconds. The timeout errors on= ly >> occur on the slower WDC disks. After that, the disk gets unusable >> (device offlined), but the case's LED continues to flash repeatedly, >> there is also audible disk access. This continues until the enclosur= e >> is powered off and on again. >=20 > Apparently the firmware is stuck in a loop, after the sbp2 driver tol= d > it to abort that timed-out command and to start over when the next > command comes. >=20 >> I have tested this in the following different cabling configurations >> (I'll use the scsi-numbers from above): >> >> a) >> Host -- scsi8 -- scsi9 (WDC) >> Host -- scsi10 -- scsi11 (WDC) >> >> b) >> Host -- scsi9 (WDC) -- scsi8 >> Host -- scsi11 (WDC) -- scsi10 >> >> Interestingly, in Configuration b), when one of the WDC disks goes >> offline, scsi commands still seem to be able to travel through the >> Prolific device to scsi8/10. (I can still access the disks 8/10). >=20 > Yes, the cable topology doesn't influence this. The FireWire link la= yer > controller/ SCSI target/ IDE bridge is on one chip, and the FireWire > physical interface is on another chip. That phy chip is also a FireW= ire > repeater, transparently to the link layer controller. >=20 >> To sum things up: >> To me it seems that scsi write commands being sent disks have a >> shorter timeout than read commands. When the disk needs to be spun u= p >> the command times out and the disk goes offline (the kernel seems to >> force it offline). When a simple read command is being sent, it seem= s >> to be more "tolerant" to timeouts and waits until the disk is online= =2E >> >> Is there any way to change timeouts so that the kernel waits for the >> disk until it is up and running again? I'm willing to test >> everything... >> >> Regards, >> Marc Posch >> >> >From http://www.linux1394.org/trouble.php: >> >> - kernel version: >> 2.6.20-16-generic (built from Ubuntu feisty kernel source) >> with patchset 2.6.20.y_ieee1394_v474.patch.bz2 (from >> http://me.in-berlin.de/~s5r6/linux1394/updates/2.6.20.y/) applied >> Same happens on unmodified 2.6.20-16-generic - the patchset was an >> attempt to resolve the problem >> - libraw version: libraw1394.so.8.1.1 >> - driver: OHCI >> - relevant messages from dmesg: >> ------------------------------------------------- >> Jul 2 07:52:12 doriath kernel: [52294.212000] ieee1394: sbp2: abort= ing sbp2 command >> Jul 2 07:52:12 doriath kernel: [52294.212000] sd 5:0:0:0: >> Jul 2 07:52:12 doriath kernel: [52294.212000] command: Writ= e(10): 2a 00 1d 1c 44 bf 00 00 08 00 >=20 > That's a message from sbp2's SCSI command timeout handler > (.eh_abort_handler). >=20 >> Jul 2 07:52:22 doriath kernel: [52304.212000] ieee1394: sbp2: abort= ing sbp2 command >> Jul 2 07:52:22 doriath kernel: [52304.212000] sd 5:0:0:0: >> Jul 2 07:52:22 doriath kernel: [52304.212000] command: Test= Unit Ready: 00 00 00 00 00 00 >=20 > Timeout handler again. >=20 >> Jul 2 07:52:22 doriath kernel: [52304.212000] ieee1394: sbp2: reset= requested >> Jul 2 07:52:22 doriath kernel: [52304.212000] ieee1394: sbp2: gener= ating sbp2 fetch agent reset >=20 > That's sbp2's SCSI device reset handler (.eh_device_reset_handler). = It > doesn't actually do anything what the timeout handler already tried t= o > do, so it's quite pointless. When this gets called because the SCSI > subsystem didn't see a lot of success with the previous command abort > handler calls, the SBP-2 target firmware is probably already dead. >=20 > Maybe we should implement something more drastic in the device reset > handler --- e.g. an actual request for a reset similar to power reset= =2E > I never tried this, so I don't know if that would really improve > anything. I'll notify you when I'm in the mood to write a respective= patch. >=20 >> Jul 2 07:52:32 doriath kernel: [52314.212000] ieee1394: sbp2: abort= ing sbp2 command >> Jul 2 07:52:32 doriath kernel: [52314.212000] sd 5:0:0:0: >> Jul 2 07:52:32 doriath kernel: [52314.212000] command: Test= Unit Ready: 00 00 00 00 00 00 >> Jul 2 07:52:32 doriath kernel: [52314.212000] sd 5:0:0:0: scsi: Dev= ice offlined - not ready after error recovery >> Jul 2 07:52:32 doriath kernel: [52314.212000] sd 5:0:0:0: SCSI erro= r: return code =3D 0x00050000 >> Jul 2 07:52:32 doriath kernel: [52314.212000] end_request: I/O erro= r, dev sdb, sector 488391871 >=20 > SCSI subsystem finally gave up. >=20 >> ------------------------------------------------- >> - adapter card model: PL3507 >> - output of gscanbus: >> ------------------------------------------------- >> Root >> =3D=3D=3D=3D >> SelfID Info >> ----------- >> Physical ID: 4 >> Link active: Yes >> Gap Count: 63 >> PHY Speed: S400 >> PHY Delay: <=3D144ns >> IRM Capable: Yes >> Power Class: +15W >> Port 0: Connected to child node >> Port 1: Connected to child node >> Init. reset: Yes >> >> CSR ROM Info >> ------------ >> GUID: 0x004063500006A853 >> Node Capabilities: 0x000083C0 >> Vendor ID: 0x00004063 >> Unit Spec ID: 0x0000005E >> Unit SW Version: 0x00000001 >> Model ID: 0x00000000 >> Nr. Textual Leafes: 1 >> >> Vendor: VIA TECHNOLOGIES, INC. >> Textual Leafes: >> Linux - ohci1394 >> >> AV/C Subunits >> ------------- >> N/A >> >> Channel A Child 1 >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> SelfID Info >> ----------- >> Physical ID: 1 >> Link active: Yes >> Gap Count: 63 >> PHY Speed: S400 >> PHY Delay: <=3D144ns >> IRM Capable: Yes >> Power Class: -10W >> Port 0: Connected to parent node >> Port 1: Connected to child node >> Init. reset: No >> >> CSR ROM Info >> ------------ >> GUID: 0x0050770E00001BB4 >> Node Capabilities: 0x000083C0 >> Vendor ID: 0x00005077 >> Unit Spec ID: 0x0000609E >> Unit SW Version: 0x00010483 >> Model ID: 0x00000001 >> Nr. Textual Leafes: 1 >> >> Vendor: PROLIFIC TECHNOLOGY, INC. >> Textual Leafes: >> Prolific PL3507 Combo Device >> >> AV/C Subunits >> ------------- >> N/A >=20 > Looks like RaidSonic actually delivered an original Prolific Firmware > and didn't even replace the vendor string. >=20 >> Channel A Child 2 >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> SelfID Info >> ----------- >> Physical ID: 0 >> Link active: Yes >> Gap Count: 63 >> PHY Speed: S400 >> PHY Delay: <=3D144ns >> IRM Capable: Yes >> Power Class: -10W >> Port 0: Connected to parent node >> Port 1: Not connected >> Init. reset: No >> >> CSR ROM Info >> ------------ >> GUID: 0x0050770E0000383E >> Node Capabilities: 0x000083C0 >> Vendor ID: 0x00005077 >> Unit Spec ID: 0x0000609E >> Unit SW Version: 0x00010483 >> Model ID: 0x00000001 >> Nr. Textual Leafes: 1 >> >> Vendor: PROLIFIC TECHNOLOGY, INC. >> Textual Leafes: >> Prolific PL3507 Combo Device >> >> AV/C Subunits >> ------------- >> N/A >> >> Channel B Child 1 >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> SelfID Info >> ----------- >> Physical ID: 3 >> Link active: Yes >> Gap Count: 63 >> PHY Speed: S400 >> PHY Delay: <=3D144ns >> IRM Capable: Yes >> Power Class: -10W >> Port 0: Connected to parent node >> Port 1: Connected to child node >> Init. reset: No >> >> CSR ROM Info >> ------------ >> GUID: 0x0050770E00003C09 >> Node Capabilities: 0x000083C0 >> Vendor ID: 0x00005077 >> Unit Spec ID: 0x0000609E >> Unit SW Version: 0x00010483 >> Model ID: 0x00000001 >> Nr. Textual Leafes: 1 >> >> Vendor: PROLIFIC TECHNOLOGY, INC. >> Textual Leafes: >> Prolific PL3507 Combo Device >> >> AV/C Subunits >> ------------- >> N/A >> >> Channel B Child 2 >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> SelfID Info >> ----------- >> Physical ID: 2 >> Link active: Yes >> Gap Count: 63 >> PHY Speed: S400 >> PHY Delay: <=3D144ns >> IRM Capable: Yes >> Power Class: -10W >> Port 0: Connected to parent node >> Port 1: Not connected >> Init. reset: No >> >> CSR ROM Info >> ------------ >> GUID: 0x0050770E00003761 >> Node Capabilities: 0x000083C0 >> Vendor ID: 0x00005077 >> Unit Spec ID: 0x0000609E >> Unit SW Version: 0x00010483 >> Model ID: 0x00000001 >> Nr. Textual Leafes: 1 >> >> Vendor: PROLIFIC TECHNOLOGY, INC. >> Textual Leafes: >> Prolific PL3507 Combo Device >> >> AV/C Subunits >> ------------- >> N/A >> ------------------------------------------------- >> - output of lspci >> ------------------------------------------------- >> 00:00.0 Host bridge: VIA Technologies, Inc. VT8623 [Apollo CLE266] >> 00:01.0 PCI bridge: VIA Technologies, Inc. VT8633 [Apollo Pro266 AGP= ] >> 00:0d.0 FireWire (IEEE 1394): VIA Technologies, Inc. IEEE 1394 Host = Controller (rev 80) >> 00:10.0 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.= 1 Controller (rev 80) >> 00:10.1 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.= 1 Controller (rev 80) >> 00:10.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.= 1 Controller (rev 80) >> 00:10.3 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 82) >> 00:11.0 ISA bridge: VIA Technologies, Inc. VT8235 ISA Bridge >> 00:11.1 IDE interface: VIA Technologies, Inc. VT82C586A/B/VT82C686/A= /B/VT823x/A/C PIPC Bus Master IDE (rev 06) >> 00:11.5 Multimedia audio controller: VIA Technologies, Inc. VT8233/A= /8235/8237 AC97 Audio Controller (rev 50) >> 00:12.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II= ] (rev 74) >> 01:00.0 VGA compatible controller: VIA Technologies, Inc. VT8623 [Ap= ollo CLE266] integrated CastleRock graphics (rev 03) >> ------------------------------------------------- >> - version of application or utility: sg3_utils version 1.21 >=20 --=20 ~~(=B8=B8=B8=B8=BA=BA>_________________________________________________= ____.=B7'=B4=AF)~ =AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF= =AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF= =AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF=AF - To unsubscribe from this list: send the line "unsubscribe linux-scsi" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html