* getting I/O errors in super_written()...any ideas what would cause this? @ 2012-11-28 17:52 Chris Friesen 2012-11-28 18:08 ` Mathias Burén 0 siblings, 1 reply; 22+ messages in thread From: Chris Friesen @ 2012-11-28 17:52 UTC (permalink / raw) To: Neil Brown, linux-raid, Jens Axboe Hi, I'm running 2.6.27 with LVM over software RAID 1 over a pair of SAS disks. Recently we started seeing messages of the following pattern: Nov 28 08:57:10 kernel: end_request: I/O error, dev sda, sector 1758169523 Nov 28 08:57:10 kernel: md: super_written gets error=-5, uptodate=0 Nov 28 08:57:10 kernel: raid1: Disk failure on sda2, disabling device. Nov 28 08:57:10 kernel: raid1: Operation continuing on 1 devices. We're working through our changes to figure out what might have triggered it, but it seems likely the root cause lies in the core code. We're assuming it's a software issue since it's reproducible on multiple new-ish systems, although so far we've only tried it on systems with one particular configuration--we're planning on trying it with different disks just to be sure. For what it's worth, we've seen the problems with disk write cache enabled and disabled. Anyone have any ideas, or pointers as to what I should look at? Thanks, Chris -- Chris Friesen Software Designer 3500 Carling Avenue Ottawa, Ontario K2H 8E9 www.genband.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: getting I/O errors in super_written()...any ideas what would cause this? 2012-11-28 17:52 getting I/O errors in super_written()...any ideas what would cause this? Chris Friesen @ 2012-11-28 18:08 ` Mathias Burén 2012-11-28 18:51 ` Roy Sigurd Karlsbakk 0 siblings, 1 reply; 22+ messages in thread From: Mathias Burén @ 2012-11-28 18:08 UTC (permalink / raw) To: Chris Friesen; +Cc: Neil Brown, Linux-RAID, Jens Axboe Hi, It would be interesting to see what SMART says about the above, sinde the error is regarding sda first, then md follows. Mathias On 28 November 2012 17:52, Chris Friesen <chris.friesen@genband.com> wrote: > > Hi, > > I'm running 2.6.27 with LVM over software RAID 1 over a pair of SAS disks. > > Recently we started seeing messages of the following pattern: > > Nov 28 08:57:10 kernel: end_request: I/O error, dev sda, sector 1758169523 > Nov 28 08:57:10 kernel: md: super_written gets error=-5, uptodate=0 > Nov 28 08:57:10 kernel: raid1: Disk failure on sda2, disabling device. > Nov 28 08:57:10 kernel: raid1: Operation continuing on 1 devices. > > We're working through our changes to figure out what might have triggered > it, but it seems likely the root cause lies in the core code. > > We're assuming it's a software issue since it's reproducible on multiple > new-ish systems, although so far we've only tried it on systems with one > particular configuration--we're planning on trying it with different disks > just to be sure. > > For what it's worth, we've seen the problems with disk write cache enabled > and disabled. > > Anyone have any ideas, or pointers as to what I should look at? > > Thanks, > Chris > > > -- > > Chris Friesen > Software Designer > > 3500 Carling Avenue > Ottawa, Ontario K2H 8E9 > www.genband.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: getting I/O errors in super_written()...any ideas what would cause this? 2012-11-28 18:08 ` Mathias Burén @ 2012-11-28 18:51 ` Roy Sigurd Karlsbakk 2012-11-28 20:21 ` Chris Friesen 0 siblings, 1 reply; 22+ messages in thread From: Roy Sigurd Karlsbakk @ 2012-11-28 18:51 UTC (permalink / raw) To: Mathias Burén; +Cc: Neil Brown, Linux-RAID, Jens Axboe, Chris Friesen > > I'm running 2.6.27 with LVM over software RAID 1 over a pair of SAS > > disks. > > > > Recently we started seeing messages of the following pattern: > > > > Nov 28 08:57:10 kernel: end_request: I/O error, dev sda, sector > > 1758169523 > > Nov 28 08:57:10 kernel: md: super_written gets error=-5, uptodate=0 > > Nov 28 08:57:10 kernel: raid1: Disk failure on sda2, disabling > > device. > > Nov 28 08:57:10 kernel: raid1: Operation continuing on 1 devices. > > > > We're working through our changes to figure out what might have > > triggered > > it, but it seems likely the root cause lies in the core code. > > > > We're assuming it's a software issue since it's reproducible on > > multiple > > new-ish systems, although so far we've only tried it on systems with > > one > > particular configuration--we're planning on trying it with different > > disks > > just to be sure. > > > > For what it's worth, we've seen the problems with disk write cache > > enabled > > and disabled. > > > > Anyone have any ideas, or pointers as to what I should look at? > > Hi, > > It would be interesting to see what SMART says about the above, sinde > the error is regarding sda first, then md follows. > Agreed - run smartctl -H /dev/sda or smartctl -a /dev/sda if -H succeeds Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 98013356 roy@karlsbakk.net http://blogg.karlsbakk.net/ GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: getting I/O errors in super_written()...any ideas what would cause this? 2012-11-28 18:51 ` Roy Sigurd Karlsbakk @ 2012-11-28 20:21 ` Chris Friesen 2012-11-28 20:27 ` Mathias Burén 0 siblings, 1 reply; 22+ messages in thread From: Chris Friesen @ 2012-11-28 20:21 UTC (permalink / raw) To: Roy Sigurd Karlsbakk Cc: Mathias Burén, Neil Brown, Linux-RAID, Jens Axboe On 11/28/2012 12:51 PM, Roy Sigurd Karlsbakk wrote: >>> I'm running 2.6.27 with LVM over software RAID 1 over a pair of SAS >>> disks. >>> >>> Recently we started seeing messages of the following pattern: >>> >>> Nov 28 08:57:10 kernel: end_request: I/O error, dev sda, sector >>> 1758169523 >>> Nov 28 08:57:10 kernel: md: super_written gets error=-5, uptodate=0 >>> Nov 28 08:57:10 kernel: raid1: Disk failure on sda2, disabling >>> device. >>> Nov 28 08:57:10 kernel: raid1: Operation continuing on 1 devices. >> It would be interesting to see what SMART says about the above, sinde >> the error is regarding sda first, then md follows. >> > > Agreed - run smartctl -H /dev/sda or smartctl -a /dev/sda if -H succeeds Okay, I just got some more information that I didn't have earlier. Apparently we're doing a disk self-test command at the time we see the error. I'm trying to get the details of exactly what is being run, but from the output below it looks like some form of background short test. Is it possible that the self test causes an error message that the kernel doesn't know how to handle? In any case, here's the smartctl output from right after a failure: root@typhoon-base-unit0:/root> ./smartctl --all /dev/sda smartctl version 5.38 [i686-wrs-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ Device: WD WD9001BKHG-02D22 Version: SR03 Serial number: WX21EB1ANU78 Device type: disk Transport protocol: SAS Local Time is: Fri Nov 23 00:35:03 2012 HKT Device supports SMART and is Enabled Temperature Warning Enabled SMART Health Status: OK Current Drive Temperature: 39 C Drive Trip Temperature: 69 C Manufactured in week 01 of year 2010 Recommended maximum start stop count: 1048576 times Current start stop count: 26 times Elements in grown defect list: 0 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [109 bytes] errors read: 21187 2 2 21189 2 4950.446 0 write: 89 4 0 93 4 1317.938 0 verify: 103 0 0 103 0 0.000 0 Non-medium error count: 169436 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background short Self test in progress ... 4 NOW - [- - -] # 2 Background short Completed - 1377 - [- - -] # 3 Background short Completed - 1377 - [- - -] # 4 Background short Completed - 1377 - [- - -] # 5 Background short Completed - 1377 - [- - -] # 6 Background short Completed - 1377 - [- - -] # 7 Background short Completed - 1377 - [- - -] # 8 Background short Completed - 1377 - [- - -] # 9 Background short Completed - 1377 - [- - -] #10 Background short Completed - 1377 - [- - -] #11 Background short Completed - 1377 - [- - -] #12 Background short Completed - 1377 - [- - -] #13 Background short Completed - 1377 - [- - -] #14 Background short Completed - 1377 - [- - -] #15 Background short Completed - 1377 - [- - -] #16 Background short Completed - 1377 - [- - -] #17 Background short Completed - 1377 - [- - -] #18 Background short Completed - 1377 - [- - -] #19 Background short Completed - 1377 - [- - -] #20 Background short Completed - 1377 - [- - -] Long (extended) Self Test duration: 6362 seconds [106.0 minutes] I also have this from ten minutes later with a newer version of smartctl: root@typhoon-base-unit0:/root> ./smartctl.eric -x /dev/sda smartctl 5.40 2010-10-16 r3189 [i686-pc-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net Device: WD WD9001BKHG-02D22 Version: SR03 Serial number: WX21EB1ANU78 Device type: disk Transport protocol: SAS Local Time is: Fri Nov 23 00:45:08 2012 HKT Device supports SMART and is Enabled Temperature Warning Enabled SMART Health Status: OK Current Drive Temperature: 39 C Drive Trip Temperature: 69 C Manufactured in week 01 of year 2010 Specified cycle count over device lifetime: 1048576 Accumulated start-stop cycles: 26 Specified load-unload count over device lifetime: 1114112 Accumulated load-unload cycles: 0 Elements in grown defect list: 0 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [109 bytes] errors read: 21189 2 2 21191 2 4950.446 0 write: 89 4 0 93 4 1317.939 0 verify: 103 0 0 103 0 0.000 0 Non-medium error count: 169436 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background short Completed - 1378 - [- - -] # 2 Background short Completed - 1378 - [- - -] # 3 Background short Completed - 1378 - [- - -] # 4 Background short Completed - 1377 - [- - -] # 5 Background short Completed - 1377 - [- - -] # 6 Background short Completed - 1377 - [- - -] # 7 Background short Completed - 1377 - [- - -] # 8 Background short Completed - 1377 - [- - -] # 9 Background short Completed - 1377 - [- - -] #10 Background short Completed - 1377 - [- - -] #11 Background short Completed - 1377 - [- - -] #12 Background short Completed - 1377 - [- - -] #13 Background short Completed - 1377 - [- - -] #14 Background short Completed - 1377 - [- - -] #15 Background short Completed - 1377 - [- - -] #16 Background short Completed - 1377 - [- - -] #17 Background short Completed - 1377 - [- - -] #18 Background short Completed - 1377 - [- - -] #19 Background short Completed - 1377 - [- - -] #20 Background short Completed - 1377 - [- - -] Long (extended) Self Test duration: 6362 seconds [106.0 minutes] Background scan results log Status: no scans active Accumulated power on time, hours:minutes 1378:08 [82688 minutes] Number of background scans performed: 0, scan progress: 0.00% Number of background medium scans performed: 0 Protocol Specific port log page for SAS SSP relative target port id = 1 generation code = 0 number of phys = 1 phy identifier = 0 attached device type: end device attached reason: unknown reason: unknown negotiated logical link rate: phy enabled; 3 Gbps attached initiator port: ssp=1 stp=1 smp=1 attached target port: ssp=0 stp=0 smp=0 SAS address = 0x50014ee3556977a6 attached SAS address = 0x5fcfffff00000001 attached phy identifier = 0 Invalid DWORD count = 0 Running disparity error count = 0 Loss of DWORD synchronization = 3 Phy reset problem = 0 Phy event descriptors: Transmitted SSP frame error count: 0 Received SSP frame error count: 0 relative target port id = 2 generation code = 0 number of phys = 1 phy identifier = 1 attached device type: no device attached attached reason: unknown reason: unknown negotiated logical link rate: phy enabled; unknown attached initiator port: ssp=0 stp=0 smp=0 attached target port: ssp=0 stp=0 smp=0 SAS address = 0x50014ee3556977a7 attached SAS address = 0x0 attached phy identifier = 0 Invalid DWORD count = 0 Running disparity error count = 0 Loss of DWORD synchronization = 0 Phy reset problem = 0 Phy event descriptors: Transmitted SSP frame error count: 0 Received SSP frame error count: 0 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: getting I/O errors in super_written()...any ideas what would cause this? 2012-11-28 20:21 ` Chris Friesen @ 2012-11-28 20:27 ` Mathias Burén 2012-11-28 20:29 ` Chris Friesen 0 siblings, 1 reply; 22+ messages in thread From: Mathias Burén @ 2012-11-28 20:27 UTC (permalink / raw) To: Chris Friesen; +Cc: Roy Sigurd Karlsbakk, Neil Brown, Linux-RAID, Jens Axboe On 28 November 2012 20:21, Chris Friesen <chris.friesen@genband.com> wrote: > On 11/28/2012 12:51 PM, Roy Sigurd Karlsbakk wrote: >>>> I'm running 2.6.27 with LVM over software RAID 1 over a pair of SAS >>>> disks. >>>> >>>> Recently we started seeing messages of the following pattern: >>>> >>>> Nov 28 08:57:10 kernel: end_request: I/O error, dev sda, sector >>>> 1758169523 >>>> Nov 28 08:57:10 kernel: md: super_written gets error=-5, uptodate=0 >>>> Nov 28 08:57:10 kernel: raid1: Disk failure on sda2, disabling >>>> device. >>>> Nov 28 08:57:10 kernel: raid1: Operation continuing on 1 devices. > >>> It would be interesting to see what SMART says about the above, sinde >>> the error is regarding sda first, then md follows. >>> >> >> Agreed - run smartctl -H /dev/sda or smartctl -a /dev/sda if -H succeeds > > Okay, I just got some more information that I didn't have earlier. > Apparently we're doing a disk self-test command at the time we see > the error. I'm trying to get the details of exactly what is > being run, but from the output below it looks like some form of > background short test. > > Is it possible that the self test causes an error message that the kernel > doesn't know how to handle? > > > In any case, here's the smartctl output from right after a failure: > > root@typhoon-base-unit0:/root> ./smartctl --all /dev/sda > smartctl version 5.38 [i686-wrs-linux-gnu] Copyright (C) 2002-8 Bruce Allen > Home page is http://smartmontools.sourceforge.net/ > > Device: WD WD9001BKHG-02D22 Version: SR03 > Serial number: WX21EB1ANU78 > Device type: disk > Transport protocol: SAS > Local Time is: Fri Nov 23 00:35:03 2012 HKT > Device supports SMART and is Enabled > Temperature Warning Enabled > SMART Health Status: OK > > Current Drive Temperature: 39 C > Drive Trip Temperature: 69 C > Manufactured in week 01 of year 2010 > Recommended maximum start stop count: 1048576 times > Current start stop count: 26 times > Elements in grown defect list: 0 > > Error counter log: > Errors Corrected by Total Correction Gigabytes Total > ECC rereads/ errors algorithm processed uncorrected > fast | delayed rewrites corrected invocations [109 bytes] errors > read: 21187 2 2 21189 2 4950.446 0 > write: 89 4 0 93 4 1317.938 0 > verify: 103 0 0 103 0 0.000 0 > > Non-medium error count: 169436 > > SMART Self-test log > Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] > Description number (hours) > # 1 Background short Self test in progress ... 4 NOW - [- - -] > # 2 Background short Completed - 1377 - [- - -] > # 3 Background short Completed - 1377 - [- - -] > # 4 Background short Completed - 1377 - [- - -] > # 5 Background short Completed - 1377 - [- - -] > # 6 Background short Completed - 1377 - [- - -] > # 7 Background short Completed - 1377 - [- - -] > # 8 Background short Completed - 1377 - [- - -] > # 9 Background short Completed - 1377 - [- - -] > #10 Background short Completed - 1377 - [- - -] > #11 Background short Completed - 1377 - [- - -] > #12 Background short Completed - 1377 - [- - -] > #13 Background short Completed - 1377 - [- - -] > #14 Background short Completed - 1377 - [- - -] > #15 Background short Completed - 1377 - [- - -] > #16 Background short Completed - 1377 - [- - -] > #17 Background short Completed - 1377 - [- - -] > #18 Background short Completed - 1377 - [- - -] > #19 Background short Completed - 1377 - [- - -] > #20 Background short Completed - 1377 - [- - -] > > Long (extended) Self Test duration: 6362 seconds [106.0 minutes] > > > > > I also have this from ten minutes later with a newer version of smartctl: > > root@typhoon-base-unit0:/root> ./smartctl.eric -x /dev/sda > smartctl 5.40 2010-10-16 r3189 [i686-pc-linux-gnu] (local build) > Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net > > Device: WD WD9001BKHG-02D22 Version: SR03 > Serial number: WX21EB1ANU78 > Device type: disk > Transport protocol: SAS > Local Time is: Fri Nov 23 00:45:08 2012 HKT > Device supports SMART and is Enabled > Temperature Warning Enabled > SMART Health Status: OK > > Current Drive Temperature: 39 C > Drive Trip Temperature: 69 C > Manufactured in week 01 of year 2010 > Specified cycle count over device lifetime: 1048576 > Accumulated start-stop cycles: 26 > Specified load-unload count over device lifetime: 1114112 > Accumulated load-unload cycles: 0 > Elements in grown defect list: 0 > > Error counter log: > Errors Corrected by Total Correction Gigabytes Total > ECC rereads/ errors algorithm processed uncorrected > fast | delayed rewrites corrected invocations [109 bytes] errors > read: 21189 2 2 21191 2 4950.446 0 > write: 89 4 0 93 4 1317.939 0 > verify: 103 0 0 103 0 0.000 0 > > Non-medium error count: 169436 > > SMART Self-test log > Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] > Description number (hours) > # 1 Background short Completed - 1378 - [- - -] > # 2 Background short Completed - 1378 - [- - -] > # 3 Background short Completed - 1378 - [- - -] > # 4 Background short Completed - 1377 - [- - -] > # 5 Background short Completed - 1377 - [- - -] > # 6 Background short Completed - 1377 - [- - -] > # 7 Background short Completed - 1377 - [- - -] > # 8 Background short Completed - 1377 - [- - -] > # 9 Background short Completed - 1377 - [- - -] > #10 Background short Completed - 1377 - [- - -] > #11 Background short Completed - 1377 - [- - -] > #12 Background short Completed - 1377 - [- - -] > #13 Background short Completed - 1377 - [- - -] > #14 Background short Completed - 1377 - [- - -] > #15 Background short Completed - 1377 - [- - -] > #16 Background short Completed - 1377 - [- - -] > #17 Background short Completed - 1377 - [- - -] > #18 Background short Completed - 1377 - [- - -] > #19 Background short Completed - 1377 - [- - -] > #20 Background short Completed - 1377 - [- - -] > > Long (extended) Self Test duration: 6362 seconds [106.0 minutes] > > Background scan results log > Status: no scans active > Accumulated power on time, hours:minutes 1378:08 [82688 minutes] > Number of background scans performed: 0, scan progress: 0.00% > Number of background medium scans performed: 0 > Protocol Specific port log page for SAS SSP > relative target port id = 1 > generation code = 0 > number of phys = 1 > phy identifier = 0 > attached device type: end device > attached reason: unknown > reason: unknown > negotiated logical link rate: phy enabled; 3 Gbps > attached initiator port: ssp=1 stp=1 smp=1 > attached target port: ssp=0 stp=0 smp=0 > SAS address = 0x50014ee3556977a6 > attached SAS address = 0x5fcfffff00000001 > attached phy identifier = 0 > Invalid DWORD count = 0 > Running disparity error count = 0 > Loss of DWORD synchronization = 3 > Phy reset problem = 0 > Phy event descriptors: > Transmitted SSP frame error count: 0 > Received SSP frame error count: 0 > relative target port id = 2 > generation code = 0 > number of phys = 1 > phy identifier = 1 > attached device type: no device attached > attached reason: unknown > reason: unknown > negotiated logical link rate: phy enabled; unknown > attached initiator port: ssp=0 stp=0 smp=0 > attached target port: ssp=0 stp=0 smp=0 > SAS address = 0x50014ee3556977a7 > attached SAS address = 0x0 > attached phy identifier = 0 > Invalid DWORD count = 0 > Running disparity error count = 0 > Loss of DWORD synchronization = 0 > Phy reset problem = 0 > Phy event descriptors: > Transmitted SSP frame error count: 0 > Received SSP frame error count: 0 > > > > The drives look healthy, but am I reading that right? More than 10 self tests per hour? Mathias ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: getting I/O errors in super_written()...any ideas what would cause this? 2012-11-28 20:27 ` Mathias Burén @ 2012-11-28 20:29 ` Chris Friesen 2012-12-03 20:22 ` Ric Wheeler 0 siblings, 1 reply; 22+ messages in thread From: Chris Friesen @ 2012-11-28 20:29 UTC (permalink / raw) To: Mathias Burén Cc: Roy Sigurd Karlsbakk, Neil Brown, Linux-RAID, Jens Axboe On 11/28/2012 02:27 PM, Mathias Burén wrote: > The drives look healthy, but am I reading that right? More than 10 > self tests per hour? Yeah....we cranked it up to try and increase how frequently we see the problem. From what I understand normally it runs once a day. Chris -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: getting I/O errors in super_written()...any ideas what would cause this? 2012-11-28 20:29 ` Chris Friesen @ 2012-12-03 20:22 ` Ric Wheeler 2012-12-03 20:44 ` Chris Friesen 0 siblings, 1 reply; 22+ messages in thread From: Ric Wheeler @ 2012-12-03 20:22 UTC (permalink / raw) To: Chris Friesen Cc: Mathias Burén, Roy Sigurd Karlsbakk, Neil Brown, Linux-RAID, Jens Axboe, IDE/ATA development list On 11/28/2012 03:29 PM, Chris Friesen wrote: > On 11/28/2012 02:27 PM, Mathias Burén wrote: > >> The drives look healthy, but am I reading that right? More than 10 >> self tests per hour? > > Yeah....we cranked it up to try and increase how frequently we see the problem. > > From what I understand normally it runs once a day. > > Chris Did the vendor suggest to you that running a self test on an active drive would be OK? I would expect errors in this case - specifically time outs.... Regards, Ric -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: getting I/O errors in super_written()...any ideas what would cause this? 2012-12-03 20:22 ` Ric Wheeler @ 2012-12-03 20:44 ` Chris Friesen 2012-12-03 20:52 ` Ric Wheeler 0 siblings, 1 reply; 22+ messages in thread From: Chris Friesen @ 2012-12-03 20:44 UTC (permalink / raw) To: Ric Wheeler Cc: Mathias Burén, Roy Sigurd Karlsbakk, Neil Brown, Linux-RAID, Jens Axboe, IDE/ATA development list On 12/03/2012 02:22 PM, Ric Wheeler wrote: > On 11/28/2012 03:29 PM, Chris Friesen wrote: >> On 11/28/2012 02:27 PM, Mathias Burén wrote: >> >>> The drives look healthy, but am I reading that right? More than 10 >>> self tests per hour? >> >> Yeah....we cranked it up to try and increase how frequently we see the >> problem. >> >> From what I understand normally it runs once a day. >> >> Chris > > Did the vendor suggest to you that running a self test on an active > drive would be OK? I would expect errors in this case - specifically > time outs.... I'm not the main developer in that area, but from what I understand the code has been like this for ages. (It's entirely possible we've been lucky up till now since we support limited hardware types.) The fact that you'd expect time outs is interesting--is that from the delay switching from doing the self-test to doing the actual request? Is the expectation that the OS should not be sending any other commands to the disk while doing the self-test? I was recently looking at the SCSI spec trying to learn a bit about this issue and the section on background self-test (spc-4, section 5.15.4.3) seems to indicate that a READ or WRITE command should cause the background self-test to be aborted and the command to be processed within 2 seconds. In our case it doesn't seem to be aborting (at least it shows as "Completed" in smartctl)--is this expected? Thanks, Chris -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: getting I/O errors in super_written()...any ideas what would cause this? 2012-12-03 20:44 ` Chris Friesen @ 2012-12-03 20:52 ` Ric Wheeler 2012-12-03 21:08 ` Chris Friesen 0 siblings, 1 reply; 22+ messages in thread From: Ric Wheeler @ 2012-12-03 20:52 UTC (permalink / raw) To: Chris Friesen Cc: Mathias Burén, Roy Sigurd Karlsbakk, Neil Brown, Linux-RAID, Jens Axboe, IDE/ATA development list On 12/03/2012 03:44 PM, Chris Friesen wrote: > On 12/03/2012 02:22 PM, Ric Wheeler wrote: >> On 11/28/2012 03:29 PM, Chris Friesen wrote: >>> On 11/28/2012 02:27 PM, Mathias Burén wrote: >>> >>>> The drives look healthy, but am I reading that right? More than 10 >>>> self tests per hour? >>> >>> Yeah....we cranked it up to try and increase how frequently we see the >>> problem. >>> >>> From what I understand normally it runs once a day. >>> >>> Chris >> >> Did the vendor suggest to you that running a self test on an active >> drive would be OK? I would expect errors in this case - specifically >> time outs.... > > I'm not the main developer in that area, but from what I understand the code > has been like this for ages. (It's entirely possible we've been lucky up till > now since we support limited hardware types.) > > The fact that you'd expect time outs is interesting--is that from the delay > switching from doing the self-test to doing the actual request? > > Is the expectation that the OS should not be sending any other commands to the > disk while doing the self-test? > > I was recently looking at the SCSI spec trying to learn a bit about this issue > and the section on background self-test (spc-4, section 5.15.4.3) seems to > indicate that a READ or WRITE command should cause the background self-test to > be aborted and the command to be processed within 2 seconds. In our case it > doesn't seem to be aborting (at least it shows as "Completed" in smartctl)--is > this expected? > > Thanks, > Chris I jumped into this thread late - can you repost detail on the specific drive and HBA used here? In any case, it sounds like this is a better topic for the linux-scsi or linux-ide list where most of the low level storage people lurk :) Ric ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: getting I/O errors in super_written()...any ideas what would cause this? 2012-12-03 20:52 ` Ric Wheeler @ 2012-12-03 21:08 ` Chris Friesen 2012-12-03 21:21 ` Dave Jiang 2012-12-03 21:53 ` Ric Wheeler 0 siblings, 2 replies; 22+ messages in thread From: Chris Friesen @ 2012-12-03 21:08 UTC (permalink / raw) To: Ric Wheeler Cc: Mathias Burén, Roy Sigurd Karlsbakk, Neil Brown, Linux-RAID, Jens Axboe, IDE/ATA development list, linux-scsi On 12/03/2012 02:52 PM, Ric Wheeler wrote: > I jumped into this thread late - can you repost detail on the specific > drive and HBA used here? In any case, it sounds like this is a better > topic for the linux-scsi or linux-ide list where most of the low level > storage people lurk :) Okay, expanding the receiver list. :) To recap: I'm running 2.6.27 with LVM over software RAID 1 over a pair of SAS disks. Disks are WD9001BKHG, controller is Intel C600. Recently we started seeing messages of the following pattern, and we don't know what's causing them: Nov 28 08:57:10 kernel: end_request: I/O error, dev sda, sector 1758169523 Nov 28 08:57:10 kernel: md: super_written gets error=-5, uptodate=0 Nov 28 08:57:10 kernel: raid1: Disk failure on sda2, disabling device. Nov 28 08:57:10 kernel: raid1: Operation continuing on 1 devices. We've been assuming it's a software issue since it's reproducible on multiple systems, although so far we've only seen the problem with these particular disks. We've seen the problems with disk write cache enabled and disabled. It looks like it may be related to being in the middle of a background short self-test at the time we see the error. The disks are still in-service at this point--is this supported behaviour or would it be expected to cause errors? (The self-test works fine with other disks, and worked fine with these disks until recently, but we haven't made any changes to the block I/O code.) Here's the smartctl output from right after a failure. The self-tests are frequent as a stress-test, normally they're done once per day: root@typhoon-base-unit0:/root> ./smartctl --all /dev/sda smartctl version 5.38 [i686-wrs-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ Device: WD WD9001BKHG-02D22 Version: SR03 Serial number: WX21EB1ANU78 Device type: disk Transport protocol: SAS Local Time is: Fri Nov 23 00:35:03 2012 HKT Device supports SMART and is Enabled Temperature Warning Enabled SMART Health Status: OK Current Drive Temperature: 39 C Drive Trip Temperature: 69 C Manufactured in week 01 of year 2010 Recommended maximum start stop count: 1048576 times Current start stop count: 26 times Elements in grown defect list: 0 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [109 bytes] errors read: 21187 2 2 21189 2 4950.446 0 write: 89 4 0 93 4 1317.938 0 verify: 103 0 0 103 0 0.000 0 Non-medium error count: 169436 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background short Self test in progress ... 4 NOW - [- - -] # 2 Background short Completed - 1377 - [- - -] # 3 Background short Completed - 1377 - [- - -] # 4 Background short Completed - 1377 - [- - -] # 5 Background short Completed - 1377 - [- - -] # 6 Background short Completed - 1377 - [- - -] # 7 Background short Completed - 1377 - [- - -] # 8 Background short Completed - 1377 - [- - -] # 9 Background short Completed - 1377 - [- - -] #10 Background short Completed - 1377 - [- - -] #11 Background short Completed - 1377 - [- - -] #12 Background short Completed - 1377 - [- - -] #13 Background short Completed - 1377 - [- - -] #14 Background short Completed - 1377 - [- - -] #15 Background short Completed - 1377 - [- - -] #16 Background short Completed - 1377 - [- - -] #17 Background short Completed - 1377 - [- - -] #18 Background short Completed - 1377 - [- - -] #19 Background short Completed - 1377 - [- - -] #20 Background short Completed - 1377 - [- - -] Long (extended) Self Test duration: 6362 seconds [106.0 minutes] I also have this from ten minutes later with a newer version of smartctl: root@typhoon-base-unit0:/root> ./smartctl.eric -x /dev/sda smartctl 5.40 2010-10-16 r3189 [i686-pc-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net Device: WD WD9001BKHG-02D22 Version: SR03 Serial number: WX21EB1ANU78 Device type: disk Transport protocol: SAS Local Time is: Fri Nov 23 00:45:08 2012 HKT Device supports SMART and is Enabled Temperature Warning Enabled SMART Health Status: OK Current Drive Temperature: 39 C Drive Trip Temperature: 69 C Manufactured in week 01 of year 2010 Specified cycle count over device lifetime: 1048576 Accumulated start-stop cycles: 26 Specified load-unload count over device lifetime: 1114112 Accumulated load-unload cycles: 0 Elements in grown defect list: 0 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [109 bytes] errors read: 21189 2 2 21191 2 4950.446 0 write: 89 4 0 93 4 1317.939 0 verify: 103 0 0 103 0 0.000 0 Non-medium error count: 169436 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background short Completed - 1378 - [- - -] # 2 Background short Completed - 1378 - [- - -] # 3 Background short Completed - 1378 - [- - -] # 4 Background short Completed - 1377 - [- - -] # 5 Background short Completed - 1377 - [- - -] # 6 Background short Completed - 1377 - [- - -] # 7 Background short Completed - 1377 - [- - -] # 8 Background short Completed - 1377 - [- - -] # 9 Background short Completed - 1377 - [- - -] #10 Background short Completed - 1377 - [- - -] #11 Background short Completed - 1377 - [- - -] #12 Background short Completed - 1377 - [- - -] #13 Background short Completed - 1377 - [- - -] #14 Background short Completed - 1377 - [- - -] #15 Background short Completed - 1377 - [- - -] #16 Background short Completed - 1377 - [- - -] #17 Background short Completed - 1377 - [- - -] #18 Background short Completed - 1377 - [- - -] #19 Background short Completed - 1377 - [- - -] #20 Background short Completed - 1377 - [- - -] Long (extended) Self Test duration: 6362 seconds [106.0 minutes] Background scan results log Status: no scans active Accumulated power on time, hours:minutes 1378:08 [82688 minutes] Number of background scans performed: 0, scan progress: 0.00% Number of background medium scans performed: 0 Protocol Specific port log page for SAS SSP relative target port id = 1 generation code = 0 number of phys = 1 phy identifier = 0 attached device type: end device attached reason: unknown reason: unknown negotiated logical link rate: phy enabled; 3 Gbps attached initiator port: ssp=1 stp=1 smp=1 attached target port: ssp=0 stp=0 smp=0 SAS address = 0x50014ee3556977a6 attached SAS address = 0x5fcfffff00000001 attached phy identifier = 0 Invalid DWORD count = 0 Running disparity error count = 0 Loss of DWORD synchronization = 3 Phy reset problem = 0 Phy event descriptors: Transmitted SSP frame error count: 0 Received SSP frame error count: 0 relative target port id = 2 generation code = 0 number of phys = 1 phy identifier = 1 attached device type: no device attached attached reason: unknown reason: unknown negotiated logical link rate: phy enabled; unknown attached initiator port: ssp=0 stp=0 smp=0 attached target port: ssp=0 stp=0 smp=0 SAS address = 0x50014ee3556977a7 attached SAS address = 0x0 attached phy identifier = 0 Invalid DWORD count = 0 Running disparity error count = 0 Loss of DWORD synchronization = 0 Phy reset problem = 0 Phy event descriptors: Transmitted SSP frame error count: 0 Received SSP frame error count: 0 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: getting I/O errors in super_written()...any ideas what would cause this? 2012-12-03 21:08 ` Chris Friesen @ 2012-12-03 21:21 ` Dave Jiang 2012-12-03 21:36 ` Chris Friesen 2012-12-03 21:53 ` Ric Wheeler 1 sibling, 1 reply; 22+ messages in thread From: Dave Jiang @ 2012-12-03 21:21 UTC (permalink / raw) To: Chris Friesen Cc: Ric Wheeler, Mathias Burén, Roy Sigurd Karlsbakk, Neil Brown, Linux-RAID, Jens Axboe, IDE/ATA development list, linux-scsi On 12/03/2012 02:08 PM, Chris Friesen wrote: > On 12/03/2012 02:52 PM, Ric Wheeler wrote: > >> I jumped into this thread late - can you repost detail on the specific >> drive and HBA used here? In any case, it sounds like this is a better >> topic for the linux-scsi or linux-ide list where most of the low level >> storage people lurk :) > Okay, expanding the receiver list. :) > > To recap: > > I'm running 2.6.27 with LVM over software RAID 1 over a pair of SAS disks. > Disks are WD9001BKHG, controller is Intel C600. Just curious what driver are you using with the C600. The upstream driver for C600 didn't get accepted until 3.0-rc6 and all of the outstanding patches weren't accepted until 3.7-rc. So I'd say 3.6 would be your best bet until 3.7 is released. Did you attempt a backport of the isci driver or using something like an LSI port on 2.6.27? Have you verified the issue on a more recent kernel? > Recently we started seeing messages of the following pattern, and we > don't know what's causing them: > > Nov 28 08:57:10 kernel: end_request: I/O error, dev sda, sector 1758169523 > Nov 28 08:57:10 kernel: md: super_written gets error=-5, uptodate=0 > Nov 28 08:57:10 kernel: raid1: Disk failure on sda2, disabling device. > Nov 28 08:57:10 kernel: raid1: Operation continuing on 1 devices. > > We've been assuming it's a software issue since it's reproducible on > multiple systems, although so far we've only seen the problem with > these particular disks. > > We've seen the problems with disk write cache enabled and disabled. > > It looks like it may be related to being in the middle of a background > short self-test at the time we see the error. The disks are still > in-service at this point--is this supported behaviour or would it > be expected to cause errors? (The self-test works fine with other > disks, and worked fine with these disks until recently, but we haven't > made any changes to the block I/O code.) > > Here's the smartctl output from right after a failure. The self-tests > are frequent as a stress-test, normally they're done once per day: > > root@typhoon-base-unit0:/root> ./smartctl --all /dev/sda > smartctl version 5.38 [i686-wrs-linux-gnu] Copyright (C) 2002-8 Bruce Allen > Home page is http://smartmontools.sourceforge.net/ > > Device: WD WD9001BKHG-02D22 Version: SR03 > Serial number: WX21EB1ANU78 > Device type: disk > Transport protocol: SAS > Local Time is: Fri Nov 23 00:35:03 2012 HKT > Device supports SMART and is Enabled > Temperature Warning Enabled > SMART Health Status: OK > > Current Drive Temperature: 39 C > Drive Trip Temperature: 69 C > Manufactured in week 01 of year 2010 > Recommended maximum start stop count: 1048576 times > Current start stop count: 26 times > Elements in grown defect list: 0 > > Error counter log: > Errors Corrected by Total Correction Gigabytes Total > ECC rereads/ errors algorithm processed uncorrected > fast | delayed rewrites corrected invocations [109 bytes] errors > read: 21187 2 2 21189 2 4950.446 0 > write: 89 4 0 93 4 1317.938 0 > verify: 103 0 0 103 0 0.000 0 > > Non-medium error count: 169436 > > SMART Self-test log > Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] > Description number (hours) > # 1 Background short Self test in progress ... 4 NOW - [- - -] > # 2 Background short Completed - 1377 - [- - -] > # 3 Background short Completed - 1377 - [- - -] > # 4 Background short Completed - 1377 - [- - -] > # 5 Background short Completed - 1377 - [- - -] > # 6 Background short Completed - 1377 - [- - -] > # 7 Background short Completed - 1377 - [- - -] > # 8 Background short Completed - 1377 - [- - -] > # 9 Background short Completed - 1377 - [- - -] > #10 Background short Completed - 1377 - [- - -] > #11 Background short Completed - 1377 - [- - -] > #12 Background short Completed - 1377 - [- - -] > #13 Background short Completed - 1377 - [- - -] > #14 Background short Completed - 1377 - [- - -] > #15 Background short Completed - 1377 - [- - -] > #16 Background short Completed - 1377 - [- - -] > #17 Background short Completed - 1377 - [- - -] > #18 Background short Completed - 1377 - [- - -] > #19 Background short Completed - 1377 - [- - -] > #20 Background short Completed - 1377 - [- - -] > > Long (extended) Self Test duration: 6362 seconds [106.0 minutes] > > > > > I also have this from ten minutes later with a newer version of smartctl: > > root@typhoon-base-unit0:/root> ./smartctl.eric -x /dev/sda > smartctl 5.40 2010-10-16 r3189 [i686-pc-linux-gnu] (local build) > Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net > > Device: WD WD9001BKHG-02D22 Version: SR03 > Serial number: WX21EB1ANU78 > Device type: disk > Transport protocol: SAS > Local Time is: Fri Nov 23 00:45:08 2012 HKT > Device supports SMART and is Enabled > Temperature Warning Enabled > SMART Health Status: OK > > Current Drive Temperature: 39 C > Drive Trip Temperature: 69 C > Manufactured in week 01 of year 2010 > Specified cycle count over device lifetime: 1048576 > Accumulated start-stop cycles: 26 > Specified load-unload count over device lifetime: 1114112 > Accumulated load-unload cycles: 0 > Elements in grown defect list: 0 > > Error counter log: > Errors Corrected by Total Correction Gigabytes Total > ECC rereads/ errors algorithm processed uncorrected > fast | delayed rewrites corrected invocations [109 bytes] errors > read: 21189 2 2 21191 2 4950.446 0 > write: 89 4 0 93 4 1317.939 0 > verify: 103 0 0 103 0 0.000 0 > > Non-medium error count: 169436 > > SMART Self-test log > Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] > Description number (hours) > # 1 Background short Completed - 1378 - [- - -] > # 2 Background short Completed - 1378 - [- - -] > # 3 Background short Completed - 1378 - [- - -] > # 4 Background short Completed - 1377 - [- - -] > # 5 Background short Completed - 1377 - [- - -] > # 6 Background short Completed - 1377 - [- - -] > # 7 Background short Completed - 1377 - [- - -] > # 8 Background short Completed - 1377 - [- - -] > # 9 Background short Completed - 1377 - [- - -] > #10 Background short Completed - 1377 - [- - -] > #11 Background short Completed - 1377 - [- - -] > #12 Background short Completed - 1377 - [- - -] > #13 Background short Completed - 1377 - [- - -] > #14 Background short Completed - 1377 - [- - -] > #15 Background short Completed - 1377 - [- - -] > #16 Background short Completed - 1377 - [- - -] > #17 Background short Completed - 1377 - [- - -] > #18 Background short Completed - 1377 - [- - -] > #19 Background short Completed - 1377 - [- - -] > #20 Background short Completed - 1377 - [- - -] > > Long (extended) Self Test duration: 6362 seconds [106.0 minutes] > > Background scan results log > Status: no scans active > Accumulated power on time, hours:minutes 1378:08 [82688 minutes] > Number of background scans performed: 0, scan progress: 0.00% > Number of background medium scans performed: 0 > Protocol Specific port log page for SAS SSP > relative target port id = 1 > generation code = 0 > number of phys = 1 > phy identifier = 0 > attached device type: end device > attached reason: unknown > reason: unknown > negotiated logical link rate: phy enabled; 3 Gbps > attached initiator port: ssp=1 stp=1 smp=1 > attached target port: ssp=0 stp=0 smp=0 > SAS address = 0x50014ee3556977a6 > attached SAS address = 0x5fcfffff00000001 > attached phy identifier = 0 > Invalid DWORD count = 0 > Running disparity error count = 0 > Loss of DWORD synchronization = 3 > Phy reset problem = 0 > Phy event descriptors: > Transmitted SSP frame error count: 0 > Received SSP frame error count: 0 > relative target port id = 2 > generation code = 0 > number of phys = 1 > phy identifier = 1 > attached device type: no device attached > attached reason: unknown > reason: unknown > negotiated logical link rate: phy enabled; unknown > attached initiator port: ssp=0 stp=0 smp=0 > attached target port: ssp=0 stp=0 smp=0 > SAS address = 0x50014ee3556977a7 > attached SAS address = 0x0 > attached phy identifier = 0 > Invalid DWORD count = 0 > Running disparity error count = 0 > Loss of DWORD synchronization = 0 > Phy reset problem = 0 > Phy event descriptors: > Transmitted SSP frame error count: 0 > Received SSP frame error count: 0 > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: getting I/O errors in super_written()...any ideas what would cause this? 2012-12-03 21:21 ` Dave Jiang @ 2012-12-03 21:36 ` Chris Friesen 2012-12-03 21:59 ` Dave Jiang 0 siblings, 1 reply; 22+ messages in thread From: Chris Friesen @ 2012-12-03 21:36 UTC (permalink / raw) To: Dave Jiang Cc: Ric Wheeler, Mathias Burén, Roy Sigurd Karlsbakk, Neil Brown, Linux-RAID, Jens Axboe, IDE/ATA development list, linux-scsi On 12/03/2012 03:21 PM, Dave Jiang wrote: > On 12/03/2012 02:08 PM, Chris Friesen wrote: >> On 12/03/2012 02:52 PM, Ric Wheeler wrote: >> >>> I jumped into this thread late - can you repost detail on the specific >>> drive and HBA used here? In any case, it sounds like this is a better >>> topic for the linux-scsi or linux-ide list where most of the low level >>> storage people lurk :) >> Okay, expanding the receiver list. :) >> >> To recap: >> >> I'm running 2.6.27 with LVM over software RAID 1 over a pair of SAS disks. >> Disks are WD9001BKHG, controller is Intel C600. > > Just curious what driver are you using with the C600. The upstream > driver for C600 didn't get accepted until 3.0-rc6 and all of the > outstanding patches weren't accepted until 3.7-rc. So I'd say 3.6 would > be your best bet until 3.7 is released. Did you attempt a backport of > the isci driver or using something like an LSI port on 2.6.27? Have you > verified the issue on a more recent kernel? We're using a driver provided by the hardware vendor. It appears to be a backport of version 1.0.1 of the isci driver. We've been using it since mid-March or so. This is an embedded system, so as is all too common in that environment upgrading the whole kernel isn't an option since it requires support from multiple hardware/software vendors. Upgrading just the driver might be possible--do you think it's likely as a cause for these errors? The current driver has a binary firmware file that it uses--would we keep that with the new driver? Chris ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: getting I/O errors in super_written()...any ideas what would cause this? 2012-12-03 21:36 ` Chris Friesen @ 2012-12-03 21:59 ` Dave Jiang 0 siblings, 0 replies; 22+ messages in thread From: Dave Jiang @ 2012-12-03 21:59 UTC (permalink / raw) To: Chris Friesen Cc: Ric Wheeler, Mathias Burén, Roy Sigurd Karlsbakk, Neil Brown, Linux-RAID, Jens Axboe, IDE/ATA development list, linux-scsi On 12/03/2012 02:36 PM, Chris Friesen wrote: > On 12/03/2012 03:21 PM, Dave Jiang wrote: >> On 12/03/2012 02:08 PM, Chris Friesen wrote: >>> On 12/03/2012 02:52 PM, Ric Wheeler wrote: >>> >>>> I jumped into this thread late - can you repost detail on the specific >>>> drive and HBA used here? In any case, it sounds like this is a better >>>> topic for the linux-scsi or linux-ide list where most of the low level >>>> storage people lurk :) >>> Okay, expanding the receiver list. :) >>> >>> To recap: >>> >>> I'm running 2.6.27 with LVM over software RAID 1 over a pair of SAS disks. >>> Disks are WD9001BKHG, controller is Intel C600. >> Just curious what driver are you using with the C600. The upstream >> driver for C600 didn't get accepted until 3.0-rc6 and all of the >> outstanding patches weren't accepted until 3.7-rc. So I'd say 3.6 would >> be your best bet until 3.7 is released. Did you attempt a backport of >> the isci driver or using something like an LSI port on 2.6.27? Have you >> verified the issue on a more recent kernel? > We're using a driver provided by the hardware vendor. It appears to be > a backport of version 1.0.1 of the isci driver. We've been using it > since mid-March or so. Yikes. There has been significant updates to libsas, libata, and isci driver since March. Looks like you are barely limping along. I would imagine the error handling and the hotplug would be a giant mess to say the least. > This is an embedded system, so as is all too common in that environment > upgrading the whole kernel isn't an option since it requires support > from multiple hardware/software vendors. > > Upgrading just the driver might be possible--do you think it's likely as > a cause for these errors? The current driver has a binary firmware file > that it uses--would we keep that with the new driver? You can certainly try but it needs the libsas, libata, and some block fixes to function in a stable fashion. Given that it was a backport by a vendor, one would wonder how much of libsas they actually backported. It's really difficult to say where the error is coming from without being able to verify on a later kernel. Is there any other I/O controller you can use to test this? I'm guessing the answer is no since it's embedded board. You are using a very old driver that is backported to a very old kernel that requires significant subsystem backporting as well. You may need to go poke your OS vendor and have them support the issue? The binary firmware file is really there in case you are not able to load your OEM parameter properly from the platform. It's there to allow you to limp if that is the case and by no means should be used for standard operation. You are suppose to get the appropriate values for your specific platform using a tool called phytune (which you should've gotten from your Intel field rep). You need to program those values and others into the OEM parameter block in the SPI flash of your platform. In your BIOS you need to have either the OROM or the EFI driver loaded during boot. The OROM or EFI driver then copies the values out of SPI flash at boot and provides it to the driver. Those parameters provide important timing values and others. If you are loading the wrong values against your platform, it is very possible that you could see I/O errors. > Chris ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: getting I/O errors in super_written()...any ideas what would cause this? 2012-12-03 21:08 ` Chris Friesen 2012-12-03 21:21 ` Dave Jiang @ 2012-12-03 21:53 ` Ric Wheeler 2012-12-04 22:00 ` Chris Friesen 1 sibling, 1 reply; 22+ messages in thread From: Ric Wheeler @ 2012-12-03 21:53 UTC (permalink / raw) To: Chris Friesen Cc: Mathias Burén, Roy Sigurd Karlsbakk, Neil Brown, Linux-RAID, Jens Axboe, IDE/ATA development list, linux-scsi On 12/03/2012 04:08 PM, Chris Friesen wrote: > On 12/03/2012 02:52 PM, Ric Wheeler wrote: > >> I jumped into this thread late - can you repost detail on the specific >> drive and HBA used here? In any case, it sounds like this is a better >> topic for the linux-scsi or linux-ide list where most of the low level >> storage people lurk :) > Okay, expanding the receiver list. :) > > To recap: > > I'm running 2.6.27 with LVM over software RAID 1 over a pair of SAS disks. > Disks are WD9001BKHG, controller is Intel C600. > > Recently we started seeing messages of the following pattern, and we > don't know what's causing them: > > Nov 28 08:57:10 kernel: end_request: I/O error, dev sda, sector 1758169523 > Nov 28 08:57:10 kernel: md: super_written gets error=-5, uptodate=0 > Nov 28 08:57:10 kernel: raid1: Disk failure on sda2, disabling device. > Nov 28 08:57:10 kernel: raid1: Operation continuing on 1 devices. > > We've been assuming it's a software issue since it's reproducible on > multiple systems, although so far we've only seen the problem with > these particular disks. > > We've seen the problems with disk write cache enabled and disabled. Hi Chris, Are there any earlier IO errors or sda related errors in the log? Ric > > It looks like it may be related to being in the middle of a background > short self-test at the time we see the error. The disks are still > in-service at this point--is this supported behaviour or would it > be expected to cause errors? (The self-test works fine with other > disks, and worked fine with these disks until recently, but we haven't > made any changes to the block I/O code.) > > Here's the smartctl output from right after a failure. The self-tests > are frequent as a stress-test, normally they're done once per day: > > root@typhoon-base-unit0:/root> ./smartctl --all /dev/sda > smartctl version 5.38 [i686-wrs-linux-gnu] Copyright (C) 2002-8 Bruce Allen > Home page is http://smartmontools.sourceforge.net/ > > Device: WD WD9001BKHG-02D22 Version: SR03 > Serial number: WX21EB1ANU78 > Device type: disk > Transport protocol: SAS > Local Time is: Fri Nov 23 00:35:03 2012 HKT > Device supports SMART and is Enabled > Temperature Warning Enabled > SMART Health Status: OK > > Current Drive Temperature: 39 C > Drive Trip Temperature: 69 C > Manufactured in week 01 of year 2010 > Recommended maximum start stop count: 1048576 times > Current start stop count: 26 times > Elements in grown defect list: 0 > > Error counter log: > Errors Corrected by Total Correction Gigabytes Total > ECC rereads/ errors algorithm processed uncorrected > fast | delayed rewrites corrected invocations [109 bytes] errors > read: 21187 2 2 21189 2 4950.446 0 > write: 89 4 0 93 4 1317.938 0 > verify: 103 0 0 103 0 0.000 0 > > Non-medium error count: 169436 > > SMART Self-test log > Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] > Description number (hours) > # 1 Background short Self test in progress ... 4 NOW - [- - -] > # 2 Background short Completed - 1377 - [- - -] > # 3 Background short Completed - 1377 - [- - -] > # 4 Background short Completed - 1377 - [- - -] > # 5 Background short Completed - 1377 - [- - -] > # 6 Background short Completed - 1377 - [- - -] > # 7 Background short Completed - 1377 - [- - -] > # 8 Background short Completed - 1377 - [- - -] > # 9 Background short Completed - 1377 - [- - -] > #10 Background short Completed - 1377 - [- - -] > #11 Background short Completed - 1377 - [- - -] > #12 Background short Completed - 1377 - [- - -] > #13 Background short Completed - 1377 - [- - -] > #14 Background short Completed - 1377 - [- - -] > #15 Background short Completed - 1377 - [- - -] > #16 Background short Completed - 1377 - [- - -] > #17 Background short Completed - 1377 - [- - -] > #18 Background short Completed - 1377 - [- - -] > #19 Background short Completed - 1377 - [- - -] > #20 Background short Completed - 1377 - [- - -] > > Long (extended) Self Test duration: 6362 seconds [106.0 minutes] > > > > > I also have this from ten minutes later with a newer version of smartctl: > > root@typhoon-base-unit0:/root> ./smartctl.eric -x /dev/sda > smartctl 5.40 2010-10-16 r3189 [i686-pc-linux-gnu] (local build) > Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net > > Device: WD WD9001BKHG-02D22 Version: SR03 > Serial number: WX21EB1ANU78 > Device type: disk > Transport protocol: SAS > Local Time is: Fri Nov 23 00:45:08 2012 HKT > Device supports SMART and is Enabled > Temperature Warning Enabled > SMART Health Status: OK > > Current Drive Temperature: 39 C > Drive Trip Temperature: 69 C > Manufactured in week 01 of year 2010 > Specified cycle count over device lifetime: 1048576 > Accumulated start-stop cycles: 26 > Specified load-unload count over device lifetime: 1114112 > Accumulated load-unload cycles: 0 > Elements in grown defect list: 0 > > Error counter log: > Errors Corrected by Total Correction Gigabytes Total > ECC rereads/ errors algorithm processed uncorrected > fast | delayed rewrites corrected invocations [109 bytes] errors > read: 21189 2 2 21191 2 4950.446 0 > write: 89 4 0 93 4 1317.939 0 > verify: 103 0 0 103 0 0.000 0 > > Non-medium error count: 169436 > > SMART Self-test log > Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] > Description number (hours) > # 1 Background short Completed - 1378 - [- - -] > # 2 Background short Completed - 1378 - [- - -] > # 3 Background short Completed - 1378 - [- - -] > # 4 Background short Completed - 1377 - [- - -] > # 5 Background short Completed - 1377 - [- - -] > # 6 Background short Completed - 1377 - [- - -] > # 7 Background short Completed - 1377 - [- - -] > # 8 Background short Completed - 1377 - [- - -] > # 9 Background short Completed - 1377 - [- - -] > #10 Background short Completed - 1377 - [- - -] > #11 Background short Completed - 1377 - [- - -] > #12 Background short Completed - 1377 - [- - -] > #13 Background short Completed - 1377 - [- - -] > #14 Background short Completed - 1377 - [- - -] > #15 Background short Completed - 1377 - [- - -] > #16 Background short Completed - 1377 - [- - -] > #17 Background short Completed - 1377 - [- - -] > #18 Background short Completed - 1377 - [- - -] > #19 Background short Completed - 1377 - [- - -] > #20 Background short Completed - 1377 - [- - -] > > Long (extended) Self Test duration: 6362 seconds [106.0 minutes] > > Background scan results log > Status: no scans active > Accumulated power on time, hours:minutes 1378:08 [82688 minutes] > Number of background scans performed: 0, scan progress: 0.00% > Number of background medium scans performed: 0 > Protocol Specific port log page for SAS SSP > relative target port id = 1 > generation code = 0 > number of phys = 1 > phy identifier = 0 > attached device type: end device > attached reason: unknown > reason: unknown > negotiated logical link rate: phy enabled; 3 Gbps > attached initiator port: ssp=1 stp=1 smp=1 > attached target port: ssp=0 stp=0 smp=0 > SAS address = 0x50014ee3556977a6 > attached SAS address = 0x5fcfffff00000001 > attached phy identifier = 0 > Invalid DWORD count = 0 > Running disparity error count = 0 > Loss of DWORD synchronization = 3 > Phy reset problem = 0 > Phy event descriptors: > Transmitted SSP frame error count: 0 > Received SSP frame error count: 0 > relative target port id = 2 > generation code = 0 > number of phys = 1 > phy identifier = 1 > attached device type: no device attached > attached reason: unknown > reason: unknown > negotiated logical link rate: phy enabled; unknown > attached initiator port: ssp=0 stp=0 smp=0 > attached target port: ssp=0 stp=0 smp=0 > SAS address = 0x50014ee3556977a7 > attached SAS address = 0x0 > attached phy identifier = 0 > Invalid DWORD count = 0 > Running disparity error count = 0 > Loss of DWORD synchronization = 0 > Phy reset problem = 0 > Phy event descriptors: > Transmitted SSP frame error count: 0 > Received SSP frame error count: 0 > > > > > > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: getting I/O errors in super_written()...any ideas what would cause this? 2012-12-03 21:53 ` Ric Wheeler @ 2012-12-04 22:00 ` Chris Friesen 2012-12-04 23:55 ` Ric Wheeler 2012-12-05 9:20 ` James Bottomley 0 siblings, 2 replies; 22+ messages in thread From: Chris Friesen @ 2012-12-04 22:00 UTC (permalink / raw) To: Ric Wheeler Cc: Mathias Burén, Roy Sigurd Karlsbakk, Neil Brown, Linux-RAID, Jens Axboe, IDE/ATA development list, linux-scsi On 12/03/2012 03:53 PM, Ric Wheeler wrote: > On 12/03/2012 04:08 PM, Chris Friesen wrote: >> On 12/03/2012 02:52 PM, Ric Wheeler wrote: >> >>> I jumped into this thread late - can you repost detail on the specific >>> drive and HBA used here? In any case, it sounds like this is a better >>> topic for the linux-scsi or linux-ide list where most of the low level >>> storage people lurk :) >> Okay, expanding the receiver list. :) >> >> To recap: >> >> I'm running 2.6.27 with LVM over software RAID 1 over a pair of SAS >> disks. >> Disks are WD9001BKHG, controller is Intel C600. >> >> Recently we started seeing messages of the following pattern, and we >> don't know what's causing them: >> >> Nov 28 08:57:10 kernel: end_request: I/O error, dev sda, sector >> 1758169523 >> Nov 28 08:57:10 kernel: md: super_written gets error=-5, uptodate=0 >> Nov 28 08:57:10 kernel: raid1: Disk failure on sda2, disabling device. >> Nov 28 08:57:10 kernel: raid1: Operation continuing on 1 devices. >> >> We've been assuming it's a software issue since it's reproducible on >> multiple systems, although so far we've only seen the problem with >> these particular disks. >> >> We've seen the problems with disk write cache enabled and disabled. > > Hi Chris, > > Are there any earlier IO errors or sda related errors in the log? Nope, at least not nearby. On one system for instance we boot up and get into steady-state, then there are no kernel logs for about half an hour then out of the blue we see: Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: end_request: I/O error, dev sda, sector 1758169523 Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: md: super_written gets error=-5, uptodate=0 Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: raid1: Disk failure on sda2, disabling device. Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: raid1: Operation continuing on 1 devices. Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: end_request: I/O error, dev sdb, sector 1758169523 Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: md: super_written gets error=-5, uptodate=0 Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: RAID1 conf printout: Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: --- wd:1 rd:2 Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: disk 0, wo:1, o:0, dev:sda2 Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: disk 1, wo:0, o:1, dev:sdb2 Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: RAID1 conf printout: Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: --- wd:1 rd:2 Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: disk 1, wo:0, o:1, dev:sdb2 As another data point, it looks like we may be doing a SEND DIAGNOSTIC command specifying the default self-test in addition to the background short self-test. This seems a bit risky and excessive to me, but apparently the guy that wrote it is no longer with the company. What is the recommended method for monitoring disks on a system that is likely to go a long time between boots? Do we avoid any in-service testing and just monitor the SMART data and only test it if something actually goes wrong? Or should we intentionally drop a disk out of the array and test it? (The downside of that is that we lose redundancy since we only have 2 disks.) Chris ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: getting I/O errors in super_written()...any ideas what would cause this? 2012-12-04 22:00 ` Chris Friesen @ 2012-12-04 23:55 ` Ric Wheeler 2012-12-05 9:20 ` James Bottomley 1 sibling, 0 replies; 22+ messages in thread From: Ric Wheeler @ 2012-12-04 23:55 UTC (permalink / raw) To: Chris Friesen Cc: Mathias Burén, Roy Sigurd Karlsbakk, Neil Brown, Linux-RAID, Jens Axboe, IDE/ATA development list, linux-scsi On 12/04/2012 05:00 PM, Chris Friesen wrote: > On 12/03/2012 03:53 PM, Ric Wheeler wrote: >> On 12/03/2012 04:08 PM, Chris Friesen wrote: >>> On 12/03/2012 02:52 PM, Ric Wheeler wrote: >>> >>>> I jumped into this thread late - can you repost detail on the specific >>>> drive and HBA used here? In any case, it sounds like this is a better >>>> topic for the linux-scsi or linux-ide list where most of the low level >>>> storage people lurk :) >>> Okay, expanding the receiver list. :) >>> >>> To recap: >>> >>> I'm running 2.6.27 with LVM over software RAID 1 over a pair of SAS >>> disks. >>> Disks are WD9001BKHG, controller is Intel C600. >>> >>> Recently we started seeing messages of the following pattern, and we >>> don't know what's causing them: >>> >>> Nov 28 08:57:10 kernel: end_request: I/O error, dev sda, sector >>> 1758169523 >>> Nov 28 08:57:10 kernel: md: super_written gets error=-5, uptodate=0 >>> Nov 28 08:57:10 kernel: raid1: Disk failure on sda2, disabling device. >>> Nov 28 08:57:10 kernel: raid1: Operation continuing on 1 devices. >>> >>> We've been assuming it's a software issue since it's reproducible on >>> multiple systems, although so far we've only seen the problem with >>> these particular disks. >>> >>> We've seen the problems with disk write cache enabled and disabled. >> Hi Chris, >> >> Are there any earlier IO errors or sda related errors in the log? > Nope, at least not nearby. On one system for instance we boot up and > get into steady-state, then there are no kernel logs for about half an > hour then out of the blue we see: > > Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: end_request: I/O error, dev sda, sector 1758169523 > Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: md: super_written gets error=-5, uptodate=0 > Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: raid1: Disk failure on sda2, disabling device. > Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: raid1: Operation continuing on 1 devices. > Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: end_request: I/O error, dev sdb, sector 1758169523 > Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: md: super_written gets error=-5, uptodate=0 > Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: RAID1 conf printout: > Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: --- wd:1 rd:2 > Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: disk 0, wo:1, o:0, dev:sda2 > Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: disk 1, wo:0, o:1, dev:sdb2 > Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: RAID1 conf printout: > Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: --- wd:1 rd:2 > Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: disk 1, wo:0, o:1, dev:sdb2 > > > As another data point, it looks like we may be doing a SEND DIAGNOSTIC > command specifying the default self-test in addition to the background > short self-test. This seems a bit risky and excessive to me, but > apparently the guy that wrote it is no longer with the company. > > What is the recommended method for monitoring disks on a system that > is likely to go a long time between boots? Do we avoid any in-service > testing and just monitor the SMART data and only test it if something > actually goes wrong? Or should we intentionally drop a disk out of the > array and test it? (The downside of that is that we lose > redundancy since we only have 2 disks.) > > Chris I don't know if running the self tests really helps. Normally, I would simply suggest scanning for remapped sectors (and looking out for lots of them, not just a handful since they are moderately normal in disks). You can do that with smartctl. Best advice is to try and consult directly with your disk vendor about their suggestions if you have that connection of course :) Ric ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: getting I/O errors in super_written()...any ideas what would cause this? 2012-12-04 22:00 ` Chris Friesen 2012-12-04 23:55 ` Ric Wheeler @ 2012-12-05 9:20 ` James Bottomley 2012-12-05 11:41 ` Ric Wheeler 2012-12-06 18:15 ` Chris Friesen 1 sibling, 2 replies; 22+ messages in thread From: James Bottomley @ 2012-12-05 9:20 UTC (permalink / raw) To: Chris Friesen Cc: Ric Wheeler, Mathias Burén, Roy Sigurd Karlsbakk, Neil Brown, Linux-RAID, Jens Axboe, IDE/ATA development list, linux-scsi On Tue, 2012-12-04 at 16:00 -0600, Chris Friesen wrote: > As another data point, it looks like we may be doing a SEND DIAGNOSTIC > command specifying the default self-test in addition to the background > short self-test. This seems a bit risky and excessive to me, but > apparently the guy that wrote it is no longer with the company. This is a really bad idea. A lot of disks go out to lunch until the diagnostics complete (the same goes for SMART diagnostics). This means that if you do diagnostics on a running device, the drivers start to get timeouts on commands which are queued waiting for diagnostics to complete ... if those go over the standard SCSI timeouts, we'll start to try error recovery and likely have the disaster you see above. > What is the recommended method for monitoring disks on a system that > is likely to go a long time between boots? Do we avoid any in-service > testing and just monitor the SMART data and only test it if something > actually goes wrong? Or should we intentionally drop a disk out of the > array and test it? (The downside of that is that we lose > redundancy since we only have 2 disks.) What do you mean by "monitoring" ... as in what are you looking for? To make sure the disk is healthy and responding, a simple test unit ready works. To look at other parameters, read the mode pages. Anything that actively causes the disk to go out and check something is a bad idea in a running environment. Only do this if you can quiesce the I/O before starting the active diagnostic (or drop the disk from the array as you suggest). To be honest, though, modern disks do a whole host of diagnostics as they write data just to check that it is safely committed, so passive monitoring should be fine. James ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: getting I/O errors in super_written()...any ideas what would cause this? 2012-12-05 9:20 ` James Bottomley @ 2012-12-05 11:41 ` Ric Wheeler 2012-12-05 11:57 ` James Bottomley 2012-12-06 18:15 ` Chris Friesen 1 sibling, 1 reply; 22+ messages in thread From: Ric Wheeler @ 2012-12-05 11:41 UTC (permalink / raw) To: James Bottomley Cc: Chris Friesen, Mathias Burén, Roy Sigurd Karlsbakk, Neil Brown, Linux-RAID, Jens Axboe, IDE/ATA development list, linux-scsi On 12/05/2012 04:20 AM, James Bottomley wrote: > On Tue, 2012-12-04 at 16:00 -0600, Chris Friesen wrote: >> As another data point, it looks like we may be doing a SEND DIAGNOSTIC >> command specifying the default self-test in addition to the background >> short self-test. This seems a bit risky and excessive to me, but >> apparently the guy that wrote it is no longer with the company. > This is a really bad idea. A lot of disks go out to lunch until the > diagnostics complete (the same goes for SMART diagnostics). This means > that if you do diagnostics on a running device, the drivers start to get > timeouts on commands which are queued waiting for diagnostics to > complete ... if those go over the standard SCSI timeouts, we'll start to > try error recovery and likely have the disaster you see above. > >> What is the recommended method for monitoring disks on a system that >> is likely to go a long time between boots? Do we avoid any in-service >> testing and just monitor the SMART data and only test it if something >> actually goes wrong? Or should we intentionally drop a disk out of the >> array and test it? (The downside of that is that we lose >> redundancy since we only have 2 disks.) > What do you mean by "monitoring" ... as in what are you looking for? To > make sure the disk is healthy and responding, a simple test unit ready > works. To look at other parameters, read the mode pages. > > Anything that actively causes the disk to go out and check something is > a bad idea in a running environment. Only do this if you can quiesce > the I/O before starting the active diagnostic (or drop the disk from the > array as you suggest). > > To be honest, though, modern disks do a whole host of diagnostics as > they write data just to check that it is safely committed, so passive > monitoring should be fine. > > James > > I don't think that the basic stat gathering (smartctl -a ....) has this kind of impact, but am worried about the running of the diagnostics, ric ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: getting I/O errors in super_written()...any ideas what would cause this? 2012-12-05 11:41 ` Ric Wheeler @ 2012-12-05 11:57 ` James Bottomley 0 siblings, 0 replies; 22+ messages in thread From: James Bottomley @ 2012-12-05 11:57 UTC (permalink / raw) To: Ric Wheeler Cc: Chris Friesen, Mathias Burén, Roy Sigurd Karlsbakk, Neil Brown, Linux-RAID, Jens Axboe, IDE/ATA development list, linux-scsi On Wed, 2012-12-05 at 06:41 -0500, Ric Wheeler wrote: > I don't think that the basic stat gathering (smartctl -a ....) has this kind of > impact, but am worried about the running of the diagnostics, Right, that's what I was trying to say above. Anything that just reads existing data is fine. Anything that causes the disk to go off and test things to gather data is likely to cause I/O errors due to timeouts. James ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: getting I/O errors in super_written()...any ideas what would cause this? 2012-12-05 9:20 ` James Bottomley 2012-12-05 11:41 ` Ric Wheeler @ 2012-12-06 18:15 ` Chris Friesen 2012-12-06 20:27 ` Chris Murphy 2012-12-08 18:08 ` James Bottomley 1 sibling, 2 replies; 22+ messages in thread From: Chris Friesen @ 2012-12-06 18:15 UTC (permalink / raw) To: James Bottomley Cc: Ric Wheeler, Mathias Burén, Roy Sigurd Karlsbakk, Neil Brown, Linux-RAID, Jens Axboe, IDE/ATA development list, linux-scsi On 12/05/2012 03:20 AM, James Bottomley wrote: > On Tue, 2012-12-04 at 16:00 -0600, Chris Friesen wrote: >> As another data point, it looks like we may be doing a SEND DIAGNOSTIC >> command specifying the default self-test in addition to the background >> short self-test. This seems a bit risky and excessive to me, but >> apparently the guy that wrote it is no longer with the company. > > This is a really bad idea. A lot of disks go out to lunch until the > diagnostics complete (the same goes for SMART diagnostics). This means > that if you do diagnostics on a running device, the drivers start to get > timeouts on commands which are queued waiting for diagnostics to > complete ... if those go over the standard SCSI timeouts, we'll start to > try error recovery and likely have the disaster you see above. So it turns out that our problems are intermittently triggered when running the default self test. This agrees with the statement in sg_senddiag to not do foreground self-tests on disks with mounted filesystems. We seem to be able to do background short self-tests (ie, SEND DIAGNOSTIC command with self-test code of 001b and ST code of 0b) without causing any problems. Is this pushing our luck or is this something that should work according to the spec and the linux stack? The scsi spec indicates that in this case for most commands the test will be paused and the command executed within 2 seconds, but I don't know what the normal scsi timeouts are. Thanks for the input, this is very useful. Chris ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: getting I/O errors in super_written()...any ideas what would cause this? 2012-12-06 18:15 ` Chris Friesen @ 2012-12-06 20:27 ` Chris Murphy 2012-12-08 18:08 ` James Bottomley 1 sibling, 0 replies; 22+ messages in thread From: Chris Murphy @ 2012-12-06 20:27 UTC (permalink / raw) To: Linux-RAID; +Cc: IDE/ATA development list, linux-scsi On Dec 6, 2012, at 11:15 AM, Chris Friesen <chris.friesen@genband.com> wrote: >> > > So it turns out that our problems are intermittently triggered when running the default self test. This agrees with the statement in sg_senddiag to not do foreground self-tests on disks with mounted filesystems. Yeah that would be a bad idea. But the default SMART test isn't with the -C flag (for foreground testing), it's a background test including the long/extended test. Chris Murphy ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: getting I/O errors in super_written()...any ideas what would cause this? 2012-12-06 18:15 ` Chris Friesen 2012-12-06 20:27 ` Chris Murphy @ 2012-12-08 18:08 ` James Bottomley 1 sibling, 0 replies; 22+ messages in thread From: James Bottomley @ 2012-12-08 18:08 UTC (permalink / raw) To: Chris Friesen Cc: Ric Wheeler, Mathias Burén, Roy Sigurd Karlsbakk, Neil Brown, Linux-RAID, Jens Axboe, IDE/ATA development list, linux-scsi On Thu, 2012-12-06 at 12:15 -0600, Chris Friesen wrote: > On 12/05/2012 03:20 AM, James Bottomley wrote: > > On Tue, 2012-12-04 at 16:00 -0600, Chris Friesen wrote: > >> As another data point, it looks like we may be doing a SEND DIAGNOSTIC > >> command specifying the default self-test in addition to the background > >> short self-test. This seems a bit risky and excessive to me, but > >> apparently the guy that wrote it is no longer with the company. > > > > This is a really bad idea. A lot of disks go out to lunch until the > > diagnostics complete (the same goes for SMART diagnostics). This means > > that if you do diagnostics on a running device, the drivers start to get > > timeouts on commands which are queued waiting for diagnostics to > > complete ... if those go over the standard SCSI timeouts, we'll start to > > try error recovery and likely have the disaster you see above. > > So it turns out that our problems are intermittently triggered when > running the default self test. This agrees with the statement in > sg_senddiag to not do foreground self-tests on disks with mounted > filesystems. > > We seem to be able to do background short self-tests (ie, SEND > DIAGNOSTIC command with self-test code of 001b and ST code of 0b) > without causing any problems. Is this pushing our luck or is this > something that should work according to the spec and the linux stack? No one can tell you this. The specs don't say what should happen on a diagnostic, how long it will take or how disruptive to the I/O flow it is. > The scsi spec indicates that in this case for most commands the test > will be paused and the command executed within 2 seconds, but I don't > know what the normal scsi timeouts are. 2 Seconds can be an eternity if you're pumping huge amounts of data to a disk: it causes a burp in the I/O chain which propagates back up the stack with unpredictable knock on consequences. The standard SCSI timeouts are configurable under sysfs, but they're 30s. It is known that EMC recommends 60s timeouts for arrays, for instance (which is why they're configurable). James ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2012-12-08 18:08 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-11-28 17:52 getting I/O errors in super_written()...any ideas what would cause this? Chris Friesen 2012-11-28 18:08 ` Mathias Burén 2012-11-28 18:51 ` Roy Sigurd Karlsbakk 2012-11-28 20:21 ` Chris Friesen 2012-11-28 20:27 ` Mathias Burén 2012-11-28 20:29 ` Chris Friesen 2012-12-03 20:22 ` Ric Wheeler 2012-12-03 20:44 ` Chris Friesen 2012-12-03 20:52 ` Ric Wheeler 2012-12-03 21:08 ` Chris Friesen 2012-12-03 21:21 ` Dave Jiang 2012-12-03 21:36 ` Chris Friesen 2012-12-03 21:59 ` Dave Jiang 2012-12-03 21:53 ` Ric Wheeler 2012-12-04 22:00 ` Chris Friesen 2012-12-04 23:55 ` Ric Wheeler 2012-12-05 9:20 ` James Bottomley 2012-12-05 11:41 ` Ric Wheeler 2012-12-05 11:57 ` James Bottomley 2012-12-06 18:15 ` Chris Friesen 2012-12-06 20:27 ` Chris Murphy 2012-12-08 18:08 ` James Bottomley
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).