getting I/O errors in super_written()...any ideas what would cause this?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* getting I/O errors in super_written()...any ideas what would cause this?
@ 2012-11-28 17:52 Chris Friesen
  2012-11-28 18:08 ` Mathias Burén
  0 siblings, 1 reply; 22+ messages in thread
From: Chris Friesen @ 2012-11-28 17:52 UTC (permalink / raw)
  To: Neil Brown, linux-raid, Jens Axboe

Hi,

I'm running 2.6.27 with LVM over software RAID 1 over a pair of SAS disks.

Recently we started seeing messages of the following pattern:

Nov 28 08:57:10 kernel: end_request: I/O error, dev sda, sector 1758169523
Nov 28 08:57:10 kernel: md: super_written gets error=-5, uptodate=0
Nov 28 08:57:10 kernel: raid1: Disk failure on sda2, disabling device.
Nov 28 08:57:10 kernel: raid1: Operation continuing on 1 devices.

We're working through our changes to figure out what might have 
triggered it, but it seems likely the root cause lies in the core code.

We're assuming it's a software issue since it's reproducible on multiple 
new-ish systems, although so far we've only tried it on systems with one 
particular configuration--we're planning on trying it with different 
disks just to be sure.

For what it's worth, we've seen the problems with disk write cache 
enabled and disabled.

Anyone have any ideas, or pointers as to what I should look at?

Thanks,
Chris

-- 

Chris Friesen
Software Designer

3500 Carling Avenue
Ottawa, Ontario K2H 8E9
www.genband.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: getting I/O errors in super_written()...any ideas what would cause this?
  2012-11-28 17:52 getting I/O errors in super_written()...any ideas what would cause this? Chris Friesen
@ 2012-11-28 18:08 ` Mathias Burén
  2012-11-28 18:51   ` Roy Sigurd Karlsbakk
  0 siblings, 1 reply; 22+ messages in thread
From: Mathias Burén @ 2012-11-28 18:08 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Neil Brown, Linux-RAID, Jens Axboe

Hi,

It would be interesting to see what SMART says about the above, sinde
the error is regarding sda first, then md follows.

Mathias

On 28 November 2012 17:52, Chris Friesen <chris.friesen@genband.com> wrote:
>
> Hi,
>
> I'm running 2.6.27 with LVM over software RAID 1 over a pair of SAS disks.
>
> Recently we started seeing messages of the following pattern:
>
> Nov 28 08:57:10 kernel: end_request: I/O error, dev sda, sector 1758169523
> Nov 28 08:57:10 kernel: md: super_written gets error=-5, uptodate=0
> Nov 28 08:57:10 kernel: raid1: Disk failure on sda2, disabling device.
> Nov 28 08:57:10 kernel: raid1: Operation continuing on 1 devices.
>
> We're working through our changes to figure out what might have triggered
> it, but it seems likely the root cause lies in the core code.
>
> We're assuming it's a software issue since it's reproducible on multiple
> new-ish systems, although so far we've only tried it on systems with one
> particular configuration--we're planning on trying it with different disks
> just to be sure.
>
> For what it's worth, we've seen the problems with disk write cache enabled
> and disabled.
>
> Anyone have any ideas, or pointers as to what I should look at?
>
> Thanks,
> Chris
>
>
> --
>
> Chris Friesen
> Software Designer
>
> 3500 Carling Avenue
> Ottawa, Ontario K2H 8E9
> www.genband.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: getting I/O errors in super_written()...any ideas what would cause this?
  2012-11-28 18:08 ` Mathias Burén
@ 2012-11-28 18:51   ` Roy Sigurd Karlsbakk
  2012-11-28 20:21     ` Chris Friesen
  0 siblings, 1 reply; 22+ messages in thread
From: Roy Sigurd Karlsbakk @ 2012-11-28 18:51 UTC (permalink / raw)
  To: Mathias Burén; +Cc: Neil Brown, Linux-RAID, Jens Axboe, Chris Friesen

> > I'm running 2.6.27 with LVM over software RAID 1 over a pair of SAS
> > disks.
> >
> > Recently we started seeing messages of the following pattern:
> >
> > Nov 28 08:57:10 kernel: end_request: I/O error, dev sda, sector
> > 1758169523
> > Nov 28 08:57:10 kernel: md: super_written gets error=-5, uptodate=0
> > Nov 28 08:57:10 kernel: raid1: Disk failure on sda2, disabling
> > device.
> > Nov 28 08:57:10 kernel: raid1: Operation continuing on 1 devices.
> >
> > We're working through our changes to figure out what might have
> > triggered
> > it, but it seems likely the root cause lies in the core code.
> >
> > We're assuming it's a software issue since it's reproducible on
> > multiple
> > new-ish systems, although so far we've only tried it on systems with
> > one
> > particular configuration--we're planning on trying it with different
> > disks
> > just to be sure.
> >
> > For what it's worth, we've seen the problems with disk write cache
> > enabled
> > and disabled.
> >
> > Anyone have any ideas, or pointers as to what I should look at?
>
> Hi,
> 
> It would be interesting to see what SMART says about the above, sinde
> the error is regarding sda first, then md follows.
>

Agreed - run smartctl -H /dev/sda or smartctl -a /dev/sda if -H succeeds
 
Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: getting I/O errors in super_written()...any ideas what would cause this?
  2012-11-28 18:51   ` Roy Sigurd Karlsbakk
@ 2012-11-28 20:21     ` Chris Friesen
  2012-11-28 20:27       ` Mathias Burén
  0 siblings, 1 reply; 22+ messages in thread
From: Chris Friesen @ 2012-11-28 20:21 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk
  Cc: Mathias Burén, Neil Brown, Linux-RAID, Jens Axboe

On 11/28/2012 12:51 PM, Roy Sigurd Karlsbakk wrote:
>>> I'm running 2.6.27 with LVM over software RAID 1 over a pair of SAS
>>> disks.
>>>
>>> Recently we started seeing messages of the following pattern:
>>>
>>> Nov 28 08:57:10 kernel: end_request: I/O error, dev sda, sector
>>> 1758169523
>>> Nov 28 08:57:10 kernel: md: super_written gets error=-5, uptodate=0
>>> Nov 28 08:57:10 kernel: raid1: Disk failure on sda2, disabling
>>> device.
>>> Nov 28 08:57:10 kernel: raid1: Operation continuing on 1 devices.

>> It would be interesting to see what SMART says about the above, sinde
>> the error is regarding sda first, then md follows.
>>
> 
> Agreed - run smartctl -H /dev/sda or smartctl -a /dev/sda if -H succeeds

Okay, I just got some more information that I didn't have earlier.
Apparently we're doing a disk self-test command at the time we see
the error.  I'm trying to get the details of exactly what is
being run, but from the output below it looks like some form of
background short test.

Is it possible that the self test causes an error message that the kernel
doesn't know how to handle?


In any case, here's the smartctl output from right after a failure:

root@typhoon-base-unit0:/root> ./smartctl --all /dev/sda
smartctl version 5.38 [i686-wrs-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device: WD       WD9001BKHG-02D22 Version: SR03
Serial number:         WX21EB1ANU78
Device type: disk
Transport protocol: SAS
Local Time is: Fri Nov 23 00:35:03 2012 HKT
Device supports SMART and is Enabled
Temperature Warning Enabled
SMART Health Status: OK

Current Drive Temperature:     39 C
Drive Trip Temperature:        69 C
Manufactured in week 01 of year 2010
Recommended maximum start stop count:  1048576 times
Current start stop count:      26 times
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [109 bytes]  errors
read:      21187        2         2     21189          2       4950.446           0
write:        89        4         0        93          4       1317.938           0
verify:      103        0         0       103          0          0.000           0

Non-medium error count:   169436

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Self test in progress ...   4     NOW                 - [-   -    -]
# 2  Background short  Completed                   -    1377                 - [-   -    -]
# 3  Background short  Completed                   -    1377                 - [-   -    -]
# 4  Background short  Completed                   -    1377                 - [-   -    -]
# 5  Background short  Completed                   -    1377                 - [-   -    -]
# 6  Background short  Completed                   -    1377                 - [-   -    -]
# 7  Background short  Completed                   -    1377                 - [-   -    -]
# 8  Background short  Completed                   -    1377                 - [-   -    -]
# 9  Background short  Completed                   -    1377                 - [-   -    -]
#10  Background short  Completed                   -    1377                 - [-   -    -]
#11  Background short  Completed                   -    1377                 - [-   -    -]
#12  Background short  Completed                   -    1377                 - [-   -    -]
#13  Background short  Completed                   -    1377                 - [-   -    -]
#14  Background short  Completed                   -    1377                 - [-   -    -]
#15  Background short  Completed                   -    1377                 - [-   -    -]
#16  Background short  Completed                   -    1377                 - [-   -    -]
#17  Background short  Completed                   -    1377                 - [-   -    -]
#18  Background short  Completed                   -    1377                 - [-   -    -]
#19  Background short  Completed                   -    1377                 - [-   -    -]
#20  Background short  Completed                   -    1377                 - [-   -    -]

Long (extended) Self Test duration: 6362 seconds [106.0 minutes]




I also have this from ten minutes later with a newer version of smartctl:

root@typhoon-base-unit0:/root> ./smartctl.eric -x /dev/sda
smartctl 5.40 2010-10-16 r3189 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

Device: WD       WD9001BKHG-02D22 Version: SR03
Serial number:         WX21EB1ANU78
Device type: disk
Transport protocol: SAS
Local Time is: Fri Nov 23 00:45:08 2012 HKT
Device supports SMART and is Enabled
Temperature Warning Enabled
SMART Health Status: OK

Current Drive Temperature:     39 C
Drive Trip Temperature:        69 C
Manufactured in week 01 of year 2010
Specified cycle count over device lifetime:  1048576
Accumulated start-stop cycles:  26
Specified load-unload count over device lifetime:  1114112
Accumulated load-unload cycles:  0
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [109 bytes]  errors
read:      21189        2         2     21191          2       4950.446           0
write:        89        4         0        93          4       1317.939           0
verify:      103        0         0       103          0          0.000           0

Non-medium error count:   169436

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -    1378                 - [-   -    -]
# 2  Background short  Completed                   -    1378                 - [-   -    -]
# 3  Background short  Completed                   -    1378                 - [-   -    -]
# 4  Background short  Completed                   -    1377                 - [-   -    -]
# 5  Background short  Completed                   -    1377                 - [-   -    -]
# 6  Background short  Completed                   -    1377                 - [-   -    -]
# 7  Background short  Completed                   -    1377                 - [-   -    -]
# 8  Background short  Completed                   -    1377                 - [-   -    -]
# 9  Background short  Completed                   -    1377                 - [-   -    -]
#10  Background short  Completed                   -    1377                 - [-   -    -]
#11  Background short  Completed                   -    1377                 - [-   -    -]
#12  Background short  Completed                   -    1377                 - [-   -    -]
#13  Background short  Completed                   -    1377                 - [-   -    -]
#14  Background short  Completed                   -    1377                 - [-   -    -]
#15  Background short  Completed                   -    1377                 - [-   -    -]
#16  Background short  Completed                   -    1377                 - [-   -    -]
#17  Background short  Completed                   -    1377                 - [-   -    -]
#18  Background short  Completed                   -    1377                 - [-   -    -]
#19  Background short  Completed                   -    1377                 - [-   -    -]
#20  Background short  Completed                   -    1377                 - [-   -    -]

Long (extended) Self Test duration: 6362 seconds [106.0 minutes]

Background scan results log
  Status: no scans active
    Accumulated power on time, hours:minutes 1378:08 [82688 minutes]
    Number of background scans performed: 0,  scan progress: 0.00%
    Number of background medium scans performed: 0
Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 0
  number of phys = 1
  phy identifier = 0
    attached device type: end device
    attached reason: unknown
    reason: unknown
    negotiated logical link rate: phy enabled; 3 Gbps
    attached initiator port: ssp=1 stp=1 smp=1
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x50014ee3556977a6
    attached SAS address = 0x5fcfffff00000001
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 3
    Phy reset problem = 0
    Phy event descriptors:
     Transmitted SSP frame error count: 0
     Received SSP frame error count: 0
relative target port id = 2
  generation code = 0
  number of phys = 1
  phy identifier = 1
    attached device type: no device attached
    attached reason: unknown
    reason: unknown
    negotiated logical link rate: phy enabled; unknown
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x50014ee3556977a7
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 0
    Phy reset problem = 0
    Phy event descriptors:
     Transmitted SSP frame error count: 0
     Received SSP frame error count: 0





^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: getting I/O errors in super_written()...any ideas what would cause this?
  2012-11-28 20:21     ` Chris Friesen
@ 2012-11-28 20:27       ` Mathias Burén
  2012-11-28 20:29         ` Chris Friesen
  0 siblings, 1 reply; 22+ messages in thread
From: Mathias Burén @ 2012-11-28 20:27 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Roy Sigurd Karlsbakk, Neil Brown, Linux-RAID, Jens Axboe

On 28 November 2012 20:21, Chris Friesen <chris.friesen@genband.com> wrote:
> On 11/28/2012 12:51 PM, Roy Sigurd Karlsbakk wrote:
>>>> I'm running 2.6.27 with LVM over software RAID 1 over a pair of SAS
>>>> disks.
>>>>
>>>> Recently we started seeing messages of the following pattern:
>>>>
>>>> Nov 28 08:57:10 kernel: end_request: I/O error, dev sda, sector
>>>> 1758169523
>>>> Nov 28 08:57:10 kernel: md: super_written gets error=-5, uptodate=0
>>>> Nov 28 08:57:10 kernel: raid1: Disk failure on sda2, disabling
>>>> device.
>>>> Nov 28 08:57:10 kernel: raid1: Operation continuing on 1 devices.
>
>>> It would be interesting to see what SMART says about the above, sinde
>>> the error is regarding sda first, then md follows.
>>>
>>
>> Agreed - run smartctl -H /dev/sda or smartctl -a /dev/sda if -H succeeds
>
> Okay, I just got some more information that I didn't have earlier.
> Apparently we're doing a disk self-test command at the time we see
> the error.  I'm trying to get the details of exactly what is
> being run, but from the output below it looks like some form of
> background short test.
>
> Is it possible that the self test causes an error message that the kernel
> doesn't know how to handle?
>
>
> In any case, here's the smartctl output from right after a failure:
>
> root@typhoon-base-unit0:/root> ./smartctl --all /dev/sda
> smartctl version 5.38 [i686-wrs-linux-gnu] Copyright (C) 2002-8 Bruce Allen
> Home page is http://smartmontools.sourceforge.net/
>
> Device: WD       WD9001BKHG-02D22 Version: SR03
> Serial number:         WX21EB1ANU78
> Device type: disk
> Transport protocol: SAS
> Local Time is: Fri Nov 23 00:35:03 2012 HKT
> Device supports SMART and is Enabled
> Temperature Warning Enabled
> SMART Health Status: OK
>
> Current Drive Temperature:     39 C
> Drive Trip Temperature:        69 C
> Manufactured in week 01 of year 2010
> Recommended maximum start stop count:  1048576 times
> Current start stop count:      26 times
> Elements in grown defect list: 0
>
> Error counter log:
>            Errors Corrected by           Total   Correction     Gigabytes    Total
>                ECC          rereads/    errors   algorithm      processed    uncorrected
>            fast | delayed   rewrites  corrected  invocations   [109 bytes]  errors
> read:      21187        2         2     21189          2       4950.446           0
> write:        89        4         0        93          4       1317.938           0
> verify:      103        0         0       103          0          0.000           0
>
> Non-medium error count:   169436
>
> SMART Self-test log
> Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
>      Description                              number   (hours)
> # 1  Background short  Self test in progress ...   4     NOW                 - [-   -    -]
> # 2  Background short  Completed                   -    1377                 - [-   -    -]
> # 3  Background short  Completed                   -    1377                 - [-   -    -]
> # 4  Background short  Completed                   -    1377                 - [-   -    -]
> # 5  Background short  Completed                   -    1377                 - [-   -    -]
> # 6  Background short  Completed                   -    1377                 - [-   -    -]
> # 7  Background short  Completed                   -    1377                 - [-   -    -]
> # 8  Background short  Completed                   -    1377                 - [-   -    -]
> # 9  Background short  Completed                   -    1377                 - [-   -    -]
> #10  Background short  Completed                   -    1377                 - [-   -    -]
> #11  Background short  Completed                   -    1377                 - [-   -    -]
> #12  Background short  Completed                   -    1377                 - [-   -    -]
> #13  Background short  Completed                   -    1377                 - [-   -    -]
> #14  Background short  Completed                   -    1377                 - [-   -    -]
> #15  Background short  Completed                   -    1377                 - [-   -    -]
> #16  Background short  Completed                   -    1377                 - [-   -    -]
> #17  Background short  Completed                   -    1377                 - [-   -    -]
> #18  Background short  Completed                   -    1377                 - [-   -    -]
> #19  Background short  Completed                   -    1377                 - [-   -    -]
> #20  Background short  Completed                   -    1377                 - [-   -    -]
>
> Long (extended) Self Test duration: 6362 seconds [106.0 minutes]
>
>
>
>
> I also have this from ten minutes later with a newer version of smartctl:
>
> root@typhoon-base-unit0:/root> ./smartctl.eric -x /dev/sda
> smartctl 5.40 2010-10-16 r3189 [i686-pc-linux-gnu] (local build)
> Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
>
> Device: WD       WD9001BKHG-02D22 Version: SR03
> Serial number:         WX21EB1ANU78
> Device type: disk
> Transport protocol: SAS
> Local Time is: Fri Nov 23 00:45:08 2012 HKT
> Device supports SMART and is Enabled
> Temperature Warning Enabled
> SMART Health Status: OK
>
> Current Drive Temperature:     39 C
> Drive Trip Temperature:        69 C
> Manufactured in week 01 of year 2010
> Specified cycle count over device lifetime:  1048576
> Accumulated start-stop cycles:  26
> Specified load-unload count over device lifetime:  1114112
> Accumulated load-unload cycles:  0
> Elements in grown defect list: 0
>
> Error counter log:
>            Errors Corrected by           Total   Correction     Gigabytes    Total
>                ECC          rereads/    errors   algorithm      processed    uncorrected
>            fast | delayed   rewrites  corrected  invocations   [109 bytes]  errors
> read:      21189        2         2     21191          2       4950.446           0
> write:        89        4         0        93          4       1317.939           0
> verify:      103        0         0       103          0          0.000           0
>
> Non-medium error count:   169436
>
> SMART Self-test log
> Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
>      Description                              number   (hours)
> # 1  Background short  Completed                   -    1378                 - [-   -    -]
> # 2  Background short  Completed                   -    1378                 - [-   -    -]
> # 3  Background short  Completed                   -    1378                 - [-   -    -]
> # 4  Background short  Completed                   -    1377                 - [-   -    -]
> # 5  Background short  Completed                   -    1377                 - [-   -    -]
> # 6  Background short  Completed                   -    1377                 - [-   -    -]
> # 7  Background short  Completed                   -    1377                 - [-   -    -]
> # 8  Background short  Completed                   -    1377                 - [-   -    -]
> # 9  Background short  Completed                   -    1377                 - [-   -    -]
> #10  Background short  Completed                   -    1377                 - [-   -    -]
> #11  Background short  Completed                   -    1377                 - [-   -    -]
> #12  Background short  Completed                   -    1377                 - [-   -    -]
> #13  Background short  Completed                   -    1377                 - [-   -    -]
> #14  Background short  Completed                   -    1377                 - [-   -    -]
> #15  Background short  Completed                   -    1377                 - [-   -    -]
> #16  Background short  Completed                   -    1377                 - [-   -    -]
> #17  Background short  Completed                   -    1377                 - [-   -    -]
> #18  Background short  Completed                   -    1377                 - [-   -    -]
> #19  Background short  Completed                   -    1377                 - [-   -    -]
> #20  Background short  Completed                   -    1377                 - [-   -    -]
>
> Long (extended) Self Test duration: 6362 seconds [106.0 minutes]
>
> Background scan results log
>   Status: no scans active
>     Accumulated power on time, hours:minutes 1378:08 [82688 minutes]
>     Number of background scans performed: 0,  scan progress: 0.00%
>     Number of background medium scans performed: 0
> Protocol Specific port log page for SAS SSP
> relative target port id = 1
>   generation code = 0
>   number of phys = 1
>   phy identifier = 0
>     attached device type: end device
>     attached reason: unknown
>     reason: unknown
>     negotiated logical link rate: phy enabled; 3 Gbps
>     attached initiator port: ssp=1 stp=1 smp=1
>     attached target port: ssp=0 stp=0 smp=0
>     SAS address = 0x50014ee3556977a6
>     attached SAS address = 0x5fcfffff00000001
>     attached phy identifier = 0
>     Invalid DWORD count = 0
>     Running disparity error count = 0
>     Loss of DWORD synchronization = 3
>     Phy reset problem = 0
>     Phy event descriptors:
>      Transmitted SSP frame error count: 0
>      Received SSP frame error count: 0
> relative target port id = 2
>   generation code = 0
>   number of phys = 1
>   phy identifier = 1
>     attached device type: no device attached
>     attached reason: unknown
>     reason: unknown
>     negotiated logical link rate: phy enabled; unknown
>     attached initiator port: ssp=0 stp=0 smp=0
>     attached target port: ssp=0 stp=0 smp=0
>     SAS address = 0x50014ee3556977a7
>     attached SAS address = 0x0
>     attached phy identifier = 0
>     Invalid DWORD count = 0
>     Running disparity error count = 0
>     Loss of DWORD synchronization = 0
>     Phy reset problem = 0
>     Phy event descriptors:
>      Transmitted SSP frame error count: 0
>      Received SSP frame error count: 0
>
>
>
>

The drives look healthy, but am I reading that right? More than 10
self tests per hour?

Mathias

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: getting I/O errors in super_written()...any ideas what would cause this?
  2012-11-28 20:27       ` Mathias Burén
@ 2012-11-28 20:29         ` Chris Friesen
  2012-12-03 20:22           ` Ric Wheeler
  0 siblings, 1 reply; 22+ messages in thread
From: Chris Friesen @ 2012-11-28 20:29 UTC (permalink / raw)
  To: Mathias Burén
  Cc: Roy Sigurd Karlsbakk, Neil Brown, Linux-RAID, Jens Axboe

On 11/28/2012 02:27 PM, Mathias Burén wrote:

> The drives look healthy, but am I reading that right? More than 10
> self tests per hour?

Yeah....we cranked it up to try and increase how frequently we see the 
problem.

 From what I understand normally it runs once a day.

Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: getting I/O errors in super_written()...any ideas what would cause this?
  2012-11-28 20:29         ` Chris Friesen
@ 2012-12-03 20:22           ` Ric Wheeler
  2012-12-03 20:44             ` Chris Friesen
  0 siblings, 1 reply; 22+ messages in thread
From: Ric Wheeler @ 2012-12-03 20:22 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Mathias Burén, Roy Sigurd Karlsbakk, Neil Brown, Linux-RAID,
	Jens Axboe, IDE/ATA development list

On 11/28/2012 03:29 PM, Chris Friesen wrote:
> On 11/28/2012 02:27 PM, Mathias Burén wrote:
>
>> The drives look healthy, but am I reading that right? More than 10
>> self tests per hour?
>
> Yeah....we cranked it up to try and increase how frequently we see the problem.
>
> From what I understand normally it runs once a day.
>
> Chris

Did the vendor suggest to you that running a self test on an active drive would 
be OK? I would expect errors in this case - specifically time outs....

Regards,

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: getting I/O errors in super_written()...any ideas what would cause this?
  2012-12-03 20:22           ` Ric Wheeler
@ 2012-12-03 20:44             ` Chris Friesen
  2012-12-03 20:52               ` Ric Wheeler
  0 siblings, 1 reply; 22+ messages in thread
From: Chris Friesen @ 2012-12-03 20:44 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Mathias Burén, Roy Sigurd Karlsbakk, Neil Brown, Linux-RAID,
	Jens Axboe, IDE/ATA development list

On 12/03/2012 02:22 PM, Ric Wheeler wrote:
> On 11/28/2012 03:29 PM, Chris Friesen wrote:
>> On 11/28/2012 02:27 PM, Mathias Burén wrote:
>>
>>> The drives look healthy, but am I reading that right? More than 10
>>> self tests per hour?
>>
>> Yeah....we cranked it up to try and increase how frequently we see the
>> problem.
>>
>> From what I understand normally it runs once a day.
>>
>> Chris
>
> Did the vendor suggest to you that running a self test on an active
> drive would be OK? I would expect errors in this case - specifically
> time outs....

I'm not the main developer in that area, but from what I understand the 
code has been like this for ages.  (It's entirely possible we've been 
lucky up till now since we support limited hardware types.)

The fact that you'd expect time outs is interesting--is that from the 
delay switching from doing the self-test to doing the actual request?

Is the expectation that the OS should not be sending any other commands 
to the disk while doing the self-test?

I was recently looking at the SCSI spec trying to learn a bit about this 
issue and the section on background self-test (spc-4, section 5.15.4.3) 
seems to indicate that a READ or WRITE command should cause the 
background self-test to be aborted and the command to be processed 
within 2 seconds.  In our case it doesn't seem to be aborting (at least 
it shows as "Completed" in smartctl)--is this expected?

Thanks,
Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: getting I/O errors in super_written()...any ideas what would cause this?
  2012-12-03 20:44             ` Chris Friesen
@ 2012-12-03 20:52               ` Ric Wheeler
  2012-12-03 21:08                 ` Chris Friesen
  0 siblings, 1 reply; 22+ messages in thread
From: Ric Wheeler @ 2012-12-03 20:52 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Mathias Burén, Roy Sigurd Karlsbakk, Neil Brown, Linux-RAID,
	Jens Axboe, IDE/ATA development list

On 12/03/2012 03:44 PM, Chris Friesen wrote:
> On 12/03/2012 02:22 PM, Ric Wheeler wrote:
>> On 11/28/2012 03:29 PM, Chris Friesen wrote:
>>> On 11/28/2012 02:27 PM, Mathias Burén wrote:
>>>
>>>> The drives look healthy, but am I reading that right? More than 10
>>>> self tests per hour?
>>>
>>> Yeah....we cranked it up to try and increase how frequently we see the
>>> problem.
>>>
>>> From what I understand normally it runs once a day.
>>>
>>> Chris
>>
>> Did the vendor suggest to you that running a self test on an active
>> drive would be OK? I would expect errors in this case - specifically
>> time outs....
>
> I'm not the main developer in that area, but from what I understand the code 
> has been like this for ages.  (It's entirely possible we've been lucky up till 
> now since we support limited hardware types.)
>
> The fact that you'd expect time outs is interesting--is that from the delay 
> switching from doing the self-test to doing the actual request?
>
> Is the expectation that the OS should not be sending any other commands to the 
> disk while doing the self-test?
>
> I was recently looking at the SCSI spec trying to learn a bit about this issue 
> and the section on background self-test (spc-4, section 5.15.4.3) seems to 
> indicate that a READ or WRITE command should cause the background self-test to 
> be aborted and the command to be processed within 2 seconds.  In our case it 
> doesn't seem to be aborting (at least it shows as "Completed" in smartctl)--is 
> this expected?
>
> Thanks,
> Chris

I jumped into this thread late - can you repost detail on the specific drive and 
HBA used here? In any case, it sounds like this is a better topic for the 
linux-scsi or linux-ide list where most of the low level storage people lurk :)

Ric


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: getting I/O errors in super_written()...any ideas what would cause this?
  2012-12-03 20:52               ` Ric Wheeler
@ 2012-12-03 21:08                 ` Chris Friesen
  2012-12-03 21:21                   ` Dave Jiang
  2012-12-03 21:53                   ` Ric Wheeler
  0 siblings, 2 replies; 22+ messages in thread
From: Chris Friesen @ 2012-12-03 21:08 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Mathias Burén, Roy Sigurd Karlsbakk, Neil Brown, Linux-RAID,
	Jens Axboe, IDE/ATA development list, linux-scsi

On 12/03/2012 02:52 PM, Ric Wheeler wrote:

> I jumped into this thread late - can you repost detail on the specific 
> drive and HBA used here? In any case, it sounds like this is a better 
> topic for the linux-scsi or linux-ide list where most of the low level 
> storage people lurk :)

Okay, expanding the receiver list. :)

To recap:

I'm running 2.6.27 with LVM over software RAID 1 over a pair of SAS disks.
Disks are WD9001BKHG, controller is Intel C600.

Recently we started seeing messages of the following pattern, and we
don't know what's causing them:

Nov 28 08:57:10 kernel: end_request: I/O error, dev sda, sector 1758169523
Nov 28 08:57:10 kernel: md: super_written gets error=-5, uptodate=0
Nov 28 08:57:10 kernel: raid1: Disk failure on sda2, disabling device.
Nov 28 08:57:10 kernel: raid1: Operation continuing on 1 devices.

We've been assuming it's a software issue since it's reproducible on
multiple systems, although so far we've only seen the problem with
these particular disks.

We've seen the problems with disk write cache enabled and disabled.

It looks like it may be related to being in the middle of a background
short self-test at the time we see the error.  The disks are still
in-service at this point--is this supported behaviour or would it
be expected to cause errors?  (The self-test works fine with other
disks, and worked fine with these disks until recently, but we haven't
made any changes to the block I/O code.)

Here's the smartctl output from right after a failure.  The self-tests
are frequent as a stress-test, normally they're done once per day:

root@typhoon-base-unit0:/root> ./smartctl --all /dev/sda
smartctl version 5.38 [i686-wrs-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device: WD       WD9001BKHG-02D22 Version: SR03
Serial number:         WX21EB1ANU78
Device type: disk
Transport protocol: SAS
Local Time is: Fri Nov 23 00:35:03 2012 HKT
Device supports SMART and is Enabled
Temperature Warning Enabled
SMART Health Status: OK

Current Drive Temperature:     39 C
Drive Trip Temperature:        69 C
Manufactured in week 01 of year 2010
Recommended maximum start stop count:  1048576 times
Current start stop count:      26 times
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [109 bytes]  errors
read:      21187        2         2     21189          2       4950.446           0
write:        89        4         0        93          4       1317.938           0
verify:      103        0         0       103          0          0.000           0

Non-medium error count:   169436

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Self test in progress ...   4     NOW                 - [-   -    -]
# 2  Background short  Completed                   -    1377                 - [-   -    -]
# 3  Background short  Completed                   -    1377                 - [-   -    -]
# 4  Background short  Completed                   -    1377                 - [-   -    -]
# 5  Background short  Completed                   -    1377                 - [-   -    -]
# 6  Background short  Completed                   -    1377                 - [-   -    -]
# 7  Background short  Completed                   -    1377                 - [-   -    -]
# 8  Background short  Completed                   -    1377                 - [-   -    -]
# 9  Background short  Completed                   -    1377                 - [-   -    -]
#10  Background short  Completed                   -    1377                 - [-   -    -]
#11  Background short  Completed                   -    1377                 - [-   -    -]
#12  Background short  Completed                   -    1377                 - [-   -    -]
#13  Background short  Completed                   -    1377                 - [-   -    -]
#14  Background short  Completed                   -    1377                 - [-   -    -]
#15  Background short  Completed                   -    1377                 - [-   -    -]
#16  Background short  Completed                   -    1377                 - [-   -    -]
#17  Background short  Completed                   -    1377                 - [-   -    -]
#18  Background short  Completed                   -    1377                 - [-   -    -]
#19  Background short  Completed                   -    1377                 - [-   -    -]
#20  Background short  Completed                   -    1377                 - [-   -    -]

Long (extended) Self Test duration: 6362 seconds [106.0 minutes]




I also have this from ten minutes later with a newer version of smartctl:

root@typhoon-base-unit0:/root> ./smartctl.eric -x /dev/sda
smartctl 5.40 2010-10-16 r3189 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

Device: WD       WD9001BKHG-02D22 Version: SR03
Serial number:         WX21EB1ANU78
Device type: disk
Transport protocol: SAS
Local Time is: Fri Nov 23 00:45:08 2012 HKT
Device supports SMART and is Enabled
Temperature Warning Enabled
SMART Health Status: OK

Current Drive Temperature:     39 C
Drive Trip Temperature:        69 C
Manufactured in week 01 of year 2010
Specified cycle count over device lifetime:  1048576
Accumulated start-stop cycles:  26
Specified load-unload count over device lifetime:  1114112
Accumulated load-unload cycles:  0
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [109 bytes]  errors
read:      21189        2         2     21191          2       4950.446           0
write:        89        4         0        93          4       1317.939           0
verify:      103        0         0       103          0          0.000           0

Non-medium error count:   169436

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -    1378                 - [-   -    -]
# 2  Background short  Completed                   -    1378                 - [-   -    -]
# 3  Background short  Completed                   -    1378                 - [-   -    -]
# 4  Background short  Completed                   -    1377                 - [-   -    -]
# 5  Background short  Completed                   -    1377                 - [-   -    -]
# 6  Background short  Completed                   -    1377                 - [-   -    -]
# 7  Background short  Completed                   -    1377                 - [-   -    -]
# 8  Background short  Completed                   -    1377                 - [-   -    -]
# 9  Background short  Completed                   -    1377                 - [-   -    -]
#10  Background short  Completed                   -    1377                 - [-   -    -]
#11  Background short  Completed                   -    1377                 - [-   -    -]
#12  Background short  Completed                   -    1377                 - [-   -    -]
#13  Background short  Completed                   -    1377                 - [-   -    -]
#14  Background short  Completed                   -    1377                 - [-   -    -]
#15  Background short  Completed                   -    1377                 - [-   -    -]
#16  Background short  Completed                   -    1377                 - [-   -    -]
#17  Background short  Completed                   -    1377                 - [-   -    -]
#18  Background short  Completed                   -    1377                 - [-   -    -]
#19  Background short  Completed                   -    1377                 - [-   -    -]
#20  Background short  Completed                   -    1377                 - [-   -    -]

Long (extended) Self Test duration: 6362 seconds [106.0 minutes]

Background scan results log
  Status: no scans active
    Accumulated power on time, hours:minutes 1378:08 [82688 minutes]
    Number of background scans performed: 0,  scan progress: 0.00%
    Number of background medium scans performed: 0
Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 0
  number of phys = 1
  phy identifier = 0
    attached device type: end device
    attached reason: unknown
    reason: unknown
    negotiated logical link rate: phy enabled; 3 Gbps
    attached initiator port: ssp=1 stp=1 smp=1
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x50014ee3556977a6
    attached SAS address = 0x5fcfffff00000001
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 3
    Phy reset problem = 0
    Phy event descriptors:
     Transmitted SSP frame error count: 0
     Received SSP frame error count: 0
relative target port id = 2
  generation code = 0
  number of phys = 1
  phy identifier = 1
    attached device type: no device attached
    attached reason: unknown
    reason: unknown
    negotiated logical link rate: phy enabled; unknown
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x50014ee3556977a7
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 0
    Phy reset problem = 0
    Phy event descriptors:
     Transmitted SSP frame error count: 0
     Received SSP frame error count: 0







^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: getting I/O errors in super_written()...any ideas what would cause this?
  2012-12-03 21:08                 ` Chris Friesen
@ 2012-12-03 21:21                   ` Dave Jiang
  2012-12-03 21:36                     ` Chris Friesen
  2012-12-03 21:53                   ` Ric Wheeler
  1 sibling, 1 reply; 22+ messages in thread
From: Dave Jiang @ 2012-12-03 21:21 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Ric Wheeler, Mathias Burén, Roy Sigurd Karlsbakk, Neil Brown,
	Linux-RAID, Jens Axboe, IDE/ATA development list, linux-scsi

On 12/03/2012 02:08 PM, Chris Friesen wrote:
> On 12/03/2012 02:52 PM, Ric Wheeler wrote:
>
>> I jumped into this thread late - can you repost detail on the specific 
>> drive and HBA used here? In any case, it sounds like this is a better 
>> topic for the linux-scsi or linux-ide list where most of the low level 
>> storage people lurk :)
> Okay, expanding the receiver list. :)
>
> To recap:
>
> I'm running 2.6.27 with LVM over software RAID 1 over a pair of SAS disks.
> Disks are WD9001BKHG, controller is Intel C600.

Just curious what driver are you using with the C600. The upstream
driver for C600 didn't get accepted until 3.0-rc6 and all of the
outstanding patches weren't accepted until 3.7-rc. So I'd say 3.6 would
be your best bet until 3.7 is released. Did you attempt a backport of
the isci driver or using something like an LSI port on 2.6.27? Have you
verified the issue on a more recent kernel?

> Recently we started seeing messages of the following pattern, and we
> don't know what's causing them:
>
> Nov 28 08:57:10 kernel: end_request: I/O error, dev sda, sector 1758169523
> Nov 28 08:57:10 kernel: md: super_written gets error=-5, uptodate=0
> Nov 28 08:57:10 kernel: raid1: Disk failure on sda2, disabling device.
> Nov 28 08:57:10 kernel: raid1: Operation continuing on 1 devices.
>
> We've been assuming it's a software issue since it's reproducible on
> multiple systems, although so far we've only seen the problem with
> these particular disks.
>
> We've seen the problems with disk write cache enabled and disabled.
>
> It looks like it may be related to being in the middle of a background
> short self-test at the time we see the error.  The disks are still
> in-service at this point--is this supported behaviour or would it
> be expected to cause errors?  (The self-test works fine with other
> disks, and worked fine with these disks until recently, but we haven't
> made any changes to the block I/O code.)
>
> Here's the smartctl output from right after a failure.  The self-tests
> are frequent as a stress-test, normally they're done once per day:
>
> root@typhoon-base-unit0:/root> ./smartctl --all /dev/sda
> smartctl version 5.38 [i686-wrs-linux-gnu] Copyright (C) 2002-8 Bruce Allen
> Home page is http://smartmontools.sourceforge.net/
>
> Device: WD       WD9001BKHG-02D22 Version: SR03
> Serial number:         WX21EB1ANU78
> Device type: disk
> Transport protocol: SAS
> Local Time is: Fri Nov 23 00:35:03 2012 HKT
> Device supports SMART and is Enabled
> Temperature Warning Enabled
> SMART Health Status: OK
>
> Current Drive Temperature:     39 C
> Drive Trip Temperature:        69 C
> Manufactured in week 01 of year 2010
> Recommended maximum start stop count:  1048576 times
> Current start stop count:      26 times
> Elements in grown defect list: 0
>
> Error counter log:
>            Errors Corrected by           Total   Correction     Gigabytes    Total
>                ECC          rereads/    errors   algorithm      processed    uncorrected
>            fast | delayed   rewrites  corrected  invocations   [109 bytes]  errors
> read:      21187        2         2     21189          2       4950.446           0
> write:        89        4         0        93          4       1317.938           0
> verify:      103        0         0       103          0          0.000           0
>
> Non-medium error count:   169436
>
> SMART Self-test log
> Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
>      Description                              number   (hours)
> # 1  Background short  Self test in progress ...   4     NOW                 - [-   -    -]
> # 2  Background short  Completed                   -    1377                 - [-   -    -]
> # 3  Background short  Completed                   -    1377                 - [-   -    -]
> # 4  Background short  Completed                   -    1377                 - [-   -    -]
> # 5  Background short  Completed                   -    1377                 - [-   -    -]
> # 6  Background short  Completed                   -    1377                 - [-   -    -]
> # 7  Background short  Completed                   -    1377                 - [-   -    -]
> # 8  Background short  Completed                   -    1377                 - [-   -    -]
> # 9  Background short  Completed                   -    1377                 - [-   -    -]
> #10  Background short  Completed                   -    1377                 - [-   -    -]
> #11  Background short  Completed                   -    1377                 - [-   -    -]
> #12  Background short  Completed                   -    1377                 - [-   -    -]
> #13  Background short  Completed                   -    1377                 - [-   -    -]
> #14  Background short  Completed                   -    1377                 - [-   -    -]
> #15  Background short  Completed                   -    1377                 - [-   -    -]
> #16  Background short  Completed                   -    1377                 - [-   -    -]
> #17  Background short  Completed                   -    1377                 - [-   -    -]
> #18  Background short  Completed                   -    1377                 - [-   -    -]
> #19  Background short  Completed                   -    1377                 - [-   -    -]
> #20  Background short  Completed                   -    1377                 - [-   -    -]
>
> Long (extended) Self Test duration: 6362 seconds [106.0 minutes]
>
>
>
>
> I also have this from ten minutes later with a newer version of smartctl:
>
> root@typhoon-base-unit0:/root> ./smartctl.eric -x /dev/sda
> smartctl 5.40 2010-10-16 r3189 [i686-pc-linux-gnu] (local build)
> Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
>
> Device: WD       WD9001BKHG-02D22 Version: SR03
> Serial number:         WX21EB1ANU78
> Device type: disk
> Transport protocol: SAS
> Local Time is: Fri Nov 23 00:45:08 2012 HKT
> Device supports SMART and is Enabled
> Temperature Warning Enabled
> SMART Health Status: OK
>
> Current Drive Temperature:     39 C
> Drive Trip Temperature:        69 C
> Manufactured in week 01 of year 2010
> Specified cycle count over device lifetime:  1048576
> Accumulated start-stop cycles:  26
> Specified load-unload count over device lifetime:  1114112
> Accumulated load-unload cycles:  0
> Elements in grown defect list: 0
>
> Error counter log:
>            Errors Corrected by           Total   Correction     Gigabytes    Total
>                ECC          rereads/    errors   algorithm      processed    uncorrected
>            fast | delayed   rewrites  corrected  invocations   [109 bytes]  errors
> read:      21189        2         2     21191          2       4950.446           0
> write:        89        4         0        93          4       1317.939           0
> verify:      103        0         0       103          0          0.000           0
>
> Non-medium error count:   169436
>
> SMART Self-test log
> Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
>      Description                              number   (hours)
> # 1  Background short  Completed                   -    1378                 - [-   -    -]
> # 2  Background short  Completed                   -    1378                 - [-   -    -]
> # 3  Background short  Completed                   -    1378                 - [-   -    -]
> # 4  Background short  Completed                   -    1377                 - [-   -    -]
> # 5  Background short  Completed                   -    1377                 - [-   -    -]
> # 6  Background short  Completed                   -    1377                 - [-   -    -]
> # 7  Background short  Completed                   -    1377                 - [-   -    -]
> # 8  Background short  Completed                   -    1377                 - [-   -    -]
> # 9  Background short  Completed                   -    1377                 - [-   -    -]
> #10  Background short  Completed                   -    1377                 - [-   -    -]
> #11  Background short  Completed                   -    1377                 - [-   -    -]
> #12  Background short  Completed                   -    1377                 - [-   -    -]
> #13  Background short  Completed                   -    1377                 - [-   -    -]
> #14  Background short  Completed                   -    1377                 - [-   -    -]
> #15  Background short  Completed                   -    1377                 - [-   -    -]
> #16  Background short  Completed                   -    1377                 - [-   -    -]
> #17  Background short  Completed                   -    1377                 - [-   -    -]
> #18  Background short  Completed                   -    1377                 - [-   -    -]
> #19  Background short  Completed                   -    1377                 - [-   -    -]
> #20  Background short  Completed                   -    1377                 - [-   -    -]
>
> Long (extended) Self Test duration: 6362 seconds [106.0 minutes]
>
> Background scan results log
>   Status: no scans active
>     Accumulated power on time, hours:minutes 1378:08 [82688 minutes]
>     Number of background scans performed: 0,  scan progress: 0.00%
>     Number of background medium scans performed: 0
> Protocol Specific port log page for SAS SSP
> relative target port id = 1
>   generation code = 0
>   number of phys = 1
>   phy identifier = 0
>     attached device type: end device
>     attached reason: unknown
>     reason: unknown
>     negotiated logical link rate: phy enabled; 3 Gbps
>     attached initiator port: ssp=1 stp=1 smp=1
>     attached target port: ssp=0 stp=0 smp=0
>     SAS address = 0x50014ee3556977a6
>     attached SAS address = 0x5fcfffff00000001
>     attached phy identifier = 0
>     Invalid DWORD count = 0
>     Running disparity error count = 0
>     Loss of DWORD synchronization = 3
>     Phy reset problem = 0
>     Phy event descriptors:
>      Transmitted SSP frame error count: 0
>      Received SSP frame error count: 0
> relative target port id = 2
>   generation code = 0
>   number of phys = 1
>   phy identifier = 1
>     attached device type: no device attached
>     attached reason: unknown
>     reason: unknown
>     negotiated logical link rate: phy enabled; unknown
>     attached initiator port: ssp=0 stp=0 smp=0
>     attached target port: ssp=0 stp=0 smp=0
>     SAS address = 0x50014ee3556977a7
>     attached SAS address = 0x0
>     attached phy identifier = 0
>     Invalid DWORD count = 0
>     Running disparity error count = 0
>     Loss of DWORD synchronization = 0
>     Phy reset problem = 0
>     Phy event descriptors:
>      Transmitted SSP frame error count: 0
>      Received SSP frame error count: 0
>
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: getting I/O errors in super_written()...any ideas what would cause this?
  2012-12-03 21:21                   ` Dave Jiang
@ 2012-12-03 21:36                     ` Chris Friesen
  2012-12-03 21:59                       ` Dave Jiang
  0 siblings, 1 reply; 22+ messages in thread
From: Chris Friesen @ 2012-12-03 21:36 UTC (permalink / raw)
  To: Dave Jiang
  Cc: Ric Wheeler, Mathias Burén, Roy Sigurd Karlsbakk, Neil Brown,
	Linux-RAID, Jens Axboe, IDE/ATA development list, linux-scsi

On 12/03/2012 03:21 PM, Dave Jiang wrote:
> On 12/03/2012 02:08 PM, Chris Friesen wrote:
>> On 12/03/2012 02:52 PM, Ric Wheeler wrote:
>>
>>> I jumped into this thread late - can you repost detail on the specific
>>> drive and HBA used here? In any case, it sounds like this is a better
>>> topic for the linux-scsi or linux-ide list where most of the low level
>>> storage people lurk :)
>> Okay, expanding the receiver list. :)
>>
>> To recap:
>>
>> I'm running 2.6.27 with LVM over software RAID 1 over a pair of SAS disks.
>> Disks are WD9001BKHG, controller is Intel C600.
>
> Just curious what driver are you using with the C600. The upstream
> driver for C600 didn't get accepted until 3.0-rc6 and all of the
> outstanding patches weren't accepted until 3.7-rc. So I'd say 3.6 would
> be your best bet until 3.7 is released. Did you attempt a backport of
> the isci driver or using something like an LSI port on 2.6.27? Have you
> verified the issue on a more recent kernel?

We're using a driver provided by the hardware vendor.  It appears to be 
a backport of version 1.0.1 of the isci driver.  We've been using it 
since mid-March or so.

This is an embedded system, so as is all too common in that environment 
upgrading the whole kernel isn't an option since it requires support 
from multiple hardware/software vendors.

Upgrading just the driver might be possible--do you think it's likely as 
a cause for these errors?  The current driver has a binary firmware file 
that it uses--would we keep that with the new driver?

Chris

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: getting I/O errors in super_written()...any ideas what would cause this?
  2012-12-03 21:08                 ` Chris Friesen
  2012-12-03 21:21                   ` Dave Jiang
@ 2012-12-03 21:53                   ` Ric Wheeler
  2012-12-04 22:00                     ` Chris Friesen
  1 sibling, 1 reply; 22+ messages in thread
From: Ric Wheeler @ 2012-12-03 21:53 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Mathias Burén, Roy Sigurd Karlsbakk, Neil Brown, Linux-RAID,
	Jens Axboe, IDE/ATA development list, linux-scsi

On 12/03/2012 04:08 PM, Chris Friesen wrote:
> On 12/03/2012 02:52 PM, Ric Wheeler wrote:
>
>> I jumped into this thread late - can you repost detail on the specific
>> drive and HBA used here? In any case, it sounds like this is a better
>> topic for the linux-scsi or linux-ide list where most of the low level
>> storage people lurk :)
> Okay, expanding the receiver list. :)
>
> To recap:
>
> I'm running 2.6.27 with LVM over software RAID 1 over a pair of SAS disks.
> Disks are WD9001BKHG, controller is Intel C600.
>
> Recently we started seeing messages of the following pattern, and we
> don't know what's causing them:
>
> Nov 28 08:57:10 kernel: end_request: I/O error, dev sda, sector 1758169523
> Nov 28 08:57:10 kernel: md: super_written gets error=-5, uptodate=0
> Nov 28 08:57:10 kernel: raid1: Disk failure on sda2, disabling device.
> Nov 28 08:57:10 kernel: raid1: Operation continuing on 1 devices.
>
> We've been assuming it's a software issue since it's reproducible on
> multiple systems, although so far we've only seen the problem with
> these particular disks.
>
> We've seen the problems with disk write cache enabled and disabled.

Hi Chris,

Are there any earlier IO errors or sda related errors in the log?

Ric

>
> It looks like it may be related to being in the middle of a background
> short self-test at the time we see the error.  The disks are still
> in-service at this point--is this supported behaviour or would it
> be expected to cause errors?  (The self-test works fine with other
> disks, and worked fine with these disks until recently, but we haven't
> made any changes to the block I/O code.)
>
> Here's the smartctl output from right after a failure.  The self-tests
> are frequent as a stress-test, normally they're done once per day:
>
> root@typhoon-base-unit0:/root> ./smartctl --all /dev/sda
> smartctl version 5.38 [i686-wrs-linux-gnu] Copyright (C) 2002-8 Bruce Allen
> Home page is http://smartmontools.sourceforge.net/
>
> Device: WD       WD9001BKHG-02D22 Version: SR03
> Serial number:         WX21EB1ANU78
> Device type: disk
> Transport protocol: SAS
> Local Time is: Fri Nov 23 00:35:03 2012 HKT
> Device supports SMART and is Enabled
> Temperature Warning Enabled
> SMART Health Status: OK
>
> Current Drive Temperature:     39 C
> Drive Trip Temperature:        69 C
> Manufactured in week 01 of year 2010
> Recommended maximum start stop count:  1048576 times
> Current start stop count:      26 times
> Elements in grown defect list: 0
>
> Error counter log:
>             Errors Corrected by           Total   Correction     Gigabytes    Total
>                 ECC          rereads/    errors   algorithm      processed    uncorrected
>             fast | delayed   rewrites  corrected  invocations   [109 bytes]  errors
> read:      21187        2         2     21189          2       4950.446           0
> write:        89        4         0        93          4       1317.938           0
> verify:      103        0         0       103          0          0.000           0
>
> Non-medium error count:   169436
>
> SMART Self-test log
> Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
>       Description                              number   (hours)
> # 1  Background short  Self test in progress ...   4     NOW                 - [-   -    -]
> # 2  Background short  Completed                   -    1377                 - [-   -    -]
> # 3  Background short  Completed                   -    1377                 - [-   -    -]
> # 4  Background short  Completed                   -    1377                 - [-   -    -]
> # 5  Background short  Completed                   -    1377                 - [-   -    -]
> # 6  Background short  Completed                   -    1377                 - [-   -    -]
> # 7  Background short  Completed                   -    1377                 - [-   -    -]
> # 8  Background short  Completed                   -    1377                 - [-   -    -]
> # 9  Background short  Completed                   -    1377                 - [-   -    -]
> #10  Background short  Completed                   -    1377                 - [-   -    -]
> #11  Background short  Completed                   -    1377                 - [-   -    -]
> #12  Background short  Completed                   -    1377                 - [-   -    -]
> #13  Background short  Completed                   -    1377                 - [-   -    -]
> #14  Background short  Completed                   -    1377                 - [-   -    -]
> #15  Background short  Completed                   -    1377                 - [-   -    -]
> #16  Background short  Completed                   -    1377                 - [-   -    -]
> #17  Background short  Completed                   -    1377                 - [-   -    -]
> #18  Background short  Completed                   -    1377                 - [-   -    -]
> #19  Background short  Completed                   -    1377                 - [-   -    -]
> #20  Background short  Completed                   -    1377                 - [-   -    -]
>
> Long (extended) Self Test duration: 6362 seconds [106.0 minutes]
>
>
>
>
> I also have this from ten minutes later with a newer version of smartctl:
>
> root@typhoon-base-unit0:/root> ./smartctl.eric -x /dev/sda
> smartctl 5.40 2010-10-16 r3189 [i686-pc-linux-gnu] (local build)
> Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
>
> Device: WD       WD9001BKHG-02D22 Version: SR03
> Serial number:         WX21EB1ANU78
> Device type: disk
> Transport protocol: SAS
> Local Time is: Fri Nov 23 00:45:08 2012 HKT
> Device supports SMART and is Enabled
> Temperature Warning Enabled
> SMART Health Status: OK
>
> Current Drive Temperature:     39 C
> Drive Trip Temperature:        69 C
> Manufactured in week 01 of year 2010
> Specified cycle count over device lifetime:  1048576
> Accumulated start-stop cycles:  26
> Specified load-unload count over device lifetime:  1114112
> Accumulated load-unload cycles:  0
> Elements in grown defect list: 0
>
> Error counter log:
>             Errors Corrected by           Total   Correction     Gigabytes    Total
>                 ECC          rereads/    errors   algorithm      processed    uncorrected
>             fast | delayed   rewrites  corrected  invocations   [109 bytes]  errors
> read:      21189        2         2     21191          2       4950.446           0
> write:        89        4         0        93          4       1317.939           0
> verify:      103        0         0       103          0          0.000           0
>
> Non-medium error count:   169436
>
> SMART Self-test log
> Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
>       Description                              number   (hours)
> # 1  Background short  Completed                   -    1378                 - [-   -    -]
> # 2  Background short  Completed                   -    1378                 - [-   -    -]
> # 3  Background short  Completed                   -    1378                 - [-   -    -]
> # 4  Background short  Completed                   -    1377                 - [-   -    -]
> # 5  Background short  Completed                   -    1377                 - [-   -    -]
> # 6  Background short  Completed                   -    1377                 - [-   -    -]
> # 7  Background short  Completed                   -    1377                 - [-   -    -]
> # 8  Background short  Completed                   -    1377                 - [-   -    -]
> # 9  Background short  Completed                   -    1377                 - [-   -    -]
> #10  Background short  Completed                   -    1377                 - [-   -    -]
> #11  Background short  Completed                   -    1377                 - [-   -    -]
> #12  Background short  Completed                   -    1377                 - [-   -    -]
> #13  Background short  Completed                   -    1377                 - [-   -    -]
> #14  Background short  Completed                   -    1377                 - [-   -    -]
> #15  Background short  Completed                   -    1377                 - [-   -    -]
> #16  Background short  Completed                   -    1377                 - [-   -    -]
> #17  Background short  Completed                   -    1377                 - [-   -    -]
> #18  Background short  Completed                   -    1377                 - [-   -    -]
> #19  Background short  Completed                   -    1377                 - [-   -    -]
> #20  Background short  Completed                   -    1377                 - [-   -    -]
>
> Long (extended) Self Test duration: 6362 seconds [106.0 minutes]
>
> Background scan results log
>    Status: no scans active
>      Accumulated power on time, hours:minutes 1378:08 [82688 minutes]
>      Number of background scans performed: 0,  scan progress: 0.00%
>      Number of background medium scans performed: 0
> Protocol Specific port log page for SAS SSP
> relative target port id = 1
>    generation code = 0
>    number of phys = 1
>    phy identifier = 0
>      attached device type: end device
>      attached reason: unknown
>      reason: unknown
>      negotiated logical link rate: phy enabled; 3 Gbps
>      attached initiator port: ssp=1 stp=1 smp=1
>      attached target port: ssp=0 stp=0 smp=0
>      SAS address = 0x50014ee3556977a6
>      attached SAS address = 0x5fcfffff00000001
>      attached phy identifier = 0
>      Invalid DWORD count = 0
>      Running disparity error count = 0
>      Loss of DWORD synchronization = 3
>      Phy reset problem = 0
>      Phy event descriptors:
>       Transmitted SSP frame error count: 0
>       Received SSP frame error count: 0
> relative target port id = 2
>    generation code = 0
>    number of phys = 1
>    phy identifier = 1
>      attached device type: no device attached
>      attached reason: unknown
>      reason: unknown
>      negotiated logical link rate: phy enabled; unknown
>      attached initiator port: ssp=0 stp=0 smp=0
>      attached target port: ssp=0 stp=0 smp=0
>      SAS address = 0x50014ee3556977a7
>      attached SAS address = 0x0
>      attached phy identifier = 0
>      Invalid DWORD count = 0
>      Running disparity error count = 0
>      Loss of DWORD synchronization = 0
>      Phy reset problem = 0
>      Phy event descriptors:
>       Transmitted SSP frame error count: 0
>       Received SSP frame error count: 0
>
>
>
>
>
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: getting I/O errors in super_written()...any ideas what would cause this?
  2012-12-03 21:36                     ` Chris Friesen
@ 2012-12-03 21:59                       ` Dave Jiang
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Jiang @ 2012-12-03 21:59 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Ric Wheeler, Mathias Burén, Roy Sigurd Karlsbakk, Neil Brown,
	Linux-RAID, Jens Axboe, IDE/ATA development list, linux-scsi

On 12/03/2012 02:36 PM, Chris Friesen wrote:
> On 12/03/2012 03:21 PM, Dave Jiang wrote:
>> On 12/03/2012 02:08 PM, Chris Friesen wrote:
>>> On 12/03/2012 02:52 PM, Ric Wheeler wrote:
>>>
>>>> I jumped into this thread late - can you repost detail on the specific
>>>> drive and HBA used here? In any case, it sounds like this is a better
>>>> topic for the linux-scsi or linux-ide list where most of the low level
>>>> storage people lurk :)
>>> Okay, expanding the receiver list. :)
>>>
>>> To recap:
>>>
>>> I'm running 2.6.27 with LVM over software RAID 1 over a pair of SAS disks.
>>> Disks are WD9001BKHG, controller is Intel C600.
>> Just curious what driver are you using with the C600. The upstream
>> driver for C600 didn't get accepted until 3.0-rc6 and all of the
>> outstanding patches weren't accepted until 3.7-rc. So I'd say 3.6 would
>> be your best bet until 3.7 is released. Did you attempt a backport of
>> the isci driver or using something like an LSI port on 2.6.27? Have you
>> verified the issue on a more recent kernel?
> We're using a driver provided by the hardware vendor.  It appears to be 
> a backport of version 1.0.1 of the isci driver.  We've been using it 
> since mid-March or so.

Yikes. There has been significant updates to libsas, libata, and isci
driver since March. Looks like you are barely limping along. I would
imagine the error handling and the hotplug would be a giant mess to say
the least.

> This is an embedded system, so as is all too common in that environment 
> upgrading the whole kernel isn't an option since it requires support 
> from multiple hardware/software vendors.
>
> Upgrading just the driver might be possible--do you think it's likely as 
> a cause for these errors?  The current driver has a binary firmware file 
> that it uses--would we keep that with the new driver?
You can certainly try but it needs the libsas, libata, and some block
fixes to function in a stable fashion. Given that it was a backport by a
vendor, one would wonder how much of libsas they actually backported.
It's really difficult to say where the error is coming from without
being able to verify on a later kernel. Is there any other I/O
controller you can use to test this? I'm guessing the answer is no since
it's embedded board. You are using a very old driver that is backported
to a very old kernel that requires significant subsystem backporting as
well. You may need to go poke your OS vendor and have them support the
issue?

The binary firmware file is really there in case you are not able to
load your OEM parameter properly from the platform. It's there to allow
you to limp if that is the case and by no means should be used for
standard operation. You are suppose to get the appropriate values for
your specific platform using a tool called phytune (which you should've
gotten from your Intel field rep). You need to program those values and
others into the OEM parameter block in the SPI flash of your platform.
In your BIOS you need to have either the OROM or the EFI driver loaded
during boot. The OROM or EFI driver then copies the values out of SPI
flash at boot and provides it to the driver. Those parameters provide
important timing values and others. If you are loading the wrong values
against your platform, it is very possible that you could see I/O errors.



> Chris



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: getting I/O errors in super_written()...any ideas what would cause this?
  2012-12-03 21:53                   ` Ric Wheeler
@ 2012-12-04 22:00                     ` Chris Friesen
  2012-12-04 23:55                       ` Ric Wheeler
  2012-12-05  9:20                       ` James Bottomley
  0 siblings, 2 replies; 22+ messages in thread
From: Chris Friesen @ 2012-12-04 22:00 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Mathias Burén, Roy Sigurd Karlsbakk, Neil Brown, Linux-RAID,
	Jens Axboe, IDE/ATA development list, linux-scsi

On 12/03/2012 03:53 PM, Ric Wheeler wrote:
> On 12/03/2012 04:08 PM, Chris Friesen wrote:
>> On 12/03/2012 02:52 PM, Ric Wheeler wrote:
>>
>>> I jumped into this thread late - can you repost detail on the specific
>>> drive and HBA used here? In any case, it sounds like this is a better
>>> topic for the linux-scsi or linux-ide list where most of the low level
>>> storage people lurk :)
>> Okay, expanding the receiver list. :)
>>
>> To recap:
>>
>> I'm running 2.6.27 with LVM over software RAID 1 over a pair of SAS 
>> disks.
>> Disks are WD9001BKHG, controller is Intel C600.
>>
>> Recently we started seeing messages of the following pattern, and we
>> don't know what's causing them:
>>
>> Nov 28 08:57:10 kernel: end_request: I/O error, dev sda, sector 
>> 1758169523
>> Nov 28 08:57:10 kernel: md: super_written gets error=-5, uptodate=0
>> Nov 28 08:57:10 kernel: raid1: Disk failure on sda2, disabling device.
>> Nov 28 08:57:10 kernel: raid1: Operation continuing on 1 devices.
>>
>> We've been assuming it's a software issue since it's reproducible on
>> multiple systems, although so far we've only seen the problem with
>> these particular disks.
>>
>> We've seen the problems with disk write cache enabled and disabled.
> 
> Hi Chris,
> 
> Are there any earlier IO errors or sda related errors in the log?

Nope, at least not nearby.  On one system for instance we boot up and
get into steady-state, then there are no kernel logs for about half an
hour then out of the blue we see:

Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: end_request: I/O error, dev sda, sector 1758169523
Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: md: super_written gets error=-5, uptodate=0
Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: raid1: Disk failure on sda2, disabling device.
Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: raid1: Operation continuing on 1 devices.
Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: end_request: I/O error, dev sdb, sector 1758169523
Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: md: super_written gets error=-5, uptodate=0
Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: RAID1 conf printout:
Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: --- wd:1 rd:2
Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: disk 0, wo:1, o:0, dev:sda2
Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: disk 1, wo:0, o:1, dev:sdb2
Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: RAID1 conf printout:
Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: --- wd:1 rd:2
Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: disk 1, wo:0, o:1, dev:sdb2


As another data point, it looks like we may be doing a SEND DIAGNOSTIC
command specifying the default self-test in addition to the background
short self-test.  This seems a bit risky and excessive to me, but
apparently the guy that wrote it is no longer with the company.

What is the recommended method for monitoring disks on a system that
is likely to go a long time between boots?  Do we avoid any in-service
testing and just monitor the SMART data and only test it if something
actually goes wrong?  Or should we intentionally drop a disk out of the
array and test it?  (The downside of that is that we lose
redundancy since we only have 2 disks.)

Chris

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: getting I/O errors in super_written()...any ideas what would cause this?
  2012-12-04 22:00                     ` Chris Friesen
@ 2012-12-04 23:55                       ` Ric Wheeler
  2012-12-05  9:20                       ` James Bottomley
  1 sibling, 0 replies; 22+ messages in thread
From: Ric Wheeler @ 2012-12-04 23:55 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Mathias Burén, Roy Sigurd Karlsbakk, Neil Brown, Linux-RAID,
	Jens Axboe, IDE/ATA development list, linux-scsi

On 12/04/2012 05:00 PM, Chris Friesen wrote:
> On 12/03/2012 03:53 PM, Ric Wheeler wrote:
>> On 12/03/2012 04:08 PM, Chris Friesen wrote:
>>> On 12/03/2012 02:52 PM, Ric Wheeler wrote:
>>>
>>>> I jumped into this thread late - can you repost detail on the specific
>>>> drive and HBA used here? In any case, it sounds like this is a better
>>>> topic for the linux-scsi or linux-ide list where most of the low level
>>>> storage people lurk :)
>>> Okay, expanding the receiver list. :)
>>>
>>> To recap:
>>>
>>> I'm running 2.6.27 with LVM over software RAID 1 over a pair of SAS
>>> disks.
>>> Disks are WD9001BKHG, controller is Intel C600.
>>>
>>> Recently we started seeing messages of the following pattern, and we
>>> don't know what's causing them:
>>>
>>> Nov 28 08:57:10 kernel: end_request: I/O error, dev sda, sector
>>> 1758169523
>>> Nov 28 08:57:10 kernel: md: super_written gets error=-5, uptodate=0
>>> Nov 28 08:57:10 kernel: raid1: Disk failure on sda2, disabling device.
>>> Nov 28 08:57:10 kernel: raid1: Operation continuing on 1 devices.
>>>
>>> We've been assuming it's a software issue since it's reproducible on
>>> multiple systems, although so far we've only seen the problem with
>>> these particular disks.
>>>
>>> We've seen the problems with disk write cache enabled and disabled.
>> Hi Chris,
>>
>> Are there any earlier IO errors or sda related errors in the log?
> Nope, at least not nearby.  On one system for instance we boot up and
> get into steady-state, then there are no kernel logs for about half an
> hour then out of the blue we see:
>
> Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: end_request: I/O error, dev sda, sector 1758169523
> Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: md: super_written gets error=-5, uptodate=0
> Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: raid1: Disk failure on sda2, disabling device.
> Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: raid1: Operation continuing on 1 devices.
> Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: end_request: I/O error, dev sdb, sector 1758169523
> Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: md: super_written gets error=-5, uptodate=0
> Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: RAID1 conf printout:
> Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: --- wd:1 rd:2
> Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: disk 0, wo:1, o:0, dev:sda2
> Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: disk 1, wo:0, o:1, dev:sdb2
> Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: RAID1 conf printout:
> Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: --- wd:1 rd:2
> Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: disk 1, wo:0, o:1, dev:sdb2
>
>
> As another data point, it looks like we may be doing a SEND DIAGNOSTIC
> command specifying the default self-test in addition to the background
> short self-test.  This seems a bit risky and excessive to me, but
> apparently the guy that wrote it is no longer with the company.
>
> What is the recommended method for monitoring disks on a system that
> is likely to go a long time between boots?  Do we avoid any in-service
> testing and just monitor the SMART data and only test it if something
> actually goes wrong?  Or should we intentionally drop a disk out of the
> array and test it?  (The downside of that is that we lose
> redundancy since we only have 2 disks.)
>
> Chris

I don't know if running the self tests really helps. Normally, I would simply 
suggest scanning for remapped sectors (and looking out for lots of them, not 
just a handful since they are moderately normal in disks). You can do that with 
smartctl.

Best advice is to try and consult directly with your disk vendor about their 
suggestions if you have that connection of course :)

Ric


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: getting I/O errors in super_written()...any ideas what would cause this?
  2012-12-04 22:00                     ` Chris Friesen
  2012-12-04 23:55                       ` Ric Wheeler
@ 2012-12-05  9:20                       ` James Bottomley
  2012-12-05 11:41                         ` Ric Wheeler
  2012-12-06 18:15                         ` Chris Friesen
  1 sibling, 2 replies; 22+ messages in thread
From: James Bottomley @ 2012-12-05  9:20 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Ric Wheeler, Mathias Burén, Roy Sigurd Karlsbakk, Neil Brown,
	Linux-RAID, Jens Axboe, IDE/ATA development list, linux-scsi

On Tue, 2012-12-04 at 16:00 -0600, Chris Friesen wrote:
> As another data point, it looks like we may be doing a SEND DIAGNOSTIC
> command specifying the default self-test in addition to the background
> short self-test.  This seems a bit risky and excessive to me, but
> apparently the guy that wrote it is no longer with the company.

This is a really bad idea.  A lot of disks go out to lunch until the
diagnostics complete (the same goes for SMART diagnostics).  This means
that if you do diagnostics on a running device, the drivers start to get
timeouts on commands which are queued waiting for diagnostics to
complete ... if those go over the standard SCSI timeouts, we'll start to
try error recovery and likely have the disaster you see above.

> What is the recommended method for monitoring disks on a system that
> is likely to go a long time between boots?  Do we avoid any in-service
> testing and just monitor the SMART data and only test it if something
> actually goes wrong?  Or should we intentionally drop a disk out of the
> array and test it?  (The downside of that is that we lose
> redundancy since we only have 2 disks.)

What do you mean by "monitoring" ... as in what are you looking for?  To
make sure the disk is healthy and responding, a simple test unit ready
works.  To look at other parameters, read the mode pages.

Anything that actively causes the disk to go out and check something is
a bad idea in a running environment.  Only do this if you can quiesce
the I/O before starting the active diagnostic (or drop the disk from the
array as you suggest).

To be honest, though, modern disks do a whole host of diagnostics as
they write data just to check that it is safely committed, so passive
monitoring should be fine.

James

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: getting I/O errors in super_written()...any ideas what would cause this?
  2012-12-05  9:20                       ` James Bottomley
@ 2012-12-05 11:41                         ` Ric Wheeler
  2012-12-05 11:57                           ` James Bottomley
  2012-12-06 18:15                         ` Chris Friesen
  1 sibling, 1 reply; 22+ messages in thread
From: Ric Wheeler @ 2012-12-05 11:41 UTC (permalink / raw)
  To: James Bottomley
  Cc: Chris Friesen, Mathias Burén, Roy Sigurd Karlsbakk,
	Neil Brown, Linux-RAID, Jens Axboe, IDE/ATA development list,
	linux-scsi

On 12/05/2012 04:20 AM, James Bottomley wrote:
> On Tue, 2012-12-04 at 16:00 -0600, Chris Friesen wrote:
>> As another data point, it looks like we may be doing a SEND DIAGNOSTIC
>> command specifying the default self-test in addition to the background
>> short self-test.  This seems a bit risky and excessive to me, but
>> apparently the guy that wrote it is no longer with the company.
> This is a really bad idea.  A lot of disks go out to lunch until the
> diagnostics complete (the same goes for SMART diagnostics).  This means
> that if you do diagnostics on a running device, the drivers start to get
> timeouts on commands which are queued waiting for diagnostics to
> complete ... if those go over the standard SCSI timeouts, we'll start to
> try error recovery and likely have the disaster you see above.
>
>> What is the recommended method for monitoring disks on a system that
>> is likely to go a long time between boots?  Do we avoid any in-service
>> testing and just monitor the SMART data and only test it if something
>> actually goes wrong?  Or should we intentionally drop a disk out of the
>> array and test it?  (The downside of that is that we lose
>> redundancy since we only have 2 disks.)
> What do you mean by "monitoring" ... as in what are you looking for?  To
> make sure the disk is healthy and responding, a simple test unit ready
> works.  To look at other parameters, read the mode pages.
>
> Anything that actively causes the disk to go out and check something is
> a bad idea in a running environment.  Only do this if you can quiesce
> the I/O before starting the active diagnostic (or drop the disk from the
> array as you suggest).
>
> To be honest, though, modern disks do a whole host of diagnostics as
> they write data just to check that it is safely committed, so passive
> monitoring should be fine.
>
> James
>
>

I don't think that the basic stat gathering (smartctl -a ....) has this kind of 
impact, but am worried about the running of the diagnostics,

ric


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: getting I/O errors in super_written()...any ideas what would cause this?
  2012-12-05 11:41                         ` Ric Wheeler
@ 2012-12-05 11:57                           ` James Bottomley
  0 siblings, 0 replies; 22+ messages in thread
From: James Bottomley @ 2012-12-05 11:57 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Chris Friesen, Mathias Burén, Roy Sigurd Karlsbakk,
	Neil Brown, Linux-RAID, Jens Axboe, IDE/ATA development list,
	linux-scsi

On Wed, 2012-12-05 at 06:41 -0500, Ric Wheeler wrote:
> I don't think that the basic stat gathering (smartctl -a ....) has this kind of 
> impact, but am worried about the running of the diagnostics,

Right, that's what I was trying to say above.  Anything that just reads
existing data is fine.  Anything that causes the disk to go off and test
things to gather data is likely to cause I/O errors due to timeouts.

James



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: getting I/O errors in super_written()...any ideas what would cause this?
  2012-12-05  9:20                       ` James Bottomley
  2012-12-05 11:41                         ` Ric Wheeler
@ 2012-12-06 18:15                         ` Chris Friesen
  2012-12-06 20:27                           ` Chris Murphy
  2012-12-08 18:08                           ` James Bottomley
  1 sibling, 2 replies; 22+ messages in thread
From: Chris Friesen @ 2012-12-06 18:15 UTC (permalink / raw)
  To: James Bottomley
  Cc: Ric Wheeler, Mathias Burén, Roy Sigurd Karlsbakk, Neil Brown,
	Linux-RAID, Jens Axboe, IDE/ATA development list, linux-scsi

On 12/05/2012 03:20 AM, James Bottomley wrote:
> On Tue, 2012-12-04 at 16:00 -0600, Chris Friesen wrote:
>> As another data point, it looks like we may be doing a SEND DIAGNOSTIC
>> command specifying the default self-test in addition to the background
>> short self-test.  This seems a bit risky and excessive to me, but
>> apparently the guy that wrote it is no longer with the company.
>
> This is a really bad idea.  A lot of disks go out to lunch until the
> diagnostics complete (the same goes for SMART diagnostics).  This means
> that if you do diagnostics on a running device, the drivers start to get
> timeouts on commands which are queued waiting for diagnostics to
> complete ... if those go over the standard SCSI timeouts, we'll start to
> try error recovery and likely have the disaster you see above.

So it turns out that our problems are intermittently triggered when 
running the default self test.  This agrees with the statement in 
sg_senddiag to not do foreground self-tests on disks with mounted 
filesystems.

We seem to be able to do background short self-tests (ie, SEND 
DIAGNOSTIC command with self-test code of 001b and ST code of 0b) 
without causing any problems.  Is this pushing our luck or is this 
something that should work according to the spec and the linux stack? 
The scsi spec indicates that in this case for most commands the test 
will be paused and the command executed within 2 seconds, but I don't 
know what the normal scsi timeouts are.

Thanks for the input, this is very useful.

Chris

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: getting I/O errors in super_written()...any ideas what would cause this?
  2012-12-06 18:15                         ` Chris Friesen
@ 2012-12-06 20:27                           ` Chris Murphy
  2012-12-08 18:08                           ` James Bottomley
  1 sibling, 0 replies; 22+ messages in thread
From: Chris Murphy @ 2012-12-06 20:27 UTC (permalink / raw)
  To: Linux-RAID; +Cc: IDE/ATA development list, linux-scsi


On Dec 6, 2012, at 11:15 AM, Chris Friesen <chris.friesen@genband.com> wrote:
>> 
> 
> So it turns out that our problems are intermittently triggered when running the default self test.  This agrees with the statement in sg_senddiag to not do foreground self-tests on disks with mounted filesystems.

Yeah that would be a bad idea. But the default SMART test isn't with the -C flag (for foreground testing), it's a background test including the long/extended test.


Chris Murphy


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: getting I/O errors in super_written()...any ideas what would cause this?
  2012-12-06 18:15                         ` Chris Friesen
  2012-12-06 20:27                           ` Chris Murphy
@ 2012-12-08 18:08                           ` James Bottomley
  1 sibling, 0 replies; 22+ messages in thread
From: James Bottomley @ 2012-12-08 18:08 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Ric Wheeler, Mathias Burén, Roy Sigurd Karlsbakk, Neil Brown,
	Linux-RAID, Jens Axboe, IDE/ATA development list, linux-scsi

On Thu, 2012-12-06 at 12:15 -0600, Chris Friesen wrote:
> On 12/05/2012 03:20 AM, James Bottomley wrote:
> > On Tue, 2012-12-04 at 16:00 -0600, Chris Friesen wrote:
> >> As another data point, it looks like we may be doing a SEND DIAGNOSTIC
> >> command specifying the default self-test in addition to the background
> >> short self-test.  This seems a bit risky and excessive to me, but
> >> apparently the guy that wrote it is no longer with the company.
> >
> > This is a really bad idea.  A lot of disks go out to lunch until the
> > diagnostics complete (the same goes for SMART diagnostics).  This means
> > that if you do diagnostics on a running device, the drivers start to get
> > timeouts on commands which are queued waiting for diagnostics to
> > complete ... if those go over the standard SCSI timeouts, we'll start to
> > try error recovery and likely have the disaster you see above.
> 
> So it turns out that our problems are intermittently triggered when 
> running the default self test.  This agrees with the statement in 
> sg_senddiag to not do foreground self-tests on disks with mounted 
> filesystems.
> 
> We seem to be able to do background short self-tests (ie, SEND 
> DIAGNOSTIC command with self-test code of 001b and ST code of 0b) 
> without causing any problems.  Is this pushing our luck or is this 
> something that should work according to the spec and the linux stack?

No one can tell you this.  The specs don't say what should happen on a
diagnostic, how long it will take or how disruptive to the I/O flow it
is.

> The scsi spec indicates that in this case for most commands the test 
> will be paused and the command executed within 2 seconds, but I don't 
> know what the normal scsi timeouts are.

2 Seconds can be an eternity if you're pumping huge amounts of data to a
disk: it causes a burp in the I/O chain which propagates back up the
stack with unpredictable knock on consequences.  The standard SCSI
timeouts are configurable under sysfs, but they're 30s.  It is known
that EMC recommends 60s timeouts for arrays, for instance (which is why
they're configurable).

James



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2012-12-08 18:08 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-28 17:52 getting I/O errors in super_written()...any ideas what would cause this? Chris Friesen
2012-11-28 18:08 ` Mathias Burén
2012-11-28 18:51   ` Roy Sigurd Karlsbakk
2012-11-28 20:21     ` Chris Friesen
2012-11-28 20:27       ` Mathias Burén
2012-11-28 20:29         ` Chris Friesen
2012-12-03 20:22           ` Ric Wheeler
2012-12-03 20:44             ` Chris Friesen
2012-12-03 20:52               ` Ric Wheeler
2012-12-03 21:08                 ` Chris Friesen
2012-12-03 21:21                   ` Dave Jiang
2012-12-03 21:36                     ` Chris Friesen
2012-12-03 21:59                       ` Dave Jiang
2012-12-03 21:53                   ` Ric Wheeler
2012-12-04 22:00                     ` Chris Friesen
2012-12-04 23:55                       ` Ric Wheeler
2012-12-05  9:20                       ` James Bottomley
2012-12-05 11:41                         ` Ric Wheeler
2012-12-05 11:57                           ` James Bottomley
2012-12-06 18:15                         ` Chris Friesen
2012-12-06 20:27                           ` Chris Murphy
2012-12-08 18:08                           ` James Bottomley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).