From: Dave Jiang <dave.jiang@intel.com>
To: Chris Friesen <chris.friesen@genband.com>
Cc: "Ric Wheeler" <rwheeler@redhat.com>,
"Mathias Burén" <mathias.buren@gmail.com>,
"Roy Sigurd Karlsbakk" <roy@karlsbakk.net>,
"Neil Brown" <neilb@suse.de>,
Linux-RAID <linux-raid@vger.kernel.org>,
"Jens Axboe" <axboe@kernel.dk>,
"IDE/ATA development list" <linux-ide@vger.kernel.org>,
linux-scsi <linux-scsi@vger.kernel.org>
Subject: Re: getting I/O errors in super_written()...any ideas what would cause this?
Date: Mon, 03 Dec 2012 14:21:17 -0700 [thread overview]
Message-ID: <50BD17CD.8080206@intel.com> (raw)
In-Reply-To: <50BD14B1.7000203@genband.com>
On 12/03/2012 02:08 PM, Chris Friesen wrote:
> On 12/03/2012 02:52 PM, Ric Wheeler wrote:
>
>> I jumped into this thread late - can you repost detail on the specific
>> drive and HBA used here? In any case, it sounds like this is a better
>> topic for the linux-scsi or linux-ide list where most of the low level
>> storage people lurk :)
> Okay, expanding the receiver list. :)
>
> To recap:
>
> I'm running 2.6.27 with LVM over software RAID 1 over a pair of SAS disks.
> Disks are WD9001BKHG, controller is Intel C600.
Just curious what driver are you using with the C600. The upstream
driver for C600 didn't get accepted until 3.0-rc6 and all of the
outstanding patches weren't accepted until 3.7-rc. So I'd say 3.6 would
be your best bet until 3.7 is released. Did you attempt a backport of
the isci driver or using something like an LSI port on 2.6.27? Have you
verified the issue on a more recent kernel?
> Recently we started seeing messages of the following pattern, and we
> don't know what's causing them:
>
> Nov 28 08:57:10 kernel: end_request: I/O error, dev sda, sector 1758169523
> Nov 28 08:57:10 kernel: md: super_written gets error=-5, uptodate=0
> Nov 28 08:57:10 kernel: raid1: Disk failure on sda2, disabling device.
> Nov 28 08:57:10 kernel: raid1: Operation continuing on 1 devices.
>
> We've been assuming it's a software issue since it's reproducible on
> multiple systems, although so far we've only seen the problem with
> these particular disks.
>
> We've seen the problems with disk write cache enabled and disabled.
>
> It looks like it may be related to being in the middle of a background
> short self-test at the time we see the error. The disks are still
> in-service at this point--is this supported behaviour or would it
> be expected to cause errors? (The self-test works fine with other
> disks, and worked fine with these disks until recently, but we haven't
> made any changes to the block I/O code.)
>
> Here's the smartctl output from right after a failure. The self-tests
> are frequent as a stress-test, normally they're done once per day:
>
> root@typhoon-base-unit0:/root> ./smartctl --all /dev/sda
> smartctl version 5.38 [i686-wrs-linux-gnu] Copyright (C) 2002-8 Bruce Allen
> Home page is http://smartmontools.sourceforge.net/
>
> Device: WD WD9001BKHG-02D22 Version: SR03
> Serial number: WX21EB1ANU78
> Device type: disk
> Transport protocol: SAS
> Local Time is: Fri Nov 23 00:35:03 2012 HKT
> Device supports SMART and is Enabled
> Temperature Warning Enabled
> SMART Health Status: OK
>
> Current Drive Temperature: 39 C
> Drive Trip Temperature: 69 C
> Manufactured in week 01 of year 2010
> Recommended maximum start stop count: 1048576 times
> Current start stop count: 26 times
> Elements in grown defect list: 0
>
> Error counter log:
> Errors Corrected by Total Correction Gigabytes Total
> ECC rereads/ errors algorithm processed uncorrected
> fast | delayed rewrites corrected invocations [109 bytes] errors
> read: 21187 2 2 21189 2 4950.446 0
> write: 89 4 0 93 4 1317.938 0
> verify: 103 0 0 103 0 0.000 0
>
> Non-medium error count: 169436
>
> SMART Self-test log
> Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
> Description number (hours)
> # 1 Background short Self test in progress ... 4 NOW - [- - -]
> # 2 Background short Completed - 1377 - [- - -]
> # 3 Background short Completed - 1377 - [- - -]
> # 4 Background short Completed - 1377 - [- - -]
> # 5 Background short Completed - 1377 - [- - -]
> # 6 Background short Completed - 1377 - [- - -]
> # 7 Background short Completed - 1377 - [- - -]
> # 8 Background short Completed - 1377 - [- - -]
> # 9 Background short Completed - 1377 - [- - -]
> #10 Background short Completed - 1377 - [- - -]
> #11 Background short Completed - 1377 - [- - -]
> #12 Background short Completed - 1377 - [- - -]
> #13 Background short Completed - 1377 - [- - -]
> #14 Background short Completed - 1377 - [- - -]
> #15 Background short Completed - 1377 - [- - -]
> #16 Background short Completed - 1377 - [- - -]
> #17 Background short Completed - 1377 - [- - -]
> #18 Background short Completed - 1377 - [- - -]
> #19 Background short Completed - 1377 - [- - -]
> #20 Background short Completed - 1377 - [- - -]
>
> Long (extended) Self Test duration: 6362 seconds [106.0 minutes]
>
>
>
>
> I also have this from ten minutes later with a newer version of smartctl:
>
> root@typhoon-base-unit0:/root> ./smartctl.eric -x /dev/sda
> smartctl 5.40 2010-10-16 r3189 [i686-pc-linux-gnu] (local build)
> Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
>
> Device: WD WD9001BKHG-02D22 Version: SR03
> Serial number: WX21EB1ANU78
> Device type: disk
> Transport protocol: SAS
> Local Time is: Fri Nov 23 00:45:08 2012 HKT
> Device supports SMART and is Enabled
> Temperature Warning Enabled
> SMART Health Status: OK
>
> Current Drive Temperature: 39 C
> Drive Trip Temperature: 69 C
> Manufactured in week 01 of year 2010
> Specified cycle count over device lifetime: 1048576
> Accumulated start-stop cycles: 26
> Specified load-unload count over device lifetime: 1114112
> Accumulated load-unload cycles: 0
> Elements in grown defect list: 0
>
> Error counter log:
> Errors Corrected by Total Correction Gigabytes Total
> ECC rereads/ errors algorithm processed uncorrected
> fast | delayed rewrites corrected invocations [109 bytes] errors
> read: 21189 2 2 21191 2 4950.446 0
> write: 89 4 0 93 4 1317.939 0
> verify: 103 0 0 103 0 0.000 0
>
> Non-medium error count: 169436
>
> SMART Self-test log
> Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
> Description number (hours)
> # 1 Background short Completed - 1378 - [- - -]
> # 2 Background short Completed - 1378 - [- - -]
> # 3 Background short Completed - 1378 - [- - -]
> # 4 Background short Completed - 1377 - [- - -]
> # 5 Background short Completed - 1377 - [- - -]
> # 6 Background short Completed - 1377 - [- - -]
> # 7 Background short Completed - 1377 - [- - -]
> # 8 Background short Completed - 1377 - [- - -]
> # 9 Background short Completed - 1377 - [- - -]
> #10 Background short Completed - 1377 - [- - -]
> #11 Background short Completed - 1377 - [- - -]
> #12 Background short Completed - 1377 - [- - -]
> #13 Background short Completed - 1377 - [- - -]
> #14 Background short Completed - 1377 - [- - -]
> #15 Background short Completed - 1377 - [- - -]
> #16 Background short Completed - 1377 - [- - -]
> #17 Background short Completed - 1377 - [- - -]
> #18 Background short Completed - 1377 - [- - -]
> #19 Background short Completed - 1377 - [- - -]
> #20 Background short Completed - 1377 - [- - -]
>
> Long (extended) Self Test duration: 6362 seconds [106.0 minutes]
>
> Background scan results log
> Status: no scans active
> Accumulated power on time, hours:minutes 1378:08 [82688 minutes]
> Number of background scans performed: 0, scan progress: 0.00%
> Number of background medium scans performed: 0
> Protocol Specific port log page for SAS SSP
> relative target port id = 1
> generation code = 0
> number of phys = 1
> phy identifier = 0
> attached device type: end device
> attached reason: unknown
> reason: unknown
> negotiated logical link rate: phy enabled; 3 Gbps
> attached initiator port: ssp=1 stp=1 smp=1
> attached target port: ssp=0 stp=0 smp=0
> SAS address = 0x50014ee3556977a6
> attached SAS address = 0x5fcfffff00000001
> attached phy identifier = 0
> Invalid DWORD count = 0
> Running disparity error count = 0
> Loss of DWORD synchronization = 3
> Phy reset problem = 0
> Phy event descriptors:
> Transmitted SSP frame error count: 0
> Received SSP frame error count: 0
> relative target port id = 2
> generation code = 0
> number of phys = 1
> phy identifier = 1
> attached device type: no device attached
> attached reason: unknown
> reason: unknown
> negotiated logical link rate: phy enabled; unknown
> attached initiator port: ssp=0 stp=0 smp=0
> attached target port: ssp=0 stp=0 smp=0
> SAS address = 0x50014ee3556977a7
> attached SAS address = 0x0
> attached phy identifier = 0
> Invalid DWORD count = 0
> Running disparity error count = 0
> Loss of DWORD synchronization = 0
> Phy reset problem = 0
> Phy event descriptors:
> Transmitted SSP frame error count: 0
> Received SSP frame error count: 0
>
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2012-12-03 21:21 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-11-28 17:52 getting I/O errors in super_written()...any ideas what would cause this? Chris Friesen
2012-11-28 18:08 ` Mathias Burén
2012-11-28 18:51 ` Roy Sigurd Karlsbakk
2012-11-28 20:21 ` Chris Friesen
2012-11-28 20:27 ` Mathias Burén
2012-11-28 20:29 ` Chris Friesen
2012-12-03 20:22 ` Ric Wheeler
2012-12-03 20:44 ` Chris Friesen
2012-12-03 20:52 ` Ric Wheeler
2012-12-03 21:08 ` Chris Friesen
2012-12-03 21:21 ` Dave Jiang [this message]
2012-12-03 21:36 ` Chris Friesen
2012-12-03 21:59 ` Dave Jiang
2012-12-03 21:53 ` Ric Wheeler
2012-12-04 22:00 ` Chris Friesen
2012-12-04 23:55 ` Ric Wheeler
2012-12-05 9:20 ` James Bottomley
2012-12-05 11:41 ` Ric Wheeler
2012-12-05 11:57 ` James Bottomley
2012-12-06 18:15 ` Chris Friesen
2012-12-06 20:27 ` Chris Murphy
2012-12-08 18:08 ` James Bottomley
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=50BD17CD.8080206@intel.com \
--to=dave.jiang@intel.com \
--cc=axboe@kernel.dk \
--cc=chris.friesen@genband.com \
--cc=linux-ide@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=mathias.buren@gmail.com \
--cc=neilb@suse.de \
--cc=roy@karlsbakk.net \
--cc=rwheeler@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.