* XFS/driver bug or bad drive?
@ 2009-10-01 23:27 David Engel
2009-10-02 0:39 ` Eric Sandeen
2009-10-02 8:05 ` Michael Monnerie
0 siblings, 2 replies; 8+ messages in thread
From: David Engel @ 2009-10-01 23:27 UTC (permalink / raw)
To: xfs
Hi,
I've been trying to diagnose a suspected disk drive problem for about
a week. I now think the problem might be a known (and fixed) xfs or
driver bug, but I'm not 100% sure. I'm hoping someone here can
confirm the problem is or isn't an xfs bug.
The drive in question is a Samsung HD753LJ. I have two of these
drives and have had to do three replacements for various reasons in
<10 months of use. In short, I don't have a lot of confidence in the
drive, even though recent evidence seems to point elsewhere.
The problem occurs when I copy several hundred gigabytes of large
files (MythTV recordings, to be specific) to the troublesome drive
from another drive. When using a stock 2.6.30.8 kernel and xfs, the
copy eventually fails because the drive quits responding (and won't
respond again until it is power cycled). The failure doesn't always
occur at the same point in the copy, but it does always occur. Here
is a log sample of one of the failures.
Sep 29 17:59:34 tux kernel: XFS mounting filesystem sdb1
Sep 29 17:59:34 tux kernel: Ending clean XFS mount for filesystem: sdb1
Sep 29 18:32:07 tux kernel: ata2.00: exception Emask 0x0 SAct 0xffff SErr 0x0 action 0x6 frozen
Sep 29 18:32:07 tux kernel: ata2.00: cmd 61/00:00:af:02:eb/04:00:17:00:00/40 tag 0 ncq 524288 out
Sep 29 18:32:07 tux kernel: res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 29 18:32:07 tux kernel: ata2.00: status: { DRDY }
Sep 29 18:32:07 tux kernel: ata2.00: cmd 61/00:08:af:06:eb/04:00:17:00:00/40 tag 1 ncq 524288 out
Sep 29 18:32:07 tux kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 29 18:32:07 tux kernel: ata2.00: status: { DRDY }
Sep 29 18:32:07 tux kernel: ata2.00: cmd 61/00:10:af:0a:eb/04:00:17:00:00/40 tag 2 ncq 524288 out
Sep 29 18:32:07 tux kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 29 18:32:07 tux kernel: ata2.00: status: { DRDY }
Sep 29 18:32:07 tux kernel: ata2.00: cmd 61/00:18:af:0e:eb/04:00:17:00:00/40 tag 3 ncq 524288 out
Sep 29 18:32:07 tux kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 29 18:32:07 tux kernel: ata2.00: status: { DRDY }
Sep 29 18:32:07 tux kernel: ata2.00: cmd 61/00:20:af:12:eb/04:00:17:00:00/40 tag 4 ncq 524288 out
Sep 29 18:32:07 tux kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 29 18:32:07 tux kernel: ata2.00: status: { DRDY }
Sep 29 18:32:07 tux kernel: ata2.00: cmd 61/00:28:af:16:eb/04:00:17:00:00/40 tag 5 ncq 524288 out
Sep 29 18:32:07 tux kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 29 18:32:07 tux kernel: ata2.00: status: { DRDY }
Sep 29 18:32:07 tux kernel: ata2.00: cmd 61/00:30:af:da:ea/04:00:17:00:00/40 tag 6 ncq 524288 out
Sep 29 18:32:07 tux kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 29 18:32:07 tux kernel: ata2.00: status: { DRDY }
Sep 29 18:32:07 tux kernel: ata2.00: cmd 61/00:38:af:de:ea/04:00:17:00:00/40 tag 7 ncq 524288 out
Sep 29 18:32:07 tux kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 29 18:32:07 tux kernel: ata2.00: status: { DRDY }
Sep 29 18:32:07 tux kernel: ata2.00: cmd 61/00:40:af:e2:ea/04:00:17:00:00/40 tag 8 ncq 524288 out
Sep 29 18:32:07 tux kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 29 18:32:07 tux kernel: ata2.00: status: { DRDY }
Sep 29 18:32:07 tux kernel: ata2.00: cmd 61/00:48:af:e6:ea/04:00:17:00:00/40 tag 9 ncq 524288 out
Sep 29 18:32:07 tux kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 29 18:32:07 tux kernel: ata2.00: status: { DRDY }
Sep 29 18:32:07 tux kernel: ata2.00: cmd 61/00:50:af:ea:ea/04:00:17:00:00/40 tag 10 ncq 524288 out
Sep 29 18:32:07 tux kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 29 18:32:07 tux kernel: ata2.00: status: { DRDY }
Sep 29 18:32:07 tux kernel: ata2.00: cmd 61/00:58:af:ee:ea/04:00:17:00:00/40 tag 11 ncq 524288 out
Sep 29 18:32:07 tux kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 29 18:32:07 tux kernel: ata2.00: status: { DRDY }
Sep 29 18:32:07 tux kernel: ata2.00: cmd 61/00:60:af:f2:ea/04:00:17:00:00/40 tag 12 ncq 524288 out
Sep 29 18:32:07 tux kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 29 18:32:07 tux kernel: ata2.00: status: { DRDY }
Sep 29 18:32:07 tux kernel: ata2.00: cmd 61/00:68:af:f6:ea/04:00:17:00:00/40 tag 13 ncq 524288 out
Sep 29 18:32:07 tux kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 29 18:32:07 tux kernel: ata2.00: status: { DRDY }
Sep 29 18:32:07 tux kernel: ata2.00: cmd 61/00:70:af:fa:ea/04:00:17:00:00/40 tag 14 ncq 524288 out
Sep 29 18:32:07 tux kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 29 18:32:07 tux kernel: ata2.00: status: { DRDY }
Sep 29 18:32:07 tux kernel: ata2.00: cmd 61/00:78:af:fe:ea/04:00:17:00:00/40 tag 15 ncq 524288 out
Sep 29 18:32:07 tux kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 29 18:32:07 tux kernel: ata2.00: status: { DRDY }
Sep 29 18:32:07 tux kernel: ata2: hard resetting link
Sep 29 18:32:17 tux kernel: ata2: softreset failed (device not ready)
Sep 29 18:32:17 tux kernel: ata2: hard resetting link
Sep 29 18:32:27 tux kernel: ata2: softreset failed (device not ready)
Sep 29 18:32:27 tux kernel: ata2: hard resetting link
Sep 29 18:32:38 tux kernel: ata2: link is slow to respond, please be patient (ready=0)
Sep 29 18:33:02 tux kernel: ata2: softreset failed (device not ready)
Sep 29 18:33:02 tux kernel: ata2: limiting SATA link speed to 1.5 Gbps
Sep 29 18:33:02 tux kernel: ata2: hard resetting link
Sep 29 18:33:07 tux kernel: ata2: softreset failed (device not ready)
Sep 29 18:33:07 tux kernel: ata2: reset failed, giving up
Sep 29 18:33:07 tux kernel: ata2.00: disabled
Sep 29 18:33:07 tux kernel: ata2.00: device reported invalid CHS sector 0
Sep 29 18:33:07 tux last message repeated 15 times
Sep 29 18:33:07 tux kernel: ata2: EH complete
Sep 29 18:33:07 tux kernel: sd 1:0:0:0: [sdb] Unhandled error code
Sep 29 18:33:07 tux kernel: sd 1:0:0:0: [sdb] Result: hostbyte=0x04 driverbyte=0x00
Sep 29 18:33:07 tux kernel: end_request: I/O error, dev sdb, sector 401276591
Sep 29 18:33:07 tux kernel: sd 1:0:0:0: [sdb] Unhandled error code
Sep 29 18:33:07 tux kernel: sd 1:0:0:0: [sdb] Result: hostbyte=0x04 driverbyte=0x00
Sep 29 18:33:07 tux kernel: end_request: I/O error, dev sdb, sector 401275567
I finally decided to give some other filesystems a try to see if
anything changed. Low and behold it did. Still using a stock
2.6.30.8 kernel, but with ext3, ext4 and jfs filesystems, the large
copy succeeded everytime! I then decided to try a stock 2.6.31.1
kernel with xfs. It worked fine, too!
My question, now, is -- is this problem a known xfs bug that was fixed
in 2.6.31.x? I glanced through the code changes and git log and
didn't see any smoking gun. If it's not an xfs bug, does anyone know
if it might be a block driver bug (ata/ahci, in this case) that was
only tickled by xfs?
David
--
David Engel
david@istwok.net
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: XFS/driver bug or bad drive?
2009-10-01 23:27 XFS/driver bug or bad drive? David Engel
@ 2009-10-02 0:39 ` Eric Sandeen
2009-10-02 16:57 ` David Engel
2009-10-02 8:05 ` Michael Monnerie
1 sibling, 1 reply; 8+ messages in thread
From: Eric Sandeen @ 2009-10-02 0:39 UTC (permalink / raw)
To: David Engel; +Cc: xfs
David Engel wrote:
> Hi,
>
> I've been trying to diagnose a suspected disk drive problem for about
> a week. I now think the problem might be a known (and fixed) xfs or
> driver bug, but I'm not 100% sure. I'm hoping someone here can
> confirm the problem is or isn't an xfs bug.
>
> The drive in question is a Samsung HD753LJ. I have two of these
> drives and have had to do three replacements for various reasons in
> <10 months of use. In short, I don't have a lot of confidence in the
> drive, even though recent evidence seems to point elsewhere.
>
> The problem occurs when I copy several hundred gigabytes of large
> files (MythTV recordings, to be specific) to the troublesome drive
> from another drive. When using a stock 2.6.30.8 kernel and xfs, the
> copy eventually fails because the drive quits responding (and won't
> respond again until it is power cycled). The failure doesn't always
> occur at the same point in the copy, but it does always occur. Here
> is a log sample of one of the failures.
>
> Sep 29 17:59:34 tux kernel: XFS mounting filesystem sdb1
> Sep 29 17:59:34 tux kernel: Ending clean XFS mount for filesystem: sdb1
> Sep 29 18:32:07 tux kernel: ata2.00: exception Emask 0x0 SAct 0xffff SErr 0x0 action 0x6 frozen
> Sep 29 18:32:07 tux kernel: ata2.00: cmd 61/00:00:af:02:eb/04:00:17:00:00/40 tag 0 ncq 524288 out
> Sep 29 18:32:07 tux kernel: res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
> Sep 29 18:32:07 tux kernel: ata2.00: status: { DRDY }
...
> Sep 29 18:32:07 tux kernel: ata2: hard resetting link
> Sep 29 18:32:17 tux kernel: ata2: softreset failed (device not ready)
...
> Sep 29 18:33:07 tux kernel: ata2.00: disabled
> Sep 29 18:33:07 tux kernel: ata2.00: device reported invalid CHS sector 0
> Sep 29 18:33:07 tux last message repeated 15 times
> Sep 29 18:33:07 tux kernel: ata2: EH complete
> Sep 29 18:33:07 tux kernel: sd 1:0:0:0: [sdb] Unhandled error code
> Sep 29 18:33:07 tux kernel: sd 1:0:0:0: [sdb] Result: hostbyte=0x04 driverbyte=0x00
> Sep 29 18:33:07 tux kernel: end_request: I/O error, dev sdb, sector 401276591
> Sep 29 18:33:07 tux kernel: sd 1:0:0:0: [sdb] Unhandled error code
> Sep 29 18:33:07 tux kernel: sd 1:0:0:0: [sdb] Result: hostbyte=0x04 driverbyte=0x00
> Sep 29 18:33:07 tux kernel: end_request: I/O error, dev sdb, sector 401275567
These are all storage errors, not xfs. I suppose it could be differing
IO patterns from one fs or the other that trips it up, but nothing above
is related to an xfs bug; any xfs problems are in response to the above
IO errors, maybe a hardware problem or a driver problem, not sure - but
most likely a hardware issue I think. You might point smartctl at the
drive and see what it says.
-Eric
> I finally decided to give some other filesystems a try to see if
> anything changed. Low and behold it did. Still using a stock
> 2.6.30.8 kernel, but with ext3, ext4 and jfs filesystems, the large
> copy succeeded everytime! I then decided to try a stock 2.6.31.1
> kernel with xfs. It worked fine, too!
>
> My question, now, is -- is this problem a known xfs bug that was fixed
> in 2.6.31.x? I glanced through the code changes and git log and
> didn't see any smoking gun. If it's not an xfs bug, does anyone know
> if it might be a block driver bug (ata/ahci, in this case) that was
> only tickled by xfs?
>
> David
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: XFS/driver bug or bad drive?
2009-10-01 23:27 XFS/driver bug or bad drive? David Engel
2009-10-02 0:39 ` Eric Sandeen
@ 2009-10-02 8:05 ` Michael Monnerie
1 sibling, 0 replies; 8+ messages in thread
From: Michael Monnerie @ 2009-10-02 8:05 UTC (permalink / raw)
To: xfs
On Freitag 02 Oktober 2009 David Engel wrote:
> The drive in question is a Samsung HD753LJ. I have two of these
> drives and have had to do three replacements for various reasons in
> <10 months of use.
Yes, that Samsung crap. Of the very few drives we had (the 1TB version),
all got broken quickly, and we replaced them with Hitachis. It's a pity,
as they had a nice price, but when they it your data, it's priceless.
mfg zmi
--
// Michael Monnerie, Ing.BSc ----- http://it-management.at
// Tel: 0660 / 415 65 31 .network.your.ideas.
// PGP Key: "curl -s http://zmi.at/zmi.asc | gpg --import"
// Fingerprint: AC19 F9D5 36ED CD8A EF38 500E CE14 91F7 1C12 09B4
// Keyserver: wwwkeys.eu.pgp.net Key-ID: 1C1209B4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: XFS/driver bug or bad drive?
2009-10-02 0:39 ` Eric Sandeen
@ 2009-10-02 16:57 ` David Engel
2009-10-07 11:29 ` Michael-John Turner
0 siblings, 1 reply; 8+ messages in thread
From: David Engel @ 2009-10-02 16:57 UTC (permalink / raw)
To: xfs, Eric Sandeen
On Thu, Oct 01, 2009 at 07:39:54PM -0500, Eric Sandeen wrote:
> These are all storage errors, not xfs. I suppose it could be
> differing IO patterns from one fs or the other that trips it up, but
> nothing above is related to an xfs bug; any xfs problems are in
> response to the above IO errors, maybe a hardware problem or a
> driver problem, not sure - but most likely a hardware issue I think.
> You might point smartctl at the drive and see what it says.
I agree it shouldn't be an xfs bug. I thought it was strange, though,
that the problem only seemed to show up with xfs on 2.6.30.x. IO
pattern sensitivity wouldn't surprise me, but I wanted to check all my
bases before giving up on the drive.
Michael Monnerie wrote:
> Yes, that Samsung crap. Of the very few drives we had (the 1TB version),
> all got broken quickly, and we replaced them with Hitachis. It's a pity,
> as they had a nice price, but when they it your data, it's priceless.
I've used mostly Samsung drives for several years. This particular
750GB model is the only one that I consider a lemon.
David
--
David Engel
david@istwok.net
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: XFS/driver bug or bad drive?
2009-10-02 16:57 ` David Engel
@ 2009-10-07 11:29 ` Michael-John Turner
2009-10-07 13:24 ` Eric Sandeen
0 siblings, 1 reply; 8+ messages in thread
From: Michael-John Turner @ 2009-10-07 11:29 UTC (permalink / raw)
To: David Engel; +Cc: Eric Sandeen, xfs
On Fri, Oct 02, 2009 at 11:57:04AM -0500, David Engel wrote:
> I agree it shouldn't be an xfs bug. I thought it was strange, though,
> that the problem only seemed to show up with xfs on 2.6.30.x. IO
> pattern sensitivity wouldn't surprise me, but I wanted to check all my
> bases before giving up on the drive.
Rather curiously, I had exactly the same issue with the same model drive
this past weekend. Debian-patched 2.6.26 kernel, however, though also with
XFS (on top of md/LVM). Interestingly, there were no SMART errors and a
full SMART test passed. The error was triggered by doing a cvs update on my
working copy of the NetBSD source tree - not a large copy, but a
disk-intensive activity.
-mj
--
Michael-John Turner
mj@mjturner.net <> http://mjturner.net/
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: XFS/driver bug or bad drive?
2009-10-07 11:29 ` Michael-John Turner
@ 2009-10-07 13:24 ` Eric Sandeen
2009-10-07 14:04 ` Michael-John Turner
0 siblings, 1 reply; 8+ messages in thread
From: Eric Sandeen @ 2009-10-07 13:24 UTC (permalink / raw)
To: Michael-John Turner; +Cc: David Engel, xfs
Michael-John Turner wrote:
> On Fri, Oct 02, 2009 at 11:57:04AM -0500, David Engel wrote:
>> I agree it shouldn't be an xfs bug. I thought it was strange, though,
>> that the problem only seemed to show up with xfs on 2.6.30.x. IO
>> pattern sensitivity wouldn't surprise me, but I wanted to check all my
>> bases before giving up on the drive.
>
> Rather curiously, I had exactly the same issue with the same model drive
> this past weekend. Debian-patched 2.6.26 kernel, however, though also with
> XFS (on top of md/LVM). Interestingly, there were no SMART errors and a
> full SMART test passed. The error was triggered by doing a cvs update on my
> working copy of the NetBSD source tree - not a large copy, but a
> disk-intensive activity.
>
> -mj
Firmware bug? I still think it can't be an xfs problem, and I'm not
just trying to be protective of our turf ;)
-Eric
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: XFS/driver bug or bad drive?
2009-10-07 13:24 ` Eric Sandeen
@ 2009-10-07 14:04 ` Michael-John Turner
2009-10-07 15:20 ` David Engel
0 siblings, 1 reply; 8+ messages in thread
From: Michael-John Turner @ 2009-10-07 14:04 UTC (permalink / raw)
To: Eric Sandeen; +Cc: David Engel, xfs
On Wed, Oct 07, 2009 at 08:24:41AM -0500, Eric Sandeen wrote:
> Firmware bug? I still think it can't be an xfs problem, and I'm not
> just trying to be protective of our turf ;)
Could very well be a firmware bug - FWIW, my drive has version 1AA01109.
Some searching online suggests that others have solved similar problems by
replacing their SATA cables. I'll give that a try but, as my issue isn't
reproducible (subsequent cvs updates haven't given any problems), it'll be
difficult to know immediately if a faulty cable was the problem.
-mj
--
Michael-John Turner
mj@mjturner.net <> http://mjturner.net/
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: XFS/driver bug or bad drive?
2009-10-07 14:04 ` Michael-John Turner
@ 2009-10-07 15:20 ` David Engel
0 siblings, 0 replies; 8+ messages in thread
From: David Engel @ 2009-10-07 15:20 UTC (permalink / raw)
To: Michael-John Turner; +Cc: Eric Sandeen, xfs
On Wed, Oct 07, 2009 at 03:04:31PM +0100, Michael-John Turner wrote:
> On Wed, Oct 07, 2009 at 08:24:41AM -0500, Eric Sandeen wrote:
> > Firmware bug? I still think it can't be an xfs problem, and I'm not
> > just trying to be protective of our turf ;)
>
> Could very well be a firmware bug - FWIW, my drive has version 1AA01109.
A firmware bug wouldn't surprise me. FWIW, here are my firmware
versions and latest information.
The problematic drive has firmware version 1AA01110. I don't trust
that specific drive, don't really trust that model anymore, so I
retired it and bought a new drive to replace it.
The other HD753LJ I have (also replacement) has firmware version
1AA01110 too. This drive hasn't shown any problems yet, but I will be
testing it more after I finish moving some files around.
> Some searching online suggests that others have solved similar problems by
> replacing their SATA cables. I'll give that a try but, as my issue isn't
> reproducible (subsequent cvs updates haven't given any problems), it'll be
> difficult to know immediately if a faulty cable was the problem.
I probably ran across all of the same stuff. I tried multiple cables,
multiple SATA ports and multiple host systems. The problem was nearly
100% on the intended system. I don't know how frequent the problem
was on the second system since I stopped testing on it after the first
failure confirmed the problem was not specific to the original system.
David
--
David Engel
david@istwok.net
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2009-10-07 15:18 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-01 23:27 XFS/driver bug or bad drive? David Engel
2009-10-02 0:39 ` Eric Sandeen
2009-10-02 16:57 ` David Engel
2009-10-07 11:29 ` Michael-John Turner
2009-10-07 13:24 ` Eric Sandeen
2009-10-07 14:04 ` Michael-John Turner
2009-10-07 15:20 ` David Engel
2009-10-02 8:05 ` Michael Monnerie
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox