Re: XFS/driver bug or bad drive?

From: Eric Sandeen <sandeen@sandeen.net>
To: David Engel <david@istwok.net>
Cc: xfs@oss.sgi.com
Subject: Re: XFS/driver bug or bad drive?
Date: Thu, 01 Oct 2009 19:39:54 -0500	[thread overview]
Message-ID: <4AC54BDA.20806@sandeen.net> (raw)
In-Reply-To: <20091001232759.GA12832@opus.istwok.net>

David Engel wrote:
> Hi,
> 
> I've been trying to diagnose a suspected disk drive problem for about
> a week.  I now think the problem might be a known (and fixed) xfs or
> driver bug, but I'm not 100% sure.  I'm hoping someone here can
> confirm the problem is or isn't an xfs bug.
> 
> The drive in question is a Samsung HD753LJ.  I have two of these
> drives and have had to do three replacements for various reasons in
> <10 months of use.  In short, I don't have a lot of confidence in the
> drive, even though recent evidence seems to point elsewhere.
> 
> The problem occurs when I copy several hundred gigabytes of large
> files (MythTV recordings, to be specific) to the troublesome drive
> from another drive.  When using a stock 2.6.30.8 kernel and xfs, the
> copy eventually fails because the drive quits responding (and won't
> respond again until it is power cycled).  The failure doesn't always
> occur at the same point in the copy, but it does always occur.  Here
> is a log sample of one of the failures.
> 
> Sep 29 17:59:34 tux kernel: XFS mounting filesystem sdb1
> Sep 29 17:59:34 tux kernel: Ending clean XFS mount for filesystem: sdb1
> Sep 29 18:32:07 tux kernel: ata2.00: exception Emask 0x0 SAct 0xffff SErr 0x0 action 0x6 frozen
> Sep 29 18:32:07 tux kernel: ata2.00: cmd 61/00:00:af:02:eb/04:00:17:00:00/40 tag 0 ncq 524288 out
> Sep 29 18:32:07 tux kernel:          res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
> Sep 29 18:32:07 tux kernel: ata2.00: status: { DRDY }
...
> Sep 29 18:32:07 tux kernel: ata2: hard resetting link
> Sep 29 18:32:17 tux kernel: ata2: softreset failed (device not ready)
...

> Sep 29 18:33:07 tux kernel: ata2.00: disabled
> Sep 29 18:33:07 tux kernel: ata2.00: device reported invalid CHS sector 0
> Sep 29 18:33:07 tux last message repeated 15 times
> Sep 29 18:33:07 tux kernel: ata2: EH complete
> Sep 29 18:33:07 tux kernel: sd 1:0:0:0: [sdb] Unhandled error code
> Sep 29 18:33:07 tux kernel: sd 1:0:0:0: [sdb] Result: hostbyte=0x04 driverbyte=0x00
> Sep 29 18:33:07 tux kernel: end_request: I/O error, dev sdb, sector 401276591
> Sep 29 18:33:07 tux kernel: sd 1:0:0:0: [sdb] Unhandled error code
> Sep 29 18:33:07 tux kernel: sd 1:0:0:0: [sdb] Result: hostbyte=0x04 driverbyte=0x00
> Sep 29 18:33:07 tux kernel: end_request: I/O error, dev sdb, sector 401275567

These are all storage errors, not xfs.  I suppose it could be differing 
IO patterns from one fs or the other that trips it up, but nothing above 
is related to an xfs bug; any xfs problems are in response to the above 
IO errors, maybe a hardware problem or a driver problem, not sure - but 
most likely a hardware issue I think.  You might point smartctl at the 
drive and see what it says.

-Eric

> I finally decided to give some other filesystems a try to see if
> anything changed.  Low and behold it did.  Still using a stock
> 2.6.30.8 kernel, but with ext3, ext4 and jfs filesystems, the large
> copy succeeded everytime!  I then decided to try a stock 2.6.31.1
> kernel with xfs.  It worked fine, too!
> 
> My question, now, is -- is this problem a known xfs bug that was fixed
> in 2.6.31.x?  I glanced through the code changes and git log and
> didn't see any smoking gun.  If it's not an xfs bug, does anyone know
> if it might be a block driver bug (ata/ahci, in this case) that was
> only tickled by xfs?
> 
> David

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs