From: Jonathan Woithe <jwoithe@just42.net>
To: linux1394-devel@lists.sourceforge.net
Cc: jwoithe@just42.net, stefanr@s5r6.in-berlin.de,
linux-kernel@vger.kernel.org
Subject: Re: Silent data corruption with kernel 3.4 and FireWire disks
Date: Tue, 5 Jun 2012 09:32:39 +0930 [thread overview]
Message-ID: <20120605000239.GA28823@marvin.atrad.com.au> (raw)
In-Reply-To: <mailman.408825.1338838130.11409.linux1394-devel@lists.sourceforge.net>
On Mon, Jun 04, 2012 at 07:28:50PM +0000, Stefan Richter wrote:
> About a week ago I noticed silent data corruptions of files on FireWire
> disks: Mount disk, read lots of data and e.g. compute their md5sum,
> unmount disk, mount disk again, read and md5sum the same files again ->
> MD5s may differ.
>
> Defects in files that were written in May hint that not only reading from
> but also writing to FireWire disks resulted in corrupt data. This was
> silent corruption without any error messages from the PCI, firewire, SCSI,
> block, or filesystem subsystems.
>
> Affected:
> - kernel 3.4
> - kernel 3.4-rc5
> Not affected:
> - kernel 3.3.1 (which I have been running now for the last 6 days)
Hmm, funny you should mention this. Over the past few months I have also
been experiencing silent corruption of a firewire disc, although I suspect
it may be for a different reason. The corruptions started occurring soon
after I upgraded a machine to kernel 2.6.39 in May 2011. The filesystem was
xfs, and when corruption occurred it generally took out the entire
filesystem (on repair, everything would be bundled unsorted into
lost+found).
The disc is written to once a day using rsync.
I removed the drive from its enclosure and ran various SMART tests on it
directly (the enclosure prevents SMART from operating). The drive showed no
pre-fail signs, passed all self-tests and didn't show any problems under
badblocks tests (read-write or destructive write).
On 18 May this year I upgraded the kernel to 3.3.6 and thus far I have not
had a repeat of the corruption. Under 2.6.39 I was usually seeing a
corruption event well within 2 weeks of recreating the filesystem, although
sometimes it took longer. Although it's early days it seems that 3.3.6 is
so far behaving better than 2.6.39.
Combined with Stefan's observations, this would indicate that there were
issues with 2.6.39, they weren't present in 3.3.x and then reappeared in
3.4. It's the disappearance and reappearance which has me thinking that
perhaps we are seeing two different problems, one of which has been fixed.
> FireWire disks with different 1394-to-SATA or -IDE bridge chips are
> affected. I noticed the problem at first on an Agere FW643e PCIe 1394
> controller which sits behind a PLX PEX 8505 PCIe switch.
In my case the enclosure was one based on the Oxford Semiconductor chipset
(911?). The drive is a PATA Western Digital 500 GB drive (00AAKB-00H8A0 - I
think from memory it's a Green drive). The firewire card is reported to be
VIA Technologies, Inc. IEEE 1394 Host Controller (rev 46)
Subsystem: VIA Technologies, Inc. IEEE 1394 Host Controller
(vendor/device ID: 1106:3044, subsystem: 1106:3044).
> - whether SATA or USB disks are affected (SATA probably not, USB not
> used yet),
The system concerned uses SATA discs for the system drives, driven by:
RAID bus controller: VIA Technologies, Inc. VIA VT6420 SATA RAID Controller
(rev 80)
Subsystem: ASUSTeK Computer Inc. A7V600/K8V Deluxe/K8V-X/A8V Deluxe
motherboard
I have seen no corruption on these. Once a week I am also writing to
alternating external USB2 drives (again, using rsync) and none of those have
seen this corruption either. The USB host is reported to be
USB Controller: VIA Technologies, Inc. USB 2.0 (rev 86) (prog-if 20
[EHCI])
Subsystem: ASUSTeK Computer Inc. A7V600/K8V-X/A8V Deluxe motherboard
As I said, since there seems to be a working kernel between the version I
saw which exhibited the problem and the one where Stefan experienced an
issue, it's possible that these are two different issues (one fixed, one
still lurking). I throw the above out there in case it helps.
Regards
jonathan
PS: I'm not subscribed to lkml, but am to ieee1394-devel.
next parent reply other threads:[~2012-06-05 0:30 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <mailman.408825.1338838130.11409.linux1394-devel@lists.sourceforge.net>
2012-06-05 0:02 ` Jonathan Woithe [this message]
[not found] <20120524224447.57a636f7@stein>
2012-06-02 9:55 ` Silent data corruption with kernel 3.4 and FireWire disks Stefan Richter
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120605000239.GA28823@marvin.atrad.com.au \
--to=jwoithe@just42.net \
--cc=linux-kernel@vger.kernel.org \
--cc=linux1394-devel@lists.sourceforge.net \
--cc=stefanr@s5r6.in-berlin.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox