public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Jonathan Woithe <jwoithe@just42.net>
To: linux1394-devel@lists.sourceforge.net
Cc: jwoithe@just42.net, stefanr@s5r6.in-berlin.de,
	linux-kernel@vger.kernel.org
Subject: Re: Silent data corruption with kernel 3.4 and FireWire disks
Date: Tue, 5 Jun 2012 09:32:39 +0930	[thread overview]
Message-ID: <20120605000239.GA28823@marvin.atrad.com.au> (raw)
In-Reply-To: <mailman.408825.1338838130.11409.linux1394-devel@lists.sourceforge.net>

On Mon, Jun 04, 2012 at 07:28:50PM +0000, Stefan Richter wrote:
> About a week ago I noticed silent data corruptions of files on FireWire
> disks:  Mount disk, read lots of data and e.g. compute their md5sum,
> unmount disk, mount disk again, read and md5sum the same files again ->
> MD5s may differ.
> 
> Defects in files that were written in May hint that not only reading from
> but also writing to FireWire disks resulted in corrupt data.  This was
> silent corruption without any error messages from the PCI, firewire, SCSI,
> block, or filesystem subsystems.
> 
> Affected:
>   - kernel 3.4
>   - kernel 3.4-rc5
> Not affected:
>   - kernel 3.3.1 (which I have been running now for the last 6 days)

Hmm, funny you should mention this.  Over the past few months I have also
been experiencing silent corruption of a firewire disc, although I suspect
it may be for a different reason.  The corruptions started occurring soon
after I upgraded a machine to kernel 2.6.39 in May 2011.  The filesystem was
xfs, and when corruption occurred it generally took out the entire
filesystem (on repair, everything would be bundled unsorted into
lost+found).

The disc is written to once a day using rsync.

I removed the drive from its enclosure and ran various SMART tests on it
directly (the enclosure prevents SMART from operating).  The drive showed no
pre-fail signs, passed all self-tests and didn't show any problems under
badblocks tests (read-write or destructive write).

On 18 May this year I upgraded the kernel to 3.3.6 and thus far I have not
had a repeat of the corruption.  Under 2.6.39 I was usually seeing a
corruption event well within 2 weeks of recreating the filesystem, although
sometimes it took longer.  Although it's early days it seems that 3.3.6 is
so far behaving better than 2.6.39.

Combined with Stefan's observations, this would indicate that there were
issues with 2.6.39, they weren't present in 3.3.x and then reappeared in
3.4.  It's the disappearance and reappearance which has me thinking that
perhaps we are seeing two different problems, one of which has been fixed.

> FireWire disks with different 1394-to-SATA or -IDE bridge chips are
> affected.  I noticed the problem at first on an Agere FW643e PCIe 1394
> controller which sits behind a PLX PEX 8505 PCIe switch.

In my case the enclosure was one based on the Oxford Semiconductor chipset
(911?).  The drive is a PATA Western Digital 500 GB drive (00AAKB-00H8A0 - I
think from memory it's a Green drive).  The firewire card is reported to be

  VIA Technologies, Inc. IEEE 1394 Host Controller (rev 46)
    Subsystem: VIA Technologies, Inc. IEEE 1394 Host Controller

(vendor/device ID: 1106:3044, subsystem: 1106:3044).

>   - whether SATA or USB disks are affected (SATA probably not, USB not
>     used yet),

The system concerned uses SATA discs for the system drives, driven by:

  RAID bus controller: VIA Technologies, Inc. VIA VT6420 SATA RAID Controller
    (rev 80)
  Subsystem: ASUSTeK Computer Inc. A7V600/K8V Deluxe/K8V-X/A8V Deluxe
    motherboard

I have seen no corruption on these.  Once a week I am also writing to
alternating external USB2 drives (again, using rsync) and none of those have
seen this corruption either.  The USB host is reported to be

  USB Controller: VIA Technologies, Inc. USB 2.0 (rev 86) (prog-if 20
    [EHCI])
  Subsystem: ASUSTeK Computer Inc. A7V600/K8V-X/A8V Deluxe motherboard

As I said, since there seems to be a working kernel between the version I
saw which exhibited the problem and the one where Stefan experienced an
issue, it's possible that these are two different issues (one fixed, one
still lurking).  I throw the above out there in case it helps.

Regards
  jonathan

PS: I'm not subscribed to lkml, but am to ieee1394-devel.

       reply	other threads:[~2012-06-05  0:30 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <mailman.408825.1338838130.11409.linux1394-devel@lists.sourceforge.net>
2012-06-05  0:02 ` Jonathan Woithe [this message]
     [not found] <20120524224447.57a636f7@stein>
2012-06-02  9:55 ` Silent data corruption with kernel 3.4 and FireWire disks Stefan Richter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120605000239.GA28823@marvin.atrad.com.au \
    --to=jwoithe@just42.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux1394-devel@lists.sourceforge.net \
    --cc=stefanr@s5r6.in-berlin.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox