Re: OT: silent data corruption reading from hard drives

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Phil Turmel <philip@turmel.org>
To: listy@fastmail.fm
Cc: linux-raid@vger.kernel.org
Subject: Re: OT: silent data corruption reading from hard drives
Date: Fri, 03 Aug 2012 09:36:38 -0400	[thread overview]
Message-ID: <501BD3E6.6050004@turmel.org> (raw)
In-Reply-To: <1343932322.14845.140661109934445.709DABBB@webmail.messagingengine.com>

Hi Matt,

I now see that I hit the wrong "reply" button--my apologies to the list.
 You've quoted the important stuff, though, so I won't resend.

On 08/02/2012 02:32 PM, listy@fastmail.fm wrote:
> On Thu, Aug 2, 2012, at 13:33, Phil Turmel wrote:
>> You really do need to have a process check mismatch_cnt after your
>> weekly check completes.
> 
> 
> With Fedora, I get an email Monday morning, after the raid-check, which 
> warns of a non-zero mismatch_cnt.

Good to know.  I'm on gentoo, and I use my own script in logwatch, so
I'm not familiar with the various distros practice on this.

>> Depends.  If you use "repair", bad data will be propagated.  If you use
>> "check", it'll just be reported.
> 
> 
> Ah, okay, good.  I thought I'd read here a while back that "check" & 
> "repair" do the same thing.
> 
> 
>> I've seen a great deal of good advice here, but nothing about the system
>> component least likely to be protected in an "economy" system:
>> RAM.  Does your Mobo have ECC ram?  
> 
> Good point.  It does not.  Might be time for me to upgrade to a mobo with 
> ECC support.

In my opinion, any corruption noticed in a non-ECC system is most likely
due to the RAM.  You really need to run memtest86 on your system,
preferably for 24 hours or more.

>> does your kernel support logging, and are you monitoring the
>> machine check log?
> 
> klogd is not running, but I think the latest rsyslog handles the kernel 
> messages.  There was nothing in the logs related to my corruption issues, 
> however.

I meant logging of ECC RAM correction events (warnings) and
uncorrectable errors.  Your kernel has to support that.  I would be
shocked if Fedora didn't support it.  You also need the user space
"mcelog" package.  "mce" ==> "Machine Check Exception"

>> Hard drives write extensive ECC payloads to catch corruptions there;
>> SATA and SAS protocols have CRC checks on every frame transferred; the
>> PCIe bus uses CRC checks on each lane, with low-level encoding very
>> similar to SATA.  Even modern processors are using PCIe-style encoded
> 
> Thanks, this is good info, and kind of gets at my thinking when I posted my 
> initial question.  In a typical consumer hardware setup, with a current 
> linux kernel, do I have to take any steps to enable these kinds of checks?
> Can the kernel log any failed checks at the levels you mention?  I guess my 
> confusion with my silent data corruption issues stems from my naive 
> assumption that all the various data transfers happening would have some 
> way of detecting or flagging the bad reads as they happened.

You won't get ram corruption error reports if you don't have ECC ram.
Data transfer errors between CPU and chipset might generate machine
check exceptions, but if not recoverable, the machine just dies.  Errors
on PCIe lanes and SATA/SAS connections cause retransmissions until
success or the driver times out.  That would show up in dmesg.

> But maybe as you suggest, my issue is related to memory, and ECC might help 
> in the future?

You don't have to guess.  Boot into memtest86 and see.  And yes, any
machine handling data you really care about should have ECC ram.

Phil

next prev parent reply	other threads:[~2012-08-03 13:36 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-08-01 12:02 OT: silent data corruption reading from hard drives matt
2012-08-01 13:03 ` Roman Mamedov
2012-08-02  0:56 ` Stan Hoeppner
2012-08-02  1:07   ` Roberto Spadim
2012-08-02  1:14     ` Roberto Spadim
2012-08-02  1:27   ` Adam Goryachev
2012-08-02  1:35     ` Roberto Spadim
2012-08-02  3:23     ` Stan Hoeppner
2012-08-02 13:02     ` Drew
2012-08-02  3:19   ` Roman Mamedov
2012-08-02  7:51     ` Stan Hoeppner
2012-08-02  8:06       ` Roman Mamedov
2012-08-02  9:29         ` Stan Hoeppner
2012-08-02 12:26         ` Iustin Pop
2012-08-02 16:59         ` listy
2012-08-02 17:04           ` Roberto Spadim
2012-08-02 17:13             ` Jeff Johnson
2012-08-02 17:19               ` Roman Mamedov
2012-08-02 17:25                 ` Roberto Spadim
2012-08-02 17:22               ` Roberto Spadim
     [not found]           ` <501AB9D8.1030404@turmel.org>
2012-08-02 18:32             ` listy
2012-08-03 13:36               ` Phil Turmel [this message]
2012-08-15 21:55                 ` Peter Grandi
2012-08-16  7:30                   ` Oliver Schinagl
     [not found]                     ` <CABYL=TqU6qvDK-CuFak42iVNj0v4OcvALXOnr=6XLM4HyXfGkw@mail.gmail.com>
2012-08-16 14:33                       ` Roberto Spadim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=501BD3E6.6050004@turmel.org \
    --to=philip@turmel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=listy@fastmail.fm \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.