From: Roger Heflin <rogerheflin@gmail.com>
To: Jon@eHardcastle.com
Cc: Goswin von Brederlow <goswin-v-b@web.de>, linux-raid@vger.kernel.org
Subject: Re: Fw: Why does one get mismatches?
Date: Sun, 24 Jan 2010 15:52:36 -0600 [thread overview]
Message-ID: <4B5CC124.8010302@gmail.com> (raw)
In-Reply-To: <65698.36235.qm@web51306.mail.re2.yahoo.com>
Jon Hardcastle wrote:
> --- On Fri, 22/1/10, Goswin von Brederlow <goswin-v-b@web.de> wrote:
>
>> From: Goswin von Brederlow <goswin-v-b@web.de>
>> Subject: Re: Fw: Why does one get mismatches?
>> To: Jon@eHardcastle.com
>> Cc: linux-raid@vger.kernel.org
>> Date: Friday, 22 January, 2010, 18:13
>> Jon Hardcastle <jd_hardcastle@yahoo.com>
>> writes:
>>
>>> --- On Tue, 19/1/10, Jon Hardcastle <jd_hardcastle@yahoo.com>
>> wrote:
>>>> From: Jon Hardcastle <jd_hardcastle@yahoo.com>
>>>> Subject: Why does one get mismatches?
>>>> To: linux-raid@vger.kernel.org
>>>> Date: Tuesday, 19 January, 2010, 10:04
>>>> Hi,
>>>>
>>>> I kicked off a check/repair cycle on my machine
>> after i
>>>> moved the phyiscal ordering of my drives around
>> and I am now
>>>> on my second check/repair cycle and it has kept
>> finding
>>>> mismatches.
>>>>
>>>> Is it correct that the mismatch value after a
>> repair was
>>>> needed should equal the value present after a
>> check? What if
>>>> it doesn't? What does it mean if another check
>> STILL reveals
>>>> mismatches?
>>>>
>>>> I had something similar after i reshaped from raid
>> 5 to 6 i
>>>> had to run check/repair/check/repair several times
>> before i
>>>> got my 0.
>>>>
>>>>
>>> Guys,
>>>
>>> Anyone got any suggestions here? I am now on my ~5
>> check/repair and after a reboot the first check is still
>> returning 8.
>>> All i have done is move the drives around. It is the
>> same controllers/cables/etc
>>> I really dont like the seeming random nature of what
>> can/does/has caused the mismatches?
>>
>> There is some unknown corruption going on with raid1 that
>> causes
>> mismatches but it is believed that it will never occur on
>> any used
>> block. Swapping is a likely cause.
>>
>> Any swap device on the raid? Try turning that off.
>> If that doesn't help try umounting filesystems or
>> remounting RO.
>>
>> MfG
>> Goswin
>
> Hello, my usual savior Goswin!
>
> The deal is it is a 7 drive raid 6 array. it has LVM on it and is not used for swapping. I have umounted all LV's and still got mismatches, i run smartctl --test=long on all drives - nothing. I have now dismantled the array and am 3/4 the way through 'badblocks -svn' on each of the component drive. I have a hunch that it may be a dodgy SATA cable but have no evidence. No errors in log, nothing on dmesg.
>
> Is there any way to get more information? I am starting to think this is more happened since i changed from raid 5 to 6..... which i did < 1 month ago.
>
> The only lead i have is that whilst doing the bad blocks 1 drive ran at ~10~15MB/s whereas the rest are going at ~30 i have another identical model drive coming up so i will see if that one is slow too. But the lack of logging info is not helpful and worrying! and the prospect of silent corruption a big worry!
>
It is possible that the reads are somehow corrupting sometimes.
I have seen a couple of different controllers fail and result in read
corruptions, basically you have 50 largish files or so on the disk
with the same checksum (50xsize needs to be 2x greater than ram), and
you cksum all of the files and see if the cksum changes, if it does
the "bad" file will move around, so in this case the data on disk
should be ok. I have seen a couple of different companies controller
fail this way, usually it is from a bad PCI interface chip or a bad
config (too fast) causing PCI parity errors. I had one controller
fail (broken) and cause errors (replaced with spare corrected), and in
the second case I found that the MB was running the PCI bus too faster
for the number of cards (two different companies FC card fails--both
in slightly different ways-one silently corrupted, the other crashed
the machine about the time an error would have been expected), and had
to slow the bus down one step (PCIX-133 -> PCIX-100, or PCIX-100 to
PCIX-66) and the issue went away.
In both cases I did not find any write corruptions, but found read
corruptions often, if you have this happening with a raid5 device it
would be bad if you had to use parity (corrupt read would mean
regenerated parity would be wrong, and restore from parity would lead
to corrupted data).
I don't know how strong the internal SATA communication is, if it uses
CRC's errors are almost impossible on the cable, if it uses parity
errors are easy, the PCI bus uses parity, so it is pretty easy for
errors to get through, but I have only seen them very rarely, maybe 5
times in 10,000 years of machine operations (2000+ machines for
several years).
next prev parent reply other threads:[~2010-01-24 21:52 UTC|newest]
Thread overview: 70+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-01-20 11:52 Fw: Why does one get mismatches? Jon Hardcastle
2010-01-22 18:13 ` Goswin von Brederlow
2010-01-24 17:40 ` Jon Hardcastle
2010-01-24 21:52 ` Roger Heflin [this message]
2010-01-24 23:13 ` Goswin von Brederlow
2010-01-25 10:07 ` Jon Hardcastle
2010-01-25 10:37 ` Goswin von Brederlow
2010-01-25 10:52 ` Jon Hardcastle
2010-01-25 17:32 ` Goswin von Brederlow
2010-01-25 19:32 ` Iustin Pop
2010-02-01 21:18 ` Bill Davidsen
2010-02-01 22:37 ` Neil Brown
2010-02-02 15:11 ` Bill Davidsen
2010-02-03 11:17 ` Goswin von Brederlow
2010-02-11 5:14 ` Neil Brown
2010-02-11 17:51 ` Bryan Mesich
2010-02-16 21:25 ` Bill Davidsen
2010-02-16 21:38 ` Steven Haigh
2010-02-17 3:19 ` Bryan Mesich
2010-02-17 23:05 ` Neil Brown
2010-02-19 15:18 ` Piergiorgio Sartor
2010-02-19 22:02 ` Neil Brown
2010-02-19 22:37 ` Piergiorgio Sartor
2010-02-19 23:34 ` Asdo
2010-02-20 4:27 ` Goswin von Brederlow
2010-02-20 11:12 ` Asdo
2010-02-21 11:13 ` Goswin von Brederlow
[not found] ` <8754A21825504719B463AD9809E54349@m5>
[not found] ` <20100221194400.GA2570@lazy.lzy>
2010-02-22 13:01 ` Asdo
2010-02-22 13:30 ` Piergiorgio Sartor
2010-02-22 13:44 ` Piergiorgio Sartor
2010-02-24 19:42 ` Bill Davidsen
2010-02-20 4:23 ` Goswin von Brederlow
2010-02-24 14:54 ` Bill Davidsen
2010-02-24 21:37 ` Neil Brown
2010-02-26 20:48 ` Bill Davidsen
2010-02-26 21:09 ` Neil Brown
2010-02-26 22:01 ` Piergiorgio Sartor
2010-02-26 22:15 ` Bill Davidsen
2010-02-26 22:21 ` Piergiorgio Sartor
2010-02-26 22:20 ` Asdo
2010-02-27 6:01 ` Michael Evans
2010-02-28 0:01 ` Bill Davidsen
2010-02-24 14:46 ` Bill Davidsen
2010-02-24 16:12 ` Martin K. Petersen
2010-02-24 18:51 ` Piergiorgio Sartor
2010-02-24 22:21 ` Neil Brown
2010-02-25 8:41 ` Piergiorgio Sartor
2010-03-02 4:57 ` Neil Brown
2010-03-02 18:49 ` Piergiorgio Sartor
2010-02-24 21:39 ` Neil Brown
[not found] ` <4B8640A2.4060307@shiftmail.org>
2010-02-25 10:41 ` Neil Brown
2010-02-28 8:09 ` Luca Berra
2010-03-02 5:01 ` Neil Brown
2010-03-02 7:36 ` Luca Berra
2010-03-02 10:04 ` Michael Evans
2010-03-02 11:02 ` Luca Berra
2010-03-02 12:13 ` Michael Evans
2010-03-02 18:14 ` Asdo
2010-03-02 18:52 ` Piergiorgio Sartor
2010-03-02 23:27 ` Asdo
2010-03-03 9:13 ` Piergiorgio Sartor
2010-03-03 11:42 ` Asdo
2010-03-03 12:03 ` Piergiorgio Sartor
2010-03-02 20:17 ` Neil Brown
2010-02-24 21:32 ` Neil Brown
2010-02-25 7:22 ` Goswin von Brederlow
2010-02-25 7:39 ` Neil Brown
2010-02-25 8:47 ` John Robinson
2010-02-25 9:07 ` Neil Brown
2010-02-11 18:12 ` Piergiorgio Sartor
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4B5CC124.8010302@gmail.com \
--to=rogerheflin@gmail.com \
--cc=Jon@eHardcastle.com \
--cc=goswin-v-b@web.de \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).