From: Peter Rabbitson <rabbit+list@rabbit.us>
To: linux-raid@vger.kernel.org
Subject: Re: Redundancy check using "echo check > sync_action": error reporting?
Date: Sun, 04 May 2008 09:30:02 +0200 [thread overview]
Message-ID: <481D65FA.4090107@rabbit.us> (raw)
In-Reply-To: <47E4D95A.9000505@rabbit.us>
Peter Rabbitson wrote:
> Robin Hill wrote:
>> On Fri Mar 21, 2008 at 07:01:43PM -0400, Bill Davidsen wrote:
>>
>>> Peter Rabbitson wrote:
>>>> I was actually specifically advocating that md must _not_ do
>>>> anything on its own. Just provide the hooks to get information (what
>>>> is the current stripe state) and update information (the described
>>>> repair extension). The logic that you are describing can live only
>>>> in an external app, it has no place in-kernel.
>>> So you advocate the current code being in the kernel, which absent a
>>> hardware error makes blind assumptions about which data is valid and
>>> which is not and in all cases hides the problem, instead of the code
>>> I proposed, which in some cases will be able to avoid action which is
>>> provably wrong and never be less likely to do the wrong thing than
>>> the current code?
>>>
>> I would certainly advocate that the current (entirely automatic) code
>> belongs in the kernel whereas any code requiring user
>> intervention/decision making belongs in a user process, yes. That's not
>> to say that the former should be preferred over the latter though, but
>> there's really no reason to remove the in-kernel automated process until
>> (or even after) a user-side repair process has been coded.
>
> I am asserting that automatic repair is infeasible in most
> highly-redundant cases. Lets take the root raid1 of one of my busiest
> servers:
>
> /dev/md0:
> Version : 00.90.03
> Creation Time : Tue Mar 20 21:58:54 2007
> Raid Level : raid1
> Array Size : 6000128 (5.72 GiB 6.14 GB)
> Used Dev Size : 6000128 (5.72 GiB 6.14 GB)
> Raid Devices : 4
> Total Devices : 4
> Preferred Minor : 0
> Persistence : Superblock is persistent
>
> Update Time : Sat Mar 22 05:55:08 2008
> State : clean
> Active Devices : 4
> Working Devices : 4
> Failed Devices : 0
> Spare Devices : 0
>
> UUID : b6a11a74:8b069a29:6e26228f:2ab99bd0 (local to host
> Arzamas)
> Events : 0.183270
>
> As you can see it is pretty old, and does not have many events to speak
> of. Yet every month when the automatic check is issued I get between 512
> and 2048 in mismatch_cnt. I maintain md5sums of all files on this
> filesystem, and there were no deviations for the lifetime of the array
> (of course there are mismatches after upgrades, after log appends etc,
> but they are all expected). So all I can do with this array is issue a
> blind repair, without even having the chance to find what exactly is
> causing this. Yes, it is raid1 and I could do 1:1 comparison to find
> which is the offending block. How about raid10 -n f3? There is no way I
> can figure out _what_ is giving me a problem. I do not know if it is a
> hardware error (the md5 sums speak against it), some process with weird
> write patterns resulting in heavy DMA, or a bug in md itself.
>
> By the way there is no swap file on this array. Just / and /var, with a
> moderately busy mail spool on top.
>
I want to resurect this discussion with a peculiar observation - the above
mismatch was caused by GRUB.
I had some time this weekend and decided to take device snapshots of the 4
array members as listed above while / is mounted ro. After stripping the md
superblock I ended up with data from slots 1 2 and 3 being identical, and 0
(my primary boot device) being different by about 10 bytes. Hexediting
revealed that the bytes in question belong to /boot/grub/default.
I realized that my grub config contains a savedefault clause, which updates
the file on the raw ext3 volume before any raid assembly has taken place.
Executing grub-set-default from within a booted system (with a mounted
assembled raid) resulted in the subsequent md check to return 0 mismatches. To
add insult to the injury the way svedefault and grub-set-default update said
file are different (comments vs empty lines). So even if one savedfault's the
same entry as the one set initially bu grub-set-default - the result will
still be a raid1 mismatch.
I assume that this condition is benign, but wanted to bring this to the
attention of the masses anyway.
Cheers
Peter
next prev parent reply other threads:[~2008-05-04 7:30 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-03-16 14:21 Redundancy check using "echo check > sync_action": error reporting? Bas van Schaik
2008-03-16 15:14 ` Janek Kozicki
2008-03-20 13:32 ` Bas van Schaik
2008-03-20 13:47 ` Robin Hill
2008-03-20 14:19 ` Bas van Schaik
2008-03-20 14:45 ` Robin Hill
2008-03-20 15:16 ` Bas van Schaik
2008-03-20 16:04 ` Robin Hill
2008-03-20 16:35 ` Theodore Tso
2008-03-20 17:10 ` Robin Hill
2008-03-20 17:39 ` Andre Noll
2008-03-20 18:02 ` Theodore Tso
2008-03-20 18:57 ` Andre Noll
2008-03-21 14:02 ` Ric Wheeler
2008-03-21 20:19 ` NeilBrown
2008-03-21 20:45 ` Ric Wheeler
2008-03-22 17:13 ` Bill Davidsen
2008-03-20 23:08 ` Peter Rabbitson
2008-03-21 14:24 ` Bill Davidsen
2008-03-21 14:52 ` Peter Rabbitson
2008-03-21 17:13 ` Theodore Tso
2008-03-21 17:35 ` Peter Rabbitson
2008-03-22 13:27 ` Theodore Tso
2008-03-22 14:00 ` Bas van Schaik
2008-03-25 4:44 ` Neil Brown
2008-03-25 15:17 ` Bill Davidsen
2008-03-25 9:19 ` Mattias Wadenstein
2008-03-21 17:43 ` Robin Hill
2008-03-21 23:01 ` Bill Davidsen
2008-03-21 23:45 ` Carlos Carvalho
2008-03-22 17:19 ` Bill Davidsen
2008-03-21 23:55 ` Robin Hill
2008-03-22 10:03 ` Peter Rabbitson
2008-03-22 10:42 ` What do Events actually mean? Justin Piszcz
2008-03-22 17:35 ` David Greaves
2008-03-22 17:48 ` Justin Piszcz
2008-03-22 18:02 ` David Greaves
2008-03-25 3:58 ` Neil Brown
2008-03-26 8:57 ` David Greaves
2008-03-26 8:57 ` David Greaves
2008-05-04 7:30 ` Peter Rabbitson [this message]
2008-05-06 6:36 ` Redundancy check using "echo check > sync_action": error reporting? Luca Berra
2008-03-25 4:24 ` Neil Brown
2008-03-25 9:00 ` Peter Rabbitson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=481D65FA.4090107@rabbit.us \
--to=rabbit+list@rabbit.us \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).