linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Peter Rabbitson <rabbit+list@rabbit.us>
To: linux-raid@vger.kernel.org
Subject: Re: Redundancy check using "echo check > sync_action": error	reporting?
Date: Sun, 04 May 2008 09:30:02 +0200	[thread overview]
Message-ID: <481D65FA.4090107@rabbit.us> (raw)
In-Reply-To: <47E4D95A.9000505@rabbit.us>

Peter Rabbitson wrote:
> Robin Hill wrote:
>> On Fri Mar 21, 2008 at 07:01:43PM -0400, Bill Davidsen wrote:
>>
>>> Peter Rabbitson wrote:
>>>> I was actually specifically advocating that md must _not_ do 
>>>> anything on its own. Just provide the hooks to get information (what 
>>>> is the current stripe state) and update information (the described 
>>>> repair extension). The logic that you are describing can live only 
>>>> in an external app, it has no place in-kernel.
>>> So you advocate the current code being in the kernel, which absent a 
>>> hardware error makes blind assumptions about which data is valid and 
>>> which is not and in all cases hides the problem, instead of the code 
>>> I proposed, which in some cases will be able to avoid action which is 
>>> provably wrong and never be less likely to do the wrong thing than 
>>> the current code?
>>>
>> I would certainly advocate that the current (entirely automatic) code
>> belongs in the kernel whereas any code requiring user
>> intervention/decision making belongs in a user process, yes.  That's not
>> to say that the former should be preferred over the latter though, but
>> there's really no reason to remove the in-kernel automated process until
>> (or even after) a user-side repair process has been coded.
> 
> I am asserting that automatic repair is infeasible in most 
> highly-redundant cases. Lets take the root raid1 of one of my busiest 
> servers:
> 
> /dev/md0:
>         Version : 00.90.03
>   Creation Time : Tue Mar 20 21:58:54 2007
>      Raid Level : raid1
>      Array Size : 6000128 (5.72 GiB 6.14 GB)
>   Used Dev Size : 6000128 (5.72 GiB 6.14 GB)
>    Raid Devices : 4
>   Total Devices : 4
> Preferred Minor : 0
>     Persistence : Superblock is persistent
> 
>     Update Time : Sat Mar 22 05:55:08 2008
>           State : clean
>  Active Devices : 4
> Working Devices : 4
>  Failed Devices : 0
>   Spare Devices : 0
> 
>            UUID : b6a11a74:8b069a29:6e26228f:2ab99bd0 (local to host 
> Arzamas)
>          Events : 0.183270
> 
> As you can see it is pretty old, and does not have many events to speak 
> of. Yet every month when the automatic check is issued I get between 512 
> and 2048 in mismatch_cnt. I maintain md5sums of all files on this 
> filesystem, and there were no deviations for the lifetime of the array 
> (of course there are mismatches after upgrades, after log appends etc, 
> but they are all expected). So all I can do with this array is issue a 
> blind repair, without even having the chance to find what exactly is 
> causing this. Yes, it is raid1 and I could do 1:1 comparison to find 
> which is the offending block. How about raid10 -n f3? There is no way I 
> can figure out _what_ is giving me a problem. I do not know if it is a 
> hardware error (the md5 sums speak against it), some process with weird 
> write patterns resulting in heavy DMA, or a bug in md itself.
> 
> By the way there is no swap file on this array. Just / and /var, with a 
> moderately busy mail spool on top.
> 

I want to resurect this discussion with a peculiar observation - the above 
mismatch was caused by GRUB.

I had some time this weekend and decided to take device snapshots of the 4 
array members as listed above while / is mounted ro. After stripping the md 
superblock I ended up with data from slots 1 2 and 3 being identical, and 0 
(my primary boot device) being different by about 10 bytes. Hexediting 
revealed that the bytes in question belong to /boot/grub/default.

I realized that my grub config contains a savedefault clause, which updates 
the file on the raw ext3 volume before any raid assembly has taken place. 
Executing grub-set-default from within a booted system (with a mounted 
assembled raid) resulted in the subsequent md check to return 0 mismatches. To 
add insult to the injury the way svedefault and grub-set-default update said 
file are different (comments vs empty lines). So even if one savedfault's the 
same entry as the one set initially bu grub-set-default - the result will 
still be a raid1 mismatch.

I assume that this condition is benign, but wanted to bring this to the 
attention of the masses anyway.

Cheers

Peter

  parent reply	other threads:[~2008-05-04  7:30 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-03-16 14:21 Redundancy check using "echo check > sync_action": error reporting? Bas van Schaik
2008-03-16 15:14 ` Janek Kozicki
2008-03-20 13:32   ` Bas van Schaik
2008-03-20 13:47     ` Robin Hill
2008-03-20 14:19       ` Bas van Schaik
2008-03-20 14:45         ` Robin Hill
2008-03-20 15:16           ` Bas van Schaik
2008-03-20 16:04             ` Robin Hill
2008-03-20 16:35         ` Theodore Tso
2008-03-20 17:10           ` Robin Hill
2008-03-20 17:39           ` Andre Noll
2008-03-20 18:02             ` Theodore Tso
2008-03-20 18:57               ` Andre Noll
2008-03-21 14:02               ` Ric Wheeler
2008-03-21 20:19               ` NeilBrown
2008-03-21 20:45                 ` Ric Wheeler
2008-03-22 17:13                 ` Bill Davidsen
2008-03-20 23:08           ` Peter Rabbitson
2008-03-21 14:24             ` Bill Davidsen
2008-03-21 14:52               ` Peter Rabbitson
2008-03-21 17:13                 ` Theodore Tso
2008-03-21 17:35                   ` Peter Rabbitson
2008-03-22 13:27                     ` Theodore Tso
2008-03-22 14:00                       ` Bas van Schaik
2008-03-25  4:44                       ` Neil Brown
2008-03-25 15:17                         ` Bill Davidsen
2008-03-25  9:19                       ` Mattias Wadenstein
2008-03-21 17:43                   ` Robin Hill
2008-03-21 23:01                 ` Bill Davidsen
2008-03-21 23:45                   ` Carlos Carvalho
2008-03-22 17:19                     ` Bill Davidsen
2008-03-21 23:55                   ` Robin Hill
2008-03-22 10:03                     ` Peter Rabbitson
2008-03-22 10:42                       ` What do Events actually mean? Justin Piszcz
2008-03-22 17:35                         ` David Greaves
2008-03-22 17:48                           ` Justin Piszcz
2008-03-22 18:02                             ` David Greaves
2008-03-25  3:58                         ` Neil Brown
2008-03-26  8:57                           ` David Greaves
2008-03-26  8:57                           ` David Greaves
2008-05-04  7:30                       ` Peter Rabbitson [this message]
2008-05-06  6:36                         ` Redundancy check using "echo check > sync_action": error reporting? Luca Berra
2008-03-25  4:24             ` Neil Brown
2008-03-25  9:00               ` Peter Rabbitson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=481D65FA.4090107@rabbit.us \
    --to=rabbit+list@rabbit.us \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).