Re: Redundancy check using "echo check > sync_action": error reporting?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Peter Rabbitson <rabbit+list@rabbit.us>
To: linux-raid@vger.kernel.org
Subject: Re: Redundancy check using "echo check > sync_action": error	reporting?
Date: Sat, 22 Mar 2008 11:03:06 +0100	[thread overview]
Message-ID: <47E4D95A.9000505@rabbit.us> (raw)
In-Reply-To: <20080321235557.GA11801@cthulhu.home.robinhill.me.uk>

Robin Hill wrote:
> On Fri Mar 21, 2008 at 07:01:43PM -0400, Bill Davidsen wrote:
> 
>> Peter Rabbitson wrote:
>>> I was actually specifically advocating that md must _not_ do anything on 
>>> its own. Just provide the hooks to get information (what is the current 
>>> stripe state) and update information (the described repair extension). The 
>>> logic that you are describing can live only in an external app, it has no 
>>> place in-kernel.
>> So you advocate the current code being in the kernel, which absent a 
>> hardware error makes blind assumptions about which data is valid and which 
>> is not and in all cases hides the problem, instead of the code I proposed, 
>> which in some cases will be able to avoid action which is provably wrong 
>> and never be less likely to do the wrong thing than the current code?
>>
> I would certainly advocate that the current (entirely automatic) code
> belongs in the kernel whereas any code requiring user
> intervention/decision making belongs in a user process, yes.  That's not
> to say that the former should be preferred over the latter though, but
> there's really no reason to remove the in-kernel automated process until
> (or even after) a user-side repair process has been coded.

I am asserting that automatic repair is infeasible in most highly-redundant 
cases. Lets take the root raid1 of one of my busiest servers:

/dev/md0:
         Version : 00.90.03
   Creation Time : Tue Mar 20 21:58:54 2007
      Raid Level : raid1
      Array Size : 6000128 (5.72 GiB 6.14 GB)
   Used Dev Size : 6000128 (5.72 GiB 6.14 GB)
    Raid Devices : 4
   Total Devices : 4
Preferred Minor : 0
     Persistence : Superblock is persistent

     Update Time : Sat Mar 22 05:55:08 2008
           State : clean
  Active Devices : 4
Working Devices : 4
  Failed Devices : 0
   Spare Devices : 0

            UUID : b6a11a74:8b069a29:6e26228f:2ab99bd0 (local to host Arzamas)
          Events : 0.183270

As you can see it is pretty old, and does not have many events to speak of. 
Yet every month when the automatic check is issued I get between 512 and 2048 
in mismatch_cnt. I maintain md5sums of all files on this filesystem, and there 
were no deviations for the lifetime of the array (of course there are 
mismatches after upgrades, after log appends etc, but they are all expected). 
So all I can do with this array is issue a blind repair, without even having 
the chance to find what exactly is causing this. Yes, it is raid1 and I could 
do 1:1 comparison to find which is the offending block. How about raid10 -n 
f3? There is no way I can figure out _what_ is giving me a problem. I do not 
know if it is a hardware error (the md5 sums speak against it), some process 
with weird write patterns resulting in heavy DMA, or a bug in md itself.

By the way there is no swap file on this array. Just / and /var, with a 
moderately busy mail spool on top.

>> Currently the "repair" action (which *is* in the kernel now) takes no 
>> advantage of the additional information available in these cases I noted. 
>> By what logic do you conclude that the user meant "hide the error" when 
>> using the "repair" action? What I propose is never less likely to be 
>> correct than what the current code does, why would you not want to improve 
>> the chances of getting the repair correct?
>>
> That is, of course, a separate issue to whether it should be in-kernel.
> I would entirely agree that user-level processes should be able to
> access and manipulate the low-level RAID data/metadata (via the md
> layer) in order to facilitate more advanced repair functions, but this
> should be separate from, and in addition to, the "ignorant"
> parity-updating repair process currently in place.
> 

I am trying to convey the idea that a first step to a userland process would 
be full disclosure of what is going on. A non-zero mismatch_cnt on a 
multigigabyte array makes an admin very uneasy, without giving him a chance to 
assess the situation.

Peter

next prev parent reply	other threads:[~2008-03-22 10:03 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-03-16 14:21 Redundancy check using "echo check > sync_action": error reporting? Bas van Schaik
2008-03-16 15:14 ` Janek Kozicki
2008-03-20 13:32   ` Bas van Schaik
2008-03-20 13:47     ` Robin Hill
2008-03-20 14:19       ` Bas van Schaik
2008-03-20 14:45         ` Robin Hill
2008-03-20 15:16           ` Bas van Schaik
2008-03-20 16:04             ` Robin Hill
2008-03-20 16:35         ` Theodore Tso
2008-03-20 17:10           ` Robin Hill
2008-03-20 17:39           ` Andre Noll
2008-03-20 18:02             ` Theodore Tso
2008-03-20 18:57               ` Andre Noll
2008-03-21 14:02               ` Ric Wheeler
2008-03-21 20:19               ` NeilBrown
2008-03-21 20:45                 ` Ric Wheeler
2008-03-22 17:13                 ` Bill Davidsen
2008-03-20 23:08           ` Peter Rabbitson
2008-03-21 14:24             ` Bill Davidsen
2008-03-21 14:52               ` Peter Rabbitson
2008-03-21 17:13                 ` Theodore Tso
2008-03-21 17:35                   ` Peter Rabbitson
2008-03-22 13:27                     ` Theodore Tso
2008-03-22 14:00                       ` Bas van Schaik
2008-03-25  4:44                       ` Neil Brown
2008-03-25 15:17                         ` Bill Davidsen
2008-03-25  9:19                       ` Mattias Wadenstein
2008-03-21 17:43                   ` Robin Hill
2008-03-21 23:01                 ` Bill Davidsen
2008-03-21 23:45                   ` Carlos Carvalho
2008-03-22 17:19                     ` Bill Davidsen
2008-03-21 23:55                   ` Robin Hill
2008-03-22 10:03                     ` Peter Rabbitson [this message]
2008-03-22 10:42                       ` What do Events actually mean? Justin Piszcz
2008-03-22 17:35                         ` David Greaves
2008-03-22 17:48                           ` Justin Piszcz
2008-03-22 18:02                             ` David Greaves
2008-03-25  3:58                         ` Neil Brown
2008-03-26  8:57                           ` David Greaves
2008-03-26  8:57                           ` David Greaves
2008-05-04  7:30                       ` Redundancy check using "echo check > sync_action": error reporting? Peter Rabbitson
2008-05-06  6:36                         ` Luca Berra
2008-03-25  4:24             ` Neil Brown
2008-03-25  9:00               ` Peter Rabbitson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=47E4D95A.9000505@rabbit.us \
    --to=rabbit+list@rabbit.us \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).