Re: mismatch_cnt again - Piergiorgio Sartor

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Piergiorgio Sartor <piergiorgio.sartor@nexgo.de>
To: NeilBrown <neilb@suse.de>
Cc: Piergiorgio Sartor <piergiorgio.sartor@nexgo.de>,
	Peter Rabbitson <rabbit+list@rabbit.us>,
	Goswin von Brederlow <goswin-v-b@web.de>,
	Doug Ledford <dledford@redhat.com>,
	Michael Evans <mjevans1983@gmail.com>,
	Eyal Lebedinsky <eyal@eyal.emu.id.au>,
	linux-raid list <linux-raid@vger.kernel.org>
Subject: Re: mismatch_cnt again
Date: Tue, 10 Nov 2009 20:52:22 +0100	[thread overview]
Message-ID: <20091110195222.GA2777@lazy.lzy> (raw)
In-Reply-To: <d99ca9481d2471073484c5d43d493b4d.squirrel@neil.brown.name>

Hi again,

> It seems we might have been talking at cross-purposes.
> 
> When I wrote about the need for a threat model, it was in the
> context of automatically determining which block was most
> likely to be in error (e.g. voting with a 3-drive RAID1 or
> fancy arithmetic with RAID6).  I do not believe there is any
> value in doing that.  At least not automatically in the kernel
> with the aim of just repairing which block was decided to be
> most wrong.
> 
> You now seem to be talking about the ability to find out which
> blocks are inconsistent.  That is very different.  I do agree there
> is value in that.  Maybe it should appear in the kernel logs,
> or maybe we could store the information and report in via sysfs
> (the former would certainly be easier).

maybe there is a misunderstanding between us! :-)

Automatic repair *might* be a far end target, but I do
agree, this needs to be clarified deeply.

I see the thing similarly to a previous comment from a
fellow poster.
To do:
1) detect which MD block is inconsistent
2) detect, when possible, which device component is responsible
3) trigger a repair action

This would be done all under user control, i.e. the user
will get the mismatch count, maybe with some hint on which
device could be guilty (RAID-6 or RAID-1/10 with multiple
redundancy) and then he could decide what to do.

The user will have full control and full *responsability*
on the action, but it will also be fully informed on what
the situation is.

The system will tell: block ABC is inconsistent, maybe
device /dev/sdX is guilty, you could: do nothing, resync
the parity, try to repair.

> I would be very happy to accept a patch which logged this
> information - providing it was careful not to overly spam the logs if there
> were lots and lots of errors.  I may even write on myself.

I could try to have a look into it, time permitting.

[mismatch_cnt=256]
> I would probably run a 'repair' to fix the difference, but that
> isn't firm advice.  It is quite probably that the block is not
> actively in use and so the inconsistency will never be noticed.

Exactly, that's why having the knowledge of *where*
the issue is would help already a lot!

> check/repair is primarily about reading every block on every device,
> and being ready to cope with read errors by overwriting with the
> correct data.  This is known as scrubbing I believe.
> I would normally just 'repair' every month or so.  If there are
> discrepancies I would like them reported and fixed.  I they happen
> often on a non-swap partition, I would like to knoe about it, otherwise
> I would rather they were just fixed.
> 'check' largely exists because it was trivial to implement given
> that 'repair' was being implemented, and it could concievably be useful,
> e.g. you have assembled an array read-only as you aren't at all sure the
> disks should form an array.  You run a 'check' to increase your
> confidence that all is OK without risking any change to any data incase
> you put the array together badly.

As I mentioned some times ago, I built a RAID-6, where
one disk, due to a strange cabling problem, was sometimes
returning wrong data (one bit flip, actually).
And this without any errors reported, i.e. a bit was
sometimes flipped, at the very end it seems, and it
was undetected by ECC/CRC/whatever.

This was noticed by the "check", so I ran a "repair", which
was, of course, making more damage...

What I did was to run a check, with one device after the
other failed (and then re-added, of course) on a RO MD device.

I was able to find the guilty disk and to fix the array
for good!

Now, this was a really lengthy process, I would have
preferred to have it done automatically and then have
a report on which *could* be the resposible device.

I agree with you that an automatic repair would have
not been the right choice, without knowing first what
was going on.

> drivers/md/raid1.c for RAID1
> drivers/md/raid5.c for RAID4/RAID5/RAID6
> 
> Look for where the resync_mismatches field is updated.

Thanks, I'll try to have a look!

bye,

-- 

piergiorgio

next prev parent reply	other threads:[~2009-11-10 19:52 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-11-07  0:41 mismatch_cnt again Eyal Lebedinsky
2009-11-07  1:53 ` berk walker
2009-11-07  7:49   ` Eyal Lebedinsky
2009-11-07  8:08     ` Michael Evans
2009-11-07  8:42       ` Eyal Lebedinsky
2009-11-07 13:51       ` Goswin von Brederlow
2009-11-07 14:58         ` Doug Ledford
2009-11-07 16:23           ` Piergiorgio Sartor
2009-11-07 16:37             ` Doug Ledford
2009-11-07 22:25               ` Eyal Lebedinsky
2009-11-07 22:57                 ` Doug Ledford
2009-11-08 15:32             ` Goswin von Brederlow
2009-11-09 18:08               ` Bill Davidsen
2009-11-07 22:19           ` Eyal Lebedinsky
2009-11-07 22:58             ` Doug Ledford
2009-11-08 15:46           ` Goswin von Brederlow
2009-11-08 16:04             ` Piergiorgio Sartor
2009-11-09 18:22               ` Bill Davidsen
2009-11-09 21:50                 ` NeilBrown
2009-11-10 18:05                   ` Bill Davidsen
2009-11-10 22:17                     ` Peter Rabbitson
2009-11-13  2:15                     ` Neil Brown
2009-11-09 19:13               ` Goswin von Brederlow
2009-11-08 22:51             ` Peter Rabbitson
2009-11-09 18:56               ` Piergiorgio Sartor
2009-11-09 21:14                 ` NeilBrown
2009-11-09 21:54                   ` Piergiorgio Sartor
2009-11-10  0:17                     ` NeilBrown
2009-11-10  9:09                       ` Peter Rabbitson
2009-11-10 14:03                         ` Martin K. Petersen
2009-11-12 22:40                           ` Bill Davidsen
2009-11-13 17:12                             ` Martin K. Petersen
2009-11-14 17:01                               ` Bill Davidsen
2009-11-17  5:19                                 ` Martin K. Petersen
2009-11-14 19:04                               ` Goswin von Brederlow
2009-11-17  5:22                                 ` Martin K. Petersen
2009-11-10 19:52                       ` Piergiorgio Sartor [this message]
2009-11-13  2:37                         ` Neil Brown
2009-11-13  5:30                           ` Goswin von Brederlow
2009-11-13  9:33                           ` Peter Rabbitson
2009-11-15 21:05                           ` Piergiorgio Sartor
2009-11-15 22:29                             ` Guy Watkins
2009-11-16  1:23                               ` Goswin von Brederlow
2009-11-16  1:37                               ` Neil Brown
2009-11-16  5:21                                 ` Goswin von Brederlow
2009-11-16  5:35                                   ` Neil Brown
2009-11-16  7:40                                     ` Goswin von Brederlow
2009-11-12 22:57                       ` Bill Davidsen
2009-11-09 18:11           ` Bill Davidsen
2009-11-09 20:58             ` Doug Ledford
2009-11-09 22:03 ` Eyal Lebedinsky
  -- strict thread matches above, loose matches on Subject: below --
2009-11-12 19:20 greg
2009-11-13  2:28 ` Neil Brown
2009-11-13  5:19   ` Goswin von Brederlow
2009-11-15  1:54   ` Bill Davidsen
2009-11-16 21:36 greg
2009-11-16 22:14 ` Neil Brown
2009-11-17  4:50   ` Goswin von Brederlow

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20091110195222.GA2777@lazy.lzy \
    --to=piergiorgio.sartor@nexgo.de \
    --cc=dledford@redhat.com \
    --cc=eyal@eyal.emu.id.au \
    --cc=goswin-v-b@web.de \
    --cc=linux-raid@vger.kernel.org \
    --cc=mjevans1983@gmail.com \
    --cc=neilb@suse.de \
    --cc=rabbit+list@rabbit.us \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).