From: Piergiorgio Sartor <piergiorgio.sartor@nexgo.de>
To: NeilBrown <neilb@suse.de>
Cc: Piergiorgio Sartor <piergiorgio.sartor@nexgo.de>,
Peter Rabbitson <rabbit+list@rabbit.us>,
Goswin von Brederlow <goswin-v-b@web.de>,
Doug Ledford <dledford@redhat.com>,
Michael Evans <mjevans1983@gmail.com>,
Eyal Lebedinsky <eyal@eyal.emu.id.au>,
linux-raid list <linux-raid@vger.kernel.org>
Subject: Re: mismatch_cnt again
Date: Tue, 10 Nov 2009 20:52:22 +0100 [thread overview]
Message-ID: <20091110195222.GA2777@lazy.lzy> (raw)
In-Reply-To: <d99ca9481d2471073484c5d43d493b4d.squirrel@neil.brown.name>
Hi again,
> It seems we might have been talking at cross-purposes.
>
> When I wrote about the need for a threat model, it was in the
> context of automatically determining which block was most
> likely to be in error (e.g. voting with a 3-drive RAID1 or
> fancy arithmetic with RAID6). I do not believe there is any
> value in doing that. At least not automatically in the kernel
> with the aim of just repairing which block was decided to be
> most wrong.
>
> You now seem to be talking about the ability to find out which
> blocks are inconsistent. That is very different. I do agree there
> is value in that. Maybe it should appear in the kernel logs,
> or maybe we could store the information and report in via sysfs
> (the former would certainly be easier).
maybe there is a misunderstanding between us! :-)
Automatic repair *might* be a far end target, but I do
agree, this needs to be clarified deeply.
I see the thing similarly to a previous comment from a
fellow poster.
To do:
1) detect which MD block is inconsistent
2) detect, when possible, which device component is responsible
3) trigger a repair action
This would be done all under user control, i.e. the user
will get the mismatch count, maybe with some hint on which
device could be guilty (RAID-6 or RAID-1/10 with multiple
redundancy) and then he could decide what to do.
The user will have full control and full *responsability*
on the action, but it will also be fully informed on what
the situation is.
The system will tell: block ABC is inconsistent, maybe
device /dev/sdX is guilty, you could: do nothing, resync
the parity, try to repair.
> I would be very happy to accept a patch which logged this
> information - providing it was careful not to overly spam the logs if there
> were lots and lots of errors. I may even write on myself.
I could try to have a look into it, time permitting.
[mismatch_cnt=256]
> I would probably run a 'repair' to fix the difference, but that
> isn't firm advice. It is quite probably that the block is not
> actively in use and so the inconsistency will never be noticed.
Exactly, that's why having the knowledge of *where*
the issue is would help already a lot!
> check/repair is primarily about reading every block on every device,
> and being ready to cope with read errors by overwriting with the
> correct data. This is known as scrubbing I believe.
> I would normally just 'repair' every month or so. If there are
> discrepancies I would like them reported and fixed. I they happen
> often on a non-swap partition, I would like to knoe about it, otherwise
> I would rather they were just fixed.
> 'check' largely exists because it was trivial to implement given
> that 'repair' was being implemented, and it could concievably be useful,
> e.g. you have assembled an array read-only as you aren't at all sure the
> disks should form an array. You run a 'check' to increase your
> confidence that all is OK without risking any change to any data incase
> you put the array together badly.
As I mentioned some times ago, I built a RAID-6, where
one disk, due to a strange cabling problem, was sometimes
returning wrong data (one bit flip, actually).
And this without any errors reported, i.e. a bit was
sometimes flipped, at the very end it seems, and it
was undetected by ECC/CRC/whatever.
This was noticed by the "check", so I ran a "repair", which
was, of course, making more damage...
What I did was to run a check, with one device after the
other failed (and then re-added, of course) on a RO MD device.
I was able to find the guilty disk and to fix the array
for good!
Now, this was a really lengthy process, I would have
preferred to have it done automatically and then have
a report on which *could* be the resposible device.
I agree with you that an automatic repair would have
not been the right choice, without knowing first what
was going on.
> drivers/md/raid1.c for RAID1
> drivers/md/raid5.c for RAID4/RAID5/RAID6
>
> Look for where the resync_mismatches field is updated.
Thanks, I'll try to have a look!
bye,
--
piergiorgio
next prev parent reply other threads:[~2009-11-10 19:52 UTC|newest]
Thread overview: 58+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-11-07 0:41 mismatch_cnt again Eyal Lebedinsky
2009-11-07 1:53 ` berk walker
2009-11-07 7:49 ` Eyal Lebedinsky
2009-11-07 8:08 ` Michael Evans
2009-11-07 8:42 ` Eyal Lebedinsky
2009-11-07 13:51 ` Goswin von Brederlow
2009-11-07 14:58 ` Doug Ledford
2009-11-07 16:23 ` Piergiorgio Sartor
2009-11-07 16:37 ` Doug Ledford
2009-11-07 22:25 ` Eyal Lebedinsky
2009-11-07 22:57 ` Doug Ledford
2009-11-08 15:32 ` Goswin von Brederlow
2009-11-09 18:08 ` Bill Davidsen
2009-11-07 22:19 ` Eyal Lebedinsky
2009-11-07 22:58 ` Doug Ledford
2009-11-08 15:46 ` Goswin von Brederlow
2009-11-08 16:04 ` Piergiorgio Sartor
2009-11-09 18:22 ` Bill Davidsen
2009-11-09 21:50 ` NeilBrown
2009-11-10 18:05 ` Bill Davidsen
2009-11-10 22:17 ` Peter Rabbitson
2009-11-13 2:15 ` Neil Brown
2009-11-09 19:13 ` Goswin von Brederlow
2009-11-08 22:51 ` Peter Rabbitson
2009-11-09 18:56 ` Piergiorgio Sartor
2009-11-09 21:14 ` NeilBrown
2009-11-09 21:54 ` Piergiorgio Sartor
2009-11-10 0:17 ` NeilBrown
2009-11-10 9:09 ` Peter Rabbitson
2009-11-10 14:03 ` Martin K. Petersen
2009-11-12 22:40 ` Bill Davidsen
2009-11-13 17:12 ` Martin K. Petersen
2009-11-14 17:01 ` Bill Davidsen
2009-11-17 5:19 ` Martin K. Petersen
2009-11-14 19:04 ` Goswin von Brederlow
2009-11-17 5:22 ` Martin K. Petersen
2009-11-10 19:52 ` Piergiorgio Sartor [this message]
2009-11-13 2:37 ` Neil Brown
2009-11-13 5:30 ` Goswin von Brederlow
2009-11-13 9:33 ` Peter Rabbitson
2009-11-15 21:05 ` Piergiorgio Sartor
2009-11-15 22:29 ` Guy Watkins
2009-11-16 1:23 ` Goswin von Brederlow
2009-11-16 1:37 ` Neil Brown
2009-11-16 5:21 ` Goswin von Brederlow
2009-11-16 5:35 ` Neil Brown
2009-11-16 7:40 ` Goswin von Brederlow
2009-11-12 22:57 ` Bill Davidsen
2009-11-09 18:11 ` Bill Davidsen
2009-11-09 20:58 ` Doug Ledford
2009-11-09 22:03 ` Eyal Lebedinsky
-- strict thread matches above, loose matches on Subject: below --
2009-11-12 19:20 greg
2009-11-13 2:28 ` Neil Brown
2009-11-13 5:19 ` Goswin von Brederlow
2009-11-15 1:54 ` Bill Davidsen
2009-11-16 21:36 greg
2009-11-16 22:14 ` Neil Brown
2009-11-17 4:50 ` Goswin von Brederlow
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20091110195222.GA2777@lazy.lzy \
--to=piergiorgio.sartor@nexgo.de \
--cc=dledford@redhat.com \
--cc=eyal@eyal.emu.id.au \
--cc=goswin-v-b@web.de \
--cc=linux-raid@vger.kernel.org \
--cc=mjevans1983@gmail.com \
--cc=neilb@suse.de \
--cc=rabbit+list@rabbit.us \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).