Re: 3.12: raid-1 mismatch_cnt question

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: joystick <joystick@shiftmail.org>
To: Justin Piszcz <jpiszcz@lucidpixels.com>
Cc: 'linux-raid' <linux-raid@vger.kernel.org>
Subject: Re: 3.12: raid-1 mismatch_cnt question
Date: Fri, 15 Nov 2013 09:51:32 +0100	[thread overview]
Message-ID: <5285E094.2050607@shiftmail.org> (raw)
In-Reply-To: <007301cee15e$178969e0$469c3da0$@lucidpixels.com>

On 14/11/2013 18:22, Justin Piszcz wrote:
>
> -----Original Message-----
> From: joystick [mailto:joystick@shiftmail.org]
> Sent: Thursday, November 14, 2013 11:09 AM
> To: Justin Piszcz
> Cc: 'Bernd Schubert'; 'linux-raid'
> Subject: Re: 3.12: raid-1 mismatch_cnt question
>
> [ .. ]
>
>>> At the end of the procedure (like now, if you didn't resync or repair in
>>> the meanwhile) is mismatch_cnt still so high?
> After a reboot, I ran the check and yes it was still high.
>
> [ .. ]
>
>>> no, not that one...
>>> it would be helpful to know the kernel version that *creates*
>>> mismatches, the one that you have running normally on the live system.
> Version: 3.12.0 (and typically always use the latest)

ok

> That's the "bugged" one, supposing this is really a bug (until we find
> where the mismatches are, it's difficult to say wether this is a data
> loss or not)
>
>>> Maybe the mismatched are located ext4 metadata areas which are not files
>>> and so can't be seen with md5sums... That would still be as much
>>> worrisome, unless some expert of ext4 can tell that it's ok (it can be
>>> OK if the region with mismatches is an old metadata area, currently
>>> unused; the mechanism that can create harmless mismatches in this case
>>> has been described by Neil)
> If that is what is occurring, is it possible to exclude them from mismatch_cnt?

not possible unfortunately.

But your mismatch_cnt is exceptionally high, which is unlikely to come 
from the described mechanism. Other people usually have zero even after 
months of operation, e.g. I have zero


> [ .. ]
>
> - First confirm that mismatch_cnt is still high..
> It was 0 after reboot.

Above you wrote that after the procedure you rebooted then did a check 
and it was still high. Can you guess when did it repair?


> [ .. ]
>
>
> - Then if this does not disrupt your system operation too much, i would
> suggest to fill 95% of free space with a zeroes file like you did in
> earlier tests. Otherwise for a mismatch happening in non-file area we
> won't be sure of what kind of area is that. Maybe recompute mismatch_cnt
> after this.
>
> Create file up to 95% utilization on /root:
> /dev/root       219G  205G   12G  95% /
>
> Re-check:
> # echo check > /sys/devices/virtual/block/md1/md/sync_action
> # cat /sys/devices/virtual/block/md1/md/mismatch_cnt
> 27520

?????
You mean that mismatch_cnt was zero, then you created a big file full of 
zeroes and after that mismatch_cnt jumped to 27520 ??
I believe this should not happen, especially not with the harmless 
mechanism explained by Neil, and this narrows the bug quite a lot.
If you confirm I understood correctly, can you retry such thing a couple 
of times? Delete zeroes file, repair RAID so that mismatch_cnt goes to 
zero, check to confirm that mismatch_cnt is zero, create a file full of 
zeroes, check again... did mismatch_cnt jump to a high value?

If reproducing the bug is so easy, you might want to try earlier kernels 
such as the 3.0.101 and re-test with that .
If earlier kernels do not have such bug it becomes relatively easy to 
find when was it introduced. Maybe without even knowing where are the 
mismatches located.


> then, copypasting the procedure with some modifications:
> ----
> ... to determine the location of mismatches (...)
> Unfortunately I don't think MD tells you the location of mismatches
> directly. Do you want to try the following:
> /sys/block/md1/md/sync_min and /sys/block/md1/md/sync_max should allow
> you to narrow the region of the next check.
> Set them, then perform check, then cat mismatch_cnt.
> Narrow progressively sync_min and sync_max so that you identify the most
> dense areas of mismatches, or a few single blocks that mismatch.
> When you have identified some regions or isolated blocks, invoke "sync"
> from bash and then check again the same region a couple of times so to
> be sure that it stays mismatched and it's not just a transient situation.
> Then try with debugfs (in readonly mode can be used with fs mounted):
> there should be an option to get the inode number from a block number of
> the device... I hope that block numbers are not offset by MD... I think
> it's icheck and after that you might need "find -inum <inode_number>"
> launched on the same filesystem to find the corresponding filename from
> the inode number. That should be the file that contains the mismatch.
> [ .. ]
> When I do this, the speed of check thereafter is very slow:
>
> Personalities : [raid1]
> md1 : active raid1 sdc2[0] sdb2[1]
>        233381376 blocks [2/2] [UU]
>        [>....................]  check =  0.0% (4500/233381376) finish=80387.9min speed=48K/sec (55 days)
>
> The speed continues to decrease when the sync_min is set to 1000 and sync_max is 9000 (this won't work).

Are you running the "find" simultaneously with "check" ?
Check priority is rather low so I understand why it would slow down if 
you are also doing "find". Otherwise... seems like it's another bug.

     prev parent reply	other threads:[~2013-11-15  8:51 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-11-04 10:25 3.12: raid-1 mismatch_cnt question Justin Piszcz
2013-11-07 10:54 ` Justin Piszcz
2013-11-12  0:39   ` Brad Campbell
2013-11-12  9:14     ` Justin Piszcz
     [not found] ` <527E8B74.70301@shiftmail.org>
2013-11-09 22:49   ` Justin Piszcz
2013-11-10 12:45     ` joystick
2013-11-11  9:26       ` Justin Piszcz
2013-11-11 11:06         ` joystick
2013-11-11 18:52           ` Justin Piszcz
2013-11-11 21:23             ` John Stoffel
2013-11-11 21:55               ` NeilBrown
2013-11-12  2:49                 ` John Stoffel
2013-11-11 21:58             ` NeilBrown
2013-11-11 22:18               ` Justin Piszcz
2013-11-12  9:30             ` joystick
2013-11-12 10:29               ` Bernd Schubert
2013-11-13 22:10                 ` Justin Piszcz
2013-11-14  8:44                   ` joystick
2013-11-14 10:43                     ` Justin Piszcz
2013-11-14 16:09                       ` joystick
2013-11-14 17:22                         ` Justin Piszcz
2013-11-15  8:51                           ` joystick [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5285E094.2050607@shiftmail.org \
    --to=joystick@shiftmail.org \
    --cc=jpiszcz@lucidpixels.com \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).