From: NeilBrown <neilb@suse.de>
To: Anssi Hannula <anssi.hannula@iki.fi>
Cc: linux-raid@vger.kernel.org
Subject: Re: raid6 rebuild not starting
Date: Mon, 12 Dec 2011 17:24:53 +1100 [thread overview]
Message-ID: <20111212172453.57bee40c@notabene.brown> (raw)
In-Reply-To: <CAP+ghcLtZQeAR3n4x_8MuGJrEtkOaMWcNKHRin=A+tuDC7aZ5Q@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 5307 bytes --]
On Mon, 12 Dec 2011 08:02:33 +0200 Anssi Hannula <anssi.hannula@iki.fi> wrote:
> On Mon, Dec 12, 2011 at 7:42 AM, NeilBrown <neilb@suse.de> wrote:
> > On Mon, 12 Dec 2011 07:22:17 +0200 Anssi Hannula <anssi.hannula@iki.fi> wrote:
> >
> >> On Mon, Dec 12, 2011 at 5:01 AM, NeilBrown <neilb@suse.de> wrote:
> >> > On Sun, 11 Dec 2011 09:03:14 +0200 Anssi Hannula <anssi.hannula@iki.fi> wrote:
> >> >
> >> >> Hi!
> >> >>
> >> >> After I rebooted during a raid6 rebuild, the rebuild didn't start again.
> >> >> Instead, there is a flood of "RAID conf printout"s that seemingly happen
> >> >> on array activity.
> >> >>
> >> >> All the devices show up properly in --detail and two devices are marked
> >> >> as "spare rebuilding", and I can access the contents of the array just
> >> >> fine, but the rebuild doesn't actually start. Is this a bug or am I
> >> >> missing something? :)
> >> >>
> >> >> I was initially on 2.6.38.8, but also tried 3.1.4 which seems to have
> >> >> the same issue. mdadm is 3.1.5.
> >> >>
> >> >> I'm not using start_ro and writing to the array doesn't trigger a
> >> >> rebuild either.
> >> >>
> >> >> Attached are --examine outputs before assembly, kernel log output on
> >> >> assembly, /proc/mdstat and --detail after assembly (on 3.1.4).
> >> >>
> >> >
> >> > Thank you for the very detailed problem report.
> >>
> >> Thanks for the quick response :)
> >>
> >> > Unfortunately it is a complete mystery to me what is happening.
> >> >
> >> > The repeated "RAID conf printout" messages are almost certainly coming from
> >> > the end of raid5_remove_disk.
> >> > It is being called from remove_and_add_spares for each of the two devices
> >> > that are being rebuilt. raid5_remove_disk declines to remove them because it
> >> > can keep rebuilding them.
> >> >
> >> > remove_and_add_spares then counts them and notes there are 2.
> >> > md_check_recovery notes that this is > 0, so it should create a thread to run
> >> > md_do_sync.
> >> >
> >> > md_do_sync should then print out a message like
> >> > md: recovery of RAID array md0
> >> >
> >> > but it doesn't. So something went wrong.
> >> > There are three reasons that md_do_sync might not print a message:
> >> >
> >> > 1/ MD_RECOVERY_DONE is set. As only md_do_sync ever sets it, that is
> >> > unlikely, and in any case md_check_recovery clears it.
> >> > 2/ mddev->ro != 0. It is only ever set to 0, 1, or 2. If it is 1 or 2
> >> > then we would be able to see that in /proc/mdstat as a "(readonly)"
> >> > status. But we don't.
> >> > 3/ MD_RECOVERY_INTR is set. Again, md_check_recovery clears this. It does
> >> > get set if kthread_should_stop() returns 'true', but that should only
> >> > happen if kthread_stop() was called. That is only called by
> >> > md_unregister_thread and I cannot see any way that could be call.
> >> >
> >> > So. No idea.
> >> >
> >> > Are you compiling these kernels yourself?
> >>
> >> Nope (used Mageia kernels), but I did now (3.1.5).
> >>
> >> > If so, could you:
> >> > - put a printk in the top of md_do_sync to report the values of
> >> > mddev->recovery and mddev->ro
> >> > - print a message whenever md_unregister_thread is called
> >> > - in md_check_recovery, in the
> >> > if (mddev->ro) {
> >> > /* Only thing we do on a ro array is remove
> >> > * failed devices.
> >> > */
> >> > mdk_rdev_t *rdev;
> >> >
> >> > in statement, print the value of mddev->ro.
> >> >
> >> > Then see which of those printk's fire, and what they tell us.
> >>
> >> Only the last one does, and mddev->ro == 0.
> >>
> >> For reference, attached is the used patch and resulting log output.
> >>
> >
> > Thanks.
> >
> > So it isn't running md_do_sync at all. Odd.
> >
> > Could please add:
> > - call "WARN_ON(1);" in print_raid5_conf() so we get a stack trace and can
> > see who is calling it.
> > - print the value that remove_and_add_spares is going to return.
>
> Attached. As you can see, remove_and_add_spare returns 0.
>
> --
> Anssi Hannula
Please add:
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 5c95ccb..fa56ac5 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7328,8 +7328,10 @@ static int remove_and_add_spares(mddev_t *mddev)
}
}
+ printk("degraded=%d\n", mddev->degraded);
if (mddev->degraded) {
list_for_each_entry(rdev, &mddev->disks, same_set) {
+ printk("raid_disk=%d flags=%x\n", rdev->raid_disk, rdev->flags);
if (rdev->raid_disk >= 0 &&
!test_bit(In_sync, &rdev->flags) &&
!test_bit(Faulty, &rdev->flags))
'degraded' must be 2 as dmesg contains
[ 45.544806] md/raid:md0: raid level 6 active with 8 out of 10 devices, algorithm 2
and 'degraded' is exactly the difference between '8' and '10' there.
raid disks 3 and 7 must have In_sync and Faulty clear as both of them just
show "spare rebuilding" in the 'detail' output.
so remove_and_add_spares "must" return 2.
Hopefully the above patch will help me understand which of those is wrong.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
next prev parent reply other threads:[~2011-12-12 6:24 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-12-11 7:03 raid6 rebuild not starting Anssi Hannula
2011-12-12 3:01 ` NeilBrown
2011-12-12 5:22 ` Anssi Hannula
2011-12-12 5:42 ` NeilBrown
2011-12-12 6:02 ` Anssi Hannula
2011-12-12 6:24 ` NeilBrown [this message]
2011-12-12 6:42 ` Anssi Hannula
2011-12-12 7:10 ` NeilBrown
2011-12-12 7:25 ` Anssi Hannula
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20111212172453.57bee40c@notabene.brown \
--to=neilb@suse.de \
--cc=anssi.hannula@iki.fi \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox