From: NeilBrown <neilb@suse.de>
To: Anssi Hannula <anssi.hannula@iki.fi>
Cc: linux-raid@vger.kernel.org
Subject: Re: raid6 rebuild not starting
Date: Mon, 12 Dec 2011 16:42:40 +1100 [thread overview]
Message-ID: <20111212164240.01e8d1fb@notabene.brown> (raw)
In-Reply-To: <CAP+ghcLLkAbPVy8LO4Hq6ncNr3=tvGGHB1tJ9Wih=Cjbty+zuw@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 3790 bytes --]
On Mon, 12 Dec 2011 07:22:17 +0200 Anssi Hannula <anssi.hannula@iki.fi> wrote:
> On Mon, Dec 12, 2011 at 5:01 AM, NeilBrown <neilb@suse.de> wrote:
> > On Sun, 11 Dec 2011 09:03:14 +0200 Anssi Hannula <anssi.hannula@iki.fi> wrote:
> >
> >> Hi!
> >>
> >> After I rebooted during a raid6 rebuild, the rebuild didn't start again.
> >> Instead, there is a flood of "RAID conf printout"s that seemingly happen
> >> on array activity.
> >>
> >> All the devices show up properly in --detail and two devices are marked
> >> as "spare rebuilding", and I can access the contents of the array just
> >> fine, but the rebuild doesn't actually start. Is this a bug or am I
> >> missing something? :)
> >>
> >> I was initially on 2.6.38.8, but also tried 3.1.4 which seems to have
> >> the same issue. mdadm is 3.1.5.
> >>
> >> I'm not using start_ro and writing to the array doesn't trigger a
> >> rebuild either.
> >>
> >> Attached are --examine outputs before assembly, kernel log output on
> >> assembly, /proc/mdstat and --detail after assembly (on 3.1.4).
> >>
> >
> > Thank you for the very detailed problem report.
>
> Thanks for the quick response :)
>
> > Unfortunately it is a complete mystery to me what is happening.
> >
> > The repeated "RAID conf printout" messages are almost certainly coming from
> > the end of raid5_remove_disk.
> > It is being called from remove_and_add_spares for each of the two devices
> > that are being rebuilt. raid5_remove_disk declines to remove them because it
> > can keep rebuilding them.
> >
> > remove_and_add_spares then counts them and notes there are 2.
> > md_check_recovery notes that this is > 0, so it should create a thread to run
> > md_do_sync.
> >
> > md_do_sync should then print out a message like
> > md: recovery of RAID array md0
> >
> > but it doesn't. So something went wrong.
> > There are three reasons that md_do_sync might not print a message:
> >
> > 1/ MD_RECOVERY_DONE is set. As only md_do_sync ever sets it, that is
> > unlikely, and in any case md_check_recovery clears it.
> > 2/ mddev->ro != 0. It is only ever set to 0, 1, or 2. If it is 1 or 2
> > then we would be able to see that in /proc/mdstat as a "(readonly)"
> > status. But we don't.
> > 3/ MD_RECOVERY_INTR is set. Again, md_check_recovery clears this. It does
> > get set if kthread_should_stop() returns 'true', but that should only
> > happen if kthread_stop() was called. That is only called by
> > md_unregister_thread and I cannot see any way that could be call.
> >
> > So. No idea.
> >
> > Are you compiling these kernels yourself?
>
> Nope (used Mageia kernels), but I did now (3.1.5).
>
> > If so, could you:
> > - put a printk in the top of md_do_sync to report the values of
> > mddev->recovery and mddev->ro
> > - print a message whenever md_unregister_thread is called
> > - in md_check_recovery, in the
> > if (mddev->ro) {
> > /* Only thing we do on a ro array is remove
> > * failed devices.
> > */
> > mdk_rdev_t *rdev;
> >
> > in statement, print the value of mddev->ro.
> >
> > Then see which of those printk's fire, and what they tell us.
>
> Only the last one does, and mddev->ro == 0.
>
> For reference, attached is the used patch and resulting log output.
>
Thanks.
So it isn't running md_do_sync at all. Odd.
Could please add:
- call "WARN_ON(1);" in print_raid5_conf() so we get a stack trace and can
see who is calling it.
- print the value that remove_and_add_spares is going to return.
Thanks,
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
next prev parent reply other threads:[~2011-12-12 5:42 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-12-11 7:03 raid6 rebuild not starting Anssi Hannula
2011-12-12 3:01 ` NeilBrown
2011-12-12 5:22 ` Anssi Hannula
2011-12-12 5:42 ` NeilBrown [this message]
2011-12-12 6:02 ` Anssi Hannula
2011-12-12 6:24 ` NeilBrown
2011-12-12 6:42 ` Anssi Hannula
2011-12-12 7:10 ` NeilBrown
2011-12-12 7:25 ` Anssi Hannula
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20111212164240.01e8d1fb@notabene.brown \
--to=neilb@suse.de \
--cc=anssi.hannula@iki.fi \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.