Linux RAID subsystem development
 help / color / mirror / Atom feed
From: NeilBrown <neilb@suse.de>
To: Anssi Hannula <anssi.hannula@iki.fi>
Cc: linux-raid@vger.kernel.org
Subject: Re: raid6 rebuild not starting
Date: Mon, 12 Dec 2011 16:42:40 +1100	[thread overview]
Message-ID: <20111212164240.01e8d1fb@notabene.brown> (raw)
In-Reply-To: <CAP+ghcLLkAbPVy8LO4Hq6ncNr3=tvGGHB1tJ9Wih=Cjbty+zuw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3790 bytes --]

On Mon, 12 Dec 2011 07:22:17 +0200 Anssi Hannula <anssi.hannula@iki.fi> wrote:

> On Mon, Dec 12, 2011 at 5:01 AM, NeilBrown <neilb@suse.de> wrote:
> > On Sun, 11 Dec 2011 09:03:14 +0200 Anssi Hannula <anssi.hannula@iki.fi> wrote:
> >
> >> Hi!
> >>
> >> After I rebooted during a raid6 rebuild, the rebuild didn't start again.
> >> Instead, there is a flood of "RAID conf printout"s that seemingly happen
> >> on array activity.
> >>
> >> All the devices show up properly in --detail and two devices are marked
> >> as "spare rebuilding", and I can access the contents of the array just
> >> fine, but the rebuild doesn't actually start. Is this a bug or am I
> >> missing something? :)
> >>
> >> I was initially on 2.6.38.8, but also tried 3.1.4 which seems to have
> >> the same issue. mdadm is 3.1.5.
> >>
> >> I'm not using start_ro and writing to the array doesn't trigger a
> >> rebuild either.
> >>
> >> Attached are --examine outputs before assembly, kernel log output on
> >> assembly, /proc/mdstat and --detail after assembly (on 3.1.4).
> >>
> >
> > Thank you for the very detailed problem report.
> 
> Thanks for the quick response :)
> 
> > Unfortunately it is a complete mystery to me what is happening.
> >
> > The repeated "RAID conf printout" messages are almost certainly coming from
> > the end of raid5_remove_disk.
> > It is being called from remove_and_add_spares for each of the two devices
> > that are being rebuilt.  raid5_remove_disk declines to remove them because it
> > can keep rebuilding them.
> >
> > remove_and_add_spares then counts them and notes there are 2.
> > md_check_recovery notes that this is > 0, so it should create a thread to run
> > md_do_sync.
> >
> > md_do_sync should then print out a message like
> >  md: recovery of RAID array md0
> >
> > but it doesn't.  So something went wrong.
> > There are three reasons that md_do_sync might not print a message:
> >
> > 1/ MD_RECOVERY_DONE is set.  As only md_do_sync ever sets it, that is
> >    unlikely, and in any case md_check_recovery clears it.
> > 2/ mddev->ro != 0.  It is only ever set to 0, 1, or 2.  If it is 1 or 2
> >   then we would be able to see that in /proc/mdstat as a "(readonly)"
> >   status.  But we don't.
> > 3/ MD_RECOVERY_INTR is set. Again, md_check_recovery clears this.  It does
> >   get set if kthread_should_stop() returns 'true', but that should only
> >   happen if kthread_stop() was called.  That is only called by
> >   md_unregister_thread and I cannot see any way that could be call.
> >
> > So.  No idea.
> >
> > Are you compiling these kernels yourself?
> 
> Nope (used Mageia kernels), but I did now (3.1.5).
> 
> > If so, could you:
> >  - put a printk in the top of md_do_sync to report the values of
> >   mddev->recovery and mddev->ro
> >  - print a message whenever md_unregister_thread is called
> >  - in md_check_recovery, in the
> >                if (mddev->ro) {
> >                        /* Only thing we do on a ro array is remove
> >                         * failed devices.
> >                         */
> >                        mdk_rdev_t *rdev;
> >
> >  in statement, print the value of mddev->ro.
> >
> > Then see which of those printk's fire, and what they tell us.
> 
> Only the last one does, and mddev->ro == 0.
> 
> For reference, attached is the used patch and resulting log output.
> 

Thanks.

So it isn't running md_do_sync at all. Odd.

Could please add:
  - call "WARN_ON(1);" in print_raid5_conf() so we get a stack trace and can
    see who is calling it.
  - print the value that remove_and_add_spares is going to return.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

  reply	other threads:[~2011-12-12  5:42 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-12-11  7:03 raid6 rebuild not starting Anssi Hannula
2011-12-12  3:01 ` NeilBrown
2011-12-12  5:22   ` Anssi Hannula
2011-12-12  5:42     ` NeilBrown [this message]
2011-12-12  6:02       ` Anssi Hannula
2011-12-12  6:24         ` NeilBrown
2011-12-12  6:42           ` Anssi Hannula
2011-12-12  7:10             ` NeilBrown
2011-12-12  7:25               ` Anssi Hannula

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20111212164240.01e8d1fb@notabene.brown \
    --to=neilb@suse.de \
    --cc=anssi.hannula@iki.fi \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox