All of lore.kernel.org
 help / color / mirror / Atom feed
From: Xiao Ni <xni@redhat.com>
To: Joe Lawrence <joe.lawrence@stratus.com>
Cc: NeilBrown <neilb@suse.de>,
	linux-raid@vger.kernel.org,
	Bill Kuzeja <william.kuzeja@stratus.com>
Subject: Re: RAID1 removing failed disk returns EBUSY
Date: Fri, 16 Jan 2015 00:20:12 -0500 (EST)	[thread overview]
Message-ID: <2054919975.10444188.1421385612513.JavaMail.zimbra@redhat.com> (raw)
In-Reply-To: <20150115082210.31bd3ea5@jlaw-desktop.mno.stratus.com>



----- Original Message -----
> From: "Joe Lawrence" <joe.lawrence@stratus.com>
> To: "XiaoNi" <xni@redhat.com>
> Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com>
> Sent: Thursday, January 15, 2015 9:22:10 PM
> Subject: Re: RAID1 removing failed disk returns EBUSY
> 
> On Wed, 14 Jan 2015 20:41:16 +0800
> XiaoNi <xni@redhat.com> wrote:
> 
> > On 11/17/2014 07:03 AM, NeilBrown wrote:
> > > On Thu, 13 Nov 2014 09:05:49 -0500 Joe Lawrence<joe.lawrence@stratus.com>
> > > wrote:
> > >
> > >> On Wed, 29 Oct 2014 13:36:04 -0400
> > >> Joe Lawrence<joe.lawrence@stratus.com>  wrote:
> > >>
> > >>> On Wed, 29 Oct 2014 08:41:13 +1100
> > >>> NeilBrown<neilb@suse.de>  wrote:
> > >>>
> > >>>> On Mon, 27 Oct 2014 16:27:48 -0400 Joe
> > >>>> Lawrence<joe.lawrence@stratus.com>
> > >>>> wrote:
> > >>>>
> > >>>>> Hi Neil,
> > >>>>>
> > >>>>> We've encountered changes in MD and mdadm that have broken our
> > >>>>> automated
> > >>>>> disk removal script.  In the past, we've been able to run the
> > >>>>> following
> > >>>>> after a RAID1 disk component removal:
> > >>>>>
> > >>>>> % echo fail>  /sys/block/md3/md/dev-sdr5/state
> > >>>>> % echo remove>  /sys/block/md3/md/dev-sdr5/state
> > >>>>>
> > >>>>> However, the latest RHEL6.6 code drop has rebased to sufficiently
> > >>>>> recent
> > >>>>> MD kernel and mdadm changes, in which the previous commands
> > >>>>> occasionally
> > >>>>> fail like so:
> > >>>>>
> > >>>>> * MD array is usually resyncing or checking
> > >>>>> * Component disk /dev/sdr removed via HBA sysfs PCI removal
> > >>>>> * Following UDEV rule fires:
> > >>>>>
> > >>>>> SUBSYSTEM=="block", ACTION=="remove", ENV{ID_PATH}=="?*", \
> > >>>>>          RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}"
> > >>>>>
> > >>>>> % mdadm --detail /dev/md3
> > >>>>> /dev/md3:
> > >>>>>          Version : 1.1
> > >>>>>    Creation Time : Tue Oct 14 17:31:59 2014
> > >>>>>       Raid Level : raid1
> > >>>>>       Array Size : 25149440 (23.98 GiB 25.75 GB)
> > >>>>>    Used Dev Size : 25149440 (23.98 GiB 25.75 GB)
> > >>>>>     Raid Devices : 2
> > >>>>>    Total Devices : 2
> > >>>>>      Persistence : Superblock is persistent
> > >>>>>
> > >>>>>    Intent Bitmap : Internal
> > >>>>>
> > >>>>>      Update Time : Wed Oct 15 14:22:34 2014
> > >>>>>            State : active, degraded
> > >>>>>   Active Devices : 1
> > >>>>> Working Devices : 1
> > >>>>>   Failed Devices : 1
> > >>>>>    Spare Devices : 0
> > >>>>>
> > >>>>>             Name : localhost.localdomain:3
> > >>>>>             UUID : 40ed68ee:ba41d4cd:28c361ed:be7470b8
> > >>>>>           Events : 142
> > >>>>>
> > >>>>>      Number   Major   Minor   RaidDevice State
> > >>>>>         0      65       21        0      faulty
> > >>>>>         1      65        5        1      active sync   /dev/sdj5
> > >>>>>
> > >>>>> All attempts to remove this device fail:
> > >>>>>
> > >>>>> % echo remove>  /sys/block/md3/md/dev-sdr5/state
> > >>>>> -bash: echo: write error: Device or resource busy
> > >>>>>
> > >>>>> This can be traced to state_store():
> > >>>>>
> > >>>>>          } else if (cmd_match(buf, "remove")) {
> > >>>>>                  if (rdev->raid_disk>= 0)
> > >>>>>                          err = -EBUSY;
> > >>>>>
> > >>>>> After much debugging and systemtapping, I think I've figured out that
> > >>>>> the
> > >>>>> sysfs scripting may fail after the following combination of changes:
> > >>>>>
> > >>>>> mdadm  8af530b07fce "Enhance incremental removal."
> > >>>>> kernel 30b8feb730f9 "md/raid5: avoid deadlock when raid5 array has
> > >>>>> unack
> > >>>>>                       badblocks during md_stop_writes"
> > >>>>>
> > >>>>> With these two changes:
> > >>>>>
> > >>>>> 1 - On the user side, mdadm is trying to set the array_state to
> > >>>>> read-auto
> > >>>>>      on incremental removal (as invoked by UDEV rule).
> > >>>>>
> > >>>>> 2 - Kernel side, md_set_readonly() will set the MD_RECOVERY_FROZEN
> > >>>>> flag,
> > >>>>>      wake up the mddev->thread and if there is a sync_thread, it will
> > >>>>>      set
> > >>>>>      MD_RECOVERY_INTR and then wait until the sync_thread is set to
> > >>>>>      NULL.
> > >>>>>
> > >>>>>      When md_check_recovery() gets a chance to run as part of the
> > >>>>>      raid1d() mddev->thread, it may or may not ever get to
> > >>>>>      an invocation of remove_and_add_spares(), for there are but
> > >>>>>      *many*
> > >>>>>      conditional early exits along the way -- for example, if
> > >>>>>      MD_RECOVERY_FROZEN is set, the following condition will bounce
> > >>>>>      out of
> > >>>>>      the routine:
> > >>>>>
> > >>>>>                  if
> > >>>>>                  (!test_and_clear_bit(MD_RECOVERY_NEEDED,&mddev->recovery)
> > >>>>>                  ||
> > >>>>>                      test_bit(MD_RECOVERY_FROZEN,&mddev->recovery))
> > >>>>>                          goto unlock;
> > >>>>>
> > >>>>>      the next time around, MD_RECOVERY_NEEDED will have been cleared,
> > >>>>>      so
> > >>>>>      all future tests will return 0 and the negation will always take
> > >>>>>      the
> > >>>>>      early exit path.
> > >>>>>
> > >>>>>      Back in md_set_readonly(), it may notice that the MD is still in
> > >>>>>      use,
> > >>>>>      so it clears the MD_RECOVERY_FROZEN and then returns -EBUSY,
> > >>>>>      without
> > >>>>>      setting mddev->ro.  But the damage has been done as conditions
> > >>>>>      have
> > >>>>>      been set such that md_check_recovery() will never call
> > >>>>>      remove_and_add_spares().
> > >>>>>
> > >>>>> This would also explain why an "idle" sync_action clears the wedge:
> > >>>>> it
> > >>>>> sets MD_RECOVERY_NEEDED allowing md_check_recovery() to continue
> > >>>>> executing
> > >>>>> to remove_and_add_spares().
> > >>>>>
> > >>>>> As far as I can tell, this is what is happening to prevent the
> > >>>>> "remove"
> > >>>>> write to /sys/block/md3/md/dev-sdr5/state from succeeding.  There are
> > >>>>> certainly a lot of little bit-states between disk removal, UDEV
> > >>>>> mdadm, and
> > >>>>> various MD kernel threads, so apologies if I missed an important
> > >>>>> transition.
> > >>>>>
> > >>>>> Would you consider writing "idle" to the MD array sync_action file as
> > >>>>> a
> > >>>>> safe and reasonable intermediate workaround step for our script?
> > >>>>>
> > >>>>> And of course, any suggestions to whether this is intended behavior
> > >>>>> (ie,
> > >>>>> the removed component disk is failed, but stuck in the array)?
> > >>>>>
> > >>>>> This is fairly easy for us to reproduce with multiple MD arrays per
> > >>>>> disk
> > >>>>> (one per partition) and interrupting a raid check on all of them
> > >>>>> (especially when they are delayed waiting for the first to finish) by
> > >>>>> removing the component disk via sysfs PCI removal.  We can provide
> > >>>>> additional debug or testing if required.
> > >>>>>
> > >>>> Hi Joe,
> > >>>>   thanks for the details analysis!!
> > >>>>
> > >>>> I think the correct fix would be that MD_RECOVERY_NEEDED should be set
> > >>>> after
> > >>>> clearing MD_RECOVERY_FROZEN, like the patch below.
> > >>>> Can you confirm that it works for you?
> > >>>>
> > >>>> Writing 'idle' should in general be safe, so that could be used as an
> > >>>> interim.
> > >>>>
> > >>>> Thanks,
> > >>>> NeilBrown
> > >>>>
> > >>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
> > >>>> index c03d87b6890a..2c73fcb82593 100644
> > >>>> --- a/drivers/md/md.c
> > >>>> +++ b/drivers/md/md.c
> > >>>> @@ -5261,6 +5261,7 @@ static int md_set_readonly(struct mddev *mddev,
> > >>>> struct block_device *bdev)
> > >>>>   		printk("md: %s still in use.\n",mdname(mddev));
> > >>>>   		if (did_freeze) {
> > >>>>   			clear_bit(MD_RECOVERY_FROZEN,&mddev->recovery);
> > >>>> +			set_bit(MD_RECOVERY_NEEDED,&mddev->recovery);
> > >>>>   			md_wakeup_thread(mddev->thread);
> > >>>>   		}
> > >>>>   		err = -EBUSY;
> > >>>> @@ -5275,6 +5276,8 @@ static int md_set_readonly(struct mddev *mddev,
> > >>>> struct block_device *bdev)
> > >>>>   		mddev->ro = 1;
> > >>>>   		set_disk_ro(mddev->gendisk, 1);
> > >>>>   		clear_bit(MD_RECOVERY_FROZEN,&mddev->recovery);
> > >>>> +		set_bit(MD_RECOVERY_NEEDED,&mddev->recovery);
> > >>>> +		md_wakeup_thread(mddev->thread);
> > >>>>   		sysfs_notify_dirent_safe(mddev->sysfs_state);
> > >>>>   		err = 0;
> > >>>>   	}
> > >>>> @@ -5318,6 +5321,7 @@ static int do_md_stop(struct mddev *mddev, int
> > >>>> mode,
> > >>>>   		mutex_unlock(&mddev->open_mutex);
> > >>>>   		if (did_freeze) {
> > >>>>   			clear_bit(MD_RECOVERY_FROZEN,&mddev->recovery);
> > >>>> +			set_bit(MD_RECOVERY_NEEDED,&mddev->recovery);
> > >>>>   			md_wakeup_thread(mddev->thread);
> > >>>>   		}
> > >>>>   		return -EBUSY;
> > >>> Hi Neil,
> > >>>
> > >>> In my tests, the UDEV "mdadm -If" invocation fails *and* removes the
> > >>> pulled disk from the MD array.  This is okay for our intentions, but I
> > >>> wanted to make sure that it's okay to skip any failed-but-not-removed
> > >>> state.
> > >>>
> > >>> Tested-by: Joe Lawrence<joe.lawrence@stratus.com>
> > >>>
> > >>> and should this have a
> > >>>
> > >>> Fixes: 30b8feb730f9 ("md/raid5: avoid deadlock when raid5 array has
> > >>> unack badblocks during md_stop_writes")
> > >>>
> > >>> tag to mark for stable?
> > >>
> > >> Hi Neil,
> > >>
> > >> Would you like me to write up a proper patch, or is this one in the
> > >> queue?
> > >>
> > > Several times over the last week I've thought that I should probably push
> > > that patch along ... but each time something else seemed more
> > > interesting.
> > > But it's a new week now.  I've just posted a pull request.
> > >
> > > Thanks for the prompt (and the report and testing of course).
> > >
> > > NeilBrown
> > Hi Neil and Joe
> > 
> >      Any update for this? I tried this with 3.18.2 and the problem still
> > exists.
> > 
> >     When it tried to remove the failed disk. it find the Blocked flag in
> > rdev->flags is
> > set. So it can't remove the disk. Is this the right reason?
> 
> Hi Xiao,
> 
> It's been a while since I've looked at this patch, but it looks like it
> made it into 3.18, so it should be present on 3.18.2.
> 
> What version of mdadm are you running?
> 
> Does writing an "idle" sync_action clear this condition?
> 
> Regards,
> 
> -- Joe
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Hi Joe

   Thanks for reminding me. I didn't do that. Now it can remove successfully after writing
"idle" to sync_action.

   I thought wrongly that the patch referenced in this mail is fixed for the problem.

Best Regards
Xiao


  reply	other threads:[~2015-01-16  5:20 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-10-27 20:27 RAID1 removing failed disk returns EBUSY Joe Lawrence
2014-10-28 21:41 ` NeilBrown
2014-10-29 17:36   ` Joe Lawrence
2014-11-13 14:05     ` Joe Lawrence
2014-11-16 23:03       ` NeilBrown
2015-01-14 12:41         ` XiaoNi
2015-01-15 13:22           ` Joe Lawrence
2015-01-16  5:20             ` Xiao Ni [this message]
2015-01-16 15:10               ` Joe Lawrence
2015-01-19  2:33                 ` Xiao Ni
2015-01-19 17:56                   ` Joe Lawrence
2015-01-20  7:16                     ` Xiao Ni
2015-01-23 15:11                       ` Joe Lawrence
2015-01-30  2:19                         ` Xiao Ni
2015-01-30  4:27                           ` Xiao Ni
2015-01-29  3:52                   ` NeilBrown
2015-01-29 12:14                     ` Xiao Ni
2015-02-02  6:36                       ` NeilBrown
2015-02-03  8:10                         ` Xiao Ni
2015-06-10  6:26                           ` XiaoNi
2015-06-17  2:51                             ` Neil Brown
2015-06-25  9:42                               ` Xiao Ni

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2054919975.10444188.1421385612513.JavaMail.zimbra@redhat.com \
    --to=xni@redhat.com \
    --cc=joe.lawrence@stratus.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=william.kuzeja@stratus.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.