Re: [PATCH 3/3] raid5: introduce MD_BROKEN

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
To: Guoqing Jiang <guoqing.jiang@linux.dev>, song@kernel.org
Cc: linux-raid@vger.kernel.org
Subject: Re: [PATCH 3/3] raid5: introduce MD_BROKEN
Date: Tue, 22 Feb 2022 15:18:51 +0100	[thread overview]
Message-ID: <20220222151851.0000089a@linux.intel.com> (raw)
In-Reply-To: <fbe1ec39-acee-8226-adb2-6c61e3d7fdd0@linux.dev>

Hi Guoqing,
Sorry for the delay, I missed it.

On Sat, 12 Feb 2022 09:47:38 +0800
Guoqing Jiang <guoqing.jiang@linux.dev> wrote:

> On 1/27/22 11:39 PM, Mariusz Tkaczyk wrote:
> > Raid456 module had allowed to achieve failed state. It was fixed by
> > fb73b357fb9 ("raid5: block failing device if raid will be failed").
> > This fix introduces a bug, now if raid5 fails during IO, it may
> > result with a hung task without completion. Faulty flag on the
> > device is necessary to process all requests and is checked many
> > times, mainly in analyze_stripe().
> > Allow to set faulty on drive again and set MD_BROKEN if raid is
> > failed.
> >
> > As a result, this level is allowed to achieve failed state again,
> > but communication with userspace (via -EBUSY status) will be
> > preserved.
> >
> > This restores possibility to fail array via #mdadm --set-faulty
> > command and will be fixed by additional verification on mdadm side.
> >  
> 
> Again, you better to send mdadm change along with the series.

I understand your objections. Unfortunately, I was unable to handle it
yet. I focused on kernel part first because mdadm was in freeze.
In mdadm, I need to block manual removal which should be simple, there
is already enough() function defined. It is on top of my TODO lst.
  
> 
> > Reproduction steps:
> >   mdadm -CR imsm -e imsm -n 3 /dev/nvme[0-2]n1
> >   mdadm -CR r5 -e imsm -l5 -n3 /dev/nvme[0-2]n1 --assume-clean
> >   mkfs.xfs /dev/md126 -f
> >   mount /dev/md126 /mnt/root/
> >
> >   fio --filename=/mnt/root/file --size=5GB --direct=1 --rw=randrw
> > --bs=64k --ioengine=libaio --iodepth=64 --runtime=240 --numjobs=4
> > --time_based --group_reporting --name=throughput-test-job
> > --eta-newline=1 &
> >
> >   echo 1 > /sys/block/nvme2n1/device/device/remove
> >   echo 1 > /sys/block/nvme1n1/device/device/remove
> >
> >   [ 1475.787779] Call Trace:
> >   [ 1475.793111] __schedule+0x2a6/0x700
> >   [ 1475.799460] schedule+0x38/0xa0
> >   [ 1475.805454] raid5_get_active_stripe+0x469/0x5f0 [raid456]
> >   [ 1475.813856] ? finish_wait+0x80/0x80
> >   [ 1475.820332] raid5_make_request+0x180/0xb40 [raid456]
> >   [ 1475.828281] ? finish_wait+0x80/0x80
> >   [ 1475.834727] ? finish_wait+0x80/0x80
> >   [ 1475.841127] ? finish_wait+0x80/0x80
> >   [ 1475.847480] md_handle_request+0x119/0x190
> >   [ 1475.854390] md_make_request+0x8a/0x190
> >   [ 1475.861041] generic_make_request+0xcf/0x310
> >   [ 1475.868145] submit_bio+0x3c/0x160
> >   [ 1475.874355] iomap_dio_submit_bio.isra.20+0x51/0x60
> >   [ 1475.882070] iomap_dio_bio_actor+0x175/0x390
> >   [ 1475.889149] iomap_apply+0xff/0x310
> >   [ 1475.895447] ? iomap_dio_bio_actor+0x390/0x390
> >   [ 1475.902736] ? iomap_dio_bio_actor+0x390/0x390
> >   [ 1475.909974] iomap_dio_rw+0x2f2/0x490
> >   [ 1475.916415] ? iomap_dio_bio_actor+0x390/0x390
> >   [ 1475.923680] ? atime_needs_update+0x77/0xe0
> >   [ 1475.930674] ? xfs_file_dio_aio_read+0x6b/0xe0 [xfs]
> >   [ 1475.938455] xfs_file_dio_aio_read+0x6b/0xe0 [xfs]
> >   [ 1475.946084] xfs_file_read_iter+0xba/0xd0 [xfs]
> >   [ 1475.953403] aio_read+0xd5/0x180
> >   [ 1475.959395] ? _cond_resched+0x15/0x30
> >   [ 1475.965907] io_submit_one+0x20b/0x3c0
> >   [ 1475.972398] __x64_sys_io_submit+0xa2/0x180
> >   [ 1475.979335] ? do_io_getevents+0x7c/0xc0
> >   [ 1475.986009] do_syscall_64+0x5b/0x1a0
> >   [ 1475.992419] entry_SYSCALL_64_after_hwframe+0x65/0xca
> >   [ 1476.000255] RIP: 0033:0x7f11fc27978d
> >   [ 1476.006631] Code: Bad RIP value.
> >   [ 1476.073251] INFO: task fio:3877 blocked for more than 120
> > seconds.  
> 
> Does it also happen to non imsm array? And did you try to reproduce
> it with revert fb73b357fb? I suppose fb73b357fb9 introduced the
> regression given it is fixed by this one.
> 
It is reproducible on native and IMSM. Yes, we can fix it by revert.

> 
> >    * In both cases, &MD_BROKEN will be set in &mddev->flags.
> >    */
> >   static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > index 1240a5c16af8..bee953c8007f 100644
> > --- a/drivers/md/raid5.c
> > +++ b/drivers/md/raid5.c
> > @@ -686,17 +686,21 @@ int raid5_calc_degraded(struct r5conf *conf)
> >   	return degraded;
> >   }
> >   
> > -static int has_failed(struct r5conf *conf)
> > +static bool has_failed(struct r5conf *conf)
> >   {
> > -	int degraded;
> > +	int degraded = conf->mddev->degraded;
> >   
> > -	if (conf->mddev->reshape_position == MaxSector)
> > -		return conf->mddev->degraded > conf->max_degraded;
> > +	if (test_bit(MD_BROKEN, &conf->mddev->flags))
> > +		return true;  
> 
> If one member disk was set Faulty which caused BROKEN was set, is it
> possible to re-add the same member disk again?
> 
Is possible to re-add drive to failed raid5 array now? From my
understanding of raid5_add_disk it is not possible.

> [root@vm ~]# echo faulty > /sys/block/md0/md/dev-loop1/state
> [root@vm ~]# cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid5 loop2[2] loop1[0](F)
>        1046528 blocks super 1.2 level 5, 512k chunk, algorithm 2
> [2/1] [_U] bitmap: 0/1 pages [0KB], 65536KB chunk
> 
> unused devices: <none>
> [root@vm ~]# echo re-add > /sys/block/md0/md/dev-loop1/state
> [root@vm ~]# cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid5 loop2[2] loop1[0]
>        1046528 blocks super 1.2 level 5, 512k chunk, algorithm 2
> [2/2] [UU] bitmap: 0/1 pages [0KB], 65536KB chunk
> 
> unused devices: <none>
> 
> And have you run mdadm test against the series?
> 
I run imsm test suite and our internal IMSM scope. I will take the
challenge and will verify with native. Thanks for suggestion.

> > -	degraded = raid5_calc_degraded(conf);
> > -	if (degraded > conf->max_degraded)
> > -		return 1;
> > -	return 0;
> > +	if (conf->mddev->reshape_position != MaxSector)
> > +		degraded = raid5_calc_degraded(conf);
> > +
> > +	if (degraded > conf->max_degraded) {
> > +		set_bit(MD_BROKEN, &conf->mddev->flags);  
> 
> Why not set BROKEN flags in err handler to align with other levels? Or
> do it in md_error only.

https://lore.kernel.org/linux-raid/3da9324e-01e7-2a07-4bcd-14245db56693@linux.dev/

You suggested that.
Other levels doesn't have dedicates has_failed() routines. For raid5 it
is reasonable to set it in has_failed().

I can't do that in md_error because I don't have such information in
all cases. !test_bit("Faulty", rdev->flags) result varies.

> 
> > +		return true;
> > +	}
> > +	return false;
> >   }
> >   
> >   struct stripe_head *
> > @@ -2877,34 +2881,29 @@ static void raid5_error(struct mddev
> > *mddev, struct md_rdev *rdev) unsigned long flags;
> >   	pr_debug("raid456: error called\n");
> >   
> > +	pr_crit("md/raid:%s: Disk failure on %s, disabling
> > device.\n",
> > +		mdname(mddev), bdevname(rdev->bdev, b));
> > +
> >   	spin_lock_irqsave(&conf->device_lock, flags);
> > +	set_bit(Faulty, &rdev->flags);
> > +	clear_bit(In_sync, &rdev->flags);
> > +	mddev->degraded = raid5_calc_degraded(conf);
> >   
> > -	if (test_bit(In_sync, &rdev->flags) &&
> > -	    mddev->degraded == conf->max_degraded) {
> > -		/*
> > -		 * Don't allow to achieve failed state
> > -		 * Don't try to recover this device
> > -		 */
> > +	if (has_failed(conf)) {
> >   		conf->recovery_disabled =
> > mddev->recovery_disabled;
> > -		spin_unlock_irqrestore(&conf->device_lock, flags);
> > -		return;  
> 
> Ok, I think commit fb73b357fb985cc652a72a41541d25915c7f9635 is
> effectively reverted by this hunk. So I would prefer to separate the
> revert part from this patch, just my 0.02$.

Song,
do you want revert?

Thanks,
Mariusz

next prev parent reply	other threads:[~2022-02-22 14:19 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-27 15:39 [PATCH v3 0/3] Improve failed arrays handling Mariusz Tkaczyk
2022-01-27 15:39 ` [PATCH 1/3] raid0, linear, md: add error_handlers for raid0 and linear Mariusz Tkaczyk
2022-02-12  1:12   ` Guoqing Jiang
2022-02-14  9:37     ` Mariusz Tkaczyk
2022-02-15  3:43       ` Guoqing Jiang
2022-02-15 14:06         ` Mariusz Tkaczyk
2022-02-16  9:47           ` Xiao Ni
2022-02-22  6:34           ` Song Liu
2022-02-22 13:02             ` Mariusz Tkaczyk
2022-01-27 15:39 ` [PATCH 2/3] md: Set MD_BROKEN for RAID1 and RAID10 Mariusz Tkaczyk
2022-01-31  8:29   ` Xiao Ni
2022-01-31  9:06     ` Mariusz Tkaczyk
2022-02-08  7:13       ` Song Liu
2022-01-31 12:23     ` Wols Lists
2022-02-12  1:17   ` Guoqing Jiang
2022-02-14  8:55     ` Mariusz Tkaczyk
2022-01-27 15:39 ` [PATCH 3/3] raid5: introduce MD_BROKEN Mariusz Tkaczyk
2022-01-31  8:58   ` Xiao Ni
2022-02-12  1:47   ` Guoqing Jiang
2022-02-22 14:18     ` Mariusz Tkaczyk [this message]
2022-02-25  7:22       ` Guoqing Jiang
2022-03-03 16:21         ` Mariusz Tkaczyk
2022-02-08  7:18 ` [PATCH v3 0/3] Improve failed arrays handling Song Liu
  -- strict thread matches above, loose matches on Subject: below --
2022-03-22 15:23 [PATCH 0/3] Failed array handling improvements Mariusz Tkaczyk
2022-03-22 15:23 ` [PATCH 3/3] raid5: introduce MD_BROKEN Mariusz Tkaczyk
2022-04-08  0:29   ` Song Liu
2021-12-16 14:52 [PATCH v2 0/3] Use MD_BROKEN for redundant arrays Mariusz Tkaczyk
2021-12-16 14:52 ` [PATCH 3/3] raid5: introduce MD_BROKEN Mariusz Tkaczyk
2021-12-17  2:26   ` Guoqing Jiang
2021-12-17  8:37     ` Mariusz Tkaczyk
2021-12-22  1:46       ` Guoqing Jiang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220222151851.0000089a@linux.intel.com \
    --to=mariusz.tkaczyk@linux.intel.com \
    --cc=guoqing.jiang@linux.dev \
    --cc=linux-raid@vger.kernel.org \
    --cc=song@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.