linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Fwd: Infiniate systemd loop when power off the machine with multiple MD RAIDs
@ 2023-08-16  9:37 Bagas Sanjaya
  2023-08-18  8:16 ` Mariusz Tkaczyk
                   ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Bagas Sanjaya @ 2023-08-16  9:37 UTC (permalink / raw)
  To: Christoph Hellwig, Song Liu, AceLan Kao
  Cc: Linux Kernel Mailing List, Linux Regressions, Linux RAID

Hi,

I notice a regression report on Bugzilla [1]. Quoting from it:

> It needs to build at least 2 different RAIDs(eg. RAID0 and RAID10, RAID5 and RAID10) and then you will see below error repeatly(need to use serial console to see it)
> 
> [ 205.360738] systemd-shutdown[1]: Stopping MD devices.
> [ 205.366384] systemd-shutdown[1]: sd-device-enumerator: Scan all dirs
> [ 205.373327] systemd-shutdown[1]: sd-device-enumerator: Scanning /sys/bus
> [ 205.380427] systemd-shutdown[1]: sd-device-enumerator: Scanning /sys/class
> [ 205.388257] systemd-shutdown[1]: Stopping MD /dev/md127 (9:127).
> [ 205.394880] systemd-shutdown[1]: Failed to sync MD block device /dev/md127, ignoring: Input/output error
> [ 205.404975] md: md127 stopped.
> [ 205.470491] systemd-shutdown[1]: Stopping MD /dev/md126 (9:126).
> [ 205.770179] md: md126: resync interrupted.
> [ 205.776258] md126: detected capacity change from 1900396544 to 0
> [ 205.783349] md: md126 stopped.
> [ 205.862258] systemd-shutdown[1]: Stopping MD /dev/md125 (9:125).
> [ 205.862435] md: md126 stopped.
> [ 205.868376] systemd-shutdown[1]: Failed to sync MD block device /dev/md125, ignoring: Input/output error
> [ 205.872845] block device autoloading is deprecated and will be removed.
> [ 205.880955] md: md125 stopped.
> [ 205.934349] systemd-shutdown[1]: Stopping MD /dev/md124p2 (259:7).
> [ 205.947707] systemd-shutdown[1]: Could not stop MD /dev/md124p2: Device or resource busy
> [ 205.957004] systemd-shutdown[1]: Stopping MD /dev/md124p1 (259:6).
> [ 205.964177] systemd-shutdown[1]: Could not stop MD /dev/md124p1: Device or resource busy
> [ 205.973155] systemd-shutdown[1]: Stopping MD /dev/md124 (9:124).
> [ 205.979789] systemd-shutdown[1]: Could not stop MD /dev/md124: Device or resource busy
> [ 205.988475] systemd-shutdown[1]: Not all MD devices stopped, 4 left.

See Bugzilla for the full thread and attached full journalctl log.

Anyway, I'm adding this regression to be tracked by regzbot:

#regzbot introduced: 12a6caf273240a https://bugzilla.kernel.org/show_bug.cgi?id=217798
#regzbot title: systemd shutdown hang on machine with different RAID levels

Thanks.

[1]: https://bugzilla.kernel.org/show_bug.cgi?id=217798

-- 
An old man doll... just what I always wanted! - Clara

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs
  2023-08-16  9:37 Fwd: Infiniate systemd loop when power off the machine with multiple MD RAIDs Bagas Sanjaya
@ 2023-08-18  8:16 ` Mariusz Tkaczyk
  2023-08-18  9:21   ` Hannes Reinecke
  2023-08-22  3:51   ` Guoqing Jiang
  2023-08-21 13:18 ` Fwd: " Yu Kuai
  2023-08-22 18:56 ` Song Liu
  2 siblings, 2 replies; 30+ messages in thread
From: Mariusz Tkaczyk @ 2023-08-18  8:16 UTC (permalink / raw)
  To: Bagas Sanjaya
  Cc: Christoph Hellwig, Song Liu, AceLan Kao,
	Linux Kernel Mailing List, Linux Regressions, Linux RAID

On Wed, 16 Aug 2023 16:37:26 +0700
Bagas Sanjaya <bagasdotme@gmail.com> wrote:

> Hi,
> 
> I notice a regression report on Bugzilla [1]. Quoting from it:
> 
> > It needs to build at least 2 different RAIDs(eg. RAID0 and RAID10, RAID5
> > and RAID10) and then you will see below error repeatly(need to use serial
> > console to see it)
> > 
> > [ 205.360738] systemd-shutdown[1]: Stopping MD devices.
> > [ 205.366384] systemd-shutdown[1]: sd-device-enumerator: Scan all dirs
> > [ 205.373327] systemd-shutdown[1]: sd-device-enumerator: Scanning /sys/bus
> > [ 205.380427] systemd-shutdown[1]: sd-device-enumerator: Scanning /sys/class
> > [ 205.388257] systemd-shutdown[1]: Stopping MD /dev/md127 (9:127).
> > [ 205.394880] systemd-shutdown[1]: Failed to sync MD block device
> > /dev/md127, ignoring: Input/output error [ 205.404975] md: md127 stopped.
> > [ 205.470491] systemd-shutdown[1]: Stopping MD /dev/md126 (9:126).
> > [ 205.770179] md: md126: resync interrupted.
> > [ 205.776258] md126: detected capacity change from 1900396544 to 0
> > [ 205.783349] md: md126 stopped.
> > [ 205.862258] systemd-shutdown[1]: Stopping MD /dev/md125 (9:125).
> > [ 205.862435] md: md126 stopped.
> > [ 205.868376] systemd-shutdown[1]: Failed to sync MD block device
> > /dev/md125, ignoring: Input/output error [ 205.872845] block device
> > autoloading is deprecated and will be removed. [ 205.880955] md: md125
> > stopped. [ 205.934349] systemd-shutdown[1]: Stopping MD /dev/md124p2
> > (259:7). [ 205.947707] systemd-shutdown[1]: Could not stop MD /dev/md124p2:
> > Device or resource busy [ 205.957004] systemd-shutdown[1]: Stopping MD
> > /dev/md124p1 (259:6). [ 205.964177] systemd-shutdown[1]: Could not stop MD
> > /dev/md124p1: Device or resource busy [ 205.973155] systemd-shutdown[1]:
> > Stopping MD /dev/md124 (9:124). [ 205.979789] systemd-shutdown[1]: Could
> > not stop MD /dev/md124: Device or resource busy [ 205.988475]
> > systemd-shutdown[1]: Not all MD devices stopped, 4 left.  
> 
> See Bugzilla for the full thread and attached full journalctl log.
> 
> Anyway, I'm adding this regression to be tracked by regzbot:
> 
> #regzbot introduced: 12a6caf273240a
> https://bugzilla.kernel.org/show_bug.cgi?id=217798 #regzbot title: systemd
> shutdown hang on machine with different RAID levels
> 
> Thanks.
> 
> [1]: https://bugzilla.kernel.org/show_bug.cgi?id=217798
> 

Hello,
The issue is reproducible with IMSM metadata too, around 20% of reboot hangs. I
will try to raise the priority in the bug because it is valid high- the
base functionality of the system is affected.

Christoph, could you take a look in short term?

Thanks,
Mariusz

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs
  2023-08-18  8:16 ` Mariusz Tkaczyk
@ 2023-08-18  9:21   ` Hannes Reinecke
  2023-08-21  3:23     ` AceLan Kao
  2023-08-22  3:51   ` Guoqing Jiang
  1 sibling, 1 reply; 30+ messages in thread
From: Hannes Reinecke @ 2023-08-18  9:21 UTC (permalink / raw)
  To: Mariusz Tkaczyk, Bagas Sanjaya
  Cc: Christoph Hellwig, Song Liu, AceLan Kao,
	Linux Kernel Mailing List, Linux Regressions, Linux RAID

On 8/18/23 10:16, Mariusz Tkaczyk wrote:
> On Wed, 16 Aug 2023 16:37:26 +0700
> Bagas Sanjaya <bagasdotme@gmail.com> wrote:
> 
>> Hi,
>>
>> I notice a regression report on Bugzilla [1]. Quoting from it:
>>
>>> It needs to build at least 2 different RAIDs(eg. RAID0 and RAID10, RAID5
>>> and RAID10) and then you will see below error repeatly(need to use serial
>>> console to see it)
>>>
>>> [ 205.360738] systemd-shutdown[1]: Stopping MD devices.
>>> [ 205.366384] systemd-shutdown[1]: sd-device-enumerator: Scan all dirs
>>> [ 205.373327] systemd-shutdown[1]: sd-device-enumerator: Scanning /sys/bus
>>> [ 205.380427] systemd-shutdown[1]: sd-device-enumerator: Scanning /sys/class
>>> [ 205.388257] systemd-shutdown[1]: Stopping MD /dev/md127 (9:127).
>>> [ 205.394880] systemd-shutdown[1]: Failed to sync MD block device
>>> /dev/md127, ignoring: Input/output error [ 205.404975] md: md127 stopped.
>>> [ 205.470491] systemd-shutdown[1]: Stopping MD /dev/md126 (9:126).
>>> [ 205.770179] md: md126: resync interrupted.
>>> [ 205.776258] md126: detected capacity change from 1900396544 to 0
>>> [ 205.783349] md: md126 stopped.
>>> [ 205.862258] systemd-shutdown[1]: Stopping MD /dev/md125 (9:125).
>>> [ 205.862435] md: md126 stopped.
>>> [ 205.868376] systemd-shutdown[1]: Failed to sync MD block device
>>> /dev/md125, ignoring: Input/output error [ 205.872845] block device
>>> autoloading is deprecated and will be removed. [ 205.880955] md: md125
>>> stopped. [ 205.934349] systemd-shutdown[1]: Stopping MD /dev/md124p2
>>> (259:7). [ 205.947707] systemd-shutdown[1]: Could not stop MD /dev/md124p2:
>>> Device or resource busy [ 205.957004] systemd-shutdown[1]: Stopping MD
>>> /dev/md124p1 (259:6). [ 205.964177] systemd-shutdown[1]: Could not stop MD
>>> /dev/md124p1: Device or resource busy [ 205.973155] systemd-shutdown[1]:
>>> Stopping MD /dev/md124 (9:124). [ 205.979789] systemd-shutdown[1]: Could
>>> not stop MD /dev/md124: Device or resource busy [ 205.988475]
>>> systemd-shutdown[1]: Not all MD devices stopped, 4 left.
>>
>> See Bugzilla for the full thread and attached full journalctl log.
>>
>> Anyway, I'm adding this regression to be tracked by regzbot:
>>
>> #regzbot introduced: 12a6caf273240a
>> https://bugzilla.kernel.org/show_bug.cgi?id=217798 #regzbot title: systemd
>> shutdown hang on machine with different RAID levels
>>
>> Thanks.
>>
>> [1]: https://bugzilla.kernel.org/show_bug.cgi?id=217798
>>
> 
> Hello,
> The issue is reproducible with IMSM metadata too, around 20% of reboot hangs. I
> will try to raise the priority in the bug because it is valid high- the
> base functionality of the system is affected.
> 
> Christoph, could you take a look in short term?
> 
Does this help?


diff --git a/drivers/md/md.c b/drivers/md/md.c
index 78be7811a89f..b3661c8def8c 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9575,6 +9575,7 @@ static int md_notify_reboot(struct notifier_block 
*this,
         struct mddev *mddev, *n;
         int need_delay = 0;

+       flush_workqueue(md_misc_wq);
         spin_lock(&all_mddevs_lock);
         list_for_each_entry_safe(mddev, n, &all_mddevs, all_mddevs) {
                 if (!mddev_get(mddev))

Cheers,

Hannes


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs
  2023-08-18  9:21   ` Hannes Reinecke
@ 2023-08-21  3:23     ` AceLan Kao
  0 siblings, 0 replies; 30+ messages in thread
From: AceLan Kao @ 2023-08-21  3:23 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Mariusz Tkaczyk, Bagas Sanjaya, Christoph Hellwig, Song Liu,
	Linux Kernel Mailing List, Linux Regressions, Linux RAID

Hannes Reinecke <hare@suse.de> 於 2023年8月18日 週五 下午5:21寫道:
>
> On 8/18/23 10:16, Mariusz Tkaczyk wrote:
> > On Wed, 16 Aug 2023 16:37:26 +0700
> > Bagas Sanjaya <bagasdotme@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> I notice a regression report on Bugzilla [1]. Quoting from it:
> >>
> >>> It needs to build at least 2 different RAIDs(eg. RAID0 and RAID10, RAID5
> >>> and RAID10) and then you will see below error repeatly(need to use serial
> >>> console to see it)
> >>>
> >>> [ 205.360738] systemd-shutdown[1]: Stopping MD devices.
> >>> [ 205.366384] systemd-shutdown[1]: sd-device-enumerator: Scan all dirs
> >>> [ 205.373327] systemd-shutdown[1]: sd-device-enumerator: Scanning /sys/bus
> >>> [ 205.380427] systemd-shutdown[1]: sd-device-enumerator: Scanning /sys/class
> >>> [ 205.388257] systemd-shutdown[1]: Stopping MD /dev/md127 (9:127).
> >>> [ 205.394880] systemd-shutdown[1]: Failed to sync MD block device
> >>> /dev/md127, ignoring: Input/output error [ 205.404975] md: md127 stopped.
> >>> [ 205.470491] systemd-shutdown[1]: Stopping MD /dev/md126 (9:126).
> >>> [ 205.770179] md: md126: resync interrupted.
> >>> [ 205.776258] md126: detected capacity change from 1900396544 to 0
> >>> [ 205.783349] md: md126 stopped.
> >>> [ 205.862258] systemd-shutdown[1]: Stopping MD /dev/md125 (9:125).
> >>> [ 205.862435] md: md126 stopped.
> >>> [ 205.868376] systemd-shutdown[1]: Failed to sync MD block device
> >>> /dev/md125, ignoring: Input/output error [ 205.872845] block device
> >>> autoloading is deprecated and will be removed. [ 205.880955] md: md125
> >>> stopped. [ 205.934349] systemd-shutdown[1]: Stopping MD /dev/md124p2
> >>> (259:7). [ 205.947707] systemd-shutdown[1]: Could not stop MD /dev/md124p2:
> >>> Device or resource busy [ 205.957004] systemd-shutdown[1]: Stopping MD
> >>> /dev/md124p1 (259:6). [ 205.964177] systemd-shutdown[1]: Could not stop MD
> >>> /dev/md124p1: Device or resource busy [ 205.973155] systemd-shutdown[1]:
> >>> Stopping MD /dev/md124 (9:124). [ 205.979789] systemd-shutdown[1]: Could
> >>> not stop MD /dev/md124: Device or resource busy [ 205.988475]
> >>> systemd-shutdown[1]: Not all MD devices stopped, 4 left.
> >>
> >> See Bugzilla for the full thread and attached full journalctl log.
> >>
> >> Anyway, I'm adding this regression to be tracked by regzbot:
> >>
> >> #regzbot introduced: 12a6caf273240a
> >> https://bugzilla.kernel.org/show_bug.cgi?id=217798 #regzbot title: systemd
> >> shutdown hang on machine with different RAID levels
> >>
> >> Thanks.
> >>
> >> [1]: https://bugzilla.kernel.org/show_bug.cgi?id=217798
> >>
> >
> > Hello,
> > The issue is reproducible with IMSM metadata too, around 20% of reboot hangs. I
> > will try to raise the priority in the bug because it is valid high- the
> > base functionality of the system is affected.
> >
> > Christoph, could you take a look in short term?
> >
> Does this help?
>
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 78be7811a89f..b3661c8def8c 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -9575,6 +9575,7 @@ static int md_notify_reboot(struct notifier_block
> *this,
>          struct mddev *mddev, *n;
>          int need_delay = 0;
>
> +       flush_workqueue(md_misc_wq);
>          spin_lock(&all_mddevs_lock);
>          list_for_each_entry_safe(mddev, n, &all_mddevs, all_mddevs) {
>                  if (!mddev_get(mddev))
>
> Cheers,
>
> Hannes
>

Hi Hannes,

The patch doesn't work, I still can reproduce the issue after applied
the patch on top of v6.5-rc7
706a74159504 Linux 6.5-rc7

-- 
Chia-Lin Kao(AceLan)
http://blog.acelan.idv.tw/
E-Mail: acelan.kaoATcanonical.com (s/AT/@/)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Fwd: Infiniate systemd loop when power off the machine with multiple MD RAIDs
  2023-08-16  9:37 Fwd: Infiniate systemd loop when power off the machine with multiple MD RAIDs Bagas Sanjaya
  2023-08-18  8:16 ` Mariusz Tkaczyk
@ 2023-08-21 13:18 ` Yu Kuai
  2023-08-22  1:39   ` AceLan Kao
  2023-08-22 18:56 ` Song Liu
  2 siblings, 1 reply; 30+ messages in thread
From: Yu Kuai @ 2023-08-21 13:18 UTC (permalink / raw)
  To: Bagas Sanjaya, Christoph Hellwig, Song Liu, AceLan Kao
  Cc: Linux Kernel Mailing List, Linux Regressions, Linux RAID,
	yukuai (C)

Hi,

在 2023/08/16 17:37, Bagas Sanjaya 写道:
> Hi,
> 
> I notice a regression report on Bugzilla [1]. Quoting from it:
> 
>> It needs to build at least 2 different RAIDs(eg. RAID0 and RAID10, RAID5 and RAID10) and then you will see below error repeatly(need to use serial console to see it)
>>
>> [ 205.360738] systemd-shutdown[1]: Stopping MD devices.
>> [ 205.366384] systemd-shutdown[1]: sd-device-enumerator: Scan all dirs
>> [ 205.373327] systemd-shutdown[1]: sd-device-enumerator: Scanning /sys/bus
>> [ 205.380427] systemd-shutdown[1]: sd-device-enumerator: Scanning /sys/class
>> [ 205.388257] systemd-shutdown[1]: Stopping MD /dev/md127 (9:127).
>> [ 205.394880] systemd-shutdown[1]: Failed to sync MD block device /dev/md127, ignoring: Input/output error
>> [ 205.404975] md: md127 stopped.
>> [ 205.470491] systemd-shutdown[1]: Stopping MD /dev/md126 (9:126).
>> [ 205.770179] md: md126: resync interrupted.
>> [ 205.776258] md126: detected capacity change from 1900396544 to 0
>> [ 205.783349] md: md126 stopped.
>> [ 205.862258] systemd-shutdown[1]: Stopping MD /dev/md125 (9:125).
>> [ 205.862435] md: md126 stopped.
>> [ 205.868376] systemd-shutdown[1]: Failed to sync MD block device /dev/md125, ignoring: Input/output error
>> [ 205.872845] block device autoloading is deprecated and will be removed.
>> [ 205.880955] md: md125 stopped.
>> [ 205.934349] systemd-shutdown[1]: Stopping MD /dev/md124p2 (259:7).
>> [ 205.947707] systemd-shutdown[1]: Could not stop MD /dev/md124p2: Device or resource busy
>> [ 205.957004] systemd-shutdown[1]: Stopping MD /dev/md124p1 (259:6).
>> [ 205.964177] systemd-shutdown[1]: Could not stop MD /dev/md124p1: Device or resource busy
>> [ 205.973155] systemd-shutdown[1]: Stopping MD /dev/md124 (9:124).
>> [ 205.979789] systemd-shutdown[1]: Could not stop MD /dev/md124: Device or resource busy
>> [ 205.988475] systemd-shutdown[1]: Not all MD devices stopped, 4 left.

Without the problem, did the log complain about this?

Could not stop MD...Device or resource busy

Thanks,
Kuai

> 
> See Bugzilla for the full thread and attached full journalctl log.
> 
> Anyway, I'm adding this regression to be tracked by regzbot:
> 
> #regzbot introduced: 12a6caf273240a https://bugzilla.kernel.org/show_bug.cgi?id=217798
> #regzbot title: systemd shutdown hang on machine with different RAID levels
> 
> Thanks.
> 
> [1]: https://bugzilla.kernel.org/show_bug.cgi?id=217798
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Fwd: Infiniate systemd loop when power off the machine with multiple MD RAIDs
  2023-08-21 13:18 ` Fwd: " Yu Kuai
@ 2023-08-22  1:39   ` AceLan Kao
  0 siblings, 0 replies; 30+ messages in thread
From: AceLan Kao @ 2023-08-22  1:39 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Bagas Sanjaya, Christoph Hellwig, Song Liu,
	Linux Kernel Mailing List, Linux Regressions, Linux RAID,
	yukuai (C)

Yu Kuai <yukuai1@huaweicloud.com> 於 2023年8月21日 週一 下午9:18寫道:
>
> Hi,
>
> 在 2023/08/16 17:37, Bagas Sanjaya 写道:
> > Hi,
> >
> > I notice a regression report on Bugzilla [1]. Quoting from it:
> >
> >> It needs to build at least 2 different RAIDs(eg. RAID0 and RAID10, RAID5 and RAID10) and then you will see below error repeatly(need to use serial console to see it)
> >>
> >> [ 205.360738] systemd-shutdown[1]: Stopping MD devices.
> >> [ 205.366384] systemd-shutdown[1]: sd-device-enumerator: Scan all dirs
> >> [ 205.373327] systemd-shutdown[1]: sd-device-enumerator: Scanning /sys/bus
> >> [ 205.380427] systemd-shutdown[1]: sd-device-enumerator: Scanning /sys/class
> >> [ 205.388257] systemd-shutdown[1]: Stopping MD /dev/md127 (9:127).
> >> [ 205.394880] systemd-shutdown[1]: Failed to sync MD block device /dev/md127, ignoring: Input/output error
> >> [ 205.404975] md: md127 stopped.
> >> [ 205.470491] systemd-shutdown[1]: Stopping MD /dev/md126 (9:126).
> >> [ 205.770179] md: md126: resync interrupted.
> >> [ 205.776258] md126: detected capacity change from 1900396544 to 0
> >> [ 205.783349] md: md126 stopped.
> >> [ 205.862258] systemd-shutdown[1]: Stopping MD /dev/md125 (9:125).
> >> [ 205.862435] md: md126 stopped.
> >> [ 205.868376] systemd-shutdown[1]: Failed to sync MD block device /dev/md125, ignoring: Input/output error
> >> [ 205.872845] block device autoloading is deprecated and will be removed.
> >> [ 205.880955] md: md125 stopped.
> >> [ 205.934349] systemd-shutdown[1]: Stopping MD /dev/md124p2 (259:7).
> >> [ 205.947707] systemd-shutdown[1]: Could not stop MD /dev/md124p2: Device or resource busy
> >> [ 205.957004] systemd-shutdown[1]: Stopping MD /dev/md124p1 (259:6).
> >> [ 205.964177] systemd-shutdown[1]: Could not stop MD /dev/md124p1: Device or resource busy
> >> [ 205.973155] systemd-shutdown[1]: Stopping MD /dev/md124 (9:124).
> >> [ 205.979789] systemd-shutdown[1]: Could not stop MD /dev/md124: Device or resource busy
> >> [ 205.988475] systemd-shutdown[1]: Not all MD devices stopped, 4 left.
>
> Without the problem, did the log complain about this?
>
> Could not stop MD...Device or resource busy
>
> Thanks,
> Kuai
>
> >
> > See Bugzilla for the full thread and attached full journalctl log.
> >
> > Anyway, I'm adding this regression to be tracked by regzbot:
> >
> > #regzbot introduced: 12a6caf273240a https://bugzilla.kernel.org/show_bug.cgi?id=217798
> > #regzbot title: systemd shutdown hang on machine with different RAID levels
> >
> > Thanks.
> >
> > [1]: https://bugzilla.kernel.org/show_bug.cgi?id=217798
> >
>
Hi Yu,

Everything looks normal.
I can see the MD raids stop without any warning/error messages.

$ sudo journalctl -b -2 -k -o short-monotonic | egrep md[0-9]+
[   13.418154] u-Precision-7960-Tower kernel: md/raid10:md125: active
with 4 out of 4 devices
[   13.424050] u-Precision-7960-Tower kernel: md125: detected capacity
change from 0 to 1900396544
[   13.525198] u-Precision-7960-Tower kernel: md123: detected capacity
change from 0 to 1900396544
[   13.576770] u-Precision-7960-Tower kernel: md122: detected capacity
change from 0 to 1900417024
[   13.585705] u-Precision-7960-Tower kernel:  md122: p1 p2 p3
[   14.323717] u-Precision-7960-Tower kernel: EXT4-fs (md122p3):
mounted filesystem 6d53dc8e-3f45-4efa-bc0e-4af477fac217 ro with
ordered data mode. Quota mode: none.
[   14.814365] u-Precision-7960-Tower systemd-gpt-auto-generator[587]:
md122p3: Root device /dev/md122.
[   14.829064] u-Precision-7960-Tower kernel: block md122: the
capability attribute has been deprecated.
[   18.133316] u-Precision-7960-Tower systemd[1]:
unit_file_build_name_map: normal unit file:
/lib/systemd/system/casper-md5check.service
[   20.609948] u-Precision-7960-Tower kernel: EXT4-fs (md122p3):
re-mounted 6d53dc8e-3f45-4efa-bc0e-4af477fac217 r/w. Quota mode: none.
[   39.161554] u-Precision-7960-Tower kernel: md: requested-resync of
RAID array md125
[   77.344788] u-Precision-7960-Tower systemd[1]: Got message
type=signal sender=org.freedesktop.DBus
destination=org.freedesktop.systemd1 path=/org/freedesktop/DBus int
erface=org.freedesktop.systemd1.Activator member=ActivationRequest
cookie=89 reply_cookie=0 signature=s error-name=n/a error-message=n/a
[   77.736790] u-Precision-7960-Tower systemd[1]: Sent message
type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1
interface=org.freedesktop.systemd1.M
anager member=UnitNew cookie=716 reply_cookie=0 signature=so
error-name=n/a error-message=n/a
[   77.772791] u-Precision-7960-Tower systemd[1]: Sent message
type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1
interface=org.freedesktop.systemd1.M
anager member=UnitRemoved cookie=717 reply_cookie=0 signature=so
error-name=n/a error-message=n/a
[   77.808791] u-Precision-7960-Tower systemd[1]: Sent message
type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1
interface=org.freedesktop.systemd1.M
anager member=UnitNew cookie=718 reply_cookie=0 signature=so
error-name=n/a error-message=n/a
[   77.844795] u-Precision-7960-Tower systemd[1]: Sent message
type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1
interface=org.freedesktop.systemd1.M
anager member=UnitRemoved cookie=719 reply_cookie=0 signature=so
error-name=n/a error-message=n/a
[   77.880866] u-Precision-7960-Tower systemd[1]: Sent message
type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1
interface=org.freedesktop.systemd1.M
anager member=UnitNew cookie=720 reply_cookie=0 signature=so
error-name=n/a error-message=n/a
[   77.916817] u-Precision-7960-Tower systemd[1]: Sent message
type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1
interface=org.freedesktop.systemd1.M
anager member=JobNew cookie=721 reply_cookie=0 signature=uos
error-name=n/a error-message=n/a
[   79.072786] u-Precision-7960-Tower systemd[1]: Sent message
type=signal sender=n/a destination=n/a
path=/org/freedesktop/systemd1/unit/systemd_2dlocaled_2eservice int
erface=org.freedesktop.DBus.Properties member=PropertiesChanged
cookie=724 reply_cookie=0 signature=sa{sv}as error-name=n/a
error-message=n/a
[   79.112778] u-Precision-7960-Tower systemd[1]: Sent message
type=signal sender=n/a destination=n/a
path=/org/freedesktop/systemd1/unit/systemd_2dlocaled_2eservice int
erface=org.freedesktop.DBus.Properties member=PropertiesChanged
cookie=725 reply_cookie=0 signature=sa{sv}as error-name=n/a
error-message=n/a
[   79.188801] u-Precision-7960-Tower systemd[1]: Sent message
type=signal sender=n/a destination=n/a
path=/org/freedesktop/systemd1/unit/systemd_2dlocaled_2eservice int
erface=org.freedesktop.DBus.Properties member=PropertiesChanged
cookie=726 reply_cookie=0 signature=sa{sv}as error-name=n/a
error-message=n/a
[   79.228790] u-Precision-7960-Tower systemd[1]: Sent message
type=signal sender=n/a destination=n/a
path=/org/freedesktop/systemd1/unit/systemd_2dlocaled_2eservice int
erface=org.freedesktop.DBus.Properties member=PropertiesChanged
cookie=727 reply_cookie=0 signature=sa{sv}as error-name=n/a
error-message=n/a
[   79.268794] u-Precision-7960-Tower systemd[1]: Sent message
type=signal sender=n/a destination=n/a
path=/org/freedesktop/systemd1/job/1915 interface=org.freedesktop.D
Bus.Properties member=PropertiesChanged cookie=728 reply_cookie=0
signature=sa{sv}as error-name=n/a error-message=n/a
[   81.064804] u-Precision-7960-Tower systemd[1]: Got message
type=method_call sender=:1.67 destination=org.freedesktop.systemd1
path=/org/freedesktop/systemd1 interface
=org.freedesktop.DBus.Properties member=Get cookie=31 reply_cookie=0
signature=ss error-name=n/a error-message=n/a
[  286.066341] u-Precision-7960-Tower kernel: md123: detected capacity
change from 1900396544 to 0
[  286.073500] u-Precision-7960-Tower kernel: md: md123 stopped.
[  286.175794] u-Precision-7960-Tower kernel: md: md124 stopped.


-- 
Chia-Lin Kao(AceLan)
http://blog.acelan.idv.tw/
E-Mail: acelan.kaoATcanonical.com (s/AT/@/)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs
  2023-08-18  8:16 ` Mariusz Tkaczyk
  2023-08-18  9:21   ` Hannes Reinecke
@ 2023-08-22  3:51   ` Guoqing Jiang
  2023-08-22  6:17     ` Song Liu
  1 sibling, 1 reply; 30+ messages in thread
From: Guoqing Jiang @ 2023-08-22  3:51 UTC (permalink / raw)
  To: Mariusz Tkaczyk, Bagas Sanjaya
  Cc: Christoph Hellwig, Song Liu, AceLan Kao,
	Linux Kernel Mailing List, Linux Regressions, Linux RAID



On 8/18/23 16:16, Mariusz Tkaczyk wrote:
> On Wed, 16 Aug 2023 16:37:26 +0700
> Bagas Sanjaya<bagasdotme@gmail.com>  wrote:
>
>> Hi,
>>
>> I notice a regression report on Bugzilla [1]. Quoting from it:
>>
>>> It needs to build at least 2 different RAIDs(eg. RAID0 and RAID10, RAID5
>>> and RAID10) and then you will see below error repeatly(need to use serial
>>> console to see it)
>>>
>>> [ 205.360738] systemd-shutdown[1]: Stopping MD devices.
>>> [ 205.366384] systemd-shutdown[1]: sd-device-enumerator: Scan all dirs
>>> [ 205.373327] systemd-shutdown[1]: sd-device-enumerator: Scanning /sys/bus
>>> [ 205.380427] systemd-shutdown[1]: sd-device-enumerator: Scanning /sys/class
>>> [ 205.388257] systemd-shutdown[1]: Stopping MD /dev/md127 (9:127).
>>> [ 205.394880] systemd-shutdown[1]: Failed to sync MD block device
>>> /dev/md127, ignoring: Input/output error [ 205.404975] md: md127 stopped.
>>> [ 205.470491] systemd-shutdown[1]: Stopping MD /dev/md126 (9:126).
>>> [ 205.770179] md: md126: resync interrupted.
>>> [ 205.776258] md126: detected capacity change from 1900396544 to 0
>>> [ 205.783349] md: md126 stopped.
>>> [ 205.862258] systemd-shutdown[1]: Stopping MD /dev/md125 (9:125).
>>> [ 205.862435] md: md126 stopped.
>>> [ 205.868376] systemd-shutdown[1]: Failed to sync MD block device
>>> /dev/md125, ignoring: Input/output error [ 205.872845] block device
>>> autoloading is deprecated and will be removed. [ 205.880955] md: md125
>>> stopped. [ 205.934349] systemd-shutdown[1]: Stopping MD /dev/md124p2
>>> (259:7). [ 205.947707] systemd-shutdown[1]: Could not stop MD /dev/md124p2:
>>> Device or resource busy [ 205.957004] systemd-shutdown[1]: Stopping MD
>>> /dev/md124p1 (259:6). [ 205.964177] systemd-shutdown[1]: Could not stop MD
>>> /dev/md124p1: Device or resource busy [ 205.973155] systemd-shutdown[1]:
>>> Stopping MD /dev/md124 (9:124). [ 205.979789] systemd-shutdown[1]: Could
>>> not stop MD /dev/md124: Device or resource busy [ 205.988475]
>>> systemd-shutdown[1]: Not all MD devices stopped, 4 left.
>> See Bugzilla for the full thread and attached full journalctl log.
>>
>> Anyway, I'm adding this regression to be tracked by regzbot:
>>
>> #regzbot introduced: 12a6caf273240a
>> https://bugzilla.kernel.org/show_bug.cgi?id=217798  #regzbot title: systemd
>> shutdown hang on machine with different RAID levels
>>
>> Thanks.
>>
>> [1]:https://bugzilla.kernel.org/show_bug.cgi?id=217798
>>
> Hello,
> The issue is reproducible with IMSM metadata too, around 20% of reboot hangs. I
> will try to raise the priority in the bug because it is valid high- the
> base functionality of the system is affected.

Since it it reproducible from your side, is it possible to turn the 
reproduce steps into a test case
given the importance?

I guess If all arrays are set with MD_DELETED flag, then reboot might 
hang, not sure whether
below (maybe need to flush wq as well  before list_del) helps or not, 
just FYI.

@@ -9566,8 +9566,10 @@ static int md_notify_reboot(struct notifier_block 
*this,

         spin_lock(&all_mddevs_lock);
         list_for_each_entry_safe(mddev, n, &all_mddevs, all_mddevs) {
-               if (!mddev_get(mddev))
+               if (!mddev_get(mddev)) {
+                       list_del(&mddev->all_mddevs);
                         continue;
+               }

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs
  2023-08-22  3:51   ` Guoqing Jiang
@ 2023-08-22  6:17     ` Song Liu
  2023-08-22  6:39       ` Mariusz Tkaczyk
  0 siblings, 1 reply; 30+ messages in thread
From: Song Liu @ 2023-08-22  6:17 UTC (permalink / raw)
  To: Guoqing Jiang
  Cc: Mariusz Tkaczyk, Bagas Sanjaya, Christoph Hellwig, AceLan Kao,
	Linux Kernel Mailing List, Linux Regressions, Linux RAID

On Mon, Aug 21, 2023 at 8:51 PM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
>
>
>
> On 8/18/23 16:16, Mariusz Tkaczyk wrote:
> > On Wed, 16 Aug 2023 16:37:26 +0700
> > Bagas Sanjaya<bagasdotme@gmail.com>  wrote:
> >
> >> Hi,
> >>
> >> I notice a regression report on Bugzilla [1]. Quoting from it:
> >>
> >>> It needs to build at least 2 different RAIDs(eg. RAID0 and RAID10, RAID5
> >>> and RAID10) and then you will see below error repeatly(need to use serial
> >>> console to see it)
> >>>
> >>> [ 205.360738] systemd-shutdown[1]: Stopping MD devices.
> >>> [ 205.366384] systemd-shutdown[1]: sd-device-enumerator: Scan all dirs
> >>> [ 205.373327] systemd-shutdown[1]: sd-device-enumerator: Scanning /sys/bus
> >>> [ 205.380427] systemd-shutdown[1]: sd-device-enumerator: Scanning /sys/class
> >>> [ 205.388257] systemd-shutdown[1]: Stopping MD /dev/md127 (9:127).
> >>> [ 205.394880] systemd-shutdown[1]: Failed to sync MD block device
> >>> /dev/md127, ignoring: Input/output error [ 205.404975] md: md127 stopped.
> >>> [ 205.470491] systemd-shutdown[1]: Stopping MD /dev/md126 (9:126).
> >>> [ 205.770179] md: md126: resync interrupted.
> >>> [ 205.776258] md126: detected capacity change from 1900396544 to 0
> >>> [ 205.783349] md: md126 stopped.
> >>> [ 205.862258] systemd-shutdown[1]: Stopping MD /dev/md125 (9:125).
> >>> [ 205.862435] md: md126 stopped.
> >>> [ 205.868376] systemd-shutdown[1]: Failed to sync MD block device
> >>> /dev/md125, ignoring: Input/output error [ 205.872845] block device
> >>> autoloading is deprecated and will be removed. [ 205.880955] md: md125
> >>> stopped. [ 205.934349] systemd-shutdown[1]: Stopping MD /dev/md124p2
> >>> (259:7). [ 205.947707] systemd-shutdown[1]: Could not stop MD /dev/md124p2:
> >>> Device or resource busy [ 205.957004] systemd-shutdown[1]: Stopping MD
> >>> /dev/md124p1 (259:6). [ 205.964177] systemd-shutdown[1]: Could not stop MD
> >>> /dev/md124p1: Device or resource busy [ 205.973155] systemd-shutdown[1]:
> >>> Stopping MD /dev/md124 (9:124). [ 205.979789] systemd-shutdown[1]: Could
> >>> not stop MD /dev/md124: Device or resource busy [ 205.988475]
> >>> systemd-shutdown[1]: Not all MD devices stopped, 4 left.
> >> See Bugzilla for the full thread and attached full journalctl log.
> >>
> >> Anyway, I'm adding this regression to be tracked by regzbot:
> >>
> >> #regzbot introduced: 12a6caf273240a
> >> https://bugzilla.kernel.org/show_bug.cgi?id=217798  #regzbot title: systemd
> >> shutdown hang on machine with different RAID levels
> >>
> >> Thanks.
> >>
> >> [1]:https://bugzilla.kernel.org/show_bug.cgi?id=217798
> >>
> > Hello,
> > The issue is reproducible with IMSM metadata too, around 20% of reboot hangs. I
> > will try to raise the priority in the bug because it is valid high- the
> > base functionality of the system is affected.
>
> Since it it reproducible from your side, is it possible to turn the
> reproduce steps into a test case
> given the importance?
>
> I guess If all arrays are set with MD_DELETED flag, then reboot might
> hang, not sure whether
> below (maybe need to flush wq as well  before list_del) helps or not,
> just FYI.
>
> @@ -9566,8 +9566,10 @@ static int md_notify_reboot(struct notifier_block
> *this,
>
>          spin_lock(&all_mddevs_lock);
>          list_for_each_entry_safe(mddev, n, &all_mddevs, all_mddevs) {
> -               if (!mddev_get(mddev))
> +               if (!mddev_get(mddev)) {
> +                       list_del(&mddev->all_mddevs);
>                          continue;
> +               }

I am still not able to reproduce this, probably due to differences in the
timing. Maybe we only need something like:

diff --git i/drivers/md/md.c w/drivers/md/md.c
index 5c3c19b8d509..ebb529b0faf8 100644
--- i/drivers/md/md.c
+++ w/drivers/md/md.c
@@ -9619,8 +9619,10 @@ static int md_notify_reboot(struct notifier_block *this,

        spin_lock(&all_mddevs_lock);
        list_for_each_entry_safe(mddev, n, &all_mddevs, all_mddevs) {
-               if (!mddev_get(mddev))
+               if (!mddev_get(mddev)) {
+                       need_delay = 1;
                        continue;
+               }
                spin_unlock(&all_mddevs_lock);
                if (mddev_trylock(mddev)) {
                        if (mddev->pers)


Thanks,
Song

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs
  2023-08-22  6:17     ` Song Liu
@ 2023-08-22  6:39       ` Mariusz Tkaczyk
  2023-08-22  8:13         ` AceLan Kao
  0 siblings, 1 reply; 30+ messages in thread
From: Mariusz Tkaczyk @ 2023-08-22  6:39 UTC (permalink / raw)
  To: Song Liu
  Cc: Guoqing Jiang, Bagas Sanjaya, Christoph Hellwig, AceLan Kao,
	Linux Kernel Mailing List, Linux Regressions, Linux RAID

On Mon, 21 Aug 2023 23:17:54 -0700
Song Liu <song@kernel.org> wrote:

> On Mon, Aug 21, 2023 at 8:51 PM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
> >
> >
> >
> > On 8/18/23 16:16, Mariusz Tkaczyk wrote:  
> > > On Wed, 16 Aug 2023 16:37:26 +0700
> > > Bagas Sanjaya<bagasdotme@gmail.com>  wrote:
> > >  
> > >> Hi,
> > >>
> > >> I notice a regression report on Bugzilla [1]. Quoting from it:
> > >>  
> > >>> It needs to build at least 2 different RAIDs(eg. RAID0 and RAID10, RAID5
> > >>> and RAID10) and then you will see below error repeatly(need to use
> > >>> serial console to see it)
> > >>>
> > >>> [ 205.360738] systemd-shutdown[1]: Stopping MD devices.
> > >>> [ 205.366384] systemd-shutdown[1]: sd-device-enumerator: Scan all dirs
> > >>> [ 205.373327] systemd-shutdown[1]: sd-device-enumerator: Scanning
> > >>> /sys/bus [ 205.380427] systemd-shutdown[1]: sd-device-enumerator:
> > >>> Scanning /sys/class [ 205.388257] systemd-shutdown[1]: Stopping MD
> > >>> /dev/md127 (9:127). [ 205.394880] systemd-shutdown[1]: Failed to sync
> > >>> MD block device /dev/md127, ignoring: Input/output error [ 205.404975]
> > >>> md: md127 stopped. [ 205.470491] systemd-shutdown[1]: Stopping MD
> > >>> /dev/md126 (9:126). [ 205.770179] md: md126: resync interrupted.
> > >>> [ 205.776258] md126: detected capacity change from 1900396544 to 0
> > >>> [ 205.783349] md: md126 stopped.
> > >>> [ 205.862258] systemd-shutdown[1]: Stopping MD /dev/md125 (9:125).
> > >>> [ 205.862435] md: md126 stopped.
> > >>> [ 205.868376] systemd-shutdown[1]: Failed to sync MD block device
> > >>> /dev/md125, ignoring: Input/output error [ 205.872845] block device
> > >>> autoloading is deprecated and will be removed. [ 205.880955] md: md125
> > >>> stopped. [ 205.934349] systemd-shutdown[1]: Stopping MD /dev/md124p2
> > >>> (259:7). [ 205.947707] systemd-shutdown[1]: Could not stop MD
> > >>> /dev/md124p2: Device or resource busy [ 205.957004]
> > >>> systemd-shutdown[1]: Stopping MD /dev/md124p1 (259:6). [ 205.964177]
> > >>> systemd-shutdown[1]: Could not stop MD /dev/md124p1: Device or resource
> > >>> busy [ 205.973155] systemd-shutdown[1]: Stopping MD /dev/md124 (9:124).
> > >>> [ 205.979789] systemd-shutdown[1]: Could not stop MD /dev/md124: Device
> > >>> or resource busy [ 205.988475] systemd-shutdown[1]: Not all MD devices
> > >>> stopped, 4 left.  
> > >> See Bugzilla for the full thread and attached full journalctl log.
> > >>
> > >> Anyway, I'm adding this regression to be tracked by regzbot:
> > >>
> > >> #regzbot introduced: 12a6caf273240a
> > >> https://bugzilla.kernel.org/show_bug.cgi?id=217798  #regzbot title:
> > >> systemd shutdown hang on machine with different RAID levels
> > >>
> > >> Thanks.
> > >>
> > >> [1]:https://bugzilla.kernel.org/show_bug.cgi?id=217798
> > >>  
> > > Hello,
> > > The issue is reproducible with IMSM metadata too, around 20% of reboot
> > > hangs. I will try to raise the priority in the bug because it is valid
> > > high- the base functionality of the system is affected.  
> >
> > Since it it reproducible from your side, is it possible to turn the
> > reproduce steps into a test case
> > given the importance?

I didn't try to reproduce it locally yet because customer was able to
bisect the regression and it pointed them to the same patch so I connected it
and asked author to take a look first. At a first glance, I wanted to get
community voice to see if it is not something obvious.

So far I know, customer is creating 3 IMSM raid arrays, one is the system
volume and do a reboot and it sporadically fails (around 20%). That is all.

> >
> > I guess If all arrays are set with MD_DELETED flag, then reboot might
> > hang, not sure whether
> > below (maybe need to flush wq as well  before list_del) helps or not,
> > just FYI.
> >
> > @@ -9566,8 +9566,10 @@ static int md_notify_reboot(struct notifier_block
> > *this,
> >
> >          spin_lock(&all_mddevs_lock);
> >          list_for_each_entry_safe(mddev, n, &all_mddevs, all_mddevs) {
> > -               if (!mddev_get(mddev))
> > +               if (!mddev_get(mddev)) {
> > +                       list_del(&mddev->all_mddevs);
> >                          continue;
> > +               }  
> 
> I am still not able to reproduce this, probably due to differences in the
> timing. Maybe we only need something like:
> 
> diff --git i/drivers/md/md.c w/drivers/md/md.c
> index 5c3c19b8d509..ebb529b0faf8 100644
> --- i/drivers/md/md.c
> +++ w/drivers/md/md.c
> @@ -9619,8 +9619,10 @@ static int md_notify_reboot(struct notifier_block
> *this,
> 
>         spin_lock(&all_mddevs_lock);
>         list_for_each_entry_safe(mddev, n, &all_mddevs, all_mddevs) {
> -               if (!mddev_get(mddev))
> +               if (!mddev_get(mddev)) {
> +                       need_delay = 1;
>                         continue;
> +               }
>                 spin_unlock(&all_mddevs_lock);
>                 if (mddev_trylock(mddev)) {
>                         if (mddev->pers)
> 
> 
> Thanks,
> Song

I will try to reproduce issue at Intel lab to check this.

Thanks,
Mariusz

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs
  2023-08-22  6:39       ` Mariusz Tkaczyk
@ 2023-08-22  8:13         ` AceLan Kao
  2023-08-22 12:41           ` Guoqing Jiang
  0 siblings, 1 reply; 30+ messages in thread
From: AceLan Kao @ 2023-08-22  8:13 UTC (permalink / raw)
  To: Mariusz Tkaczyk
  Cc: Song Liu, Guoqing Jiang, Bagas Sanjaya, Christoph Hellwig,
	Linux Kernel Mailing List, Linux Regressions, Linux RAID

Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> 於 2023年8月22日 週二 下午2:39寫道:
>
> On Mon, 21 Aug 2023 23:17:54 -0700
> Song Liu <song@kernel.org> wrote:
>
> > On Mon, Aug 21, 2023 at 8:51 PM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
> > >
> > >
> > >
> > > On 8/18/23 16:16, Mariusz Tkaczyk wrote:
> > > > On Wed, 16 Aug 2023 16:37:26 +0700
> > > > Bagas Sanjaya<bagasdotme@gmail.com>  wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> I notice a regression report on Bugzilla [1]. Quoting from it:
> > > >>
> > > >>> It needs to build at least 2 different RAIDs(eg. RAID0 and RAID10, RAID5
> > > >>> and RAID10) and then you will see below error repeatly(need to use
> > > >>> serial console to see it)
> > > >>>
> > > >>> [ 205.360738] systemd-shutdown[1]: Stopping MD devices.
> > > >>> [ 205.366384] systemd-shutdown[1]: sd-device-enumerator: Scan all dirs
> > > >>> [ 205.373327] systemd-shutdown[1]: sd-device-enumerator: Scanning
> > > >>> /sys/bus [ 205.380427] systemd-shutdown[1]: sd-device-enumerator:
> > > >>> Scanning /sys/class [ 205.388257] systemd-shutdown[1]: Stopping MD
> > > >>> /dev/md127 (9:127). [ 205.394880] systemd-shutdown[1]: Failed to sync
> > > >>> MD block device /dev/md127, ignoring: Input/output error [ 205.404975]
> > > >>> md: md127 stopped. [ 205.470491] systemd-shutdown[1]: Stopping MD
> > > >>> /dev/md126 (9:126). [ 205.770179] md: md126: resync interrupted.
> > > >>> [ 205.776258] md126: detected capacity change from 1900396544 to 0
> > > >>> [ 205.783349] md: md126 stopped.
> > > >>> [ 205.862258] systemd-shutdown[1]: Stopping MD /dev/md125 (9:125).
> > > >>> [ 205.862435] md: md126 stopped.
> > > >>> [ 205.868376] systemd-shutdown[1]: Failed to sync MD block device
> > > >>> /dev/md125, ignoring: Input/output error [ 205.872845] block device
> > > >>> autoloading is deprecated and will be removed. [ 205.880955] md: md125
> > > >>> stopped. [ 205.934349] systemd-shutdown[1]: Stopping MD /dev/md124p2
> > > >>> (259:7). [ 205.947707] systemd-shutdown[1]: Could not stop MD
> > > >>> /dev/md124p2: Device or resource busy [ 205.957004]
> > > >>> systemd-shutdown[1]: Stopping MD /dev/md124p1 (259:6). [ 205.964177]
> > > >>> systemd-shutdown[1]: Could not stop MD /dev/md124p1: Device or resource
> > > >>> busy [ 205.973155] systemd-shutdown[1]: Stopping MD /dev/md124 (9:124).
> > > >>> [ 205.979789] systemd-shutdown[1]: Could not stop MD /dev/md124: Device
> > > >>> or resource busy [ 205.988475] systemd-shutdown[1]: Not all MD devices
> > > >>> stopped, 4 left.
> > > >> See Bugzilla for the full thread and attached full journalctl log.
> > > >>
> > > >> Anyway, I'm adding this regression to be tracked by regzbot:
> > > >>
> > > >> #regzbot introduced: 12a6caf273240a
> > > >> https://bugzilla.kernel.org/show_bug.cgi?id=217798  #regzbot title:
> > > >> systemd shutdown hang on machine with different RAID levels
> > > >>
> > > >> Thanks.
> > > >>
> > > >> [1]:https://bugzilla.kernel.org/show_bug.cgi?id=217798
> > > >>
> > > > Hello,
> > > > The issue is reproducible with IMSM metadata too, around 20% of reboot
> > > > hangs. I will try to raise the priority in the bug because it is valid
> > > > high- the base functionality of the system is affected.
> > >
> > > Since it it reproducible from your side, is it possible to turn the
> > > reproduce steps into a test case
> > > given the importance?
>
> I didn't try to reproduce it locally yet because customer was able to
> bisect the regression and it pointed them to the same patch so I connected it
> and asked author to take a look first. At a first glance, I wanted to get
> community voice to see if it is not something obvious.
>
> So far I know, customer is creating 3 IMSM raid arrays, one is the system
> volume and do a reboot and it sporadically fails (around 20%). That is all.
>
> > >
> > > I guess If all arrays are set with MD_DELETED flag, then reboot might
> > > hang, not sure whether
> > > below (maybe need to flush wq as well  before list_del) helps or not,
> > > just FYI.
> > >
> > > @@ -9566,8 +9566,10 @@ static int md_notify_reboot(struct notifier_block
> > > *this,
> > >
> > >          spin_lock(&all_mddevs_lock);
> > >          list_for_each_entry_safe(mddev, n, &all_mddevs, all_mddevs) {
> > > -               if (!mddev_get(mddev))
> > > +               if (!mddev_get(mddev)) {
> > > +                       list_del(&mddev->all_mddevs);
> > >                          continue;
> > > +               }
> >
> > I am still not able to reproduce this, probably due to differences in the
> > timing. Maybe we only need something like:
> >
> > diff --git i/drivers/md/md.c w/drivers/md/md.c
> > index 5c3c19b8d509..ebb529b0faf8 100644
> > --- i/drivers/md/md.c
> > +++ w/drivers/md/md.c
> > @@ -9619,8 +9619,10 @@ static int md_notify_reboot(struct notifier_block
> > *this,
> >
> >         spin_lock(&all_mddevs_lock);
> >         list_for_each_entry_safe(mddev, n, &all_mddevs, all_mddevs) {
> > -               if (!mddev_get(mddev))
> > +               if (!mddev_get(mddev)) {
> > +                       need_delay = 1;
> >                         continue;
> > +               }
> >                 spin_unlock(&all_mddevs_lock);
> >                 if (mddev_trylock(mddev)) {
> >                         if (mddev->pers)
> >
> >
> > Thanks,
> > Song
>
> I will try to reproduce issue at Intel lab to check this.
>
> Thanks,
> Mariusz
Hi Guoqing,

Here is the command how I trigger the issue, have to do it around 10
times to make sure the issue is reproducible

echo "repair" | sudo tee /sys/class/block/md12?/md/sync_action && sudo
grub-reboot "Advanced options for Ubuntu>Ubuntu, with Linux 6.5.0-rc77
06a74159504-dirty" && head -c 1G < /dev/urandom > myfile1 && sleep 180
&& head -c 1G < /dev/urandom > myfile2 && sleep 1 && cat /proc/mdstat
&& sleep 1 && rm myfile1 &&
sudo reboot

And the patch to add need_delay doesn't work.



--
Chia-Lin Kao(AceLan)
http://blog.acelan.idv.tw/
E-Mail: acelan.kaoATcanonical.com (s/AT/@/)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs
  2023-08-22  8:13         ` AceLan Kao
@ 2023-08-22 12:41           ` Guoqing Jiang
  2023-08-23  8:02             ` AceLan Kao
  0 siblings, 1 reply; 30+ messages in thread
From: Guoqing Jiang @ 2023-08-22 12:41 UTC (permalink / raw)
  To: AceLan Kao
  Cc: Song Liu, Mariusz Tkaczyk, Bagas Sanjaya, Christoph Hellwig,
	Linux Kernel Mailing List, Linux Regressions, Linux RAID

Hi Acelan,

On 8/22/23 16:13, AceLan Kao wrote:
>>>>> Hello,
>>>>> The issue is reproducible with IMSM metadata too, around 20% of reboot
>>>>> hangs. I will try to raise the priority in the bug because it is valid
>>>>> high- the base functionality of the system is affected.
>>>> Since it it reproducible from your side, is it possible to turn the
>>>> reproduce steps into a test case
>>>> given the importance?
>> I didn't try to reproduce it locally yet because customer was able to
>> bisect the regression and it pointed them to the same patch so I connected it
>> and asked author to take a look first. At a first glance, I wanted to get
>> community voice to see if it is not something obvious.
>>
>> So far I know, customer is creating 3 IMSM raid arrays, one is the system
>> volume and do a reboot and it sporadically fails (around 20%). That is all.
>>
>>>> I guess If all arrays are set with MD_DELETED flag, then reboot might
>>>> hang, not sure whether
>>>> below (maybe need to flush wq as well  before list_del) helps or not,
>>>> just FYI.
>>>>
>>>> @@ -9566,8 +9566,10 @@ static int md_notify_reboot(struct notifier_block
>>>> *this,
>>>>
>>>>           spin_lock(&all_mddevs_lock);
>>>>           list_for_each_entry_safe(mddev, n, &all_mddevs, all_mddevs) {
>>>> -               if (!mddev_get(mddev))
>>>> +               if (!mddev_get(mddev)) {
>>>> +                       list_del(&mddev->all_mddevs);
>>>>                           continue;
>>>> +               }

My suggestion is delete the list node under this scenario, did you try 
above?

>>> I am still not able to reproduce this, probably due to differences in the
>>> timing. Maybe we only need something like:
>>>
>>> diff --git i/drivers/md/md.c w/drivers/md/md.c
>>> index 5c3c19b8d509..ebb529b0faf8 100644
>>> --- i/drivers/md/md.c
>>> +++ w/drivers/md/md.c
>>> @@ -9619,8 +9619,10 @@ static int md_notify_reboot(struct notifier_block
>>> *this,
>>>
>>>          spin_lock(&all_mddevs_lock);
>>>          list_for_each_entry_safe(mddev, n, &all_mddevs, all_mddevs) {
>>> -               if (!mddev_get(mddev))
>>> +               if (!mddev_get(mddev)) {
>>> +                       need_delay = 1;
>>>                          continue;
>>> +               }
>>>                  spin_unlock(&all_mddevs_lock);
>>>                  if (mddev_trylock(mddev)) {
>>>                          if (mddev->pers)
>>>
>>>
>>> Thanks,
>>> Song
>> I will try to reproduce issue at Intel lab to check this.
>>
>> Thanks,
>> Mariusz
> Hi Guoqing,
>
> Here is the command how I trigger the issue, have to do it around 10
> times to make sure the issue is reproducible
>
> echo "repair" | sudo tee /sys/class/block/md12?/md/sync_action && sudo
> grub-reboot "Advanced options for Ubuntu>Ubuntu, with Linux 6.5.0-rc77
> 06a74159504-dirty" && head -c 1G < /dev/urandom > myfile1 && sleep 180
> && head -c 1G < /dev/urandom > myfile2 && sleep 1 && cat /proc/mdstat
> && sleep 1 && rm myfile1 &&
> sudo reboot

Is the issue still reproducible with remove below from cmd?

echo "repair" | sudo tee /sys/class/block/md12?/md/sync_action

Just want to know if resync thread is related with the issue or not.

> And the patch to add need_delay doesn't work.

My assumption is that mddev_get always returns NULL, so set need_delay
wouldn't help.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs
  2023-08-16  9:37 Fwd: Infiniate systemd loop when power off the machine with multiple MD RAIDs Bagas Sanjaya
  2023-08-18  8:16 ` Mariusz Tkaczyk
  2023-08-21 13:18 ` Fwd: " Yu Kuai
@ 2023-08-22 18:56 ` Song Liu
  2023-08-22 19:13   ` Carlos Carvalho
  2 siblings, 1 reply; 30+ messages in thread
From: Song Liu @ 2023-08-22 18:56 UTC (permalink / raw)
  To: Bagas Sanjaya
  Cc: Christoph Hellwig, AceLan Kao, Linux Kernel Mailing List,
	Linux Regressions, Linux RAID

On Wed, Aug 16, 2023 at 2:37 AM Bagas Sanjaya <bagasdotme@gmail.com> wrote:
>
> Hi,
>
> I notice a regression report on Bugzilla [1]. Quoting from it:
>
> > It needs to build at least 2 different RAIDs(eg. RAID0 and RAID10, RAID5 and RAID10) and then you will see below error repeatly(need to use serial console to see it)
> >
> > [ 205.360738] systemd-shutdown[1]: Stopping MD devices.
> > [ 205.366384] systemd-shutdown[1]: sd-device-enumerator: Scan all dirs
> > [ 205.373327] systemd-shutdown[1]: sd-device-enumerator: Scanning /sys/bus
> > [ 205.380427] systemd-shutdown[1]: sd-device-enumerator: Scanning /sys/class
> > [ 205.388257] systemd-shutdown[1]: Stopping MD /dev/md127 (9:127).
> > [ 205.394880] systemd-shutdown[1]: Failed to sync MD block device /dev/md127, ignoring: Input/output error
> > [ 205.404975] md: md127 stopped.
> > [ 205.470491] systemd-shutdown[1]: Stopping MD /dev/md126 (9:126).
> > [ 205.770179] md: md126: resync interrupted.
> > [ 205.776258] md126: detected capacity change from 1900396544 to 0
> > [ 205.783349] md: md126 stopped.
> > [ 205.862258] systemd-shutdown[1]: Stopping MD /dev/md125 (9:125).
> > [ 205.862435] md: md126 stopped.
> > [ 205.868376] systemd-shutdown[1]: Failed to sync MD block device /dev/md125, ignoring: Input/output error
> > [ 205.872845] block device autoloading is deprecated and will be removed.
> > [ 205.880955] md: md125 stopped.
> > [ 205.934349] systemd-shutdown[1]: Stopping MD /dev/md124p2 (259:7).
> > [ 205.947707] systemd-shutdown[1]: Could not stop MD /dev/md124p2: Device or resource busy
> > [ 205.957004] systemd-shutdown[1]: Stopping MD /dev/md124p1 (259:6).
> > [ 205.964177] systemd-shutdown[1]: Could not stop MD /dev/md124p1: Device or resource busy
> > [ 205.973155] systemd-shutdown[1]: Stopping MD /dev/md124 (9:124).
> > [ 205.979789] systemd-shutdown[1]: Could not stop MD /dev/md124: Device or resource busy
> > [ 205.988475] systemd-shutdown[1]: Not all MD devices stopped, 4 left.

From systemd code, i.e. function delete_md(), this error:

[ 205.957004] systemd-shutdown[1]: Stopping MD /dev/md124p1 (259:6).
[ 205.964177] systemd-shutdown[1]: Could not stop MD /dev/md124p1:
Device or resource busy

is most likely triggered by ioctl(STOP_ARRAY).

And based on the code, I think the ioctl fails here:

        if (cmd == STOP_ARRAY || cmd == STOP_ARRAY_RO) {
                /* Need to flush page cache, and ensure no-one else opens
                 * and writes
                 */
                mutex_lock(&mddev->open_mutex);
                if (mddev->pers && atomic_read(&mddev->openers) > 1) {
                        mutex_unlock(&mddev->open_mutex);
                        err = -EBUSY;
                        goto out;        ////////////////////// HERE
                }
                if (test_and_set_bit(MD_CLOSING, &mddev->flags)) {
                        mutex_unlock(&mddev->open_mutex);
                        err = -EBUSY;
                        goto out;
                }
                did_set_md_closing = true;
                mutex_unlock(&mddev->open_mutex);
                sync_blockdev(bdev);
        }

>
> See Bugzilla for the full thread and attached full journalctl log.
>
> Anyway, I'm adding this regression to be tracked by regzbot:
>
> #regzbot introduced: 12a6caf273240a https://bugzilla.kernel.org/show_bug.cgi?id=217798
> #regzbot title: systemd shutdown hang on machine with different RAID levels

But the observation above doesn't seem to match the bisect result
and it doesn't seem to be related to different RAID levels.

Thanks,
Song

>
> Thanks.
>
> [1]: https://bugzilla.kernel.org/show_bug.cgi?id=217798
>
> --
> An old man doll... just what I always wanted! - Clara

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs
  2023-08-22 18:56 ` Song Liu
@ 2023-08-22 19:13   ` Carlos Carvalho
  2023-08-23  1:28     ` Yu Kuai
  0 siblings, 1 reply; 30+ messages in thread
From: Carlos Carvalho @ 2023-08-22 19:13 UTC (permalink / raw)
  To: Song Liu
  Cc: Bagas Sanjaya, Christoph Hellwig, AceLan Kao,
	Linux Kernel Mailing List, Linux Regressions, Linux RAID

Song Liu (song@kernel.org) wrote on Tue, Aug 22, 2023 at 03:56:04PM -03:
> >From systemd code, i.e. function delete_md(), this error:
> 
> [ 205.957004] systemd-shutdown[1]: Stopping MD /dev/md124p1 (259:6).
> [ 205.964177] systemd-shutdown[1]: Could not stop MD /dev/md124p1:
> Device or resource busy
> 
> is most likely triggered by ioctl(STOP_ARRAY).
> 
> And based on the code, I think the ioctl fails here:
> 
>         if (cmd == STOP_ARRAY || cmd == STOP_ARRAY_RO) {
>                 /* Need to flush page cache, and ensure no-one else opens
>                  * and writes
>                  */
>                 mutex_lock(&mddev->open_mutex);
>                 if (mddev->pers && atomic_read(&mddev->openers) > 1) {
>                         mutex_unlock(&mddev->open_mutex);
>                         err = -EBUSY;
>                         goto out;        ////////////////////// HERE
>                 }
>                 if (test_and_set_bit(MD_CLOSING, &mddev->flags)) {
>                         mutex_unlock(&mddev->open_mutex);
>                         err = -EBUSY;
>                         goto out;
>                 }
>                 did_set_md_closing = true;
>                 mutex_unlock(&mddev->open_mutex);
>                 sync_blockdev(bdev);
>         }

Probably. The problem is why doesn't it manage to flush the page cache? I find
strange that the problem appears only when trying to stop the array, I get it
when trying to umount the filesystem, where it also hangs because of the same
reason. The kworker thread runs continuously using 100% cpu of only 1 core.

These are all similar symptoms of the underlying problem which I complained
about days ago: the system doesn't manage to write to the disks, which stay
nearly idle. If you wait long enough without issuing writes, which can be
several hours, it'll eventually flush the page cache and proceed to a "normal"
umount or reboot.

The bug is dependent on the rate of writes and also on uptime; it rarely
appears soon after boot, and increases as times passes.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs
  2023-08-22 19:13   ` Carlos Carvalho
@ 2023-08-23  1:28     ` Yu Kuai
  2023-08-23  6:04       ` Hannes Reinecke
  0 siblings, 1 reply; 30+ messages in thread
From: Yu Kuai @ 2023-08-23  1:28 UTC (permalink / raw)
  To: Carlos Carvalho, Song Liu
  Cc: Bagas Sanjaya, Christoph Hellwig, AceLan Kao,
	Linux Kernel Mailing List, Linux Regressions, Linux RAID,
	yukuai (C), yangerkun@huawei.com

Hi,

在 2023/08/23 3:13, Carlos Carvalho 写道:
> Song Liu (song@kernel.org) wrote on Tue, Aug 22, 2023 at 03:56:04PM -03:
>> >From systemd code, i.e. function delete_md(), this error:
>>
>> [ 205.957004] systemd-shutdown[1]: Stopping MD /dev/md124p1 (259:6).
>> [ 205.964177] systemd-shutdown[1]: Could not stop MD /dev/md124p1:
>> Device or resource busy
>>
>> is most likely triggered by ioctl(STOP_ARRAY).
>>
>> And based on the code, I think the ioctl fails here:
>>
>>          if (cmd == STOP_ARRAY || cmd == STOP_ARRAY_RO) {
>>                  /* Need to flush page cache, and ensure no-one else opens
>>                   * and writes
>>                   */
>>                  mutex_lock(&mddev->open_mutex);
>>                  if (mddev->pers && atomic_read(&mddev->openers) > 1) {
>>                          mutex_unlock(&mddev->open_mutex);
>>                          err = -EBUSY;
>>                          goto out;        ////////////////////// HERE

Yes, I suspect here as well, but I do some test with error injection to
gurantee that ioctl(STOP_ARRAY) always return -EBUSY, but I found that
system reboot didn't hang, it'll try a few times but eventually reboot
finished.

>>                  }
>>                  if (test_and_set_bit(MD_CLOSING, &mddev->flags)) {
>>                          mutex_unlock(&mddev->open_mutex);
>>                          err = -EBUSY;
>>                          goto out;
>>                  }
>>                  did_set_md_closing = true;
>>                  mutex_unlock(&mddev->open_mutex);
>>                  sync_blockdev(bdev);
>>          }
> 
> Probably. The problem is why doesn't it manage to flush the page cache? I find
> strange that the problem appears only when trying to stop the array, I get it
> when trying to umount the filesystem, where it also hangs because of the same
> reason. The kworker thread runs continuously using 100% cpu of only 1 core.

The kworker do you mean the daemon thread or the sync thread?  runs
continuously using 100% cpu doen't sounds correct to me.

Thanks,
Kuai

> 
> These are all similar symptoms of the underlying problem which I complained
> about days ago: the system doesn't manage to write to the disks, which stay
> nearly idle. If you wait long enough without issuing writes, which can be
> several hours, it'll eventually flush the page cache and proceed to a "normal"
> umount or reboot.
> 
> The bug is dependent on the rate of writes and also on uptime; it rarely
> appears soon after boot, and increases as times passes.
> 
> .
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs
  2023-08-23  1:28     ` Yu Kuai
@ 2023-08-23  6:04       ` Hannes Reinecke
  0 siblings, 0 replies; 30+ messages in thread
From: Hannes Reinecke @ 2023-08-23  6:04 UTC (permalink / raw)
  To: Yu Kuai, Carlos Carvalho, Song Liu
  Cc: Bagas Sanjaya, Christoph Hellwig, AceLan Kao,
	Linux Kernel Mailing List, Linux Regressions, Linux RAID,
	yukuai (C), yangerkun@huawei.com

On 8/23/23 03:28, Yu Kuai wrote:
> Hi,
> 
> 在 2023/08/23 3:13, Carlos Carvalho 写道:
>> Song Liu (song@kernel.org) wrote on Tue, Aug 22, 2023 at 03:56:04PM -03:
>>> >From systemd code, i.e. function delete_md(), this error:
>>>
>>> [ 205.957004] systemd-shutdown[1]: Stopping MD /dev/md124p1 (259:6).
>>> [ 205.964177] systemd-shutdown[1]: Could not stop MD /dev/md124p1:
>>> Device or resource busy
>>>
>>> is most likely triggered by ioctl(STOP_ARRAY).
>>>
>>> And based on the code, I think the ioctl fails here:
>>>
>>>          if (cmd == STOP_ARRAY || cmd == STOP_ARRAY_RO) {
>>>                  /* Need to flush page cache, and ensure no-one else 
>>> opens
>>>                   * and writes
>>>                   */
>>>                  mutex_lock(&mddev->open_mutex);
>>>                  if (mddev->pers && atomic_read(&mddev->openers) > 1) {
>>>                          mutex_unlock(&mddev->open_mutex);
>>>                          err = -EBUSY;
>>>                          goto out;        ////////////////////// HERE
> 
> Yes, I suspect here as well, but I do some test with error injection to
> gurantee that ioctl(STOP_ARRAY) always return -EBUSY, but I found that
> system reboot didn't hang, it'll try a few times but eventually reboot
> finished.
> 
>>>                  }
>>>                  if (test_and_set_bit(MD_CLOSING, &mddev->flags)) {
>>>                          mutex_unlock(&mddev->open_mutex);
>>>                          err = -EBUSY;
>>>                          goto out;
>>>                  }
>>>                  did_set_md_closing = true;
>>>                  mutex_unlock(&mddev->open_mutex);
>>>                  sync_blockdev(bdev);
>>>          }
>>
>> Probably. The problem is why doesn't it manage to flush the page 
>> cache? I find
>> strange that the problem appears only when trying to stop the array, I 
>> get it
>> when trying to umount the filesystem, where it also hangs because of 
>> the same
>> reason. The kworker thread runs continuously using 100% cpu of only 1 
>> core.
> 
> The kworker do you mean the daemon thread or the sync thread?  runs
> continuously using 100% cpu doen't sounds correct to me.
> 
That sounds more like kworker waiting on a mutex to be released.
Can you trigger a kdump while in that state?
Or do a sysrq-T to get a stack trace of all running processes?
That should tell you where exactly the 100% cpu time is being
wasted.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs
  2023-08-22 12:41           ` Guoqing Jiang
@ 2023-08-23  8:02             ` AceLan Kao
  2023-08-23 13:25               ` Song Liu
  0 siblings, 1 reply; 30+ messages in thread
From: AceLan Kao @ 2023-08-23  8:02 UTC (permalink / raw)
  To: Guoqing Jiang
  Cc: Song Liu, Mariusz Tkaczyk, Bagas Sanjaya, Christoph Hellwig,
	Linux Kernel Mailing List, Linux Regressions, Linux RAID

Hi Guoqing,

Guoqing Jiang <guoqing.jiang@linux.dev> 於 2023年8月22日 週二 下午8:41寫道:
>
> Hi Acelan,
>
> On 8/22/23 16:13, AceLan Kao wrote:
> >>>>> Hello,
> >>>>> The issue is reproducible with IMSM metadata too, around 20% of reboot
> >>>>> hangs. I will try to raise the priority in the bug because it is valid
> >>>>> high- the base functionality of the system is affected.
> >>>> Since it it reproducible from your side, is it possible to turn the
> >>>> reproduce steps into a test case
> >>>> given the importance?
> >> I didn't try to reproduce it locally yet because customer was able to
> >> bisect the regression and it pointed them to the same patch so I connected it
> >> and asked author to take a look first. At a first glance, I wanted to get
> >> community voice to see if it is not something obvious.
> >>
> >> So far I know, customer is creating 3 IMSM raid arrays, one is the system
> >> volume and do a reboot and it sporadically fails (around 20%). That is all.
> >>
> >>>> I guess If all arrays are set with MD_DELETED flag, then reboot might
> >>>> hang, not sure whether
> >>>> below (maybe need to flush wq as well  before list_del) helps or not,
> >>>> just FYI.
> >>>>
> >>>> @@ -9566,8 +9566,10 @@ static int md_notify_reboot(struct notifier_block
> >>>> *this,
> >>>>
> >>>>           spin_lock(&all_mddevs_lock);
> >>>>           list_for_each_entry_safe(mddev, n, &all_mddevs, all_mddevs) {
> >>>> -               if (!mddev_get(mddev))
> >>>> +               if (!mddev_get(mddev)) {
> >>>> +                       list_del(&mddev->all_mddevs);
> >>>>                           continue;
> >>>> +               }
>
> My suggestion is delete the list node under this scenario, did you try
> above?
Still no luck, the patch doesn't work, the sympton is the same.

>
> >>> I am still not able to reproduce this, probably due to differences in the
> >>> timing. Maybe we only need something like:
> >>>
> >>> diff --git i/drivers/md/md.c w/drivers/md/md.c
> >>> index 5c3c19b8d509..ebb529b0faf8 100644
> >>> --- i/drivers/md/md.c
> >>> +++ w/drivers/md/md.c
> >>> @@ -9619,8 +9619,10 @@ static int md_notify_reboot(struct notifier_block
> >>> *this,
> >>>
> >>>          spin_lock(&all_mddevs_lock);
> >>>          list_for_each_entry_safe(mddev, n, &all_mddevs, all_mddevs) {
> >>> -               if (!mddev_get(mddev))
> >>> +               if (!mddev_get(mddev)) {
> >>> +                       need_delay = 1;
> >>>                          continue;
> >>> +               }
> >>>                  spin_unlock(&all_mddevs_lock);
> >>>                  if (mddev_trylock(mddev)) {
> >>>                          if (mddev->pers)
> >>>
> >>>
> >>> Thanks,
> >>> Song
> >> I will try to reproduce issue at Intel lab to check this.
> >>
> >> Thanks,
> >> Mariusz
> > Hi Guoqing,
> >
> > Here is the command how I trigger the issue, have to do it around 10
> > times to make sure the issue is reproducible
> >
> > echo "repair" | sudo tee /sys/class/block/md12?/md/sync_action && sudo
> > grub-reboot "Advanced options for Ubuntu>Ubuntu, with Linux 6.5.0-rc77
> > 06a74159504-dirty" && head -c 1G < /dev/urandom > myfile1 && sleep 180
> > && head -c 1G < /dev/urandom > myfile2 && sleep 1 && cat /proc/mdstat
> > && sleep 1 && rm myfile1 &&
> > sudo reboot
>
> Is the issue still reproducible with remove below from cmd?
>
> echo "repair" | sudo tee /sys/class/block/md12?/md/sync_action
>
> Just want to know if resync thread is related with the issue or not.
Probably not, we can reproduce the issue without resync, just feel
it's easier to reproduce the issue, so I put the command in my script.

>
> > And the patch to add need_delay doesn't work.
>
> My assumption is that mddev_get always returns NULL, so set need_delay
> wouldn't help.
>
> Thanks,
> Guoqing



-- 
Chia-Lin Kao(AceLan)
http://blog.acelan.idv.tw/
E-Mail: acelan.kaoATcanonical.com (s/AT/@/)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs
  2023-08-23  8:02             ` AceLan Kao
@ 2023-08-23 13:25               ` Song Liu
       [not found]                 ` <CAMz9Wg9y52iuxJRSQFC2N5Katt72v-o=JvEjegJt-MwORmw9tQ@mail.gmail.com>
  0 siblings, 1 reply; 30+ messages in thread
From: Song Liu @ 2023-08-23 13:25 UTC (permalink / raw)
  To: AceLan Kao
  Cc: Guoqing Jiang, Mariusz Tkaczyk, Bagas Sanjaya, Christoph Hellwig,
	Linux Kernel Mailing List, Linux Regressions, Linux RAID

Hi AceLan,

On Wed, Aug 23, 2023 at 1:03 AM AceLan Kao <acelan@gmail.com> wrote:
>
[...]
> >
> > My suggestion is delete the list node under this scenario, did you try
> > above?
> Still no luck, the patch doesn't work, the sympton is the same.
[...]
> >
> > Is the issue still reproducible with remove below from cmd?
> >
> > echo "repair" | sudo tee /sys/class/block/md12?/md/sync_action
> >
> > Just want to know if resync thread is related with the issue or not.
> Probably not, we can reproduce the issue without resync, just feel
> it's easier to reproduce the issue, so I put the command in my script.
>

Could you please run the follow two experiments?

1. Confirm 12a6caf273240a triggers this. Specifically:
   git checkout 12a6caf273240a => repros
   git checkout 12a6caf273240a~1 => cannot repro

2. Try with the following change (add debug messages), which hopefully
   shows which command is holding a reference on mddev->openers.

Thanks,
Song

diff --git i/drivers/md/md.c w/drivers/md/md.c
index 78be7811a89f..3e9b718b32c1 100644
--- i/drivers/md/md.c
+++ w/drivers/md/md.c
@@ -7574,11 +7574,15 @@ static int md_ioctl(struct block_device *bdev,
blk_mode_t mode,
                if (mddev->pers && atomic_read(&mddev->openers) > 1) {
                        mutex_unlock(&mddev->open_mutex);
                        err = -EBUSY;
+                       pr_warn("%s return -EBUSY for %s with
mddev->openers = %d\n",
+                               __func__, mdname(mddev),
atomic_read(&mddev->openers));
                        goto out;
                }
                if (test_and_set_bit(MD_CLOSING, &mddev->flags)) {
                        mutex_unlock(&mddev->open_mutex);
                        err = -EBUSY;
+                       pr_warn("%s return -EBUSY for %s with
MD_CLOSING bit set\n",
+                               __func__, mdname(mddev));
                        goto out;
                }
                did_set_md_closing = true;
@@ -7789,6 +7793,8 @@ static int md_open(struct gendisk *disk, blk_mode_t mode)
                goto out_unlock;

        atomic_inc(&mddev->openers);
+       pr_info("%s:%s openers++ = %d by %s\n", __func__, mdname(mddev),
+               atomic_read(&mddev->openers), current->comm);
        mutex_unlock(&mddev->open_mutex);

        disk_check_media_change(disk);
@@ -7807,6 +7813,8 @@ static void md_release(struct gendisk *disk)

        BUG_ON(!mddev);
        atomic_dec(&mddev->openers);
+       pr_info("%s:%s openers-- = %d by %s\n", __func__, mdname(mddev),
+               atomic_read(&mddev->openers), current->comm);
        mddev_put(mddev);
 }

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs
       [not found]                 ` <CAMz9Wg9y52iuxJRSQFC2N5Katt72v-o=JvEjegJt-MwORmw9tQ@mail.gmail.com>
@ 2023-08-28  5:20                   ` Song Liu
  2023-08-28 10:48                     ` AceLan Kao
  2023-08-28 13:50                     ` Yu Kuai
  0 siblings, 2 replies; 30+ messages in thread
From: Song Liu @ 2023-08-28  5:20 UTC (permalink / raw)
  To: AceLan Kao
  Cc: Guoqing Jiang, Mariusz Tkaczyk, Bagas Sanjaya, Christoph Hellwig,
	Linux Kernel Mailing List, Linux Regressions, Linux RAID

Hi AceLan,

Thanks for running the experiments.

On Fri, Aug 25, 2023 at 9:32 PM AceLan Kao <acelan@gmail.com> wrote:
[...]
> >
> > Could you please run the follow two experiments?
> >
> > 1. Confirm 12a6caf273240a triggers this. Specifically:
> >    git checkout 12a6caf273240a => repros
> >    git checkout 12a6caf273240a~1 => cannot repro
> Yes, I'm pretty sure about this, that's my bisect result and I just
> confirmed it again.
> I also tried reverting 12a6caf273240a and the issue is gone.

The log doesn't match my guess. Specifically:

[  420.068142] systemd-shutdown[1]: Stopping MD /dev/md123 (9:123).
[  420.074718] md_open:md123 openers++ = 1 by systemd-shutdow
[  420.080787] systemd-shutdown[1]: Failed to sync MD block device
/dev/md123, ignoring: Input/output error
[  420.090831] md: md123 stopped.
[  420.094465] systemd-shutdown[1]: Stopping MD /dev/md122 (9:122).
[  420.101045] systemd-shutdown[1]: Could not stop MD /dev/md122:
Device or resource busy

For a successful stop on md123, we reach the pr_info() in md_open().
For a failed stop on md122, the kernel returns -EBUSY before that
pr_info() in md_open(). There are some changes in md_open() in
the past few release, so I am not quite sure we are looking at the
same code.

Therefore, could you please help clarify:

1. Which base kernel are you using?

From the log, you are using 6.5-rc7-706a74159504. However,
I think we cannot cleanly revert 12a6caf273240a on top of
6.5-rc7-706a74159504. Did you manually fix some issue in the
revert? If so, could you please share the revert commit?

2. If you are not using 6.5-rc7-706a74159504 as base kernel, which
one are you using?

Thanks,
Song

>
> >
> > 2. Try with the following change (add debug messages), which hopefully
> >    shows which command is holding a reference on mddev->openers.
> >
> > Thanks,
> > Song
> >
> > diff --git i/drivers/md/md.c w/drivers/md/md.c
> > index 78be7811a89f..3e9b718b32c1 100644
> > --- i/drivers/md/md.c
> > +++ w/drivers/md/md.c
> > @@ -7574,11 +7574,15 @@ static int md_ioctl(struct block_device *bdev,
> > blk_mode_t mode,
> >                 if (mddev->pers && atomic_read(&mddev->openers) > 1) {
> >                         mutex_unlock(&mddev->open_mutex);
> >                         err = -EBUSY;
> > +                       pr_warn("%s return -EBUSY for %s with
> > mddev->openers = %d\n",
> > +                               __func__, mdname(mddev),
> > atomic_read(&mddev->openers));
> >                         goto out;
> >                 }
> >                 if (test_and_set_bit(MD_CLOSING, &mddev->flags)) {
> >                         mutex_unlock(&mddev->open_mutex);
> >                         err = -EBUSY;
> > +                       pr_warn("%s return -EBUSY for %s with
> > MD_CLOSING bit set\n",
> > +                               __func__, mdname(mddev));
> >                         goto out;
> >                 }
> >                 did_set_md_closing = true;
> > @@ -7789,6 +7793,8 @@ static int md_open(struct gendisk *disk, blk_mode_t mode)
> >                 goto out_unlock;
> >
> >         atomic_inc(&mddev->openers);
> > +       pr_info("%s:%s openers++ = %d by %s\n", __func__, mdname(mddev),
> > +               atomic_read(&mddev->openers), current->comm);
> >         mutex_unlock(&mddev->open_mutex);
> >
> >         disk_check_media_change(disk);
> > @@ -7807,6 +7813,8 @@ static void md_release(struct gendisk *disk)
> >
> >         BUG_ON(!mddev);
> >         atomic_dec(&mddev->openers);
> > +       pr_info("%s:%s openers-- = %d by %s\n", __func__, mdname(mddev),
> > +               atomic_read(&mddev->openers), current->comm);
> >         mddev_put(mddev);
> >  }
> It's pretty strange that I can't reproduce the issue after applied the patch.
>
> I tried to figure out which part affect the issue and found when I
> comment out the pr_info() In md_release(), the issue could be
> reproduced.
>
> --
> Chia-Lin Kao(AceLan)
> http://blog.acelan.idv.tw/
> E-Mail: acelan.kaoATcanonical.com (s/AT/@/)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs
  2023-08-28  5:20                   ` Song Liu
@ 2023-08-28 10:48                     ` AceLan Kao
  2023-08-28 13:50                     ` Yu Kuai
  1 sibling, 0 replies; 30+ messages in thread
From: AceLan Kao @ 2023-08-28 10:48 UTC (permalink / raw)
  To: Song Liu
  Cc: Guoqing Jiang, Mariusz Tkaczyk, Bagas Sanjaya, Christoph Hellwig,
	Linux Kernel Mailing List, Linux Regressions, Linux RAID

[-- Attachment #1: Type: text/plain, Size: 4654 bytes --]

Hi Song,

Song Liu <song@kernel.org> 於 2023年8月28日 週一 下午1:21寫道:
>
> Hi AceLan,
>
> Thanks for running the experiments.
>
> On Fri, Aug 25, 2023 at 9:32 PM AceLan Kao <acelan@gmail.com> wrote:
> [...]
> > >
> > > Could you please run the follow two experiments?
> > >
> > > 1. Confirm 12a6caf273240a triggers this. Specifically:
> > >    git checkout 12a6caf273240a => repros
> > >    git checkout 12a6caf273240a~1 => cannot repro
> > Yes, I'm pretty sure about this, that's my bisect result and I just
> > confirmed it again.
> > I also tried reverting 12a6caf273240a and the issue is gone.
>
> The log doesn't match my guess. Specifically:
>
> [  420.068142] systemd-shutdown[1]: Stopping MD /dev/md123 (9:123).
> [  420.074718] md_open:md123 openers++ = 1 by systemd-shutdow
> [  420.080787] systemd-shutdown[1]: Failed to sync MD block device
> /dev/md123, ignoring: Input/output error
> [  420.090831] md: md123 stopped.
> [  420.094465] systemd-shutdown[1]: Stopping MD /dev/md122 (9:122).
> [  420.101045] systemd-shutdown[1]: Could not stop MD /dev/md122:
> Device or resource busy
>
> For a successful stop on md123, we reach the pr_info() in md_open().
> For a failed stop on md122, the kernel returns -EBUSY before that
> pr_info() in md_open(). There are some changes in md_open() in
> the past few release, so I am not quite sure we are looking at the
> same code.
>
> Therefore, could you please help clarify:
>
> 1. Which base kernel are you using?
>
> From the log, you are using 6.5-rc7-706a74159504. However,
> I think we cannot cleanly revert 12a6caf273240a on top of
> 6.5-rc7-706a74159504. Did you manually fix some issue in the
> revert? If so, could you please share the revert commit?
Yes, I'm basing 6.5.0-rc7 706a74159504 to apply your patch
706a74159504 Linux 6.5-rc7

Attached how I revert the commit.

>
> 2. If you are not using 6.5-rc7-706a74159504 as base kernel, which
> one are you using?
>
> Thanks,
> Song
>
> >
> > >
> > > 2. Try with the following change (add debug messages), which hopefully
> > >    shows which command is holding a reference on mddev->openers.
> > >
> > > Thanks,
> > > Song
> > >
> > > diff --git i/drivers/md/md.c w/drivers/md/md.c
> > > index 78be7811a89f..3e9b718b32c1 100644
> > > --- i/drivers/md/md.c
> > > +++ w/drivers/md/md.c
> > > @@ -7574,11 +7574,15 @@ static int md_ioctl(struct block_device *bdev,
> > > blk_mode_t mode,
> > >                 if (mddev->pers && atomic_read(&mddev->openers) > 1) {
> > >                         mutex_unlock(&mddev->open_mutex);
> > >                         err = -EBUSY;
> > > +                       pr_warn("%s return -EBUSY for %s with
> > > mddev->openers = %d\n",
> > > +                               __func__, mdname(mddev),
> > > atomic_read(&mddev->openers));
> > >                         goto out;
> > >                 }
> > >                 if (test_and_set_bit(MD_CLOSING, &mddev->flags)) {
> > >                         mutex_unlock(&mddev->open_mutex);
> > >                         err = -EBUSY;
> > > +                       pr_warn("%s return -EBUSY for %s with
> > > MD_CLOSING bit set\n",
> > > +                               __func__, mdname(mddev));
> > >                         goto out;
> > >                 }
> > >                 did_set_md_closing = true;
> > > @@ -7789,6 +7793,8 @@ static int md_open(struct gendisk *disk, blk_mode_t mode)
> > >                 goto out_unlock;
> > >
> > >         atomic_inc(&mddev->openers);
> > > +       pr_info("%s:%s openers++ = %d by %s\n", __func__, mdname(mddev),
> > > +               atomic_read(&mddev->openers), current->comm);
> > >         mutex_unlock(&mddev->open_mutex);
> > >
> > >         disk_check_media_change(disk);
> > > @@ -7807,6 +7813,8 @@ static void md_release(struct gendisk *disk)
> > >
> > >         BUG_ON(!mddev);
> > >         atomic_dec(&mddev->openers);
> > > +       pr_info("%s:%s openers-- = %d by %s\n", __func__, mdname(mddev),
> > > +               atomic_read(&mddev->openers), current->comm);
> > >         mddev_put(mddev);
> > >  }
> > It's pretty strange that I can't reproduce the issue after applied the patch.
> >
> > I tried to figure out which part affect the issue and found when I
> > comment out the pr_info() In md_release(), the issue could be
> > reproduced.
> >
> > --
> > Chia-Lin Kao(AceLan)
> > http://blog.acelan.idv.tw/
> > E-Mail: acelan.kaoATcanonical.com (s/AT/@/)



-- 
Chia-Lin Kao(AceLan)
http://blog.acelan.idv.tw/
E-Mail: acelan.kaoATcanonical.com (s/AT/@/)

[-- Attachment #2: 0001-Revert-md-only-delete-entries-from-all_mddevs-when-t.patch --]
[-- Type: text/x-patch, Size: 5304 bytes --]

From 58dd106db88f00b08d20d4a949a001629238dbc0 Mon Sep 17 00:00:00 2001
From: "Chia-Lin Kao (AceLan)" <acelan.kao@canonical.com>
Date: Mon, 28 Aug 2023 02:34:08 -0400
Subject: [PATCH] Revert "md: only delete entries from all_mddevs when the disk
 is freed"

This reverts commit 12a6caf273240ae42842de8cc05feaa86e582d61.
---
 drivers/md/md.c | 52 ++++++++++++++++---------------------------------
 drivers/md/md.h |  2 --
 2 files changed, 17 insertions(+), 37 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 78be7811a89f..af5d379ad357 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -606,10 +606,6 @@ EXPORT_SYMBOL(md_flush_request);
 
 static inline struct mddev *mddev_get(struct mddev *mddev)
 {
-	lockdep_assert_held(&all_mddevs_lock);
-
-	if (test_bit(MD_DELETED, &mddev->flags))
-		return NULL;
 	atomic_inc(&mddev->active);
 	return mddev;
 }
@@ -624,7 +620,7 @@ void mddev_put(struct mddev *mddev)
 	    mddev->ctime == 0 && !mddev->hold_active) {
 		/* Array is not configured at all, and not held active,
 		 * so destroy it */
-		set_bit(MD_DELETED, &mddev->flags);
+		list_del_init(&mddev->all_mddevs);
 
 		/*
 		 * Call queue_work inside the spinlock so that
@@ -3285,8 +3281,6 @@ static bool md_rdev_overlaps(struct md_rdev *rdev)
 
 	spin_lock(&all_mddevs_lock);
 	list_for_each_entry(mddev, &all_mddevs, all_mddevs) {
-		if (test_bit(MD_DELETED, &mddev->flags))
-			continue;
 		rdev_for_each(rdev2, mddev) {
 			if (rdev != rdev2 && rdev->bdev == rdev2->bdev &&
 			    md_rdevs_overlap(rdev, rdev2)) {
@@ -5488,10 +5482,11 @@ md_attr_show(struct kobject *kobj, struct attribute *attr, char *page)
 	if (!entry->show)
 		return -EIO;
 	spin_lock(&all_mddevs_lock);
-	if (!mddev_get(mddev)) {
+	if (list_empty(&mddev->all_mddevs)) {
 		spin_unlock(&all_mddevs_lock);
 		return -EBUSY;
 	}
+	mddev_get(mddev);
 	spin_unlock(&all_mddevs_lock);
 
 	rv = entry->show(mddev, page);
@@ -5512,10 +5507,11 @@ md_attr_store(struct kobject *kobj, struct attribute *attr,
 	if (!capable(CAP_SYS_ADMIN))
 		return -EACCES;
 	spin_lock(&all_mddevs_lock);
-	if (!mddev_get(mddev)) {
+	if (list_empty(&mddev->all_mddevs)) {
 		spin_unlock(&all_mddevs_lock);
 		return -EBUSY;
 	}
+	mddev_get(mddev);
 	spin_unlock(&all_mddevs_lock);
 	rv = entry->store(mddev, page, length);
 	mddev_put(mddev);
@@ -7826,6 +7822,7 @@ static void md_free_disk(struct gendisk *disk)
 	struct mddev *mddev = disk->private_data;
 
 	percpu_ref_exit(&mddev->writes_pending);
+
 	mddev_free(mddev);
 }
 
@@ -8159,8 +8156,7 @@ static void *md_seq_start(struct seq_file *seq, loff_t *pos)
 	list_for_each(tmp,&all_mddevs)
 		if (!l--) {
 			mddev = list_entry(tmp, struct mddev, all_mddevs);
-			if (!mddev_get(mddev))
-				continue;
+			mddev_get(mddev);
 			spin_unlock(&all_mddevs_lock);
 			return mddev;
 		}
@@ -8174,35 +8170,25 @@ static void *md_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 {
 	struct list_head *tmp;
 	struct mddev *next_mddev, *mddev = v;
-	struct mddev *to_put = NULL;
 
 	++*pos;
 	if (v == (void*)2)
 		return NULL;
 
 	spin_lock(&all_mddevs_lock);
-	if (v == (void*)1) {
+	if (v == (void*)1)
 		tmp = all_mddevs.next;
-	} else {
-		to_put = mddev;
-		tmp = mddev->all_mddevs.next;
-	}
-
-	for (;;) {
-		if (tmp == &all_mddevs) {
-			next_mddev = (void*)2;
-			*pos = 0x10000;
-			break;
-		}
-		next_mddev = list_entry(tmp, struct mddev, all_mddevs);
-		if (mddev_get(next_mddev))
-			break;
-		mddev = next_mddev;
+	else
 		tmp = mddev->all_mddevs.next;
+	if (tmp != &all_mddevs)
+		next_mddev = mddev_get(list_entry(tmp,struct mddev,all_mddevs));
+	else {
+		next_mddev = (void*)2;
+		*pos = 0x10000;
 	}
 	spin_unlock(&all_mddevs_lock);
 
-	if (to_put)
+	if (v != (void*)1)
 		mddev_put(mddev);
 	return next_mddev;
 
@@ -8772,8 +8758,6 @@ void md_do_sync(struct md_thread *thread)
 			goto skip;
 		spin_lock(&all_mddevs_lock);
 		list_for_each_entry(mddev2, &all_mddevs, all_mddevs) {
-			if (test_bit(MD_DELETED, &mddev2->flags))
-				continue;
 			if (mddev2 == mddev)
 				continue;
 			if (!mddev->parallel_resync
@@ -9577,8 +9561,7 @@ static int md_notify_reboot(struct notifier_block *this,
 
 	spin_lock(&all_mddevs_lock);
 	list_for_each_entry_safe(mddev, n, &all_mddevs, all_mddevs) {
-		if (!mddev_get(mddev))
-			continue;
+		mddev_get(mddev);
 		spin_unlock(&all_mddevs_lock);
 		if (mddev_trylock(mddev)) {
 			if (mddev->pers)
@@ -9934,8 +9917,7 @@ static __exit void md_exit(void)
 
 	spin_lock(&all_mddevs_lock);
 	list_for_each_entry_safe(mddev, n, &all_mddevs, all_mddevs) {
-		if (!mddev_get(mddev))
-			continue;
+		mddev_get(mddev);
 		spin_unlock(&all_mddevs_lock);
 		export_array(mddev);
 		mddev->ctime = 0;
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 1aef86bf3fc3..4e265e38c229 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -252,7 +252,6 @@ struct md_cluster_info;
  * @MD_NOT_READY: do_md_run() is active, so 'array_state', ust not report that
  *		   array is ready yet.
  * @MD_BROKEN: This is used to stop writes and mark array as failed.
- * @MD_DELETED: This device is being deleted
  *
  * change UNSUPPORTED_MDDEV_FLAGS for each array type if new flag is added
  */
@@ -269,7 +268,6 @@ enum mddev_flags {
 	MD_UPDATING_SB,
 	MD_NOT_READY,
 	MD_BROKEN,
-	MD_DELETED,
 };
 
 enum mddev_sb_flags {
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs
  2023-08-28  5:20                   ` Song Liu
  2023-08-28 10:48                     ` AceLan Kao
@ 2023-08-28 13:50                     ` Yu Kuai
  2023-08-31  2:28                       ` Yu Kuai
  1 sibling, 1 reply; 30+ messages in thread
From: Yu Kuai @ 2023-08-28 13:50 UTC (permalink / raw)
  To: Song Liu, AceLan Kao
  Cc: Guoqing Jiang, Mariusz Tkaczyk, Bagas Sanjaya, Christoph Hellwig,
	Linux Kernel Mailing List, Linux Regressions, Linux RAID,
	yukuai (C)

Hi,

在 2023/08/28 13:20, Song Liu 写道:
> Hi AceLan,
> 
> Thanks for running the experiments.
> 
> On Fri, Aug 25, 2023 at 9:32 PM AceLan Kao <acelan@gmail.com> wrote:
> [...]
>>>
>>> Could you please run the follow two experiments?
>>>
>>> 1. Confirm 12a6caf273240a triggers this. Specifically:
>>>     git checkout 12a6caf273240a => repros
>>>     git checkout 12a6caf273240a~1 => cannot repro
>> Yes, I'm pretty sure about this, that's my bisect result and I just
>> confirmed it again.
>> I also tried reverting 12a6caf273240a and the issue is gone.
> 
> The log doesn't match my guess. Specifically:
> 
> [  420.068142] systemd-shutdown[1]: Stopping MD /dev/md123 (9:123).
> [  420.074718] md_open:md123 openers++ = 1 by systemd-shutdow
> [  420.080787] systemd-shutdown[1]: Failed to sync MD block device
> /dev/md123, ignoring: Input/output error
> [  420.090831] md: md123 stopped.
> [  420.094465] systemd-shutdown[1]: Stopping MD /dev/md122 (9:122).
> [  420.101045] systemd-shutdown[1]: Could not stop MD /dev/md122:
> Device or resource busy

I see that:

systemd-shutdown[1]: Couldn't finalize remaining  MD devices, trying again.

Can we make sure is this why power off hang?

Because in my VM, power off is not hang and I got:

systemd-shutdown[1]: Could not stop MD /dev/md1: Device or resource busy
systemd-shutdown[1]: Failed to finalize MD devices, ignoring.
> 
> For a successful stop on md123, we reach the pr_info() in md_open().
> For a failed stop on md122, the kernel returns -EBUSY before that
> pr_info() in md_open(). There are some changes in md_open() in
> the past few release, so I am not quite sure we are looking at the
> same code.

By the way, based on code review, looks like md_open never return
-EBUSY, and I think following is the only place can return -EBUSY before
md_open() is called:

blkdev_open
  blkdev_get_by_dev
   bd_prepare_to_claim
    bd_may_claim 	-> -EBUSY

Acelan, can you add following debug patch on the top of Song's patch
and reporduce it again? Hope it'll confirm why stop array failed with
-EBUSY.

diff --git a/block/bdev.c b/block/bdev.c
index 979e28a46b98..699739223dcb 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -789,8 +789,11 @@ struct block_device *blkdev_get_by_dev(dev_t dev, 
blk_mode_t mode, void *holder,
         if (holder) {
                 mode |= BLK_OPEN_EXCL;
                 ret = bd_prepare_to_claim(bdev, holder, hops);
-               if (ret)
+               if (ret) {
+                       pr_warn("%s:%s bd_prepare_to_claim return %d\n",
+                               disk->disk_name, current->comm, ret);
                         goto put_blkdev;
+               }
         } else {
                 if (WARN_ON_ONCE(mode & BLK_OPEN_EXCL)) {
                         ret = -EIO;
diff --git a/block/fops.c b/block/fops.c
index eaa98a987213..2d69119c71f6 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -587,8 +587,11 @@ static int blkdev_open(struct inode *inode, struct 
file *filp)

         bdev = blkdev_get_by_dev(inode->i_rdev, file_to_blk_mode(filp),
                                  filp->private_data, NULL);
-       if (IS_ERR(bdev))
+       if (IS_ERR(bdev)) {
+               pr_warn("%pD:%s blkdev_get_by_dev return %ld\n",
+                       filp, current->comm, PTR_ERR(bdev));
                 return PTR_ERR(bdev);
+       }

         if (bdev_nowait(bdev))
                 filp->f_mode |= FMODE_NOWAIT;

Thanks,
Kuai

> 
> Therefore, could you please help clarify:
> 
> 1. Which base kernel are you using?
> 
>>From the log, you are using 6.5-rc7-706a74159504. However,
> I think we cannot cleanly revert 12a6caf273240a on top of
> 6.5-rc7-706a74159504. Did you manually fix some issue in the
> revert? If so, could you please share the revert commit?
> 
> 2. If you are not using 6.5-rc7-706a74159504 as base kernel, which
> one are you using?
> 
> Thanks,
> Song
> 
>>
>>>
>>> 2. Try with the following change (add debug messages), which hopefully
>>>     shows which command is holding a reference on mddev->openers.
>>>
>>> Thanks,
>>> Song
>>>
>>> diff --git i/drivers/md/md.c w/drivers/md/md.c
>>> index 78be7811a89f..3e9b718b32c1 100644
>>> --- i/drivers/md/md.c
>>> +++ w/drivers/md/md.c
>>> @@ -7574,11 +7574,15 @@ static int md_ioctl(struct block_device *bdev,
>>> blk_mode_t mode,
>>>                  if (mddev->pers && atomic_read(&mddev->openers) > 1) {
>>>                          mutex_unlock(&mddev->open_mutex);
>>>                          err = -EBUSY;
>>> +                       pr_warn("%s return -EBUSY for %s with
>>> mddev->openers = %d\n",
>>> +                               __func__, mdname(mddev),
>>> atomic_read(&mddev->openers));
>>>                          goto out;
>>>                  }
>>>                  if (test_and_set_bit(MD_CLOSING, &mddev->flags)) {
>>>                          mutex_unlock(&mddev->open_mutex);
>>>                          err = -EBUSY;
>>> +                       pr_warn("%s return -EBUSY for %s with
>>> MD_CLOSING bit set\n",
>>> +                               __func__, mdname(mddev));
>>>                          goto out;
>>>                  }
>>>                  did_set_md_closing = true;
>>> @@ -7789,6 +7793,8 @@ static int md_open(struct gendisk *disk, blk_mode_t mode)
>>>                  goto out_unlock;
>>>
>>>          atomic_inc(&mddev->openers);
>>> +       pr_info("%s:%s openers++ = %d by %s\n", __func__, mdname(mddev),
>>> +               atomic_read(&mddev->openers), current->comm);
>>>          mutex_unlock(&mddev->open_mutex);
>>>
>>>          disk_check_media_change(disk);
>>> @@ -7807,6 +7813,8 @@ static void md_release(struct gendisk *disk)
>>>
>>>          BUG_ON(!mddev);
>>>          atomic_dec(&mddev->openers);
>>> +       pr_info("%s:%s openers-- = %d by %s\n", __func__, mdname(mddev),
>>> +               atomic_read(&mddev->openers), current->comm);
>>>          mddev_put(mddev);
>>>   }
>> It's pretty strange that I can't reproduce the issue after applied the patch.
>>
>> I tried to figure out which part affect the issue and found when I
>> comment out the pr_info() In md_release(), the issue could be
>> reproduced.
>>
>> --
>> Chia-Lin Kao(AceLan)
>> http://blog.acelan.idv.tw/
>> E-Mail: acelan.kaoATcanonical.com (s/AT/@/)
> .
> 


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs
  2023-08-28 13:50                     ` Yu Kuai
@ 2023-08-31  2:28                       ` Yu Kuai
  2023-08-31  6:50                         ` Mariusz Tkaczyk
  0 siblings, 1 reply; 30+ messages in thread
From: Yu Kuai @ 2023-08-31  2:28 UTC (permalink / raw)
  To: Yu Kuai, Song Liu, AceLan Kao
  Cc: Guoqing Jiang, Mariusz Tkaczyk, Bagas Sanjaya, Christoph Hellwig,
	Linux Kernel Mailing List, Linux Regressions, Linux RAID,
	yukuai (C)

Hi, Acelan,

在 2023/08/28 21:50, Yu Kuai 写道:
> Hi,
> 
> 在 2023/08/28 13:20, Song Liu 写道:
>> Hi AceLan,
>>
>> Thanks for running the experiments.
>>
>> On Fri, Aug 25, 2023 at 9:32 PM AceLan Kao <acelan@gmail.com> wrote:
>> [...]
>>>>
>>>> Could you please run the follow two experiments?
>>>>
>>>> 1. Confirm 12a6caf273240a triggers this. Specifically:
>>>>     git checkout 12a6caf273240a => repros
>>>>     git checkout 12a6caf273240a~1 => cannot repro
>>> Yes, I'm pretty sure about this, that's my bisect result and I just
>>> confirmed it again.
>>> I also tried reverting 12a6caf273240a and the issue is gone.
>>
>> The log doesn't match my guess. Specifically:
>>
>> [  420.068142] systemd-shutdown[1]: Stopping MD /dev/md123 (9:123).
>> [  420.074718] md_open:md123 openers++ = 1 by systemd-shutdow
>> [  420.080787] systemd-shutdown[1]: Failed to sync MD block device
>> /dev/md123, ignoring: Input/output error
>> [  420.090831] md: md123 stopped.
>> [  420.094465] systemd-shutdown[1]: Stopping MD /dev/md122 (9:122).
>> [  420.101045] systemd-shutdown[1]: Could not stop MD /dev/md122:
>> Device or resource busy
> 
> I see that:
> 
> systemd-shutdown[1]: Couldn't finalize remaining  MD devices, trying again.
> 
> Can we make sure is this why power off hang?
> 
> Because in my VM, power off is not hang and I got:
> 
> systemd-shutdown[1]: Could not stop MD /dev/md1: Device or resource busy
> systemd-shutdown[1]: Failed to finalize MD devices, ignoring.
>>
>> For a successful stop on md123, we reach the pr_info() in md_open().
>> For a failed stop on md122, the kernel returns -EBUSY before that
>> pr_info() in md_open(). There are some changes in md_open() in
>> the past few release, so I am not quite sure we are looking at the
>> same code.
> 
> By the way, based on code review, looks like md_open never return
> -EBUSY, and I think following is the only place can return -EBUSY before
> md_open() is called:
> 
> blkdev_open
>   blkdev_get_by_dev
>    bd_prepare_to_claim
>     bd_may_claim     -> -EBUSY
> 
> Acelan, can you add following debug patch on the top of Song's patch
> and reporduce it again? Hope it'll confirm why stop array failed with
> -EBUSY.
> 

I saw your reply:

https://lore.kernel.org/all/CAMz9Wg9VXJzZkKWCosm0_=Rz6gG9r3pCMrhvBZQVCQb8i0kd-w@mail.gmail.com/

But I didn't recieve you emails, please reply to me through this email
address: yukuai3@huawei.com. yukuai1@huaweicloud.com is just used for
sending, and I always missing emails from huaweicloud.com, my apologize
for this inconvenience, I have to do it this way due to my company
policy.

There is something interesting on test result:

At first, md122 is the root device, I guess this is why I can't repoduce
this problem in my local VM.
[   14.478619] systemd-gpt-auto-generator[585]: md122p3: Root device 
/dev/md122.
[   14.481823] block md122: the capability attribute has been deprecated.

Then, before trying to stop md122, following log shows that md122 is
still mounted:
[ 1368.369926] [3462]: Remounting '/' read-only with options 
'errors=remount-ro,stripe=128'.
[ 1368.396287] EXT4-fs (md122p3): re-mounted 
6d53dc8e-3f45-4efa-bc0e-4af477fac217 ro. Quota mode: none.

And since then, there is no log aboug md_open(), which means no one
ever open md122 since this remount.

At last, there are lots of log shows that md122 can't be opened
exclusively:
[ 1368.812952] md122:systemd-shutdow bd_prepare_to_claim return -16
[ 1368.819189] md122:systemd-shutdow blkdev_get_by_dev return -16

Which indicate that md122 is still mounted, and systemd-shutdow can't
stop this array. This behaviour looks correct to me, because rootfs
can't be umounted while systemd-shutdown is still running.

However, there are some questions I don't have answers for now, I'm not
familiar with how systemd works:

1) In the normal case that raid is root device, how can systemd-shutdown
make progress and how is rootfs unmounted?
2) How does this scenario related to commit 12a6caf273240a.

I must read systemd source code to get answers, and it'll be appreciated
if someone has answers.

Thanks,
Kuai

> diff --git a/block/bdev.c b/block/bdev.c
> index 979e28a46b98..699739223dcb 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -789,8 +789,11 @@ struct block_device *blkdev_get_by_dev(dev_t dev, 
> blk_mode_t mode, void *holder,
>          if (holder) {
>                  mode |= BLK_OPEN_EXCL;
>                  ret = bd_prepare_to_claim(bdev, holder, hops);
> -               if (ret)
> +               if (ret) {
> +                       pr_warn("%s:%s bd_prepare_to_claim return %d\n",
> +                               disk->disk_name, current->comm, ret);
>                          goto put_blkdev;
> +               }
>          } else {
>                  if (WARN_ON_ONCE(mode & BLK_OPEN_EXCL)) {
>                          ret = -EIO;
> diff --git a/block/fops.c b/block/fops.c
> index eaa98a987213..2d69119c71f6 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -587,8 +587,11 @@ static int blkdev_open(struct inode *inode, struct 
> file *filp)
> 
>          bdev = blkdev_get_by_dev(inode->i_rdev, file_to_blk_mode(filp),
>                                   filp->private_data, NULL);
> -       if (IS_ERR(bdev))
> +       if (IS_ERR(bdev)) {
> +               pr_warn("%pD:%s blkdev_get_by_dev return %ld\n",
> +                       filp, current->comm, PTR_ERR(bdev));
>                  return PTR_ERR(bdev);
> +       }
> 
>          if (bdev_nowait(bdev))
>                  filp->f_mode |= FMODE_NOWAIT;
> 
> Thanks,
> Kuai
> 
>>
>> Therefore, could you please help clarify:
>>
>> 1. Which base kernel are you using?
>>
>>> From the log, you are using 6.5-rc7-706a74159504. However,
>> I think we cannot cleanly revert 12a6caf273240a on top of
>> 6.5-rc7-706a74159504. Did you manually fix some issue in the
>> revert? If so, could you please share the revert commit?
>>
>> 2. If you are not using 6.5-rc7-706a74159504 as base kernel, which
>> one are you using?
>>
>> Thanks,
>> Song
>>
>>>
>>>>
>>>> 2. Try with the following change (add debug messages), which hopefully
>>>>     shows which command is holding a reference on mddev->openers.
>>>>
>>>> Thanks,
>>>> Song
>>>>
>>>> diff --git i/drivers/md/md.c w/drivers/md/md.c
>>>> index 78be7811a89f..3e9b718b32c1 100644
>>>> --- i/drivers/md/md.c
>>>> +++ w/drivers/md/md.c
>>>> @@ -7574,11 +7574,15 @@ static int md_ioctl(struct block_device *bdev,
>>>> blk_mode_t mode,
>>>>                  if (mddev->pers && atomic_read(&mddev->openers) > 1) {
>>>>                          mutex_unlock(&mddev->open_mutex);
>>>>                          err = -EBUSY;
>>>> +                       pr_warn("%s return -EBUSY for %s with
>>>> mddev->openers = %d\n",
>>>> +                               __func__, mdname(mddev),
>>>> atomic_read(&mddev->openers));
>>>>                          goto out;
>>>>                  }
>>>>                  if (test_and_set_bit(MD_CLOSING, &mddev->flags)) {
>>>>                          mutex_unlock(&mddev->open_mutex);
>>>>                          err = -EBUSY;
>>>> +                       pr_warn("%s return -EBUSY for %s with
>>>> MD_CLOSING bit set\n",
>>>> +                               __func__, mdname(mddev));
>>>>                          goto out;
>>>>                  }
>>>>                  did_set_md_closing = true;
>>>> @@ -7789,6 +7793,8 @@ static int md_open(struct gendisk *disk, 
>>>> blk_mode_t mode)
>>>>                  goto out_unlock;
>>>>
>>>>          atomic_inc(&mddev->openers);
>>>> +       pr_info("%s:%s openers++ = %d by %s\n", __func__, 
>>>> mdname(mddev),
>>>> +               atomic_read(&mddev->openers), current->comm);
>>>>          mutex_unlock(&mddev->open_mutex);
>>>>
>>>>          disk_check_media_change(disk);
>>>> @@ -7807,6 +7813,8 @@ static void md_release(struct gendisk *disk)
>>>>
>>>>          BUG_ON(!mddev);
>>>>          atomic_dec(&mddev->openers);
>>>> +       pr_info("%s:%s openers-- = %d by %s\n", __func__, 
>>>> mdname(mddev),
>>>> +               atomic_read(&mddev->openers), current->comm);
>>>>          mddev_put(mddev);
>>>>   }
>>> It's pretty strange that I can't reproduce the issue after applied 
>>> the patch.
>>>
>>> I tried to figure out which part affect the issue and found when I
>>> comment out the pr_info() In md_release(), the issue could be
>>> reproduced.
>>>
>>> -- 
>>> Chia-Lin Kao(AceLan)
>>> http://blog.acelan.idv.tw/
>>> E-Mail: acelan.kaoATcanonical.com (s/AT/@/)
>> .
>>
> 
> .
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs
  2023-08-31  2:28                       ` Yu Kuai
@ 2023-08-31  6:50                         ` Mariusz Tkaczyk
  2023-09-06  6:26                           ` AceLan Kao
  0 siblings, 1 reply; 30+ messages in thread
From: Mariusz Tkaczyk @ 2023-08-31  6:50 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Song Liu, AceLan Kao, Guoqing Jiang, Bagas Sanjaya,
	Christoph Hellwig, Linux Kernel Mailing List, Linux Regressions,
	Linux RAID, yukuai (C)

On Thu, 31 Aug 2023 10:28:44 +0800
Yu Kuai <yukuai1@huaweicloud.com> wrote:

> Hi, Acelan,
> 
> 在 2023/08/28 21:50, Yu Kuai 写道:
> > Hi,
> > 
> > 在 2023/08/28 13:20, Song Liu 写道:  
> >> Hi AceLan,
> >>
> >> Thanks for running the experiments.
> >>
> >> On Fri, Aug 25, 2023 at 9:32 PM AceLan Kao <acelan@gmail.com> wrote:
> >> [...]  
> >>>>
> >>>> Could you please run the follow two experiments?
> >>>>
> >>>> 1. Confirm 12a6caf273240a triggers this. Specifically:
> >>>>     git checkout 12a6caf273240a => repros
> >>>>     git checkout 12a6caf273240a~1 => cannot repro  
> >>> Yes, I'm pretty sure about this, that's my bisect result and I just
> >>> confirmed it again.
> >>> I also tried reverting 12a6caf273240a and the issue is gone.  
> >>
> >> The log doesn't match my guess. Specifically:
> >>
> >> [  420.068142] systemd-shutdown[1]: Stopping MD /dev/md123 (9:123).
> >> [  420.074718] md_open:md123 openers++ = 1 by systemd-shutdow
> >> [  420.080787] systemd-shutdown[1]: Failed to sync MD block device
> >> /dev/md123, ignoring: Input/output error
> >> [  420.090831] md: md123 stopped.
> >> [  420.094465] systemd-shutdown[1]: Stopping MD /dev/md122 (9:122).
> >> [  420.101045] systemd-shutdown[1]: Could not stop MD /dev/md122:
> >> Device or resource busy  
> > 
> > I see that:
> > 
> > systemd-shutdown[1]: Couldn't finalize remaining  MD devices, trying again.
> > 
> > Can we make sure is this why power off hang?
> > 
> > Because in my VM, power off is not hang and I got:
> > 
> > systemd-shutdown[1]: Could not stop MD /dev/md1: Device or resource busy
> > systemd-shutdown[1]: Failed to finalize MD devices, ignoring.  
> >>
> >> For a successful stop on md123, we reach the pr_info() in md_open().
> >> For a failed stop on md122, the kernel returns -EBUSY before that
> >> pr_info() in md_open(). There are some changes in md_open() in
> >> the past few release, so I am not quite sure we are looking at the
> >> same code.  
> > 
> > By the way, based on code review, looks like md_open never return
> > -EBUSY, and I think following is the only place can return -EBUSY before
> > md_open() is called:
> > 
> > blkdev_open
> >   blkdev_get_by_dev
> >    bd_prepare_to_claim
> >     bd_may_claim     -> -EBUSY
> > 
> > Acelan, can you add following debug patch on the top of Song's patch
> > and reporduce it again? Hope it'll confirm why stop array failed with
> > -EBUSY.
> >   
> 
> I saw your reply:
> 
> https://lore.kernel.org/all/CAMz9Wg9VXJzZkKWCosm0_=Rz6gG9r3pCMrhvBZQVCQb8i0kd-w@mail.gmail.com/
> 
> But I didn't recieve you emails, please reply to me through this email
> address: yukuai3@huawei.com. yukuai1@huaweicloud.com is just used for
> sending, and I always missing emails from huaweicloud.com, my apologize
> for this inconvenience, I have to do it this way due to my company
> policy.
> 
> There is something interesting on test result:
> 
> At first, md122 is the root device, I guess this is why I can't repoduce
> this problem in my local VM.
> [   14.478619] systemd-gpt-auto-generator[585]: md122p3: Root device 
> /dev/md122.
> [   14.481823] block md122: the capability attribute has been deprecated.
> 
> Then, before trying to stop md122, following log shows that md122 is
> still mounted:
> [ 1368.369926] [3462]: Remounting '/' read-only with options 
> 'errors=remount-ro,stripe=128'.
> [ 1368.396287] EXT4-fs (md122p3): re-mounted 
> 6d53dc8e-3f45-4efa-bc0e-4af477fac217 ro. Quota mode: none.
> 
> And since then, there is no log aboug md_open(), which means no one
> ever open md122 since this remount.
> 
> At last, there are lots of log shows that md122 can't be opened
> exclusively:
> [ 1368.812952] md122:systemd-shutdow bd_prepare_to_claim return -16
> [ 1368.819189] md122:systemd-shutdow blkdev_get_by_dev return -16
>
> Which indicate that md122 is still mounted, and systemd-shutdow can't
> stop this array. This behaviour looks correct to me, because rootfs
> can't be umounted while systemd-shutdown is still running.
> 
> However, there are some questions I don't have answers for now, I'm not
> familiar with how systemd works:
> 
> 1) In the normal case that raid is root device, how can systemd-shutdown
> make progress and how is rootfs unmounted?

Thanks for digging into!

In that case systemd-shutdown should skip this device. It should be umounted by
dracut. systemd-shutdown implements naive way to try stop MD devices, simple
ioctl(fd, STOP_ARRAY, NULL).
We are expecting failure if resource is still mounted.

Here dracut code:
https://github.com/dracutdevs/dracut/blob/master/modules.d/90mdraid/md-shutdown.sh

Ohh, it is ubuntu so it is handled by initramfs-tools not by dracut, but I hope
you will find it useful for the future. 

and system-shutdown:
https://github.com/systemd/systemd/blob/main/src/shutdown/shutdown.c#L483C7-L483C7

When the functionality in systemd-shutdown was fixed by me, I asked if we can
make it configurable- Lennart disagreed so there is now way to disable it now. 

> 2) How does this scenario related to commit 12a6caf273240a.

I'm unable to reproduce it on Redhat, Ubuntu is on the way now.
I'm starting to be convinced that it could be related with initramfs-tools but
I cannot prove this theory yet. I need to reproduce the issue first.

Thanks,
Mariusz
> 
> I must read systemd source code to get answers, and it'll be appreciated
> if someone has answers.
> 
> Thanks,
> Kuai
> 
> > diff --git a/block/bdev.c b/block/bdev.c
> > index 979e28a46b98..699739223dcb 100644
> > --- a/block/bdev.c
> > +++ b/block/bdev.c
> > @@ -789,8 +789,11 @@ struct block_device *blkdev_get_by_dev(dev_t dev, 
> > blk_mode_t mode, void *holder,
> >          if (holder) {
> >                  mode |= BLK_OPEN_EXCL;
> >                  ret = bd_prepare_to_claim(bdev, holder, hops);
> > -               if (ret)
> > +               if (ret) {
> > +                       pr_warn("%s:%s bd_prepare_to_claim return %d\n",
> > +                               disk->disk_name, current->comm, ret);
> >                          goto put_blkdev;
> > +               }
> >          } else {
> >                  if (WARN_ON_ONCE(mode & BLK_OPEN_EXCL)) {
> >                          ret = -EIO;
> > diff --git a/block/fops.c b/block/fops.c
> > index eaa98a987213..2d69119c71f6 100644
> > --- a/block/fops.c
> > +++ b/block/fops.c
> > @@ -587,8 +587,11 @@ static int blkdev_open(struct inode *inode, struct 
> > file *filp)
> > 
> >          bdev = blkdev_get_by_dev(inode->i_rdev, file_to_blk_mode(filp),
> >                                   filp->private_data, NULL);
> > -       if (IS_ERR(bdev))
> > +       if (IS_ERR(bdev)) {
> > +               pr_warn("%pD:%s blkdev_get_by_dev return %ld\n",
> > +                       filp, current->comm, PTR_ERR(bdev));
> >                  return PTR_ERR(bdev);
> > +       }
> > 
> >          if (bdev_nowait(bdev))
> >                  filp->f_mode |= FMODE_NOWAIT;
> > 
> > Thanks,
> > Kuai
> >   
> >>
> >> Therefore, could you please help clarify:
> >>
> >> 1. Which base kernel are you using?
> >>  
> >>> From the log, you are using 6.5-rc7-706a74159504. However,  
> >> I think we cannot cleanly revert 12a6caf273240a on top of
> >> 6.5-rc7-706a74159504. Did you manually fix some issue in the
> >> revert? If so, could you please share the revert commit?
> >>
> >> 2. If you are not using 6.5-rc7-706a74159504 as base kernel, which
> >> one are you using?
> >>
> >> Thanks,
> >> Song
> >>  
> >>>  
> >>>>
> >>>> 2. Try with the following change (add debug messages), which hopefully
> >>>>     shows which command is holding a reference on mddev->openers.
> >>>>
> >>>> Thanks,
> >>>> Song
> >>>>
> >>>> diff --git i/drivers/md/md.c w/drivers/md/md.c
> >>>> index 78be7811a89f..3e9b718b32c1 100644
> >>>> --- i/drivers/md/md.c
> >>>> +++ w/drivers/md/md.c
> >>>> @@ -7574,11 +7574,15 @@ static int md_ioctl(struct block_device *bdev,
> >>>> blk_mode_t mode,
> >>>>                  if (mddev->pers && atomic_read(&mddev->openers) > 1) {
> >>>>                          mutex_unlock(&mddev->open_mutex);
> >>>>                          err = -EBUSY;
> >>>> +                       pr_warn("%s return -EBUSY for %s with
> >>>> mddev->openers = %d\n",
> >>>> +                               __func__, mdname(mddev),
> >>>> atomic_read(&mddev->openers));
> >>>>                          goto out;
> >>>>                  }
> >>>>                  if (test_and_set_bit(MD_CLOSING, &mddev->flags)) {
> >>>>                          mutex_unlock(&mddev->open_mutex);
> >>>>                          err = -EBUSY;
> >>>> +                       pr_warn("%s return -EBUSY for %s with
> >>>> MD_CLOSING bit set\n",
> >>>> +                               __func__, mdname(mddev));
> >>>>                          goto out;
> >>>>                  }
> >>>>                  did_set_md_closing = true;
> >>>> @@ -7789,6 +7793,8 @@ static int md_open(struct gendisk *disk, 
> >>>> blk_mode_t mode)
> >>>>                  goto out_unlock;
> >>>>
> >>>>          atomic_inc(&mddev->openers);
> >>>> +       pr_info("%s:%s openers++ = %d by %s\n", __func__, 
> >>>> mdname(mddev),
> >>>> +               atomic_read(&mddev->openers), current->comm);
> >>>>          mutex_unlock(&mddev->open_mutex);
> >>>>
> >>>>          disk_check_media_change(disk);
> >>>> @@ -7807,6 +7813,8 @@ static void md_release(struct gendisk *disk)
> >>>>
> >>>>          BUG_ON(!mddev);
> >>>>          atomic_dec(&mddev->openers);
> >>>> +       pr_info("%s:%s openers-- = %d by %s\n", __func__, 
> >>>> mdname(mddev),
> >>>> +               atomic_read(&mddev->openers), current->comm);
> >>>>          mddev_put(mddev);
> >>>>   }  
> >>> It's pretty strange that I can't reproduce the issue after applied 
> >>> the patch.
> >>>
> >>> I tried to figure out which part affect the issue and found when I
> >>> comment out the pr_info() In md_release(), the issue could be
> >>> reproduced.
> >>>
> >>> -- 
> >>> Chia-Lin Kao(AceLan)
> >>> http://blog.acelan.idv.tw/
> >>> E-Mail: acelan.kaoATcanonical.com (s/AT/@/)  
> >> .
> >>  
> > 
> > .
> >   
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs
  2023-08-31  6:50                         ` Mariusz Tkaczyk
@ 2023-09-06  6:26                           ` AceLan Kao
  2023-09-06 10:27                             ` Mariusz Tkaczyk
  0 siblings, 1 reply; 30+ messages in thread
From: AceLan Kao @ 2023-09-06  6:26 UTC (permalink / raw)
  To: Mariusz Tkaczyk
  Cc: Yu Kuai, Song Liu, Guoqing Jiang, Bagas Sanjaya,
	Christoph Hellwig, Linux Kernel Mailing List, Linux Regressions,
	Linux RAID, yukuai (C)

Hi,

Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> 於 2023年8月31日 週四 下午2:51寫道:
>
> On Thu, 31 Aug 2023 10:28:44 +0800
> Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> > Hi, Acelan,
> >
> > 在 2023/08/28 21:50, Yu Kuai 写道:
> > > Hi,
> > >
> > > 在 2023/08/28 13:20, Song Liu 写道:
> > >> Hi AceLan,
> > >>
> > >> Thanks for running the experiments.
> > >>
> > >> On Fri, Aug 25, 2023 at 9:32 PM AceLan Kao <acelan@gmail.com> wrote:
> > >> [...]
> > >>>>
> > >>>> Could you please run the follow two experiments?
> > >>>>
> > >>>> 1. Confirm 12a6caf273240a triggers this. Specifically:
> > >>>>     git checkout 12a6caf273240a => repros
> > >>>>     git checkout 12a6caf273240a~1 => cannot repro
> > >>> Yes, I'm pretty sure about this, that's my bisect result and I just
> > >>> confirmed it again.
> > >>> I also tried reverting 12a6caf273240a and the issue is gone.
> > >>
> > >> The log doesn't match my guess. Specifically:
> > >>
> > >> [  420.068142] systemd-shutdown[1]: Stopping MD /dev/md123 (9:123).
> > >> [  420.074718] md_open:md123 openers++ = 1 by systemd-shutdow
> > >> [  420.080787] systemd-shutdown[1]: Failed to sync MD block device
> > >> /dev/md123, ignoring: Input/output error
> > >> [  420.090831] md: md123 stopped.
> > >> [  420.094465] systemd-shutdown[1]: Stopping MD /dev/md122 (9:122).
> > >> [  420.101045] systemd-shutdown[1]: Could not stop MD /dev/md122:
> > >> Device or resource busy
> > >
> > > I see that:
> > >
> > > systemd-shutdown[1]: Couldn't finalize remaining  MD devices, trying again.
> > >
> > > Can we make sure is this why power off hang?
> > >
> > > Because in my VM, power off is not hang and I got:
> > >
> > > systemd-shutdown[1]: Could not stop MD /dev/md1: Device or resource busy
> > > systemd-shutdown[1]: Failed to finalize MD devices, ignoring.
> > >>
> > >> For a successful stop on md123, we reach the pr_info() in md_open().
> > >> For a failed stop on md122, the kernel returns -EBUSY before that
> > >> pr_info() in md_open(). There are some changes in md_open() in
> > >> the past few release, so I am not quite sure we are looking at the
> > >> same code.
> > >
> > > By the way, based on code review, looks like md_open never return
> > > -EBUSY, and I think following is the only place can return -EBUSY before
> > > md_open() is called:
> > >
> > > blkdev_open
> > >   blkdev_get_by_dev
> > >    bd_prepare_to_claim
> > >     bd_may_claim     -> -EBUSY
> > >
> > > Acelan, can you add following debug patch on the top of Song's patch
> > > and reporduce it again? Hope it'll confirm why stop array failed with
> > > -EBUSY.
> > >
> >
> > I saw your reply:
> >
> > https://lore.kernel.org/all/CAMz9Wg9VXJzZkKWCosm0_=Rz6gG9r3pCMrhvBZQVCQb8i0kd-w@mail.gmail.com/
> >
> > But I didn't recieve you emails, please reply to me through this email
> > address: yukuai3@huawei.com. yukuai1@huaweicloud.com is just used for
> > sending, and I always missing emails from huaweicloud.com, my apologize
> > for this inconvenience, I have to do it this way due to my company
> > policy.
> >
> > There is something interesting on test result:
> >
> > At first, md122 is the root device, I guess this is why I can't repoduce
> > this problem in my local VM.
> > [   14.478619] systemd-gpt-auto-generator[585]: md122p3: Root device
> > /dev/md122.
> > [   14.481823] block md122: the capability attribute has been deprecated.
> >
> > Then, before trying to stop md122, following log shows that md122 is
> > still mounted:
> > [ 1368.369926] [3462]: Remounting '/' read-only with options
> > 'errors=remount-ro,stripe=128'.
> > [ 1368.396287] EXT4-fs (md122p3): re-mounted
> > 6d53dc8e-3f45-4efa-bc0e-4af477fac217 ro. Quota mode: none.
> >
> > And since then, there is no log aboug md_open(), which means no one
> > ever open md122 since this remount.
> >
> > At last, there are lots of log shows that md122 can't be opened
> > exclusively:
> > [ 1368.812952] md122:systemd-shutdow bd_prepare_to_claim return -16
> > [ 1368.819189] md122:systemd-shutdow blkdev_get_by_dev return -16
> >
> > Which indicate that md122 is still mounted, and systemd-shutdow can't
> > stop this array. This behaviour looks correct to me, because rootfs
> > can't be umounted while systemd-shutdown is still running.
> >
> > However, there are some questions I don't have answers for now, I'm not
> > familiar with how systemd works:
> >
> > 1) In the normal case that raid is root device, how can systemd-shutdown
> > make progress and how is rootfs unmounted?
>
> Thanks for digging into!
>
> In that case systemd-shutdown should skip this device. It should be umounted by
> dracut. systemd-shutdown implements naive way to try stop MD devices, simple
> ioctl(fd, STOP_ARRAY, NULL).
> We are expecting failure if resource is still mounted.
>
> Here dracut code:
> https://github.com/dracutdevs/dracut/blob/master/modules.d/90mdraid/md-shutdown.sh
>
> Ohh, it is ubuntu so it is handled by initramfs-tools not by dracut, but I hope
> you will find it useful for the future.
>
> and system-shutdown:
> https://github.com/systemd/systemd/blob/main/src/shutdown/shutdown.c#L483C7-L483C7
>
> When the functionality in systemd-shutdown was fixed by me, I asked if we can
> make it configurable- Lennart disagreed so there is now way to disable it now.
>
> > 2) How does this scenario related to commit 12a6caf273240a.
>
> I'm unable to reproduce it on Redhat, Ubuntu is on the way now.
> I'm starting to be convinced that it could be related with initramfs-tools but
> I cannot prove this theory yet. I need to reproduce the issue first.
>
> Thanks,
> Mariusz
> >
> > I must read systemd source code to get answers, and it'll be appreciated
> > if someone has answers.
> >
> > Thanks,
> > Kuai
> >
> > > diff --git a/block/bdev.c b/block/bdev.c
> > > index 979e28a46b98..699739223dcb 100644
> > > --- a/block/bdev.c
> > > +++ b/block/bdev.c
> > > @@ -789,8 +789,11 @@ struct block_device *blkdev_get_by_dev(dev_t dev,
> > > blk_mode_t mode, void *holder,
> > >          if (holder) {
> > >                  mode |= BLK_OPEN_EXCL;
> > >                  ret = bd_prepare_to_claim(bdev, holder, hops);
> > > -               if (ret)
> > > +               if (ret) {
> > > +                       pr_warn("%s:%s bd_prepare_to_claim return %d\n",
> > > +                               disk->disk_name, current->comm, ret);
> > >                          goto put_blkdev;
> > > +               }
> > >          } else {
> > >                  if (WARN_ON_ONCE(mode & BLK_OPEN_EXCL)) {
> > >                          ret = -EIO;
> > > diff --git a/block/fops.c b/block/fops.c
> > > index eaa98a987213..2d69119c71f6 100644
> > > --- a/block/fops.c
> > > +++ b/block/fops.c
> > > @@ -587,8 +587,11 @@ static int blkdev_open(struct inode *inode, struct
> > > file *filp)
> > >
> > >          bdev = blkdev_get_by_dev(inode->i_rdev, file_to_blk_mode(filp),
> > >                                   filp->private_data, NULL);
> > > -       if (IS_ERR(bdev))
> > > +       if (IS_ERR(bdev)) {
> > > +               pr_warn("%pD:%s blkdev_get_by_dev return %ld\n",
> > > +                       filp, current->comm, PTR_ERR(bdev));
> > >                  return PTR_ERR(bdev);
> > > +       }
> > >
> > >          if (bdev_nowait(bdev))
> > >                  filp->f_mode |= FMODE_NOWAIT;
> > >
> > > Thanks,
> > > Kuai
> > >
> > >>
> > >> Therefore, could you please help clarify:
> > >>
> > >> 1. Which base kernel are you using?
> > >>
> > >>> From the log, you are using 6.5-rc7-706a74159504. However,
> > >> I think we cannot cleanly revert 12a6caf273240a on top of
> > >> 6.5-rc7-706a74159504. Did you manually fix some issue in the
> > >> revert? If so, could you please share the revert commit?
> > >>
> > >> 2. If you are not using 6.5-rc7-706a74159504 as base kernel, which
> > >> one are you using?
> > >>
> > >> Thanks,
> > >> Song
> > >>
> > >>>
> > >>>>
> > >>>> 2. Try with the following change (add debug messages), which hopefully
> > >>>>     shows which command is holding a reference on mddev->openers.
> > >>>>
> > >>>> Thanks,
> > >>>> Song
> > >>>>
> > >>>> diff --git i/drivers/md/md.c w/drivers/md/md.c
> > >>>> index 78be7811a89f..3e9b718b32c1 100644
> > >>>> --- i/drivers/md/md.c
> > >>>> +++ w/drivers/md/md.c
> > >>>> @@ -7574,11 +7574,15 @@ static int md_ioctl(struct block_device *bdev,
> > >>>> blk_mode_t mode,
> > >>>>                  if (mddev->pers && atomic_read(&mddev->openers) > 1) {
> > >>>>                          mutex_unlock(&mddev->open_mutex);
> > >>>>                          err = -EBUSY;
> > >>>> +                       pr_warn("%s return -EBUSY for %s with
> > >>>> mddev->openers = %d\n",
> > >>>> +                               __func__, mdname(mddev),
> > >>>> atomic_read(&mddev->openers));
> > >>>>                          goto out;
> > >>>>                  }
> > >>>>                  if (test_and_set_bit(MD_CLOSING, &mddev->flags)) {
> > >>>>                          mutex_unlock(&mddev->open_mutex);
> > >>>>                          err = -EBUSY;
> > >>>> +                       pr_warn("%s return -EBUSY for %s with
> > >>>> MD_CLOSING bit set\n",
> > >>>> +                               __func__, mdname(mddev));
> > >>>>                          goto out;
> > >>>>                  }
> > >>>>                  did_set_md_closing = true;
> > >>>> @@ -7789,6 +7793,8 @@ static int md_open(struct gendisk *disk,
> > >>>> blk_mode_t mode)
> > >>>>                  goto out_unlock;
> > >>>>
> > >>>>          atomic_inc(&mddev->openers);
> > >>>> +       pr_info("%s:%s openers++ = %d by %s\n", __func__,
> > >>>> mdname(mddev),
> > >>>> +               atomic_read(&mddev->openers), current->comm);
> > >>>>          mutex_unlock(&mddev->open_mutex);
> > >>>>
> > >>>>          disk_check_media_change(disk);
> > >>>> @@ -7807,6 +7813,8 @@ static void md_release(struct gendisk *disk)
> > >>>>
> > >>>>          BUG_ON(!mddev);
> > >>>>          atomic_dec(&mddev->openers);
> > >>>> +       pr_info("%s:%s openers-- = %d by %s\n", __func__,
> > >>>> mdname(mddev),
> > >>>> +               atomic_read(&mddev->openers), current->comm);
> > >>>>          mddev_put(mddev);
> > >>>>   }
> > >>> It's pretty strange that I can't reproduce the issue after applied
> > >>> the patch.
> > >>>
> > >>> I tried to figure out which part affect the issue and found when I
> > >>> comment out the pr_info() In md_release(), the issue could be
> > >>> reproduced.
> > >>>
From previous testing, I don't think it's an issue in systemd, so I
did a simple test and found the issue is gone.
You only need to add a small delay in md_release(), then the issue
can't be reproduced.

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 78be7811a89f..ef47e34c1af5 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7805,6 +7805,7 @@ static void md_release(struct gendisk *disk)
{
       struct mddev *mddev = disk->private_data;

+       msleep(10);
       BUG_ON(!mddev);
       atomic_dec(&mddev->openers);
       mddev_put(mddev);

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs
  2023-09-06  6:26                           ` AceLan Kao
@ 2023-09-06 10:27                             ` Mariusz Tkaczyk
  2023-09-07  2:04                               ` Yu Kuai
  0 siblings, 1 reply; 30+ messages in thread
From: Mariusz Tkaczyk @ 2023-09-06 10:27 UTC (permalink / raw)
  To: AceLan Kao
  Cc: Yu Kuai, Song Liu, Guoqing Jiang, Bagas Sanjaya,
	Christoph Hellwig, Linux Kernel Mailing List, Linux Regressions,
	Linux RAID, yukuai (C)

On Wed, 6 Sep 2023 14:26:30 +0800
AceLan Kao <acelan@gmail.com> wrote:

> From previous testing, I don't think it's an issue in systemd, so I
> did a simple test and found the issue is gone.
> You only need to add a small delay in md_release(), then the issue
> can't be reproduced.
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 78be7811a89f..ef47e34c1af5 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -7805,6 +7805,7 @@ static void md_release(struct gendisk *disk)
> {
>        struct mddev *mddev = disk->private_data;
> 
> +       msleep(10);
>        BUG_ON(!mddev);
>        atomic_dec(&mddev->openers);
>        mddev_put(mddev);

I have repro and I tested it on my setup. It is not working for me.
My setup could be more "advanced" to maximalize chance of reproduction:

# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4] [raid10] [raid0]
md121 : active raid0 nvme2n1[1] nvme5n1[0]
      7126394880 blocks super external:/md127/0 128k chunks

md122 : active raid10 nvme6n1[3] nvme4n1[2] nvme1n1[1] nvme7n1[0]
      104857600 blocks super external:/md126/0 64K chunks 2 near-copies [4/4]
[UUUU]

md123 : active raid5 nvme6n1[3] nvme4n1[2] nvme1n1[1] nvme7n1[0]
      2655765504 blocks super external:/md126/1 level 5, 32k chunk, algorithm 0
[4/4] [UUUU]

md124 : active raid1 nvme0n1[1] nvme3n1[0]
      99614720 blocks super external:/md125/0 [2/2] [UU]

md125 : inactive nvme3n1[1](S) nvme0n1[0](S)
      10402 blocks super external:imsm

md126 : inactive nvme7n1[3](S) nvme1n1[2](S) nvme6n1[1](S) nvme4n1[0](S)
      20043 blocks super external:imsm

md127 : inactive nvme2n1[1](S) nvme5n1[0](S)
      10402 blocks super external:imsm

I have almost 99% repro ratio, slowly moving forward..

It is endless loop because systemd-shutdown sends ioctl "stop_array" which is
successful but array is not stopped. For that reason it sets "changed = true".

Systemd-shutdown see the change and retries to check if there is something else
which can be stopped now, and again, again...

I will check what is returned first, it could be 0 or it could be positive
errno (nit?) because systemd cares "if(r < 0)".

Thanks,
Mariusz

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs
  2023-09-06 10:27                             ` Mariusz Tkaczyk
@ 2023-09-07  2:04                               ` Yu Kuai
  2023-09-07 10:18                                 ` Mariusz Tkaczyk
  0 siblings, 1 reply; 30+ messages in thread
From: Yu Kuai @ 2023-09-07  2:04 UTC (permalink / raw)
  To: Mariusz Tkaczyk, AceLan Kao
  Cc: Yu Kuai, Song Liu, Guoqing Jiang, Bagas Sanjaya,
	Christoph Hellwig, Linux Kernel Mailing List, Linux Regressions,
	Linux RAID, yukuai (C), yangerkun@huawei.com

Hi,

在 2023/09/06 18:27, Mariusz Tkaczyk 写道:
> On Wed, 6 Sep 2023 14:26:30 +0800
> AceLan Kao <acelan@gmail.com> wrote:
> 
>>  From previous testing, I don't think it's an issue in systemd, so I
>> did a simple test and found the issue is gone.
>> You only need to add a small delay in md_release(), then the issue
>> can't be reproduced.
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 78be7811a89f..ef47e34c1af5 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -7805,6 +7805,7 @@ static void md_release(struct gendisk *disk)
>> {
>>         struct mddev *mddev = disk->private_data;
>>
>> +       msleep(10);
>>         BUG_ON(!mddev);
>>         atomic_dec(&mddev->openers);
>>         mddev_put(mddev);
> 
> I have repro and I tested it on my setup. It is not working for me.
> My setup could be more "advanced" to maximalize chance of reproduction:
> 
> # cat /proc/mdstat
> Personalities : [raid1] [raid6] [raid5] [raid4] [raid10] [raid0]
> md121 : active raid0 nvme2n1[1] nvme5n1[0]
>        7126394880 blocks super external:/md127/0 128k chunks
> 
> md122 : active raid10 nvme6n1[3] nvme4n1[2] nvme1n1[1] nvme7n1[0]
>        104857600 blocks super external:/md126/0 64K chunks 2 near-copies [4/4]
> [UUUU]
> 
> md123 : active raid5 nvme6n1[3] nvme4n1[2] nvme1n1[1] nvme7n1[0]
>        2655765504 blocks super external:/md126/1 level 5, 32k chunk, algorithm 0
> [4/4] [UUUU]
> 
> md124 : active raid1 nvme0n1[1] nvme3n1[0]
>        99614720 blocks super external:/md125/0 [2/2] [UU]
> 
> md125 : inactive nvme3n1[1](S) nvme0n1[0](S)
>        10402 blocks super external:imsm
> 
> md126 : inactive nvme7n1[3](S) nvme1n1[2](S) nvme6n1[1](S) nvme4n1[0](S)
>        20043 blocks super external:imsm
> 
> md127 : inactive nvme2n1[1](S) nvme5n1[0](S)
>        10402 blocks super external:imsm
> 
> I have almost 99% repro ratio, slowly moving forward..
> 
> It is endless loop because systemd-shutdown sends ioctl "stop_array" which is
> successful but array is not stopped. For that reason it sets "changed = true".

How does systemd-shutdown judge if array is stopped? cat /proc/mdstat or
ls /dev/md* or other way?
> 
> Systemd-shutdown see the change and retries to check if there is something else
> which can be stopped now, and again, again...
> 
> I will check what is returned first, it could be 0 or it could be positive
> errno (nit?) because systemd cares "if(r < 0)".

I do noticed that there are lots of log about md123 stopped:

[ 1371.834034] md122:systemd-shutdow bd_prepare_to_claim return -16
[ 1371.840294] md122:systemd-shutdow blkdev_get_by_dev return -16
[ 1371.846845] md: md123 stopped.
[ 1371.850155] md122:systemd-shutdow bd_prepare_to_claim return -16
[ 1371.856411] md122:systemd-shutdow blkdev_get_by_dev return -16
[ 1371.862941] md: md123 stopped.

And md_ioctl->do_md_stop doesn't have error path after printing this
log, hence 0 will be returned to user.

The normal case is that:

open md123
ioctl STOP_ARRAY -> all rdev should be removed from array
close md123 -> mddev will finally be freed by:
	md_release
	 mddev_put
	  set_bit(MD_DELETED, &mddev->flags) -> user shound not see this mddev
	  queue_work(md_misc_wq, &mddev->del_work)

	mddev_delayed_delete
	 kobject_put(&mddev->kobj)

	md_kobj_release
	 del_gendisk
	  md_free_disk
	   mddev_free

Now that you can reporduce this problem 99%, can you dig deeper and find
out what is wrong?

Thanks,
Kuai

> 
> Thanks,
> Mariusz
> 
> .
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs
  2023-09-07  2:04                               ` Yu Kuai
@ 2023-09-07 10:18                                 ` Mariusz Tkaczyk
       [not found]                                   ` <cffca94f-5729-622d-9327-632b3ff2891a@huaweicloud.com>
  0 siblings, 1 reply; 30+ messages in thread
From: Mariusz Tkaczyk @ 2023-09-07 10:18 UTC (permalink / raw)
  To: Yu Kuai
  Cc: AceLan Kao, Song Liu, Guoqing Jiang, Bagas Sanjaya,
	Christoph Hellwig, Linux Kernel Mailing List, Linux Regressions,
	Linux RAID, yukuai (C), yangerkun@huawei.com

On Thu, 7 Sep 2023 10:04:11 +0800
Yu Kuai <yukuai1@huaweicloud.com> wrote:

> Hi,
> 
> 在 2023/09/06 18:27, Mariusz Tkaczyk 写道:
> > On Wed, 6 Sep 2023 14:26:30 +0800
> > AceLan Kao <acelan@gmail.com> wrote:
> >   
> >>  From previous testing, I don't think it's an issue in systemd, so I
> >> did a simple test and found the issue is gone.
> >> You only need to add a small delay in md_release(), then the issue
> >> can't be reproduced.
> >>
> >> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >> index 78be7811a89f..ef47e34c1af5 100644
> >> --- a/drivers/md/md.c
> >> +++ b/drivers/md/md.c
> >> @@ -7805,6 +7805,7 @@ static void md_release(struct gendisk *disk)
> >> {
> >>         struct mddev *mddev = disk->private_data;
> >>
> >> +       msleep(10);
> >>         BUG_ON(!mddev);
> >>         atomic_dec(&mddev->openers);
> >>         mddev_put(mddev);  
> > 
> > I have repro and I tested it on my setup. It is not working for me.
> > My setup could be more "advanced" to maximalize chance of reproduction:
> > 
> > # cat /proc/mdstat
> > Personalities : [raid1] [raid6] [raid5] [raid4] [raid10] [raid0]
> > md121 : active raid0 nvme2n1[1] nvme5n1[0]
> >        7126394880 blocks super external:/md127/0 128k chunks
> > 
> > md122 : active raid10 nvme6n1[3] nvme4n1[2] nvme1n1[1] nvme7n1[0]
> >        104857600 blocks super external:/md126/0 64K chunks 2 near-copies
> > [4/4] [UUUU]
> > 
> > md123 : active raid5 nvme6n1[3] nvme4n1[2] nvme1n1[1] nvme7n1[0]
> >        2655765504 blocks super external:/md126/1 level 5, 32k chunk,
> > algorithm 0 [4/4] [UUUU]
> > 
> > md124 : active raid1 nvme0n1[1] nvme3n1[0]
> >        99614720 blocks super external:/md125/0 [2/2] [UU]
> > 
> > md125 : inactive nvme3n1[1](S) nvme0n1[0](S)
> >        10402 blocks super external:imsm
> > 
> > md126 : inactive nvme7n1[3](S) nvme1n1[2](S) nvme6n1[1](S) nvme4n1[0](S)
> >        20043 blocks super external:imsm
> > 
> > md127 : inactive nvme2n1[1](S) nvme5n1[0](S)
> >        10402 blocks super external:imsm
> > 
> > I have almost 99% repro ratio, slowly moving forward..
> > 
> > It is endless loop because systemd-shutdown sends ioctl "stop_array" which
> > is successful but array is not stopped. For that reason it sets "changed =
> > true".  
> 
> How does systemd-shutdown judge if array is stopped? cat /proc/mdstat or
> ls /dev/md* or other way?

Hi Yu,

It trusts return result, I confirmed that 0 is returned.
The most weird is we are returning 0 but array is still there, and it is
stopped again in next systemd loop. I don't understand why yet..

> > Systemd-shutdown see the change and retries to check if there is something
> > else which can be stopped now, and again, again...
> > 
> > I will check what is returned first, it could be 0 or it could be positive
> > errno (nit?) because systemd cares "if(r < 0)".  
> 
> I do noticed that there are lots of log about md123 stopped:
> 
> [ 1371.834034] md122:systemd-shutdow bd_prepare_to_claim return -16
> [ 1371.840294] md122:systemd-shutdow blkdev_get_by_dev return -16
> [ 1371.846845] md: md123 stopped.
> [ 1371.850155] md122:systemd-shutdow bd_prepare_to_claim return -16
> [ 1371.856411] md122:systemd-shutdow blkdev_get_by_dev return -16
> [ 1371.862941] md: md123 stopped.
> 
> And md_ioctl->do_md_stop doesn't have error path after printing this
> log, hence 0 will be returned to user.
> 
> The normal case is that:
> 
> open md123
> ioctl STOP_ARRAY -> all rdev should be removed from array
> close md123 -> mddev will finally be freed by:
> 	md_release
> 	 mddev_put
> 	  set_bit(MD_DELETED, &mddev->flags) -> user shound not see this mddev
> 	  queue_work(md_misc_wq, &mddev->del_work)
> 
> 	mddev_delayed_delete
> 	 kobject_put(&mddev->kobj)
> 
> 	md_kobj_release
> 	 del_gendisk
> 	  md_free_disk
> 	   mddev_free
> 
Ok thanks, I understand that md_release is called on descriptor closing, right?


> Now that you can reporduce this problem 99%, can you dig deeper and find
> out what is wrong?

Yes, working on it!

My first idea was that mddev_get and mddev_put are missing on md_ioctl() path
but it doesn't help for the issue. My motivation here was that md_attr_store and
md_attr_show are using them.

Systemd regenerates list of MD arrays on every loop and it is always
there, systemd is able to open file descriptor (maybe inactive?).
My fix with checking mddev_get and mddev_put should address that so it seems to
be something different.

BTW. I opened a discussion on slack proposed by Paul. I'm adding updates there
more frequently. If you are interested please join. I can see that you were
invited as:
Email Address
yukuai1@huaweicloud.com
Perhaps, that is the reason you can't join. Let me know, I will add you.

Thanks,
Mariusz


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs
       [not found]                                     ` <3e7edf0c-cadd-59b0-4e10-dffdb86b93b7@huaweicloud.com>
@ 2023-09-07 12:41                                       ` Mariusz Tkaczyk
  2023-09-07 12:53                                         ` Yu Kuai
  0 siblings, 1 reply; 30+ messages in thread
From: Mariusz Tkaczyk @ 2023-09-07 12:41 UTC (permalink / raw)
  To: Yu Kuai
  Cc: AceLan Kao, Song Liu, Guoqing Jiang, Bagas Sanjaya,
	Christoph Hellwig, Linux Kernel Mailing List, Linux Regressions,
	Linux RAID, yangerkun@huawei.com, yukuai (C)

On Thu, 7 Sep 2023 20:14:03 +0800
Yu Kuai <yukuai1@huaweicloud.com> wrote:

> Hi,
> 
> 在 2023/09/07 19:26, Yu Kuai 写道:
> > Hi,
> > 
> > 在 2023/09/07 18:18, Mariusz Tkaczyk 写道:  
> >> On Thu, 7 Sep 2023 10:04:11 +0800
> >> Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >>  
> >>> Hi,
> >>>
> >>> 在 2023/09/06 18:27, Mariusz Tkaczyk 写道:  
> >>>> On Wed, 6 Sep 2023 14:26:30 +0800
> >>>> AceLan Kao <acelan@gmail.com> wrote:  
> >>>>>   From previous testing, I don't think it's an issue in systemd, so I
> >>>>> did a simple test and found the issue is gone.
> >>>>> You only need to add a small delay in md_release(), then the issue
> >>>>> can't be reproduced.
> >>>>>
> >>>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >>>>> index 78be7811a89f..ef47e34c1af5 100644
> >>>>> --- a/drivers/md/md.c
> >>>>> +++ b/drivers/md/md.c
> >>>>> @@ -7805,6 +7805,7 @@ static void md_release(struct gendisk *disk)
> >>>>> {
> >>>>>          struct mddev *mddev = disk->private_data;
> >>>>>
> >>>>> +       msleep(10);
> >>>>>          BUG_ON(!mddev);
> >>>>>          atomic_dec(&mddev->openers);
> >>>>>          mddev_put(mddev);  
> >>>>
> >>>> I have repro and I tested it on my setup. It is not working for me.
> >>>> My setup could be more "advanced" to maximalize chance of reproduction:
> >>>>
> >>>> # cat /proc/mdstat
> >>>> Personalities : [raid1] [raid6] [raid5] [raid4] [raid10] [raid0]
> >>>> md121 : active raid0 nvme2n1[1] nvme5n1[0]
> >>>>         7126394880 blocks super external:/md127/0 128k chunks
> >>>>
> >>>> md122 : active raid10 nvme6n1[3] nvme4n1[2] nvme1n1[1] nvme7n1[0]
> >>>>         104857600 blocks super external:/md126/0 64K chunks 2 
> >>>> near-copies
> >>>> [4/4] [UUUU]
> >>>>
> >>>> md123 : active raid5 nvme6n1[3] nvme4n1[2] nvme1n1[1] nvme7n1[0]
> >>>>         2655765504 blocks super external:/md126/1 level 5, 32k chunk,
> >>>> algorithm 0 [4/4] [UUUU]
> >>>>
> >>>> md124 : active raid1 nvme0n1[1] nvme3n1[0]
> >>>>         99614720 blocks super external:/md125/0 [2/2] [UU]
> >>>>
> >>>> md125 : inactive nvme3n1[1](S) nvme0n1[0](S)
> >>>>         10402 blocks super external:imsm
> >>>>
> >>>> md126 : inactive nvme7n1[3](S) nvme1n1[2](S) nvme6n1[1](S) 
> >>>> nvme4n1[0](S)
> >>>>         20043 blocks super external:imsm
> >>>>
> >>>> md127 : inactive nvme2n1[1](S) nvme5n1[0](S)
> >>>>         10402 blocks super external:imsm
> >>>>
> >>>> I have almost 99% repro ratio, slowly moving forward..
> >>>>
> >>>> It is endless loop because systemd-shutdown sends ioctl "stop_array" 
> >>>> which
> >>>> is successful but array is not stopped. For that reason it sets 
> >>>> "changed =
> >>>> true".  
> >>>
> >>> How does systemd-shutdown judge if array is stopped? cat /proc/mdstat or
> >>> ls /dev/md* or other way?  
> >>
> >> Hi Yu,
> >>
> >> It trusts return result, I confirmed that 0 is returned.
> >> The most weird is we are returning 0 but array is still there, and it is
> >> stopped again in next systemd loop. I don't understand why yet..
> >>  
> >>>> Systemd-shutdown see the change and retries to check if there is 
> >>>> something
> >>>> else which can be stopped now, and again, again...
> >>>>
> >>>> I will check what is returned first, it could be 0 or it could be 
> >>>> positive
> >>>> errno (nit?) because systemd cares "if(r < 0)".  
> >>>
> >>> I do noticed that there are lots of log about md123 stopped:
> >>>
> >>> [ 1371.834034] md122:systemd-shutdow bd_prepare_to_claim return -16
> >>> [ 1371.840294] md122:systemd-shutdow blkdev_get_by_dev return -16
> >>> [ 1371.846845] md: md123 stopped.
> >>> [ 1371.850155] md122:systemd-shutdow bd_prepare_to_claim return -16
> >>> [ 1371.856411] md122:systemd-shutdow blkdev_get_by_dev return -16
> >>> [ 1371.862941] md: md123 stopped.
> >>>
> >>> And md_ioctl->do_md_stop doesn't have error path after printing this
> >>> log, hence 0 will be returned to user.
> >>>
> >>> The normal case is that:
> >>>
> >>> open md123
> >>> ioctl STOP_ARRAY -> all rdev should be removed from array
> >>> close md123 -> mddev will finally be freed by:
> >>>     md_release
> >>>      mddev_put
> >>>       set_bit(MD_DELETED, &mddev->flags) -> user shound not see this 
> >>> mddev
> >>>       queue_work(md_misc_wq, &mddev->del_work)
> >>>
> >>>     mddev_delayed_delete
> >>>      kobject_put(&mddev->kobj)
> >>>
> >>>     md_kobj_release
> >>>      del_gendisk
> >>>       md_free_disk
> >>>        mddev_free
> >>>  
> >> Ok thanks, I understand that md_release is called on descriptor 
> >> closing, right?
> >>  
> > 
> > Yes, normally close md123 should drop that last reference.  
> >>  
> >>> Now that you can reporduce this problem 99%, can you dig deeper and find
> >>> out what is wrong?  
> >>
> >> Yes, working on it!
> >>
> >> My first idea was that mddev_get and mddev_put are missing on 
> >> md_ioctl() path
> >> but it doesn't help for the issue. My motivation here was that 
> >> md_attr_store and
> >> md_attr_show are using them.
> >>
> >> Systemd regenerates list of MD arrays on every loop and it is always
> >> there, systemd is able to open file descriptor (maybe inactive?).  
> > 
> > md123 should not be opended again, ioctl(STOP_ARRAY) already set the
> > flag 'MD_CLOSING' to prevent that. Are you sure that systemd-shutdown do
> > open and close the array in each loop?  
> 
> I realized that I'm wrong here. 'MD_CLOSING' is cleared before ioctl
> return by commit 065e519e71b2 ("md: MD_CLOSING needs to be cleared after
> called md_set_readonly or do_md_stop").
> 
> I'm confused here, commit message said 'MD_CLOSING' shold not be set for
> the case STOP_ARRAY_RO, but I don't understand why it's cleared for
> STOP_ARRAY as well.
> 
Not related with your findings but:

I replaced if (!atomic_dec_and_lock(&mddev->active, &all_mddevs_lock))
because that is the way to exit without queuing work:

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0fe7ab6e8ab9..80bd7446be94 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -618,8 +618,7 @@ static void mddev_delayed_delete(struct work_struct *ws);

 

void mddev_put(struct mddev *mddev)
{
-       if (!atomic_dec_and_lock(&mddev->active, &all_mddevs_lock))
-               return;
+       spin_lock(&all_mddevs_lock);
        if (!mddev->raid_disks && list_empty(&mddev->disks) &&
            mddev->ctime == 0 && !mddev->hold_active) {
                /* Array is not configured at all, and not held active,
@@ -634,6 +633,7 @@ void mddev_put(struct mddev *mddev)
                INIT_WORK(&mddev->del_work, mddev_delayed_delete);
                queue_work(md_misc_wq, &mddev->del_work);
        }
+       atomic_dec(&mddev->active);
        spin_unlock(&all_mddevs_lock);
}

After that I got kernel panic but it seems that workqueue is scheduled:

 51.535103] BUG: kernel NULL pointer dereference, address: 0000000000000008
[   51.539115] ------------[ cut here ]------------
[   51.543867] #PF: supervisor read access in kernel mode
1;[3 9 m S5t1a.r5tPF: error_code(0x0000) - not-present page
[   51.543875] PGD 0 P4D 0
.k[ 0 m5.1
 54ops: 0000 [#1] PREEMPT SMP NOPTI
[   51.552207] refcount_t: underflow; use-after-free.
[   51.556820] CPU: 19 PID: 368 Comm: kworker/19:1 Not tainted 6.5.0+ #57
[   51.556825] Hardware name: Intel Corporation WilsonCity/WilsonCity, BIOS
WLYDCRB1.SYS.0027.P82.2204080829 04/08/2022 [   51.561979] WARNING: CPU: 26
PID: 376 at lib/refcount.c:28 refcount_warn_saturate+0x99/0xe0 [   51.567273]
Workqueue:  mddev_delayed_delete [md_mod] [   51.569822] Modules linked in:
[   51.574351]  (events)
[   51.574353] RIP: 0010:process_one_work+0x10f/0x3d0
[   51.579155]  configfs

In my case, it seems to be IMSM container device is stopped in loop, which is an
inactive from the start. It is not something I'm totally sure but it could lead
us to the root cause. So far I know, the original reported uses IMSM arrays too.

Thanks,
Mariusz

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs
  2023-09-07 12:41                                       ` Mariusz Tkaczyk
@ 2023-09-07 12:53                                         ` Yu Kuai
  2023-09-07 15:09                                           ` Mariusz Tkaczyk
  0 siblings, 1 reply; 30+ messages in thread
From: Yu Kuai @ 2023-09-07 12:53 UTC (permalink / raw)
  To: Mariusz Tkaczyk, Yu Kuai
  Cc: AceLan Kao, Song Liu, Guoqing Jiang, Bagas Sanjaya,
	Christoph Hellwig, Linux Kernel Mailing List, Linux Regressions,
	Linux RAID, yangerkun@huawei.com, yukuai (C)

Hi,

在 2023/09/07 20:41, Mariusz Tkaczyk 写道:
> On Thu, 7 Sep 2023 20:14:03 +0800
> Yu Kuai <yukuai1@huaweicloud.com> wrote:
> 
>> Hi,
>>
>> 在 2023/09/07 19:26, Yu Kuai 写道:
>>> Hi,
>>>
>>> 在 2023/09/07 18:18, Mariusz Tkaczyk 写道:
>>>> On Thu, 7 Sep 2023 10:04:11 +0800
>>>> Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>>>   
>>>>> Hi,
>>>>>
>>>>> 在 2023/09/06 18:27, Mariusz Tkaczyk 写道:
>>>>>> On Wed, 6 Sep 2023 14:26:30 +0800
>>>>>> AceLan Kao <acelan@gmail.com> wrote:
>>>>>>>    From previous testing, I don't think it's an issue in systemd, so I
>>>>>>> did a simple test and found the issue is gone.
>>>>>>> You only need to add a small delay in md_release(), then the issue
>>>>>>> can't be reproduced.
>>>>>>>
>>>>>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>>>>>> index 78be7811a89f..ef47e34c1af5 100644
>>>>>>> --- a/drivers/md/md.c
>>>>>>> +++ b/drivers/md/md.c
>>>>>>> @@ -7805,6 +7805,7 @@ static void md_release(struct gendisk *disk)
>>>>>>> {
>>>>>>>           struct mddev *mddev = disk->private_data;
>>>>>>>
>>>>>>> +       msleep(10);
>>>>>>>           BUG_ON(!mddev);
>>>>>>>           atomic_dec(&mddev->openers);
>>>>>>>           mddev_put(mddev);
>>>>>>
>>>>>> I have repro and I tested it on my setup. It is not working for me.
>>>>>> My setup could be more "advanced" to maximalize chance of reproduction:
>>>>>>
>>>>>> # cat /proc/mdstat
>>>>>> Personalities : [raid1] [raid6] [raid5] [raid4] [raid10] [raid0]
>>>>>> md121 : active raid0 nvme2n1[1] nvme5n1[0]
>>>>>>          7126394880 blocks super external:/md127/0 128k chunks
>>>>>>
>>>>>> md122 : active raid10 nvme6n1[3] nvme4n1[2] nvme1n1[1] nvme7n1[0]
>>>>>>          104857600 blocks super external:/md126/0 64K chunks 2
>>>>>> near-copies
>>>>>> [4/4] [UUUU]
>>>>>>
>>>>>> md123 : active raid5 nvme6n1[3] nvme4n1[2] nvme1n1[1] nvme7n1[0]
>>>>>>          2655765504 blocks super external:/md126/1 level 5, 32k chunk,
>>>>>> algorithm 0 [4/4] [UUUU]
>>>>>>
>>>>>> md124 : active raid1 nvme0n1[1] nvme3n1[0]
>>>>>>          99614720 blocks super external:/md125/0 [2/2] [UU]
>>>>>>
>>>>>> md125 : inactive nvme3n1[1](S) nvme0n1[0](S)
>>>>>>          10402 blocks super external:imsm
>>>>>>
>>>>>> md126 : inactive nvme7n1[3](S) nvme1n1[2](S) nvme6n1[1](S)
>>>>>> nvme4n1[0](S)
>>>>>>          20043 blocks super external:imsm
>>>>>>
>>>>>> md127 : inactive nvme2n1[1](S) nvme5n1[0](S)
>>>>>>          10402 blocks super external:imsm
>>>>>>
>>>>>> I have almost 99% repro ratio, slowly moving forward..
>>>>>>
>>>>>> It is endless loop because systemd-shutdown sends ioctl "stop_array"
>>>>>> which
>>>>>> is successful but array is not stopped. For that reason it sets
>>>>>> "changed =
>>>>>> true".
>>>>>
>>>>> How does systemd-shutdown judge if array is stopped? cat /proc/mdstat or
>>>>> ls /dev/md* or other way?
>>>>
>>>> Hi Yu,
>>>>
>>>> It trusts return result, I confirmed that 0 is returned.
>>>> The most weird is we are returning 0 but array is still there, and it is
>>>> stopped again in next systemd loop. I don't understand why yet..
>>>>   
>>>>>> Systemd-shutdown see the change and retries to check if there is
>>>>>> something
>>>>>> else which can be stopped now, and again, again...
>>>>>>
>>>>>> I will check what is returned first, it could be 0 or it could be
>>>>>> positive
>>>>>> errno (nit?) because systemd cares "if(r < 0)".
>>>>>
>>>>> I do noticed that there are lots of log about md123 stopped:
>>>>>
>>>>> [ 1371.834034] md122:systemd-shutdow bd_prepare_to_claim return -16
>>>>> [ 1371.840294] md122:systemd-shutdow blkdev_get_by_dev return -16
>>>>> [ 1371.846845] md: md123 stopped.
>>>>> [ 1371.850155] md122:systemd-shutdow bd_prepare_to_claim return -16
>>>>> [ 1371.856411] md122:systemd-shutdow blkdev_get_by_dev return -16
>>>>> [ 1371.862941] md: md123 stopped.
>>>>>
>>>>> And md_ioctl->do_md_stop doesn't have error path after printing this
>>>>> log, hence 0 will be returned to user.
>>>>>
>>>>> The normal case is that:
>>>>>
>>>>> open md123
>>>>> ioctl STOP_ARRAY -> all rdev should be removed from array
>>>>> close md123 -> mddev will finally be freed by:
>>>>>      md_release
>>>>>       mddev_put
>>>>>        set_bit(MD_DELETED, &mddev->flags) -> user shound not see this
>>>>> mddev
>>>>>        queue_work(md_misc_wq, &mddev->del_work)
>>>>>
>>>>>      mddev_delayed_delete
>>>>>       kobject_put(&mddev->kobj)
>>>>>
>>>>>      md_kobj_release
>>>>>       del_gendisk
>>>>>        md_free_disk
>>>>>         mddev_free
>>>>>   
>>>> Ok thanks, I understand that md_release is called on descriptor
>>>> closing, right?
>>>>   
>>>
>>> Yes, normally close md123 should drop that last reference.
>>>>   
>>>>> Now that you can reporduce this problem 99%, can you dig deeper and find
>>>>> out what is wrong?
>>>>
>>>> Yes, working on it!
>>>>
>>>> My first idea was that mddev_get and mddev_put are missing on
>>>> md_ioctl() path
>>>> but it doesn't help for the issue. My motivation here was that
>>>> md_attr_store and
>>>> md_attr_show are using them.
>>>>
>>>> Systemd regenerates list of MD arrays on every loop and it is always
>>>> there, systemd is able to open file descriptor (maybe inactive?).
>>>
>>> md123 should not be opended again, ioctl(STOP_ARRAY) already set the
>>> flag 'MD_CLOSING' to prevent that. Are you sure that systemd-shutdown do
>>> open and close the array in each loop?
>>
>> I realized that I'm wrong here. 'MD_CLOSING' is cleared before ioctl
>> return by commit 065e519e71b2 ("md: MD_CLOSING needs to be cleared after
>> called md_set_readonly or do_md_stop").
>>
>> I'm confused here, commit message said 'MD_CLOSING' shold not be set for
>> the case STOP_ARRAY_RO, but I don't understand why it's cleared for
>> STOP_ARRAY as well.
>>

Can you try the following patch?

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 3afd57622a0b..31b9cec7e4c0 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7668,7 +7668,8 @@ static int md_ioctl(struct block_device *bdev, 
fmode_t mode,
                         err = -EBUSY;
                         goto out;
                 }
-               did_set_md_closing = true;
+               if (cmd == STOP_ARRAY_RO)
+                       did_set_md_closing = true;
                 mutex_unlock(&mddev->open_mutex);
                 sync_blockdev(bdev);
         }

I think prevent array to be opened again after STOP_ARRAY might fix
this.

Thanks,
Kuai

> Not related with your findings but:
> 
> I replaced if (!atomic_dec_and_lock(&mddev->active, &all_mddevs_lock))
> because that is the way to exit without queuing work:
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 0fe7ab6e8ab9..80bd7446be94 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -618,8 +618,7 @@ static void mddev_delayed_delete(struct work_struct *ws);
> 
>   
> 
> void mddev_put(struct mddev *mddev)
> {
> -       if (!atomic_dec_and_lock(&mddev->active, &all_mddevs_lock))
> -               return;
> +       spin_lock(&all_mddevs_lock);
>          if (!mddev->raid_disks && list_empty(&mddev->disks) &&
>              mddev->ctime == 0 && !mddev->hold_active) {
>                  /* Array is not configured at all, and not held active,
> @@ -634,6 +633,7 @@ void mddev_put(struct mddev *mddev)
>                  INIT_WORK(&mddev->del_work, mddev_delayed_delete);
>                  queue_work(md_misc_wq, &mddev->del_work);
>          }
> +       atomic_dec(&mddev->active);
>          spin_unlock(&all_mddevs_lock);
> }
> 
> After that I got kernel panic but it seems that workqueue is scheduled:
> 
>   51.535103] BUG: kernel NULL pointer dereference, address: 0000000000000008
> [   51.539115] ------------[ cut here ]------------
> [   51.543867] #PF: supervisor read access in kernel mode
> 1;[3 9 m S5t1a.r5tPF: error_code(0x0000) - not-present page
> [   51.543875] PGD 0 P4D 0
> .k[ 0 m5.1
>   54ops: 0000 [#1] PREEMPT SMP NOPTI
> [   51.552207] refcount_t: underflow; use-after-free.
> [   51.556820] CPU: 19 PID: 368 Comm: kworker/19:1 Not tainted 6.5.0+ #57
> [   51.556825] Hardware name: Intel Corporation WilsonCity/WilsonCity, BIOS
> WLYDCRB1.SYS.0027.P82.2204080829 04/08/2022 [   51.561979] WARNING: CPU: 26
> PID: 376 at lib/refcount.c:28 refcount_warn_saturate+0x99/0xe0 [   51.567273]
> Workqueue:  mddev_delayed_delete [md_mod] [   51.569822] Modules linked in:
> [   51.574351]  (events)
> [   51.574353] RIP: 0010:process_one_work+0x10f/0x3d0
> [   51.579155]  configfs
> 
> In my case, it seems to be IMSM container device is stopped in loop, which is an
> inactive from the start. It is not something I'm totally sure but it could lead
> us to the root cause. So far I know, the original reported uses IMSM arrays too.
> 
> Thanks,
> Mariusz
> 
> .
> 


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs
  2023-09-07 12:53                                         ` Yu Kuai
@ 2023-09-07 15:09                                           ` Mariusz Tkaczyk
  2023-09-08 20:25                                             ` Song Liu
  0 siblings, 1 reply; 30+ messages in thread
From: Mariusz Tkaczyk @ 2023-09-07 15:09 UTC (permalink / raw)
  To: Yu Kuai
  Cc: AceLan Kao, Song Liu, Guoqing Jiang, Bagas Sanjaya,
	Christoph Hellwig, Linux Kernel Mailing List, Linux Regressions,
	Linux RAID, yangerkun@huawei.com, yukuai (C)

On Thu, 7 Sep 2023 20:53:30 +0800
Yu Kuai <yukuai1@huaweicloud.com> wrote:

> Hi,
> 
> 在 2023/09/07 20:41, Mariusz Tkaczyk 写道:
> > On Thu, 7 Sep 2023 20:14:03 +0800
> > Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >   
> >> Hi,
> >>
> >> 在 2023/09/07 19:26, Yu Kuai 写道:  
> >>> Hi,
> >>>
> >>> 在 2023/09/07 18:18, Mariusz Tkaczyk 写道:  
> >>>> On Thu, 7 Sep 2023 10:04:11 +0800
> >>>> Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >>>>     
> >>>>> Hi,
> >>>>>
> >>>>> 在 2023/09/06 18:27, Mariusz Tkaczyk 写道:  
> >>>>>> On Wed, 6 Sep 2023 14:26:30 +0800
> >>>>>> AceLan Kao <acelan@gmail.com> wrote:  
> >>>>>>>    From previous testing, I don't think it's an issue in systemd, so I
> >>>>>>> did a simple test and found the issue is gone.
> >>>>>>> You only need to add a small delay in md_release(), then the issue
> >>>>>>> can't be reproduced.
> >>>>>>>
> >>>>>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >>>>>>> index 78be7811a89f..ef47e34c1af5 100644
> >>>>>>> --- a/drivers/md/md.c
> >>>>>>> +++ b/drivers/md/md.c
> >>>>>>> @@ -7805,6 +7805,7 @@ static void md_release(struct gendisk *disk)
> >>>>>>> {
> >>>>>>>           struct mddev *mddev = disk->private_data;
> >>>>>>>
> >>>>>>> +       msleep(10);
> >>>>>>>           BUG_ON(!mddev);
> >>>>>>>           atomic_dec(&mddev->openers);
> >>>>>>>           mddev_put(mddev);  
> >>>>>>
> >>>>>> I have repro and I tested it on my setup. It is not working for me.
> >>>>>> My setup could be more "advanced" to maximalize chance of reproduction:
> >>>>>>
> >>>>>> # cat /proc/mdstat
> >>>>>> Personalities : [raid1] [raid6] [raid5] [raid4] [raid10] [raid0]
> >>>>>> md121 : active raid0 nvme2n1[1] nvme5n1[0]
> >>>>>>          7126394880 blocks super external:/md127/0 128k chunks
> >>>>>>
> >>>>>> md122 : active raid10 nvme6n1[3] nvme4n1[2] nvme1n1[1] nvme7n1[0]
> >>>>>>          104857600 blocks super external:/md126/0 64K chunks 2
> >>>>>> near-copies
> >>>>>> [4/4] [UUUU]
> >>>>>>
> >>>>>> md123 : active raid5 nvme6n1[3] nvme4n1[2] nvme1n1[1] nvme7n1[0]
> >>>>>>          2655765504 blocks super external:/md126/1 level 5, 32k chunk,
> >>>>>> algorithm 0 [4/4] [UUUU]
> >>>>>>
> >>>>>> md124 : active raid1 nvme0n1[1] nvme3n1[0]
> >>>>>>          99614720 blocks super external:/md125/0 [2/2] [UU]
> >>>>>>
> >>>>>> md125 : inactive nvme3n1[1](S) nvme0n1[0](S)
> >>>>>>          10402 blocks super external:imsm
> >>>>>>
> >>>>>> md126 : inactive nvme7n1[3](S) nvme1n1[2](S) nvme6n1[1](S)
> >>>>>> nvme4n1[0](S)
> >>>>>>          20043 blocks super external:imsm
> >>>>>>
> >>>>>> md127 : inactive nvme2n1[1](S) nvme5n1[0](S)
> >>>>>>          10402 blocks super external:imsm
> >>>>>>
> >>>>>> I have almost 99% repro ratio, slowly moving forward..
> >>>>>>
> >>>>>> It is endless loop because systemd-shutdown sends ioctl "stop_array"
> >>>>>> which
> >>>>>> is successful but array is not stopped. For that reason it sets
> >>>>>> "changed =
> >>>>>> true".  
> >>>>>
> >>>>> How does systemd-shutdown judge if array is stopped? cat /proc/mdstat or
> >>>>> ls /dev/md* or other way?  
> >>>>
> >>>> Hi Yu,
> >>>>
> >>>> It trusts return result, I confirmed that 0 is returned.
> >>>> The most weird is we are returning 0 but array is still there, and it is
> >>>> stopped again in next systemd loop. I don't understand why yet..
> >>>>     
> >>>>>> Systemd-shutdown see the change and retries to check if there is
> >>>>>> something
> >>>>>> else which can be stopped now, and again, again...
> >>>>>>
> >>>>>> I will check what is returned first, it could be 0 or it could be
> >>>>>> positive
> >>>>>> errno (nit?) because systemd cares "if(r < 0)".  
> >>>>>
> >>>>> I do noticed that there are lots of log about md123 stopped:
> >>>>>
> >>>>> [ 1371.834034] md122:systemd-shutdow bd_prepare_to_claim return -16
> >>>>> [ 1371.840294] md122:systemd-shutdow blkdev_get_by_dev return -16
> >>>>> [ 1371.846845] md: md123 stopped.
> >>>>> [ 1371.850155] md122:systemd-shutdow bd_prepare_to_claim return -16
> >>>>> [ 1371.856411] md122:systemd-shutdow blkdev_get_by_dev return -16
> >>>>> [ 1371.862941] md: md123 stopped.
> >>>>>
> >>>>> And md_ioctl->do_md_stop doesn't have error path after printing this
> >>>>> log, hence 0 will be returned to user.
> >>>>>
> >>>>> The normal case is that:
> >>>>>
> >>>>> open md123
> >>>>> ioctl STOP_ARRAY -> all rdev should be removed from array
> >>>>> close md123 -> mddev will finally be freed by:
> >>>>>      md_release
> >>>>>       mddev_put
> >>>>>        set_bit(MD_DELETED, &mddev->flags) -> user shound not see this
> >>>>> mddev
> >>>>>        queue_work(md_misc_wq, &mddev->del_work)
> >>>>>
> >>>>>      mddev_delayed_delete
> >>>>>       kobject_put(&mddev->kobj)
> >>>>>
> >>>>>      md_kobj_release
> >>>>>       del_gendisk
> >>>>>        md_free_disk
> >>>>>         mddev_free
> >>>>>     
> >>>> Ok thanks, I understand that md_release is called on descriptor
> >>>> closing, right?
> >>>>     
> >>>
> >>> Yes, normally close md123 should drop that last reference.  
> >>>>     
> >>>>> Now that you can reporduce this problem 99%, can you dig deeper and find
> >>>>> out what is wrong?  
> >>>>
> >>>> Yes, working on it!
> >>>>
> >>>> My first idea was that mddev_get and mddev_put are missing on
> >>>> md_ioctl() path
> >>>> but it doesn't help for the issue. My motivation here was that
> >>>> md_attr_store and
> >>>> md_attr_show are using them.
> >>>>
> >>>> Systemd regenerates list of MD arrays on every loop and it is always
> >>>> there, systemd is able to open file descriptor (maybe inactive?).  
> >>>
> >>> md123 should not be opended again, ioctl(STOP_ARRAY) already set the
> >>> flag 'MD_CLOSING' to prevent that. Are you sure that systemd-shutdown do
> >>> open and close the array in each loop?  
> >>
> >> I realized that I'm wrong here. 'MD_CLOSING' is cleared before ioctl
> >> return by commit 065e519e71b2 ("md: MD_CLOSING needs to be cleared after
> >> called md_set_readonly or do_md_stop").
> >>
> >> I'm confused here, commit message said 'MD_CLOSING' shold not be set for
> >> the case STOP_ARRAY_RO, but I don't understand why it's cleared for
> >> STOP_ARRAY as well.
> >>  
> 
> Can you try the following patch?
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 3afd57622a0b..31b9cec7e4c0 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -7668,7 +7668,8 @@ static int md_ioctl(struct block_device *bdev, 
> fmode_t mode,
>                          err = -EBUSY;
>                          goto out;
>                  }
> -               did_set_md_closing = true;
> +               if (cmd == STOP_ARRAY_RO)
> +                       did_set_md_closing = true;
>                  mutex_unlock(&mddev->open_mutex);
>                  sync_blockdev(bdev);
>          }
> 
> I think prevent array to be opened again after STOP_ARRAY might fix
> this.

I didn't help. I tried much more:
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0fe7ab6e8ab9..807387f37755 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -608,7 +608,7 @@ static inline struct mddev *mddev_get(struct mddev *mddev)
 {
        lockdep_assert_held(&all_mddevs_lock);

-       if (test_bit(MD_DELETED, &mddev->flags))
+       if (test_bit(MD_DELETED, &mddev->flags) || test_bit(MD_CLOSING,
  &mddev->flags)) return NULL;
        atomic_inc(&mddev->active);
        return mddev;


Issue is still reproducible. I was shocked so I removed clearing MD_CLOSING:
 out:
-       if(did_set_md_closing)
-               clear_bit(MD_CLOSING, &mddev->flags);
+       //if(did_set_md_closing)
+       //      clear_bit(MD_CLOSING, &mddev->flags);

Still reproducible. Then I found this:
@@ -6177,7 +6179,7 @@ static void md_clean(struct mddev *mddev)
        mddev->persistent = 0;
        mddev->level = LEVEL_NONE;
        mddev->clevel[0] = 0;
-       mddev->flags = 0;
+       //mddev->flags = 0;
        mddev->sb_flags = 0;
        mddev->ro = MD_RDWR;
        mddev->metadata_type[0] = 0;

Issue not reproducible. I can see in log that attempt to stop device is not
repeated.

Well, with the change of using time-sensitive lock in mddev_put we need to
ensure that we can still can remove mddevice so setting and preserving
MD_CLOSING is reasonable for me. I will combine this into patches tomorrow.

Yu, really appreciate your help! Glad to heave you here.

Thanks,
Mariusz

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs
  2023-09-07 15:09                                           ` Mariusz Tkaczyk
@ 2023-09-08 20:25                                             ` Song Liu
  0 siblings, 0 replies; 30+ messages in thread
From: Song Liu @ 2023-09-08 20:25 UTC (permalink / raw)
  To: Mariusz Tkaczyk
  Cc: Yu Kuai, AceLan Kao, Guoqing Jiang, Bagas Sanjaya,
	Christoph Hellwig, Linux Kernel Mailing List, Linux Regressions,
	Linux RAID, yangerkun@huawei.com, yukuai (C)

On Thu, Sep 7, 2023 at 8:24 AM Mariusz Tkaczyk
<mariusz.tkaczyk@linux.intel.com> wrote:
>
> On Thu, 7 Sep 2023 20:53:30 +0800
> Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> > Hi,
> >
> > 在 2023/09/07 20:41, Mariusz Tkaczyk 写道:
> > > On Thu, 7 Sep 2023 20:14:03 +0800
> > > Yu Kuai <yukuai1@huaweicloud.com> wrote:
> > >
> > >> Hi,
> > >>
> > >> 在 2023/09/07 19:26, Yu Kuai 写道:
> > >>> Hi,
> > >>>
> > >>> 在 2023/09/07 18:18, Mariusz Tkaczyk 写道:
> > >>>> On Thu, 7 Sep 2023 10:04:11 +0800
> > >>>> Yu Kuai <yukuai1@huaweicloud.com> wrote:
> > >>>>
> > >>>>> Hi,
> > >>>>>
> > >>>>> 在 2023/09/06 18:27, Mariusz Tkaczyk 写道:
> > >>>>>> On Wed, 6 Sep 2023 14:26:30 +0800
> > >>>>>> AceLan Kao <acelan@gmail.com> wrote:
> > >>>>>>>    From previous testing, I don't think it's an issue in systemd, so I
> > >>>>>>> did a simple test and found the issue is gone.
> > >>>>>>> You only need to add a small delay in md_release(), then the issue
> > >>>>>>> can't be reproduced.
> > >>>>>>>
> > >>>>>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
> > >>>>>>> index 78be7811a89f..ef47e34c1af5 100644
> > >>>>>>> --- a/drivers/md/md.c
> > >>>>>>> +++ b/drivers/md/md.c
> > >>>>>>> @@ -7805,6 +7805,7 @@ static void md_release(struct gendisk *disk)
> > >>>>>>> {
> > >>>>>>>           struct mddev *mddev = disk->private_data;
> > >>>>>>>
> > >>>>>>> +       msleep(10);
> > >>>>>>>           BUG_ON(!mddev);
> > >>>>>>>           atomic_dec(&mddev->openers);
> > >>>>>>>           mddev_put(mddev);
> > >>>>>>
> > >>>>>> I have repro and I tested it on my setup. It is not working for me.
> > >>>>>> My setup could be more "advanced" to maximalize chance of reproduction:
> > >>>>>>
> > >>>>>> # cat /proc/mdstat
> > >>>>>> Personalities : [raid1] [raid6] [raid5] [raid4] [raid10] [raid0]
> > >>>>>> md121 : active raid0 nvme2n1[1] nvme5n1[0]
> > >>>>>>          7126394880 blocks super external:/md127/0 128k chunks
> > >>>>>>
> > >>>>>> md122 : active raid10 nvme6n1[3] nvme4n1[2] nvme1n1[1] nvme7n1[0]
> > >>>>>>          104857600 blocks super external:/md126/0 64K chunks 2
> > >>>>>> near-copies
> > >>>>>> [4/4] [UUUU]
> > >>>>>>
> > >>>>>> md123 : active raid5 nvme6n1[3] nvme4n1[2] nvme1n1[1] nvme7n1[0]
> > >>>>>>          2655765504 blocks super external:/md126/1 level 5, 32k chunk,
> > >>>>>> algorithm 0 [4/4] [UUUU]
> > >>>>>>
> > >>>>>> md124 : active raid1 nvme0n1[1] nvme3n1[0]
> > >>>>>>          99614720 blocks super external:/md125/0 [2/2] [UU]
> > >>>>>>
> > >>>>>> md125 : inactive nvme3n1[1](S) nvme0n1[0](S)
> > >>>>>>          10402 blocks super external:imsm
> > >>>>>>
> > >>>>>> md126 : inactive nvme7n1[3](S) nvme1n1[2](S) nvme6n1[1](S)
> > >>>>>> nvme4n1[0](S)
> > >>>>>>          20043 blocks super external:imsm
> > >>>>>>
> > >>>>>> md127 : inactive nvme2n1[1](S) nvme5n1[0](S)
> > >>>>>>          10402 blocks super external:imsm
> > >>>>>>
> > >>>>>> I have almost 99% repro ratio, slowly moving forward..
> > >>>>>>
> > >>>>>> It is endless loop because systemd-shutdown sends ioctl "stop_array"
> > >>>>>> which
> > >>>>>> is successful but array is not stopped. For that reason it sets
> > >>>>>> "changed =
> > >>>>>> true".
> > >>>>>
> > >>>>> How does systemd-shutdown judge if array is stopped? cat /proc/mdstat or
> > >>>>> ls /dev/md* or other way?
> > >>>>
> > >>>> Hi Yu,
> > >>>>
> > >>>> It trusts return result, I confirmed that 0 is returned.
> > >>>> The most weird is we are returning 0 but array is still there, and it is
> > >>>> stopped again in next systemd loop. I don't understand why yet..
> > >>>>
> > >>>>>> Systemd-shutdown see the change and retries to check if there is
> > >>>>>> something
> > >>>>>> else which can be stopped now, and again, again...
> > >>>>>>
> > >>>>>> I will check what is returned first, it could be 0 or it could be
> > >>>>>> positive
> > >>>>>> errno (nit?) because systemd cares "if(r < 0)".
> > >>>>>
> > >>>>> I do noticed that there are lots of log about md123 stopped:
> > >>>>>
> > >>>>> [ 1371.834034] md122:systemd-shutdow bd_prepare_to_claim return -16
> > >>>>> [ 1371.840294] md122:systemd-shutdow blkdev_get_by_dev return -16
> > >>>>> [ 1371.846845] md: md123 stopped.
> > >>>>> [ 1371.850155] md122:systemd-shutdow bd_prepare_to_claim return -16
> > >>>>> [ 1371.856411] md122:systemd-shutdow blkdev_get_by_dev return -16
> > >>>>> [ 1371.862941] md: md123 stopped.
> > >>>>>
> > >>>>> And md_ioctl->do_md_stop doesn't have error path after printing this
> > >>>>> log, hence 0 will be returned to user.
> > >>>>>
> > >>>>> The normal case is that:
> > >>>>>
> > >>>>> open md123
> > >>>>> ioctl STOP_ARRAY -> all rdev should be removed from array
> > >>>>> close md123 -> mddev will finally be freed by:
> > >>>>>      md_release
> > >>>>>       mddev_put
> > >>>>>        set_bit(MD_DELETED, &mddev->flags) -> user shound not see this
> > >>>>> mddev
> > >>>>>        queue_work(md_misc_wq, &mddev->del_work)
> > >>>>>
> > >>>>>      mddev_delayed_delete
> > >>>>>       kobject_put(&mddev->kobj)
> > >>>>>
> > >>>>>      md_kobj_release
> > >>>>>       del_gendisk
> > >>>>>        md_free_disk
> > >>>>>         mddev_free
> > >>>>>
> > >>>> Ok thanks, I understand that md_release is called on descriptor
> > >>>> closing, right?
> > >>>>
> > >>>
> > >>> Yes, normally close md123 should drop that last reference.
> > >>>>
> > >>>>> Now that you can reporduce this problem 99%, can you dig deeper and find
> > >>>>> out what is wrong?
> > >>>>
> > >>>> Yes, working on it!
> > >>>>
> > >>>> My first idea was that mddev_get and mddev_put are missing on
> > >>>> md_ioctl() path
> > >>>> but it doesn't help for the issue. My motivation here was that
> > >>>> md_attr_store and
> > >>>> md_attr_show are using them.
> > >>>>
> > >>>> Systemd regenerates list of MD arrays on every loop and it is always
> > >>>> there, systemd is able to open file descriptor (maybe inactive?).
> > >>>
> > >>> md123 should not be opended again, ioctl(STOP_ARRAY) already set the
> > >>> flag 'MD_CLOSING' to prevent that. Are you sure that systemd-shutdown do
> > >>> open and close the array in each loop?
> > >>
> > >> I realized that I'm wrong here. 'MD_CLOSING' is cleared before ioctl
> > >> return by commit 065e519e71b2 ("md: MD_CLOSING needs to be cleared after
> > >> called md_set_readonly or do_md_stop").
> > >>
> > >> I'm confused here, commit message said 'MD_CLOSING' shold not be set for
> > >> the case STOP_ARRAY_RO, but I don't understand why it's cleared for
> > >> STOP_ARRAY as well.
> > >>
> >
> > Can you try the following patch?
> >
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index 3afd57622a0b..31b9cec7e4c0 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -7668,7 +7668,8 @@ static int md_ioctl(struct block_device *bdev,
> > fmode_t mode,
> >                          err = -EBUSY;
> >                          goto out;
> >                  }
> > -               did_set_md_closing = true;
> > +               if (cmd == STOP_ARRAY_RO)
> > +                       did_set_md_closing = true;
> >                  mutex_unlock(&mddev->open_mutex);
> >                  sync_blockdev(bdev);
> >          }
> >
> > I think prevent array to be opened again after STOP_ARRAY might fix
> > this.
>
> I didn't help. I tried much more:
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 0fe7ab6e8ab9..807387f37755 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -608,7 +608,7 @@ static inline struct mddev *mddev_get(struct mddev *mddev)
>  {
>         lockdep_assert_held(&all_mddevs_lock);
>
> -       if (test_bit(MD_DELETED, &mddev->flags))
> +       if (test_bit(MD_DELETED, &mddev->flags) || test_bit(MD_CLOSING,
>   &mddev->flags)) return NULL;
>         atomic_inc(&mddev->active);
>         return mddev;
>
>
> Issue is still reproducible. I was shocked so I removed clearing MD_CLOSING:
>  out:
> -       if(did_set_md_closing)
> -               clear_bit(MD_CLOSING, &mddev->flags);
> +       //if(did_set_md_closing)
> +       //      clear_bit(MD_CLOSING, &mddev->flags);
>
> Still reproducible. Then I found this:
> @@ -6177,7 +6179,7 @@ static void md_clean(struct mddev *mddev)
>         mddev->persistent = 0;
>         mddev->level = LEVEL_NONE;
>         mddev->clevel[0] = 0;
> -       mddev->flags = 0;
> +       //mddev->flags = 0;
>         mddev->sb_flags = 0;
>         mddev->ro = MD_RDWR;
>         mddev->metadata_type[0] = 0;
>
> Issue not reproducible. I can see in log that attempt to stop device is not
> repeated.
>
> Well, with the change of using time-sensitive lock in mddev_put we need to
> ensure that we can still can remove mddevice so setting and preserving
> MD_CLOSING is reasonable for me. I will combine this into patches tomorrow.
>
> Yu, really appreciate your help! Glad to heave you here.

Thanks to everyone for looking into this tricky issue and running various tests!

Mariusz, please let me know if you need any help with the patch. Hopefully we
can fix this soon.

Thanks again,
Song

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2023-09-08 20:25 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-16  9:37 Fwd: Infiniate systemd loop when power off the machine with multiple MD RAIDs Bagas Sanjaya
2023-08-18  8:16 ` Mariusz Tkaczyk
2023-08-18  9:21   ` Hannes Reinecke
2023-08-21  3:23     ` AceLan Kao
2023-08-22  3:51   ` Guoqing Jiang
2023-08-22  6:17     ` Song Liu
2023-08-22  6:39       ` Mariusz Tkaczyk
2023-08-22  8:13         ` AceLan Kao
2023-08-22 12:41           ` Guoqing Jiang
2023-08-23  8:02             ` AceLan Kao
2023-08-23 13:25               ` Song Liu
     [not found]                 ` <CAMz9Wg9y52iuxJRSQFC2N5Katt72v-o=JvEjegJt-MwORmw9tQ@mail.gmail.com>
2023-08-28  5:20                   ` Song Liu
2023-08-28 10:48                     ` AceLan Kao
2023-08-28 13:50                     ` Yu Kuai
2023-08-31  2:28                       ` Yu Kuai
2023-08-31  6:50                         ` Mariusz Tkaczyk
2023-09-06  6:26                           ` AceLan Kao
2023-09-06 10:27                             ` Mariusz Tkaczyk
2023-09-07  2:04                               ` Yu Kuai
2023-09-07 10:18                                 ` Mariusz Tkaczyk
     [not found]                                   ` <cffca94f-5729-622d-9327-632b3ff2891a@huaweicloud.com>
     [not found]                                     ` <3e7edf0c-cadd-59b0-4e10-dffdb86b93b7@huaweicloud.com>
2023-09-07 12:41                                       ` Mariusz Tkaczyk
2023-09-07 12:53                                         ` Yu Kuai
2023-09-07 15:09                                           ` Mariusz Tkaczyk
2023-09-08 20:25                                             ` Song Liu
2023-08-21 13:18 ` Fwd: " Yu Kuai
2023-08-22  1:39   ` AceLan Kao
2023-08-22 18:56 ` Song Liu
2023-08-22 19:13   ` Carlos Carvalho
2023-08-23  1:28     ` Yu Kuai
2023-08-23  6:04       ` Hannes Reinecke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).