From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 78FEAEEB56E for ; Fri, 8 Sep 2023 20:25:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1343854AbjIHUZa (ORCPT ); Fri, 8 Sep 2023 16:25:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37548 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S245719AbjIHUZZ (ORCPT ); Fri, 8 Sep 2023 16:25:25 -0400 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1E3499C; Fri, 8 Sep 2023 13:25:21 -0700 (PDT) Received: by smtp.kernel.org (Postfix) with ESMTPSA id B3166C433D9; Fri, 8 Sep 2023 20:25:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1694204720; bh=oICsfMZ9zcXdQx62GRWHsb78WN87mQZ6kdZzxuWVlLw=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=np5QNcHV6ppednrPrSrut4jUIvhZmjh9Lq6RR+Nrh/bGli/5LDm1RlRcXjMRsm4Py EfEu5ZFnSIC3b+Q10BR6WL8HDgCG9xo7AGZBRLuzQj6HDZ7YlODM6YEhtZV8BL0OiJ aKEwlTeEoZFaRB+3ghwOMPP0jcy2qUuY7sW+aQgmuizAZ04gtmueUaYgBl3UwKDEz4 28PqxY2kNf3cOY2CYRBXObrAH8gViWaNrIv+lx5d8rB6zmZB+C1QfEnpUqnm3XP689 LTU+qgjaSGSjEgG7/9cggBZJ9906x3l0VYT1BOWzPc2XxjpZlz90BzZ/LBv4NAoh4i e6UY8w8aOOO/w== Received: by mail-lf1-f53.google.com with SMTP id 2adb3069b0e04-502a7e1bdc7so267139e87.0; Fri, 08 Sep 2023 13:25:20 -0700 (PDT) X-Gm-Message-State: AOJu0YxsB8Sglk0lI/2Loq8lo4Qf1cxvCy8iDQUtQUBKM2Yb3vQU9N2j Mj31ssxRB1nT0NLul5/h3zGyYQMyq9W5UwqVxAU= X-Google-Smtp-Source: AGHT+IFfBSYL+XX4tBhb29IpCYv+gn37nS694W+gRm2KIccPnRUcRy+AaCekwP01g8GK9+U5N6kYLYqIjbsC4lbAvCw= X-Received: by 2002:ac2:57c7:0:b0:500:9839:b13a with SMTP id k7-20020ac257c7000000b005009839b13amr2102297lfo.66.1694204718841; Fri, 08 Sep 2023 13:25:18 -0700 (PDT) MIME-Version: 1.0 References: <028a21df-4397-80aa-c2a5-7c754560f595@gmail.com> <35130b3f-c0fd-e2d6-e849-a5ceb6a2895f@linux.dev> <354004ce-ad4e-5ad5-8fe6-303216647e0c@huaweicloud.com> <03b79ab0-0bb0-ac29-4a70-37d902f9a05b@huaweicloud.com> <20230831085057.00001795@linux.intel.com> <20230906122751.00001e5b@linux.intel.com> <43b0b2f4-17c0-61d2-9c41-0595fb6f2efc@huaweicloud.com> <20230907121819.00005a15@linux.intel.com> <3e7edf0c-cadd-59b0-4e10-dffdb86b93b7@huaweicloud.com> <20230907144153.00002492@linux.intel.com> <513ea05e-cc2e-e0c8-43cc-6636b0631cdf@huaweicloud.com> <20230907170940.00002fe4@linux.intel.com> In-Reply-To: <20230907170940.00002fe4@linux.intel.com> From: Song Liu Date: Fri, 8 Sep 2023 13:25:06 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs To: Mariusz Tkaczyk Cc: Yu Kuai , AceLan Kao , Guoqing Jiang , Bagas Sanjaya , Christoph Hellwig , Linux Kernel Mailing List , Linux Regressions , Linux RAID , "yangerkun@huawei.com" , "yukuai (C)" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-raid@vger.kernel.org On Thu, Sep 7, 2023 at 8:24=E2=80=AFAM Mariusz Tkaczyk wrote: > > On Thu, 7 Sep 2023 20:53:30 +0800 > Yu Kuai wrote: > > > Hi, > > > > =E5=9C=A8 2023/09/07 20:41, Mariusz Tkaczyk =E5=86=99=E9=81=93: > > > On Thu, 7 Sep 2023 20:14:03 +0800 > > > Yu Kuai wrote: > > > > > >> Hi, > > >> > > >> =E5=9C=A8 2023/09/07 19:26, Yu Kuai =E5=86=99=E9=81=93: > > >>> Hi, > > >>> > > >>> =E5=9C=A8 2023/09/07 18:18, Mariusz Tkaczyk =E5=86=99=E9=81=93: > > >>>> On Thu, 7 Sep 2023 10:04:11 +0800 > > >>>> Yu Kuai wrote: > > >>>> > > >>>>> Hi, > > >>>>> > > >>>>> =E5=9C=A8 2023/09/06 18:27, Mariusz Tkaczyk =E5=86=99=E9=81=93: > > >>>>>> On Wed, 6 Sep 2023 14:26:30 +0800 > > >>>>>> AceLan Kao wrote: > > >>>>>>> From previous testing, I don't think it's an issue in system= d, so I > > >>>>>>> did a simple test and found the issue is gone. > > >>>>>>> You only need to add a small delay in md_release(), then the is= sue > > >>>>>>> can't be reproduced. > > >>>>>>> > > >>>>>>> diff --git a/drivers/md/md.c b/drivers/md/md.c > > >>>>>>> index 78be7811a89f..ef47e34c1af5 100644 > > >>>>>>> --- a/drivers/md/md.c > > >>>>>>> +++ b/drivers/md/md.c > > >>>>>>> @@ -7805,6 +7805,7 @@ static void md_release(struct gendisk *di= sk) > > >>>>>>> { > > >>>>>>> struct mddev *mddev =3D disk->private_data; > > >>>>>>> > > >>>>>>> + msleep(10); > > >>>>>>> BUG_ON(!mddev); > > >>>>>>> atomic_dec(&mddev->openers); > > >>>>>>> mddev_put(mddev); > > >>>>>> > > >>>>>> I have repro and I tested it on my setup. It is not working for = me. > > >>>>>> My setup could be more "advanced" to maximalize chance of reprod= uction: > > >>>>>> > > >>>>>> # cat /proc/mdstat > > >>>>>> Personalities : [raid1] [raid6] [raid5] [raid4] [raid10] [raid0] > > >>>>>> md121 : active raid0 nvme2n1[1] nvme5n1[0] > > >>>>>> 7126394880 blocks super external:/md127/0 128k chunks > > >>>>>> > > >>>>>> md122 : active raid10 nvme6n1[3] nvme4n1[2] nvme1n1[1] nvme7n1[0= ] > > >>>>>> 104857600 blocks super external:/md126/0 64K chunks 2 > > >>>>>> near-copies > > >>>>>> [4/4] [UUUU] > > >>>>>> > > >>>>>> md123 : active raid5 nvme6n1[3] nvme4n1[2] nvme1n1[1] nvme7n1[0] > > >>>>>> 2655765504 blocks super external:/md126/1 level 5, 32k = chunk, > > >>>>>> algorithm 0 [4/4] [UUUU] > > >>>>>> > > >>>>>> md124 : active raid1 nvme0n1[1] nvme3n1[0] > > >>>>>> 99614720 blocks super external:/md125/0 [2/2] [UU] > > >>>>>> > > >>>>>> md125 : inactive nvme3n1[1](S) nvme0n1[0](S) > > >>>>>> 10402 blocks super external:imsm > > >>>>>> > > >>>>>> md126 : inactive nvme7n1[3](S) nvme1n1[2](S) nvme6n1[1](S) > > >>>>>> nvme4n1[0](S) > > >>>>>> 20043 blocks super external:imsm > > >>>>>> > > >>>>>> md127 : inactive nvme2n1[1](S) nvme5n1[0](S) > > >>>>>> 10402 blocks super external:imsm > > >>>>>> > > >>>>>> I have almost 99% repro ratio, slowly moving forward.. > > >>>>>> > > >>>>>> It is endless loop because systemd-shutdown sends ioctl "stop_ar= ray" > > >>>>>> which > > >>>>>> is successful but array is not stopped. For that reason it sets > > >>>>>> "changed =3D > > >>>>>> true". > > >>>>> > > >>>>> How does systemd-shutdown judge if array is stopped? cat /proc/md= stat or > > >>>>> ls /dev/md* or other way? > > >>>> > > >>>> Hi Yu, > > >>>> > > >>>> It trusts return result, I confirmed that 0 is returned. > > >>>> The most weird is we are returning 0 but array is still there, and= it is > > >>>> stopped again in next systemd loop. I don't understand why yet.. > > >>>> > > >>>>>> Systemd-shutdown see the change and retries to check if there is > > >>>>>> something > > >>>>>> else which can be stopped now, and again, again... > > >>>>>> > > >>>>>> I will check what is returned first, it could be 0 or it could b= e > > >>>>>> positive > > >>>>>> errno (nit?) because systemd cares "if(r < 0)". > > >>>>> > > >>>>> I do noticed that there are lots of log about md123 stopped: > > >>>>> > > >>>>> [ 1371.834034] md122:systemd-shutdow bd_prepare_to_claim return -= 16 > > >>>>> [ 1371.840294] md122:systemd-shutdow blkdev_get_by_dev return -16 > > >>>>> [ 1371.846845] md: md123 stopped. > > >>>>> [ 1371.850155] md122:systemd-shutdow bd_prepare_to_claim return -= 16 > > >>>>> [ 1371.856411] md122:systemd-shutdow blkdev_get_by_dev return -16 > > >>>>> [ 1371.862941] md: md123 stopped. > > >>>>> > > >>>>> And md_ioctl->do_md_stop doesn't have error path after printing t= his > > >>>>> log, hence 0 will be returned to user. > > >>>>> > > >>>>> The normal case is that: > > >>>>> > > >>>>> open md123 > > >>>>> ioctl STOP_ARRAY -> all rdev should be removed from array > > >>>>> close md123 -> mddev will finally be freed by: > > >>>>> md_release > > >>>>> mddev_put > > >>>>> set_bit(MD_DELETED, &mddev->flags) -> user shound not see = this > > >>>>> mddev > > >>>>> queue_work(md_misc_wq, &mddev->del_work) > > >>>>> > > >>>>> mddev_delayed_delete > > >>>>> kobject_put(&mddev->kobj) > > >>>>> > > >>>>> md_kobj_release > > >>>>> del_gendisk > > >>>>> md_free_disk > > >>>>> mddev_free > > >>>>> > > >>>> Ok thanks, I understand that md_release is called on descriptor > > >>>> closing, right? > > >>>> > > >>> > > >>> Yes, normally close md123 should drop that last reference. > > >>>> > > >>>>> Now that you can reporduce this problem 99%, can you dig deeper a= nd find > > >>>>> out what is wrong? > > >>>> > > >>>> Yes, working on it! > > >>>> > > >>>> My first idea was that mddev_get and mddev_put are missing on > > >>>> md_ioctl() path > > >>>> but it doesn't help for the issue. My motivation here was that > > >>>> md_attr_store and > > >>>> md_attr_show are using them. > > >>>> > > >>>> Systemd regenerates list of MD arrays on every loop and it is alwa= ys > > >>>> there, systemd is able to open file descriptor (maybe inactive?). > > >>> > > >>> md123 should not be opended again, ioctl(STOP_ARRAY) already set th= e > > >>> flag 'MD_CLOSING' to prevent that. Are you sure that systemd-shutdo= wn do > > >>> open and close the array in each loop? > > >> > > >> I realized that I'm wrong here. 'MD_CLOSING' is cleared before ioctl > > >> return by commit 065e519e71b2 ("md: MD_CLOSING needs to be cleared a= fter > > >> called md_set_readonly or do_md_stop"). > > >> > > >> I'm confused here, commit message said 'MD_CLOSING' shold not be set= for > > >> the case STOP_ARRAY_RO, but I don't understand why it's cleared for > > >> STOP_ARRAY as well. > > >> > > > > Can you try the following patch? > > > > diff --git a/drivers/md/md.c b/drivers/md/md.c > > index 3afd57622a0b..31b9cec7e4c0 100644 > > --- a/drivers/md/md.c > > +++ b/drivers/md/md.c > > @@ -7668,7 +7668,8 @@ static int md_ioctl(struct block_device *bdev, > > fmode_t mode, > > err =3D -EBUSY; > > goto out; > > } > > - did_set_md_closing =3D true; > > + if (cmd =3D=3D STOP_ARRAY_RO) > > + did_set_md_closing =3D true; > > mutex_unlock(&mddev->open_mutex); > > sync_blockdev(bdev); > > } > > > > I think prevent array to be opened again after STOP_ARRAY might fix > > this. > > I didn't help. I tried much more: > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 0fe7ab6e8ab9..807387f37755 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -608,7 +608,7 @@ static inline struct mddev *mddev_get(struct mddev *m= ddev) > { > lockdep_assert_held(&all_mddevs_lock); > > - if (test_bit(MD_DELETED, &mddev->flags)) > + if (test_bit(MD_DELETED, &mddev->flags) || test_bit(MD_CLOSING, > &mddev->flags)) return NULL; > atomic_inc(&mddev->active); > return mddev; > > > Issue is still reproducible. I was shocked so I removed clearing MD_CLOSI= NG: > out: > - if(did_set_md_closing) > - clear_bit(MD_CLOSING, &mddev->flags); > + //if(did_set_md_closing) > + // clear_bit(MD_CLOSING, &mddev->flags); > > Still reproducible. Then I found this: > @@ -6177,7 +6179,7 @@ static void md_clean(struct mddev *mddev) > mddev->persistent =3D 0; > mddev->level =3D LEVEL_NONE; > mddev->clevel[0] =3D 0; > - mddev->flags =3D 0; > + //mddev->flags =3D 0; > mddev->sb_flags =3D 0; > mddev->ro =3D MD_RDWR; > mddev->metadata_type[0] =3D 0; > > Issue not reproducible. I can see in log that attempt to stop device is n= ot > repeated. > > Well, with the change of using time-sensitive lock in mddev_put we need t= o > ensure that we can still can remove mddevice so setting and preserving > MD_CLOSING is reasonable for me. I will combine this into patches tomorro= w. > > Yu, really appreciate your help! Glad to heave you here. Thanks to everyone for looking into this tricky issue and running various t= ests! Mariusz, please let me know if you need any help with the patch. Hopefully = we can fix this soon. Thanks again, Song