From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1EFB0C6FD1D for ; Wed, 15 Mar 2023 10:11:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232272AbjCOKLo (ORCPT ); Wed, 15 Mar 2023 06:11:44 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40618 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232269AbjCOKLK (ORCPT ); Wed, 15 Mar 2023 06:11:10 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2CD2E62313 for ; Wed, 15 Mar 2023 03:10:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1678875040; x=1710411040; h=date:from:to:cc:subject:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=DkCgGbpULSOsDRjnAmAkZRpMoLHNpyciU1NBsJGDXqQ=; b=DCpHpLAQtmQCayCS8miOJgYERrX3PcihJMC+PBkWJF8fiepR6Cg9+kwA E5fvyRuQvkyiFhEEHj9PQEjEEx75bhGb3BKraOzOwaS65bo+tL2LN2dp0 WoXRNfLqqzAfP7mW9rX1UM5JHf1FUG9Fot1lCz3D6jVInWXZPf3+j4Y22 4eKKwkfahS9IflXy7hcXOyHNPy4UxDtssrGXnBPLeOIvOWMowXuNnyOJF jdAtDzbLQ8VLMgl/HmagcYFuApGpscFKnsdMdEq1rskGztVQ9gogOsqS7 DkDltOqap1XYwgwUVXwXEzp412hN9PoT+jJCbXZS7DET4MFK+TY2Vln8u A==; X-IronPort-AV: E=McAfee;i="6500,9779,10649"; a="340025728" X-IronPort-AV: E=Sophos;i="5.98,262,1673942400"; d="scan'208";a="340025728" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Mar 2023 03:10:35 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10649"; a="925285861" X-IronPort-AV: E=Sophos;i="5.98,262,1673942400"; d="scan'208";a="925285861" Received: from mtkaczyk-mobl.ger.corp.intel.com (HELO localhost) ([10.252.62.26]) by fmsmga006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Mar 2023 03:10:31 -0700 Date: Wed, 15 Mar 2023 11:10:27 +0100 From: Mariusz Tkaczyk To: Martin Wilck , Song Liu Cc: Li Xiao Keng , Jes Sorensen , Paul Menzel , Coly Li , linux-raid@vger.kernel.org, linfeilong , louhongxiang@huawei.com, "liuzhiqiang (I)" , miaoguanqin Subject: Re: [QUESTION] How to fix the race of "mdadm --add" and "mdadm mdadm --incremental --export" Message-ID: <20230315111027.0000372d@linux.intel.com> In-Reply-To: <04a4cc6aac10cd24d5bc0b3485d47f6ccb752eab.camel@suse.com> References: <252cdcda-afcd-ce76-00cf-c138136e70ab@huawei.com> <20230314165938.00003030@linux.intel.com> <04a4cc6aac10cd24d5bc0b3485d47f6ccb752eab.camel@suse.com> X-Mailer: Claws Mail 4.1.0 (GTK 3.24.33; x86_64-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-raid@vger.kernel.org On Tue, 14 Mar 2023 17:11:06 +0100 Martin Wilck wrote: > On Tue, 2023-03-14 at 16:59 +0100, Mariusz Tkaczyk wrote: > > On Tue, 14 Mar 2023 16:04:23 +0100 > > Martin Wilck wrote: > > =20 > > > On Tue, 2023-03-14 at 22:58 +0800, Li Xiao Keng wrote: =20 > > > > Hi, > > > > =C2=A0=C2=A0 Here we meet a question. When we add a new disk to a r= aid, it > > > > may > > > > return > > > > -EBUSY. > > > > =C2=A0=C2=A0 The main process of --add=EF=BC=88for example md0, sdf= ): > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1.dev_open(sdf) > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 2.add_to_super > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 3.write_init_super > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 4.fsync(fd) > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 5.close(fd) > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 6.ioctl(ADD_NEW_DISK). > > > > =C2=A0=C2=A0 However, there will be some udev(change of sdf) event = after > > > > step5. > > > > Then > > > > "/usr/sbin/mdadm --incremental --export $devnode --offroot > > > > $env{DEVLINKS}" > > > > will be run, and the sdf will be added to md0. After that, step6 > > > > will > > > > return > > > > -EBUSY. > > > > =C2=A0=C2=A0 It is a problem to user. First time adding disk does n= ot > > > > return > > > > success > > > > but disk is actually added. And I have no good idea to deal with > > > > it. > > > > Please > > > > give some great advice.=C2=A0 =20 > > >=20 > > > I haven't looked at the code in detail, but off the top of my head, > > > it > > > should help to execute step 5 after step 6. The close() in step 5 > > > triggers the uevent via inotify; doing it after the ioctl should > > > avoid > > > the above problem. =20 > > Hi, > > That will result in EBUSY in everytime. mdadm will always handle > > descriptor and kernel will refuse to add the drive. =20 >=20 > Why would it cause EBUSY? Please elaborate. My suggestion would avoid > the race described by Li Xiao Keng. I only suggested to postpone the > close(), nothing else. The fsync() would still be done before the > ioctl, so the metadata should be safely on disk when the ioctl is run. Because device will be claimed in userspace by mdadm. MD may check if device is not claimed. I checked bind_rdev_to_array() and I don't find an obvious answer, so I could be wrong here. Maybe someone more kernel experienced can speak here? Song, could you look? I think that the descriptor opened by incremental block kernel from adding = the device to the array but also the same incremental is able to add it later because metadata is on device. There is no retry in Manage_add() flow so it must be added by Incremental so the question is if it is already in array or disk is just claimed and that is the problem. Eventually, you can test your proposal that should gives us an answer. >=20 > This is a recurring pattern. Tools that manipulate block devices must > be aware that close-after-write triggers an uevent, and should make > sure that they don't close() such files prematurely. Agree. Mdadm has this problem, descriptors are opened and closed may times. >=20 > > >=20 > > > Another obvious workaround in mdadm would be to check the state of > > > the > > > array in the EBUSY case and find out that the disk had already been > > > added. > > >=20 > > > But again, this was just a high-level guess. > > >=20 > > > Martin > > > =20 > >=20 > > Hmm... I'm not a native expert but why we cannot write metadata after > > adding > > drive to array? Why kernel can't handle that? > >=20 Ok, there is an assumption in kernel that device MUST have metadata. > > Ideally, we should lock device and block udev- I know that there is > > flock > > based API to do that but I'm not sure if flock() won't cause the same > > problem. =20 >=20 > That doesn't work reliably. At least not in general. The mechanmism is > disabled for for dm devices (e.g. multipath), for example. See > https://github.com/systemd/systemd/blob/a5c0ad9a9a2964079a19a1db42f79570a= 3582bee/src/udev/udevd.c#L483 >=20 Yeah, I know but udev processes the disk, not MD device so the locking should work. But if we cannot trust it, we shouldn't follow this idea. > =20 > > There is also something like "udev-md-raid-creating.rules". Maybe we > > can reuse > > it? > > =20 >=20 > Unless I am mistaken, these rules are exactly those that cause the > issue we are discussing. Since these rules are also part of the mdadm > package, it might be possible to set some flag under /run that would > indicate to the rules that auto-assembly should be skipped. But that > might be racy, too. Yeah, bad idea. Agree. New day, new ideas: - why we cannot let udev to add the device? just write metadata and finish, udev should handle that because metadata is there. - incremental uses map_lock() to prevent concurrent updates, I seems to b missed for --add. That could be a key to prevent the behavior.Incremental= is checking if it can lock the map file. If not, it finishes gracefully. To = lock array we need to read metadata first, so it goes to the first question- i= s it a problem that incremental has opened descriptor? Thanks, Mariusz