From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9E029CA0ED2 for ; Mon, 11 Sep 2023 22:00:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1355428AbjIKV7d (ORCPT ); Mon, 11 Sep 2023 17:59:33 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35746 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242868AbjIKQ10 (ORCPT ); Mon, 11 Sep 2023 12:27:26 -0400 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 63589CCD for ; Mon, 11 Sep 2023 09:27:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1694449641; x=1725985641; h=date:from:to:cc:subject:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=fLe1eVF0AyQjKFE2tRHJ1P6Lg3GSKkOCJmYP0mm4GKo=; b=ZlV/pDFwkxOEVp7I7+uu70+I5g+t+RlARjLclYzNdaNcN1nrzIn21SAF nbu9+kjkyR/uSf0wlALRI+Z4ohuHSGP5vVAxzPq/F0d9zAtcu5bG1pBfJ Z+SLc4NXLMHJzOyC29SM6sPOHXO6O341hJPY+aie94sHSP+oisO0SP2nD CdGL8CnXzJUOBzPrt+bOyVH/BajSaIksJKlcDojs2/cGxu0Q0YDE3B3HT jAWDRga19YiHw3YCR/PSHXkR+l1VOXkQGWDsptSy/gJRxRp8YeAgRyoXq o5AMuUVPPWBHCg5ru/LYCNU2usJxoECfYyHz88WZ+99DC8nAw9KiYmwxe A==; X-IronPort-AV: E=McAfee;i="6600,9927,10830"; a="380829242" X-IronPort-AV: E=Sophos;i="6.02,244,1688454000"; d="scan'208";a="380829242" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2023 09:00:39 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10830"; a="990139590" X-IronPort-AV: E=Sophos;i="6.02,244,1688454000"; d="scan'208";a="990139590" Received: from mtkaczyk-mobl.ger.corp.intel.com (HELO localhost) ([10.249.141.71]) by fmsmga006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2023 09:00:36 -0700 Date: Mon, 11 Sep 2023 18:00:32 +0200 From: Mariusz Tkaczyk To: Li Xiao Keng Cc: Jes Sorensen , Martin Wilck , Coly Li , , , miaoguanqin Subject: Re: [PATCH v3] Fix race of "mdadm --add" and "mdadm --incremental" Message-ID: <20230911180032.00007bbb@linux.intel.com> In-Reply-To: References: X-Mailer: Claws Mail 4.1.0 (GTK 3.24.33; x86_64-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-raid@vger.kernel.org On Thu, 7 Sep 2023 19:37:44 +0800 Li Xiao Keng wrote: > There is a raid1 with sda and sdb. And we add sdc to this raid, > it may return -EBUSY. >=20 > The main process of --add: > 1. dev_open=EF=BC=88sdc) in Manage_add > 2. store_super1(st, di->fd) in write_init_super1 > 3. fsync(fd) in store_super1 > 4. close(di->fd) in write_init_super1 > 5. ioctl(ADD_NEW_DISK) >=20 > Step 2 and 3 will add sdc to metadata of raid1. There will be > udev(change of sdc) event after step4. Then "/usr/sbin/mdadm > --incremental --export $devnode --offroot $env{DEVLINKS}" > will be run, and the sdc will be added to the raid1. Then > step 5 will return -EBUSY because it checks if device isn't > claimed in md_import_device()->lock_rdev()->blkdev_get_by_dev() > ->blkdev_get(). =20 >=20 > It will be confusing for users because sdc is added first time. > The "incremental" will get map_lock before add sdc to raid1. > So we add map_lock before write_init_super in "mdadm --add" > to fix the race of "add" and "incremental". >=20 > Signed-off-by: Li Xiao Keng > Signed-off-by: Guanqin Miao Hello Li Xiao Keng, Thanks for the patch, I'm trying to get my head around this. As I understand, you added a mapfile locking and you are failing when the l= og is already claimed, but you prefer to return -1 and print error instead of returning EBUSY, right? If yes, it has nothing to do with fixing. It is just a obscure hack to make= it, let say more reliable. We can either get EBUSY and know that is expected. I see no difference, your solution is same confusing as current implementatio= n. Also, you used map_lock(). I remember that initially I proposed - my bad. We also rejected udev locking, as a buggy. In fact- I don't see ready synchronization method to avoid this race in mdadm right now. The best way, I see is: 1. Write metadata. 2. Give udev chance to process the device. 3. Add it manually if udev failed (or D_NO_UDEV). If it does not work for you, we need to add synchronization, I know that th= ere is pidfile used for mdmon, perhaps we can reuse it? We don't need something fancy, just something to keep all synchronised and block incremental to led --add finish the action. We cannot use map_lock() and map_unlock() in this context. It claims lock on mapfile which describes every array in your system (base details, there is = no info about members), see /var/run/mdadm. Add is just a tiny action to add d= isk and we are not going to update the map. Taking this lock will make --add more vulnerable, for example it will fail because another array is assemble= d or created in the same time (I know that it will be hard to achieve but it is another race - we don't want it). And we cannot allow for partially done action, you added this lock after writing metadata (according to description), we are not taking this lock to complete the action, just to hide race. Lock should be taken before writing metadata, but we know that it will not hide an issue. In this for it cannot be merged I don't see a gain, we are returning -1 ins= tead -22, what is the difference? I don't see error message as more meaningful n= ow. Thanks, Mariusz