From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MSGID_FROM_MTA_HEADER,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2DFEEC388F9 for ; Sun, 8 Nov 2020 14:53:31 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id CC462206F4 for ; Sun, 8 Nov 2020 14:53:30 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com header.b="E+y0nid6" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727949AbgKHOxa (ORCPT ); Sun, 8 Nov 2020 09:53:30 -0500 Received: from de-smtp-delivery-102.mimecast.com ([51.163.158.102]:57233 "EHLO de-smtp-delivery-102.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727570AbgKHOx3 (ORCPT ); Sun, 8 Nov 2020 09:53:29 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=mimecast20200619; t=1604847206; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type; bh=JhY8mPuV8VjKdvcfGDC7x6rgkFbp1jfFltleJ4tuZdc=; b=E+y0nid6/PHlMkPc8KugvwXKHkLvcecLLJ8q0qO9EPjxcX7ELdho73DEnz1xrcFlgzcyP4 lhGtr0F5y7hitj30BiqgkCXKrv2ZHrAv307eOfD/H6zuZmI1+mDItfclL+INlxIk4lQbnu H0eMyUt2T8idqovGrPniiEs9kJ/lpQY= Received: from EUR04-VI1-obe.outbound.protection.outlook.com (mail-vi1eur04lp2056.outbound.protection.outlook.com [104.47.14.56]) (Using TLS) by relay.mimecast.com with ESMTP id de-mta-15-V_on_V_wPsqp-oDfUL1_iA-1; Sun, 08 Nov 2020 15:53:25 +0100 X-MC-Unique: V_on_V_wPsqp-oDfUL1_iA-1 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=FeVrytvm3q3/SMC1QmNEgnQ+cDCq2/lf3ZB6Qb/rsU7Fr+z2Nk/5pbn+5h8JA8vtAVJCWAr2bLxVJShJl7EpIKKodrm9RlFlLwLOL75SbuwSKOEy5RfBauAShSOoVjvKJgG05PLVDF3tPS5nbKZY05pQisFhSLXshUtiSuIW3YLpEAC8XopO5LSQGlJg1pAuU/l4RFtyjmBvZ4mBnmb55HBWzyetEBhpHnhhQLF7eUbArdnvkb8pwynEjFF7RhpVwuYdtwbUJRXj0Y4zI0QbN7mphgR/4iVwo0JEqwnLOFOlMQbC0Gq0y1F47/g3WtCNfDmZa7b6rs1nDbrPb0eqwA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=JhY8mPuV8VjKdvcfGDC7x6rgkFbp1jfFltleJ4tuZdc=; b=h/Wg+MiF0LBSdzVdeiXAbG97TmZE1F3Qn87+UoWPi/Wh0kotFeeyWOCjrQjHfVCrTSFaIf+fAgVZZZMN/jwnilwrrKxCAJygixYheSSuKuc2du4ua6suO/QmAmUVpY1EQwkm0BsHrGrTRMoWgC996tZ5J9/y/yqVtdmclFG3Qii4gbF4pKdQAUSRVdJ4qGiygnlrgsXbCk0GNP18lMW5VkImEaBy3Jeg7M0rj/FFTV88PjdON9uH/YXOCG7GumnOnLbuyuxgHuKDGcP6un3Foep4EI4xdiEin68WxjrcYjcrb4RTnyLN+N0X6OuaYLKgBfU/ODKn+azDegoIP48p1A== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none Authentication-Results: vger.kernel.org; dkim=none (message not signed) header.d=none;vger.kernel.org; dmarc=none action=none header.from=suse.com; Received: from DB7PR04MB4666.eurprd04.prod.outlook.com (2603:10a6:5:2b::14) by DB7PR04MB5115.eurprd04.prod.outlook.com (2603:10a6:10:15::21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3541.21; Sun, 8 Nov 2020 14:53:22 +0000 Received: from DB7PR04MB4666.eurprd04.prod.outlook.com ([fe80::197:4f2c:bd9d:ff7a]) by DB7PR04MB4666.eurprd04.prod.outlook.com ([fe80::197:4f2c:bd9d:ff7a%7]) with mapi id 15.20.3541.021; Sun, 8 Nov 2020 14:53:22 +0000 From: Zhao Heming To: linux-raid@vger.kernel.org, song@kernel.org, guoqing.jiang@cloud.ionos.com Cc: Zhao Heming , lidong.zhong@suse.com, xni@redhat.com, neilb@suse.de, colyli@suse.de Subject: [PATCH 0/2] md-cluster bugs fix Date: Sun, 8 Nov 2020 22:52:59 +0800 Message-Id: <1604847181-22086-1-git-send-email-heming.zhao@suse.com> X-Mailer: git-send-email 1.8.3.1 Content-Type: text/plain X-Originating-IP: [123.123.132.155] X-ClientProxiedBy: HKAPR03CA0019.apcprd03.prod.outlook.com (2603:1096:203:c9::6) To DB7PR04MB4666.eurprd04.prod.outlook.com (2603:10a6:5:2b::14) MIME-Version: 1.0 X-MS-Exchange-MessageSentRepresentingType: 1 Received: from c73.home (123.123.132.155) by HKAPR03CA0019.apcprd03.prod.outlook.com (2603:1096:203:c9::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3564.13 via Frontend Transport; Sun, 8 Nov 2020 14:53:18 +0000 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: b95432bf-f7f5-4e8f-081a-08d883f60956 X-MS-TrafficTypeDiagnostic: DB7PR04MB5115: X-MS-Exchange-Transport-Forked: True X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:4125; X-MS-Exchange-SenderADCheck: 1 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: hLExMMudzsIPHsSBBV5UUk2ytWlhNykSu2z0OoxMHXV0KHaxbkI518d8sqB+mU1pfq4cNtbZyASWXOspKRdz97/WXkfZxBrmbyeXVYnNA/bhwfi3ODCC2ew/f2r+E7MG7J+tbo0kbRbjZNJPjo8HIOiwQ+uZVPyrtDGO11Gvmefm06GOCkvS8OzqzetBFDeSg1H27KSYJjGnxfHJh+ucZxOkvt+vKkdv9jLwTaRJhNniTJWMjI3mU8z1ddNTh1umcHYViqUmKLiukp/9klT6nDCvjf0aDZgYP5Wtb+2EwGaY0EkC2ubnFQqzFQ+2G/SjLYWT9RnMk5wr7fZaZhbiE4Uoz5Z5dzVs6S7p0AvF2WgjW5enHzj1M7qYz/p9GL3J X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:DB7PR04MB4666.eurprd04.prod.outlook.com;PTR:;CAT:NONE;SFS:(346002)(39860400002)(366004)(396003)(376002)(136003)(8936002)(52116002)(316002)(6666004)(956004)(8886007)(2616005)(478600001)(86362001)(66556008)(66476007)(66946007)(4326008)(6506007)(6486002)(8676002)(6512007)(16526019)(186003)(36756003)(5660300002)(26005)(2906002)(83380400001)(9126005);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData: 4Bv5r/6w81WbE8fSHPA/5ydFJG+Kwl7FxmHs56lyBJdeDW+0BO5ZVvD3UW5QX25RlOjsnSoflqGRf/LMaJx9/wq5lvYWcGADFsKNXtc6c0TrLN7eXUdPFy5eNNqnkNiZ/HiM6JpOd7NAIQpUXUDYjEW90Tp3W37LRSooEbVVYeH6QV0GkIVB4lY0zvmByQPb0IT3k96Dt7gz0jd3lWx4vKUEiJvaAKDepIbMaql0Rk3Fb3b215jjEcIfWRiv4YhDEUFNCz3WPz8R3DLP7xCxTpTmvgxRPE6jKlfLrctnDCObR5vzb5+KobzkjttAd6lTqcdK4y4kBWKSDrlP3cQqCftDt2KvTVsgXkYyTCo5ibCbJ5zNxH3SeAUTuumujVgyouaPnf16IFEejM5Fcu++0FvVieuJRoDlBiu1ER6TrU1GSGo5ymohSkGp4k+zXyNR+XUtfKKLRGypVy0VaftJt3nFh8Xe+VzFOnSOttq1kx6Qy7CfEqnbzMuKkTydPVFszcb1mcl43QwwufZH2wXNWsNrU8OkZ/lWXKkaPGU0DmXNEYi2jSuaP0kjxvgDAv7AcAlcu41ooTd9x84neC4FO6ct3I3aKJcgFawufq0PElP9L82t9HshyryPJc8ChNY7rJlii+L12w5QfngmGEbocg== X-OriginatorOrg: suse.com X-MS-Exchange-CrossTenant-Network-Message-Id: b95432bf-f7f5-4e8f-081a-08d883f60956 X-MS-Exchange-CrossTenant-AuthSource: DB7PR04MB4666.eurprd04.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 08 Nov 2020 14:53:22.5405 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: f7a17af6-1c5c-4a36-aa8b-f5be247aa4ba X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: i6nH7YxuVNyv6lFCEnNtfdEMxbIWKcqqfFSQvxebkQLTu4F6jfEBJjw5zBch7n5AMSGNYxtgEHRwwNA6/kgcmA== X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB7PR04MB5115 Precedence: bulk List-ID: X-Mailing-List: linux-raid@vger.kernel.org Hello List, I filed two patches to fix 2 different md-cluster bugs. For easy understanding, Let us call issue 1 (releted with patch 1), and issue 2 (related with patch 2). *** Test env *** node A & B share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB. The disk size is more large the issues are more likely to trigger. (more resync time, more easily trigger issues) *** Test script *** Issue 1 & 2 can use same test script to trigger: ``` ssh root@node2 "mdadm -S --scan" mdadm -S --scan for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \ count=20; done echo "mdadm create array" mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \ --bitmap-chunk=1M echo "set up array on node2" ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh" sleep 5 mkfs.xfs /dev/md0 mdadm --manage --add /dev/md0 /dev/sdi mdadm --wait /dev/md0 mdadm --grow --raid-devices=3 /dev/md0 mdadm /dev/md0 --fail /dev/sdg mdadm /dev/md0 --remove /dev/sdg #mdadm --wait /dev/md0 mdadm --grow --raid-devices=2 /dev/md0 ``` There is a workaround: when adding the --wait before second --grow, the issue 1 will disappear. *** error behavior *** issue 1: test script can finish every cmds in script, but array status is wrong. issue 2: test script will hung when executing "mdadm --remove". array status of issue 1: (part of output by: mdadm -D /dev/md0) : normal test result. ``` Number Major Minor RaidDevice State 1 8 112 0 active sync /dev/sdh 2 8 128 1 active sync /dev/sdi ``` : "--faulty" data still exist on disk metadata area. ``` Number Major Minor RaidDevice State - 0 0 0 removed 1 8 112 1 active sync /dev/sdh 2 8 128 2 active sync /dev/sdi 0 8 96 - faulty /dev/sdg ``` : "--remove" data still exist on disk metadata area. ``` Number Major Minor RaidDevice State - 0 0 0 removed 1 8 112 1 active sync /dev/sdh 2 8 128 2 active sync /dev/sdi ``` array status of issue 2: Hunging info: ``` node1 # ps axj | grep mdadm 1 5423 5227 2231 ? -1 D 0 0:00 mdadm /dev/md0 --remove /dev/sdg node1 # cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdi[2] sdh[1] sdg[0](F) 1046528 blocks super 1.2 [2/1] [_U] [>....................] recovery = 0.0% (1/1046528) finish=354.0min speed=47K/sec bitmap: 1/1 pages [4KB], 1024KB chunk unused devices: node2 # cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdi[2] sdg[0](F) sdh[1] 1046528 blocks super 1.2 [2/1] [_U] bitmap: 1/1 pages [4KB], 1024KB chunk unused devices: echo t > /proc/sysrq-trigger md0_cluster_rec D 0 5329 2 0x80004000 Call Trace: __schedule+0x1f6/0x560 ? _cond_resched+0x2d/0x40 ? schedule+0x4a/0xb0 ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster] ? wait_woken+0x80/0x80 ? process_recvd_msg+0x113/0x1d0 [md_cluster] ? recv_daemon+0x9e/0x120 [md_cluster] ? md_thread+0x94/0x160 [md_mod] ? wait_woken+0x80/0x80 ? md_congested+0x30/0x30 [md_mod] ? kthread+0x115/0x140 ? __kthread_bind_mask+0x60/0x60 ? ret_from_fork+0x1f/0x40 mdadm D 0 5423 1 0x00004004 Call Trace: __schedule+0x1f6/0x560 ? __schedule+0x1fe/0x560 ? schedule+0x4a/0xb0 ? lock_comm.isra.0+0x7b/0xb0 [md_cluster] ? wait_woken+0x80/0x80 ? remove_disk+0x4f/0x90 [md_cluster] ? hot_remove_disk+0xb1/0x1b0 [md_mod] ? md_ioctl+0x50c/0xba0 [md_mod] ? wait_woken+0x80/0x80 ? blkdev_ioctl+0xa2/0x2a0 ? block_ioctl+0x39/0x40 ? ksys_ioctl+0x82/0xc0 ? __x64_sys_ioctl+0x16/0x20 ? do_syscall_64+0x5f/0x150 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 md0_resync D 0 5425 2 0x80004000 Call Trace: __schedule+0x1f6/0x560 ? schedule+0x4a/0xb0 ? dlm_lock_sync+0xa1/0xd0 [md_cluster] ? wait_woken+0x80/0x80 ? lock_token+0x2d/0x90 [md_cluster] ? resync_info_update+0x95/0x100 [md_cluster] ? raid1_sync_request+0x7d3/0xa40 [raid1] ? md_do_sync.cold+0x737/0xc8f [md_mod] ? md_thread+0x94/0x160 [md_mod] ? md_congested+0x30/0x30 [md_mod] ? kthread+0x115/0x140 ? __kthread_bind_mask+0x60/0x60 ? ret_from_fork+0x1f/0x40 ``` *** analysis *** In cluster env, every node can start resync job even if the resync cmd doesn't execute on it. e.g. There are two node A & B. User executes "mdadm --grow" on A, sometime B will start resync job not A. problem: Current update_raid_disks() only check local recovery status, it's incomplete. issue scenario ``` node A node B ------------------------------ --------------- mdadm --grow -n 3 md0 + raid1_reshape mddev->raid_disks: 2=>3 start resync job, it will block A resync job mddev->raid_disks: 2=>3 mdadm md0 --fail sdg + update disk: array sb & bitmap sb + send METADATA_UPDATE (resync job blocking) (B continue doing "--grow -n 3" resync job) recv METADATA_UPDATE + read disk metadata + raid1_error + set MD_RECOVERY_INTR to break resync ... ... md_check_recovery + remove_and_add_spares return 1 + set MD_RECOVERY_RECOVER, later restart resync mdadm md0 --remove sdg + md_cluster_ops->remove_disk | + send REMOVE + md_kick_rdev_from_array + update disk: array sb & bitmap sb (resync job blocking) (B continue doing "--grow -n 3" resync job) recv REMOVE + process_remove_disk doesn't set mddev->sb_flags, so it doesn't update disk sb & bitmap sb. ...... md_check_recovery + md_kick_rdev_from_array mdadm --grow -n 2 md0 + raid1_reshape | mddev->raid_disks: 3=>2 + send METADATA_UPDATED (B continue doing "--grow -n 3" resync job) recv METADATA_UPDATE + check_sb_changes update_raid_disks return -EBUSY update failed for mddev->raid_disks: 3=>2 (B never successfully update mddev->raid_disks: 3=>2) when B finish "--grow -n 3" resync job + use mddev->raid_disks:3 to update array sb & bitmap sb + send METADATA_UPDATED recv METADATA_UPDATED + read wrong raid_disks to update kernel data. ``` First, There is a similar deadlock in commit 0ba959774e93911caff596de6391f085fb640ac4 Let me explain commit 0ba959774e first. ``` nodeA nodeB -------------------- -------------------- a. send METADATA_UPDATED held token_lockres:EX b. md_do_sync resync_info_update send RESYNCING + set MD_CLUSTER_SEND_LOCK + wait for holding token_lockres:EX c. update sb + held reconfig_mutex + metadata_update_start + wait_event(MD_CLUSTER_SEND_LOCK) //blocking from d. recv_daemon //METADATA_UPDATED from A process_metadata_update + mddev_trylock(mddev) return false //blocking from nodeA nodeB -------------------- -------------------- a. send METADATA_UPDATED held token_lockres:EX b. md_do_sync resync_info_update send RESYNCING + set MD_CLUSTER_SEND_LOCK + wait for holding token_lockres:EX c. update sb + held reconfig_mutex + metadata_update_start wait for + set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD //for breaking + wait_event(MD_CLUSTER_SEND_LOCK) d. recv_daemon //METADATA_UPDATED from A process_metadata_update + (mddev_trylock(mddev) || MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD) //this will non-block & break deadlock. ``` the issue 2 is very like 0ba959774e, except . ``` nodeA nodeB -------------------- -------------------- a. send METADATA_UPDATED held token_lockres:EX b. md_do_sync resync_info_update send RESYNCING + set MD_CLUSTER_SEND_LOCK + wait for holding token_lockres:EX c. mdadm /dev/md0 --remove /dev/sdg + held reconfig_mutex + send REMOVE + wait_event(MD_CLUSTER_SEND_LOCK) d. recv_daemon //METADATA_UPDATED from A process_metadata_update + (mddev_trylock(mddev) || MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD) //this time, both return false forever. ``` commit 0ba959774e9391 uses MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD to break deadlock, but in issue 2, it won't help. md_cluster_ops->remove_disk called from: - state_store() this doesn't hold reconfig_mutex - hot_remove_disk() this must hold reconfig_mutex There are two method to fix. 1. like commit 0ba959774e, held reconfig_mutex in state_store, and set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in remove_disk. 2. like patch 2, change wait_event to wait_event_timeout in lock_comm & process_metadata_update. there are some reason I prefer method 2: - I am not very familiar with all scenario in state_store(). I am not sure if holding reconfig_mutex can cause new bug/issue. - It looks all sendmsg cases could trigger issue 2. Current we found two cases: (maybe there have other cases) - update sb (see commit 0ba959774) - mdadm --remove (issue 2) we should break the deadlock in key code (wait_event => wait_event_timeout). ------- v1: - create patch ------- Zhao Heming (2): md/cluster: reshape should returns error when remote doing resyncing job md/cluster: fix deadlock when doing reshape job drivers/md/md-cluster.c | 42 ++++++++++++++++++++++++++--------------- drivers/md/md.c | 8 ++++++-- 2 files changed, 33 insertions(+), 17 deletions(-) -- 2.27.0