From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id EA478C38142 for ; Thu, 19 Jan 2023 05:15:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229379AbjASFPP (ORCPT ); Thu, 19 Jan 2023 00:15:15 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50090 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229812AbjASFOQ (ORCPT ); Thu, 19 Jan 2023 00:14:16 -0500 Received: from mail-pj1-x1029.google.com (mail-pj1-x1029.google.com [IPv6:2607:f8b0:4864:20::1029]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 18DDC172A for ; Wed, 18 Jan 2023 21:14:15 -0800 (PST) Received: by mail-pj1-x1029.google.com with SMTP id n20-20020a17090aab9400b00229ca6a4636so3890438pjq.0 for ; Wed, 18 Jan 2023 21:14:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20210112.gappssmtp.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=Kx/0/eM5Aq1g63wRRP7qndmkF3eZriN905Vl4L2Ntj0=; b=K/zsRpcY8ajM/mTU/O+dTqXI+XR2tsFiTzvL3bYWpWzWQVzkiNfbdygXT2ouzbkie/ 5B6k8UNhRG1GidOF/+UJxEihcy3wGgQG9ThQ9bdnJqndYIYnuGTtcyR5IsOd8NgUSlJ8 SqXSQvWMlHgPtLxWwsl20EtUvIqSEvo2q1rZiRvkKRp2NpfKiCAjYom1oe7YJJjurcav s/hxVRCjsx/owyUfBWO4x7MD7a+7MY4uCb1C48XPULQ45zTYvBzx6kAui1/p/nyzB9RT iglL1fkfziIcNC9ukNOlteL6aPvm2tUWigWA68pIWEGAbNX7rCvx6bzUQonUSjlmHlh9 oqZA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=Kx/0/eM5Aq1g63wRRP7qndmkF3eZriN905Vl4L2Ntj0=; b=KrU8jLhFh2i3HmbCniWJZWq7Bqaw9S8iYoTkt+aoglRKy17UYt7tSwB8UEXPTrfeIq eDfdL9sjGwlrotoVDDZwmCAg8jyXRvPueoK4XodVaey81Hlml8k2yvADF3JzBAOPA2+t GWI4agMB405PMP5kUbFpwIf6tznjcAo3pGkJ47ZnwCC/lSjX2SQOBguR3fibDDLnEL6P HUNDwZJaZv6uPnMrj+o0wkRnj/oNwRRl25zdiXmOE0WO+WzIfYn+kJ1QUt10iyfXkLr2 rmhkDxx7JpeutiPkk5CKy5n9Lxm0snMxUf6UK37+Ak5v83I2RanzNPE45e2ItF5V5IB5 iE2A== X-Gm-Message-State: AFqh2kqgoM23d0RbOiiHeMky3UAMlxKdLJ8f73aOHaio/Z6kLwAviEHs KBlKYRaj5RziPHB7+qEC6Foo9hP7iJEM8Ehu X-Google-Smtp-Source: AMrXdXvRrSlGNmocFMhYEOREOnBz9fxk2tIgJ/kp18dqr/+QxxSj1BQyZxZ0hZw5aC/yY5Qf19n58w== X-Received: by 2002:a17:902:7597:b0:194:afe4:3011 with SMTP id j23-20020a170902759700b00194afe43011mr7348554pll.52.1674105254552; Wed, 18 Jan 2023 21:14:14 -0800 (PST) Received: from dread.disaster.area (pa49-186-146-207.pa.vic.optusnet.com.au. [49.186.146.207]) by smtp.gmail.com with ESMTPSA id n7-20020a170902e54700b00194ab9a4febsm4111758plf.74.2023.01.18.21.14.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 18 Jan 2023 21:14:14 -0800 (PST) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1pINFb-004pAp-9o; Thu, 19 Jan 2023 16:14:11 +1100 Date: Thu, 19 Jan 2023 16:14:11 +1100 From: Dave Chinner To: "Darrick J. Wong" Cc: linux-xfs@vger.kernel.org Subject: Re: [PATCH] xfs: recheck appropriateness of map_shared lock Message-ID: <20230119051411.GJ360264@dread.disaster.area> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org On Wed, Jan 18, 2023 at 05:24:58PM -0800, Darrick J. Wong wrote: > From: Darrick J. Wong > > While fuzzing the data fork extent count on a btree-format directory > with xfs/375, I observed the following (excerpted) splat: > > XFS: Assertion failed: xfs_isilocked(ip, XFS_ILOCK_EXCL), file: fs/xfs/libxfs/xfs_bmap.c, line: 1208 > ------------[ cut here ]------------ > WARNING: CPU: 0 PID: 43192 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs] > Call Trace: > > xfs_iread_extents+0x1af/0x210 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd] > xchk_dir_walk+0xb8/0x190 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd] > xchk_parent_count_parent_dentries+0x41/0x80 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd] > xchk_parent_validate+0x199/0x2e0 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd] > xchk_parent+0xdf/0x130 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd] > xfs_scrub_metadata+0x2b8/0x730 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd] > xfs_scrubv_metadata+0x38b/0x4d0 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd] > xfs_ioc_scrubv_metadata+0x111/0x160 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd] > xfs_file_ioctl+0x367/0xf50 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd] > __x64_sys_ioctl+0x82/0xa0 > do_syscall_64+0x2b/0x80 > entry_SYSCALL_64_after_hwframe+0x46/0xb0 > > The cause of this is a race condition in xfs_ilock_data_map_shared, > which performs an unlocked access to the data fork to guess which lock > mode it needs: > > Thread 0 Thread 1 > > xfs_need_iread_extents > > xfs_ilock(..., ILOCK_EXCL) > xfs_iread_extents > > > > match nextents> > xfs_need_iread_extents > > xfs_ilock(..., ILOCK_SHARED) > > xfs_iunlock(..., ILOCK_EXCL) > xfs_iread_extents > > > *BOOM* > > mitigate this race by having thread 1 to recheck xfs_need_iread_extents > after taking the shared ILOCK. If the iext tree isn't present, then we > need to upgrade to the exclusive ILOCK to try to load the bmbt. Yup, I see the problem - this check is failing: if (XFS_IS_CORRUPT(mp, ir.loaded != ifp->if_nextents)) { error = -EFSCORRUPTED; goto out; } and that results in calling xfs_iext_destroy() to tear down the extent tree. But we know the BMBT is corrupted and the extent list cannot be read until the corruption is fixed. IOWs, we can't access any data in the inode no matter how we lock it until the corruption is repaired. > > Signed-off-by: Darrick J. Wong > --- > fs/xfs/xfs_inode.c | 29 +++++++++++++++++++++++++++++ > 1 file changed, 29 insertions(+) > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > index d354ea2b74f9..6ce1e0e9f256 100644 > --- a/fs/xfs/xfs_inode.c > +++ b/fs/xfs/xfs_inode.c > @@ -117,6 +117,20 @@ xfs_ilock_data_map_shared( > if (xfs_need_iread_extents(&ip->i_df)) > lock_mode = XFS_ILOCK_EXCL; > xfs_ilock(ip, lock_mode); > + > + /* > + * It's possible that the unlocked access of the data fork to determine > + * the lock mode could have raced with another thread that was failing > + * to load the bmbt but hadn't yet torn down the iext tree. Recheck > + * the lock mode and upgrade to an exclusive lock if we need to. > + */ > + if (lock_mode == XFS_ILOCK_SHARED && > + xfs_need_iread_extents(&ip->i_df)) { > + xfs_iunlock(ip, lock_mode); > + lock_mode = XFS_ILOCK_EXCL; > + xfs_ilock(ip, lock_mode); > + } .... and this makes me cringe. :/ If we hit this race condition, re-reading the extent list from disk isn't going to fix the corruption, so I don't see much point in papering over the problem just by changing the locking and failing to read in the extent list again and returning -EFSCORRUPTED to the operation. So.... shouldn't we mark the inode as sick when we detect the extent list corruption issue? i.e. before destroying the iext tree, calling xfs_inode_mark_sick(XFS_SICK_INO_BMBTD) (or BMBTA, depending on the fork being read) so that there is a record of the BMBT being corrupt? That would mean that this path simply becomes: if (ip->i_sick & XFS_SICK_INO_BMBTD) { xfs_iunlock(ip, lock_mode); return -EFSCORRUPTED; } Which is now pretty clear that we there's no point continuing because we can't read in the extent list, and in doing so we've removed the race condition caused by temporarily filling the in-core extent list. > + > return lock_mode; > } > > @@ -129,6 +143,21 @@ xfs_ilock_attr_map_shared( > if (xfs_inode_has_attr_fork(ip) && xfs_need_iread_extents(&ip->i_af)) > lock_mode = XFS_ILOCK_EXCL; > xfs_ilock(ip, lock_mode); > + > + /* > + * It's possible that the unlocked access of the attr fork to determine > + * the lock mode could have raced with another thread that was failing > + * to load the bmbt but hadn't yet torn down the iext tree. Recheck > + * the lock mode and upgrade to an exclusive lock if we need to. > + */ > + if (lock_mode == XFS_ILOCK_SHARED && > + xfs_inode_has_attr_fork(ip) && > + xfs_need_iread_extents(&ip->i_af)) { > + xfs_iunlock(ip, lock_mode); > + lock_mode = XFS_ILOCK_EXCL; > + xfs_ilock(ip, lock_mode); > + } And this can just check for XFS_SICK_INO_BMBTA instead... Cheers, Dave. -- Dave Chinner david@fromorbit.com