From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.6 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, UNPARSEABLE_RELAY,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 22216C4363D for ; Wed, 7 Oct 2020 22:24:22 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C035720872 for ; Wed, 7 Oct 2020 22:24:21 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="gpoP9iA+" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728854AbgJGWYV (ORCPT ); Wed, 7 Oct 2020 18:24:21 -0400 Received: from aserp2130.oracle.com ([141.146.126.79]:59024 "EHLO aserp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728851AbgJGWYV (ORCPT ); Wed, 7 Oct 2020 18:24:21 -0400 Received: from pps.filterd (aserp2130.oracle.com [127.0.0.1]) by aserp2130.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 097MOI9g194345; Wed, 7 Oct 2020 22:24:18 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=date : from : to : cc : subject : message-id : references : mime-version : content-type : in-reply-to; s=corp-2020-01-29; bh=66SD2lRaSHUSt1IpBkjLLxh6ZbPESBrwm4LqBMg0A88=; b=gpoP9iA+m8ydzUQbXEu3SZqyz5rVOc6nn/CZLdCe3ZlAmHY2uEq5nNCsD3PBYLMaPyqS oPGMACsT87+Jnx9xj3V1lqeLg7GJuVx0jJOVYgS7TRc9xKp7eDXyBbbtjHbPQag5LszL VXKwHJXudtVZCOHA7Fbj5yh9KqdGmFh7lrdNbHWpJyKN/Qeym+rfR/dmj/eTK/7nqUID a+invrIw8G3j9I8P3S9OjhAbk2kK955+jyRM8/PdrqsmDZHwMpAY5bdrkfgQ78IcJWDA PkwECRc450Xvva9r3naeYBS6zQL4sQf+NvI4zb464tfBgyS4nrxJEFcQfZ8SwVs1/M7g JQ== Received: from aserp3020.oracle.com (aserp3020.oracle.com [141.146.126.70]) by aserp2130.oracle.com with ESMTP id 33xetb4uaf-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Wed, 07 Oct 2020 22:24:18 +0000 Received: from pps.filterd (aserp3020.oracle.com [127.0.0.1]) by aserp3020.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 097MEsjh159788; Wed, 7 Oct 2020 22:24:17 GMT Received: from aserv0121.oracle.com (aserv0121.oracle.com [141.146.126.235]) by aserp3020.oracle.com with ESMTP id 3410k06h1j-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 07 Oct 2020 22:24:17 +0000 Received: from abhmp0012.oracle.com (abhmp0012.oracle.com [141.146.116.18]) by aserv0121.oracle.com (8.14.4/8.13.8) with ESMTP id 097MOHP7017587; Wed, 7 Oct 2020 22:24:17 GMT Received: from localhost (/10.159.134.247) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 07 Oct 2020 15:24:16 -0700 Date: Wed, 7 Oct 2020 15:24:16 -0700 From: "Darrick J. Wong" To: Brian Foster Cc: linux-xfs@vger.kernel.org Subject: Re: [PATCH 3/3] xfs: rework quotaoff logging to avoid log deadlock on active fs Message-ID: <20201007222416.GG6540@magnolia> References: <20201001150310.141467-1-bfoster@redhat.com> <20201001150310.141467-4-bfoster@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20201001150310.141467-4-bfoster@redhat.com> X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9767 signatures=668680 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 mlxscore=0 mlxlogscore=999 spamscore=0 adultscore=0 bulkscore=0 malwarescore=0 suspectscore=1 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2006250000 definitions=main-2010070141 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9767 signatures=668680 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 bulkscore=0 spamscore=0 mlxscore=0 clxscore=1015 priorityscore=1501 adultscore=0 mlxlogscore=999 phishscore=0 impostorscore=0 malwarescore=0 suspectscore=1 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2006250000 definitions=main-2010070142 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org On Thu, Oct 01, 2020 at 11:03:10AM -0400, Brian Foster wrote: > The quotaoff operation logs two log items. The start item is > committed first, followed by the bulk of the in-core quotaoff > processing, and then the quotaoff end item is committed to release > the start item from the log. The problem with this mechanism is that > quite a bit of processing can be required to release dquots from all > in-core inodes and subsequently flush/purge all dquots in the > system. This processing work doesn't generally generate much log > traffic itself, but the start item pins the tail of the log. If an > external workload consumes the remaining log space before the > transaction for the end item is allocated, a log deadlock can occur. > > The purpose of the separate start and end log items is primarily to > ensure that log recovery does not incorrectly recover dquot data > after an fs crash where a quotaoff was in progress. If we only > logged a single quotaoff item, for example, it could fall behind the > tail of the log before the last dquot modification was made and > incorrectly replay dquot changes that might have occurred after the > start item committed but before quotaoff deactivated the quota. > > With that context, we can make some small changes to the quotaoff > algorithm to provide the same general log ordering guarantee without > such a large window to create a log deadlock vector. Rather than > place a start item in the log for the duration of quotaoff > processing, we can quiesce the transaction subsystem up front to > guarantee that no further dquots are logged from that point forward. > IOW, we pause the transaction subsystem, commit the quotaoff start > and end items, deactivate the associated quota such that subsequent > transactions no longer modify associated dquots, and resume the > transaction subsystem. The transaction pause is somewhat of a heavy > weight operation, but quotaoff is already a rare, slow and > performance disruptive operation and the quiesce is only required > for two small transactions. > > Altogether, this means that the dquot rele/purge sequence occurs > after the quotaoff end item has committed and thus can technically > fall off the end of the log. This is safe because the remaining > processing is in-core work that doesn't involve logging dquots and > we've guaranteed that no further dquots are modified by external > transactions. This allows quotaoff to complete without risking log > deadlock regardless of how much dquot processing is required. > > Suggested-by: Dave Chinner > Signed-off-by: Brian Foster > --- > fs/xfs/xfs_qm_syscalls.c | 133 +++++++++++++++++++-------------------- > fs/xfs/xfs_trans_dquot.c | 2 + > 2 files changed, 67 insertions(+), 68 deletions(-) > > diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c > index ca1b57d291dc..b8e55f4947bd 100644 > --- a/fs/xfs/xfs_qm_syscalls.c > +++ b/fs/xfs/xfs_qm_syscalls.c > @@ -29,7 +29,8 @@ xfs_qm_log_quotaoff( > int error; > struct xfs_qoff_logitem *qoffi; > > - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp); > + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, > + XFS_TRANS_NO_WRITECOUNT, &tp); > if (error) > goto out; > > @@ -67,7 +68,8 @@ xfs_qm_log_quotaoff_end( > int error; > struct xfs_qoff_logitem *qoffi; > > - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_equotaoff, 0, 0, 0, &tp); > + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_equotaoff, 0, 0, > + XFS_TRANS_NO_WRITECOUNT, &tp); > if (error) > return error; > > @@ -106,8 +108,8 @@ xfs_qm_scall_quotaoff( > > /* > * No file system can have quotas enabled on disk but not in core. > - * Note that quota utilities (like quotaoff) _expect_ > - * errno == -EEXIST here. > + * Note that quota utilities (like quotaoff) _expect_ errno == -EEXIST > + * here. > */ > if ((mp->m_qflags & flags) == 0) > return -EEXIST; > @@ -116,17 +118,14 @@ xfs_qm_scall_quotaoff( > flags &= (XFS_ALL_QUOTA_ACCT | XFS_ALL_QUOTA_ENFD); > > /* > - * We don't want to deal with two quotaoffs messing up each other, > - * so we're going to serialize it. quotaoff isn't exactly a performance > + * We don't want to deal with two quotaoffs messing up each other, so > + * we're going to serialize it. quotaoff isn't exactly a performance > * critical thing. > - * If quotaoff, then we must be dealing with the root filesystem. > */ > ASSERT(q); > mutex_lock(&q->qi_quotaofflock); > > - /* > - * If we're just turning off quota enforcement, change mp and go. > - */ > + /* if we're just turning off quota enforcement, change mp and go */ > if ((flags & XFS_ALL_QUOTA_ACCT) == 0) { > mp->m_qflags &= ~(flags); > > @@ -142,9 +141,9 @@ xfs_qm_scall_quotaoff( > dqtype = 0; > inactivate_flags = 0; > /* > - * If accounting is off, we must turn enforcement off, clear the > - * quota 'CHKD' certificate to make it known that we have to > - * do a quotacheck the next time this quota is turned on. > + * If accounting is off, we must turn enforcement off, clear the quota > + * 'CHKD' certificate to make it known that we have to do a quotacheck > + * the next time this quota is turned on. > */ > if (flags & XFS_UQUOTA_ACCT) { > dqtype |= XFS_QMOPT_UQUOTA; > @@ -163,89 +162,87 @@ xfs_qm_scall_quotaoff( > } > > /* > - * Nothing to do? Don't complain. This happens when we're just > - * turning off quota enforcement. > + * Nothing to do? Don't complain. This happens when we're just turning > + * off quota enforcement. > */ > if ((mp->m_qflags & flags) == 0) > goto out_unlock; > > /* > - * Write the LI_QUOTAOFF log record, and do SB changes atomically, > - * and synchronously. If we fail to write, we should abort the > - * operation as it cannot be recovered safely if we crash. > + * Quotaoff must deactivate the associated quota mode(s), release dquots > + * from inodes and purge them from the system all while the filesystem > + * remains active. We have two quotaoff log records that traditionally > + * bound the start and end of this sequence. This guarantees that no > + * dquots are modified after the end item hits the log, but quotaoff can > + * be time consuming and thus prone to deadlock because the start item > + * pins the tail the of log in the meantime (and we can't hold the end > + * transaction open across the dqrele scan). > + * > + * The critical aspect of correctly logging quotaoff is that no dquots > + * are modified after the quotaoff end item hits the on-disk log. > + * Otherwise the quotaoff can fall off the tail and log recovery can > + * replay incorrect data. Instead of letting the start item sit in the > + * log while quotaoff completes, we can provide the same guarantee via a > + * runtime barrier for dquot modifications. Specifically, we pause all > + * transactions on the system via the transaction subsystem lock, log > + * both start and end items (via sync transactions, which drains the > + * CIL), deactivate the quota, and then resume the transaction subsystem > + * while quotaoff completes. > + * > + * This is safe because the remaining quotaoff work is in-core cleanup > + * and all subsequent transactions should see the updated quota state > + * due to memory ordering provided by the lock. We also avoid deadlock > + * by committing both items sequentially with near exclusive access to > + * the transaction subsystem. > */ > + percpu_down_write(&mp->m_trans_rwsem); > + > error = xfs_qm_log_quotaoff(mp, &qoffstart, flags); > - if (error) > + if (error) { > + percpu_up_write(&mp->m_trans_rwsem); > goto out_unlock; > + } > > - /* > - * Next we clear the XFS_MOUNT_*DQ_ACTIVE bit(s) in the mount struct > - * to take care of the race between dqget and quotaoff. We don't take > - * any special locks to reset these bits. All processes need to check > - * these bits *after* taking inode lock(s) to see if the particular > - * quota type is in the process of being turned off. If *ACTIVE, it is > - * guaranteed that all dquot structures and all quotainode ptrs will all > - * stay valid as long as that inode is kept locked. > - * > - * There is no turning back after this. > - */ > mp->m_qflags &= ~inactivate_flags; > > + error = xfs_qm_log_quotaoff_end(mp, &qoffstart, flags); > + if (error) { > + percpu_up_write(&mp->m_trans_rwsem); > + /* We're screwed now. Shutdown is the only option. */ > + xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); > + goto out_unlock; > + } > + > + percpu_up_write(&mp->m_trans_rwsem); Hmm, so if I read this correctly, you're changing what gets written to the log from: ...now the qflags get cleared... ...run around purging dquots... to something that looks more like: ...now the qflags get cleared; no other transactions... ...run around purging dquots... Is that right? I guess that makes sense, though it's sort of a pity that we now make every transaction grab a read lock. Though I guess there's not much that can be done about that; it's better than pinning the log tail. I don't even think there's a good way to relog that quotaoff-intent item, is there? Since you'd have to, I dunno, do all the checks that xfs_defer_relog() does to decide if it should relog an intent item? --D > + > /* > - * Give back all the dquot reference(s) held by inodes. > - * Here we go thru every single incore inode in this file system, and > - * do a dqrele on the i_udquot/i_gdquot that it may have. > - * Essentially, as long as somebody has an inode locked, this guarantees > - * that quotas will not be turned off. This is handy because in a > - * transaction once we lock the inode(s) and check for quotaon, we can > - * depend on the quota inodes (and other things) being valid as long as > - * we keep the lock(s). > + * Release dquot references held by inodes. Technically some contexts > + * might not pick up the quota state change until the inode lock is > + * cycled if there is no transaction. We don't care about that above > + * because a dquot can't be logged without a transaction and we can't > + * release/purge a dquot here until we've cycled the locks of all inodes > + * that reference it. > */ > xfs_qm_dqrele_all_inodes(mp, flags); > > /* > * Next we make the changes in the quota flag in the mount struct. > - * This isn't protected by a particular lock directly, because we > - * don't want to take a mrlock every time we depend on quotas being on. > + * This isn't protected by a particular lock directly, because we don't > + * want to take a mrlock every time we depend on quotas being on. > */ > mp->m_qflags &= ~flags; > > - /* > - * Go through all the dquots of this file system and purge them, > - * according to what was turned off. > - */ > + /* purge all deactivated dquots from the filesystem */ > xfs_qm_dqpurge_all(mp, dqtype); > > - /* > - * Transactions that had started before ACTIVE state bit was cleared > - * could have logged many dquots, so they'd have higher LSNs than > - * the first QUOTAOFF log record does. If we happen to crash when > - * the tail of the log has gone past the QUOTAOFF record, but > - * before the last dquot modification, those dquots __will__ > - * recover, and that's not good. > - * > - * So, we have QUOTAOFF start and end logitems; the start > - * logitem won't get overwritten until the end logitem appears... > - */ > - error = xfs_qm_log_quotaoff_end(mp, &qoffstart, flags); > - if (error) { > - /* We're screwed now. Shutdown is the only option. */ > - xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); > - goto out_unlock; > - } > - > - /* > - * If all quotas are completely turned off, close shop. > - */ > + /* if all quotas are completely turned off, close shop */ > if (mp->m_qflags == 0) { > mutex_unlock(&q->qi_quotaofflock); > xfs_qm_destroy_quotainfo(mp); > return 0; > } > > - /* > - * Release our quotainode references if we don't need them anymore. > - */ > + /* release our quotainode references if we don't need them anymore */ > if ((dqtype & XFS_QMOPT_UQUOTA) && q->qi_uquotaip) { > xfs_irele(q->qi_uquotaip); > q->qi_uquotaip = NULL; > diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c > index 547ba824542e..9839b83e732a 100644 > --- a/fs/xfs/xfs_trans_dquot.c > +++ b/fs/xfs/xfs_trans_dquot.c > @@ -52,6 +52,8 @@ xfs_trans_log_dquot( > struct xfs_dquot *dqp) > { > ASSERT(XFS_DQ_IS_LOCKED(dqp)); > + /* quotaoff expects no dquots logged after deactivation */ > + ASSERT(xfs_this_quota_on(tp->t_mountp, xfs_dquot_type(dqp))); > > /* Upgrade the dquot to bigtime format if possible. */ > if (dqp->q_id != 0 && > -- > 2.25.4 >