From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 65345C433DF for ; Tue, 2 Jun 2020 16:33:06 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 3BFAE20738 for ; Tue, 2 Jun 2020 16:33:06 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="BbmPSheG" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726461AbgFBQdF (ORCPT ); Tue, 2 Jun 2020 12:33:05 -0400 Received: from us-smtp-1.mimecast.com ([207.211.31.81]:53680 "EHLO us-smtp-delivery-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1725940AbgFBQdF (ORCPT ); Tue, 2 Jun 2020 12:33:05 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1591115583; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=XJaApkyPOpaS9GrFJqIgA4MrZcNjF3x0qQgEn90ie6g=; b=BbmPSheGiC6UtMQ4p4UgDWvI7vPKVN55T0jr/Tx9ZacLVgLqpHTN4TgcIUC1/YdJbU0FQ2 LjrAWkhJ+/6isDUZs3ZNyK/UVjCfWHdRTjYKS9/hu177xoiwIe2C0RHpCv4Qcr/tq6kgs3 EsMXA1CDg4WaGOpHKSInMmRJH1MBWog= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-203-LFvwrassNE-gegS6da7wzg-1; Tue, 02 Jun 2020 12:32:53 -0400 X-MC-Unique: LFvwrassNE-gegS6da7wzg-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id C49C61009442; Tue, 2 Jun 2020 16:32:51 +0000 (UTC) Received: from bfoster (dhcp-41-2.bos.redhat.com [10.18.41.2]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 7D51A78EF6; Tue, 2 Jun 2020 16:32:51 +0000 (UTC) Date: Tue, 2 Jun 2020 12:32:49 -0400 From: Brian Foster To: Dave Chinner Cc: linux-xfs@vger.kernel.org Subject: Re: [PATCH 01/30] xfs: Don't allow logging of XFS_ISTALE inodes Message-ID: <20200602163249.GA7967@bfoster> References: <20200601214251.4167140-1-david@fromorbit.com> <20200601214251.4167140-2-david@fromorbit.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200601214251.4167140-2-david@fromorbit.com> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org On Tue, Jun 02, 2020 at 07:42:22AM +1000, Dave Chinner wrote: > From: Dave Chinner > > In tracking down a problem in this patchset, I discovered we are > reclaiming dirty stale inodes. This wasn't discovered until inodes > were always attached to the cluster buffer and then the rcu callback > that freed inodes was assert failing because the inode still had an > active pointer to the cluster buffer after it had been reclaimed. > > Debugging the issue indicated that this was a pre-existing issue > resulting from the way the inodes are handled in xfs_inactive_ifree. > When we free a cluster buffer from xfs_ifree_cluster, all the inodes > in cache are marked XFS_ISTALE. Those that are clean have nothing > else done to them and so eventually get cleaned up by background > reclaim. i.e. it is assumed we'll never dirty/relog an inode marked > XFS_ISTALE. > > On journal commit dirty stale inodes as are handled by both > buffer and inode log items to run though xfs_istale_done() and > removed from the AIL (buffer log item commit) or the log item will > simply unpin it because the buffer log item will clean it. What happens > to any specific inode is entirely dependent on which log item wins > the commit race, but the result is the same - stale inodes are > clean, not attached to the cluster buffer, and not in the AIL. Hence > inode reclaim can just free these inodes without further care. > > However, if the stale inode is relogged, it gets dirtied again and > relogged into the CIL. Most of the time this isn't an issue, because > relogging simply changes the inode's location in the current > checkpoint. Problems arise, however, when the CIL checkpoints > between two transactions in the xfs_inactive_ifree() deferops > processing. This results in the XFS_ISTALE inode being redirtied > and inserted into the CIL without any of the other stale cluster > buffer infrastructure being in place. > > Hence on journal commit, it simply gets unpinned, so it remains > dirty in memory. Everything in inode writeback avoids XFS_ISTALE > inodes so it can't be written back, and it is not tracked in the AIL > so there's not even a trigger to attempt to clean the inode. Hence > the inode just sits dirty in memory until inode reclaim comes along, > sees that it is XFS_ISTALE, and goes to reclaim it. This reclaiming > of a dirty inode caused use after free, list corruptions and other > nasty issues later in this patchset. > > Hence this patch addresses a violation of the "never log XFS_ISTALE > inodes" caused by the deferops processing rolling a transaction > and relogging a stale inode in xfs_inactive_free. It also adds a > bunch of asserts to catch this problem in debug kernels so that > we don't reintroduce this problem in future. > > Reproducer for this issue was generic/558 on a v4 filesystem. > > Signed-off-by: Dave Chinner > --- Reviewed-by: Brian Foster > fs/xfs/libxfs/xfs_trans_inode.c | 2 ++ > fs/xfs/xfs_icache.c | 3 ++- > fs/xfs/xfs_inode.c | 25 ++++++++++++++++++++++--- > 3 files changed, 26 insertions(+), 4 deletions(-) > > diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c > index b5dfb66548422..4504d215cd590 100644 > --- a/fs/xfs/libxfs/xfs_trans_inode.c > +++ b/fs/xfs/libxfs/xfs_trans_inode.c > @@ -36,6 +36,7 @@ xfs_trans_ijoin( > > ASSERT(iip->ili_lock_flags == 0); > iip->ili_lock_flags = lock_flags; > + ASSERT(!xfs_iflags_test(ip, XFS_ISTALE)); > > /* > * Get a log_item_desc to point at the new item. > @@ -89,6 +90,7 @@ xfs_trans_log_inode( > > ASSERT(ip->i_itemp != NULL); > ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); > + ASSERT(!xfs_iflags_test(ip, XFS_ISTALE)); > > /* > * Don't bother with i_lock for the I_DIRTY_TIME check here, as races > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index 0a5ac6f9a5834..dbba4c1946386 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -1141,7 +1141,7 @@ xfs_reclaim_inode( > goto out_ifunlock; > xfs_iunpin_wait(ip); > } > - if (xfs_iflags_test(ip, XFS_ISTALE) || xfs_inode_clean(ip)) { > + if (xfs_inode_clean(ip)) { > xfs_ifunlock(ip); > goto reclaim; > } > @@ -1228,6 +1228,7 @@ xfs_reclaim_inode( > xfs_ilock(ip, XFS_ILOCK_EXCL); > xfs_qm_dqdetach(ip); > xfs_iunlock(ip, XFS_ILOCK_EXCL); > + ASSERT(xfs_inode_clean(ip)); > > __xfs_inode_free(ip); > return error; > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > index 64f5f9a440aed..53a1d64782c35 100644 > --- a/fs/xfs/xfs_inode.c > +++ b/fs/xfs/xfs_inode.c > @@ -1740,10 +1740,31 @@ xfs_inactive_ifree( > return error; > } > > + /* > + * We do not hold the inode locked across the entire rolling transaction > + * here. We only need to hold it for the first transaction that > + * xfs_ifree() builds, which may mark the inode XFS_ISTALE if the > + * underlying cluster buffer is freed. Relogging an XFS_ISTALE inode > + * here breaks the relationship between cluster buffer invalidation and > + * stale inode invalidation on cluster buffer item journal commit > + * completion, and can result in leaving dirty stale inodes hanging > + * around in memory. > + * > + * We have no need for serialising this inode operation against other > + * operations - we freed the inode and hence reallocation is required > + * and that will serialise on reallocating the space the deferops need > + * to free. Hence we can unlock the inode on the first commit of > + * the transaction rather than roll it right through the deferops. This > + * avoids relogging the XFS_ISTALE inode. > + * > + * We check that xfs_ifree() hasn't grown an internal transaction roll > + * by asserting that the inode is still locked when it returns. > + */ > xfs_ilock(ip, XFS_ILOCK_EXCL); > - xfs_trans_ijoin(tp, ip, 0); > + xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > > error = xfs_ifree(tp, ip); > + ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); > if (error) { > /* > * If we fail to free the inode, shut down. The cancel > @@ -1756,7 +1777,6 @@ xfs_inactive_ifree( > xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR); > } > xfs_trans_cancel(tp); > - xfs_iunlock(ip, XFS_ILOCK_EXCL); > return error; > } > > @@ -1774,7 +1794,6 @@ xfs_inactive_ifree( > xfs_notice(mp, "%s: xfs_trans_commit returned error %d", > __func__, error); > > - xfs_iunlock(ip, XFS_ILOCK_EXCL); > return 0; > } > > -- > 2.26.2.761.g0e0b3e54be >