All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Brian Foster <bfoster@redhat.com>
Cc: kaixuxia <xiakaixu1987@gmail.com>,
	linux-xfs@vger.kernel.org, darrick.wong@oracle.com,
	newtongao@tencent.com, jasperwang@tencent.com
Subject: Re: [PATCH] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag
Date: Fri, 16 Aug 2019 09:10:01 +1000	[thread overview]
Message-ID: <20190815231001.GT6129@dread.disaster.area> (raw)
In-Reply-To: <20190813133614.GD37069@bfoster>

On Tue, Aug 13, 2019 at 09:36:14AM -0400, Brian Foster wrote:
> On Tue, Aug 13, 2019 at 07:17:33PM +0800, kaixuxia wrote:
> > When performing rename operation with RENAME_WHITEOUT flag, we will
> > hold AGF lock to allocate or free extents in manipulating the dirents
> > firstly, and then doing the xfs_iunlink_remove() call last to hold
> > AGI lock to modify the tmpfile info, so we the lock order AGI->AGF.
> > 
> 
> IIUC, the whiteout use case is that we're renaming a file, but the
> source dentry must be replaced with a magic whiteout inode rather than
> be removed. Therefore, xfs_rename() allocates the whiteout inode as a
> tmpfile first in a separate transaction, updates the target dentry with
> the source inode, replaces the source dentry to point to the whiteout
> inode and finally removes the whiteout inode from the unlinked list
> (since it is a tmpfile). This leads to the problem described below
> because the rename transaction ends up doing directory block allocs
> (locking the AGF) followed by the unlinked list remove (locking the
> AGI).
> 
> My understanding from reading the code is that this is primarly to
> cleanly handle error scenarios. If anything fails after we've allocated
> the whiteout tmpfile, it's simply left on the unlinked list and so the
> filesystem remains in a consistent/recoverable state. Given that, the
> solution here seems like overkill to me. For one, I thought background
> unlinked list removal was already on our roadmap (Darrick might have
> been looking at that and may already have a prototype as well). Also,
> unlinked list removal occurs at log recovery time already. That's
> somewhat of an existing purpose of the list, which makes a deferred
> unlinked list removal operation superfluous in more traditional cases
> where unlinked list removal doesn't require consistency with a directory
> operation.
> 
> Functional discussion aside.. from a complexity standpoint I'm wondering
> if we could do something much more simple like acquire the AGI lock for
> a whiteout inode earlier in xfs_rename(). For example, suppose we did
> something like:
> 
> 	/*
> 	 * Acquire the whiteout agi to preserve locking order in anticipation of
> 	 * unlinked list removal.
> 	 */
> 	if (wip)
> 		xfs_read_agi(mp, tp, XFS_INO_TO_AGNO(mp, wip->i_ino), &agibp);
> 
> ... after we allocate the transaction but before we do any directory ops
> that can result in block allocations. Would that prevent the problem
> you've observed?

I'd prefer that we just do things in an order that doesn't invert
the locking. For a whiteout, we only allocate blocks when modifying
the target directory, and we do a check to see if that will succeed
before actually doing the directory modification. That means the
directory modification will only fail due to an IO error or
corruption, both of which have a high probability of causing the
filesystem to be shut down. Any error after the directory mod will
cause a shutdown because the transaction is dirty.

Further, the operation that will lock the AGF is the target
directory modification if blocks need to be allocated, and the whole
point of the "check before execution" is to abort if ENOSPC would
occur as a result of trying to allocate blocks and we don't have a
space reservation for for those blocks because we are very, very
close to ENOSPC already.

If we fail the xfs_iunlink_remove() operation, we're shutting down
the filesystem. If we fail the xfs_dir_createname(target) call, we
are most likely going to be shutting down the filesystem. So the
premise that locating xfs_iunlink_remove() at the end to make error
handling easy is not really true - transaction cancel will clean
both of them up and shut the filesystem down.

Hence I think the right thing to do is to move the
xfs_iunlink_remove() call to between xfs_dir_canenter() and
xfs_dir_createname(). This means ENOSPC will abort with a clean
transaction and all is good, otherwise a failure is most likely
going to shut down the filesystem and it doesn't matter if we do
xfs_iunlink_remove() or xfs_dir_createname() first.

And by doing xfs_iunlink_remove() first, we remove the AGI/AGF
lock inversion problem....

I think this holds together, but I might have missed something in
the tangle of different rename operation cases. So it's worth
checking, but it looks to me like a better solution than having
a bare AGI lock in the middle of the function to work around error
handling logic we didn't clearly enough about at the time (hindsight
and all that jazz)....

Thoughts?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

  parent reply	other threads:[~2019-08-15 23:11 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-13 11:17 [PATCH] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag kaixuxia
2019-08-13 13:36 ` Brian Foster
2019-08-13 14:20   ` Darrick J. Wong
2019-08-13 14:57     ` Brian Foster
2019-08-14  2:32     ` kaixuxia
2019-08-15 23:10   ` Dave Chinner [this message]
2019-08-16 14:30     ` Brian Foster
2019-08-15 23:36 ` Dave Chinner
2019-08-16  8:09   ` kaixuxia
2019-08-16 14:53     ` Brian Foster
2019-08-17  1:40       ` Dave Chinner
2019-08-17 13:20         ` Brian Foster
2019-08-19 10:20           ` Dave Chinner
2019-08-19 14:28             ` Brian Foster
2019-08-20  1:04               ` Dave Chinner
2019-08-20 14:04                 ` Brian Foster
2019-08-19  7:49       ` kaixuxia

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190815231001.GT6129@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=bfoster@redhat.com \
    --cc=darrick.wong@oracle.com \
    --cc=jasperwang@tencent.com \
    --cc=linux-xfs@vger.kernel.org \
    --cc=newtongao@tencent.com \
    --cc=xiakaixu1987@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.