Archive-only list for patches
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: stable@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	patches@lists.linux.dev, Dave Chinner <dchinner@redhat.com>,
	Luis Chamberlain <mcgrof@kernel.org>,
	Christoph Hellwig <hch@lst.de>,
	"Darrick J. Wong" <djwong@kernel.org>,
	Chandan Babu R <chandanbabu@kernel.org>,
	Leah Rumancik <leah.rumancik@gmail.com>
Subject: [PATCH 6.1 51/73] xfs: fix unlink vs cluster buffer instantiation race
Date: Fri, 27 Sep 2024 14:24:02 +0200	[thread overview]
Message-ID: <20240927121721.977658809@linuxfoundation.org> (raw)
In-Reply-To: <20240927121719.897851549@linuxfoundation.org>

6.1-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Dave Chinner <dchinner@redhat.com>

[ Upstream commit 348a1983cf4cf5099fc398438a968443af4c9f65 ]

Luis has been reporting an assert failure when freeing an inode
cluster during inode inactivation for a while. The assert looks
like:

 XFS: Assertion failed: bp->b_flags & XBF_DONE, file: fs/xfs/xfs_trans_buf.c, line: 241
 ------------[ cut here ]------------
 kernel BUG at fs/xfs/xfs_message.c:102!
 Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
 CPU: 4 PID: 73 Comm: kworker/4:1 Not tainted 6.10.0-rc1 #4
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
 Workqueue: xfs-inodegc/loop5 xfs_inodegc_worker [xfs]
 RIP: 0010:assfail (fs/xfs/xfs_message.c:102) xfs
 RSP: 0018:ffff88810188f7f0 EFLAGS: 00010202
 RAX: 0000000000000000 RBX: ffff88816e748250 RCX: 1ffffffff844b0e7
 RDX: 0000000000000004 RSI: ffff88810188f558 RDI: ffffffffc2431fa0
 RBP: 1ffff11020311f01 R08: 0000000042431f9f R09: ffffed1020311e9b
 R10: ffff88810188f4df R11: ffffffffac725d70 R12: ffff88817a3f4000
 R13: ffff88812182f000 R14: ffff88810188f998 R15: ffffffffc2423f80
 FS:  0000000000000000(0000) GS:ffff8881c8400000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000055fe9d0f109c CR3: 000000014426c002 CR4: 0000000000770ef0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
 PKRU: 55555554
 Call Trace:
  <TASK>
 xfs_trans_read_buf_map (fs/xfs/xfs_trans_buf.c:241 (discriminator 1)) xfs
 xfs_imap_to_bp (fs/xfs/xfs_trans.h:210 fs/xfs/libxfs/xfs_inode_buf.c:138) xfs
 xfs_inode_item_precommit (fs/xfs/xfs_inode_item.c:145) xfs
 xfs_trans_run_precommits (fs/xfs/xfs_trans.c:931) xfs
 __xfs_trans_commit (fs/xfs/xfs_trans.c:966) xfs
 xfs_inactive_ifree (fs/xfs/xfs_inode.c:1811) xfs
 xfs_inactive (fs/xfs/xfs_inode.c:2013) xfs
 xfs_inodegc_worker (fs/xfs/xfs_icache.c:1841 fs/xfs/xfs_icache.c:1886) xfs
 process_one_work (kernel/workqueue.c:3231)
 worker_thread (kernel/workqueue.c:3306 (discriminator 2) kernel/workqueue.c:3393 (discriminator 2))
 kthread (kernel/kthread.c:389)
 ret_from_fork (arch/x86/kernel/process.c:147)
 ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
  </TASK>

And occurs when the the inode precommit handlers is attempt to look
up the inode cluster buffer to attach the inode for writeback.

The trail of logic that I can reconstruct is as follows.

	1. the inode is clean when inodegc runs, so it is not
	   attached to a cluster buffer when precommit runs.

	2. #1 implies the inode cluster buffer may be clean and not
	   pinned by dirty inodes when inodegc runs.

	3. #2 implies that the inode cluster buffer can be reclaimed
	   by memory pressure at any time.

	4. The assert failure implies that the cluster buffer was
	   attached to the transaction, but not marked done. It had
	   been accessed earlier in the transaction, but not marked
	   done.

	5. #4 implies the cluster buffer has been invalidated (i.e.
	   marked stale).

	6. #5 implies that the inode cluster buffer was instantiated
	   uninitialised in the transaction in xfs_ifree_cluster(),
	   which only instantiates the buffers to invalidate them
	   and never marks them as done.

Given factors 1-3, this issue is highly dependent on timing and
environmental factors. Hence the issue can be very difficult to
reproduce in some situations, but highly reliable in others. Luis
has an environment where it can be reproduced easily by g/531 but,
OTOH, I've reproduced it only once in ~2000 cycles of g/531.

I think the fix is to have xfs_ifree_cluster() set the XBF_DONE flag
on the cluster buffers, even though they may not be initialised. The
reasons why I think this is safe are:

	1. A buffer cache lookup hit on a XBF_STALE buffer will
	   clear the XBF_DONE flag. Hence all future users of the
	   buffer know they have to re-initialise the contents
	   before use and mark it done themselves.

	2. xfs_trans_binval() sets the XFS_BLI_STALE flag, which
	   means the buffer remains locked until the journal commit
	   completes and the buffer is unpinned. Hence once marked
	   XBF_STALE/XFS_BLI_STALE by xfs_ifree_cluster(), the only
	   context that can access the freed buffer is the currently
	   running transaction.

	3. #2 implies that future buffer lookups in the currently
	   running transaction will hit the transaction match code
	   and not the buffer cache. Hence XBF_STALE and
	   XFS_BLI_STALE will not be cleared unless the transaction
	   initialises and logs the buffer with valid contents
	   again. At which point, the buffer will be marked marked
	   XBF_DONE again, so having XBF_DONE already set on the
	   stale buffer is a moot point.

	4. #2 also implies that any concurrent access to that
	   cluster buffer will block waiting on the buffer lock
	   until the inode cluster has been fully freed and is no
	   longer an active inode cluster buffer.

	5. #4 + #1 means that any future user of the disk range of
	   that buffer will always see the range of disk blocks
	   covered by the cluster buffer as not done, and hence must
	   initialise the contents themselves.

	6. Setting XBF_DONE in xfs_ifree_cluster() then means the
	   unlinked inode precommit code will see a XBF_DONE buffer
	   from the transaction match as it expects. It can then
	   attach the stale but newly dirtied inode to the stale
	   but newly dirtied cluster buffer without unexpected
	   failures. The stale buffer will then sail through the
	   journal and do the right thing with the attached stale
	   inode during unpin.

Hence the fix is just one line of extra code. The explanation of
why we have to set XBF_DONE in xfs_ifree_cluster, OTOH, is long and
complex....

Fixes: 82842fee6e59 ("xfs: fix AGF vs inode cluster buffer deadlock")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 fs/xfs/xfs_inode.c |   23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2297,11 +2297,26 @@ xfs_ifree_cluster(
 		 * This buffer may not have been correctly initialised as we
 		 * didn't read it from disk. That's not important because we are
 		 * only using to mark the buffer as stale in the log, and to
-		 * attach stale cached inodes on it. That means it will never be
-		 * dispatched for IO. If it is, we want to know about it, and we
-		 * want it to fail. We can acheive this by adding a write
-		 * verifier to the buffer.
+		 * attach stale cached inodes on it.
+		 *
+		 * For the inode that triggered the cluster freeing, this
+		 * attachment may occur in xfs_inode_item_precommit() after we
+		 * have marked this buffer stale.  If this buffer was not in
+		 * memory before xfs_ifree_cluster() started, it will not be
+		 * marked XBF_DONE and this will cause problems later in
+		 * xfs_inode_item_precommit() when we trip over a (stale, !done)
+		 * buffer to attached to the transaction.
+		 *
+		 * Hence we have to mark the buffer as XFS_DONE here. This is
+		 * safe because we are also marking the buffer as XBF_STALE and
+		 * XFS_BLI_STALE. That means it will never be dispatched for
+		 * IO and it won't be unlocked until the cluster freeing has
+		 * been committed to the journal and the buffer unpinned. If it
+		 * is written, we want to know about it, and we want it to
+		 * fail. We can acheive this by adding a write verifier to the
+		 * buffer.
 		 */
+		bp->b_flags |= XBF_DONE;
 		bp->b_ops = &xfs_inode_buf_ops;
 
 		/*



  parent reply	other threads:[~2024-09-27 12:32 UTC|newest]

Thread overview: 83+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-27 12:23 [PATCH 6.1 00/73] 6.1.112-rc1 review Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 01/73] ASoC: SOF: mediatek: Add missing board compatible Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 02/73] ASoC: allow module autoloading for table db1200_pids Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 03/73] ASoC: allow module autoloading for table board_ids Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 04/73] ALSA: hda/realtek - Fixed ALC256 headphone no sound Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 05/73] ALSA: hda/realtek - FIxed ALC285 " Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 06/73] scsi: lpfc: Fix overflow build issue Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 07/73] pinctrl: at91: make it work with current gpiolib Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 08/73] hwmon: (asus-ec-sensors) remove VRM temp X570-E GAMING Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 09/73] microblaze: dont treat zero reserved memory regions as error Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 10/73] net: ftgmac100: Ensure tx descriptor updates are visible Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 11/73] LoongArch: Define ARCH_IRQ_INIT_FLAGS as IRQ_NOPROBE Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 12/73] wifi: iwlwifi: lower message level for FW buffer destination Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 13/73] wifi: iwlwifi: mvm: fix iwl_mvm_scan_fits() calculation Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 14/73] wifi: iwlwifi: mvm: pause TCM when the firmware is stopped Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 15/73] wifi: iwlwifi: mvm: dont wait for tx queues if firmware is dead Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 16/73] wifi: mac80211: free skb on error path in ieee80211_beacon_get_ap() Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 17/73] wifi: iwlwifi: clear trans->state earlier upon error Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 18/73] can: mcp251xfd: mcp251xfd_ring_init(): check TX-coalescing configuration Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 19/73] ASoC: Intel: soc-acpi-cht: Make Lenovo Yoga Tab 3 X90F DMI match less strict Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 20/73] ASoC: intel: fix module autoloading Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 21/73] ASoC: tda7419: " Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 22/73] spi: spidev: Add an entry for elgin,jg10309-01 Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 23/73] drm: komeda: Fix an issue related to normalized zpos Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 24/73] spi: bcm63xx: Enable module autoloading Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 25/73] smb: client: fix hang in wait_for_response() for negproto Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 26/73] x86/hyperv: Set X86_FEATURE_TSC_KNOWN_FREQ when Hyper-V provides frequency Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 27/73] tools: hv: rm .*.cmd when make clean Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 28/73] block: Fix where bio IO priority gets set Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 29/73] spi: spidev: Add missing spi_device_id for jg10309-01 Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 30/73] ocfs2: add bounds checking to ocfs2_xattr_find_entry() Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 31/73] ocfs2: strict bound check before memcmp in ocfs2_xattr_find_entry() Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 32/73] xfs: dquot shrinker doesnt check for XFS_DQFLAG_FREEING Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 33/73] xfs: Fix deadlock on xfs_inodegc_worker Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 34/73] xfs: fix extent busy updating Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 35/73] xfs: dont use BMBT btree split workers for IO completion Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 36/73] xfs: fix low space alloc deadlock Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 37/73] xfs: prefer free inodes at ENOSPC over chunk allocation Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 38/73] xfs: block reservation too large for minleft allocation Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 39/73] xfs: fix uninitialized variable access Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 40/73] xfs: quotacheck failure can race with background inode inactivation Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 41/73] xfs: fix BUG_ON in xfs_getbmap() Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 42/73] xfs: buffer pins need to hold a buffer reference Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 43/73] xfs: defered work could create precommits Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 44/73] xfs: fix AGF vs inode cluster buffer deadlock Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 45/73] xfs: collect errors from inodegc for unlinked inode recovery Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 46/73] xfs: fix ag count overflow during growfs Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 47/73] xfs: remove WARN when dquot cache insertion fails Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 48/73] xfs: fix the calculation for "end" and "length" Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 49/73] xfs: load uncached unlinked inodes into memory on demand Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 50/73] xfs: fix negative array access in xfs_getbmap Greg Kroah-Hartman
2024-09-27 12:24 ` Greg Kroah-Hartman [this message]
2024-09-27 12:24 ` [PATCH 6.1 52/73] xfs: correct calculation for agend and blockcount Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 53/73] xfs: use i_prev_unlinked to distinguish inodes that are not on the unlinked list Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 54/73] xfs: reload entire unlinked bucket lists Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 55/73] xfs: make inode unlinked bucket recovery work with quotacheck Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 56/73] xfs: fix reloading entire unlinked bucket lists Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 57/73] xfs: set bnobt/cntbt numrecs correctly when formatting new AGs Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 58/73] xfs: journal geometry is not properly bounds checked Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 59/73] netfilter: nft_socket: make cgroupsv2 matching work with namespaces Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 60/73] netfilter: nft_socket: Fix a NULL vs IS_ERR() bug in nft_socket_cgroup_subtree_level() Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 61/73] netfilter: nft_set_pipapo: walk over current view on netlink dump Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 62/73] netfilter: nf_tables: missing iterator type in lookup walk Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 63/73] Revert "wifi: cfg80211: check wiphy mutex is held for wdev mutex" Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 64/73] gpiolib: cdev: Ignore reconfiguration without direction Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 65/73] gpio: prevent potential speculation leaks in gpio_device_get_desc() Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 66/73] can: mcp251xfd: properly indent labels Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 67/73] can: mcp251xfd: move mcp251xfd_timestamp_start()/stop() into mcp251xfd_chip_start/stop() Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 68/73] selftests: mptcp: join: restrict fullmesh endp on 1st sf Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 69/73] btrfs: calculate the right space for delayed refs when updating global reserve Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 70/73] powercap: RAPL: fix invalid initialization for pl4_supported field Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 71/73] x86/mm: Switch to new Intel CPU model defines Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 72/73] USB: serial: pl2303: add device id for Macrosilicon MS3020 Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 73/73] USB: usbtmc: prevent kernel-usb-infoleak Greg Kroah-Hartman
2024-09-27 15:17 ` [PATCH 6.1 00/73] 6.1.112-rc1 review Peter Schneider
2024-09-27 15:52 ` Allen
2024-09-27 18:35 ` Jon Hunter
2024-09-27 18:40 ` Florian Fainelli
2024-09-28 12:39 ` Naresh Kamboju
2024-09-28 17:13 ` Shuah Khan
2024-09-29  8:43 ` Ron Economos
2024-09-29 11:19 ` Muhammad Usama Anjum
2024-09-30  8:41 ` Pavel Machek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240927121721.977658809@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=chandanbabu@kernel.org \
    --cc=dchinner@redhat.com \
    --cc=djwong@kernel.org \
    --cc=hch@lst.de \
    --cc=leah.rumancik@gmail.com \
    --cc=mcgrof@kernel.org \
    --cc=patches@lists.linux.dev \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox