* [PATCH v2 0/4] xfs: close crash window in attr dabtree inactivation
@ 2026-03-12 8:57 Long Li
2026-03-12 8:57 ` [PATCH v2 1/4] xfs: only assert new size for datafork during truncate extents Long Li
` (3 more replies)
0 siblings, 4 replies; 15+ messages in thread
From: Long Li @ 2026-03-12 8:57 UTC (permalink / raw)
To: djwong, cem
Cc: linux-xfs, david, yi.zhang, houtao1, leo.lilong, yangerkun,
lonuxli.64
Problem
-------
xfs_attr3_node_inactive() cancels all child blocks in the attr dabtree
via xfs_trans_binval(), relying on a subsequent full attr bmap truncation
in xfs_attr_inactive() to ensure log recovery never reaches the root
node. The child block cancellations and the bmap truncation are
committed in separate transactions, creating a narrow but real crash
window.
If the system shuts down after the cancellations commit but before the
bmap truncation commits, the attr bmap survives recovery intact and
xlog_recover_process_iunlinks() retries inactivation. It attempts to
read the root node via the attr bmap: if the root node was not replayed,
reading the unreplayed root block maybe triggers a metadata verification
failure immediately; if it was replayed, following its child pointers to
unreplayed child blocks triggers the same failure.
Zhang Yi previously attempted to solve this problem [1], but Dave gave
different advice. Following the approach suggested by Dave Chinner in
review.
[1] https://patchwork.kernel.org/project/xfs/patch/20230613030434.2944173-3-yi.zhang@huaweicloud.com/
Fix
---
The fix addresses the atomicity gap at two levels:
(1) In xfs_attr3_node_inactive(), each child block invalidation and the
removal of its reference from the parent node are placed in the same
transaction, so no parent node ever holds a pointer to a cancelled
block at any crash point. After all children are removed, the empty
root node is converted to an empty leaf in the same transaction.
This conversion is necessary because xfs_da3_node_verify() rejects
a node block with count == 0, whereas xfs_attr3_leaf_verify()
explicitly allows it during recovery to accommodate the transient
state from the shortform-to-leaf promotion path.
(2) In xfs_attr_inactive(), the attr fork is truncated in two phases.
First the child extents are removed. Then the root block is
invalidated and the attr bmap is truncated to zero in a single
transaction. These two operations in the second phase must be
atomic: as long as the attr bmap has any non-zero length, recovery
can follow it to the root block, so the root block invalidation must
commit together with the bmap-to-zero truncation.
Changes
-------
v1->v2:
1) Rename xfs_da3_node_entry_remove to xfs_attr3_node_entry_remove.
2) Initializing the structure value when declaring a structure in a function.
3) An assertion is added to xfs_attr3_leaf_init to check whether tp is not null.
4) Fixed syntax style issues.
Long Li (4):
xfs: only assert new size for datafork during truncate extents
xfs: factor out xfs_attr3_node_entry_remove
xfs: factor out xfs_attr3_leaf_init
xfs: close crash window in attr dabtree inactivation
fs/xfs/libxfs/xfs_attr_leaf.c | 22 +++++++++
fs/xfs/libxfs/xfs_attr_leaf.h | 3 ++
fs/xfs/libxfs/xfs_da_btree.c | 54 +++++++++++++++-----
fs/xfs/libxfs/xfs_da_btree.h | 2 +
fs/xfs/xfs_attr_inactive.c | 93 ++++++++++++++++++++---------------
fs/xfs/xfs_inode.c | 3 +-
6 files changed, 125 insertions(+), 52 deletions(-)
--
2.39.2
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v2 1/4] xfs: only assert new size for datafork during truncate extents
2026-03-12 8:57 [PATCH v2 0/4] xfs: close crash window in attr dabtree inactivation Long Li
@ 2026-03-12 8:57 ` Long Li
2026-03-12 8:57 ` [PATCH v2 2/4] xfs: factor out xfs_attr3_node_entry_remove Long Li
` (2 subsequent siblings)
3 siblings, 0 replies; 15+ messages in thread
From: Long Li @ 2026-03-12 8:57 UTC (permalink / raw)
To: djwong, cem
Cc: linux-xfs, david, yi.zhang, houtao1, leo.lilong, yangerkun,
lonuxli.64
The assertion functions properly because we currently only truncate the
attr to a zero size. Any other new size of the attr is not preempted.
Make this assertion is specific to the datafork, preparing for
subsequent patches to truncate the attribute to a non-zero size.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Long Li <leo.lilong@huawei.com>
---
fs/xfs/xfs_inode.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 50c0404f9064..beaa26ec62da 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1048,7 +1048,8 @@ xfs_itruncate_extents_flags(
xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
if (icount_read(VFS_I(ip)))
xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL);
- ASSERT(new_size <= XFS_ISIZE(ip));
+ if (whichfork == XFS_DATA_FORK)
+ ASSERT(new_size <= XFS_ISIZE(ip));
ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
ASSERT(ip->i_itemp != NULL);
ASSERT(ip->i_itemp->ili_lock_flags == 0);
--
2.39.2
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 2/4] xfs: factor out xfs_attr3_node_entry_remove
2026-03-12 8:57 [PATCH v2 0/4] xfs: close crash window in attr dabtree inactivation Long Li
2026-03-12 8:57 ` [PATCH v2 1/4] xfs: only assert new size for datafork during truncate extents Long Li
@ 2026-03-12 8:57 ` Long Li
2026-03-13 15:01 ` Darrick J. Wong
2026-03-12 8:57 ` [PATCH v2 3/4] xfs: factor out xfs_attr3_leaf_init Long Li
2026-03-12 8:58 ` [PATCH v2 4/4] xfs: close crash window in attr dabtree inactivation Long Li
3 siblings, 1 reply; 15+ messages in thread
From: Long Li @ 2026-03-12 8:57 UTC (permalink / raw)
To: djwong, cem
Cc: linux-xfs, david, yi.zhang, houtao1, leo.lilong, yangerkun,
lonuxli.64
Factor out wrapper xfs_attr3_node_entry_remove function, which
exported for external use.
Signed-off-by: Long Li <leo.lilong@huawei.com>
---
fs/xfs/libxfs/xfs_da_btree.c | 54 ++++++++++++++++++++++++++++--------
fs/xfs/libxfs/xfs_da_btree.h | 2 ++
2 files changed, 45 insertions(+), 11 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index 766631f0562e..0b9fe3e4370d 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -1506,21 +1506,20 @@ xfs_da3_fixhashpath(
}
/*
- * Remove an entry from an intermediate node.
+ * Internal implementation to remove an entry from an intermediate node.
*/
STATIC void
-xfs_da3_node_remove(
- struct xfs_da_state *state,
- struct xfs_da_state_blk *drop_blk)
+__xfs_da3_node_remove(
+ struct xfs_trans *tp,
+ struct xfs_inode *dp,
+ struct xfs_da_geometry *geo,
+ struct xfs_da_state_blk *drop_blk)
{
struct xfs_da_intnode *node;
struct xfs_da3_icnode_hdr nodehdr;
struct xfs_da_node_entry *btree;
int index;
int tmp;
- struct xfs_inode *dp = state->args->dp;
-
- trace_xfs_da_node_remove(state->args);
node = drop_blk->bp->b_addr;
xfs_da3_node_hdr_from_disk(dp->i_mount, &nodehdr, node);
@@ -1536,17 +1535,17 @@ xfs_da3_node_remove(
tmp = nodehdr.count - index - 1;
tmp *= (uint)sizeof(xfs_da_node_entry_t);
memmove(&btree[index], &btree[index + 1], tmp);
- xfs_trans_log_buf(state->args->trans, drop_blk->bp,
+ xfs_trans_log_buf(tp, drop_blk->bp,
XFS_DA_LOGRANGE(node, &btree[index], tmp));
index = nodehdr.count - 1;
}
memset(&btree[index], 0, sizeof(xfs_da_node_entry_t));
- xfs_trans_log_buf(state->args->trans, drop_blk->bp,
+ xfs_trans_log_buf(tp, drop_blk->bp,
XFS_DA_LOGRANGE(node, &btree[index], sizeof(btree[index])));
nodehdr.count -= 1;
xfs_da3_node_hdr_to_disk(dp->i_mount, node, &nodehdr);
- xfs_trans_log_buf(state->args->trans, drop_blk->bp,
- XFS_DA_LOGRANGE(node, &node->hdr, state->args->geo->node_hdr_size));
+ xfs_trans_log_buf(tp, drop_blk->bp,
+ XFS_DA_LOGRANGE(node, &node->hdr, geo->node_hdr_size));
/*
* Copy the last hash value from the block to propagate upwards.
@@ -1554,6 +1553,39 @@ xfs_da3_node_remove(
drop_blk->hashval = be32_to_cpu(btree[index - 1].hashval);
}
+/*
+ * Remove an entry from an intermediate node.
+ */
+STATIC void
+xfs_da3_node_remove(
+ struct xfs_da_state *state,
+ struct xfs_da_state_blk *drop_blk)
+{
+ trace_xfs_da_node_remove(state->args);
+
+ __xfs_da3_node_remove(state->args->trans, state->args->dp,
+ state->args->geo, drop_blk);
+}
+
+/*
+ * Remove an entry from a node at the specified index, this is an exported
+ * wrapper for removing entries from intermediate nodes.
+ */
+void
+xfs_attr3_node_entry_remove(
+ struct xfs_trans *tp,
+ struct xfs_inode *dp,
+ struct xfs_buf *bp,
+ int index)
+{
+ struct xfs_da_state_blk blk = {
+ .index = index,
+ .bp = bp,
+ };
+
+ __xfs_da3_node_remove(tp, dp, dp->i_mount->m_attr_geo, &blk);
+}
+
/*
* Unbalance the elements between two intermediate nodes,
* move all Btree elements from one node into another.
diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h
index 354d5d65043e..afcf2d3c7a21 100644
--- a/fs/xfs/libxfs/xfs_da_btree.h
+++ b/fs/xfs/libxfs/xfs_da_btree.h
@@ -184,6 +184,8 @@ int xfs_da3_split(xfs_da_state_t *state);
int xfs_da3_join(xfs_da_state_t *state);
void xfs_da3_fixhashpath(struct xfs_da_state *state,
struct xfs_da_state_path *path_to_to_fix);
+void xfs_attr3_node_entry_remove(struct xfs_trans *tp, struct xfs_inode *dp,
+ struct xfs_buf *bp, int index);
/*
* Routines used for finding things in the Btree.
--
2.39.2
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 3/4] xfs: factor out xfs_attr3_leaf_init
2026-03-12 8:57 [PATCH v2 0/4] xfs: close crash window in attr dabtree inactivation Long Li
2026-03-12 8:57 ` [PATCH v2 1/4] xfs: only assert new size for datafork during truncate extents Long Li
2026-03-12 8:57 ` [PATCH v2 2/4] xfs: factor out xfs_attr3_node_entry_remove Long Li
@ 2026-03-12 8:57 ` Long Li
2026-03-13 15:04 ` Darrick J. Wong
2026-03-12 8:58 ` [PATCH v2 4/4] xfs: close crash window in attr dabtree inactivation Long Li
3 siblings, 1 reply; 15+ messages in thread
From: Long Li @ 2026-03-12 8:57 UTC (permalink / raw)
To: djwong, cem
Cc: linux-xfs, david, yi.zhang, houtao1, leo.lilong, yangerkun,
lonuxli.64
Factor out wrapper xfs_attr3_leaf_init function, which exported for
external use.
Signed-off-by: Long Li <leo.lilong@huawei.com>
---
fs/xfs/libxfs/xfs_attr_leaf.c | 22 ++++++++++++++++++++++
fs/xfs/libxfs/xfs_attr_leaf.h | 3 +++
2 files changed, 25 insertions(+)
diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
index 47f48ae555c0..96a65141e812 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.c
+++ b/fs/xfs/libxfs/xfs_attr_leaf.c
@@ -1415,6 +1415,28 @@ xfs_attr3_leaf_create(
return 0;
}
+/*
+ * Create and initialize an empty attr leaf block at blkno, and attach the
+ * buffer to tp.
+ */
+int
+xfs_attr3_leaf_init(
+ struct xfs_trans *tp,
+ struct xfs_inode *dp,
+ xfs_dablk_t blkno)
+{
+ struct xfs_buf *bp = NULL;
+ struct xfs_da_args args = {
+ .trans = tp,
+ .dp = dp,
+ .owner = dp->i_ino,
+ .geo = dp->i_mount->m_attr_geo,
+ };
+
+ ASSERT(tp != NULL);
+
+ return xfs_attr3_leaf_create(&args, blkno, &bp);
+}
/*
* Split the leaf node, rebalance, then add the new entry.
*
diff --git a/fs/xfs/libxfs/xfs_attr_leaf.h b/fs/xfs/libxfs/xfs_attr_leaf.h
index aca46da2bc50..72639efe6ac3 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.h
+++ b/fs/xfs/libxfs/xfs_attr_leaf.h
@@ -87,6 +87,9 @@ int xfs_attr3_leaf_list_int(struct xfs_buf *bp,
/*
* Routines used for shrinking the Btree.
*/
+
+int xfs_attr3_leaf_init(struct xfs_trans *tp, struct xfs_inode *dp,
+ xfs_dablk_t blkno);
int xfs_attr3_leaf_toosmall(struct xfs_da_state *state, int *retval);
void xfs_attr3_leaf_unbalance(struct xfs_da_state *state,
struct xfs_da_state_blk *drop_blk,
--
2.39.2
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 4/4] xfs: close crash window in attr dabtree inactivation
2026-03-12 8:57 [PATCH v2 0/4] xfs: close crash window in attr dabtree inactivation Long Li
` (2 preceding siblings ...)
2026-03-12 8:57 ` [PATCH v2 3/4] xfs: factor out xfs_attr3_leaf_init Long Li
@ 2026-03-12 8:58 ` Long Li
2026-03-14 0:18 ` Darrick J. Wong
3 siblings, 1 reply; 15+ messages in thread
From: Long Li @ 2026-03-12 8:58 UTC (permalink / raw)
To: djwong, cem
Cc: linux-xfs, david, yi.zhang, houtao1, leo.lilong, yangerkun,
lonuxli.64
When inactivating an inode with node-format extended attributes,
xfs_attr3_node_inactive() invalidates all child leaf/node blocks via
xfs_trans_binval(), but intentionally does not remove the corresponding
entries from their parent node blocks. The implicit assumption is that
xfs_attr_inactive() will truncate the entire attr fork to zero extents
afterwards, so log recovery will never reach the root node and follow
those stale pointers.
However, if a log shutdown occurs after the leaf/node block cancellations
commit but before the attr bmap truncation commits, this assumption
breaks. Recovery replays the attr bmap intact (the inode still has
attr fork extents), but suppresses replay of all cancelled leaf/node
blocks, maybe leaving them as stale data on disk. On the next mount,
xlog_recover_process_iunlinks() retries inactivation and attempts to
read the root node via the attr bmap. If the root node was not replayed,
reading the unreplayed root block triggers a metadata verification
failure immediately; if it was replayed, following its child pointers
to unreplayed child blocks triggers the same failure:
XFS (pmem0): Metadata corruption detected at
xfs_da3_node_read_verify+0x53/0x220, xfs_da3_node block 0x78
XFS (pmem0): Unmount and run xfs_repair
XFS (pmem0): First 128 bytes of corrupted metadata buffer:
00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
XFS (pmem0): metadata I/O error in "xfs_da_read_buf+0x104/0x190" at daddr 0x78 len 8 error 117
Fix this in two places:
In xfs_attr3_node_inactive(), after calling xfs_trans_binval() on a
child block, immediately remove the entry that references it from the
parent node in the same transaction. This eliminates the window where
the parent holds a pointer to a cancelled block. Once all children are
removed, the now-empty root node is converted to a leaf block within the
same transaction. This node-to-leaf conversion is necessary for crash
safety. If the system shutdown after the empty node is written to the
log but before the second-phase bmap truncation commits, log recovery
will attempt to verify the root block on disk. xfs_da3_node_verify()
does not permit a node block with count == 0; such a block will fail
verification and trigger a metadata corruption shutdown. on the other
hand, leaf blocks are allowed to have this transient state.
In xfs_attr_inactive(), split the attr fork truncation into two explicit
phases. First, truncate all extents beyond the root block (the child
extents whose parent references have already been removed above).
Second, invalidate the root block and truncate the attr bmap to zero in
a single transaction. The two operations in the second phase must be
atomic: as long as the attr bmap has any non-zero length, recovery can
follow it to the root block, so the root block invalidation must commit
together with the bmap-to-zero truncation.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: <stable@vger.kernel.org>
Signed-off-by: Long Li <leo.lilong@huawei.com>
---
fs/xfs/xfs_attr_inactive.c | 93 ++++++++++++++++++++++----------------
1 file changed, 53 insertions(+), 40 deletions(-)
diff --git a/fs/xfs/xfs_attr_inactive.c b/fs/xfs/xfs_attr_inactive.c
index 92331991f9fd..235f19a68fa2 100644
--- a/fs/xfs/xfs_attr_inactive.c
+++ b/fs/xfs/xfs_attr_inactive.c
@@ -140,7 +140,7 @@ xfs_attr3_node_inactive(
xfs_daddr_t parent_blkno, child_blkno;
struct xfs_buf *child_bp;
struct xfs_da3_icnode_hdr ichdr;
- int error, i;
+ int error;
/*
* Since this code is recursive (gasp!) we must protect ourselves.
@@ -152,7 +152,7 @@ xfs_attr3_node_inactive(
return -EFSCORRUPTED;
}
- xfs_da3_node_hdr_from_disk(dp->i_mount, &ichdr, bp->b_addr);
+ xfs_da3_node_hdr_from_disk(mp, &ichdr, bp->b_addr);
parent_blkno = xfs_buf_daddr(bp);
if (!ichdr.count) {
xfs_trans_brelse(*trans, bp);
@@ -167,7 +167,7 @@ xfs_attr3_node_inactive(
* over the leaves removing all of them. If this is higher up
* in the tree, recurse downward.
*/
- for (i = 0; i < ichdr.count; i++) {
+ while (ichdr.count > 0) {
/*
* Read the subsidiary block to see what we have to work with.
* Don't do this in a transaction. This is a depth-first
@@ -218,29 +218,32 @@ xfs_attr3_node_inactive(
xfs_trans_binval(*trans, child_bp);
child_bp = NULL;
- /*
- * If we're not done, re-read the parent to get the next
- * child block number.
- */
- if (i + 1 < ichdr.count) {
- struct xfs_da3_icnode_hdr phdr;
+ error = xfs_da3_node_read_mapped(*trans, dp,
+ parent_blkno, &bp, XFS_ATTR_FORK);
+ if (error)
+ return error;
- error = xfs_da3_node_read_mapped(*trans, dp,
- parent_blkno, &bp, XFS_ATTR_FORK);
+ /*
+ * Remove entry from parent node, prevents being indexed to.
+ */
+ xfs_attr3_node_entry_remove(*trans, dp, bp, 0);
+
+ xfs_da3_node_hdr_from_disk(mp, &ichdr, bp->b_addr);
+ bp = NULL;
+
+ if (ichdr.count > 0) {
+ /*
+ * If we're not done, get the next child block number.
+ */
+ child_fsb = be32_to_cpu(ichdr.btree[0].before);
+
+ /*
+ * Atomically commit the whole invalidate stuff.
+ */
+ error = xfs_trans_roll_inode(trans, dp);
if (error)
return error;
- xfs_da3_node_hdr_from_disk(dp->i_mount, &phdr,
- bp->b_addr);
- child_fsb = be32_to_cpu(phdr.btree[i + 1].before);
- xfs_trans_brelse(*trans, bp);
- bp = NULL;
}
- /*
- * Atomically commit the whole invalidate stuff.
- */
- error = xfs_trans_roll_inode(trans, dp);
- if (error)
- return error;
}
return 0;
@@ -257,10 +260,8 @@ xfs_attr3_root_inactive(
struct xfs_trans **trans,
struct xfs_inode *dp)
{
- struct xfs_mount *mp = dp->i_mount;
struct xfs_da_blkinfo *info;
struct xfs_buf *bp;
- xfs_daddr_t blkno;
int error;
/*
@@ -272,7 +273,6 @@ xfs_attr3_root_inactive(
error = xfs_da3_node_read(*trans, dp, 0, &bp, XFS_ATTR_FORK);
if (error)
return error;
- blkno = xfs_buf_daddr(bp);
/*
* Invalidate the tree, even if the "tree" is only a single leaf block.
@@ -283,6 +283,16 @@ xfs_attr3_root_inactive(
case cpu_to_be16(XFS_DA_NODE_MAGIC):
case cpu_to_be16(XFS_DA3_NODE_MAGIC):
error = xfs_attr3_node_inactive(trans, dp, bp, 1);
+ if (error)
+ return error;
+
+ /*
+ * Empty root node block are not allowed, convert it to leaf.
+ */
+ error = xfs_attr3_leaf_init(*trans, dp, 0);
+ if (error)
+ return error;
+ error = xfs_trans_roll_inode(trans, dp);
break;
case cpu_to_be16(XFS_ATTR_LEAF_MAGIC):
case cpu_to_be16(XFS_ATTR3_LEAF_MAGIC):
@@ -295,21 +305,6 @@ xfs_attr3_root_inactive(
xfs_trans_brelse(*trans, bp);
break;
}
- if (error)
- return error;
-
- /*
- * Invalidate the incore copy of the root block.
- */
- error = xfs_trans_get_buf(*trans, mp->m_ddev_targp, blkno,
- XFS_FSB_TO_BB(mp, mp->m_attr_geo->fsbcount), 0, &bp);
- if (error)
- return error;
- xfs_trans_binval(*trans, bp); /* remove from cache */
- /*
- * Commit the invalidate and start the next transaction.
- */
- error = xfs_trans_roll_inode(trans, dp);
return error;
}
@@ -328,6 +323,7 @@ xfs_attr_inactive(
{
struct xfs_trans *trans;
struct xfs_mount *mp;
+ struct xfs_buf *bp;
int lock_mode = XFS_ILOCK_SHARED;
int error = 0;
@@ -363,10 +359,27 @@ xfs_attr_inactive(
* removal below.
*/
if (dp->i_af.if_nextents > 0) {
+ /*
+ * Invalidate and truncate all blocks but leave the root block.
+ */
error = xfs_attr3_root_inactive(&trans, dp);
if (error)
goto out_cancel;
+ error = xfs_itruncate_extents(&trans, dp, XFS_ATTR_FORK,
+ XFS_FSB_TO_B(mp, mp->m_attr_geo->fsbcount));
+ if (error)
+ goto out_cancel;
+
+ /*
+ * Invalidate and truncate the root block and ensure that the
+ * operation is completed within a single transaction.
+ */
+ error = xfs_da_get_buf(trans, dp, 0, &bp, XFS_ATTR_FORK);
+ if (error)
+ goto out_cancel;
+
+ xfs_trans_binval(trans, bp);
error = xfs_itruncate_extents(&trans, dp, XFS_ATTR_FORK, 0);
if (error)
goto out_cancel;
--
2.39.2
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH v2 2/4] xfs: factor out xfs_attr3_node_entry_remove
2026-03-12 8:57 ` [PATCH v2 2/4] xfs: factor out xfs_attr3_node_entry_remove Long Li
@ 2026-03-13 15:01 ` Darrick J. Wong
2026-03-14 2:49 ` Long Li
0 siblings, 1 reply; 15+ messages in thread
From: Darrick J. Wong @ 2026-03-13 15:01 UTC (permalink / raw)
To: Long Li; +Cc: cem, linux-xfs, david, yi.zhang, houtao1, yangerkun, lonuxli.64
On Thu, Mar 12, 2026 at 04:57:58PM +0800, Long Li wrote:
> Factor out wrapper xfs_attr3_node_entry_remove function, which
> exported for external use.
>
> Signed-off-by: Long Li <leo.lilong@huawei.com>
> ---
> fs/xfs/libxfs/xfs_da_btree.c | 54 ++++++++++++++++++++++++++++--------
> fs/xfs/libxfs/xfs_da_btree.h | 2 ++
> 2 files changed, 45 insertions(+), 11 deletions(-)
>
> diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
> index 766631f0562e..0b9fe3e4370d 100644
> --- a/fs/xfs/libxfs/xfs_da_btree.c
> +++ b/fs/xfs/libxfs/xfs_da_btree.c
> @@ -1506,21 +1506,20 @@ xfs_da3_fixhashpath(
> }
>
> /*
> - * Remove an entry from an intermediate node.
> + * Internal implementation to remove an entry from an intermediate node.
> */
> STATIC void
> -xfs_da3_node_remove(
> - struct xfs_da_state *state,
> - struct xfs_da_state_blk *drop_blk)
> +__xfs_da3_node_remove(
> + struct xfs_trans *tp,
> + struct xfs_inode *dp,
> + struct xfs_da_geometry *geo,
> + struct xfs_da_state_blk *drop_blk)
> {
> struct xfs_da_intnode *node;
> struct xfs_da3_icnode_hdr nodehdr;
> struct xfs_da_node_entry *btree;
> int index;
> int tmp;
> - struct xfs_inode *dp = state->args->dp;
> -
> - trace_xfs_da_node_remove(state->args);
>
> node = drop_blk->bp->b_addr;
> xfs_da3_node_hdr_from_disk(dp->i_mount, &nodehdr, node);
> @@ -1536,17 +1535,17 @@ xfs_da3_node_remove(
> tmp = nodehdr.count - index - 1;
> tmp *= (uint)sizeof(xfs_da_node_entry_t);
> memmove(&btree[index], &btree[index + 1], tmp);
> - xfs_trans_log_buf(state->args->trans, drop_blk->bp,
> + xfs_trans_log_buf(tp, drop_blk->bp,
> XFS_DA_LOGRANGE(node, &btree[index], tmp));
> index = nodehdr.count - 1;
> }
> memset(&btree[index], 0, sizeof(xfs_da_node_entry_t));
> - xfs_trans_log_buf(state->args->trans, drop_blk->bp,
> + xfs_trans_log_buf(tp, drop_blk->bp,
> XFS_DA_LOGRANGE(node, &btree[index], sizeof(btree[index])));
> nodehdr.count -= 1;
> xfs_da3_node_hdr_to_disk(dp->i_mount, node, &nodehdr);
> - xfs_trans_log_buf(state->args->trans, drop_blk->bp,
> - XFS_DA_LOGRANGE(node, &node->hdr, state->args->geo->node_hdr_size));
> + xfs_trans_log_buf(tp, drop_blk->bp,
> + XFS_DA_LOGRANGE(node, &node->hdr, geo->node_hdr_size));
>
> /*
> * Copy the last hash value from the block to propagate upwards.
> @@ -1554,6 +1553,39 @@ xfs_da3_node_remove(
> drop_blk->hashval = be32_to_cpu(btree[index - 1].hashval);
> }
>
> +/*
> + * Remove an entry from an intermediate node.
> + */
> +STATIC void
> +xfs_da3_node_remove(
> + struct xfs_da_state *state,
> + struct xfs_da_state_blk *drop_blk)
> +{
> + trace_xfs_da_node_remove(state->args);
> +
> + __xfs_da3_node_remove(state->args->trans, state->args->dp,
> + state->args->geo, drop_blk);
> +}
> +
> +/*
> + * Remove an entry from a node at the specified index, this is an exported
> + * wrapper for removing entries from intermediate nodes.
"exported" is confusing since in kernel-land that usually implies
EXPORT_SYMBOL_GPL(), which this clearly isn't. You could trim this to
/*
* Remove an entry from an intermediate attr node at the specified
* index.
*/
With that touched up,
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
--D
> + */
> +void
> +xfs_attr3_node_entry_remove(
> + struct xfs_trans *tp,
> + struct xfs_inode *dp,
> + struct xfs_buf *bp,
> + int index)
> +{
> + struct xfs_da_state_blk blk = {
> + .index = index,
> + .bp = bp,
> + };
> +
> + __xfs_da3_node_remove(tp, dp, dp->i_mount->m_attr_geo, &blk);
> +}
> +
> /*
> * Unbalance the elements between two intermediate nodes,
> * move all Btree elements from one node into another.
> diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h
> index 354d5d65043e..afcf2d3c7a21 100644
> --- a/fs/xfs/libxfs/xfs_da_btree.h
> +++ b/fs/xfs/libxfs/xfs_da_btree.h
> @@ -184,6 +184,8 @@ int xfs_da3_split(xfs_da_state_t *state);
> int xfs_da3_join(xfs_da_state_t *state);
> void xfs_da3_fixhashpath(struct xfs_da_state *state,
> struct xfs_da_state_path *path_to_to_fix);
> +void xfs_attr3_node_entry_remove(struct xfs_trans *tp, struct xfs_inode *dp,
> + struct xfs_buf *bp, int index);
>
> /*
> * Routines used for finding things in the Btree.
> --
> 2.39.2
>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 3/4] xfs: factor out xfs_attr3_leaf_init
2026-03-12 8:57 ` [PATCH v2 3/4] xfs: factor out xfs_attr3_leaf_init Long Li
@ 2026-03-13 15:04 ` Darrick J. Wong
2026-03-14 2:47 ` Long Li
0 siblings, 1 reply; 15+ messages in thread
From: Darrick J. Wong @ 2026-03-13 15:04 UTC (permalink / raw)
To: Long Li; +Cc: cem, linux-xfs, david, yi.zhang, houtao1, yangerkun, lonuxli.64
On Thu, Mar 12, 2026 at 04:57:59PM +0800, Long Li wrote:
> Factor out wrapper xfs_attr3_leaf_init function, which exported for
> external use.
>
> Signed-off-by: Long Li <leo.lilong@huawei.com>
> ---
> fs/xfs/libxfs/xfs_attr_leaf.c | 22 ++++++++++++++++++++++
> fs/xfs/libxfs/xfs_attr_leaf.h | 3 +++
> 2 files changed, 25 insertions(+)
>
> diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
> index 47f48ae555c0..96a65141e812 100644
> --- a/fs/xfs/libxfs/xfs_attr_leaf.c
> +++ b/fs/xfs/libxfs/xfs_attr_leaf.c
> @@ -1415,6 +1415,28 @@ xfs_attr3_leaf_create(
> return 0;
> }
>
> +/*
> + * Create and initialize an empty attr leaf block at blkno, and attach the
> + * buffer to tp.
Not sure why we "create and initialize" here -- there's already a block
mapped at @blkno and we're rewriting it with a leaf header, right?
/*
* Reinitialize an existing attr fork block as an empty leaf, and
* attach the buffer to tp.
*/
With that changed,
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
--D
> + */
> +int
> +xfs_attr3_leaf_init(
> + struct xfs_trans *tp,
> + struct xfs_inode *dp,
> + xfs_dablk_t blkno)
> +{
> + struct xfs_buf *bp = NULL;
> + struct xfs_da_args args = {
> + .trans = tp,
> + .dp = dp,
> + .owner = dp->i_ino,
> + .geo = dp->i_mount->m_attr_geo,
> + };
> +
> + ASSERT(tp != NULL);
> +
> + return xfs_attr3_leaf_create(&args, blkno, &bp);
> +}
> /*
> * Split the leaf node, rebalance, then add the new entry.
> *
> diff --git a/fs/xfs/libxfs/xfs_attr_leaf.h b/fs/xfs/libxfs/xfs_attr_leaf.h
> index aca46da2bc50..72639efe6ac3 100644
> --- a/fs/xfs/libxfs/xfs_attr_leaf.h
> +++ b/fs/xfs/libxfs/xfs_attr_leaf.h
> @@ -87,6 +87,9 @@ int xfs_attr3_leaf_list_int(struct xfs_buf *bp,
> /*
> * Routines used for shrinking the Btree.
> */
> +
> +int xfs_attr3_leaf_init(struct xfs_trans *tp, struct xfs_inode *dp,
> + xfs_dablk_t blkno);
> int xfs_attr3_leaf_toosmall(struct xfs_da_state *state, int *retval);
> void xfs_attr3_leaf_unbalance(struct xfs_da_state *state,
> struct xfs_da_state_blk *drop_blk,
> --
> 2.39.2
>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 4/4] xfs: close crash window in attr dabtree inactivation
2026-03-12 8:58 ` [PATCH v2 4/4] xfs: close crash window in attr dabtree inactivation Long Li
@ 2026-03-14 0:18 ` Darrick J. Wong
2026-03-16 4:56 ` Darrick J. Wong
0 siblings, 1 reply; 15+ messages in thread
From: Darrick J. Wong @ 2026-03-14 0:18 UTC (permalink / raw)
To: Long Li; +Cc: cem, linux-xfs, david, yi.zhang, houtao1, yangerkun, lonuxli.64
On Thu, Mar 12, 2026 at 04:58:00PM +0800, Long Li wrote:
> When inactivating an inode with node-format extended attributes,
> xfs_attr3_node_inactive() invalidates all child leaf/node blocks via
> xfs_trans_binval(), but intentionally does not remove the corresponding
> entries from their parent node blocks. The implicit assumption is that
> xfs_attr_inactive() will truncate the entire attr fork to zero extents
> afterwards, so log recovery will never reach the root node and follow
> those stale pointers.
>
> However, if a log shutdown occurs after the leaf/node block cancellations
> commit but before the attr bmap truncation commits, this assumption
> breaks. Recovery replays the attr bmap intact (the inode still has
> attr fork extents), but suppresses replay of all cancelled leaf/node
> blocks, maybe leaving them as stale data on disk. On the next mount,
> xlog_recover_process_iunlinks() retries inactivation and attempts to
> read the root node via the attr bmap. If the root node was not replayed,
> reading the unreplayed root block triggers a metadata verification
> failure immediately; if it was replayed, following its child pointers
> to unreplayed child blocks triggers the same failure:
>
> XFS (pmem0): Metadata corruption detected at
> xfs_da3_node_read_verify+0x53/0x220, xfs_da3_node block 0x78
> XFS (pmem0): Unmount and run xfs_repair
> XFS (pmem0): First 128 bytes of corrupted metadata buffer:
> 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> XFS (pmem0): metadata I/O error in "xfs_da_read_buf+0x104/0x190" at daddr 0x78 len 8 error 117
>
> Fix this in two places:
>
> In xfs_attr3_node_inactive(), after calling xfs_trans_binval() on a
> child block, immediately remove the entry that references it from the
> parent node in the same transaction. This eliminates the window where
> the parent holds a pointer to a cancelled block. Once all children are
> removed, the now-empty root node is converted to a leaf block within the
> same transaction. This node-to-leaf conversion is necessary for crash
> safety. If the system shutdown after the empty node is written to the
> log but before the second-phase bmap truncation commits, log recovery
> will attempt to verify the root block on disk. xfs_da3_node_verify()
> does not permit a node block with count == 0; such a block will fail
> verification and trigger a metadata corruption shutdown. on the other
> hand, leaf blocks are allowed to have this transient state.
>
> In xfs_attr_inactive(), split the attr fork truncation into two explicit
> phases. First, truncate all extents beyond the root block (the child
> extents whose parent references have already been removed above).
> Second, invalidate the root block and truncate the attr bmap to zero in
> a single transaction. The two operations in the second phase must be
> atomic: as long as the attr bmap has any non-zero length, recovery can
> follow it to the root block, so the root block invalidation must commit
> together with the bmap-to-zero truncation.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Long Li <leo.lilong@huawei.com>
I think this looks ok, but let me run it through g753 over the weekend
to see if it fixes that problem too. Thanks for answering my questions
the last time through.
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
--D
> ---
> fs/xfs/xfs_attr_inactive.c | 93 ++++++++++++++++++++++----------------
> 1 file changed, 53 insertions(+), 40 deletions(-)
>
> diff --git a/fs/xfs/xfs_attr_inactive.c b/fs/xfs/xfs_attr_inactive.c
> index 92331991f9fd..235f19a68fa2 100644
> --- a/fs/xfs/xfs_attr_inactive.c
> +++ b/fs/xfs/xfs_attr_inactive.c
> @@ -140,7 +140,7 @@ xfs_attr3_node_inactive(
> xfs_daddr_t parent_blkno, child_blkno;
> struct xfs_buf *child_bp;
> struct xfs_da3_icnode_hdr ichdr;
> - int error, i;
> + int error;
>
> /*
> * Since this code is recursive (gasp!) we must protect ourselves.
> @@ -152,7 +152,7 @@ xfs_attr3_node_inactive(
> return -EFSCORRUPTED;
> }
>
> - xfs_da3_node_hdr_from_disk(dp->i_mount, &ichdr, bp->b_addr);
> + xfs_da3_node_hdr_from_disk(mp, &ichdr, bp->b_addr);
> parent_blkno = xfs_buf_daddr(bp);
> if (!ichdr.count) {
> xfs_trans_brelse(*trans, bp);
> @@ -167,7 +167,7 @@ xfs_attr3_node_inactive(
> * over the leaves removing all of them. If this is higher up
> * in the tree, recurse downward.
> */
> - for (i = 0; i < ichdr.count; i++) {
> + while (ichdr.count > 0) {
> /*
> * Read the subsidiary block to see what we have to work with.
> * Don't do this in a transaction. This is a depth-first
> @@ -218,29 +218,32 @@ xfs_attr3_node_inactive(
> xfs_trans_binval(*trans, child_bp);
> child_bp = NULL;
>
> - /*
> - * If we're not done, re-read the parent to get the next
> - * child block number.
> - */
> - if (i + 1 < ichdr.count) {
> - struct xfs_da3_icnode_hdr phdr;
> + error = xfs_da3_node_read_mapped(*trans, dp,
> + parent_blkno, &bp, XFS_ATTR_FORK);
> + if (error)
> + return error;
>
> - error = xfs_da3_node_read_mapped(*trans, dp,
> - parent_blkno, &bp, XFS_ATTR_FORK);
> + /*
> + * Remove entry from parent node, prevents being indexed to.
> + */
> + xfs_attr3_node_entry_remove(*trans, dp, bp, 0);
> +
> + xfs_da3_node_hdr_from_disk(mp, &ichdr, bp->b_addr);
> + bp = NULL;
> +
> + if (ichdr.count > 0) {
> + /*
> + * If we're not done, get the next child block number.
> + */
> + child_fsb = be32_to_cpu(ichdr.btree[0].before);
> +
> + /*
> + * Atomically commit the whole invalidate stuff.
> + */
> + error = xfs_trans_roll_inode(trans, dp);
> if (error)
> return error;
> - xfs_da3_node_hdr_from_disk(dp->i_mount, &phdr,
> - bp->b_addr);
> - child_fsb = be32_to_cpu(phdr.btree[i + 1].before);
> - xfs_trans_brelse(*trans, bp);
> - bp = NULL;
> }
> - /*
> - * Atomically commit the whole invalidate stuff.
> - */
> - error = xfs_trans_roll_inode(trans, dp);
> - if (error)
> - return error;
> }
>
> return 0;
> @@ -257,10 +260,8 @@ xfs_attr3_root_inactive(
> struct xfs_trans **trans,
> struct xfs_inode *dp)
> {
> - struct xfs_mount *mp = dp->i_mount;
> struct xfs_da_blkinfo *info;
> struct xfs_buf *bp;
> - xfs_daddr_t blkno;
> int error;
>
> /*
> @@ -272,7 +273,6 @@ xfs_attr3_root_inactive(
> error = xfs_da3_node_read(*trans, dp, 0, &bp, XFS_ATTR_FORK);
> if (error)
> return error;
> - blkno = xfs_buf_daddr(bp);
>
> /*
> * Invalidate the tree, even if the "tree" is only a single leaf block.
> @@ -283,6 +283,16 @@ xfs_attr3_root_inactive(
> case cpu_to_be16(XFS_DA_NODE_MAGIC):
> case cpu_to_be16(XFS_DA3_NODE_MAGIC):
> error = xfs_attr3_node_inactive(trans, dp, bp, 1);
> + if (error)
> + return error;
> +
> + /*
> + * Empty root node block are not allowed, convert it to leaf.
> + */
> + error = xfs_attr3_leaf_init(*trans, dp, 0);
> + if (error)
> + return error;
> + error = xfs_trans_roll_inode(trans, dp);
> break;
> case cpu_to_be16(XFS_ATTR_LEAF_MAGIC):
> case cpu_to_be16(XFS_ATTR3_LEAF_MAGIC):
> @@ -295,21 +305,6 @@ xfs_attr3_root_inactive(
> xfs_trans_brelse(*trans, bp);
> break;
> }
> - if (error)
> - return error;
> -
> - /*
> - * Invalidate the incore copy of the root block.
> - */
> - error = xfs_trans_get_buf(*trans, mp->m_ddev_targp, blkno,
> - XFS_FSB_TO_BB(mp, mp->m_attr_geo->fsbcount), 0, &bp);
> - if (error)
> - return error;
> - xfs_trans_binval(*trans, bp); /* remove from cache */
> - /*
> - * Commit the invalidate and start the next transaction.
> - */
> - error = xfs_trans_roll_inode(trans, dp);
>
> return error;
> }
> @@ -328,6 +323,7 @@ xfs_attr_inactive(
> {
> struct xfs_trans *trans;
> struct xfs_mount *mp;
> + struct xfs_buf *bp;
> int lock_mode = XFS_ILOCK_SHARED;
> int error = 0;
>
> @@ -363,10 +359,27 @@ xfs_attr_inactive(
> * removal below.
> */
> if (dp->i_af.if_nextents > 0) {
> + /*
> + * Invalidate and truncate all blocks but leave the root block.
> + */
> error = xfs_attr3_root_inactive(&trans, dp);
> if (error)
> goto out_cancel;
>
> + error = xfs_itruncate_extents(&trans, dp, XFS_ATTR_FORK,
> + XFS_FSB_TO_B(mp, mp->m_attr_geo->fsbcount));
> + if (error)
> + goto out_cancel;
> +
> + /*
> + * Invalidate and truncate the root block and ensure that the
> + * operation is completed within a single transaction.
> + */
> + error = xfs_da_get_buf(trans, dp, 0, &bp, XFS_ATTR_FORK);
> + if (error)
> + goto out_cancel;
> +
> + xfs_trans_binval(trans, bp);
> error = xfs_itruncate_extents(&trans, dp, XFS_ATTR_FORK, 0);
> if (error)
> goto out_cancel;
> --
> 2.39.2
>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 3/4] xfs: factor out xfs_attr3_leaf_init
2026-03-13 15:04 ` Darrick J. Wong
@ 2026-03-14 2:47 ` Long Li
0 siblings, 0 replies; 15+ messages in thread
From: Long Li @ 2026-03-14 2:47 UTC (permalink / raw)
To: Darrick J. Wong
Cc: cem, linux-xfs, david, yi.zhang, houtao1, yangerkun, lonuxli.64
On Fri, Mar 13, 2026 at 08:04:17AM -0700, Darrick J. Wong wrote:
> On Thu, Mar 12, 2026 at 04:57:59PM +0800, Long Li wrote:
> > Factor out wrapper xfs_attr3_leaf_init function, which exported for
> > external use.
> >
> > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > ---
> > fs/xfs/libxfs/xfs_attr_leaf.c | 22 ++++++++++++++++++++++
> > fs/xfs/libxfs/xfs_attr_leaf.h | 3 +++
> > 2 files changed, 25 insertions(+)
> >
> > diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
> > index 47f48ae555c0..96a65141e812 100644
> > --- a/fs/xfs/libxfs/xfs_attr_leaf.c
> > +++ b/fs/xfs/libxfs/xfs_attr_leaf.c
> > @@ -1415,6 +1415,28 @@ xfs_attr3_leaf_create(
> > return 0;
> > }
> >
> > +/*
> > + * Create and initialize an empty attr leaf block at blkno, and attach the
> > + * buffer to tp.
>
> Not sure why we "create and initialize" here -- there's already a block
> mapped at @blkno and we're rewriting it with a leaf header, right?
>
> /*
> * Reinitialize an existing attr fork block as an empty leaf, and
> * attach the buffer to tp.
> */
>
> With that changed,
> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
>
> --D
Thank you for the review. I'll update the comment to use the clearer format
you suggested. and appreciate the guidance on improving the documentation
clarity.
Thanks,
Long Li
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 2/4] xfs: factor out xfs_attr3_node_entry_remove
2026-03-13 15:01 ` Darrick J. Wong
@ 2026-03-14 2:49 ` Long Li
0 siblings, 0 replies; 15+ messages in thread
From: Long Li @ 2026-03-14 2:49 UTC (permalink / raw)
To: Darrick J. Wong
Cc: cem, linux-xfs, david, yi.zhang, houtao1, yangerkun, lonuxli.64
On Fri, Mar 13, 2026 at 08:01:12AM -0700, Darrick J. Wong wrote:
> On Thu, Mar 12, 2026 at 04:57:58PM +0800, Long Li wrote:
> > Factor out wrapper xfs_attr3_node_entry_remove function, which
> > exported for external use.
> >
> > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > ---
> > fs/xfs/libxfs/xfs_da_btree.c | 54 ++++++++++++++++++++++++++++--------
> > fs/xfs/libxfs/xfs_da_btree.h | 2 ++
> > 2 files changed, 45 insertions(+), 11 deletions(-)
> >
> > diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
> > index 766631f0562e..0b9fe3e4370d 100644
> > --- a/fs/xfs/libxfs/xfs_da_btree.c
> > +++ b/fs/xfs/libxfs/xfs_da_btree.c
> > @@ -1506,21 +1506,20 @@ xfs_da3_fixhashpath(
> > }
> >
> > /*
> > - * Remove an entry from an intermediate node.
> > + * Internal implementation to remove an entry from an intermediate node.
> > */
> > STATIC void
> > -xfs_da3_node_remove(
> > - struct xfs_da_state *state,
> > - struct xfs_da_state_blk *drop_blk)
> > +__xfs_da3_node_remove(
> > + struct xfs_trans *tp,
> > + struct xfs_inode *dp,
> > + struct xfs_da_geometry *geo,
> > + struct xfs_da_state_blk *drop_blk)
> > {
> > struct xfs_da_intnode *node;
> > struct xfs_da3_icnode_hdr nodehdr;
> > struct xfs_da_node_entry *btree;
> > int index;
> > int tmp;
> > - struct xfs_inode *dp = state->args->dp;
> > -
> > - trace_xfs_da_node_remove(state->args);
> >
> > node = drop_blk->bp->b_addr;
> > xfs_da3_node_hdr_from_disk(dp->i_mount, &nodehdr, node);
> > @@ -1536,17 +1535,17 @@ xfs_da3_node_remove(
> > tmp = nodehdr.count - index - 1;
> > tmp *= (uint)sizeof(xfs_da_node_entry_t);
> > memmove(&btree[index], &btree[index + 1], tmp);
> > - xfs_trans_log_buf(state->args->trans, drop_blk->bp,
> > + xfs_trans_log_buf(tp, drop_blk->bp,
> > XFS_DA_LOGRANGE(node, &btree[index], tmp));
> > index = nodehdr.count - 1;
> > }
> > memset(&btree[index], 0, sizeof(xfs_da_node_entry_t));
> > - xfs_trans_log_buf(state->args->trans, drop_blk->bp,
> > + xfs_trans_log_buf(tp, drop_blk->bp,
> > XFS_DA_LOGRANGE(node, &btree[index], sizeof(btree[index])));
> > nodehdr.count -= 1;
> > xfs_da3_node_hdr_to_disk(dp->i_mount, node, &nodehdr);
> > - xfs_trans_log_buf(state->args->trans, drop_blk->bp,
> > - XFS_DA_LOGRANGE(node, &node->hdr, state->args->geo->node_hdr_size));
> > + xfs_trans_log_buf(tp, drop_blk->bp,
> > + XFS_DA_LOGRANGE(node, &node->hdr, geo->node_hdr_size));
> >
> > /*
> > * Copy the last hash value from the block to propagate upwards.
> > @@ -1554,6 +1553,39 @@ xfs_da3_node_remove(
> > drop_blk->hashval = be32_to_cpu(btree[index - 1].hashval);
> > }
> >
> > +/*
> > + * Remove an entry from an intermediate node.
> > + */
> > +STATIC void
> > +xfs_da3_node_remove(
> > + struct xfs_da_state *state,
> > + struct xfs_da_state_blk *drop_blk)
> > +{
> > + trace_xfs_da_node_remove(state->args);
> > +
> > + __xfs_da3_node_remove(state->args->trans, state->args->dp,
> > + state->args->geo, drop_blk);
> > +}
> > +
> > +/*
> > + * Remove an entry from a node at the specified index, this is an exported
> > + * wrapper for removing entries from intermediate nodes.
>
> "exported" is confusing since in kernel-land that usually implies
> EXPORT_SYMBOL_GPL(), which this clearly isn't. You could trim this to
>
> /*
> * Remove an entry from an intermediate attr node at the specified
> * index.
> */
>
> With that touched up,
> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
>
> --D
Thanks for your review, it will be update.
Long Li
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 4/4] xfs: close crash window in attr dabtree inactivation
2026-03-14 0:18 ` Darrick J. Wong
@ 2026-03-16 4:56 ` Darrick J. Wong
2026-03-16 8:32 ` Long Li
0 siblings, 1 reply; 15+ messages in thread
From: Darrick J. Wong @ 2026-03-16 4:56 UTC (permalink / raw)
To: Long Li; +Cc: cem, linux-xfs, david, yi.zhang, houtao1, yangerkun, lonuxli.64
On Fri, Mar 13, 2026 at 05:18:56PM -0700, Darrick J. Wong wrote:
> On Thu, Mar 12, 2026 at 04:58:00PM +0800, Long Li wrote:
> > When inactivating an inode with node-format extended attributes,
> > xfs_attr3_node_inactive() invalidates all child leaf/node blocks via
> > xfs_trans_binval(), but intentionally does not remove the corresponding
> > entries from their parent node blocks. The implicit assumption is that
> > xfs_attr_inactive() will truncate the entire attr fork to zero extents
> > afterwards, so log recovery will never reach the root node and follow
> > those stale pointers.
> >
> > However, if a log shutdown occurs after the leaf/node block cancellations
> > commit but before the attr bmap truncation commits, this assumption
> > breaks. Recovery replays the attr bmap intact (the inode still has
> > attr fork extents), but suppresses replay of all cancelled leaf/node
> > blocks, maybe leaving them as stale data on disk. On the next mount,
> > xlog_recover_process_iunlinks() retries inactivation and attempts to
> > read the root node via the attr bmap. If the root node was not replayed,
> > reading the unreplayed root block triggers a metadata verification
> > failure immediately; if it was replayed, following its child pointers
> > to unreplayed child blocks triggers the same failure:
> >
> > XFS (pmem0): Metadata corruption detected at
> > xfs_da3_node_read_verify+0x53/0x220, xfs_da3_node block 0x78
> > XFS (pmem0): Unmount and run xfs_repair
> > XFS (pmem0): First 128 bytes of corrupted metadata buffer:
> > 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > XFS (pmem0): metadata I/O error in "xfs_da_read_buf+0x104/0x190" at daddr 0x78 len 8 error 117
> >
> > Fix this in two places:
> >
> > In xfs_attr3_node_inactive(), after calling xfs_trans_binval() on a
> > child block, immediately remove the entry that references it from the
> > parent node in the same transaction. This eliminates the window where
> > the parent holds a pointer to a cancelled block. Once all children are
> > removed, the now-empty root node is converted to a leaf block within the
> > same transaction. This node-to-leaf conversion is necessary for crash
> > safety. If the system shutdown after the empty node is written to the
> > log but before the second-phase bmap truncation commits, log recovery
> > will attempt to verify the root block on disk. xfs_da3_node_verify()
> > does not permit a node block with count == 0; such a block will fail
> > verification and trigger a metadata corruption shutdown. on the other
> > hand, leaf blocks are allowed to have this transient state.
> >
> > In xfs_attr_inactive(), split the attr fork truncation into two explicit
> > phases. First, truncate all extents beyond the root block (the child
> > extents whose parent references have already been removed above).
> > Second, invalidate the root block and truncate the attr bmap to zero in
> > a single transaction. The two operations in the second phase must be
> > atomic: as long as the attr bmap has any non-zero length, recovery can
> > follow it to the root block, so the root block invalidation must commit
> > together with the bmap-to-zero truncation.
> >
> > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> > Cc: <stable@vger.kernel.org>
> > Signed-off-by: Long Li <leo.lilong@huawei.com>
>
> I think this looks ok, but let me run it through g753 over the weekend
> to see if it fixes that problem too. Thanks for answering my questions
> the last time through.
>
> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
In retesting generic/753, I noticed that this patchset *does* actually
fix the problem of xfs_repair complaining about random contents in attr
block 0. So, I think you've fixed that problem, yay!
However, I noticed an additional thing -- if attr block 0 happened to be
a leaf node (i.e. not a da3 node) AND that leaf happened to contain a
non-INCOMPLETE remote attr entry AND the fs went down after the first
bunmapi (that removes all but block0) but before the second bunmapi
(which eliminates block 0), that remote entry will point to blocks
that have been bunmapi'd, and xfs_repair will complain about that too.
So, I think this patch needs to reinit old leaf blocks too; see below.
Note that xfs_repair also needs a patch to teach it not to zap the
entire attr fork just because it found one incomplete attr entry in a
leaf block.
Thoughts?
--D
diff --git a/fs/xfs/xfs_attr_inactive.c b/fs/xfs/xfs_attr_inactive.c
index 235f19a68fa209..719837fb8340d5 100644
--- a/fs/xfs/xfs_attr_inactive.c
+++ b/fs/xfs/xfs_attr_inactive.c
@@ -297,6 +297,8 @@ xfs_attr3_root_inactive(
case cpu_to_be16(XFS_ATTR_LEAF_MAGIC):
case cpu_to_be16(XFS_ATTR3_LEAF_MAGIC):
error = xfs_attr3_leaf_inactive(trans, dp, bp);
+ if (!error)
+ error = xfs_attr3_leaf_init(*trans, dp, 0);
break;
default:
xfs_dirattr_mark_sick(dp, XFS_ATTR_FORK);
> > ---
> > fs/xfs/xfs_attr_inactive.c | 93 ++++++++++++++++++++++----------------
> > 1 file changed, 53 insertions(+), 40 deletions(-)
> >
> > diff --git a/fs/xfs/xfs_attr_inactive.c b/fs/xfs/xfs_attr_inactive.c
> > index 92331991f9fd..235f19a68fa2 100644
> > --- a/fs/xfs/xfs_attr_inactive.c
> > +++ b/fs/xfs/xfs_attr_inactive.c
> > @@ -140,7 +140,7 @@ xfs_attr3_node_inactive(
> > xfs_daddr_t parent_blkno, child_blkno;
> > struct xfs_buf *child_bp;
> > struct xfs_da3_icnode_hdr ichdr;
> > - int error, i;
> > + int error;
> >
> > /*
> > * Since this code is recursive (gasp!) we must protect ourselves.
> > @@ -152,7 +152,7 @@ xfs_attr3_node_inactive(
> > return -EFSCORRUPTED;
> > }
> >
> > - xfs_da3_node_hdr_from_disk(dp->i_mount, &ichdr, bp->b_addr);
> > + xfs_da3_node_hdr_from_disk(mp, &ichdr, bp->b_addr);
> > parent_blkno = xfs_buf_daddr(bp);
> > if (!ichdr.count) {
> > xfs_trans_brelse(*trans, bp);
> > @@ -167,7 +167,7 @@ xfs_attr3_node_inactive(
> > * over the leaves removing all of them. If this is higher up
> > * in the tree, recurse downward.
> > */
> > - for (i = 0; i < ichdr.count; i++) {
> > + while (ichdr.count > 0) {
> > /*
> > * Read the subsidiary block to see what we have to work with.
> > * Don't do this in a transaction. This is a depth-first
> > @@ -218,29 +218,32 @@ xfs_attr3_node_inactive(
> > xfs_trans_binval(*trans, child_bp);
> > child_bp = NULL;
> >
> > - /*
> > - * If we're not done, re-read the parent to get the next
> > - * child block number.
> > - */
> > - if (i + 1 < ichdr.count) {
> > - struct xfs_da3_icnode_hdr phdr;
> > + error = xfs_da3_node_read_mapped(*trans, dp,
> > + parent_blkno, &bp, XFS_ATTR_FORK);
> > + if (error)
> > + return error;
> >
> > - error = xfs_da3_node_read_mapped(*trans, dp,
> > - parent_blkno, &bp, XFS_ATTR_FORK);
> > + /*
> > + * Remove entry from parent node, prevents being indexed to.
> > + */
> > + xfs_attr3_node_entry_remove(*trans, dp, bp, 0);
> > +
> > + xfs_da3_node_hdr_from_disk(mp, &ichdr, bp->b_addr);
> > + bp = NULL;
> > +
> > + if (ichdr.count > 0) {
> > + /*
> > + * If we're not done, get the next child block number.
> > + */
> > + child_fsb = be32_to_cpu(ichdr.btree[0].before);
> > +
> > + /*
> > + * Atomically commit the whole invalidate stuff.
> > + */
> > + error = xfs_trans_roll_inode(trans, dp);
> > if (error)
> > return error;
> > - xfs_da3_node_hdr_from_disk(dp->i_mount, &phdr,
> > - bp->b_addr);
> > - child_fsb = be32_to_cpu(phdr.btree[i + 1].before);
> > - xfs_trans_brelse(*trans, bp);
> > - bp = NULL;
> > }
> > - /*
> > - * Atomically commit the whole invalidate stuff.
> > - */
> > - error = xfs_trans_roll_inode(trans, dp);
> > - if (error)
> > - return error;
> > }
> >
> > return 0;
> > @@ -257,10 +260,8 @@ xfs_attr3_root_inactive(
> > struct xfs_trans **trans,
> > struct xfs_inode *dp)
> > {
> > - struct xfs_mount *mp = dp->i_mount;
> > struct xfs_da_blkinfo *info;
> > struct xfs_buf *bp;
> > - xfs_daddr_t blkno;
> > int error;
> >
> > /*
> > @@ -272,7 +273,6 @@ xfs_attr3_root_inactive(
> > error = xfs_da3_node_read(*trans, dp, 0, &bp, XFS_ATTR_FORK);
> > if (error)
> > return error;
> > - blkno = xfs_buf_daddr(bp);
> >
> > /*
> > * Invalidate the tree, even if the "tree" is only a single leaf block.
> > @@ -283,6 +283,16 @@ xfs_attr3_root_inactive(
> > case cpu_to_be16(XFS_DA_NODE_MAGIC):
> > case cpu_to_be16(XFS_DA3_NODE_MAGIC):
> > error = xfs_attr3_node_inactive(trans, dp, bp, 1);
> > + if (error)
> > + return error;
> > +
> > + /*
> > + * Empty root node block are not allowed, convert it to leaf.
> > + */
> > + error = xfs_attr3_leaf_init(*trans, dp, 0);
> > + if (error)
> > + return error;
> > + error = xfs_trans_roll_inode(trans, dp);
> > break;
> > case cpu_to_be16(XFS_ATTR_LEAF_MAGIC):
> > case cpu_to_be16(XFS_ATTR3_LEAF_MAGIC):
> > @@ -295,21 +305,6 @@ xfs_attr3_root_inactive(
> > xfs_trans_brelse(*trans, bp);
> > break;
> > }
> > - if (error)
> > - return error;
> > -
> > - /*
> > - * Invalidate the incore copy of the root block.
> > - */
> > - error = xfs_trans_get_buf(*trans, mp->m_ddev_targp, blkno,
> > - XFS_FSB_TO_BB(mp, mp->m_attr_geo->fsbcount), 0, &bp);
> > - if (error)
> > - return error;
> > - xfs_trans_binval(*trans, bp); /* remove from cache */
> > - /*
> > - * Commit the invalidate and start the next transaction.
> > - */
> > - error = xfs_trans_roll_inode(trans, dp);
> >
> > return error;
> > }
> > @@ -328,6 +323,7 @@ xfs_attr_inactive(
> > {
> > struct xfs_trans *trans;
> > struct xfs_mount *mp;
> > + struct xfs_buf *bp;
> > int lock_mode = XFS_ILOCK_SHARED;
> > int error = 0;
> >
> > @@ -363,10 +359,27 @@ xfs_attr_inactive(
> > * removal below.
> > */
> > if (dp->i_af.if_nextents > 0) {
> > + /*
> > + * Invalidate and truncate all blocks but leave the root block.
> > + */
> > error = xfs_attr3_root_inactive(&trans, dp);
> > if (error)
> > goto out_cancel;
> >
> > + error = xfs_itruncate_extents(&trans, dp, XFS_ATTR_FORK,
> > + XFS_FSB_TO_B(mp, mp->m_attr_geo->fsbcount));
> > + if (error)
> > + goto out_cancel;
> > +
> > + /*
> > + * Invalidate and truncate the root block and ensure that the
> > + * operation is completed within a single transaction.
> > + */
> > + error = xfs_da_get_buf(trans, dp, 0, &bp, XFS_ATTR_FORK);
> > + if (error)
> > + goto out_cancel;
> > +
> > + xfs_trans_binval(trans, bp);
> > error = xfs_itruncate_extents(&trans, dp, XFS_ATTR_FORK, 0);
> > if (error)
> > goto out_cancel;
> > --
> > 2.39.2
> >
> >
>
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH v2 4/4] xfs: close crash window in attr dabtree inactivation
2026-03-16 4:56 ` Darrick J. Wong
@ 2026-03-16 8:32 ` Long Li
2026-03-16 13:04 ` Long Li
2026-03-16 16:29 ` Darrick J. Wong
0 siblings, 2 replies; 15+ messages in thread
From: Long Li @ 2026-03-16 8:32 UTC (permalink / raw)
To: Darrick J. Wong
Cc: cem, linux-xfs, david, yi.zhang, houtao1, yangerkun, lonuxli.64
On Sun, Mar 15, 2026 at 09:56:13PM -0700, Darrick J. Wong wrote:
> On Fri, Mar 13, 2026 at 05:18:56PM -0700, Darrick J. Wong wrote:
> > On Thu, Mar 12, 2026 at 04:58:00PM +0800, Long Li wrote:
> > > When inactivating an inode with node-format extended attributes,
> > > xfs_attr3_node_inactive() invalidates all child leaf/node blocks via
> > > xfs_trans_binval(), but intentionally does not remove the corresponding
> > > entries from their parent node blocks. The implicit assumption is that
> > > xfs_attr_inactive() will truncate the entire attr fork to zero extents
> > > afterwards, so log recovery will never reach the root node and follow
> > > those stale pointers.
> > >
> > > However, if a log shutdown occurs after the leaf/node block cancellations
> > > commit but before the attr bmap truncation commits, this assumption
> > > breaks. Recovery replays the attr bmap intact (the inode still has
> > > attr fork extents), but suppresses replay of all cancelled leaf/node
> > > blocks, maybe leaving them as stale data on disk. On the next mount,
> > > xlog_recover_process_iunlinks() retries inactivation and attempts to
> > > read the root node via the attr bmap. If the root node was not replayed,
> > > reading the unreplayed root block triggers a metadata verification
> > > failure immediately; if it was replayed, following its child pointers
> > > to unreplayed child blocks triggers the same failure:
> > >
> > > XFS (pmem0): Metadata corruption detected at
> > > xfs_da3_node_read_verify+0x53/0x220, xfs_da3_node block 0x78
> > > XFS (pmem0): Unmount and run xfs_repair
> > > XFS (pmem0): First 128 bytes of corrupted metadata buffer:
> > > 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > XFS (pmem0): metadata I/O error in "xfs_da_read_buf+0x104/0x190" at daddr 0x78 len 8 error 117
> > >
> > > Fix this in two places:
> > >
> > > In xfs_attr3_node_inactive(), after calling xfs_trans_binval() on a
> > > child block, immediately remove the entry that references it from the
> > > parent node in the same transaction. This eliminates the window where
> > > the parent holds a pointer to a cancelled block. Once all children are
> > > removed, the now-empty root node is converted to a leaf block within the
> > > same transaction. This node-to-leaf conversion is necessary for crash
> > > safety. If the system shutdown after the empty node is written to the
> > > log but before the second-phase bmap truncation commits, log recovery
> > > will attempt to verify the root block on disk. xfs_da3_node_verify()
> > > does not permit a node block with count == 0; such a block will fail
> > > verification and trigger a metadata corruption shutdown. on the other
> > > hand, leaf blocks are allowed to have this transient state.
> > >
> > > In xfs_attr_inactive(), split the attr fork truncation into two explicit
> > > phases. First, truncate all extents beyond the root block (the child
> > > extents whose parent references have already been removed above).
> > > Second, invalidate the root block and truncate the attr bmap to zero in
> > > a single transaction. The two operations in the second phase must be
> > > atomic: as long as the attr bmap has any non-zero length, recovery can
> > > follow it to the root block, so the root block invalidation must commit
> > > together with the bmap-to-zero truncation.
> > >
> > > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> > > Cc: <stable@vger.kernel.org>
> > > Signed-off-by: Long Li <leo.lilong@huawei.com>
> >
> > I think this looks ok, but let me run it through g753 over the weekend
> > to see if it fixes that problem too. Thanks for answering my questions
> > the last time through.
> >
> > Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
>
> In retesting generic/753, I noticed that this patchset *does* actually
> fix the problem of xfs_repair complaining about random contents in attr
> block 0. So, I think you've fixed that problem, yay!
>
That sounds great, thank you for the test.
> However, I noticed an additional thing -- if attr block 0 happened to be
> a leaf node (i.e. not a da3 node) AND that leaf happened to contain a
> non-INCOMPLETE remote attr entry AND the fs went down after the first
> bunmapi (that removes all but block0) but before the second bunmapi
> (which eliminates block 0), that remote entry will point to blocks
> that have been bunmapi'd, and xfs_repair will complain about that too.
>
> So, I think this patch needs to reinit old leaf blocks too; see below.
Agree with you, thanks for pointint this out.
For non-root leaf blocks, the remote value block cancellation and the
removal of the leaf entry from its parent node are already in the same
transaction, so that case is safe.
For the root leaf block, it is reachable as long as the attr bmap is
non-zero, so the remote value block cancellations and the leaf reinit
must be atomic.
> Note that xfs_repair also needs a patch to teach it not to zap the
> entire attr fork just because it found one incomplete attr entry in a
> leaf block.
>
> Thoughts?
With the leaf reinit added, xfs_repair maybe encounter an empty leaf at
block 0 with no entries referencing the remote blocks. So I don't understand
under what circumstances there would be an "incomplete attr entry", perhaps
I missed something?
Thanks,
Long Li
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 4/4] xfs: close crash window in attr dabtree inactivation
2026-03-16 8:32 ` Long Li
@ 2026-03-16 13:04 ` Long Li
2026-03-16 16:29 ` Darrick J. Wong
1 sibling, 0 replies; 15+ messages in thread
From: Long Li @ 2026-03-16 13:04 UTC (permalink / raw)
To: Darrick J. Wong
Cc: cem, linux-xfs, david, yi.zhang, houtao1, yangerkun, lonuxli.64
On Mon, Mar 16, 2026 at 04:32:36PM +0800, Long Li wrote:
> On Sun, Mar 15, 2026 at 09:56:13PM -0700, Darrick J. Wong wrote:
> > On Fri, Mar 13, 2026 at 05:18:56PM -0700, Darrick J. Wong wrote:
> > > On Thu, Mar 12, 2026 at 04:58:00PM +0800, Long Li wrote:
> > > > When inactivating an inode with node-format extended attributes,
> > > > xfs_attr3_node_inactive() invalidates all child leaf/node blocks via
> > > > xfs_trans_binval(), but intentionally does not remove the corresponding
> > > > entries from their parent node blocks. The implicit assumption is that
> > > > xfs_attr_inactive() will truncate the entire attr fork to zero extents
> > > > afterwards, so log recovery will never reach the root node and follow
> > > > those stale pointers.
> > > >
> > > > However, if a log shutdown occurs after the leaf/node block cancellations
> > > > commit but before the attr bmap truncation commits, this assumption
> > > > breaks. Recovery replays the attr bmap intact (the inode still has
> > > > attr fork extents), but suppresses replay of all cancelled leaf/node
> > > > blocks, maybe leaving them as stale data on disk. On the next mount,
> > > > xlog_recover_process_iunlinks() retries inactivation and attempts to
> > > > read the root node via the attr bmap. If the root node was not replayed,
> > > > reading the unreplayed root block triggers a metadata verification
> > > > failure immediately; if it was replayed, following its child pointers
> > > > to unreplayed child blocks triggers the same failure:
> > > >
> > > > XFS (pmem0): Metadata corruption detected at
> > > > xfs_da3_node_read_verify+0x53/0x220, xfs_da3_node block 0x78
> > > > XFS (pmem0): Unmount and run xfs_repair
> > > > XFS (pmem0): First 128 bytes of corrupted metadata buffer:
> > > > 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > XFS (pmem0): metadata I/O error in "xfs_da_read_buf+0x104/0x190" at daddr 0x78 len 8 error 117
> > > >
> > > > Fix this in two places:
> > > >
> > > > In xfs_attr3_node_inactive(), after calling xfs_trans_binval() on a
> > > > child block, immediately remove the entry that references it from the
> > > > parent node in the same transaction. This eliminates the window where
> > > > the parent holds a pointer to a cancelled block. Once all children are
> > > > removed, the now-empty root node is converted to a leaf block within the
> > > > same transaction. This node-to-leaf conversion is necessary for crash
> > > > safety. If the system shutdown after the empty node is written to the
> > > > log but before the second-phase bmap truncation commits, log recovery
> > > > will attempt to verify the root block on disk. xfs_da3_node_verify()
> > > > does not permit a node block with count == 0; such a block will fail
> > > > verification and trigger a metadata corruption shutdown. on the other
> > > > hand, leaf blocks are allowed to have this transient state.
> > > >
> > > > In xfs_attr_inactive(), split the attr fork truncation into two explicit
> > > > phases. First, truncate all extents beyond the root block (the child
> > > > extents whose parent references have already been removed above).
> > > > Second, invalidate the root block and truncate the attr bmap to zero in
> > > > a single transaction. The two operations in the second phase must be
> > > > atomic: as long as the attr bmap has any non-zero length, recovery can
> > > > follow it to the root block, so the root block invalidation must commit
> > > > together with the bmap-to-zero truncation.
> > > >
> > > > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> > > > Cc: <stable@vger.kernel.org>
> > > > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > >
> > > I think this looks ok, but let me run it through g753 over the weekend
> > > to see if it fixes that problem too. Thanks for answering my questions
> > > the last time through.
> > >
> > > Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> >
> > In retesting generic/753, I noticed that this patchset *does* actually
> > fix the problem of xfs_repair complaining about random contents in attr
> > block 0. So, I think you've fixed that problem, yay!
> >
>
> That sounds great, thank you for the test.
>
> > However, I noticed an additional thing -- if attr block 0 happened to be
> > a leaf node (i.e. not a da3 node) AND that leaf happened to contain a
> > non-INCOMPLETE remote attr entry AND the fs went down after the first
> > bunmapi (that removes all but block0) but before the second bunmapi
> > (which eliminates block 0), that remote entry will point to blocks
> > that have been bunmapi'd, and xfs_repair will complain about that too.
> >
> > So, I think this patch needs to reinit old leaf blocks too; see below.
>
> Agree with you, thanks for pointint this out.
>
> For non-root leaf blocks, the remote value block cancellation and the
> removal of the leaf entry from its parent node are already in the same
> transaction, so that case is safe.
>
> For the root leaf block, it is reachable as long as the attr bmap is
> non-zero, so the remote value block cancellations and the leaf reinit
> must be atomic.
>
I was wrong about the remote value block cancellation. During the attr
inactive process, the remote value block is not canceled and recorded
in the log. However, what you said is correct.
Long Li
> > Note that xfs_repair also needs a patch to teach it not to zap the
> > entire attr fork just because it found one incomplete attr entry in a
> > leaf block.
> >
> > Thoughts?
>
> With the leaf reinit added, xfs_repair maybe encounter an empty leaf at
> block 0 with no entries referencing the remote blocks. So I don't understand
> under what circumstances there would be an "incomplete attr entry", perhaps
> I missed something?
>
> Thanks,
> Long Li
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 4/4] xfs: close crash window in attr dabtree inactivation
2026-03-16 8:32 ` Long Li
2026-03-16 13:04 ` Long Li
@ 2026-03-16 16:29 ` Darrick J. Wong
2026-03-17 1:38 ` Long Li
1 sibling, 1 reply; 15+ messages in thread
From: Darrick J. Wong @ 2026-03-16 16:29 UTC (permalink / raw)
To: Long Li; +Cc: cem, linux-xfs, david, yi.zhang, houtao1, yangerkun, lonuxli.64
On Mon, Mar 16, 2026 at 04:32:35PM +0800, Long Li wrote:
> On Sun, Mar 15, 2026 at 09:56:13PM -0700, Darrick J. Wong wrote:
> > On Fri, Mar 13, 2026 at 05:18:56PM -0700, Darrick J. Wong wrote:
> > > On Thu, Mar 12, 2026 at 04:58:00PM +0800, Long Li wrote:
> > > > When inactivating an inode with node-format extended attributes,
> > > > xfs_attr3_node_inactive() invalidates all child leaf/node blocks via
> > > > xfs_trans_binval(), but intentionally does not remove the corresponding
> > > > entries from their parent node blocks. The implicit assumption is that
> > > > xfs_attr_inactive() will truncate the entire attr fork to zero extents
> > > > afterwards, so log recovery will never reach the root node and follow
> > > > those stale pointers.
> > > >
> > > > However, if a log shutdown occurs after the leaf/node block cancellations
> > > > commit but before the attr bmap truncation commits, this assumption
> > > > breaks. Recovery replays the attr bmap intact (the inode still has
> > > > attr fork extents), but suppresses replay of all cancelled leaf/node
> > > > blocks, maybe leaving them as stale data on disk. On the next mount,
> > > > xlog_recover_process_iunlinks() retries inactivation and attempts to
> > > > read the root node via the attr bmap. If the root node was not replayed,
> > > > reading the unreplayed root block triggers a metadata verification
> > > > failure immediately; if it was replayed, following its child pointers
> > > > to unreplayed child blocks triggers the same failure:
> > > >
> > > > XFS (pmem0): Metadata corruption detected at
> > > > xfs_da3_node_read_verify+0x53/0x220, xfs_da3_node block 0x78
> > > > XFS (pmem0): Unmount and run xfs_repair
> > > > XFS (pmem0): First 128 bytes of corrupted metadata buffer:
> > > > 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > XFS (pmem0): metadata I/O error in "xfs_da_read_buf+0x104/0x190" at daddr 0x78 len 8 error 117
> > > >
> > > > Fix this in two places:
> > > >
> > > > In xfs_attr3_node_inactive(), after calling xfs_trans_binval() on a
> > > > child block, immediately remove the entry that references it from the
> > > > parent node in the same transaction. This eliminates the window where
> > > > the parent holds a pointer to a cancelled block. Once all children are
> > > > removed, the now-empty root node is converted to a leaf block within the
> > > > same transaction. This node-to-leaf conversion is necessary for crash
> > > > safety. If the system shutdown after the empty node is written to the
> > > > log but before the second-phase bmap truncation commits, log recovery
> > > > will attempt to verify the root block on disk. xfs_da3_node_verify()
> > > > does not permit a node block with count == 0; such a block will fail
> > > > verification and trigger a metadata corruption shutdown. on the other
> > > > hand, leaf blocks are allowed to have this transient state.
> > > >
> > > > In xfs_attr_inactive(), split the attr fork truncation into two explicit
> > > > phases. First, truncate all extents beyond the root block (the child
> > > > extents whose parent references have already been removed above).
> > > > Second, invalidate the root block and truncate the attr bmap to zero in
> > > > a single transaction. The two operations in the second phase must be
> > > > atomic: as long as the attr bmap has any non-zero length, recovery can
> > > > follow it to the root block, so the root block invalidation must commit
> > > > together with the bmap-to-zero truncation.
> > > >
> > > > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> > > > Cc: <stable@vger.kernel.org>
> > > > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > >
> > > I think this looks ok, but let me run it through g753 over the weekend
> > > to see if it fixes that problem too. Thanks for answering my questions
> > > the last time through.
> > >
> > > Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> >
> > In retesting generic/753, I noticed that this patchset *does* actually
> > fix the problem of xfs_repair complaining about random contents in attr
> > block 0. So, I think you've fixed that problem, yay!
> >
>
> That sounds great, thank you for the test.
>
> > However, I noticed an additional thing -- if attr block 0 happened to be
> > a leaf node (i.e. not a da3 node) AND that leaf happened to contain a
> > non-INCOMPLETE remote attr entry AND the fs went down after the first
> > bunmapi (that removes all but block0) but before the second bunmapi
> > (which eliminates block 0), that remote entry will point to blocks
> > that have been bunmapi'd, and xfs_repair will complain about that too.
> >
> > So, I think this patch needs to reinit old leaf blocks too; see below.
>
> Agree with you, thanks for pointint this out.
>
> For non-root leaf blocks, the remote value block cancellation and the
> removal of the leaf entry from its parent node are already in the same
> transaction, so that case is safe.
Agreed.
> For the root leaf block, it is reachable as long as the attr bmap is
> non-zero, so the remote value block cancellations and the leaf reinit
> must be atomic.
Agreed.
> > Note that xfs_repair also needs a patch to teach it not to zap the
> > entire attr fork just because it found one incomplete attr entry in a
> > leaf block.
> >
> > Thoughts?
>
> With the leaf reinit added, xfs_repair maybe encounter an empty leaf at
> block 0 with no entries referencing the remote blocks. So I don't understand
> under what circumstances there would be an "incomplete attr entry", perhaps
> I missed something?
Oh, sorry, I should've been more clear about what went where. There are
two scenarios that still caused problems over the weekend:
a) You create a file, start adding large attrs, and crash. After
recovery, the file will still be linked into the directory tree, but its
attr structure will have INCOMPLETE attrs.
In this case, I don't think the user wants us to destroy the entire attr
structure, so it's necessary to patch xfs_repair to be ok with
INCOMPLETE attrs.
b) You create a file with a few large attrs. After a long life without
drama, you start to delete the file, and crash after bunmapi'ing
everything *except* for block 0. IOWs, the old leaf root block is still
intact. If you then run xfs_repair, it will complain about the
non-INCOMPLETE remote attr value because valueblk points to an unmapped
region of the attr fork.
This is the case for adding the extra xfs_attr3_leaf_init call.
--D
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 4/4] xfs: close crash window in attr dabtree inactivation
2026-03-16 16:29 ` Darrick J. Wong
@ 2026-03-17 1:38 ` Long Li
0 siblings, 0 replies; 15+ messages in thread
From: Long Li @ 2026-03-17 1:38 UTC (permalink / raw)
To: Darrick J. Wong
Cc: cem, linux-xfs, david, yi.zhang, houtao1, yangerkun, lonuxli.64
On Mon, Mar 16, 2026 at 09:29:55AM -0700, Darrick J. Wong wrote:
> On Mon, Mar 16, 2026 at 04:32:35PM +0800, Long Li wrote:
> > On Sun, Mar 15, 2026 at 09:56:13PM -0700, Darrick J. Wong wrote:
> > > On Fri, Mar 13, 2026 at 05:18:56PM -0700, Darrick J. Wong wrote:
> > > > On Thu, Mar 12, 2026 at 04:58:00PM +0800, Long Li wrote:
> > > > > When inactivating an inode with node-format extended attributes,
> > > > > xfs_attr3_node_inactive() invalidates all child leaf/node blocks via
> > > > > xfs_trans_binval(), but intentionally does not remove the corresponding
> > > > > entries from their parent node blocks. The implicit assumption is that
> > > > > xfs_attr_inactive() will truncate the entire attr fork to zero extents
> > > > > afterwards, so log recovery will never reach the root node and follow
> > > > > those stale pointers.
> > > > >
> > > > > However, if a log shutdown occurs after the leaf/node block cancellations
> > > > > commit but before the attr bmap truncation commits, this assumption
> > > > > breaks. Recovery replays the attr bmap intact (the inode still has
> > > > > attr fork extents), but suppresses replay of all cancelled leaf/node
> > > > > blocks, maybe leaving them as stale data on disk. On the next mount,
> > > > > xlog_recover_process_iunlinks() retries inactivation and attempts to
> > > > > read the root node via the attr bmap. If the root node was not replayed,
> > > > > reading the unreplayed root block triggers a metadata verification
> > > > > failure immediately; if it was replayed, following its child pointers
> > > > > to unreplayed child blocks triggers the same failure:
> > > > >
> > > > > XFS (pmem0): Metadata corruption detected at
> > > > > xfs_da3_node_read_verify+0x53/0x220, xfs_da3_node block 0x78
> > > > > XFS (pmem0): Unmount and run xfs_repair
> > > > > XFS (pmem0): First 128 bytes of corrupted metadata buffer:
> > > > > 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > > 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > > 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > > 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > > 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > > 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > > 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > > XFS (pmem0): metadata I/O error in "xfs_da_read_buf+0x104/0x190" at daddr 0x78 len 8 error 117
> > > > >
> > > > > Fix this in two places:
> > > > >
> > > > > In xfs_attr3_node_inactive(), after calling xfs_trans_binval() on a
> > > > > child block, immediately remove the entry that references it from the
> > > > > parent node in the same transaction. This eliminates the window where
> > > > > the parent holds a pointer to a cancelled block. Once all children are
> > > > > removed, the now-empty root node is converted to a leaf block within the
> > > > > same transaction. This node-to-leaf conversion is necessary for crash
> > > > > safety. If the system shutdown after the empty node is written to the
> > > > > log but before the second-phase bmap truncation commits, log recovery
> > > > > will attempt to verify the root block on disk. xfs_da3_node_verify()
> > > > > does not permit a node block with count == 0; such a block will fail
> > > > > verification and trigger a metadata corruption shutdown. on the other
> > > > > hand, leaf blocks are allowed to have this transient state.
> > > > >
> > > > > In xfs_attr_inactive(), split the attr fork truncation into two explicit
> > > > > phases. First, truncate all extents beyond the root block (the child
> > > > > extents whose parent references have already been removed above).
> > > > > Second, invalidate the root block and truncate the attr bmap to zero in
> > > > > a single transaction. The two operations in the second phase must be
> > > > > atomic: as long as the attr bmap has any non-zero length, recovery can
> > > > > follow it to the root block, so the root block invalidation must commit
> > > > > together with the bmap-to-zero truncation.
> > > > >
> > > > > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> > > > > Cc: <stable@vger.kernel.org>
> > > > > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > > >
> > > > I think this looks ok, but let me run it through g753 over the weekend
> > > > to see if it fixes that problem too. Thanks for answering my questions
> > > > the last time through.
> > > >
> > > > Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> > >
> > > In retesting generic/753, I noticed that this patchset *does* actually
> > > fix the problem of xfs_repair complaining about random contents in attr
> > > block 0. So, I think you've fixed that problem, yay!
> > >
> >
> > That sounds great, thank you for the test.
> >
> > > However, I noticed an additional thing -- if attr block 0 happened to be
> > > a leaf node (i.e. not a da3 node) AND that leaf happened to contain a
> > > non-INCOMPLETE remote attr entry AND the fs went down after the first
> > > bunmapi (that removes all but block0) but before the second bunmapi
> > > (which eliminates block 0), that remote entry will point to blocks
> > > that have been bunmapi'd, and xfs_repair will complain about that too.
> > >
> > > So, I think this patch needs to reinit old leaf blocks too; see below.
> >
> > Agree with you, thanks for pointint this out.
> >
> > For non-root leaf blocks, the remote value block cancellation and the
> > removal of the leaf entry from its parent node are already in the same
> > transaction, so that case is safe.
>
> Agreed.
>
> > For the root leaf block, it is reachable as long as the attr bmap is
> > non-zero, so the remote value block cancellations and the leaf reinit
> > must be atomic.
>
> Agreed.
>
> > > Note that xfs_repair also needs a patch to teach it not to zap the
> > > entire attr fork just because it found one incomplete attr entry in a
> > > leaf block.
> > >
> > > Thoughts?
> >
> > With the leaf reinit added, xfs_repair maybe encounter an empty leaf at
> > block 0 with no entries referencing the remote blocks. So I don't understand
> > under what circumstances there would be an "incomplete attr entry", perhaps
> > I missed something?
>
> Oh, sorry, I should've been more clear about what went where. There are
> two scenarios that still caused problems over the weekend:
>
> a) You create a file, start adding large attrs, and crash. After
> recovery, the file will still be linked into the directory tree, but its
> attr structure will have INCOMPLETE attrs.
>
> In this case, I don't think the user wants us to destroy the entire attr
> structure, so it's necessary to patch xfs_repair to be ok with
> INCOMPLETE attrs.
>
> b) You create a file with a few large attrs. After a long life without
> drama, you start to delete the file, and crash after bunmapi'ing
> everything *except* for block 0. IOWs, the old leaf root block is still
> intact. If you then run xfs_repair, it will complain about the
> non-INCOMPLETE remote attr value because valueblk points to an unmapped
> region of the attr fork.
>
> This is the case for adding the extra xfs_attr3_leaf_init call.
>
> --D
I understand now. Thank you for your detailed explanation.
Long Li
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2026-03-17 1:42 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-12 8:57 [PATCH v2 0/4] xfs: close crash window in attr dabtree inactivation Long Li
2026-03-12 8:57 ` [PATCH v2 1/4] xfs: only assert new size for datafork during truncate extents Long Li
2026-03-12 8:57 ` [PATCH v2 2/4] xfs: factor out xfs_attr3_node_entry_remove Long Li
2026-03-13 15:01 ` Darrick J. Wong
2026-03-14 2:49 ` Long Li
2026-03-12 8:57 ` [PATCH v2 3/4] xfs: factor out xfs_attr3_leaf_init Long Li
2026-03-13 15:04 ` Darrick J. Wong
2026-03-14 2:47 ` Long Li
2026-03-12 8:58 ` [PATCH v2 4/4] xfs: close crash window in attr dabtree inactivation Long Li
2026-03-14 0:18 ` Darrick J. Wong
2026-03-16 4:56 ` Darrick J. Wong
2026-03-16 8:32 ` Long Li
2026-03-16 13:04 ` Long Li
2026-03-16 16:29 ` Darrick J. Wong
2026-03-17 1:38 ` Long Li
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox