* [PATCH 0/4] xfs: close crash window in attr dabtree inactivation
@ 2026-03-09 8:27 Long Li
2026-03-09 8:27 ` [PATCH 1/4] xfs: only assert new size for datafork during truncate extents Long Li
` (3 more replies)
0 siblings, 4 replies; 17+ messages in thread
From: Long Li @ 2026-03-09 8:27 UTC (permalink / raw)
To: djwong, cem
Cc: linux-xfs, david, yi.zhang, houtao1, leo.lilong, yangerkun,
lonuxli.64
Problem
-------
xfs_attr3_node_inactive() cancels all child blocks in the attr dabtree
via xfs_trans_binval(), relying on a subsequent full attr bmap truncation
in xfs_attr_inactive() to ensure log recovery never reaches the root
node. The child block cancellations and the bmap truncation are
committed in separate transactions, creating a narrow but real crash
window.
If the system shuts down after the cancellations commit but before the
bmap truncation commits, the attr bmap survives recovery intact and
xlog_recover_process_iunlinks() retries inactivation. It attempts to
read the root node via the attr bmap: if the root node was not replayed,
reading the unreplayed root block maybe triggers a metadata verification
failure immediately; if it was replayed, following its child pointers to
unreplayed child blocks triggers the same failure.
Zhang Yi previously attempted to solve this problem [1], but Dave gave
different advice. Following the approach suggested by Dave Chinner in
review.
[1] https://patchwork.kernel.org/project/xfs/patch/20230613030434.2944173-3-yi.zhang@huaweicloud.com/
Fix
---
The fix addresses the atomicity gap at two levels:
(1) In xfs_attr3_node_inactive(), each child block invalidation and the
removal of its reference from the parent node are placed in the same
transaction, so no parent node ever holds a pointer to a cancelled
block at any crash point. After all children are removed, the empty
root node is converted to an empty leaf in the same transaction.
This conversion is necessary because xfs_da3_node_verify() rejects
a node block with count == 0, whereas xfs_attr3_leaf_verify()
explicitly allows it during recovery to accommodate the transient
state from the shortform-to-leaf promotion path.
(2) In xfs_attr_inactive(), the attr fork is truncated in two phases.
First the child extents are removed. Then the root block is
invalidated and the attr bmap is truncated to zero in a single
transaction. These two operations in the second phase must be
atomic: as long as the attr bmap has any non-zero length, recovery
can follow it to the root block, so the root block invalidation must
commit together with the bmap-to-zero truncation.
Long Li (4):
xfs: only assert new size for datafork during truncate extents
xfs: factor out xfs_da3_node_entry_remove
xfs: factor out xfs_attr3_leaf_init
xfs: close crash window in attr dabtree inactivation
fs/xfs/libxfs/xfs_attr_leaf.c | 20 ++++++++
fs/xfs/libxfs/xfs_attr_leaf.h | 3 ++
fs/xfs/libxfs/xfs_da_btree.c | 54 +++++++++++++++----
fs/xfs/libxfs/xfs_da_btree.h | 2 +
fs/xfs/xfs_attr_inactive.c | 97 ++++++++++++++++++++---------------
fs/xfs/xfs_inode.c | 3 +-
6 files changed, 125 insertions(+), 54 deletions(-)
--
2.39.2
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/4] xfs: only assert new size for datafork during truncate extents
2026-03-09 8:27 [PATCH 0/4] xfs: close crash window in attr dabtree inactivation Long Li
@ 2026-03-09 8:27 ` Long Li
2026-03-09 16:39 ` Darrick J. Wong
2026-03-09 8:27 ` [PATCH 2/4] xfs: factor out xfs_da3_node_entry_remove Long Li
` (2 subsequent siblings)
3 siblings, 1 reply; 17+ messages in thread
From: Long Li @ 2026-03-09 8:27 UTC (permalink / raw)
To: djwong, cem
Cc: linux-xfs, david, yi.zhang, houtao1, leo.lilong, yangerkun,
lonuxli.64
The assertion functions properly because we currently only truncate the
attr to a zero size. Any other new size of the attr is not preempted.
Make this assertion is specific to the datafork, preparing for
subsequent patches to truncate the attribute to a non-zero size.
Signed-off-by: Long Li <leo.lilong@huawei.com>
---
fs/xfs/xfs_inode.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 50c0404f9064..beaa26ec62da 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1048,7 +1048,8 @@ xfs_itruncate_extents_flags(
xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
if (icount_read(VFS_I(ip)))
xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL);
- ASSERT(new_size <= XFS_ISIZE(ip));
+ if (whichfork == XFS_DATA_FORK)
+ ASSERT(new_size <= XFS_ISIZE(ip));
ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
ASSERT(ip->i_itemp != NULL);
ASSERT(ip->i_itemp->ili_lock_flags == 0);
--
2.39.2
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 2/4] xfs: factor out xfs_da3_node_entry_remove
2026-03-09 8:27 [PATCH 0/4] xfs: close crash window in attr dabtree inactivation Long Li
2026-03-09 8:27 ` [PATCH 1/4] xfs: only assert new size for datafork during truncate extents Long Li
@ 2026-03-09 8:27 ` Long Li
2026-03-09 16:42 ` Darrick J. Wong
2026-03-09 8:27 ` [PATCH 3/4] xfs: factor out xfs_attr3_leaf_init Long Li
2026-03-09 8:27 ` [PATCH 4/4] xfs: close crash window in attr dabtree inactivation Long Li
3 siblings, 1 reply; 17+ messages in thread
From: Long Li @ 2026-03-09 8:27 UTC (permalink / raw)
To: djwong, cem
Cc: linux-xfs, david, yi.zhang, houtao1, leo.lilong, yangerkun,
lonuxli.64
Factor out wrapper xfs_da3_node_entry_remove function, which
exported for external use.
Signed-off-by: Long Li <leo.lilong@huawei.com>
---
fs/xfs/libxfs/xfs_da_btree.c | 54 ++++++++++++++++++++++++++++--------
fs/xfs/libxfs/xfs_da_btree.h | 2 ++
2 files changed, 45 insertions(+), 11 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index 766631f0562e..466c43098768 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -1506,21 +1506,20 @@ xfs_da3_fixhashpath(
}
/*
- * Remove an entry from an intermediate node.
+ * Internal implementation to remove an entry from an intermediate node.
*/
STATIC void
-xfs_da3_node_remove(
- struct xfs_da_state *state,
- struct xfs_da_state_blk *drop_blk)
+__xfs_da3_node_remove(
+ struct xfs_trans *tp,
+ struct xfs_inode *dp,
+ struct xfs_da_geometry *geo,
+ struct xfs_da_state_blk *drop_blk)
{
struct xfs_da_intnode *node;
struct xfs_da3_icnode_hdr nodehdr;
struct xfs_da_node_entry *btree;
int index;
int tmp;
- struct xfs_inode *dp = state->args->dp;
-
- trace_xfs_da_node_remove(state->args);
node = drop_blk->bp->b_addr;
xfs_da3_node_hdr_from_disk(dp->i_mount, &nodehdr, node);
@@ -1536,17 +1535,17 @@ xfs_da3_node_remove(
tmp = nodehdr.count - index - 1;
tmp *= (uint)sizeof(xfs_da_node_entry_t);
memmove(&btree[index], &btree[index + 1], tmp);
- xfs_trans_log_buf(state->args->trans, drop_blk->bp,
+ xfs_trans_log_buf(tp, drop_blk->bp,
XFS_DA_LOGRANGE(node, &btree[index], tmp));
index = nodehdr.count - 1;
}
memset(&btree[index], 0, sizeof(xfs_da_node_entry_t));
- xfs_trans_log_buf(state->args->trans, drop_blk->bp,
+ xfs_trans_log_buf(tp, drop_blk->bp,
XFS_DA_LOGRANGE(node, &btree[index], sizeof(btree[index])));
nodehdr.count -= 1;
xfs_da3_node_hdr_to_disk(dp->i_mount, node, &nodehdr);
- xfs_trans_log_buf(state->args->trans, drop_blk->bp,
- XFS_DA_LOGRANGE(node, &node->hdr, state->args->geo->node_hdr_size));
+ xfs_trans_log_buf(tp, drop_blk->bp,
+ XFS_DA_LOGRANGE(node, &node->hdr, geo->node_hdr_size));
/*
* Copy the last hash value from the block to propagate upwards.
@@ -1554,6 +1553,39 @@ xfs_da3_node_remove(
drop_blk->hashval = be32_to_cpu(btree[index - 1].hashval);
}
+/*
+ * Remove an entry from an intermediate node.
+ */
+STATIC void
+xfs_da3_node_remove(
+ struct xfs_da_state *state,
+ struct xfs_da_state_blk *drop_blk)
+{
+ trace_xfs_da_node_remove(state->args);
+ __xfs_da3_node_remove(state->args->trans, state->args->dp,
+ state->args->geo, drop_blk);
+}
+
+/*
+ * Remove an entry from a node at the specified index, this is an exported
+ * wrapper for removing entries from intermediate nodes.
+ */
+void
+xfs_da3_node_entry_remove(
+ struct xfs_trans *tp,
+ struct xfs_inode *dp,
+ struct xfs_buf *bp,
+ int index)
+{
+ struct xfs_da_state_blk blk;
+
+ memset(&blk, 0, sizeof(blk));
+ blk.index = index;
+ blk.bp = bp;
+
+ __xfs_da3_node_remove(tp, dp, dp->i_mount->m_attr_geo, &blk);
+}
+
/*
* Unbalance the elements between two intermediate nodes,
* move all Btree elements from one node into another.
diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h
index 354d5d65043e..6cec4313c83c 100644
--- a/fs/xfs/libxfs/xfs_da_btree.h
+++ b/fs/xfs/libxfs/xfs_da_btree.h
@@ -184,6 +184,8 @@ int xfs_da3_split(xfs_da_state_t *state);
int xfs_da3_join(xfs_da_state_t *state);
void xfs_da3_fixhashpath(struct xfs_da_state *state,
struct xfs_da_state_path *path_to_to_fix);
+void xfs_da3_node_entry_remove(struct xfs_trans *tp, struct xfs_inode *dp,
+ struct xfs_buf *bp, int index);
/*
* Routines used for finding things in the Btree.
--
2.39.2
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 3/4] xfs: factor out xfs_attr3_leaf_init
2026-03-09 8:27 [PATCH 0/4] xfs: close crash window in attr dabtree inactivation Long Li
2026-03-09 8:27 ` [PATCH 1/4] xfs: only assert new size for datafork during truncate extents Long Li
2026-03-09 8:27 ` [PATCH 2/4] xfs: factor out xfs_da3_node_entry_remove Long Li
@ 2026-03-09 8:27 ` Long Li
2026-03-09 16:44 ` Darrick J. Wong
2026-03-09 8:27 ` [PATCH 4/4] xfs: close crash window in attr dabtree inactivation Long Li
3 siblings, 1 reply; 17+ messages in thread
From: Long Li @ 2026-03-09 8:27 UTC (permalink / raw)
To: djwong, cem
Cc: linux-xfs, david, yi.zhang, houtao1, leo.lilong, yangerkun,
lonuxli.64
Factor out wrapper xfs_attr3_leaf_init function, which exported for
external use.
Signed-off-by: Long Li <leo.lilong@huawei.com>
---
fs/xfs/libxfs/xfs_attr_leaf.c | 20 ++++++++++++++++++++
fs/xfs/libxfs/xfs_attr_leaf.h | 3 +++
2 files changed, 23 insertions(+)
diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
index 47f48ae555c0..6599b53f4822 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.c
+++ b/fs/xfs/libxfs/xfs_attr_leaf.c
@@ -1415,6 +1415,26 @@ xfs_attr3_leaf_create(
return 0;
}
+/*
+ * Wrapper function of initializing contents of a leaf, export for external use.
+ */
+int
+xfs_attr3_leaf_init(
+ struct xfs_trans *tp,
+ struct xfs_inode *dp,
+ xfs_dablk_t blkno)
+{
+ struct xfs_buf *bp = NULL;
+ struct xfs_da_args args;
+
+ memset(&args, 0, sizeof(args));
+ args.trans = tp;
+ args.dp = dp;
+ args.owner = dp->i_ino;
+ args.geo = dp->i_mount->m_attr_geo;
+
+ return xfs_attr3_leaf_create(&args, blkno, &bp);
+}
/*
* Split the leaf node, rebalance, then add the new entry.
*
diff --git a/fs/xfs/libxfs/xfs_attr_leaf.h b/fs/xfs/libxfs/xfs_attr_leaf.h
index aca46da2bc50..72639efe6ac3 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.h
+++ b/fs/xfs/libxfs/xfs_attr_leaf.h
@@ -87,6 +87,9 @@ int xfs_attr3_leaf_list_int(struct xfs_buf *bp,
/*
* Routines used for shrinking the Btree.
*/
+
+int xfs_attr3_leaf_init(struct xfs_trans *tp, struct xfs_inode *dp,
+ xfs_dablk_t blkno);
int xfs_attr3_leaf_toosmall(struct xfs_da_state *state, int *retval);
void xfs_attr3_leaf_unbalance(struct xfs_da_state *state,
struct xfs_da_state_blk *drop_blk,
--
2.39.2
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 4/4] xfs: close crash window in attr dabtree inactivation
2026-03-09 8:27 [PATCH 0/4] xfs: close crash window in attr dabtree inactivation Long Li
` (2 preceding siblings ...)
2026-03-09 8:27 ` [PATCH 3/4] xfs: factor out xfs_attr3_leaf_init Long Li
@ 2026-03-09 8:27 ` Long Li
2026-03-09 16:59 ` Darrick J. Wong
3 siblings, 1 reply; 17+ messages in thread
From: Long Li @ 2026-03-09 8:27 UTC (permalink / raw)
To: djwong, cem
Cc: linux-xfs, david, yi.zhang, houtao1, leo.lilong, yangerkun,
lonuxli.64
When inactivating an inode with node-format extended attributes,
xfs_attr3_node_inactive() invalidates all child leaf/node blocks via
xfs_trans_binval(), but intentionally does not remove the corresponding
entries from their parent node blocks. The implicit assumption is that
xfs_attr_inactive() will truncate the entire attr fork to zero extents
afterwards, so log recovery will never reach the root node and follow
those stale pointers.
However, if a log shutdown occurs after the child block cancellations
commit but before the attr bmap truncation commits, this assumption
breaks. Recovery replays the attr bmap intact (the inode still has
attr fork extents), but suppresses replay of all cancelled child
blocks, maybe leaving them as stale data on disk. On the next mount,
xlog_recover_process_iunlinks() retries inactivation and attempts to
read the root node via the attr bmap. If the root node was not replayed,
reading the unreplayed root block triggers a metadata verification
failure immediately; if it was replayed, following its child pointers
to unreplayed child blocks triggers the same failure:
XFS (pmem0): Metadata corruption detected at
xfs_da3_node_read_verify+0x53/0x220, xfs_da3_node block 0x78
XFS (pmem0): Unmount and run xfs_repair
XFS (pmem0): First 128 bytes of corrupted metadata buffer:
00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
XFS (pmem0): metadata I/O error in "xfs_da_read_buf+0x104/0x190" at daddr 0x78 len 8 error 117
Fix this in two places:
In xfs_attr3_node_inactive(), after calling xfs_trans_binval() on a
child block, immediately remove the entry that references it from the
parent node in the same transaction. This eliminates the window where
the parent holds a pointer to a cancelled block. Once all children are
removed, the now-empty root node is converted to a leaf block within the
same transaction. This node-to-leaf conversion is necessary for crash
safety. If the system shutdown after the empty node is written to the
log but before the second-phase bmap truncation commits, log recovery
will attempt to verify the root block on disk. xfs_da3_node_verify()
does not permit a node block with count == 0; such a block will fail
verification and trigger a metadata corruption shutdown. on the other
hand, leaf blocks are allowed to have this transient state.
In xfs_attr_inactive(), split the attr fork truncation into two explicit
phases. First, truncate all extents beyond the root block (the child
extents whose parent references have already been removed above).
Second, invalidate the root block and truncate the attr bmap to zero in
a single transaction. The two operations in the second phase must be
atomic: as long as the attr bmap has any non-zero length, recovery can
follow it to the root block, so the root block invalidation must commit
together with the bmap-to-zero truncation.
Signed-off-by: Long Li <leo.lilong@huawei.com>
---
fs/xfs/xfs_attr_inactive.c | 97 +++++++++++++++++++++-----------------
1 file changed, 55 insertions(+), 42 deletions(-)
diff --git a/fs/xfs/xfs_attr_inactive.c b/fs/xfs/xfs_attr_inactive.c
index 92331991f9fd..2ffa6b51a356 100644
--- a/fs/xfs/xfs_attr_inactive.c
+++ b/fs/xfs/xfs_attr_inactive.c
@@ -140,7 +140,7 @@ xfs_attr3_node_inactive(
xfs_daddr_t parent_blkno, child_blkno;
struct xfs_buf *child_bp;
struct xfs_da3_icnode_hdr ichdr;
- int error, i;
+ int error;
/*
* Since this code is recursive (gasp!) we must protect ourselves.
@@ -167,7 +167,7 @@ xfs_attr3_node_inactive(
* over the leaves removing all of them. If this is higher up
* in the tree, recurse downward.
*/
- for (i = 0; i < ichdr.count; i++) {
+ while (ichdr.count > 0) {
/*
* Read the subsidiary block to see what we have to work with.
* Don't do this in a transaction. This is a depth-first
@@ -218,29 +218,32 @@ xfs_attr3_node_inactive(
xfs_trans_binval(*trans, child_bp);
child_bp = NULL;
- /*
- * If we're not done, re-read the parent to get the next
- * child block number.
- */
- if (i + 1 < ichdr.count) {
- struct xfs_da3_icnode_hdr phdr;
-
- error = xfs_da3_node_read_mapped(*trans, dp,
- parent_blkno, &bp, XFS_ATTR_FORK);
- if (error)
- return error;
- xfs_da3_node_hdr_from_disk(dp->i_mount, &phdr,
- bp->b_addr);
- child_fsb = be32_to_cpu(phdr.btree[i + 1].before);
- xfs_trans_brelse(*trans, bp);
- bp = NULL;
- }
- /*
- * Atomically commit the whole invalidate stuff.
- */
- error = xfs_trans_roll_inode(trans, dp);
+ error = xfs_da3_node_read_mapped(*trans, dp,
+ parent_blkno, &bp, XFS_ATTR_FORK);
if (error)
- return error;
+ return error;
+
+ /*
+ * Remove entry form parent node, prevents being indexed to.
+ */
+ xfs_da3_node_entry_remove(*trans, dp, bp, 0);
+
+ xfs_da3_node_hdr_from_disk(dp->i_mount, &ichdr, bp->b_addr);
+ bp = NULL;
+
+ if (ichdr.count > 0) {
+ /*
+ * If we're not done, get the next child block number.
+ */
+ child_fsb = be32_to_cpu(ichdr.btree[0].before);
+
+ /*
+ * Atomically commit the whole invalidate stuff.
+ */
+ error = xfs_trans_roll_inode(trans, dp);
+ if (error)
+ return error;
+ }
}
return 0;
@@ -257,10 +260,8 @@ xfs_attr3_root_inactive(
struct xfs_trans **trans,
struct xfs_inode *dp)
{
- struct xfs_mount *mp = dp->i_mount;
struct xfs_da_blkinfo *info;
struct xfs_buf *bp;
- xfs_daddr_t blkno;
int error;
/*
@@ -272,7 +273,6 @@ xfs_attr3_root_inactive(
error = xfs_da3_node_read(*trans, dp, 0, &bp, XFS_ATTR_FORK);
if (error)
return error;
- blkno = xfs_buf_daddr(bp);
/*
* Invalidate the tree, even if the "tree" is only a single leaf block.
@@ -283,6 +283,16 @@ xfs_attr3_root_inactive(
case cpu_to_be16(XFS_DA_NODE_MAGIC):
case cpu_to_be16(XFS_DA3_NODE_MAGIC):
error = xfs_attr3_node_inactive(trans, dp, bp, 1);
+ if (error)
+ return error;
+
+ /*
+ * Empty root node block are not allowed, convert it to leaf.
+ */
+ error = xfs_attr3_leaf_init(*trans, dp, 0);
+ if (error)
+ return error;
+ error = xfs_trans_roll_inode(trans, dp);
break;
case cpu_to_be16(XFS_ATTR_LEAF_MAGIC):
case cpu_to_be16(XFS_ATTR3_LEAF_MAGIC):
@@ -295,21 +305,6 @@ xfs_attr3_root_inactive(
xfs_trans_brelse(*trans, bp);
break;
}
- if (error)
- return error;
-
- /*
- * Invalidate the incore copy of the root block.
- */
- error = xfs_trans_get_buf(*trans, mp->m_ddev_targp, blkno,
- XFS_FSB_TO_BB(mp, mp->m_attr_geo->fsbcount), 0, &bp);
- if (error)
- return error;
- xfs_trans_binval(*trans, bp); /* remove from cache */
- /*
- * Commit the invalidate and start the next transaction.
- */
- error = xfs_trans_roll_inode(trans, dp);
return error;
}
@@ -328,6 +323,7 @@ xfs_attr_inactive(
{
struct xfs_trans *trans;
struct xfs_mount *mp;
+ struct xfs_buf *bp;
int lock_mode = XFS_ILOCK_SHARED;
int error = 0;
@@ -363,10 +359,27 @@ xfs_attr_inactive(
* removal below.
*/
if (dp->i_af.if_nextents > 0) {
+ /*
+ * Invalidate and truncate all blocks but leave the root block.
+ */
error = xfs_attr3_root_inactive(&trans, dp);
if (error)
goto out_cancel;
+ error = xfs_itruncate_extents(&trans, dp, XFS_ATTR_FORK,
+ XFS_FSB_TO_B(mp, mp->m_attr_geo->fsbcount));
+ if (error)
+ goto out_cancel;
+
+ /*
+ * Invalidate and truncate the root block and ensure that the
+ * operation is completed within a single transaction.
+ */
+ error = xfs_da_get_buf(trans, dp, 0, &bp, XFS_ATTR_FORK);
+ if (error)
+ goto out_cancel;
+
+ xfs_trans_binval(trans, bp);
error = xfs_itruncate_extents(&trans, dp, XFS_ATTR_FORK, 0);
if (error)
goto out_cancel;
--
2.39.2
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/4] xfs: only assert new size for datafork during truncate extents
2026-03-09 8:27 ` [PATCH 1/4] xfs: only assert new size for datafork during truncate extents Long Li
@ 2026-03-09 16:39 ` Darrick J. Wong
0 siblings, 0 replies; 17+ messages in thread
From: Darrick J. Wong @ 2026-03-09 16:39 UTC (permalink / raw)
To: Long Li; +Cc: cem, linux-xfs, david, yi.zhang, houtao1, yangerkun, lonuxli.64
On Mon, Mar 09, 2026 at 04:27:49PM +0800, Long Li wrote:
> The assertion functions properly because we currently only truncate the
> attr to a zero size. Any other new size of the attr is not preempted.
> Make this assertion is specific to the datafork, preparing for
> subsequent patches to truncate the attribute to a non-zero size.
>
> Signed-off-by: Long Li <leo.lilong@huawei.com>
LGTM
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
--D
> ---
> fs/xfs/xfs_inode.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 50c0404f9064..beaa26ec62da 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1048,7 +1048,8 @@ xfs_itruncate_extents_flags(
> xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
> if (icount_read(VFS_I(ip)))
> xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL);
> - ASSERT(new_size <= XFS_ISIZE(ip));
> + if (whichfork == XFS_DATA_FORK)
> + ASSERT(new_size <= XFS_ISIZE(ip));
> ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
> ASSERT(ip->i_itemp != NULL);
> ASSERT(ip->i_itemp->ili_lock_flags == 0);
> --
> 2.39.2
>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/4] xfs: factor out xfs_da3_node_entry_remove
2026-03-09 8:27 ` [PATCH 2/4] xfs: factor out xfs_da3_node_entry_remove Long Li
@ 2026-03-09 16:42 ` Darrick J. Wong
2026-03-10 2:15 ` Long Li
2026-03-10 11:58 ` Long Li
0 siblings, 2 replies; 17+ messages in thread
From: Darrick J. Wong @ 2026-03-09 16:42 UTC (permalink / raw)
To: Long Li; +Cc: cem, linux-xfs, david, yi.zhang, houtao1, yangerkun, lonuxli.64
On Mon, Mar 09, 2026 at 04:27:50PM +0800, Long Li wrote:
> Factor out wrapper xfs_da3_node_entry_remove function, which
> exported for external use.
>
> Signed-off-by: Long Li <leo.lilong@huawei.com>
> ---
> fs/xfs/libxfs/xfs_da_btree.c | 54 ++++++++++++++++++++++++++++--------
> fs/xfs/libxfs/xfs_da_btree.h | 2 ++
> 2 files changed, 45 insertions(+), 11 deletions(-)
>
> diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
> index 766631f0562e..466c43098768 100644
> --- a/fs/xfs/libxfs/xfs_da_btree.c
> +++ b/fs/xfs/libxfs/xfs_da_btree.c
> @@ -1506,21 +1506,20 @@ xfs_da3_fixhashpath(
> }
>
> /*
> - * Remove an entry from an intermediate node.
> + * Internal implementation to remove an entry from an intermediate node.
> */
> STATIC void
> -xfs_da3_node_remove(
> - struct xfs_da_state *state,
> - struct xfs_da_state_blk *drop_blk)
> +__xfs_da3_node_remove(
> + struct xfs_trans *tp,
> + struct xfs_inode *dp,
> + struct xfs_da_geometry *geo,
> + struct xfs_da_state_blk *drop_blk)
> {
> struct xfs_da_intnode *node;
> struct xfs_da3_icnode_hdr nodehdr;
> struct xfs_da_node_entry *btree;
> int index;
> int tmp;
> - struct xfs_inode *dp = state->args->dp;
> -
> - trace_xfs_da_node_remove(state->args);
>
> node = drop_blk->bp->b_addr;
> xfs_da3_node_hdr_from_disk(dp->i_mount, &nodehdr, node);
> @@ -1536,17 +1535,17 @@ xfs_da3_node_remove(
> tmp = nodehdr.count - index - 1;
> tmp *= (uint)sizeof(xfs_da_node_entry_t);
> memmove(&btree[index], &btree[index + 1], tmp);
> - xfs_trans_log_buf(state->args->trans, drop_blk->bp,
> + xfs_trans_log_buf(tp, drop_blk->bp,
> XFS_DA_LOGRANGE(node, &btree[index], tmp));
> index = nodehdr.count - 1;
> }
> memset(&btree[index], 0, sizeof(xfs_da_node_entry_t));
> - xfs_trans_log_buf(state->args->trans, drop_blk->bp,
> + xfs_trans_log_buf(tp, drop_blk->bp,
> XFS_DA_LOGRANGE(node, &btree[index], sizeof(btree[index])));
> nodehdr.count -= 1;
> xfs_da3_node_hdr_to_disk(dp->i_mount, node, &nodehdr);
> - xfs_trans_log_buf(state->args->trans, drop_blk->bp,
> - XFS_DA_LOGRANGE(node, &node->hdr, state->args->geo->node_hdr_size));
> + xfs_trans_log_buf(tp, drop_blk->bp,
> + XFS_DA_LOGRANGE(node, &node->hdr, geo->node_hdr_size));
>
> /*
> * Copy the last hash value from the block to propagate upwards.
> @@ -1554,6 +1553,39 @@ xfs_da3_node_remove(
> drop_blk->hashval = be32_to_cpu(btree[index - 1].hashval);
> }
>
> +/*
> + * Remove an entry from an intermediate node.
> + */
> +STATIC void
> +xfs_da3_node_remove(
> + struct xfs_da_state *state,
> + struct xfs_da_state_blk *drop_blk)
> +{
> + trace_xfs_da_node_remove(state->args);
Style nit: blank line after the tracepoint.
> + __xfs_da3_node_remove(state->args->trans, state->args->dp,
> + state->args->geo, drop_blk);
> +}
> +
> +/*
> + * Remove an entry from a node at the specified index, this is an exported
> + * wrapper for removing entries from intermediate nodes.
> + */
> +void
> +xfs_da3_node_entry_remove(
This only applies to attr(ibute) structures, as evidenced by m_attr_geo
below. I think this ought to be named xfs_attr3_node_entry_remove.
> + struct xfs_trans *tp,
> + struct xfs_inode *dp,
> + struct xfs_buf *bp,
> + int index)
> +{
> + struct xfs_da_state_blk blk;
struct xfs_da_state_blk blk = {
.index = index,
.bp = bp,
};
Otherwise this looks ok to me.
--D
> +
> + memset(&blk, 0, sizeof(blk));
> + blk.index = index;
> + blk.bp = bp;
> +
> + __xfs_da3_node_remove(tp, dp, dp->i_mount->m_attr_geo, &blk);
> +}
> +
> /*
> * Unbalance the elements between two intermediate nodes,
> * move all Btree elements from one node into another.
> diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h
> index 354d5d65043e..6cec4313c83c 100644
> --- a/fs/xfs/libxfs/xfs_da_btree.h
> +++ b/fs/xfs/libxfs/xfs_da_btree.h
> @@ -184,6 +184,8 @@ int xfs_da3_split(xfs_da_state_t *state);
> int xfs_da3_join(xfs_da_state_t *state);
> void xfs_da3_fixhashpath(struct xfs_da_state *state,
> struct xfs_da_state_path *path_to_to_fix);
> +void xfs_da3_node_entry_remove(struct xfs_trans *tp, struct xfs_inode *dp,
> + struct xfs_buf *bp, int index);
>
> /*
> * Routines used for finding things in the Btree.
> --
> 2.39.2
>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 3/4] xfs: factor out xfs_attr3_leaf_init
2026-03-09 8:27 ` [PATCH 3/4] xfs: factor out xfs_attr3_leaf_init Long Li
@ 2026-03-09 16:44 ` Darrick J. Wong
2026-03-10 7:42 ` Long Li
0 siblings, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2026-03-09 16:44 UTC (permalink / raw)
To: Long Li; +Cc: cem, linux-xfs, david, yi.zhang, houtao1, yangerkun, lonuxli.64
On Mon, Mar 09, 2026 at 04:27:51PM +0800, Long Li wrote:
> Factor out wrapper xfs_attr3_leaf_init function, which exported for
> external use.
>
> Signed-off-by: Long Li <leo.lilong@huawei.com>
> ---
> fs/xfs/libxfs/xfs_attr_leaf.c | 20 ++++++++++++++++++++
> fs/xfs/libxfs/xfs_attr_leaf.h | 3 +++
> 2 files changed, 23 insertions(+)
>
> diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
> index 47f48ae555c0..6599b53f4822 100644
> --- a/fs/xfs/libxfs/xfs_attr_leaf.c
> +++ b/fs/xfs/libxfs/xfs_attr_leaf.c
> @@ -1415,6 +1415,26 @@ xfs_attr3_leaf_create(
> return 0;
> }
>
> +/*
> + * Wrapper function of initializing contents of a leaf, export for external use.
> + */
> +int
> +xfs_attr3_leaf_init(
> + struct xfs_trans *tp,
> + struct xfs_inode *dp,
> + xfs_dablk_t blkno)
> +{
> + struct xfs_buf *bp = NULL;
> + struct xfs_da_args args;
Same thing as previous patch:
struct xfs_da_args args = {
.trans = tp,
...
};
Also I think this should just go in whichever patch uses the new
function since this is pretty trivial.
> +
> + memset(&args, 0, sizeof(args));
> + args.trans = tp;
> + args.dp = dp;
> + args.owner = dp->i_ino;
> + args.geo = dp->i_mount->m_attr_geo;
> +
> + return xfs_attr3_leaf_create(&args, blkno, &bp);
Ummm ... who releases the returned xfs_buf pointer?
--D
> +}
> /*
> * Split the leaf node, rebalance, then add the new entry.
> *
> diff --git a/fs/xfs/libxfs/xfs_attr_leaf.h b/fs/xfs/libxfs/xfs_attr_leaf.h
> index aca46da2bc50..72639efe6ac3 100644
> --- a/fs/xfs/libxfs/xfs_attr_leaf.h
> +++ b/fs/xfs/libxfs/xfs_attr_leaf.h
> @@ -87,6 +87,9 @@ int xfs_attr3_leaf_list_int(struct xfs_buf *bp,
> /*
> * Routines used for shrinking the Btree.
> */
> +
> +int xfs_attr3_leaf_init(struct xfs_trans *tp, struct xfs_inode *dp,
> + xfs_dablk_t blkno);
> int xfs_attr3_leaf_toosmall(struct xfs_da_state *state, int *retval);
> void xfs_attr3_leaf_unbalance(struct xfs_da_state *state,
> struct xfs_da_state_blk *drop_blk,
> --
> 2.39.2
>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 4/4] xfs: close crash window in attr dabtree inactivation
2026-03-09 8:27 ` [PATCH 4/4] xfs: close crash window in attr dabtree inactivation Long Li
@ 2026-03-09 16:59 ` Darrick J. Wong
2026-03-10 8:19 ` Long Li
0 siblings, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2026-03-09 16:59 UTC (permalink / raw)
To: Long Li; +Cc: cem, linux-xfs, david, yi.zhang, houtao1, yangerkun, lonuxli.64
On Mon, Mar 09, 2026 at 04:27:52PM +0800, Long Li wrote:
> When inactivating an inode with node-format extended attributes,
> xfs_attr3_node_inactive() invalidates all child leaf/node blocks via
> xfs_trans_binval(), but intentionally does not remove the corresponding
> entries from their parent node blocks. The implicit assumption is that
> xfs_attr_inactive() will truncate the entire attr fork to zero extents
> afterwards, so log recovery will never reach the root node and follow
> those stale pointers.
>
> However, if a log shutdown occurs after the child block cancellations
> commit but before the attr bmap truncation commits, this assumption
> breaks. Recovery replays the attr bmap intact (the inode still has
> attr fork extents), but suppresses replay of all cancelled child
> blocks, maybe leaving them as stale data on disk. On the next mount,
> xlog_recover_process_iunlinks() retries inactivation and attempts to
> read the root node via the attr bmap. If the root node was not replayed,
> reading the unreplayed root block triggers a metadata verification
> failure immediately; if it was replayed, following its child pointers
> to unreplayed child blocks triggers the same failure:
>
> XFS (pmem0): Metadata corruption detected at
> xfs_da3_node_read_verify+0x53/0x220, xfs_da3_node block 0x78
> XFS (pmem0): Unmount and run xfs_repair
> XFS (pmem0): First 128 bytes of corrupted metadata buffer:
> 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> XFS (pmem0): metadata I/O error in "xfs_da_read_buf+0x104/0x190" at daddr 0x78 len 8 error 117
Did you hit this through a customer issue? Or is this the same "corrupt
block 0 of inode 25165954 attribute fork" problem exposed by generic/753
last week? Or possibly both?
https://lore.kernel.org/linux-xfs/CAF-d4Oscq=qaCd9dbbEZjG8dA5Q7erdWSszoxY1migM8j85eRw@mail.gmail.com/
> Fix this in two places:
>
> In xfs_attr3_node_inactive(), after calling xfs_trans_binval() on a
> child block, immediately remove the entry that references it from the
> parent node in the same transaction. This eliminates the window where
> the parent holds a pointer to a cancelled block. Once all children are
> removed, the now-empty root node is converted to a leaf block within the
> same transaction. This node-to-leaf conversion is necessary for crash
> safety. If the system shutdown after the empty node is written to the
> log but before the second-phase bmap truncation commits, log recovery
> will attempt to verify the root block on disk. xfs_da3_node_verify()
> does not permit a node block with count == 0; such a block will fail
> verification and trigger a metadata corruption shutdown. on the other
> hand, leaf blocks are allowed to have this transient state.
Hrmmm... this really does sound like the "corrupt block 0" problem
referenced above.
> In xfs_attr_inactive(), split the attr fork truncation into two explicit
> phases. First, truncate all extents beyond the root block (the child
> extents whose parent references have already been removed above).
> Second, invalidate the root block and truncate the attr bmap to zero in
> a single transaction. The two operations in the second phase must be
> atomic: as long as the attr bmap has any non-zero length, recovery can
> follow it to the root block, so the root block invalidation must commit
> together with the bmap-to-zero truncation.
>
> Signed-off-by: Long Li <leo.lilong@huawei.com>
> ---
> fs/xfs/xfs_attr_inactive.c | 97 +++++++++++++++++++++-----------------
> 1 file changed, 55 insertions(+), 42 deletions(-)
>
> diff --git a/fs/xfs/xfs_attr_inactive.c b/fs/xfs/xfs_attr_inactive.c
> index 92331991f9fd..2ffa6b51a356 100644
> --- a/fs/xfs/xfs_attr_inactive.c
> +++ b/fs/xfs/xfs_attr_inactive.c
> @@ -140,7 +140,7 @@ xfs_attr3_node_inactive(
> xfs_daddr_t parent_blkno, child_blkno;
> struct xfs_buf *child_bp;
> struct xfs_da3_icnode_hdr ichdr;
> - int error, i;
> + int error;
>
> /*
> * Since this code is recursive (gasp!) we must protect ourselves.
> @@ -167,7 +167,7 @@ xfs_attr3_node_inactive(
> * over the leaves removing all of them. If this is higher up
> * in the tree, recurse downward.
> */
> - for (i = 0; i < ichdr.count; i++) {
> + while (ichdr.count > 0) {
> /*
> * Read the subsidiary block to see what we have to work with.
> * Don't do this in a transaction. This is a depth-first
> @@ -218,29 +218,32 @@ xfs_attr3_node_inactive(
> xfs_trans_binval(*trans, child_bp);
> child_bp = NULL;
>
> - /*
> - * If we're not done, re-read the parent to get the next
> - * child block number.
> - */
> - if (i + 1 < ichdr.count) {
> - struct xfs_da3_icnode_hdr phdr;
> -
> - error = xfs_da3_node_read_mapped(*trans, dp,
> - parent_blkno, &bp, XFS_ATTR_FORK);
> - if (error)
> - return error;
> - xfs_da3_node_hdr_from_disk(dp->i_mount, &phdr,
> - bp->b_addr);
> - child_fsb = be32_to_cpu(phdr.btree[i + 1].before);
> - xfs_trans_brelse(*trans, bp);
> - bp = NULL;
> - }
> - /*
> - * Atomically commit the whole invalidate stuff.
> - */
> - error = xfs_trans_roll_inode(trans, dp);
> + error = xfs_da3_node_read_mapped(*trans, dp,
> + parent_blkno, &bp, XFS_ATTR_FORK);
> if (error)
> - return error;
> + return error;
> +
> + /*
> + * Remove entry form parent node, prevents being indexed to.
> + */
> + xfs_da3_node_entry_remove(*trans, dp, bp, 0);
> +
> + xfs_da3_node_hdr_from_disk(dp->i_mount, &ichdr, bp->b_addr);
> + bp = NULL;
> +
> + if (ichdr.count > 0) {
> + /*
> + * If we're not done, get the next child block number.
> + */
> + child_fsb = be32_to_cpu(ichdr.btree[0].before);
> +
> + /*
> + * Atomically commit the whole invalidate stuff.
> + */
> + error = xfs_trans_roll_inode(trans, dp);
> + if (error)
> + return error;
Don't move the double space down to ^^ here. ;)
> + }
> }
>
> return 0;
> @@ -257,10 +260,8 @@ xfs_attr3_root_inactive(
> struct xfs_trans **trans,
> struct xfs_inode *dp)
> {
> - struct xfs_mount *mp = dp->i_mount;
> struct xfs_da_blkinfo *info;
> struct xfs_buf *bp;
> - xfs_daddr_t blkno;
> int error;
>
> /*
> @@ -272,7 +273,6 @@ xfs_attr3_root_inactive(
> error = xfs_da3_node_read(*trans, dp, 0, &bp, XFS_ATTR_FORK);
> if (error)
> return error;
> - blkno = xfs_buf_daddr(bp);
>
> /*
> * Invalidate the tree, even if the "tree" is only a single leaf block.
> @@ -283,6 +283,16 @@ xfs_attr3_root_inactive(
> case cpu_to_be16(XFS_DA_NODE_MAGIC):
> case cpu_to_be16(XFS_DA3_NODE_MAGIC):
> error = xfs_attr3_node_inactive(trans, dp, bp, 1);
> + if (error)
> + return error;
> +
> + /*
> + * Empty root node block are not allowed, convert it to leaf.
> + */
> + error = xfs_attr3_leaf_init(*trans, dp, 0);
Responding to my own question: Ah, I see -- "leaf init" doesn't use the
bp anymore and it's attached to the transaction so it doesn't leak.
That's a little subtle since there's nothing preventing someone from
calling xfs_attr3_leaf_init(NULL, dp, 0).
> + if (error)
> + return error;
> + error = xfs_trans_roll_inode(trans, dp);
If we have an xattr structure with multiple levels of dabtree nodes, can
this lead to the somewhat odd situation where the tree levels are
uneven during deconstruction? For example
root
/ \
node empty_leaf
| \
| \
node node
| \
leaves more_leaves
Does this matter, or can the inactivation code already handle it? I
suppose since we're inactivating (either in inodegc or in recovery after
a crash) user programs will never see this so the window of confusion
might be pretty small.
--D
> break;
> case cpu_to_be16(XFS_ATTR_LEAF_MAGIC):
> case cpu_to_be16(XFS_ATTR3_LEAF_MAGIC):
> @@ -295,21 +305,6 @@ xfs_attr3_root_inactive(
> xfs_trans_brelse(*trans, bp);
> break;
> }
> - if (error)
> - return error;
> -
> - /*
> - * Invalidate the incore copy of the root block.
> - */
> - error = xfs_trans_get_buf(*trans, mp->m_ddev_targp, blkno,
> - XFS_FSB_TO_BB(mp, mp->m_attr_geo->fsbcount), 0, &bp);
> - if (error)
> - return error;
> - xfs_trans_binval(*trans, bp); /* remove from cache */
> - /*
> - * Commit the invalidate and start the next transaction.
> - */
> - error = xfs_trans_roll_inode(trans, dp);
>
> return error;
> }
> @@ -328,6 +323,7 @@ xfs_attr_inactive(
> {
> struct xfs_trans *trans;
> struct xfs_mount *mp;
> + struct xfs_buf *bp;
> int lock_mode = XFS_ILOCK_SHARED;
> int error = 0;
>
> @@ -363,10 +359,27 @@ xfs_attr_inactive(
> * removal below.
> */
> if (dp->i_af.if_nextents > 0) {
> + /*
> + * Invalidate and truncate all blocks but leave the root block.
> + */
> error = xfs_attr3_root_inactive(&trans, dp);
> if (error)
> goto out_cancel;
>
> + error = xfs_itruncate_extents(&trans, dp, XFS_ATTR_FORK,
> + XFS_FSB_TO_B(mp, mp->m_attr_geo->fsbcount));
> + if (error)
> + goto out_cancel;
> +
> + /*
> + * Invalidate and truncate the root block and ensure that the
> + * operation is completed within a single transaction.
> + */
> + error = xfs_da_get_buf(trans, dp, 0, &bp, XFS_ATTR_FORK);
> + if (error)
> + goto out_cancel;
> +
> + xfs_trans_binval(trans, bp);
> error = xfs_itruncate_extents(&trans, dp, XFS_ATTR_FORK, 0);
> if (error)
> goto out_cancel;
> --
> 2.39.2
>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/4] xfs: factor out xfs_da3_node_entry_remove
2026-03-09 16:42 ` Darrick J. Wong
@ 2026-03-10 2:15 ` Long Li
2026-03-10 11:58 ` Long Li
1 sibling, 0 replies; 17+ messages in thread
From: Long Li @ 2026-03-10 2:15 UTC (permalink / raw)
To: Darrick J. Wong
Cc: cem, linux-xfs, david, yi.zhang, houtao1, yangerkun, lonuxli.64
On Mon, Mar 09, 2026 at 09:42:29AM -0700, Darrick J. Wong wrote:
> On Mon, Mar 09, 2026 at 04:27:50PM +0800, Long Li wrote:
> > Factor out wrapper xfs_da3_node_entry_remove function, which
> > exported for external use.
> >
> > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > ---
> > fs/xfs/libxfs/xfs_da_btree.c | 54 ++++++++++++++++++++++++++++--------
> > fs/xfs/libxfs/xfs_da_btree.h | 2 ++
> > 2 files changed, 45 insertions(+), 11 deletions(-)
> >
> > diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
> > index 766631f0562e..466c43098768 100644
> > --- a/fs/xfs/libxfs/xfs_da_btree.c
> > +++ b/fs/xfs/libxfs/xfs_da_btree.c
> > @@ -1506,21 +1506,20 @@ xfs_da3_fixhashpath(
> > }
> >
> > /*
> > - * Remove an entry from an intermediate node.
> > + * Internal implementation to remove an entry from an intermediate node.
> > */
> > STATIC void
> > -xfs_da3_node_remove(
> > - struct xfs_da_state *state,
> > - struct xfs_da_state_blk *drop_blk)
> > +__xfs_da3_node_remove(
> > + struct xfs_trans *tp,
> > + struct xfs_inode *dp,
> > + struct xfs_da_geometry *geo,
> > + struct xfs_da_state_blk *drop_blk)
> > {
> > struct xfs_da_intnode *node;
> > struct xfs_da3_icnode_hdr nodehdr;
> > struct xfs_da_node_entry *btree;
> > int index;
> > int tmp;
> > - struct xfs_inode *dp = state->args->dp;
> > -
> > - trace_xfs_da_node_remove(state->args);
> >
> > node = drop_blk->bp->b_addr;
> > xfs_da3_node_hdr_from_disk(dp->i_mount, &nodehdr, node);
> > @@ -1536,17 +1535,17 @@ xfs_da3_node_remove(
> > tmp = nodehdr.count - index - 1;
> > tmp *= (uint)sizeof(xfs_da_node_entry_t);
> > memmove(&btree[index], &btree[index + 1], tmp);
> > - xfs_trans_log_buf(state->args->trans, drop_blk->bp,
> > + xfs_trans_log_buf(tp, drop_blk->bp,
> > XFS_DA_LOGRANGE(node, &btree[index], tmp));
> > index = nodehdr.count - 1;
> > }
> > memset(&btree[index], 0, sizeof(xfs_da_node_entry_t));
> > - xfs_trans_log_buf(state->args->trans, drop_blk->bp,
> > + xfs_trans_log_buf(tp, drop_blk->bp,
> > XFS_DA_LOGRANGE(node, &btree[index], sizeof(btree[index])));
> > nodehdr.count -= 1;
> > xfs_da3_node_hdr_to_disk(dp->i_mount, node, &nodehdr);
> > - xfs_trans_log_buf(state->args->trans, drop_blk->bp,
> > - XFS_DA_LOGRANGE(node, &node->hdr, state->args->geo->node_hdr_size));
> > + xfs_trans_log_buf(tp, drop_blk->bp,
> > + XFS_DA_LOGRANGE(node, &node->hdr, geo->node_hdr_size));
> >
> > /*
> > * Copy the last hash value from the block to propagate upwards.
> > @@ -1554,6 +1553,39 @@ xfs_da3_node_remove(
> > drop_blk->hashval = be32_to_cpu(btree[index - 1].hashval);
> > }
> >
> > +/*
> > + * Remove an entry from an intermediate node.
> > + */
> > +STATIC void
> > +xfs_da3_node_remove(
> > + struct xfs_da_state *state,
> > + struct xfs_da_state_blk *drop_blk)
> > +{
> > + trace_xfs_da_node_remove(state->args);
>
> Style nit: blank line after the tracepoint.
>
> > + __xfs_da3_node_remove(state->args->trans, state->args->dp,
> > + state->args->geo, drop_blk);
> > +}
> > +
> > +/*
> > + * Remove an entry from a node at the specified index, this is an exported
> > + * wrapper for removing entries from intermediate nodes.
> > + */
> > +void
> > +xfs_da3_node_entry_remove(
>
> This only applies to attr(ibute) structures, as evidenced by m_attr_geo
> below. I think this ought to be named xfs_attr3_node_entry_remove.
>
Agree with you, I will update it in the next version.
Thanks,
Long Li
> > + struct xfs_trans *tp,
> > + struct xfs_inode *dp,
> > + struct xfs_buf *bp,
> > + int index)
> > +{
> > + struct xfs_da_state_blk blk;
>
> struct xfs_da_state_blk blk = {
> .index = index,
> .bp = bp,
> };
>
> Otherwise this looks ok to me.
>
> --D
>
> > +
> > + memset(&blk, 0, sizeof(blk));
> > + blk.index = index;
> > + blk.bp = bp;
> > +
> > + __xfs_da3_node_remove(tp, dp, dp->i_mount->m_attr_geo, &blk);
> > +}
> > +
> > /*
> > * Unbalance the elements between two intermediate nodes,
> > * move all Btree elements from one node into another.
> > diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h
> > index 354d5d65043e..6cec4313c83c 100644
> > --- a/fs/xfs/libxfs/xfs_da_btree.h
> > +++ b/fs/xfs/libxfs/xfs_da_btree.h
> > @@ -184,6 +184,8 @@ int xfs_da3_split(xfs_da_state_t *state);
> > int xfs_da3_join(xfs_da_state_t *state);
> > void xfs_da3_fixhashpath(struct xfs_da_state *state,
> > struct xfs_da_state_path *path_to_to_fix);
> > +void xfs_da3_node_entry_remove(struct xfs_trans *tp, struct xfs_inode *dp,
> > + struct xfs_buf *bp, int index);
> >
> > /*
> > * Routines used for finding things in the Btree.
> > --
> > 2.39.2
> >
> >
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 3/4] xfs: factor out xfs_attr3_leaf_init
2026-03-09 16:44 ` Darrick J. Wong
@ 2026-03-10 7:42 ` Long Li
0 siblings, 0 replies; 17+ messages in thread
From: Long Li @ 2026-03-10 7:42 UTC (permalink / raw)
To: Darrick J. Wong
Cc: cem, linux-xfs, david, yi.zhang, houtao1, yangerkun, lonuxli.64
On Mon, Mar 09, 2026 at 09:44:37AM -0700, Darrick J. Wong wrote:
> On Mon, Mar 09, 2026 at 04:27:51PM +0800, Long Li wrote:
> > Factor out wrapper xfs_attr3_leaf_init function, which exported for
> > external use.
> >
> > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > ---
> > fs/xfs/libxfs/xfs_attr_leaf.c | 20 ++++++++++++++++++++
> > fs/xfs/libxfs/xfs_attr_leaf.h | 3 +++
> > 2 files changed, 23 insertions(+)
> >
> > diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
> > index 47f48ae555c0..6599b53f4822 100644
> > --- a/fs/xfs/libxfs/xfs_attr_leaf.c
> > +++ b/fs/xfs/libxfs/xfs_attr_leaf.c
> > @@ -1415,6 +1415,26 @@ xfs_attr3_leaf_create(
> > return 0;
> > }
> >
> > +/*
> > + * Wrapper function of initializing contents of a leaf, export for external use.
> > + */
> > +int
> > +xfs_attr3_leaf_init(
> > + struct xfs_trans *tp,
> > + struct xfs_inode *dp,
> > + xfs_dablk_t blkno)
> > +{
> > + struct xfs_buf *bp = NULL;
> > + struct xfs_da_args args;
>
> Same thing as previous patch:
>
> struct xfs_da_args args = {
> .trans = tp,
> ...
> };
>
> Also I think this should just go in whichever patch uses the new
> function since this is pretty trivial.
>
Thanks for point that, it will be update in next version.
Long Li
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 4/4] xfs: close crash window in attr dabtree inactivation
2026-03-09 16:59 ` Darrick J. Wong
@ 2026-03-10 8:19 ` Long Li
2026-03-10 14:46 ` Darrick J. Wong
0 siblings, 1 reply; 17+ messages in thread
From: Long Li @ 2026-03-10 8:19 UTC (permalink / raw)
To: Darrick J. Wong
Cc: cem, linux-xfs, david, yi.zhang, houtao1, yangerkun, lonuxli.64
On Mon, Mar 09, 2026 at 09:59:33AM -0700, Darrick J. Wong wrote:
> On Mon, Mar 09, 2026 at 04:27:52PM +0800, Long Li wrote:
> > When inactivating an inode with node-format extended attributes,
> > xfs_attr3_node_inactive() invalidates all child leaf/node blocks via
> > xfs_trans_binval(), but intentionally does not remove the corresponding
> > entries from their parent node blocks. The implicit assumption is that
> > xfs_attr_inactive() will truncate the entire attr fork to zero extents
> > afterwards, so log recovery will never reach the root node and follow
> > those stale pointers.
> >
> > However, if a log shutdown occurs after the child block cancellations
> > commit but before the attr bmap truncation commits, this assumption
> > breaks. Recovery replays the attr bmap intact (the inode still has
> > attr fork extents), but suppresses replay of all cancelled child
> > blocks, maybe leaving them as stale data on disk. On the next mount,
> > xlog_recover_process_iunlinks() retries inactivation and attempts to
> > read the root node via the attr bmap. If the root node was not replayed,
> > reading the unreplayed root block triggers a metadata verification
> > failure immediately; if it was replayed, following its child pointers
> > to unreplayed child blocks triggers the same failure:
> >
> > XFS (pmem0): Metadata corruption detected at
> > xfs_da3_node_read_verify+0x53/0x220, xfs_da3_node block 0x78
> > XFS (pmem0): Unmount and run xfs_repair
> > XFS (pmem0): First 128 bytes of corrupted metadata buffer:
> > 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > XFS (pmem0): metadata I/O error in "xfs_da_read_buf+0x104/0x190" at daddr 0x78 len 8 error 117
>
> Did you hit this through a customer issue? Or is this the same "corrupt
> block 0 of inode 25165954 attribute fork" problem exposed by generic/753
> last week? Or possibly both?
>
> https://lore.kernel.org/linux-xfs/CAF-d4Oscq=qaCd9dbbEZjG8dA5Q7erdWSszoxY1migM8j85eRw@mail.gmail.com/
We encountered this issue while performing disk fault injection tests,
rather than through the generic/753. When I construct the problem
and use xfs_repair to repair it, the error message "corrupt block 0" can
be reported as follows:
Metadata corruption detected at 0x452a9c, xfs_da3_node block 0x78/0x1000
corrupt block 0 of inode 131 attribute fork
problem with attribute contents in inode 131
clearing inode 131 attributes
correcting nblocks for inode 131, was 1 - counted 0
So the problem you encountered before might be this issue.
>
> > Fix this in two places:
> >
> > In xfs_attr3_node_inactive(), after calling xfs_trans_binval() on a
> > child block, immediately remove the entry that references it from the
> > parent node in the same transaction. This eliminates the window where
> > the parent holds a pointer to a cancelled block. Once all children are
> > removed, the now-empty root node is converted to a leaf block within the
> > same transaction. This node-to-leaf conversion is necessary for crash
> > safety. If the system shutdown after the empty node is written to the
> > log but before the second-phase bmap truncation commits, log recovery
> > will attempt to verify the root block on disk. xfs_da3_node_verify()
> > does not permit a node block with count == 0; such a block will fail
> > verification and trigger a metadata corruption shutdown. on the other
> > hand, leaf blocks are allowed to have this transient state.
>
> Hrmmm... this really does sound like the "corrupt block 0" problem
> referenced above.
>
> > In xfs_attr_inactive(), split the attr fork truncation into two explicit
> > phases. First, truncate all extents beyond the root block (the child
> > extents whose parent references have already been removed above).
> > Second, invalidate the root block and truncate the attr bmap to zero in
> > a single transaction. The two operations in the second phase must be
> > atomic: as long as the attr bmap has any non-zero length, recovery can
> > follow it to the root block, so the root block invalidation must commit
> > together with the bmap-to-zero truncation.
> >
> > Signed-off-by: Long Li <leo.lilong@huawei.com>
......
> > @@ -283,6 +283,16 @@ xfs_attr3_root_inactive(
> > case cpu_to_be16(XFS_DA_NODE_MAGIC):
> > case cpu_to_be16(XFS_DA3_NODE_MAGIC):
> > error = xfs_attr3_node_inactive(trans, dp, bp, 1);
> > + if (error)
> > + return error;
> > +
> > + /*
> > + * Empty root node block are not allowed, convert it to leaf.
> > + */
> > + error = xfs_attr3_leaf_init(*trans, dp, 0);
>
> Responding to my own question: Ah, I see -- "leaf init" doesn't use the
> bp anymore and it's attached to the transaction so it doesn't leak.
> That's a little subtle since there's nothing preventing someone from
> calling xfs_attr3_leaf_init(NULL, dp, 0).
Indeed, there should be an increase in explanatory comments and an empty
check for the tp.
>
> > + if (error)
> > + return error;
> > + error = xfs_trans_roll_inode(trans, dp);
>
> If we have an xattr structure with multiple levels of dabtree nodes, can
> this lead to the somewhat odd situation where the tree levels are
> uneven during deconstruction? For example
>
> root
> / \
> node empty_leaf
> | \
> | \
> node node
> | \
> leaves more_leaves
>
> Does this matter, or can the inactivation code already handle it? I
> suppose since we're inactivating (either in inodegc or in recovery after
> a crash) user programs will never see this so the window of confusion
> might be pretty small.
>
> --D
For the reason of simplicity and efficiency, the dead code does not consider
this tree imbalance scenario, and I understand that this would not cause
any practical issues.
Best regards,
Long Li
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/4] xfs: factor out xfs_da3_node_entry_remove
2026-03-09 16:42 ` Darrick J. Wong
2026-03-10 2:15 ` Long Li
@ 2026-03-10 11:58 ` Long Li
2026-03-10 14:38 ` Darrick J. Wong
1 sibling, 1 reply; 17+ messages in thread
From: Long Li @ 2026-03-10 11:58 UTC (permalink / raw)
To: Darrick J. Wong
Cc: cem, linux-xfs, david, yi.zhang, houtao1, yangerkun, lonuxli.64
On Mon, Mar 09, 2026 at 09:42:29AM -0700, Darrick J. Wong wrote:
>
> > + __xfs_da3_node_remove(state->args->trans, state->args->dp,
> > + state->args->geo, drop_blk);
> > +}
> > +
> > +/*
> > + * Remove an entry from a node at the specified index, this is an exported
> > + * wrapper for removing entries from intermediate nodes.
> > + */
> > +void
> > +xfs_da3_node_entry_remove(
>
> This only applies to attr(ibute) structures, as evidenced by m_attr_geo
> below. I think this ought to be named xfs_attr3_node_entry_remove.
>
Considering that xfs_da_btree.c implements functions related to xfs_da*,
it might not be appropriate to place xfs_attr3_node_entry_remove here either.
I think we could add an additional `struct xfs_da_geometry *geo` parameter
to xfs_da3_node_entry_remove() and let it be specified externally, which
would increase the function's generality. What do you think?
Thanks,
Long Li
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/4] xfs: factor out xfs_da3_node_entry_remove
2026-03-10 11:58 ` Long Li
@ 2026-03-10 14:38 ` Darrick J. Wong
0 siblings, 0 replies; 17+ messages in thread
From: Darrick J. Wong @ 2026-03-10 14:38 UTC (permalink / raw)
To: Long Li; +Cc: cem, linux-xfs, david, yi.zhang, houtao1, yangerkun, lonuxli.64
On Tue, Mar 10, 2026 at 07:58:23PM +0800, Long Li wrote:
> On Mon, Mar 09, 2026 at 09:42:29AM -0700, Darrick J. Wong wrote:
> >
> > > + __xfs_da3_node_remove(state->args->trans, state->args->dp,
> > > + state->args->geo, drop_blk);
> > > +}
> > > +
> > > +/*
> > > + * Remove an entry from a node at the specified index, this is an exported
> > > + * wrapper for removing entries from intermediate nodes.
> > > + */
> > > +void
> > > +xfs_da3_node_entry_remove(
> >
> > This only applies to attr(ibute) structures, as evidenced by m_attr_geo
> > below. I think this ought to be named xfs_attr3_node_entry_remove.
> >
>
> Considering that xfs_da_btree.c implements functions related to xfs_da*,
> it might not be appropriate to place xfs_attr3_node_entry_remove here either.
> I think we could add an additional `struct xfs_da_geometry *geo` parameter
> to xfs_da3_node_entry_remove() and let it be specified externally, which
> would increase the function's generality. What do you think?
Why does it need to be general? Don't go adding code for a user
(directories) that doesn't exist.
--D
>
> Thanks,
> Long Li
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 4/4] xfs: close crash window in attr dabtree inactivation
2026-03-10 8:19 ` Long Li
@ 2026-03-10 14:46 ` Darrick J. Wong
2026-03-11 2:34 ` Long Li
0 siblings, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2026-03-10 14:46 UTC (permalink / raw)
To: Long Li; +Cc: cem, linux-xfs, david, yi.zhang, houtao1, yangerkun, lonuxli.64
On Tue, Mar 10, 2026 at 04:19:33PM +0800, Long Li wrote:
> On Mon, Mar 09, 2026 at 09:59:33AM -0700, Darrick J. Wong wrote:
> > On Mon, Mar 09, 2026 at 04:27:52PM +0800, Long Li wrote:
> > > When inactivating an inode with node-format extended attributes,
> > > xfs_attr3_node_inactive() invalidates all child leaf/node blocks via
> > > xfs_trans_binval(), but intentionally does not remove the corresponding
> > > entries from their parent node blocks. The implicit assumption is that
> > > xfs_attr_inactive() will truncate the entire attr fork to zero extents
> > > afterwards, so log recovery will never reach the root node and follow
> > > those stale pointers.
> > >
> > > However, if a log shutdown occurs after the child block cancellations
> > > commit but before the attr bmap truncation commits, this assumption
> > > breaks. Recovery replays the attr bmap intact (the inode still has
> > > attr fork extents), but suppresses replay of all cancelled child
> > > blocks, maybe leaving them as stale data on disk. On the next mount,
> > > xlog_recover_process_iunlinks() retries inactivation and attempts to
> > > read the root node via the attr bmap. If the root node was not replayed,
> > > reading the unreplayed root block triggers a metadata verification
> > > failure immediately; if it was replayed, following its child pointers
> > > to unreplayed child blocks triggers the same failure:
> > >
> > > XFS (pmem0): Metadata corruption detected at
> > > xfs_da3_node_read_verify+0x53/0x220, xfs_da3_node block 0x78
> > > XFS (pmem0): Unmount and run xfs_repair
> > > XFS (pmem0): First 128 bytes of corrupted metadata buffer:
> > > 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > XFS (pmem0): metadata I/O error in "xfs_da_read_buf+0x104/0x190" at daddr 0x78 len 8 error 117
> >
> > Did you hit this through a customer issue? Or is this the same "corrupt
> > block 0 of inode 25165954 attribute fork" problem exposed by generic/753
> > last week? Or possibly both?
> >
> > https://lore.kernel.org/linux-xfs/CAF-d4Oscq=qaCd9dbbEZjG8dA5Q7erdWSszoxY1migM8j85eRw@mail.gmail.com/
>
> We encountered this issue while performing disk fault injection tests,
> rather than through the generic/753. When I construct the problem
> and use xfs_repair to repair it, the error message "corrupt block 0" can
> be reported as follows:
>
> Metadata corruption detected at 0x452a9c, xfs_da3_node block 0x78/0x1000
> corrupt block 0 of inode 131 attribute fork
> problem with attribute contents in inode 131
> clearing inode 131 attributes
> correcting nblocks for inode 131, was 1 - counted 0
>
> So the problem you encountered before might be this issue.
>
> >
> > > Fix this in two places:
> > >
> > > In xfs_attr3_node_inactive(), after calling xfs_trans_binval() on a
> > > child block, immediately remove the entry that references it from the
> > > parent node in the same transaction. This eliminates the window where
> > > the parent holds a pointer to a cancelled block. Once all children are
> > > removed, the now-empty root node is converted to a leaf block within the
> > > same transaction. This node-to-leaf conversion is necessary for crash
> > > safety. If the system shutdown after the empty node is written to the
> > > log but before the second-phase bmap truncation commits, log recovery
> > > will attempt to verify the root block on disk. xfs_da3_node_verify()
> > > does not permit a node block with count == 0; such a block will fail
> > > verification and trigger a metadata corruption shutdown. on the other
> > > hand, leaf blocks are allowed to have this transient state.
> >
> > Hrmmm... this really does sound like the "corrupt block 0" problem
> > referenced above.
> >
> > > In xfs_attr_inactive(), split the attr fork truncation into two explicit
> > > phases. First, truncate all extents beyond the root block (the child
> > > extents whose parent references have already been removed above).
> > > Second, invalidate the root block and truncate the attr bmap to zero in
> > > a single transaction. The two operations in the second phase must be
> > > atomic: as long as the attr bmap has any non-zero length, recovery can
> > > follow it to the root block, so the root block invalidation must commit
> > > together with the bmap-to-zero truncation.
> > >
> > > Signed-off-by: Long Li <leo.lilong@huawei.com>
>
> ......
>
> > > @@ -283,6 +283,16 @@ xfs_attr3_root_inactive(
> > > case cpu_to_be16(XFS_DA_NODE_MAGIC):
> > > case cpu_to_be16(XFS_DA3_NODE_MAGIC):
> > > error = xfs_attr3_node_inactive(trans, dp, bp, 1);
> > > + if (error)
> > > + return error;
> > > +
> > > + /*
> > > + * Empty root node block are not allowed, convert it to leaf.
> > > + */
> > > + error = xfs_attr3_leaf_init(*trans, dp, 0);
> >
> > Responding to my own question: Ah, I see -- "leaf init" doesn't use the
> > bp anymore and it's attached to the transaction so it doesn't leak.
> > That's a little subtle since there's nothing preventing someone from
> > calling xfs_attr3_leaf_init(NULL, dp, 0).
>
> Indeed, there should be an increase in explanatory comments and an empty
> check for the tp.
>
> >
> > > + if (error)
> > > + return error;
> > > + error = xfs_trans_roll_inode(trans, dp);
> >
> > If we have an xattr structure with multiple levels of dabtree nodes, can
> > this lead to the somewhat odd situation where the tree levels are
> > uneven during deconstruction? For example
> >
> > root
> > / \
> > node empty_leaf
> > | \
> > | \
> > node node
> > | \
> > leaves more_leaves
> >
> > Does this matter, or can the inactivation code already handle it? I
> > suppose since we're inactivating (either in inodegc or in recovery after
> > a crash) user programs will never see this so the window of confusion
> > might be pretty small.
> >
> > --D
>
> For the reason of simplicity and efficiency, the dead code does not consider
> this tree imbalance scenario, and I understand that this would not cause
> any practical issues.
"dead" code? Did you mean to say that the code that the patch removes
did not consider this imbalanced tree scenario? "Dead" code usually
refers to code that's still in the codebase that nobody calls or
executes.
I /think/ what I'm hearing is that the the attr inactivation code can
handle the system going down midway through removing the attr structure,
so it doesn't matter if log recovery finds an imbalanced tree because
all it's really doing is walking the dabtree to each leaf and nuking the
path to that leaf; and now it'll also rewrite each emptied-out node with
an empty leaf to ensure that a second log recovery can resume if the
first recovery attempt fails due to system crash etc.
(Is that correct?)
--D
>
> Best regards,
> Long Li
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 4/4] xfs: close crash window in attr dabtree inactivation
2026-03-10 14:46 ` Darrick J. Wong
@ 2026-03-11 2:34 ` Long Li
2026-03-13 14:46 ` Darrick J. Wong
0 siblings, 1 reply; 17+ messages in thread
From: Long Li @ 2026-03-11 2:34 UTC (permalink / raw)
To: Darrick J. Wong
Cc: cem, linux-xfs, david, yi.zhang, houtao1, yangerkun, lonuxli.64
On Tue, Mar 10, 2026 at 07:46:42AM -0700, Darrick J. Wong wrote:
> On Tue, Mar 10, 2026 at 04:19:33PM +0800, Long Li wrote:
> > On Mon, Mar 09, 2026 at 09:59:33AM -0700, Darrick J. Wong wrote:
> > > On Mon, Mar 09, 2026 at 04:27:52PM +0800, Long Li wrote:
> > > > When inactivating an inode with node-format extended attributes,
> > > > xfs_attr3_node_inactive() invalidates all child leaf/node blocks via
> > > > xfs_trans_binval(), but intentionally does not remove the corresponding
> > > > entries from their parent node blocks. The implicit assumption is that
> > > > xfs_attr_inactive() will truncate the entire attr fork to zero extents
> > > > afterwards, so log recovery will never reach the root node and follow
> > > > those stale pointers.
> > > >
> > > > However, if a log shutdown occurs after the child block cancellations
> > > > commit but before the attr bmap truncation commits, this assumption
> > > > breaks. Recovery replays the attr bmap intact (the inode still has
> > > > attr fork extents), but suppresses replay of all cancelled child
> > > > blocks, maybe leaving them as stale data on disk. On the next mount,
> > > > xlog_recover_process_iunlinks() retries inactivation and attempts to
> > > > read the root node via the attr bmap. If the root node was not replayed,
> > > > reading the unreplayed root block triggers a metadata verification
> > > > failure immediately; if it was replayed, following its child pointers
> > > > to unreplayed child blocks triggers the same failure:
> > > >
> > > > XFS (pmem0): Metadata corruption detected at
> > > > xfs_da3_node_read_verify+0x53/0x220, xfs_da3_node block 0x78
> > > > XFS (pmem0): Unmount and run xfs_repair
> > > > XFS (pmem0): First 128 bytes of corrupted metadata buffer:
> > > > 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > XFS (pmem0): metadata I/O error in "xfs_da_read_buf+0x104/0x190" at daddr 0x78 len 8 error 117
> > >
> > > Did you hit this through a customer issue? Or is this the same "corrupt
> > > block 0 of inode 25165954 attribute fork" problem exposed by generic/753
> > > last week? Or possibly both?
> > >
> > > https://lore.kernel.org/linux-xfs/CAF-d4Oscq=qaCd9dbbEZjG8dA5Q7erdWSszoxY1migM8j85eRw@mail.gmail.com/
> >
> > We encountered this issue while performing disk fault injection tests,
> > rather than through the generic/753. When I construct the problem
> > and use xfs_repair to repair it, the error message "corrupt block 0" can
> > be reported as follows:
> >
> > Metadata corruption detected at 0x452a9c, xfs_da3_node block 0x78/0x1000
> > corrupt block 0 of inode 131 attribute fork
> > problem with attribute contents in inode 131
> > clearing inode 131 attributes
> > correcting nblocks for inode 131, was 1 - counted 0
> >
> > So the problem you encountered before might be this issue.
> >
> > >
> > > > Fix this in two places:
> > > >
> > > > In xfs_attr3_node_inactive(), after calling xfs_trans_binval() on a
> > > > child block, immediately remove the entry that references it from the
> > > > parent node in the same transaction. This eliminates the window where
> > > > the parent holds a pointer to a cancelled block. Once all children are
> > > > removed, the now-empty root node is converted to a leaf block within the
> > > > same transaction. This node-to-leaf conversion is necessary for crash
> > > > safety. If the system shutdown after the empty node is written to the
> > > > log but before the second-phase bmap truncation commits, log recovery
> > > > will attempt to verify the root block on disk. xfs_da3_node_verify()
> > > > does not permit a node block with count == 0; such a block will fail
> > > > verification and trigger a metadata corruption shutdown. on the other
> > > > hand, leaf blocks are allowed to have this transient state.
> > >
> > > Hrmmm... this really does sound like the "corrupt block 0" problem
> > > referenced above.
> > >
> > > > In xfs_attr_inactive(), split the attr fork truncation into two explicit
> > > > phases. First, truncate all extents beyond the root block (the child
> > > > extents whose parent references have already been removed above).
> > > > Second, invalidate the root block and truncate the attr bmap to zero in
> > > > a single transaction. The two operations in the second phase must be
> > > > atomic: as long as the attr bmap has any non-zero length, recovery can
> > > > follow it to the root block, so the root block invalidation must commit
> > > > together with the bmap-to-zero truncation.
> > > >
> > > > Signed-off-by: Long Li <leo.lilong@huawei.com>
> >
> > ......
> >
> > > > @@ -283,6 +283,16 @@ xfs_attr3_root_inactive(
> > > > case cpu_to_be16(XFS_DA_NODE_MAGIC):
> > > > case cpu_to_be16(XFS_DA3_NODE_MAGIC):
> > > > error = xfs_attr3_node_inactive(trans, dp, bp, 1);
> > > > + if (error)
> > > > + return error;
> > > > +
> > > > + /*
> > > > + * Empty root node block are not allowed, convert it to leaf.
> > > > + */
> > > > + error = xfs_attr3_leaf_init(*trans, dp, 0);
> > >
> > > Responding to my own question: Ah, I see -- "leaf init" doesn't use the
> > > bp anymore and it's attached to the transaction so it doesn't leak.
> > > That's a little subtle since there's nothing preventing someone from
> > > calling xfs_attr3_leaf_init(NULL, dp, 0).
> >
> > Indeed, there should be an increase in explanatory comments and an empty
> > check for the tp.
> >
> > >
> > > > + if (error)
> > > > + return error;
> > > > + error = xfs_trans_roll_inode(trans, dp);
> > >
> > > If we have an xattr structure with multiple levels of dabtree nodes, can
> > > this lead to the somewhat odd situation where the tree levels are
> > > uneven during deconstruction? For example
> > >
> > > root
> > > / \
> > > node empty_leaf
> > > | \
> > > | \
> > > node node
> > > | \
> > > leaves more_leaves
> > >
> > > Does this matter, or can the inactivation code already handle it? I
> > > suppose since we're inactivating (either in inodegc or in recovery after
> > > a crash) user programs will never see this so the window of confusion
> > > might be pretty small.
> > >
> > > --D
> >
> > For the reason of simplicity and efficiency, the dead code does not consider
> > this tree imbalance scenario, and I understand that this would not cause
> > any practical issues.
>
> "dead" code? Did you mean to say that the code that the patch removes
> did not consider this imbalanced tree scenario? "Dead" code usually
> refers to code that's still in the codebase that nobody calls or
> executes.
Sorry for the confusion — "dead code" was a mistaken word choice. I
meant the attr inactivation code.
>
> I /think/ what I'm hearing is that the the attr inactivation code can
> handle the system going down midway through removing the attr structure,
> so it doesn't matter if log recovery finds an imbalanced tree because
> all it's really doing is walking the dabtree to each leaf and nuking the
> path to that leaf; and now it'll also rewrite each emptied-out node with
> an empty leaf to ensure that a second log recovery can resume if the
> first recovery attempt fails due to system crash etc.
>
> (Is that correct?)
>
> --D
And yes, I agree with most of your understanding, but the approach is
slightly different from what you described. Only the empty root node
is rewritten as an empty leaf. For intermediate nodes, the handling
depends on whether the entry being removed is the last one in that node:
if it is not the last entry, the transaction is committed immediately
after the deletion; if it is the last entry, the entry deletion, the
cancellation of the now-empty node block, and the removal of the pointer
to that node from its parent are all placed in the same transaction.
This ensures that an empty intermediate node is never visible to recovery
without its parent pointer also being gone, avoiding the need to convert
every emptied intermediate node to a leaf.
The root node is a special case, which is essentially indexed by attr bmap.
Therefore, the root node can only be converted to an empty leaf to
prevent an empty node from being visible during the next recovery.
Thanks,
Long Li
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 4/4] xfs: close crash window in attr dabtree inactivation
2026-03-11 2:34 ` Long Li
@ 2026-03-13 14:46 ` Darrick J. Wong
0 siblings, 0 replies; 17+ messages in thread
From: Darrick J. Wong @ 2026-03-13 14:46 UTC (permalink / raw)
To: Long Li; +Cc: cem, linux-xfs, david, yi.zhang, houtao1, yangerkun, lonuxli.64
On Wed, Mar 11, 2026 at 10:34:38AM +0800, Long Li wrote:
> On Tue, Mar 10, 2026 at 07:46:42AM -0700, Darrick J. Wong wrote:
> > On Tue, Mar 10, 2026 at 04:19:33PM +0800, Long Li wrote:
> > > On Mon, Mar 09, 2026 at 09:59:33AM -0700, Darrick J. Wong wrote:
> > > > On Mon, Mar 09, 2026 at 04:27:52PM +0800, Long Li wrote:
> > > > > When inactivating an inode with node-format extended attributes,
> > > > > xfs_attr3_node_inactive() invalidates all child leaf/node blocks via
> > > > > xfs_trans_binval(), but intentionally does not remove the corresponding
> > > > > entries from their parent node blocks. The implicit assumption is that
> > > > > xfs_attr_inactive() will truncate the entire attr fork to zero extents
> > > > > afterwards, so log recovery will never reach the root node and follow
> > > > > those stale pointers.
> > > > >
> > > > > However, if a log shutdown occurs after the child block cancellations
> > > > > commit but before the attr bmap truncation commits, this assumption
> > > > > breaks. Recovery replays the attr bmap intact (the inode still has
> > > > > attr fork extents), but suppresses replay of all cancelled child
> > > > > blocks, maybe leaving them as stale data on disk. On the next mount,
> > > > > xlog_recover_process_iunlinks() retries inactivation and attempts to
> > > > > read the root node via the attr bmap. If the root node was not replayed,
> > > > > reading the unreplayed root block triggers a metadata verification
> > > > > failure immediately; if it was replayed, following its child pointers
> > > > > to unreplayed child blocks triggers the same failure:
> > > > >
> > > > > XFS (pmem0): Metadata corruption detected at
> > > > > xfs_da3_node_read_verify+0x53/0x220, xfs_da3_node block 0x78
> > > > > XFS (pmem0): Unmount and run xfs_repair
> > > > > XFS (pmem0): First 128 bytes of corrupted metadata buffer:
> > > > > 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > > 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > > 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > > 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > > 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > > 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > > 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > > XFS (pmem0): metadata I/O error in "xfs_da_read_buf+0x104/0x190" at daddr 0x78 len 8 error 117
> > > >
> > > > Did you hit this through a customer issue? Or is this the same "corrupt
> > > > block 0 of inode 25165954 attribute fork" problem exposed by generic/753
> > > > last week? Or possibly both?
> > > >
> > > > https://lore.kernel.org/linux-xfs/CAF-d4Oscq=qaCd9dbbEZjG8dA5Q7erdWSszoxY1migM8j85eRw@mail.gmail.com/
> > >
> > > We encountered this issue while performing disk fault injection tests,
> > > rather than through the generic/753. When I construct the problem
> > > and use xfs_repair to repair it, the error message "corrupt block 0" can
> > > be reported as follows:
> > >
> > > Metadata corruption detected at 0x452a9c, xfs_da3_node block 0x78/0x1000
> > > corrupt block 0 of inode 131 attribute fork
> > > problem with attribute contents in inode 131
> > > clearing inode 131 attributes
> > > correcting nblocks for inode 131, was 1 - counted 0
> > >
> > > So the problem you encountered before might be this issue.
> > >
> > > >
> > > > > Fix this in two places:
> > > > >
> > > > > In xfs_attr3_node_inactive(), after calling xfs_trans_binval() on a
> > > > > child block, immediately remove the entry that references it from the
> > > > > parent node in the same transaction. This eliminates the window where
> > > > > the parent holds a pointer to a cancelled block. Once all children are
> > > > > removed, the now-empty root node is converted to a leaf block within the
> > > > > same transaction. This node-to-leaf conversion is necessary for crash
> > > > > safety. If the system shutdown after the empty node is written to the
> > > > > log but before the second-phase bmap truncation commits, log recovery
> > > > > will attempt to verify the root block on disk. xfs_da3_node_verify()
> > > > > does not permit a node block with count == 0; such a block will fail
> > > > > verification and trigger a metadata corruption shutdown. on the other
> > > > > hand, leaf blocks are allowed to have this transient state.
> > > >
> > > > Hrmmm... this really does sound like the "corrupt block 0" problem
> > > > referenced above.
> > > >
> > > > > In xfs_attr_inactive(), split the attr fork truncation into two explicit
> > > > > phases. First, truncate all extents beyond the root block (the child
> > > > > extents whose parent references have already been removed above).
> > > > > Second, invalidate the root block and truncate the attr bmap to zero in
> > > > > a single transaction. The two operations in the second phase must be
> > > > > atomic: as long as the attr bmap has any non-zero length, recovery can
> > > > > follow it to the root block, so the root block invalidation must commit
> > > > > together with the bmap-to-zero truncation.
> > > > >
> > > > > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > >
> > > ......
> > >
> > > > > @@ -283,6 +283,16 @@ xfs_attr3_root_inactive(
> > > > > case cpu_to_be16(XFS_DA_NODE_MAGIC):
> > > > > case cpu_to_be16(XFS_DA3_NODE_MAGIC):
> > > > > error = xfs_attr3_node_inactive(trans, dp, bp, 1);
> > > > > + if (error)
> > > > > + return error;
> > > > > +
> > > > > + /*
> > > > > + * Empty root node block are not allowed, convert it to leaf.
> > > > > + */
> > > > > + error = xfs_attr3_leaf_init(*trans, dp, 0);
> > > >
> > > > Responding to my own question: Ah, I see -- "leaf init" doesn't use the
> > > > bp anymore and it's attached to the transaction so it doesn't leak.
> > > > That's a little subtle since there's nothing preventing someone from
> > > > calling xfs_attr3_leaf_init(NULL, dp, 0).
> > >
> > > Indeed, there should be an increase in explanatory comments and an empty
> > > check for the tp.
> > >
> > > >
> > > > > + if (error)
> > > > > + return error;
> > > > > + error = xfs_trans_roll_inode(trans, dp);
> > > >
> > > > If we have an xattr structure with multiple levels of dabtree nodes, can
> > > > this lead to the somewhat odd situation where the tree levels are
> > > > uneven during deconstruction? For example
> > > >
> > > > root
> > > > / \
> > > > node empty_leaf
> > > > | \
> > > > | \
> > > > node node
> > > > | \
> > > > leaves more_leaves
> > > >
> > > > Does this matter, or can the inactivation code already handle it? I
> > > > suppose since we're inactivating (either in inodegc or in recovery after
> > > > a crash) user programs will never see this so the window of confusion
> > > > might be pretty small.
> > > >
> > > > --D
> > >
> > > For the reason of simplicity and efficiency, the dead code does not consider
> > > this tree imbalance scenario, and I understand that this would not cause
> > > any practical issues.
> >
> > "dead" code? Did you mean to say that the code that the patch removes
> > did not consider this imbalanced tree scenario? "Dead" code usually
> > refers to code that's still in the codebase that nobody calls or
> > executes.
>
> Sorry for the confusion — "dead code" was a mistaken word choice. I
> meant the attr inactivation code.
>
> >
> > I /think/ what I'm hearing is that the the attr inactivation code can
> > handle the system going down midway through removing the attr structure,
> > so it doesn't matter if log recovery finds an imbalanced tree because
> > all it's really doing is walking the dabtree to each leaf and nuking the
> > path to that leaf; and now it'll also rewrite each emptied-out node with
> > an empty leaf to ensure that a second log recovery can resume if the
> > first recovery attempt fails due to system crash etc.
> >
> > (Is that correct?)
> >
> > --D
>
> And yes, I agree with most of your understanding, but the approach is
> slightly different from what you described. Only the empty root node
> is rewritten as an empty leaf. For intermediate nodes, the handling
> depends on whether the entry being removed is the last one in that node:
> if it is not the last entry, the transaction is committed immediately
> after the deletion; if it is the last entry, the entry deletion, the
> cancellation of the now-empty node block, and the removal of the pointer
> to that node from its parent are all placed in the same transaction.
> This ensures that an empty intermediate node is never visible to recovery
> without its parent pointer also being gone, avoiding the need to convert
> every emptied intermediate node to a leaf.
>
> The root node is a special case, which is essentially indexed by attr bmap.
> Therefore, the root node can only be converted to an empty leaf to
> prevent an empty node from being visible during the next recovery.
Ah, ok, thanks for the explanation. I'll go catch up with your latest
posting now. :)
--D
> Thanks,
> Long Li
>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2026-03-13 14:46 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-09 8:27 [PATCH 0/4] xfs: close crash window in attr dabtree inactivation Long Li
2026-03-09 8:27 ` [PATCH 1/4] xfs: only assert new size for datafork during truncate extents Long Li
2026-03-09 16:39 ` Darrick J. Wong
2026-03-09 8:27 ` [PATCH 2/4] xfs: factor out xfs_da3_node_entry_remove Long Li
2026-03-09 16:42 ` Darrick J. Wong
2026-03-10 2:15 ` Long Li
2026-03-10 11:58 ` Long Li
2026-03-10 14:38 ` Darrick J. Wong
2026-03-09 8:27 ` [PATCH 3/4] xfs: factor out xfs_attr3_leaf_init Long Li
2026-03-09 16:44 ` Darrick J. Wong
2026-03-10 7:42 ` Long Li
2026-03-09 8:27 ` [PATCH 4/4] xfs: close crash window in attr dabtree inactivation Long Li
2026-03-09 16:59 ` Darrick J. Wong
2026-03-10 8:19 ` Long Li
2026-03-10 14:46 ` Darrick J. Wong
2026-03-11 2:34 ` Long Li
2026-03-13 14:46 ` Darrick J. Wong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox