* [PATCH] xfs: Fix file creation failure
@ 2024-06-04 7:11 Zizhi Wo
2024-06-04 15:56 ` Darrick J. Wong
2024-06-04 23:00 ` Dave Chinner
0 siblings, 2 replies; 8+ messages in thread
From: Zizhi Wo @ 2024-06-04 7:11 UTC (permalink / raw)
To: chandan.babu, djwong, dchinner, wozizhi
Cc: linux-xfs, linux-kernel, yangerkun
We have an xfs image that contains only 2 AGs, the first AG is full and
the second AG is empty, then a concurrent file creation and little writing
could unexpectedly return -ENOSPC error since there is a race window that
the allocator could get the wrong agf->agf_longest.
Write file process steps:
1) Find the entry that best meets the conditions, then calculate the start
address and length of the remaining part of the entry after allocation.
2) Delete this entry. Because the second AG is empty, the btree in its agf
has only one record, and agf->agf_longest will be set to 0 after deletion.
3) Insert the remaining unused parts of this entry based on the
calculations in 1), and update the agf->agf_longest.
Create file process steps:
1) Check whether there are free inodes in the inode chunk.
2) If there is no free inode, check whether there has space for creating
inode chunks, perform the no-lock judgment first.
3) If the judgment succeeds, the judgment is performed again with agf lock
held. Otherwire, an error is returned directly.
If the write process is in step 2) but not go to 3) yet, the create file
process goes to 2) at this time, it will be mistaken for no space,
resulting in the file system still has space but the file creation fails.
Direct write Create file
xfs_file_write_iter
...
xfs_direct_write_iomap_begin
xfs_iomap_write_direct
...
xfs_alloc_ag_vextent_near
xfs_alloc_cur_finish
xfs_alloc_fixup_trees
xfs_btree_delete
xfs_btree_delrec
xfs_allocbt_update_lastrec
// longest = 0 because numrec == 0.
agf->agf_longest = len = 0
xfs_create
...
xfs_dialloc
...
xfs_alloc_fix_freelist
xfs_alloc_space_available
-> as longest=0, it will return
false, no space for inode alloc.
Fix this issue by adding the bc_free_longest field to the xfs_btree_cur_t
structure to store the potential longest count that will be updated. The
assignment is done in xfs_alloc_fixup_trees() and xfs_free_ag_extent().
Reported by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
---
fs/xfs/libxfs/xfs_alloc.c | 14 ++++++++++++++
fs/xfs/libxfs/xfs_alloc_btree.c | 9 ++++++++-
fs/xfs/libxfs/xfs_btree.h | 1 +
3 files changed, 23 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 6c55a6e88eba..86ba873d57a8 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -577,6 +577,13 @@ xfs_alloc_fixup_trees(
nfbno2 = rbno + rlen;
nflen2 = (fbno + flen) - nfbno2;
}
+
+ /*
+ * Record the potential maximum free length in advance.
+ */
+ if (nfbno1 != NULLAGBLOCK || nfbno2 != NULLAGBLOCK)
+ cnt_cur->bc_ag.bc_free_longest = XFS_EXTLEN_MAX(nflen1, nflen2);
+
/*
* Delete the entry from the by-size btree.
*/
@@ -2044,6 +2051,13 @@ xfs_free_ag_extent(
* Now allocate and initialize a cursor for the by-size tree.
*/
cnt_cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag);
+ /*
+ * Record the potential maximum free length in advance.
+ */
+ if (haveleft)
+ cnt_cur->bc_ag.bc_free_longest = ltlen;
+ if (haveright)
+ cnt_cur->bc_ag.bc_free_longest = gtlen;
/*
* Have both left and right contiguous neighbors.
* Merge all three into a single free block.
diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index 6ef5ddd89600..8e7d1e0f1a63 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -161,7 +161,14 @@ xfs_allocbt_update_lastrec(
rrp = XFS_ALLOC_REC_ADDR(cur->bc_mp, block, numrecs);
len = rrp->ar_blockcount;
} else {
- len = 0;
+ /*
+ * Update in advance to prevent file creation failure
+ * for concurrent processes even though there is no
+ * numrec currently.
+ * And there's no need to worry as the value that no
+ * less than bc_free_longest will be inserted later.
+ */
+ len = cpu_to_be32(cur->bc_ag.bc_free_longest);
}
break;
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index f93374278aa1..985b1885a643 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -281,6 +281,7 @@ struct xfs_btree_cur
struct xfs_perag *pag;
struct xfs_buf *agbp;
struct xbtree_afakeroot *afake; /* for staging cursor */
+ xfs_extlen_t bc_free_longest; /* potential longest free space */
} bc_ag;
struct {
struct xfbtree *xfbtree;
--
2.39.2
^ permalink raw reply related [flat|nested] 8+ messages in thread* Re: [PATCH] xfs: Fix file creation failure 2024-06-04 7:11 [PATCH] xfs: Fix file creation failure Zizhi Wo @ 2024-06-04 15:56 ` Darrick J. Wong 2024-06-05 3:38 ` Zizhi Wo 2024-06-04 23:00 ` Dave Chinner 1 sibling, 1 reply; 8+ messages in thread From: Darrick J. Wong @ 2024-06-04 15:56 UTC (permalink / raw) To: Zizhi Wo; +Cc: chandan.babu, dchinner, linux-xfs, linux-kernel, yangerkun On Tue, Jun 04, 2024 at 03:11:21PM +0800, Zizhi Wo wrote: > We have an xfs image that contains only 2 AGs, the first AG is full and > the second AG is empty, then a concurrent file creation and little writing > could unexpectedly return -ENOSPC error since there is a race window that > the allocator could get the wrong agf->agf_longest. > > Write file process steps: > 1) Find the entry that best meets the conditions, then calculate the start > address and length of the remaining part of the entry after allocation. > 2) Delete this entry. Because the second AG is empty, the btree in its agf > has only one record, and agf->agf_longest will be set to 0 after deletion. > 3) Insert the remaining unused parts of this entry based on the > calculations in 1), and update the agf->agf_longest. > > Create file process steps: > 1) Check whether there are free inodes in the inode chunk. > 2) If there is no free inode, check whether there has space for creating > inode chunks, perform the no-lock judgment first. > 3) If the judgment succeeds, the judgment is performed again with agf lock > held. Otherwire, an error is returned directly. > > If the write process is in step 2) but not go to 3) yet, the create file > process goes to 2) at this time, it will be mistaken for no space, > resulting in the file system still has space but the file creation fails. > > Direct write Create file > xfs_file_write_iter > ... > xfs_direct_write_iomap_begin > xfs_iomap_write_direct > ... > xfs_alloc_ag_vextent_near > xfs_alloc_cur_finish > xfs_alloc_fixup_trees > xfs_btree_delete > xfs_btree_delrec > xfs_allocbt_update_lastrec > // longest = 0 because numrec == 0. > agf->agf_longest = len = 0 > xfs_create > ... > xfs_dialloc > ... > xfs_alloc_fix_freelist > xfs_alloc_space_available > -> as longest=0, it will return > false, no space for inode alloc. > > Fix this issue by adding the bc_free_longest field to the xfs_btree_cur_t > structure to store the potential longest count that will be updated. The > assignment is done in xfs_alloc_fixup_trees() and xfs_free_ag_extent(). This is going to be a reverse-order review due to the way that diff ordered the chunks, which means that the bulk of my questions are at the end. > Reported by: Ye Bin <yebin10@huawei.com> > Signed-off-by: Zizhi Wo <wozizhi@huawei.com> > --- > fs/xfs/libxfs/xfs_alloc.c | 14 ++++++++++++++ > fs/xfs/libxfs/xfs_alloc_btree.c | 9 ++++++++- > fs/xfs/libxfs/xfs_btree.h | 1 + > 3 files changed, 23 insertions(+), 1 deletion(-) > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > index 6c55a6e88eba..86ba873d57a8 100644 > --- a/fs/xfs/libxfs/xfs_alloc.c > +++ b/fs/xfs/libxfs/xfs_alloc.c > @@ -577,6 +577,13 @@ xfs_alloc_fixup_trees( > nfbno2 = rbno + rlen; > nflen2 = (fbno + flen) - nfbno2; > } > + > + /* > + * Record the potential maximum free length in advance. > + */ > + if (nfbno1 != NULLAGBLOCK || nfbno2 != NULLAGBLOCK) > + cnt_cur->bc_ag.bc_free_longest = XFS_EXTLEN_MAX(nflen1, nflen2); Ok, so if we're allocating space then this sets bc_free_longest to the longer of the two remaining sections, if any. But if we just allocated the entirety of the longest extent in the cntbt, then we don't set bc_free_longest, which means its zero, right? I guess that's ok because that implies there's zero space left in the AG, so the longest freespace is indeed zero. If we just allocated the entirety of a non-longest extent in the cntbt then we don't call ->lastrec_update so the value of bc_free_longest doesn't matter? > + > /* > * Delete the entry from the by-size btree. > */ > @@ -2044,6 +2051,13 @@ xfs_free_ag_extent( > * Now allocate and initialize a cursor for the by-size tree. > */ > cnt_cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); > + /* > + * Record the potential maximum free length in advance. > + */ > + if (haveleft) > + cnt_cur->bc_ag.bc_free_longest = ltlen; > + if (haveright) > + cnt_cur->bc_ag.bc_free_longest = gtlen; What happens in the haveleft && haveright case? Shouldn't bc_free_longest be set to ltlen + len + gtlen? You could just push the setting of bc_free_longest into the haveleft/haveright code below. > /* > * Have both left and right contiguous neighbors. > * Merge all three into a single free block. > diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c > index 6ef5ddd89600..8e7d1e0f1a63 100644 > --- a/fs/xfs/libxfs/xfs_alloc_btree.c > +++ b/fs/xfs/libxfs/xfs_alloc_btree.c > @@ -161,7 +161,14 @@ xfs_allocbt_update_lastrec( > rrp = XFS_ALLOC_REC_ADDR(cur->bc_mp, block, numrecs); > len = rrp->ar_blockcount; > } else { > - len = 0; > + /* > + * Update in advance to prevent file creation failure > + * for concurrent processes even though there is no > + * numrec currently. > + * And there's no need to worry as the value that no > + * less than bc_free_longest will be inserted later. > + */ > + len = cpu_to_be32(cur->bc_ag.bc_free_longest); Humm. In this case, we've called ->update_lastrec on the cntbt cursor having deleted all the records in this record block. Presumably that means that we're going to add rec->alloc.ar_blockcount blocks to the rightmost record in the left sibling of @block? Or already have? Ahh, right, the pagf_longest checks are done without holding AGF lock. The concurrent creat call sees this intermediate state (DELREC sets pagf_longest to zero, a moment later INSREC/UPDATE set it to the correct nonzero value) and decides to ENOSPC because "nobody" has sufficient free space. I think this phony zero never gets written to disk because although we're logging zero into the ondisk and incore agf_longest here, the next btree operation will reset it to the correct value. Right? Would it be simpler to handle this case by duplicating the cntbt cursor and walking one record leftward in the tree to find the longest extent, rather than using this "bc_free_longest" variable? > } > > break; > diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h > index f93374278aa1..985b1885a643 100644 > --- a/fs/xfs/libxfs/xfs_btree.h > +++ b/fs/xfs/libxfs/xfs_btree.h > @@ -281,6 +281,7 @@ struct xfs_btree_cur > struct xfs_perag *pag; > struct xfs_buf *agbp; > struct xbtree_afakeroot *afake; /* for staging cursor */ > + xfs_extlen_t bc_free_longest; /* potential longest free space */ This is only used for bnobt/cntbt trees, put it in the per-format private data area, please. If the answer to the question about duplicating the btree cursor is "no" then I think this deserves a much longer comment that captures the fact that the variable exists to avoid setting pagf_longest to zero for benefit of the code that does unlocked scanning of AGs for free space. I also wonder if the unlocked ag scan should just note that it observed a zero pagf_longest and if no space can be found across all the AGs, to try again with locks? --D > } bc_ag; > struct { > struct xfbtree *xfbtree; > -- > 2.39.2 > > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] xfs: Fix file creation failure 2024-06-04 15:56 ` Darrick J. Wong @ 2024-06-05 3:38 ` Zizhi Wo 0 siblings, 0 replies; 8+ messages in thread From: Zizhi Wo @ 2024-06-05 3:38 UTC (permalink / raw) To: Darrick J. Wong Cc: chandan.babu, dchinner, linux-xfs, linux-kernel, yangerkun 在 2024/6/4 23:56, Darrick J. Wong 写道: > On Tue, Jun 04, 2024 at 03:11:21PM +0800, Zizhi Wo wrote: >> We have an xfs image that contains only 2 AGs, the first AG is full and >> the second AG is empty, then a concurrent file creation and little writing >> could unexpectedly return -ENOSPC error since there is a race window that >> the allocator could get the wrong agf->agf_longest. >> >> Write file process steps: >> 1) Find the entry that best meets the conditions, then calculate the start >> address and length of the remaining part of the entry after allocation. >> 2) Delete this entry. Because the second AG is empty, the btree in its agf >> has only one record, and agf->agf_longest will be set to 0 after deletion. >> 3) Insert the remaining unused parts of this entry based on the >> calculations in 1), and update the agf->agf_longest. >> >> Create file process steps: >> 1) Check whether there are free inodes in the inode chunk. >> 2) If there is no free inode, check whether there has space for creating >> inode chunks, perform the no-lock judgment first. >> 3) If the judgment succeeds, the judgment is performed again with agf lock >> held. Otherwire, an error is returned directly. >> >> If the write process is in step 2) but not go to 3) yet, the create file >> process goes to 2) at this time, it will be mistaken for no space, >> resulting in the file system still has space but the file creation fails. >> >> Direct write Create file >> xfs_file_write_iter >> ... >> xfs_direct_write_iomap_begin >> xfs_iomap_write_direct >> ... >> xfs_alloc_ag_vextent_near >> xfs_alloc_cur_finish >> xfs_alloc_fixup_trees >> xfs_btree_delete >> xfs_btree_delrec >> xfs_allocbt_update_lastrec >> // longest = 0 because numrec == 0. >> agf->agf_longest = len = 0 >> xfs_create >> ... >> xfs_dialloc >> ... >> xfs_alloc_fix_freelist >> xfs_alloc_space_available >> -> as longest=0, it will return >> false, no space for inode alloc. >> >> Fix this issue by adding the bc_free_longest field to the xfs_btree_cur_t >> structure to store the potential longest count that will be updated. The >> assignment is done in xfs_alloc_fixup_trees() and xfs_free_ag_extent(). > > This is going to be a reverse-order review due to the way that diff > ordered the chunks, which means that the bulk of my questions are at the > end. > >> Reported by: Ye Bin <yebin10@huawei.com> >> Signed-off-by: Zizhi Wo <wozizhi@huawei.com> >> --- >> fs/xfs/libxfs/xfs_alloc.c | 14 ++++++++++++++ >> fs/xfs/libxfs/xfs_alloc_btree.c | 9 ++++++++- >> fs/xfs/libxfs/xfs_btree.h | 1 + >> 3 files changed, 23 insertions(+), 1 deletion(-) >> >> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c >> index 6c55a6e88eba..86ba873d57a8 100644 >> --- a/fs/xfs/libxfs/xfs_alloc.c >> +++ b/fs/xfs/libxfs/xfs_alloc.c >> @@ -577,6 +577,13 @@ xfs_alloc_fixup_trees( >> nfbno2 = rbno + rlen; >> nflen2 = (fbno + flen) - nfbno2; >> } >> + >> + /* >> + * Record the potential maximum free length in advance. >> + */ >> + if (nfbno1 != NULLAGBLOCK || nfbno2 != NULLAGBLOCK) >> + cnt_cur->bc_ag.bc_free_longest = XFS_EXTLEN_MAX(nflen1, nflen2); > > Ok, so if we're allocating space then this sets bc_free_longest to the > longer of the two remaining sections, if any. But if we just allocated > the entirety of the longest extent in the cntbt, then we don't set > bc_free_longest, which means its zero, right? I guess that's ok because > that implies there's zero space left in the AG, so the longest freespace > is indeed zero. > > If we just allocated the entirety of a non-longest extent in the cntbt > then we don't call ->lastrec_update so the value of bc_free_longest > doesn't matter? Yes, absolutely right! Thank you!φ(゜▽゜*)♪ > >> + >> /* >> * Delete the entry from the by-size btree. >> */ >> @@ -2044,6 +2051,13 @@ xfs_free_ag_extent( >> * Now allocate and initialize a cursor for the by-size tree. >> */ >> cnt_cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); >> + /* >> + * Record the potential maximum free length in advance. >> + */ >> + if (haveleft) >> + cnt_cur->bc_ag.bc_free_longest = ltlen; >> + if (haveright) >> + cnt_cur->bc_ag.bc_free_longest = gtlen; > > What happens in the haveleft && haveright case? Shouldn't > bc_free_longest be set to ltlen + len + gtlen? You could just push the > setting of bc_free_longest into the haveleft/haveright code below. Oh, as I wrote to Dave, the logic I considered here is that the record is less than or equal to the maximum value. And no need to worry about that because it will soon be updated in xfs_btree_insert in the problem triggering scenario. > >> /* >> * Have both left and right contiguous neighbors. >> * Merge all three into a single free block. >> diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c >> index 6ef5ddd89600..8e7d1e0f1a63 100644 >> --- a/fs/xfs/libxfs/xfs_alloc_btree.c >> +++ b/fs/xfs/libxfs/xfs_alloc_btree.c >> @@ -161,7 +161,14 @@ xfs_allocbt_update_lastrec( >> rrp = XFS_ALLOC_REC_ADDR(cur->bc_mp, block, numrecs); >> len = rrp->ar_blockcount; >> } else { >> - len = 0; >> + /* >> + * Update in advance to prevent file creation failure >> + * for concurrent processes even though there is no >> + * numrec currently. >> + * And there's no need to worry as the value that no >> + * less than bc_free_longest will be inserted later. >> + */ >> + len = cpu_to_be32(cur->bc_ag.bc_free_longest); > > Humm. In this case, we've called ->update_lastrec on the cntbt cursor > having deleted all the records in this record block. Presumably that > means that we're going to add rec->alloc.ar_blockcount blocks to the > rightmost record in the left sibling of @block? Or already have? > In normal delete operations, cntbt will have a balancing process, moving data from other nodes or merging to ensure that numrecs >= get_minrecs. In this scenario, the cntbt is already an -empty- tree, and is in a temporary state, new values will be inserted later. > Ahh, right, the pagf_longest checks are done without holding AGF lock. > The concurrent creat call sees this intermediate state (DELREC sets > pagf_longest to zero, a moment later INSREC/UPDATE set it to the correct > nonzero value) and decides to ENOSPC because "nobody" has sufficient > free space. > > I think this phony zero never gets written to disk because although > we're logging zero into the ondisk and incore agf_longest here, the next > btree operation will reset it to the correct value. Right? Yes, this phony zero will not be recorded to disk. It is just a temporary condition. > > Would it be simpler to handle this case by duplicating the cntbt cursor > and walking one record leftward in the tree to find the longest extent, > rather than using this "bc_free_longest" variable? > In my opinion, this does not solve the problem. As mentioned above, at this point the cntbt is an -empty- tree with no records. >> } >> >> break; >> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h >> index f93374278aa1..985b1885a643 100644 >> --- a/fs/xfs/libxfs/xfs_btree.h >> +++ b/fs/xfs/libxfs/xfs_btree.h >> @@ -281,6 +281,7 @@ struct xfs_btree_cur >> struct xfs_perag *pag; >> struct xfs_buf *agbp; >> struct xbtree_afakeroot *afake; /* for staging cursor */ >> + xfs_extlen_t bc_free_longest; /* potential longest free space */ > > This is only used for bnobt/cntbt trees, put it in the per-format > private data area, please. > OK, I will modify it. More specifically, it is only used for cntbt, because currently only cntbt can set the XFS_BTGEO_LASTREC_UPDATE flag, and can call ->update_lastrec. > If the answer to the question about duplicating the btree cursor is "no" > then I think this deserves a much longer comment that captures the fact > that the variable exists to avoid setting pagf_longest to zero for > benefit of the code that does unlocked scanning of AGs for free space. > > I also wonder if the unlocked ag scan should just note that it observed > a zero pagf_longest and if no space can be found across all the AGs, to > try again with locks? > > --D Currently xfs checks the space using the "check-lock-check again" algorithm, which I understand to be more efficient. If there are a large number of AG's and there is no free space in front of them, the performance may be affected by checking the lock again. So I think targeting specific AG might be more effective. Although the current code process has a retry mechanism (in xfs_dialloc), it still can't completely solve the problem: for example, there is no space for the first scan, and the second scan has space but the longest is 0 in the temporary state and return -ENOSPC, etc... Thanks, Zizhi Wo > >> } bc_ag; >> struct { >> struct xfbtree *xfbtree; >> -- >> 2.39.2 >> >> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] xfs: Fix file creation failure 2024-06-04 7:11 [PATCH] xfs: Fix file creation failure Zizhi Wo 2024-06-04 15:56 ` Darrick J. Wong @ 2024-06-04 23:00 ` Dave Chinner 2024-06-04 23:53 ` Darrick J. Wong 2024-06-05 2:51 ` Zizhi Wo 1 sibling, 2 replies; 8+ messages in thread From: Dave Chinner @ 2024-06-04 23:00 UTC (permalink / raw) To: Zizhi Wo; +Cc: chandan.babu, djwong, dchinner, linux-xfs, linux-kernel, yangerkun On Tue, Jun 04, 2024 at 03:11:21PM +0800, Zizhi Wo wrote: > We have an xfs image that contains only 2 AGs, the first AG is full and > the second AG is empty, then a concurrent file creation and little writing > could unexpectedly return -ENOSPC error since there is a race window that > the allocator could get the wrong agf->agf_longest. > > Write file process steps: > 1) Find the entry that best meets the conditions, then calculate the start > address and length of the remaining part of the entry after allocation. > 2) Delete this entry. Because the second AG is empty, the btree in its agf > has only one record, and agf->agf_longest will be set to 0 after deletion. > 3) Insert the remaining unused parts of this entry based on the > calculations in 1), and update the agf->agf_longest. > > Create file process steps: > 1) Check whether there are free inodes in the inode chunk. > 2) If there is no free inode, check whether there has space for creating > inode chunks, perform the no-lock judgment first. > 3) If the judgment succeeds, the judgment is performed again with agf lock > held. Otherwire, an error is returned directly. > > If the write process is in step 2) but not go to 3) yet, the create file > process goes to 2) at this time, it will be mistaken for no space, > resulting in the file system still has space but the file creation fails. > > Direct write Create file > xfs_file_write_iter > ... > xfs_direct_write_iomap_begin > xfs_iomap_write_direct > ... > xfs_alloc_ag_vextent_near > xfs_alloc_cur_finish > xfs_alloc_fixup_trees > xfs_btree_delete > xfs_btree_delrec > xfs_allocbt_update_lastrec > // longest = 0 because numrec == 0. > agf->agf_longest = len = 0 > xfs_create > ... > xfs_dialloc > ... > xfs_alloc_fix_freelist > xfs_alloc_space_available > -> as longest=0, it will return > false, no space for inode alloc. Ok, so this is a another attempt to address the problem Ye Bin attempted to fix here: https://lore.kernel.org/linux-xfs/20240419061848.1032366-1-yebin10@huawei.com/ > Fix this issue by adding the bc_free_longest field to the xfs_btree_cur_t > structure to store the potential longest count that will be updated. The > assignment is done in xfs_alloc_fixup_trees() and xfs_free_ag_extent(). I outlined how this should be fixed in the above thread: https://lore.kernel.org/linux-xfs/ZiWgRGWVG4aK1165@dread.disaster.area/ This is what I said: | What we actually want is for pag->pagf_longest not to change | transiently to zero in xfs_alloc_fixup_trees(). If the delrec that | zeroes the pagf_longest field is going to follow it up with an | insrec that will set it back to a valid value, we really should not | be doing the zeroing in the first place. | | Further, the only btree that tracks the right edge of the btree is | the by-count allocbt. This isn't "generic" functionality, even | though it is implemented through the generic btree code. If we lift | ->update_lastrec from the generic code and do it directly in | xfs_alloc.c whenever we are finished with a by-count tree update | and the cursor points to a record in the right-most leaf of the | tree, then we run the lastrec update directly at that point. | By decoupling the lastrec updates from the individual record | manipulations, we can make the transients disappear completely. I'm not sure if this patch is an attempt to implement this - there is no reference in the commit description to this previous attempt to fix the issue, nor is the any discussion of why this particular solution was chosen. In future, when you are trying to fix an issue that has previously been discussed/presented on the list, please reference it and provide a link to the previous discussions in the changelog for the new version of the patchset fixing the issue. > Reported by: Ye Bin <yebin10@huawei.com> > Signed-off-by: Zizhi Wo <wozizhi@huawei.com> > --- > fs/xfs/libxfs/xfs_alloc.c | 14 ++++++++++++++ > fs/xfs/libxfs/xfs_alloc_btree.c | 9 ++++++++- > fs/xfs/libxfs/xfs_btree.h | 1 + > 3 files changed, 23 insertions(+), 1 deletion(-) > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > index 6c55a6e88eba..86ba873d57a8 100644 > --- a/fs/xfs/libxfs/xfs_alloc.c > +++ b/fs/xfs/libxfs/xfs_alloc.c > @@ -577,6 +577,13 @@ xfs_alloc_fixup_trees( > nfbno2 = rbno + rlen; > nflen2 = (fbno + flen) - nfbno2; > } > + > + /* > + * Record the potential maximum free length in advance. > + */ > + if (nfbno1 != NULLAGBLOCK || nfbno2 != NULLAGBLOCK) > + cnt_cur->bc_ag.bc_free_longest = XFS_EXTLEN_MAX(nflen1, nflen2); > + Why do we store the length of a random extent being freed here? nflen1/2 almost always have nothing to do with the longest free space extent in the tree, they are just the new free space extents we are insering into a random location in the free space tree. > /* > * Delete the entry from the by-size btree. > */ > @@ -2044,6 +2051,13 @@ xfs_free_ag_extent( > * Now allocate and initialize a cursor for the by-size tree. > */ > cnt_cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); > + /* > + * Record the potential maximum free length in advance. > + */ > + if (haveleft) > + cnt_cur->bc_ag.bc_free_longest = ltlen; > + if (haveright) > + cnt_cur->bc_ag.bc_free_longest = gtlen; That doesn't look correct. At this point in the code, ltlen/gtlen are the sizes of the physically adjacent freespace extents that we are going to merge the newly freed extent with. i.e. the new freespace extent is going to have one of 4 possible values: no merge: len left merge: ltlen + len right merge: gtlen + len both merge: ltlen + gtlen + len So regardless of anything else, this code isn't setting the longest freespace extent in teh AGF to the lenght of the longest freespace extent in the filesystem. Which leads me to ask: how did you test this code? This bug should have been triggering verifier, repair and scrub failures quite quickly with fstests.... > /* > * Have both left and right contiguous neighbors. > * Merge all three into a single free block. > diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c > index 6ef5ddd89600..8e7d1e0f1a63 100644 > --- a/fs/xfs/libxfs/xfs_alloc_btree.c > +++ b/fs/xfs/libxfs/xfs_alloc_btree.c > @@ -161,7 +161,14 @@ xfs_allocbt_update_lastrec( > rrp = XFS_ALLOC_REC_ADDR(cur->bc_mp, block, numrecs); > len = rrp->ar_blockcount; > } else { > - len = 0; > + /* > + * Update in advance to prevent file creation failure > + * for concurrent processes even though there is no > + * numrec currently. > + * And there's no need to worry as the value that no > + * less than bc_free_longest will be inserted later. > + */ > + len = cpu_to_be32(cur->bc_ag.bc_free_longest); > } So this is in the LASTREC_DELREC case when the last record is removed from the btree. This is what causes the transient state as we do this when deleting a record to trim it and then re-insert the remainder back into the by-count btree. Writing some random transient value into the AGF *and journalling it* means we creating a transient on-disk format structure corruption and potentially writing it to persistent storage (i.e. the journal). The structure is, at least, not consistent in memory because the free space tree is empty at this point in time, whilst the agf->longest field says it has a free space available. This could trip verifiers, be flagged as corruption by xfs_scrub/repair, etc. Now, this *might be safe* because we *may* clean it up later in the transaction, but if this really is the last extent being removed from the btree and a cursor has previously been used to do other insert and free operations that set this field, then we trip over this stale inforamtion and write a corrupt structure to disk. That's not good. As I said above, this "last record tracking" needs to be ripped out of the generic btree code because only the by-count btree uses it. Then it can be updated at the end of the by-count btree update process in the allocation code (i.e. after all record manipulations are done in the transaction) and that avoids this transient caused by updating the last record on every btree record update that is done. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] xfs: Fix file creation failure 2024-06-04 23:00 ` Dave Chinner @ 2024-06-04 23:53 ` Darrick J. Wong 2024-06-05 2:51 ` Zizhi Wo 1 sibling, 0 replies; 8+ messages in thread From: Darrick J. Wong @ 2024-06-04 23:53 UTC (permalink / raw) To: Dave Chinner Cc: Zizhi Wo, chandan.babu, dchinner, linux-xfs, linux-kernel, yangerkun On Wed, Jun 05, 2024 at 09:00:28AM +1000, Dave Chinner wrote: > On Tue, Jun 04, 2024 at 03:11:21PM +0800, Zizhi Wo wrote: > > We have an xfs image that contains only 2 AGs, the first AG is full and > > the second AG is empty, then a concurrent file creation and little writing > > could unexpectedly return -ENOSPC error since there is a race window that > > the allocator could get the wrong agf->agf_longest. > > > > Write file process steps: > > 1) Find the entry that best meets the conditions, then calculate the start > > address and length of the remaining part of the entry after allocation. > > 2) Delete this entry. Because the second AG is empty, the btree in its agf > > has only one record, and agf->agf_longest will be set to 0 after deletion. > > 3) Insert the remaining unused parts of this entry based on the > > calculations in 1), and update the agf->agf_longest. > > > > Create file process steps: > > 1) Check whether there are free inodes in the inode chunk. > > 2) If there is no free inode, check whether there has space for creating > > inode chunks, perform the no-lock judgment first. > > 3) If the judgment succeeds, the judgment is performed again with agf lock > > held. Otherwire, an error is returned directly. > > > > If the write process is in step 2) but not go to 3) yet, the create file > > process goes to 2) at this time, it will be mistaken for no space, > > resulting in the file system still has space but the file creation fails. > > > > Direct write Create file > > xfs_file_write_iter > > ... > > xfs_direct_write_iomap_begin > > xfs_iomap_write_direct > > ... > > xfs_alloc_ag_vextent_near > > xfs_alloc_cur_finish > > xfs_alloc_fixup_trees > > xfs_btree_delete > > xfs_btree_delrec > > xfs_allocbt_update_lastrec > > // longest = 0 because numrec == 0. > > agf->agf_longest = len = 0 > > xfs_create > > ... > > xfs_dialloc > > ... > > xfs_alloc_fix_freelist > > xfs_alloc_space_available > > -> as longest=0, it will return > > false, no space for inode alloc. > > Ok, so this is a another attempt to address the problem Ye Bin > attempted to fix here: > > https://lore.kernel.org/linux-xfs/20240419061848.1032366-1-yebin10@huawei.com/ > > > Fix this issue by adding the bc_free_longest field to the xfs_btree_cur_t > > structure to store the potential longest count that will be updated. The > > assignment is done in xfs_alloc_fixup_trees() and xfs_free_ag_extent(). > > I outlined how this should be fixed in the above thread: > > https://lore.kernel.org/linux-xfs/ZiWgRGWVG4aK1165@dread.disaster.area/ > > This is what I said: > > | What we actually want is for pag->pagf_longest not to change > | transiently to zero in xfs_alloc_fixup_trees(). If the delrec that > | zeroes the pagf_longest field is going to follow it up with an > | insrec that will set it back to a valid value, we really should not > | be doing the zeroing in the first place. > | > | Further, the only btree that tracks the right edge of the btree is > | the by-count allocbt. This isn't "generic" functionality, even > | though it is implemented through the generic btree code. If we lift > | ->update_lastrec from the generic code and do it directly in > | xfs_alloc.c whenever we are finished with a by-count tree update > | and the cursor points to a record in the right-most leaf of the > | tree, then we run the lastrec update directly at that point. > | By decoupling the lastrec updates from the individual record > | manipulations, we can make the transients disappear completely. > > I'm not sure if this patch is an attempt to implement this - there > is no reference in the commit description to this previous attempt > to fix the issue, nor is the any discussion of why this particular > solution was chosen. > > In future, when you are trying to fix an issue that has previously > been discussed/presented on the list, please reference it and > provide a link to the previous discussions in the changelog for the > new version of the patchset fixing the issue. Aha, that's why this seemed oddly familiar. --D > > Reported by: Ye Bin <yebin10@huawei.com> > > Signed-off-by: Zizhi Wo <wozizhi@huawei.com> > > --- > > fs/xfs/libxfs/xfs_alloc.c | 14 ++++++++++++++ > > fs/xfs/libxfs/xfs_alloc_btree.c | 9 ++++++++- > > fs/xfs/libxfs/xfs_btree.h | 1 + > > 3 files changed, 23 insertions(+), 1 deletion(-) > > > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > > index 6c55a6e88eba..86ba873d57a8 100644 > > --- a/fs/xfs/libxfs/xfs_alloc.c > > +++ b/fs/xfs/libxfs/xfs_alloc.c > > @@ -577,6 +577,13 @@ xfs_alloc_fixup_trees( > > nfbno2 = rbno + rlen; > > nflen2 = (fbno + flen) - nfbno2; > > } > > + > > + /* > > + * Record the potential maximum free length in advance. > > + */ > > + if (nfbno1 != NULLAGBLOCK || nfbno2 != NULLAGBLOCK) > > + cnt_cur->bc_ag.bc_free_longest = XFS_EXTLEN_MAX(nflen1, nflen2); > > + > > Why do we store the length of a random extent being freed here? > nflen1/2 almost always have nothing to do with the longest free > space extent in the tree, they are just the new free space extents > we are insering into a random location in the free space tree. > > > /* > > * Delete the entry from the by-size btree. > > */ > > @@ -2044,6 +2051,13 @@ xfs_free_ag_extent( > > * Now allocate and initialize a cursor for the by-size tree. > > */ > > cnt_cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); > > + /* > > + * Record the potential maximum free length in advance. > > + */ > > + if (haveleft) > > + cnt_cur->bc_ag.bc_free_longest = ltlen; > > + if (haveright) > > + cnt_cur->bc_ag.bc_free_longest = gtlen; > > That doesn't look correct. At this point in the code, ltlen/gtlen > are the sizes of the physically adjacent freespace extents that we > are going to merge the newly freed extent with. i.e. the new > freespace extent is going to have one of 4 possible values: > > no merge: len > left merge: ltlen + len > right merge: gtlen + len > both merge: ltlen + gtlen + len > > So regardless of anything else, this code isn't setting the longest > freespace extent in teh AGF to the lenght of the longest freespace > extent in the filesystem. > > Which leads me to ask: how did you test this code? This bug should > have been triggering verifier, repair and scrub failures quite > quickly with fstests.... > > > /* > > * Have both left and right contiguous neighbors. > > * Merge all three into a single free block. > > diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c > > index 6ef5ddd89600..8e7d1e0f1a63 100644 > > --- a/fs/xfs/libxfs/xfs_alloc_btree.c > > +++ b/fs/xfs/libxfs/xfs_alloc_btree.c > > @@ -161,7 +161,14 @@ xfs_allocbt_update_lastrec( > > rrp = XFS_ALLOC_REC_ADDR(cur->bc_mp, block, numrecs); > > len = rrp->ar_blockcount; > > } else { > > - len = 0; > > + /* > > + * Update in advance to prevent file creation failure > > + * for concurrent processes even though there is no > > + * numrec currently. > > + * And there's no need to worry as the value that no > > + * less than bc_free_longest will be inserted later. > > + */ > > + len = cpu_to_be32(cur->bc_ag.bc_free_longest); > > } > > So this is in the LASTREC_DELREC case when the last record is > removed from the btree. This is what causes the transient state > as we do this when deleting a record to trim it and then re-insert > the remainder back into the by-count btree. > > Writing some random transient value into the AGF *and journalling > it* means we creating a transient on-disk format structure > corruption and potentially writing it to persistent storage (i.e. > the journal). The structure is, at least, not consistent in memory > because the free space tree is empty at this point in time, whilst > the agf->longest field says it has a free space available. This > could trip verifiers, be flagged as corruption by xfs_scrub/repair, > etc. > > Now, this *might be safe* because we *may* clean it up later in the > transaction, but if this really is the last extent being removed > from the btree and a cursor has previously been used to do other > insert and free operations that set this field, then we trip over > this stale inforamtion and write a corrupt structure to disk. That's > not good. > > As I said above, this "last record tracking" needs to be ripped out > of the generic btree code because only the by-count btree uses it. > Then it can be updated at the end of the by-count btree update > process in the allocation code (i.e. after all record manipulations > are done in the transaction) and that avoids this transient caused > by updating the last record on every btree record update that is > done. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] xfs: Fix file creation failure 2024-06-04 23:00 ` Dave Chinner 2024-06-04 23:53 ` Darrick J. Wong @ 2024-06-05 2:51 ` Zizhi Wo 2024-06-05 6:40 ` Dave Chinner 1 sibling, 1 reply; 8+ messages in thread From: Zizhi Wo @ 2024-06-05 2:51 UTC (permalink / raw) To: Dave Chinner Cc: chandan.babu, djwong, dchinner, linux-xfs, linux-kernel, yangerkun 在 2024/6/5 7:00, Dave Chinner 写道: > On Tue, Jun 04, 2024 at 03:11:21PM +0800, Zizhi Wo wrote: >> We have an xfs image that contains only 2 AGs, the first AG is full and >> the second AG is empty, then a concurrent file creation and little writing >> could unexpectedly return -ENOSPC error since there is a race window that >> the allocator could get the wrong agf->agf_longest. >> >> Write file process steps: >> 1) Find the entry that best meets the conditions, then calculate the start >> address and length of the remaining part of the entry after allocation. >> 2) Delete this entry. Because the second AG is empty, the btree in its agf >> has only one record, and agf->agf_longest will be set to 0 after deletion. >> 3) Insert the remaining unused parts of this entry based on the >> calculations in 1), and update the agf->agf_longest. >> >> Create file process steps: >> 1) Check whether there are free inodes in the inode chunk. >> 2) If there is no free inode, check whether there has space for creating >> inode chunks, perform the no-lock judgment first. >> 3) If the judgment succeeds, the judgment is performed again with agf lock >> held. Otherwire, an error is returned directly. >> >> If the write process is in step 2) but not go to 3) yet, the create file >> process goes to 2) at this time, it will be mistaken for no space, >> resulting in the file system still has space but the file creation fails. >> >> Direct write Create file >> xfs_file_write_iter >> ... >> xfs_direct_write_iomap_begin >> xfs_iomap_write_direct >> ... >> xfs_alloc_ag_vextent_near >> xfs_alloc_cur_finish >> xfs_alloc_fixup_trees >> xfs_btree_delete >> xfs_btree_delrec >> xfs_allocbt_update_lastrec >> // longest = 0 because numrec == 0. >> agf->agf_longest = len = 0 >> xfs_create >> ... >> xfs_dialloc >> ... >> xfs_alloc_fix_freelist >> xfs_alloc_space_available >> -> as longest=0, it will return >> false, no space for inode alloc. > > Ok, so this is a another attempt to address the problem Ye Bin > attempted to fix here: > > https://lore.kernel.org/linux-xfs/20240419061848.1032366-1-yebin10@huawei.com/ > >> Fix this issue by adding the bc_free_longest field to the xfs_btree_cur_t >> structure to store the potential longest count that will be updated. The >> assignment is done in xfs_alloc_fixup_trees() and xfs_free_ag_extent(). > > I outlined how this should be fixed in the above thread: > > https://lore.kernel.org/linux-xfs/ZiWgRGWVG4aK1165@dread.disaster.area/ > > This is what I said: > > | What we actually want is for pag->pagf_longest not to change > | transiently to zero in xfs_alloc_fixup_trees(). If the delrec that > | zeroes the pagf_longest field is going to follow it up with an > | insrec that will set it back to a valid value, we really should not > | be doing the zeroing in the first place. > | > | Further, the only btree that tracks the right edge of the btree is > | the by-count allocbt. This isn't "generic" functionality, even > | though it is implemented through the generic btree code. If we lift > | ->update_lastrec from the generic code and do it directly in > | xfs_alloc.c whenever we are finished with a by-count tree update > | and the cursor points to a record in the right-most leaf of the > | tree, then we run the lastrec update directly at that point. > | By decoupling the lastrec updates from the individual record > | manipulations, we can make the transients disappear completely. > > I'm not sure if this patch is an attempt to implement this - there > is no reference in the commit description to this previous attempt > to fix the issue, nor is the any discussion of why this particular > solution was chosen. > > In future, when you are trying to fix an issue that has previously > been discussed/presented on the list, please reference it and > provide a link to the previous discussions in the changelog for the > new version of the patchset fixing the issue. Oh, I'm sorry for the confusion I caused you. And I will reference it next time. > >> Reported by: Ye Bin <yebin10@huawei.com> >> Signed-off-by: Zizhi Wo <wozizhi@huawei.com> >> --- >> fs/xfs/libxfs/xfs_alloc.c | 14 ++++++++++++++ >> fs/xfs/libxfs/xfs_alloc_btree.c | 9 ++++++++- >> fs/xfs/libxfs/xfs_btree.h | 1 + >> 3 files changed, 23 insertions(+), 1 deletion(-) >> >> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c >> index 6c55a6e88eba..86ba873d57a8 100644 >> --- a/fs/xfs/libxfs/xfs_alloc.c >> +++ b/fs/xfs/libxfs/xfs_alloc.c >> @@ -577,6 +577,13 @@ xfs_alloc_fixup_trees( >> nfbno2 = rbno + rlen; >> nflen2 = (fbno + flen) - nfbno2; >> } >> + >> + /* >> + * Record the potential maximum free length in advance. >> + */ >> + if (nfbno1 != NULLAGBLOCK || nfbno2 != NULLAGBLOCK) >> + cnt_cur->bc_ag.bc_free_longest = XFS_EXTLEN_MAX(nflen1, nflen2); >> + > > Why do we store the length of a random extent being freed here? > nflen1/2 almost always have nothing to do with the longest free > space extent in the tree, they are just the new free space extents > we are insering into a random location in the free space tree. > First of all, there may be ambiguity in the name of the bc_free_longest field. I'm sorry for that. Its only role is to give the longest non-0 in a particular scenario. Yes, nflen1/2 can't determine the subsequent operation, but they are used to update the longest record only if the numrec in cntbt is zero, the last has been deleted and a new record will be added soon (that is, there is still space left on the file system), and that is their only function. So at this time nflen1/2 are not random extent, they indicate the maximum value to be inserted later. If cntbt does not need to be updated longest or the numrec is not zero, then bc_free_longest will not be used to update the longest. >> /* >> * Delete the entry from the by-size btree. >> */ >> @@ -2044,6 +2051,13 @@ xfs_free_ag_extent( >> * Now allocate and initialize a cursor for the by-size tree. >> */ >> cnt_cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); >> + /* >> + * Record the potential maximum free length in advance. >> + */ >> + if (haveleft) >> + cnt_cur->bc_ag.bc_free_longest = ltlen; >> + if (haveright) >> + cnt_cur->bc_ag.bc_free_longest = gtlen; > > That doesn't look correct. At this point in the code, ltlen/gtlen > are the sizes of the physically adjacent freespace extents that we > are going to merge the newly freed extent with. i.e. the new > freespace extent is going to have one of 4 possible values: > > no merge: len > left merge: ltlen + len > right merge: gtlen + len > both merge: ltlen + gtlen + len > > So regardless of anything else, this code isn't setting the longest > freespace extent in teh AGF to the lenght of the longest freespace > extent in the filesystem. > Which leads me to ask: how did you test this code? This bug should > have been triggering verifier, repair and scrub failures quite > quickly with fstests.... > The logic I'm considering here is that the record is less than or equal to the maximum value that will be updated soon, as I wrote "potentially" in the comment. And consider the following two scenarios: 1) If it is no merge, then haveleft == 0 && haveright == 0, and bc_free_longest will not be assigned, and is no need to worry about the longest update at this time. 2) If it is in merge scenario, only updating the original values here, and the actual updates are put into the subsequent xfs_btree_insert(). There is no need to worry about atomicity, both are carried out in the same transaction. All we have to do is the longest non-0. As long as the fast path judgment without locking passes, the longest must be updated to the correct value during the second lock judgment. I tested this part of the code, passed xfstests, and local validation found no problems. >> /* >> * Have both left and right contiguous neighbors. >> * Merge all three into a single free block. >> diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c >> index 6ef5ddd89600..8e7d1e0f1a63 100644 >> --- a/fs/xfs/libxfs/xfs_alloc_btree.c >> +++ b/fs/xfs/libxfs/xfs_alloc_btree.c >> @@ -161,7 +161,14 @@ xfs_allocbt_update_lastrec( >> rrp = XFS_ALLOC_REC_ADDR(cur->bc_mp, block, numrecs); >> len = rrp->ar_blockcount; >> } else { >> - len = 0; >> + /* >> + * Update in advance to prevent file creation failure >> + * for concurrent processes even though there is no >> + * numrec currently. >> + * And there's no need to worry as the value that no >> + * less than bc_free_longest will be inserted later. >> + */ >> + len = cpu_to_be32(cur->bc_ag.bc_free_longest); >> } > > So this is in the LASTREC_DELREC case when the last record is > removed from the btree. This is what causes the transient state > as we do this when deleting a record to trim it and then re-insert > the remainder back into the by-count btree. > > Writing some random transient value into the AGF *and journalling > it* means we creating a transient on-disk format structure > corruption and potentially writing it to persistent storage (i.e. > the journal). The structure is, at least, not consistent in memory > because the free space tree is empty at this point in time, whilst > the agf->longest field says it has a free space available. This > could trip verifiers, be flagged as corruption by xfs_scrub/repair, > etc. > I'm sorry, but I didn't find the problem during my own screening. In my opinion, because the trigger scenario for the current problem is only to delete the last node and be updated shortly, and bc_free_longest is used only in the following two scenarios: 1) cntbt has only one extent and remains after being used, so nflen 1/2 will be inserted later. 2) cntbt has only one extent and the released extent is adjacent to this record. This unique record will be deleted firstly, and then the two extents are merged and inserted. The above two scenarios are both within the same transaction, and both are guaranteed by a xfs_buf lock. When run xfs_trans_commit, code have gone through the delete and insert process, or merge and update process. So the longest value written to the disk is already the correct value and matches the cntbt state at this time. In other scenarios, the numrec of cntbt cannot be 0, so the longest cannot be updated through bc_free_longest. Thanks, Zizhi Wo > Now, this *might be safe* because we *may* clean it up later in the > transaction, but if this really is the last extent being removed > from the btree and a cursor has previously been used to do other > insert and free operations that set this field, then we trip over > this stale inforamtion and write a corrupt structure to disk. That's > not good. > > As I said above, this "last record tracking" needs to be ripped out > of the generic btree code because only the by-count btree uses it. > Then it can be updated at the end of the by-count btree update > process in the allocation code (i.e. after all record manipulations > are done in the transaction) and that avoids this transient caused > by updating the last record on every btree record update that is > done. > > Cheers, > > Dave. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] xfs: Fix file creation failure 2024-06-05 2:51 ` Zizhi Wo @ 2024-06-05 6:40 ` Dave Chinner 2024-06-05 8:46 ` Zizhi Wo 0 siblings, 1 reply; 8+ messages in thread From: Dave Chinner @ 2024-06-05 6:40 UTC (permalink / raw) To: Zizhi Wo; +Cc: chandan.babu, djwong, dchinner, linux-xfs, linux-kernel, yangerkun On Wed, Jun 05, 2024 at 10:51:31AM +0800, Zizhi Wo wrote: > 在 2024/6/5 7:00, Dave Chinner 写道: > > On Tue, Jun 04, 2024 at 03:11:21PM +0800, Zizhi Wo wrote: > > > We have an xfs image that contains only 2 AGs, the first AG is full and > > > the second AG is empty, then a concurrent file creation and little writing > > > could unexpectedly return -ENOSPC error since there is a race window that > > > the allocator could get the wrong agf->agf_longest. ..... > > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > > > index 6c55a6e88eba..86ba873d57a8 100644 > > > --- a/fs/xfs/libxfs/xfs_alloc.c > > > +++ b/fs/xfs/libxfs/xfs_alloc.c > > > @@ -577,6 +577,13 @@ xfs_alloc_fixup_trees( > > > nfbno2 = rbno + rlen; > > > nflen2 = (fbno + flen) - nfbno2; > > > } > > > + > > > + /* > > > + * Record the potential maximum free length in advance. > > > + */ > > > + if (nfbno1 != NULLAGBLOCK || nfbno2 != NULLAGBLOCK) > > > + cnt_cur->bc_ag.bc_free_longest = XFS_EXTLEN_MAX(nflen1, nflen2); > > > + > > > > Why do we store the length of a random extent being freed here? > > nflen1/2 almost always have nothing to do with the longest free > > space extent in the tree, they are just the new free space extents > > we are insering into a random location in the free space tree. > > > > First of all, there may be ambiguity in the name of the bc_free_longest > field. I'm sorry for that. Its only role is to give the longest non-0 in > a particular scenario. > > Yes, nflen1/2 can't determine the subsequent operation, but they are > used to update the longest record only if the numrec in cntbt is zero, > the last has been deleted and a new record will be added soon (that is, > there is still space left on the file system), and that is their only > function. So at this time nflen1/2 are not random extent, they indicate > the maximum value to be inserted later. If cntbt does not need to be > updated longest or the numrec is not zero, then bc_free_longest will not > be used to update the longest. That's the comment you should have put in the code. Comments need to explain -why- the code exists, not tell us -what- the code does. We know the latter from reading the code, we cannot know the former from reading the code... > > > /* > > > * Delete the entry from the by-size btree. > > > */ > > > @@ -2044,6 +2051,13 @@ xfs_free_ag_extent( > > > * Now allocate and initialize a cursor for the by-size tree. > > > */ > > > cnt_cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); > > > + /* > > > + * Record the potential maximum free length in advance. > > > + */ > > > + if (haveleft) > > > + cnt_cur->bc_ag.bc_free_longest = ltlen; > > > + if (haveright) > > > + cnt_cur->bc_ag.bc_free_longest = gtlen; > > > > That doesn't look correct. At this point in the code, ltlen/gtlen > > are the sizes of the physically adjacent freespace extents that we > > are going to merge the newly freed extent with. i.e. the new > > freespace extent is going to have one of 4 possible values: > > > > no merge: len > > left merge: ltlen + len > > right merge: gtlen + len > > both merge: ltlen + gtlen + len > > > > So regardless of anything else, this code isn't setting the longest > > freespace extent in teh AGF to the lenght of the longest freespace > > extent in the filesystem. > > > Which leads me to ask: how did you test this code? This bug should > > have been triggering verifier, repair and scrub failures quite > > quickly with fstests.... > > > > The logic I'm considering here is that the record is less than or equal > to the maximum value that will be updated soon, as I wrote "potentially" > in the comment. And consider the following two scenarios: > 1) If it is no merge, then haveleft == 0 && haveright == 0, and > bc_free_longest will not be assigned, and is no need to worry about the > longest update at this time. > 2) If it is in merge scenario, only updating the original values here, > and the actual updates are put into the subsequent xfs_btree_insert(). > There is no need to worry about atomicity, both are carried out in the > same transaction. All we have to do is the longest non-0. As long as the > fast path judgment without locking passes, the longest must be updated > to the correct value during the second lock judgment. And therein lies the problem. We've learnt the had way not to do partial state updates that might end up on disk even within transactions. At some point in the future, we'll change the way the code works, not realising that there's an inconsistent transient state being logged, and some time after that we'll have occasional reports of weird failures with values on disk or in the journal we cannot explain. > I tested this part of the code, passed xfstests, and local validation > found no problems. yeah, fstests won't expose the inconsistent state *right now*; the problem is that we've just left a landmine for future developers to step on. It's also really difficult to follow - you've added non-obvious coupling between the free space btree updates and the AGF updates via a field held in a btree cursor. This essentially means that all this stuff has to occur within the context of a single btree cursor, and that btree cursor cannot be re-used for further operations because this field is not reset by things like new lookups. Then there is the question of what happens if we have duplicated the cursor for a specific btree record operation, and the field is set in the duplicated cursor. The core btree operations do this in several places because they want to retain a cursor to the original poistion, but the specific operation that needs to be performed will change the cursor position (e.g. shifts, splits, merges, etc). Hence there's no guarantee that a single cursor is used for all the operations in a single btree modification, and hence storing information in cursors that is supposed to persist until some other btree modification method is run is asking for trouble. Hence, IMO, coupling allocation btree coherency operations via the btree cursor is something we should avoid at all costs. That's why I keep saying the last record update for the by-count/AGF longest needs to be moved outside the generic btree code itself. The coherency and coupling needs to be directly controlled by the high level alloc code, not by trying to punch special data holes through the generic btree abstractions. > > > /* > > > * Have both left and right contiguous neighbors. > > > * Merge all three into a single free block. > > > diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c > > > index 6ef5ddd89600..8e7d1e0f1a63 100644 > > > --- a/fs/xfs/libxfs/xfs_alloc_btree.c > > > +++ b/fs/xfs/libxfs/xfs_alloc_btree.c > > > @@ -161,7 +161,14 @@ xfs_allocbt_update_lastrec( > > > rrp = XFS_ALLOC_REC_ADDR(cur->bc_mp, block, numrecs); > > > len = rrp->ar_blockcount; > > > } else { > > > - len = 0; > > > + /* > > > + * Update in advance to prevent file creation failure > > > + * for concurrent processes even though there is no > > > + * numrec currently. > > > + * And there's no need to worry as the value that no > > > + * less than bc_free_longest will be inserted later. > > > + */ > > > + len = cpu_to_be32(cur->bc_ag.bc_free_longest); > > > } > > > > So this is in the LASTREC_DELREC case when the last record is > > removed from the btree. This is what causes the transient state > > as we do this when deleting a record to trim it and then re-insert > > the remainder back into the by-count btree. > > > > Writing some random transient value into the AGF *and journalling > > it* means we creating a transient on-disk format structure > > corruption and potentially writing it to persistent storage (i.e. > > the journal). The structure is, at least, not consistent in memory > > because the free space tree is empty at this point in time, whilst > > the agf->longest field says it has a free space available. This > > could trip verifiers, be flagged as corruption by xfs_scrub/repair, > > etc. > > > > I'm sorry, but I didn't find the problem during my own screening. In my > opinion, because the trigger scenario for the current problem is only to > delete the last node and be updated shortly, and bc_free_longest is used > only in the following two scenarios: > 1) cntbt has only one extent and remains after being used, so nflen 1/2 > will be inserted later. > 2) cntbt has only one extent and the released extent is adjacent to this > record. This unique record will be deleted firstly, and then the two > extents are merged and inserted. Yes, I understand what you've done. But I don't think it is the right way to fix the issue for the reasons I've given. I've attached a quick hack (not even compile tested!) to demonstrate the way I've been suggesting this issue should be fixed by the removal of the lastrec update code from the generic btree implementation. It updates pag->pagf_longest and agf->longest directly from the high level allocation code once the free space btree manipulations have been completed. We do this in a context where we control AGF, the perag and the btree cursors directly, and there is no need to temporarily store anything in the btree cursors at all. Feel free to use this code as the basis of future patches to address this issue. -Dave. -- Dave Chinner david@fromorbit.com xfs: avoid races with btree lastrec updates From: Dave Chinner <dchinner@redhat.com> When we modify the last record in a generic btree, the calls a function to indicate to the btree implementation that the operation is modifying the right-most record in the tree. This is only used by the by-size freespace btree, and it is used to update the agf->longest field that tracks the longest contiguous free extent in the AG. This, however, is racy with respect to external sampling of the agf->longest field. We may be doing multiple operations on the by-size freespace tree to make an update - an extent length change requires deleting the current extent, then inserting a new extent with the new size. When the current extent is the only free space extent in the AG (e.g. it is entirely empty), we end up with a transient state where there is no free space extents in the btree. By the time the modification completes, however, there is once again a free space extent in the tree, and the agf->longest value gets set to the correct size again. The issue is that external sampling of agf->longest during allocation AG selection will see the extent as having no space free, and so will skip it. This can lead to transient ENOSPC errors when there is an entire AG of free space available for use. To fix this, remove the lastrec update triggers from the generic btree code. Move the by-size last record update code to the end of the by-size btree modifications, such that it is only changed once during a free space extent allocation or freeing operation. This removes the externally visible transient empty state, and so avoids the transient, erroneous ENOSPC conditions that can currently occur. XXX: the detection of the last record update being required could probably be improved - just checking for "in the last block" works just fine, and we are already going to be logging the AGF for the free space counter updates, so it might just be that always doing the update when we land in the last block isn't such a big deal performance wise. --- fs/xfs/libxfs/xfs_alloc.c | 94 +++++++++++++++++++++++++++++++++++++++++ fs/xfs/libxfs/xfs_alloc_btree.c | 63 --------------------------- fs/xfs/libxfs/xfs_btree.c | 51 ---------------------- fs/xfs/libxfs/xfs_btree.h | 6 --- 4 files changed, 94 insertions(+), 120 deletions(-) diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index 35fbd6b19682..f256b7724cef 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -463,6 +463,76 @@ xfs_alloc_fix_len( args->len = rlen; } +/* + * Determine if the cursor points to the block that contains the right-most + * block of records in the by-count btree. This block contains the largest + * contiguous free extent in the AG, so if we modify ia record in this block we + * need to call xfs_alloc_fixup_longest() once the modifications are done to + * ensure the agf->agf_longest field is kept up to date with the longest free + * extent tracked by the by-count btree. + */ +static bool +xfs_alloc_cursor_at_lastrec( + struct xfs_btree_cur *cnt_cur) +{ + struct xfs_btree_block *block; + union xfs_btree_ptr ptr; + struct xfs_buf *bp; + + block = xfs_btree_get_block(cnt_cur, 0, &bp); + + xfs_btree_get_sibling(cnt_cur, block, &ptr, XFS_BB_RIGHTSIB); + if (!xfs_btree_ptr_is_null(cnt_cur, &ptr)) + return false; + return true; +} + +/* + * Update the longest contiguous free extent in the AG from the by-count cursor + * that is passed to us. This should be done at the end of any allocation or + * freeing operation that touches the longest extent in the btree. + * + * Needing to update the longest extent can be determined by calling + * xfs_alloc_cursor_at_lastrec() after the cursor is positioned for record + * modification but before the modification begins. + */ +static int +xfs_alloc_fixup_longest( + struct xfs_btree_cur *cnt_cur) +{ + struct xfs_perag *pag = cnt_cur->bc_ag->pag; + struct xfs_agf *agf; + xfs_agblock_t bno; + xfs_extlen_t len; + int error; + int i; + + /* lookup last rec and update AGF */ + error = xfs_alloc_lookup_le(cnt_cur, 0, pag->pagf_longest, &i); + if (error) + return error; + if (i == 0) { + /* empty tree */ + pag->pagf_longest = 0; + } else { + error = xfs_alloc_get_rec(cnt_cur, &bno, &len, &i); + if (error) + return error; + if (XFS_IS_CORRUPT(mp, i != 1)) { + xfs_btree_mark_sick(cnt_cur); + return -EFSCORRUPTED; + } + pag->pagf_longest = nflen1; + } + + + agf = XFS_BUF_TO_AGF(cur->bc_ag.agbp); + agf->agf_longest = cpu_to_be32(pag->pagf_longest); + xfs_alloc_log_agf(cur->bc_tp, cur->bc_ag.agbp, XFS_AGF_LONGEST); + + return 0; +} + /* * Update the two btrees, logically removing from freespace the extent * starting at rbno, rlen blocks. The extent is contained within the @@ -487,6 +557,7 @@ xfs_alloc_fixup_trees( xfs_extlen_t nflen1=0; /* first new free length */ xfs_extlen_t nflen2=0; /* second new free length */ struct xfs_mount *mp; + bool fixup_longest = false; mp = cnt_cur->bc_mp; @@ -575,9 +646,13 @@ xfs_alloc_fixup_trees( nfbno2 = rbno + rlen; nflen2 = (fbno + flen) - nfbno2; } + /* * Delete the entry from the by-size btree. */ + if (xfs_alloc_cursor_at_lastrec(cnt_cur)) + fixup_longest = true; + if ((error = xfs_btree_delete(cnt_cur, &i))) return error; if (XFS_IS_CORRUPT(mp, i != 1)) { @@ -615,6 +690,7 @@ xfs_alloc_fixup_trees( return -EFSCORRUPTED; } } + /* * Fix up the by-block btree entry(s). */ @@ -652,6 +728,10 @@ xfs_alloc_fixup_trees( return -EFSCORRUPTED; } } + + if (fixup_longest) + return xfs_alloc_fixup_longest(cnt_cur); + return 0; } @@ -1954,6 +2034,7 @@ xfs_free_ag_extent( int i; int error; struct xfs_perag *pag = agbp->b_pag; + bool fixup_longest = false; bno_cur = cnt_cur = NULL; mp = tp->t_mountp; @@ -2217,8 +2298,13 @@ xfs_free_ag_extent( } xfs_btree_del_cursor(bno_cur, XFS_BTREE_NOERROR); bno_cur = NULL; + /* * In all cases we need to insert the new freespace in the by-size tree. + * + * If this new freespace is being inserted in the block that contains + * the largest free space in the btree, make sure we also fix up the + * agf->agf-longest tracker field. */ if ((error = xfs_alloc_lookup_eq(cnt_cur, nbno, nlen, &i))) goto error0; @@ -2227,6 +2313,8 @@ xfs_free_ag_extent( error = -EFSCORRUPTED; goto error0; } + if (xfs_alloc_cursor_at_lastrec(cnt_cur)) + fixup_longest = true; if ((error = xfs_btree_insert(cnt_cur, &i))) goto error0; if (XFS_IS_CORRUPT(mp, i != 1)) { @@ -2234,6 +2322,12 @@ xfs_free_ag_extent( error = -EFSCORRUPTED; goto error0; } + if (fixup_longest) { + error = xfs_alloc_fixup_longest(cnt_cur); + if (error) + goto error0; + } + xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); cnt_cur = NULL; diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c index 6ef5ddd89600..866d822dcfe7 100644 --- a/fs/xfs/libxfs/xfs_alloc_btree.c +++ b/fs/xfs/libxfs/xfs_alloc_btree.c @@ -115,67 +115,6 @@ xfs_allocbt_free_block( return 0; } -/* - * Update the longest extent in the AGF - */ -STATIC void -xfs_allocbt_update_lastrec( - struct xfs_btree_cur *cur, - const struct xfs_btree_block *block, - const union xfs_btree_rec *rec, - int ptr, - int reason) -{ - struct xfs_agf *agf = cur->bc_ag.agbp->b_addr; - struct xfs_perag *pag; - __be32 len; - int numrecs; - - ASSERT(!xfs_btree_is_bno(cur->bc_ops)); - - switch (reason) { - case LASTREC_UPDATE: - /* - * If this is the last leaf block and it's the last record, - * then update the size of the longest extent in the AG. - */ - if (ptr != xfs_btree_get_numrecs(block)) - return; - len = rec->alloc.ar_blockcount; - break; - case LASTREC_INSREC: - if (be32_to_cpu(rec->alloc.ar_blockcount) <= - be32_to_cpu(agf->agf_longest)) - return; - len = rec->alloc.ar_blockcount; - break; - case LASTREC_DELREC: - numrecs = xfs_btree_get_numrecs(block); - if (ptr <= numrecs) - return; - ASSERT(ptr == numrecs + 1); - - if (numrecs) { - xfs_alloc_rec_t *rrp; - - rrp = XFS_ALLOC_REC_ADDR(cur->bc_mp, block, numrecs); - len = rrp->ar_blockcount; - } else { - len = 0; - } - - break; - default: - ASSERT(0); - return; - } - - agf->agf_longest = len; - pag = cur->bc_ag.agbp->b_pag; - pag->pagf_longest = be32_to_cpu(len); - xfs_alloc_log_agf(cur->bc_tp, cur->bc_ag.agbp, XFS_AGF_LONGEST); -} - STATIC int xfs_allocbt_get_minrecs( struct xfs_btree_cur *cur, @@ -493,7 +432,6 @@ const struct xfs_btree_ops xfs_bnobt_ops = { .set_root = xfs_allocbt_set_root, .alloc_block = xfs_allocbt_alloc_block, .free_block = xfs_allocbt_free_block, - .update_lastrec = xfs_allocbt_update_lastrec, .get_minrecs = xfs_allocbt_get_minrecs, .get_maxrecs = xfs_allocbt_get_maxrecs, .init_key_from_rec = xfs_allocbt_init_key_from_rec, @@ -525,7 +463,6 @@ const struct xfs_btree_ops xfs_cntbt_ops = { .set_root = xfs_allocbt_set_root, .alloc_block = xfs_allocbt_alloc_block, .free_block = xfs_allocbt_free_block, - .update_lastrec = xfs_allocbt_update_lastrec, .get_minrecs = xfs_allocbt_get_minrecs, .get_maxrecs = xfs_allocbt_get_maxrecs, .init_key_from_rec = xfs_allocbt_init_key_from_rec, diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c index a119cbf71df1..c727dbcfebee 100644 --- a/fs/xfs/libxfs/xfs_btree.c +++ b/fs/xfs/libxfs/xfs_btree.c @@ -1331,30 +1331,6 @@ xfs_btree_init_block_cur( xfs_btree_owner(cur)); } -/* - * Return true if ptr is the last record in the btree and - * we need to track updates to this record. The decision - * will be further refined in the update_lastrec method. - */ -STATIC int -xfs_btree_is_lastrec( - struct xfs_btree_cur *cur, - struct xfs_btree_block *block, - int level) -{ - union xfs_btree_ptr ptr; - - if (level > 0) - return 0; - if (!(cur->bc_ops->geom_flags & XFS_BTGEO_LASTREC_UPDATE)) - return 0; - - xfs_btree_get_sibling(cur, block, &ptr, XFS_BB_RIGHTSIB); - if (!xfs_btree_ptr_is_null(cur, &ptr)) - return 0; - return 1; -} - STATIC void xfs_btree_buf_to_ptr( struct xfs_btree_cur *cur, @@ -2476,15 +2452,6 @@ xfs_btree_update( xfs_btree_copy_recs(cur, rp, rec, 1); xfs_btree_log_recs(cur, bp, ptr, ptr); - /* - * If we are tracking the last record in the tree and - * we are at the far right edge of the tree, update it. - */ - if (xfs_btree_is_lastrec(cur, block, 0)) { - cur->bc_ops->update_lastrec(cur, block, rec, - ptr, LASTREC_UPDATE); - } - /* Pass new key value up to our parent. */ if (xfs_btree_needs_key_update(cur, ptr)) { error = xfs_btree_update_keys(cur, 0); @@ -3786,15 +3753,6 @@ xfs_btree_insrec( goto error0; } - /* - * If we are tracking the last record in the tree and - * we are at the far right edge of the tree, update it. - */ - if (xfs_btree_is_lastrec(cur, block, level)) { - cur->bc_ops->update_lastrec(cur, block, rec, - ptr, LASTREC_INSREC); - } - /* * Return the new block number, if any. * If there is one, give back a record value and a cursor too. @@ -4152,15 +4110,6 @@ xfs_btree_delrec( xfs_btree_set_numrecs(block, --numrecs); xfs_btree_log_block(cur, bp, XFS_BB_NUMRECS); - /* - * If we are tracking the last record in the tree and - * we are at the far right edge of the tree, update it. - */ - if (xfs_btree_is_lastrec(cur, block, level)) { - cur->bc_ops->update_lastrec(cur, block, NULL, - ptr, LASTREC_DELREC); - } - /* * We're at the root level. First, shrink the root block in-memory. * Try to get rid of the next level down. If we can't then there's diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h index f93374278aa1..99c1e4a02556 100644 --- a/fs/xfs/libxfs/xfs_btree.h +++ b/fs/xfs/libxfs/xfs_btree.h @@ -154,12 +154,6 @@ struct xfs_btree_ops { int *stat); int (*free_block)(struct xfs_btree_cur *cur, struct xfs_buf *bp); - /* update last record information */ - void (*update_lastrec)(struct xfs_btree_cur *cur, - const struct xfs_btree_block *block, - const union xfs_btree_rec *rec, - int ptr, int reason); - /* records in block/level */ int (*get_minrecs)(struct xfs_btree_cur *cur, int level); int (*get_maxrecs)(struct xfs_btree_cur *cur, int level); ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH] xfs: Fix file creation failure 2024-06-05 6:40 ` Dave Chinner @ 2024-06-05 8:46 ` Zizhi Wo 0 siblings, 0 replies; 8+ messages in thread From: Zizhi Wo @ 2024-06-05 8:46 UTC (permalink / raw) To: Dave Chinner Cc: chandan.babu, djwong, dchinner, linux-xfs, linux-kernel, yangerkun 在 2024/6/5 14:40, Dave Chinner 写道: > On Wed, Jun 05, 2024 at 10:51:31AM +0800, Zizhi Wo wrote: >> 在 2024/6/5 7:00, Dave Chinner 写道: >>> On Tue, Jun 04, 2024 at 03:11:21PM +0800, Zizhi Wo wrote: >>>> We have an xfs image that contains only 2 AGs, the first AG is full and >>>> the second AG is empty, then a concurrent file creation and little writing >>>> could unexpectedly return -ENOSPC error since there is a race window that >>>> the allocator could get the wrong agf->agf_longest. > ..... Yeah...I know it is amazing... >>>> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c >>>> index 6c55a6e88eba..86ba873d57a8 100644 >>>> --- a/fs/xfs/libxfs/xfs_alloc.c >>>> +++ b/fs/xfs/libxfs/xfs_alloc.c >>>> @@ -577,6 +577,13 @@ xfs_alloc_fixup_trees( >>>> nfbno2 = rbno + rlen; >>>> nflen2 = (fbno + flen) - nfbno2; >>>> } >>>> + >>>> + /* >>>> + * Record the potential maximum free length in advance. >>>> + */ >>>> + if (nfbno1 != NULLAGBLOCK || nfbno2 != NULLAGBLOCK) >>>> + cnt_cur->bc_ag.bc_free_longest = XFS_EXTLEN_MAX(nflen1, nflen2); >>>> + >>> >>> Why do we store the length of a random extent being freed here? >>> nflen1/2 almost always have nothing to do with the longest free >>> space extent in the tree, they are just the new free space extents >>> we are insering into a random location in the free space tree. >>> >> >> First of all, there may be ambiguity in the name of the bc_free_longest >> field. I'm sorry for that. Its only role is to give the longest non-0 in >> a particular scenario. >> >> Yes, nflen1/2 can't determine the subsequent operation, but they are >> used to update the longest record only if the numrec in cntbt is zero, >> the last has been deleted and a new record will be added soon (that is, >> there is still space left on the file system), and that is their only >> function. So at this time nflen1/2 are not random extent, they indicate >> the maximum value to be inserted later. If cntbt does not need to be >> updated longest or the numrec is not zero, then bc_free_longest will not >> be used to update the longest. > > That's the comment you should have put in the code. > > Comments need to explain -why- the code exists, not tell us -what- > the code does. We know the latter from reading the code, we cannot > know the former from reading the code... > I am sorry for the trouble caused by my comments. I have indeed left out a lot of details, and I will correct it next time. /(ㄒoㄒ)/~~ >>>> /* >>>> * Delete the entry from the by-size btree. >>>> */ >>>> @@ -2044,6 +2051,13 @@ xfs_free_ag_extent( >>>> * Now allocate and initialize a cursor for the by-size tree. >>>> */ >>>> cnt_cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); >>>> + /* >>>> + * Record the potential maximum free length in advance. >>>> + */ >>>> + if (haveleft) >>>> + cnt_cur->bc_ag.bc_free_longest = ltlen; >>>> + if (haveright) >>>> + cnt_cur->bc_ag.bc_free_longest = gtlen; >>> >>> That doesn't look correct. At this point in the code, ltlen/gtlen >>> are the sizes of the physically adjacent freespace extents that we >>> are going to merge the newly freed extent with. i.e. the new >>> freespace extent is going to have one of 4 possible values: >>> >>> no merge: len >>> left merge: ltlen + len >>> right merge: gtlen + len >>> both merge: ltlen + gtlen + len >>> >>> So regardless of anything else, this code isn't setting the longest >>> freespace extent in teh AGF to the lenght of the longest freespace >>> extent in the filesystem. >> >>> Which leads me to ask: how did you test this code? This bug should >>> have been triggering verifier, repair and scrub failures quite >>> quickly with fstests.... >>> >> >> The logic I'm considering here is that the record is less than or equal >> to the maximum value that will be updated soon, as I wrote "potentially" >> in the comment. And consider the following two scenarios: >> 1) If it is no merge, then haveleft == 0 && haveright == 0, and >> bc_free_longest will not be assigned, and is no need to worry about the >> longest update at this time. >> 2) If it is in merge scenario, only updating the original values here, >> and the actual updates are put into the subsequent xfs_btree_insert(). >> There is no need to worry about atomicity, both are carried out in the >> same transaction. All we have to do is the longest non-0. As long as the >> fast path judgment without locking passes, the longest must be updated >> to the correct value during the second lock judgment. > > And therein lies the problem. We've learnt the had way not to do > partial state updates that might end up on disk even within > transactions. At some point in the future, we'll change the way the > code works, not realising that there's an inconsistent transient > state being logged, and some time after that we'll have occasional > reports of weird failures with values on disk or in the journal we > cannot explain. > >> I tested this part of the code, passed xfstests, and local validation >> found no problems. > > yeah, fstests won't expose the inconsistent state *right now*; the > problem is that we've just left a landmine for future developers to > step on. > > It's also really difficult to follow - you've added non-obvious > coupling between the free space btree updates and the AGF updates > via a field held in a btree cursor. This essentially means that all > this stuff has to occur within the context of a single btree cursor, > and that btree cursor cannot be re-used for further operations > because this field is not reset by things like new lookups. > > Then there is the question of what happens if we have duplicated the > cursor for a specific btree record operation, and the field is set > in the duplicated cursor. The core btree operations do this in > several places because they want to retain a cursor to the original > poistion, but the specific operation that needs to be performed will > change the cursor position (e.g. shifts, splits, merges, etc). Hence > there's no guarantee that a single cursor is used for all the > operations in a single btree modification, and hence storing > information in cursors that is supposed to persist until some other > btree modification method is run is asking for trouble. > > Hence, IMO, coupling allocation btree coherency operations via the > btree cursor is something we should avoid at all costs. That's why I > keep saying the last record update for the by-count/AGF longest > needs to be moved outside the generic btree code itself. The > coherency and coupling needs to be directly controlled by the high > level alloc code, not by trying to punch special data holes through > the generic btree abstractions. > Oh, I did not consider the problems you pointed out above, and the previous revision should be avoided. I fully agree with your opinion. >>>> /* >>>> * Have both left and right contiguous neighbors. >>>> * Merge all three into a single free block. >>>> diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c >>>> index 6ef5ddd89600..8e7d1e0f1a63 100644 >>>> --- a/fs/xfs/libxfs/xfs_alloc_btree.c >>>> +++ b/fs/xfs/libxfs/xfs_alloc_btree.c >>>> @@ -161,7 +161,14 @@ xfs_allocbt_update_lastrec( >>>> rrp = XFS_ALLOC_REC_ADDR(cur->bc_mp, block, numrecs); >>>> len = rrp->ar_blockcount; >>>> } else { >>>> - len = 0; >>>> + /* >>>> + * Update in advance to prevent file creation failure >>>> + * for concurrent processes even though there is no >>>> + * numrec currently. >>>> + * And there's no need to worry as the value that no >>>> + * less than bc_free_longest will be inserted later. >>>> + */ >>>> + len = cpu_to_be32(cur->bc_ag.bc_free_longest); >>>> } >>> >>> So this is in the LASTREC_DELREC case when the last record is >>> removed from the btree. This is what causes the transient state >>> as we do this when deleting a record to trim it and then re-insert >>> the remainder back into the by-count btree. >>> >>> Writing some random transient value into the AGF *and journalling >>> it* means we creating a transient on-disk format structure >>> corruption and potentially writing it to persistent storage (i.e. >>> the journal). The structure is, at least, not consistent in memory >>> because the free space tree is empty at this point in time, whilst >>> the agf->longest field says it has a free space available. This >>> could trip verifiers, be flagged as corruption by xfs_scrub/repair, >>> etc. >>> >> >> I'm sorry, but I didn't find the problem during my own screening. In my >> opinion, because the trigger scenario for the current problem is only to >> delete the last node and be updated shortly, and bc_free_longest is used >> only in the following two scenarios: >> 1) cntbt has only one extent and remains after being used, so nflen 1/2 >> will be inserted later. >> 2) cntbt has only one extent and the released extent is adjacent to this >> record. This unique record will be deleted firstly, and then the two >> extents are merged and inserted. > > Yes, I understand what you've done. > > But I don't think it is the right way to fix the issue for the > reasons I've given. > > I've attached a quick hack (not even compile tested!) to > demonstrate the way I've been suggesting this issue should be fixed > by the removal of the lastrec update code from the generic > btree implementation. It updates pag->pagf_longest and > agf->longest directly from the high level allocation code once the > free space btree manipulations have been completed. We do this in a > context where we control AGF, the perag and the btree cursors > directly, and there is no need to temporarily store anything in the > btree cursors at all. > > Feel free to use this code as the basis of future patches to address > this issue. > > -Dave. That's amazing! Thanks!! I will read the code carefully and submit the correct fix for this issue again soon. Thanks, Zizhi Wo ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-06-05 8:46 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-06-04 7:11 [PATCH] xfs: Fix file creation failure Zizhi Wo 2024-06-04 15:56 ` Darrick J. Wong 2024-06-05 3:38 ` Zizhi Wo 2024-06-04 23:00 ` Dave Chinner 2024-06-04 23:53 ` Darrick J. Wong 2024-06-05 2:51 ` Zizhi Wo 2024-06-05 6:40 ` Dave Chinner 2024-06-05 8:46 ` Zizhi Wo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox