* [RFC PATCH 0/4] bringing back the AGFL reserve
@ 2024-06-13 20:27 Krister Johansen
2024-06-13 20:27 ` [RFC PATCH 1/4] xfs: resurrect the AGFL reservation Krister Johansen
` (4 more replies)
0 siblings, 5 replies; 14+ messages in thread
From: Krister Johansen @ 2024-06-13 20:27 UTC (permalink / raw)
To: Chandan Babu R, Darrick J. Wong, Dave Chinner; +Cc: Gao Xiang, linux-xfs
Hi,
One of the teams that I work with hits WARNs in
xfs_bmap_extents_to_btree() on a database workload of theirs. The last
time the subject came up on linux-xfs, it was suggested[1] to try
building an AG reserve pool for the AGFL.
I managed to work out a reproducer for the problem. Debugging that, the
steps Gao outlined turned out to be essentially what was necessary to
get the problem to happen repeatably.
1. Allocate almost all of the space in an AG
2. Free and reallocate that space to fragement it so the freespace
b-trees are just about to split.
3. Allocate blocks in a file such that the next extent allocated for
that file will cause its bmbt to get converted from an inline extent to
a b-tree.
4. Free space such that the free-space btrees have a contiguous extent
with a busy portion on either end
5. Allocate the portion in the middle, splitting the extent and
triggering a b-tree split.
On older kernels this is all it takes. After the AG-aware allocator
changes I also need to start the allocation in the highest numbered AG
available while inducing lock contention in the lower numbered AGs.
In order to ensure that AGs have enough space to complete transactions
with multiple allocations, I've taken a stab at implementing an AGFL
reserve pool.
This patchset passes fstests without any regressions and also does not
trigger the reproducers I wrote for the case above. I've also run those
with tracing enabled to validate that it's got the accounting correct,
is rejecting allocations when there's no space in the reserve, and is
tapping the reserve when appropriate.
The first patch is the plumbing that re-establishes the reserve for the
AGFL. I'm happy to break this into something smaller, if it's too
large. The remaining patches add additional pieces needed to check how
much space the AGFL might need on a refill, and then to actually use the
reserve to permit or deny allocation requests, as the case may be.
I'm sending this as an RFC, since I still have a few outstanding
questions and would appreciate feedback.
Some of those questions are:
Patch 1 includes all freespace that is not allocated to the rmapbt in
its used / reserved accounting. It also borrows the heuristics from
rmapbt in terms of picking the initial size of the reservation. The
numbers I'm getting seem a bit large. Any suggestions about how to
improve this further?
Patches 3 and 4 use the allocation args structure to attempt to decide
whether an allocation is the first in a transaction, or if its a
subsequent allocation. Are there any recommendations about a better way
to do this?
Thanks,
-K
[1] https://lore.kernel.org/linux-xfs/20221116025106.GB3600936@dread.disaster.area/
Krister Johansen (4):
xfs: resurrect the AGFL reservation
xfs: modify xfs_alloc_min_freelist to take an increment
xfs: let allocations tap the AGFL reserve
xfs: refuse allocations without agfl refill space
fs/xfs/libxfs/xfs_ag.h | 2 +
fs/xfs/libxfs/xfs_ag_resv.c | 54 ++++++++++++++-----
fs/xfs/libxfs/xfs_ag_resv.h | 4 ++
fs/xfs/libxfs/xfs_alloc.c | 94 +++++++++++++++++++++++++++++----
fs/xfs/libxfs/xfs_alloc.h | 5 +-
fs/xfs/libxfs/xfs_alloc_btree.c | 59 +++++++++++++++++++++
fs/xfs/libxfs/xfs_alloc_btree.h | 5 ++
fs/xfs/libxfs/xfs_bmap.c | 2 +-
fs/xfs/libxfs/xfs_ialloc.c | 2 +-
fs/xfs/libxfs/xfs_rmap_btree.c | 5 ++
fs/xfs/scrub/fscounters.c | 1 +
11 files changed, 207 insertions(+), 26 deletions(-)
base-commit: 58f880711f2ba53fd5e959875aff5b3bf6d5c32e
--
2.25.1
^ permalink raw reply [flat|nested] 14+ messages in thread
* [RFC PATCH 1/4] xfs: resurrect the AGFL reservation
2024-06-13 20:27 [RFC PATCH 0/4] bringing back the AGFL reserve Krister Johansen
@ 2024-06-13 20:27 ` Krister Johansen
2024-06-14 2:59 ` Dave Chinner
2024-06-13 20:27 ` [RFC PATCH 2/4] xfs: modify xfs_alloc_min_freelist to take an increment Krister Johansen
` (3 subsequent siblings)
4 siblings, 1 reply; 14+ messages in thread
From: Krister Johansen @ 2024-06-13 20:27 UTC (permalink / raw)
To: Chandan Babu R, Darrick J. Wong, Dave Chinner; +Cc: Gao Xiang, linux-xfs
Transactions that perform multiple allocations may inadvertently run out
of space after the first allocation selects an AG that appears to have
enough available space. The problem occurs when the allocation in the
transaction splits freespace b-trees but the second allocation does not
have enough available space to refill the AGFL. This results in an
aborted transaction and a filesystem shutdown. In this author's case,
it's frequently encountered in the xfs_bmap_extents_to_btree path on a
write to an AG that's almost reached its limits.
The AGFL reservation allows us to save some blocks to refill the AGFL to
its minimum level in an Nth allocation, and to prevent allocations from
proceeding when there's not enough reserved space to accommodate the
refill.
This patch just brings back the reservation and does the plumbing. The
policy decisions about which allocations to allow will be in a
subsequent patch.
This implementation includes space for the bnobt and cnobt in the
reserve. This was done largely because the AGFL reserve stubs appeared
to already be doing it this way.
Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
---
fs/xfs/libxfs/xfs_ag.h | 2 ++
fs/xfs/libxfs/xfs_ag_resv.c | 54 ++++++++++++++++++++++--------
fs/xfs/libxfs/xfs_ag_resv.h | 4 +++
fs/xfs/libxfs/xfs_alloc.c | 43 +++++++++++++++++++++++-
fs/xfs/libxfs/xfs_alloc.h | 3 +-
fs/xfs/libxfs/xfs_alloc_btree.c | 59 +++++++++++++++++++++++++++++++++
fs/xfs/libxfs/xfs_alloc_btree.h | 5 +++
fs/xfs/libxfs/xfs_rmap_btree.c | 5 +++
fs/xfs/scrub/fscounters.c | 1 +
9 files changed, 161 insertions(+), 15 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
index 35de09a2516c..40bff82f2b7e 100644
--- a/fs/xfs/libxfs/xfs_ag.h
+++ b/fs/xfs/libxfs/xfs_ag.h
@@ -62,6 +62,8 @@ struct xfs_perag {
struct xfs_ag_resv pag_meta_resv;
/* Blocks reserved for the reverse mapping btree. */
struct xfs_ag_resv pag_rmapbt_resv;
+ /* Blocks reserved for the AGFL. */
+ struct xfs_ag_resv pag_agfl_resv;
/* for rcu-safe freeing */
struct rcu_head rcu_head;
diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
index 216423df939e..db1d416f6ac8 100644
--- a/fs/xfs/libxfs/xfs_ag_resv.c
+++ b/fs/xfs/libxfs/xfs_ag_resv.c
@@ -17,6 +17,7 @@
#include "xfs_trans.h"
#include "xfs_rmap_btree.h"
#include "xfs_btree.h"
+#include "xfs_alloc_btree.h"
#include "xfs_refcount_btree.h"
#include "xfs_ialloc_btree.h"
#include "xfs_ag.h"
@@ -75,12 +76,14 @@ xfs_ag_resv_critical(
switch (type) {
case XFS_AG_RESV_METADATA:
- avail = pag->pagf_freeblks - pag->pag_rmapbt_resv.ar_reserved;
+ avail = pag->pagf_freeblks - pag->pag_rmapbt_resv.ar_reserved -
+ pag->pag_agfl_resv.ar_reserved;
orig = pag->pag_meta_resv.ar_asked;
break;
case XFS_AG_RESV_RMAPBT:
avail = pag->pagf_freeblks + pag->pagf_flcount -
- pag->pag_meta_resv.ar_reserved;
+ pag->pag_meta_resv.ar_reserved -
+ pag->pag_agfl_resv.ar_reserved;
orig = pag->pag_rmapbt_resv.ar_asked;
break;
default:
@@ -107,10 +110,14 @@ xfs_ag_resv_needed(
{
xfs_extlen_t len;
- len = pag->pag_meta_resv.ar_reserved + pag->pag_rmapbt_resv.ar_reserved;
+ len = pag->pag_meta_resv.ar_reserved +
+ pag->pag_rmapbt_resv.ar_reserved +
+ pag->pag_agfl_resv.ar_reserved;
+
switch (type) {
case XFS_AG_RESV_METADATA:
case XFS_AG_RESV_RMAPBT:
+ case XFS_AG_RESV_AGFL:
len -= xfs_perag_resv(pag, type)->ar_reserved;
break;
case XFS_AG_RESV_NONE:
@@ -144,7 +151,7 @@ __xfs_ag_resv_free(
* considered "free", so whatever was reserved at mount time must be
* given back at umount.
*/
- if (type == XFS_AG_RESV_RMAPBT)
+ if (type == XFS_AG_RESV_RMAPBT || type == XFS_AG_RESV_AGFL)
oldresv = resv->ar_orig_reserved;
else
oldresv = resv->ar_reserved;
@@ -161,6 +168,7 @@ xfs_ag_resv_free(
{
__xfs_ag_resv_free(pag, XFS_AG_RESV_RMAPBT);
__xfs_ag_resv_free(pag, XFS_AG_RESV_METADATA);
+ __xfs_ag_resv_free(pag, XFS_AG_RESV_AGFL);
}
static int
@@ -180,11 +188,13 @@ __xfs_ag_resv_init(
switch (type) {
case XFS_AG_RESV_RMAPBT:
+ case XFS_AG_RESV_AGFL:
/*
- * Space taken by the rmapbt is not subtracted from fdblocks
- * because the rmapbt lives in the free space. Here we must
- * subtract the entire reservation from fdblocks so that we
- * always have blocks available for rmapbt expansion.
+ * Space taken by the rmapbt and agfl are not subtracted from
+ * fdblocks because they both live in the free space. Here we
+ * must subtract the entire reservation from fdblocks so that we
+ * always have blocks available for rmapbt expansion and agfl
+ * refilling.
*/
hidden_space = ask;
break;
@@ -299,6 +309,25 @@ xfs_ag_resv_init(
has_resv = true;
}
+ /* Create the AGFL reservation */
+ if (pag->pag_agfl_resv.ar_asked == 0) {
+ ask = used = 0;
+
+ error = xfs_allocbt_calc_reserves(mp, tp, pag, &ask, &used);
+ if (error)
+ goto out;
+
+ error = xfs_alloc_agfl_calc_reserves(mp, tp, pag, &ask, &used);
+ if (error)
+ goto out;
+
+ error = __xfs_ag_resv_init(pag, XFS_AG_RESV_AGFL, ask, used);
+ if (error)
+ goto out;
+ if (ask)
+ has_resv = true;
+ }
+
out:
/*
* Initialize the pagf if we have at least one active reservation on the
@@ -324,7 +353,8 @@ xfs_ag_resv_init(
*/
if (!error &&
xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved +
- xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved >
+ xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved +
+ xfs_perag_resv(pag, XFS_AG_RESV_AGFL)->ar_reserved >
pag->pagf_freeblks + pag->pagf_flcount)
error = -ENOSPC;
}
@@ -347,7 +377,6 @@ xfs_ag_resv_alloc_extent(
switch (type) {
case XFS_AG_RESV_AGFL:
- return;
case XFS_AG_RESV_METADATA:
case XFS_AG_RESV_RMAPBT:
resv = xfs_perag_resv(pag, type);
@@ -364,7 +393,7 @@ xfs_ag_resv_alloc_extent(
len = min_t(xfs_extlen_t, args->len, resv->ar_reserved);
resv->ar_reserved -= len;
- if (type == XFS_AG_RESV_RMAPBT)
+ if (type == XFS_AG_RESV_RMAPBT || type == XFS_AG_RESV_AGFL)
return;
/* Allocations of reserved blocks only need on-disk sb updates... */
xfs_trans_mod_sb(args->tp, XFS_TRANS_SB_RES_FDBLOCKS, -(int64_t)len);
@@ -389,7 +418,6 @@ xfs_ag_resv_free_extent(
switch (type) {
case XFS_AG_RESV_AGFL:
- return;
case XFS_AG_RESV_METADATA:
case XFS_AG_RESV_RMAPBT:
resv = xfs_perag_resv(pag, type);
@@ -406,7 +434,7 @@ xfs_ag_resv_free_extent(
leftover = min_t(xfs_extlen_t, len, resv->ar_asked - resv->ar_reserved);
resv->ar_reserved += leftover;
- if (type == XFS_AG_RESV_RMAPBT)
+ if (type == XFS_AG_RESV_RMAPBT || type == XFS_AG_RESV_AGFL)
return;
/* Freeing into the reserved pool only requires on-disk update... */
xfs_trans_mod_sb(tp, XFS_TRANS_SB_RES_FDBLOCKS, len);
diff --git a/fs/xfs/libxfs/xfs_ag_resv.h b/fs/xfs/libxfs/xfs_ag_resv.h
index ff20ed93de77..ea2c16dfb843 100644
--- a/fs/xfs/libxfs/xfs_ag_resv.h
+++ b/fs/xfs/libxfs/xfs_ag_resv.h
@@ -28,6 +28,8 @@ xfs_perag_resv(
return &pag->pag_meta_resv;
case XFS_AG_RESV_RMAPBT:
return &pag->pag_rmapbt_resv;
+ case XFS_AG_RESV_AGFL:
+ return &pag->pag_agfl_resv;
default:
return NULL;
}
@@ -48,6 +50,8 @@ xfs_ag_resv_rmapbt_alloc(
args.len = 1;
pag = xfs_perag_get(mp, agno);
+ /* Transfer this reservation from the AGFL to RMAPBT */
+ xfs_ag_resv_free_extent(pag, XFS_AG_RESV_AGFL, NULL, 1);
xfs_ag_resv_alloc_extent(pag, XFS_AG_RESV_RMAPBT, &args);
xfs_perag_put(pag);
}
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 6c55a6e88eba..d70d027a8178 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -1176,12 +1176,14 @@ xfs_alloc_ag_vextent_small(
/*
* If we're feeding an AGFL block to something that doesn't live in the
- * free space, we need to clear out the OWN_AG rmap.
+ * free space, we need to clear out the OWN_AG rmap and remove it from
+ * the AGFL reservation.
*/
error = xfs_rmap_free(args->tp, args->agbp, args->pag, fbno, 1,
&XFS_RMAP_OINFO_AG);
if (error)
goto error;
+ xfs_ag_resv_free_extent(args->pag, XFS_AG_RESV_AGFL, args->tp, 1);
*stat = 0;
return 0;
@@ -2778,6 +2780,43 @@ xfs_exact_minlen_extent_available(
}
#endif
+/*
+ * Work out how many blocks to reserve for the AGFL as well as how many are in
+ * use currently.
+ */
+int
+xfs_alloc_agfl_calc_reserves(
+ struct xfs_mount *mp,
+ struct xfs_trans *tp,
+ struct xfs_perag *pag,
+ xfs_extlen_t *ask,
+ xfs_extlen_t *used)
+{
+ struct xfs_buf *agbp;
+ struct xfs_agf *agf;
+ xfs_extlen_t agfl_blocks;
+ xfs_extlen_t list_len;
+ int error;
+
+ error = xfs_alloc_read_agf(pag, tp, 0, &agbp);
+ if (error)
+ return error;
+
+ agf = agbp->b_addr;
+ agfl_blocks = xfs_alloc_min_freelist(mp, NULL);
+ list_len = be32_to_cpu(agf->agf_flcount);
+ xfs_trans_brelse(tp, agbp);
+
+ /*
+ * Reserve enough space to refill AGFL to minimum fullness if btrees are
+ * at maximum height.
+ */
+ *ask += agfl_blocks;
+ *used += list_len;
+
+ return error;
+}
+
/*
* Decide whether to use this allocation group for this allocation.
* If so, fix up the btree freelist's size.
@@ -2944,6 +2983,8 @@ xfs_alloc_fix_freelist(
if (error)
goto out_agflbp_relse;
+ xfs_ag_resv_alloc_extent(targs.pag, targs.resv, &targs);
+
/*
* Put each allocated block on the list.
*/
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 0b956f8b9d5a..8cbdfb62ac14 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -80,7 +80,8 @@ int xfs_alloc_get_freelist(struct xfs_perag *pag, struct xfs_trans *tp,
int xfs_alloc_put_freelist(struct xfs_perag *pag, struct xfs_trans *tp,
struct xfs_buf *agfbp, struct xfs_buf *agflbp,
xfs_agblock_t bno, int btreeblk);
-
+int xfs_alloc_agfl_calc_reserves(struct xfs_mount *mp, struct xfs_trans *tp,
+ struct xfs_perag *pag, xfs_extlen_t *ask, xfs_extlen_t *used);
/*
* Compute and fill in value of m_alloc_maxlevels.
*/
diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index 6ef5ddd89600..9c20f85a459d 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -671,6 +671,65 @@ xfs_allocbt_calc_size(
return xfs_btree_calc_size(mp->m_alloc_mnr, len);
}
+/*
+ * Calculate the maximum alloc btree size. This is for a single allocbt.
+ * Callers wishing to compute both the size of the bnobt and cnobt must double
+ * this result.
+ */
+xfs_extlen_t
+xfs_allocbt_max_size(
+ struct xfs_mount *mp,
+ xfs_agblock_t agblocks)
+{
+
+ /* Don't proceed if uninitialized. Can happen in mkfs. */
+ if (mp->m_alloc_mxr[0] == 0)
+ return 0;
+
+ return xfs_allocbt_calc_size(mp, agblocks);
+}
+
+/*
+ * Work out how many blocks to reserve for the bnobt and the cnobt as well as
+ * how many blocks are in use by these trees.
+ */
+int
+xfs_allocbt_calc_reserves(
+ struct xfs_mount *mp,
+ struct xfs_trans *tp,
+ struct xfs_perag *pag,
+ xfs_extlen_t *ask,
+ xfs_extlen_t *used)
+{
+ struct xfs_buf *agbp;
+ struct xfs_agf *agf;
+ xfs_agblock_t agblocks;
+ xfs_extlen_t tree_len;
+ int error;
+
+ error = xfs_alloc_read_agf(pag, tp, 0, &agbp);
+ if (error)
+ return error;
+
+ agf = agbp->b_addr;
+ agblocks = be32_to_cpu(agf->agf_length);
+ tree_len = be32_to_cpu(agf->agf_btreeblks);
+ xfs_trans_brelse(tp, agbp);
+
+ /*
+ * The log is permanently allocated. The space it occupies will never be
+ * available for btree expansion. Pretend the space is not there.
+ */
+ if (xfs_ag_contains_log(mp, pag->pag_agno))
+ agblocks -= mp->m_sb.sb_logblocks;
+
+ /* Reserve 1% of the AG or enough for one block per record per tree. */
+ *ask += max(agblocks / 100, 2 * xfs_allocbt_max_size(mp, agblocks));
+ *used += tree_len;
+
+ return error;
+}
+
int __init
xfs_allocbt_init_cur_cache(void)
{
diff --git a/fs/xfs/libxfs/xfs_alloc_btree.h b/fs/xfs/libxfs/xfs_alloc_btree.h
index 155b47f231ab..8334195e2462 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.h
+++ b/fs/xfs/libxfs/xfs_alloc_btree.h
@@ -56,6 +56,11 @@ struct xfs_btree_cur *xfs_cntbt_init_cursor(struct xfs_mount *mp,
extern int xfs_allocbt_maxrecs(struct xfs_mount *, int, int);
extern xfs_extlen_t xfs_allocbt_calc_size(struct xfs_mount *mp,
unsigned long long len);
+extern xfs_extlen_t xfs_allocbt_max_size(struct xfs_mount *mp,
+ xfs_agblock_t agblocks);
+
+extern int xfs_allocbt_calc_reserves(struct xfs_mount *mp, struct xfs_trans *tp,
+ struct xfs_perag *pag, xfs_extlen_t *ask, xfs_extlen_t *used);
void xfs_allocbt_commit_staged_btree(struct xfs_btree_cur *cur,
struct xfs_trans *tp, struct xfs_buf *agbp);
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index 9e759efa81cc..49b1652f715a 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -121,6 +121,7 @@ xfs_rmapbt_free_block(
struct xfs_buf *agbp = cur->bc_ag.agbp;
struct xfs_agf *agf = agbp->b_addr;
struct xfs_perag *pag = cur->bc_ag.pag;
+ struct xfs_alloc_arg args = { NULL };
xfs_agblock_t bno;
int error;
@@ -135,6 +136,10 @@ xfs_rmapbt_free_block(
XFS_EXTENT_BUSY_SKIP_DISCARD);
xfs_ag_resv_free_extent(pag, XFS_AG_RESV_RMAPBT, NULL, 1);
+ args.len = 1;
+ /* Transfer this reservation back to the AGFL. */
+ xfs_ag_resv_alloc_extent(pag, XFS_AG_RESV_AGFL, &args);
+
return 0;
}
diff --git a/fs/xfs/scrub/fscounters.c b/fs/xfs/scrub/fscounters.c
index 1d3e98346933..fec4aa13052a 100644
--- a/fs/xfs/scrub/fscounters.c
+++ b/fs/xfs/scrub/fscounters.c
@@ -338,6 +338,7 @@ xchk_fscount_aggregate_agcounts(
*/
fsc->fdblocks -= pag->pag_meta_resv.ar_reserved;
fsc->fdblocks -= pag->pag_rmapbt_resv.ar_orig_reserved;
+ fsc->fdblocks -= pag->pag_agfl_resv.ar_orig_reserved;
}
if (pag)
--
2.25.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [RFC PATCH 2/4] xfs: modify xfs_alloc_min_freelist to take an increment
2024-06-13 20:27 [RFC PATCH 0/4] bringing back the AGFL reserve Krister Johansen
2024-06-13 20:27 ` [RFC PATCH 1/4] xfs: resurrect the AGFL reservation Krister Johansen
@ 2024-06-13 20:27 ` Krister Johansen
2024-06-13 20:27 ` [RFC PATCH 3/4] xfs: let allocations tap the AGFL reserve Krister Johansen
` (2 subsequent siblings)
4 siblings, 0 replies; 14+ messages in thread
From: Krister Johansen @ 2024-06-13 20:27 UTC (permalink / raw)
To: Chandan Babu R, Darrick J. Wong, Dave Chinner; +Cc: Gao Xiang, linux-xfs
xfs_alloc_min_freelist is used to determine the minimum size for the
AGFL based upon the current height of the bnobt, cnobt, and rmapbt.
In order to determine how much space needs to be in the AGFL reserve in
order to permit a transacation that may perform multiple allocations, it
is necessary to also know the minimum size of the AGFL after trees are
split and the level has incremented.
Let xfs_alloc_min_freelist take an increment so that callers may request
the minimum AGFL size based upon current + N.
This patch has no functional change. A subsequent patch will bring new
users.
Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
---
fs/xfs/libxfs/xfs_alloc.c | 21 +++++++++++++--------
fs/xfs/libxfs/xfs_alloc.h | 2 +-
fs/xfs/libxfs/xfs_bmap.c | 2 +-
fs/xfs/libxfs/xfs_ialloc.c | 2 +-
4 files changed, 16 insertions(+), 11 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index d70d027a8178..0b15414468cf 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2325,17 +2325,22 @@ xfs_alloc_longest_free_extent(
/*
* Compute the minimum length of the AGFL in the given AG. If @pag is NULL,
- * return the largest possible minimum length.
+ * return the largest possible minimum length. If @level_inc is greater than
+ * zero, increment the level being computed by cur + level_inc.
*/
unsigned int
xfs_alloc_min_freelist(
struct xfs_mount *mp,
- struct xfs_perag *pag)
+ struct xfs_perag *pag,
+ unsigned int level_inc)
{
/* AG btrees have at least 1 level. */
- const unsigned int bno_level = pag ? pag->pagf_bno_level : 1;
- const unsigned int cnt_level = pag ? pag->pagf_cnt_level : 1;
- const unsigned int rmap_level = pag ? pag->pagf_rmap_level : 1;
+ const unsigned int bno_level =
+ pag ? pag->pagf_bno_level + level_inc : 1;
+ const unsigned int cnt_level =
+ pag ? pag->pagf_cnt_level + level_inc : 1;
+ const unsigned int rmap_level =
+ pag ? pag->pagf_rmap_level + level_inc : 1;
unsigned int min_free;
ASSERT(mp->m_alloc_maxlevels > 0);
@@ -2803,7 +2808,7 @@ xfs_alloc_agfl_calc_reserves(
return error;
agf = agbp->b_addr;
- agfl_blocks = xfs_alloc_min_freelist(mp, NULL);
+ agfl_blocks = xfs_alloc_min_freelist(mp, NULL, 0);
list_len = be32_to_cpu(agf->agf_flcount);
xfs_trans_brelse(tp, agbp);
@@ -2861,7 +2866,7 @@ xfs_alloc_fix_freelist(
goto out_agbp_relse;
}
- need = xfs_alloc_min_freelist(mp, pag);
+ need = xfs_alloc_min_freelist(mp, pag, 0);
if (!xfs_alloc_space_available(args, need, alloc_flags |
XFS_ALLOC_FLAG_CHECK))
goto out_agbp_relse;
@@ -2885,7 +2890,7 @@ xfs_alloc_fix_freelist(
xfs_agfl_reset(tp, agbp, pag);
/* If there isn't enough total space or single-extent, reject it. */
- need = xfs_alloc_min_freelist(mp, pag);
+ need = xfs_alloc_min_freelist(mp, pag, 0);
if (!xfs_alloc_space_available(args, need, alloc_flags))
goto out_agbp_relse;
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 8cbdfb62ac14..77347d69f797 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -74,7 +74,7 @@ unsigned int xfs_alloc_ag_max_usable(struct xfs_mount *mp);
xfs_extlen_t xfs_alloc_longest_free_extent(struct xfs_perag *pag,
xfs_extlen_t need, xfs_extlen_t reserved);
unsigned int xfs_alloc_min_freelist(struct xfs_mount *mp,
- struct xfs_perag *pag);
+ struct xfs_perag *pag, unsigned int level_inc);
int xfs_alloc_get_freelist(struct xfs_perag *pag, struct xfs_trans *tp,
struct xfs_buf *agfbp, xfs_agblock_t *bnop, int btreeblk);
int xfs_alloc_put_freelist(struct xfs_perag *pag, struct xfs_trans *tp,
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index c101cf266bc4..742ec4142fb0 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3270,7 +3270,7 @@ xfs_bmap_longest_free_extent(
}
longest = xfs_alloc_longest_free_extent(pag,
- xfs_alloc_min_freelist(pag->pag_mount, pag),
+ xfs_alloc_min_freelist(pag->pag_mount, pag, 0),
xfs_ag_resv_needed(pag, XFS_AG_RESV_NONE));
if (*blen < longest)
*blen = longest;
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 14c81f227c5b..30690b7965af 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -3047,7 +3047,7 @@ xfs_ialloc_calc_rootino(
first_bno += 1;
/* ...the initial AGFL... */
- first_bno += xfs_alloc_min_freelist(mp, NULL);
+ first_bno += xfs_alloc_min_freelist(mp, NULL, 0);
/* ...the free inode btree root... */
if (xfs_has_finobt(mp))
--
2.25.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [RFC PATCH 3/4] xfs: let allocations tap the AGFL reserve
2024-06-13 20:27 [RFC PATCH 0/4] bringing back the AGFL reserve Krister Johansen
2024-06-13 20:27 ` [RFC PATCH 1/4] xfs: resurrect the AGFL reservation Krister Johansen
2024-06-13 20:27 ` [RFC PATCH 2/4] xfs: modify xfs_alloc_min_freelist to take an increment Krister Johansen
@ 2024-06-13 20:27 ` Krister Johansen
2024-06-13 20:27 ` [RFC PATCH 4/4] xfs: refuse allocations without agfl refill space Krister Johansen
2024-06-14 0:45 ` [RFC PATCH 0/4] bringing back the AGFL reserve Dave Chinner
4 siblings, 0 replies; 14+ messages in thread
From: Krister Johansen @ 2024-06-13 20:27 UTC (permalink / raw)
To: Chandan Babu R, Darrick J. Wong, Dave Chinner; +Cc: Gao Xiang, linux-xfs
In the case where a transaction is making an allocation after its
initial allocation, allow that transaction to tap the AGFL reserve.
This is done in case a prior allocation depleted the freelists below
their minimum fill and reserve space is needed to finish the
transaction. This is predicated on the earlier phases of the allocation
enforcing that the AGFL reserve has sufficient space for the transaction
to refill the AGFL. That enforcement is in a separate commit.
Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
---
fs/xfs/libxfs/xfs_alloc.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 0b15414468cf..3fc8448e02d9 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2389,6 +2389,7 @@ xfs_alloc_space_available(
int flags)
{
struct xfs_perag *pag = args->pag;
+ enum xfs_ag_resv_type resv_type = args->resv;
xfs_extlen_t alloc_len, longest;
xfs_extlen_t reservation; /* blocks that are still reserved */
int available;
@@ -2397,7 +2398,19 @@ xfs_alloc_space_available(
if (flags & XFS_ALLOC_FLAG_FREEING)
return true;
- reservation = xfs_ag_resv_needed(pag, args->resv);
+ /*
+ * If this allocation is a subsequent allocation in a transaction with
+ * multiple allocations, allow it to tap the AGFL reserve in order to
+ * ensure that it can refill its freelist if a prior transaction has
+ * depleted it. This is done to ensure that the available space checks
+ * below don't stop an allocation that's up against the reservation
+ * space limits when there is AGFL space still reserved and available.
+ */
+ if (args->tp->t_highest_agno != NULLAGNUMBER &&
+ args->resv == XFS_AG_RESV_NONE) {
+ resv_type = XFS_AG_RESV_AGFL;
+ }
+ reservation = xfs_ag_resv_needed(pag, resv_type);
/* do we have enough contiguous free space for the allocation? */
alloc_len = args->minlen + (args->alignment - 1) + args->minalignslop;
--
2.25.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [RFC PATCH 4/4] xfs: refuse allocations without agfl refill space
2024-06-13 20:27 [RFC PATCH 0/4] bringing back the AGFL reserve Krister Johansen
` (2 preceding siblings ...)
2024-06-13 20:27 ` [RFC PATCH 3/4] xfs: let allocations tap the AGFL reserve Krister Johansen
@ 2024-06-13 20:27 ` Krister Johansen
2024-06-14 0:45 ` [RFC PATCH 0/4] bringing back the AGFL reserve Dave Chinner
4 siblings, 0 replies; 14+ messages in thread
From: Krister Johansen @ 2024-06-13 20:27 UTC (permalink / raw)
To: Chandan Babu R, Darrick J. Wong, Dave Chinner; +Cc: Gao Xiang, linux-xfs
Ensure that an allocation that may be part of a multiple-allocation
transaction has enough space in the AGFL reserve to refill the AGFL in a
subsequent transaction. The AGFL reserve was established to make sure
that there is enough space reserved for multiple allocation transactions
to use this space to refill the AGFL close to ENOSPC.
Check an allocation against this reserve and refuse to let it proceed if
the AGFL reserve cannot meet the needs of a subsequent allocation.
Without this, the second transaction may ENOSPC if the first uses all of
the AGFL blocks and the AG is close enough to the limits that it cannot
refill.
Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
---
fs/xfs/libxfs/xfs_alloc.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 3fc8448e02d9..7b0302f7e960 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2389,11 +2389,13 @@ xfs_alloc_space_available(
int flags)
{
struct xfs_perag *pag = args->pag;
+ struct xfs_ag_resv *resv;
enum xfs_ag_resv_type resv_type = args->resv;
xfs_extlen_t alloc_len, longest;
xfs_extlen_t reservation; /* blocks that are still reserved */
int available;
xfs_extlen_t agflcount;
+ unsigned int agfl_refill;
if (flags & XFS_ALLOC_FLAG_FREEING)
return true;
@@ -2429,6 +2431,21 @@ xfs_alloc_space_available(
if (available < (int)max(args->total, alloc_len))
return false;
+ /*
+ * If this is the first allocation in a transaction that may perform
+ * multiple allocations, check the AGFL reserve to see if it contains
+ * enough blocks to refill the AGFL if freespace b-trees split as part
+ * of the first allocation. This is done to ensure that subsequent
+ * allocations can utilize the reserve space instead of running out and
+ * triggering a shutdown.
+ */
+ if (args->tp->t_highest_agno == NULLAGNUMBER && args->minleft > 0) {
+ agfl_refill = xfs_alloc_min_freelist(args->mp, pag, 1);
+ resv = xfs_perag_resv(pag, XFS_AG_RESV_AGFL);
+ if (resv->ar_asked > 0 && agfl_refill > resv->ar_reserved)
+ return false;
+ }
+
/*
* Clamp maxlen to the amount of free space available for the actual
* extent allocation.
--
2.25.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 0/4] bringing back the AGFL reserve
2024-06-13 20:27 [RFC PATCH 0/4] bringing back the AGFL reserve Krister Johansen
` (3 preceding siblings ...)
2024-06-13 20:27 ` [RFC PATCH 4/4] xfs: refuse allocations without agfl refill space Krister Johansen
@ 2024-06-14 0:45 ` Dave Chinner
2024-06-17 22:25 ` Krister Johansen
4 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2024-06-14 0:45 UTC (permalink / raw)
To: Krister Johansen
Cc: Chandan Babu R, Darrick J. Wong, Dave Chinner, Gao Xiang,
linux-xfs
On Thu, Jun 13, 2024 at 01:27:09PM -0700, Krister Johansen wrote:
> Hi,
> One of the teams that I work with hits WARNs in
> xfs_bmap_extents_to_btree() on a database workload of theirs. The last
> time the subject came up on linux-xfs, it was suggested[1] to try
> building an AG reserve pool for the AGFL.
>
> I managed to work out a reproducer for the problem. Debugging that, the
> steps Gao outlined turned out to be essentially what was necessary to
> get the problem to happen repeatably.
>
> 1. Allocate almost all of the space in an AG
> 2. Free and reallocate that space to fragement it so the freespace
> b-trees are just about to split.
> 3. Allocate blocks in a file such that the next extent allocated for
> that file will cause its bmbt to get converted from an inline extent to
> a b-tree.
> 4. Free space such that the free-space btrees have a contiguous extent
> with a busy portion on either end
> 5. Allocate the portion in the middle, splitting the extent and
> triggering a b-tree split.
Do you have a script that sets up this precondition reliably?
It sounds like it can be done from a known filesystem config. If you
do have a script, can you share it? Or maybe even better, turn it
into an fstest?
> On older kernels this is all it takes. After the AG-aware allocator
> changes I also need to start the allocation in the highest numbered AG
> available while inducing lock contention in the lower numbered AGs.
Ah, so you have to perform a DOS on the lower AGFs so that the
attempts made by the xfs_alloc_vextent_start_ag() to trylock the
lower AGFs once it finds it cannot allocate in the highest AG
anymore also fail.
That was one of the changes made in the perag aware allocator
rework; it added full-range AG iteration when XFS_ALLOC_FLAG_TRYLOCK
is set because we can't deadlock on reverse order AGF locking when
using trylocks.
However, if the trylock iteration fails, it then sets the restart AG
to the minimum AG be can wait for without deadlocking, removes the
trylock and restarts the iteration. Hence you've had to create AGF
lock contention to force the allocator back to being restricted by
the AGF locking orders.
Is this new behaviour sufficient to mitigate the problem being seen
with this database workload? Has it been tested with kernels that
have those changes, and if so did it have any impact on the
frequency of the issue occurring?
> In order to ensure that AGs have enough space to complete transactions
> with multiple allocations, I've taken a stab at implementing an AGFL
> reserve pool.
OK. I'll comment directly on the code from here, hopefully I'll
address your other questions in those comments.
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 1/4] xfs: resurrect the AGFL reservation
2024-06-13 20:27 ` [RFC PATCH 1/4] xfs: resurrect the AGFL reservation Krister Johansen
@ 2024-06-14 2:59 ` Dave Chinner
2024-06-17 23:46 ` Krister Johansen
0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2024-06-14 2:59 UTC (permalink / raw)
To: Krister Johansen
Cc: Chandan Babu R, Darrick J. Wong, Dave Chinner, Gao Xiang,
linux-xfs
On Thu, Jun 13, 2024 at 01:27:26PM -0700, Krister Johansen wrote:
> Transactions that perform multiple allocations may inadvertently run out
> of space after the first allocation selects an AG that appears to have
> enough available space. The problem occurs when the allocation in the
> transaction splits freespace b-trees but the second allocation does not
> have enough available space to refill the AGFL. This results in an
> aborted transaction and a filesystem shutdown. In this author's case,
> it's frequently encountered in the xfs_bmap_extents_to_btree path on a
> write to an AG that's almost reached its limits.
>
> The AGFL reservation allows us to save some blocks to refill the AGFL to
> its minimum level in an Nth allocation, and to prevent allocations from
> proceeding when there's not enough reserved space to accommodate the
> refill.
>
> This patch just brings back the reservation and does the plumbing. The
> policy decisions about which allocations to allow will be in a
> subsequent patch.
>
> This implementation includes space for the bnobt and cnobt in the
> reserve. This was done largely because the AGFL reserve stubs appeared
> to already be doing it this way.
>
> Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
> ---
> fs/xfs/libxfs/xfs_ag.h | 2 ++
> fs/xfs/libxfs/xfs_ag_resv.c | 54 ++++++++++++++++++++++--------
> fs/xfs/libxfs/xfs_ag_resv.h | 4 +++
> fs/xfs/libxfs/xfs_alloc.c | 43 +++++++++++++++++++++++-
> fs/xfs/libxfs/xfs_alloc.h | 3 +-
> fs/xfs/libxfs/xfs_alloc_btree.c | 59 +++++++++++++++++++++++++++++++++
> fs/xfs/libxfs/xfs_alloc_btree.h | 5 +++
> fs/xfs/libxfs/xfs_rmap_btree.c | 5 +++
> fs/xfs/scrub/fscounters.c | 1 +
> 9 files changed, 161 insertions(+), 15 deletions(-)
> diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
> index 35de09a2516c..40bff82f2b7e 100644
> --- a/fs/xfs/libxfs/xfs_ag.h
> +++ b/fs/xfs/libxfs/xfs_ag.h
> @@ -62,6 +62,8 @@ struct xfs_perag {
> struct xfs_ag_resv pag_meta_resv;
> /* Blocks reserved for the reverse mapping btree. */
> struct xfs_ag_resv pag_rmapbt_resv;
> + /* Blocks reserved for the AGFL. */
> + struct xfs_ag_resv pag_agfl_resv;
Ok, so you're creating a new pool for the AGFL refills.
....
> @@ -180,11 +188,13 @@ __xfs_ag_resv_init(
>
> switch (type) {
> case XFS_AG_RESV_RMAPBT:
> + case XFS_AG_RESV_AGFL:
> /*
> - * Space taken by the rmapbt is not subtracted from fdblocks
> - * because the rmapbt lives in the free space. Here we must
> - * subtract the entire reservation from fdblocks so that we
> - * always have blocks available for rmapbt expansion.
> + * Space taken by the rmapbt and agfl are not subtracted from
> + * fdblocks because they both live in the free space. Here we
> + * must subtract the entire reservation from fdblocks so that we
> + * always have blocks available for rmapbt expansion and agfl
> + * refilling.
> */
> hidden_space = ask;
> break;
And the AGFL and RMAPBT pools have largely the same behaviour.
> @@ -299,6 +309,25 @@ xfs_ag_resv_init(
> has_resv = true;
> }
>
> + /* Create the AGFL reservation */
> + if (pag->pag_agfl_resv.ar_asked == 0) {
> + ask = used = 0;
> +
> + error = xfs_allocbt_calc_reserves(mp, tp, pag, &ask, &used);
> + if (error)
> + goto out;
I think this is wrong, more below.
> + error = xfs_alloc_agfl_calc_reserves(mp, tp, pag, &ask, &used);
> + if (error)
> + goto out;
> +
> + error = __xfs_ag_resv_init(pag, XFS_AG_RESV_AGFL, ask, used);
> + if (error)
> + goto out;
> + if (ask)
> + has_resv = true;
> + }
> +
> out:
> /*
> * Initialize the pagf if we have at least one active reservation on the
> @@ -324,7 +353,8 @@ xfs_ag_resv_init(
> */
> if (!error &&
> xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved +
> - xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved >
> + xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved +
> + xfs_perag_resv(pag, XFS_AG_RESV_AGFL)->ar_reserved >
> pag->pagf_freeblks + pag->pagf_flcount)
> error = -ENOSPC;
> }
This is now quite hard to read. Perhaps a helper:
xfs_ag_reserved_space(pag)
{
return pag->pag_meta_resv.ar_reserved +
pag->pag_rmapbt_resv.ar_reserved +
pag->pag_agfl_resv.ar_reserved;
}
And that can also be used in xfs_ag_resv_needed()....
> @@ -48,6 +50,8 @@ xfs_ag_resv_rmapbt_alloc(
>
> args.len = 1;
> pag = xfs_perag_get(mp, agno);
> + /* Transfer this reservation from the AGFL to RMAPBT */
> + xfs_ag_resv_free_extent(pag, XFS_AG_RESV_AGFL, NULL, 1);
> xfs_ag_resv_alloc_extent(pag, XFS_AG_RESV_RMAPBT, &args);
> xfs_perag_put(pag);
> }
I have no idea what this is accounting for or why it needs to be
done. Comments need to explain why, not repeat what the code is
doing.
> +/*
> + * Work out how many blocks to reserve for the AGFL as well as how many are in
> + * use currently.
> + */
> +int
> +xfs_alloc_agfl_calc_reserves(
> + struct xfs_mount *mp,
> + struct xfs_trans *tp,
> + struct xfs_perag *pag,
> + xfs_extlen_t *ask,
> + xfs_extlen_t *used)
> +{
> + struct xfs_buf *agbp;
> + struct xfs_agf *agf;
> + xfs_extlen_t agfl_blocks;
> + xfs_extlen_t list_len;
> + int error;
> +
> + error = xfs_alloc_read_agf(pag, tp, 0, &agbp);
> + if (error)
> + return error;
> +
> + agf = agbp->b_addr;
> + agfl_blocks = xfs_alloc_min_freelist(mp, NULL);
Why are you passing a NULL for the perag in here? This minimum AGFL
blocks we will reserve is for single level btrees.
When we are actually doing an allocation, this calculation returns
a block count based on the current height of the btrees, and so the
actual AGFL refill size is always going to be larger than the
reserves calculated here.
> + list_len = be32_to_cpu(agf->agf_flcount);
There is no relationship between the minimum AGFL block count and
the current number of blocks on the AGFL. The number of blocks on
the AGFL can be much larger than the minimum required to perform an
allocation (e.g. we just did a series of merges that moved lots of
blocks to the AGFL that was already at it's required size).
> + xfs_trans_brelse(tp, agbp);
> +
> + /*
> + * Reserve enough space to refill AGFL to minimum fullness if btrees are
> + * at maximum height.
> + */
> + *ask += agfl_blocks;
> + *used += list_len;
Hence the above calculations can result in a situation where used >
ask, and ask is not large enough for runtime AGFL reservations.
> diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
> index 6ef5ddd89600..9c20f85a459d 100644
> --- a/fs/xfs/libxfs/xfs_alloc_btree.c
> +++ b/fs/xfs/libxfs/xfs_alloc_btree.c
> @@ -671,6 +671,65 @@ xfs_allocbt_calc_size(
> return xfs_btree_calc_size(mp->m_alloc_mnr, len);
> }
>
> +/*
> + * Calculate the maximum alloc btree size. This is for a single allocbt.
> + * Callers wishing to compute both the size of the bnobt and cnobt must double
> + * this result.
> + */
> +xfs_extlen_t
> +xfs_allocbt_max_size(
> + struct xfs_mount *mp,
> + xfs_agblock_t agblocks)
> +{
> +
> + /* Don't proceed if uninitialized. Can happen in mkfs. */
> + if (mp->m_alloc_mxr[0] == 0)
> + return 0;
> +
> + return xfs_allocbt_calc_size(mp, agblocks);
> +}
So this is using a block count as the number of records in the
btree? And it's returning the number of btree blocks that would be
needed to index that number of records?
I'm not sure what you are attempting to calculate here, because...
> +/*
> + * Work out how many blocks to reserve for the bnobt and the cnobt as well as
> + * how many blocks are in use by these trees.
> + */
> +int
> +xfs_allocbt_calc_reserves(
> + struct xfs_mount *mp,
> + struct xfs_trans *tp,
> + struct xfs_perag *pag,
> + xfs_extlen_t *ask,
> + xfs_extlen_t *used)
> +{
> + struct xfs_buf *agbp;
> + struct xfs_agf *agf;
> + xfs_agblock_t agblocks;
> + xfs_extlen_t tree_len;
> + int error;
> +
> + error = xfs_alloc_read_agf(pag, tp, 0, &agbp);
> + if (error)
> + return error;
> +
> + agf = agbp->b_addr;
> + agblocks = be32_to_cpu(agf->agf_length);
> + tree_len = be32_to_cpu(agf->agf_btreeblks);
> + xfs_trans_brelse(tp, agbp);
> +
> + /*
> + * The log is permanently allocated. The space it occupies will never be
> + * available for btree expansion. Pretend the space is not there.
> + */
> + if (xfs_ag_contains_log(mp, pag->pag_agno))
> + agblocks -= mp->m_sb.sb_logblocks;
> +
> + /* Reserve 1% of the AG or enough for one block per record per tree. */
> + *ask += max(agblocks / 100, 2 * xfs_allocbt_max_size(mp, agblocks));
> + *used += tree_len;
> +
> + return error;
> +}
... this is passing the AG size into xfs_allocbt_max_size(), but the
free space btree cannot have that many records in it. If all the
blocks are free, then there will be a single contiguous free space
extent record spanning the entire space. That's definitely not what
this is calculation results in. :)
I can see why you might be trying to calculate the max number of
blocks in the btree - given worst case single block fragmentation,
half the space in the AG can be indexed by the free space btree.
That's a non-trivial number of records and a decent number of btree
blocks to index them, but even then I don't think it's a relevant
number to the AGFL reservation size.
The minimum AGFL size is tiny - it's just enough blocks to cover a
full height split of each btree that sources blocks from the AGFL.
If you point xfs_db at a filesystem and run the btheight command
with the length of the AG as the record count, you'll get an idea
of the worst case btree heights in the filesystem.
On the workstation I'm typing this one, /home is 1.8TB, has 32 AGs
and has reflink and rmap enabled.
The worst case btree heights for both the bno and cnt btrees is 4
levels, and the rmapbt is 5 levels at worst. Hence for any given
allocation the worst case AGFL length is going to be ~15 blocks.
Then we need to consider how many times we might have to refill the
AGFL in a series of dependent allocations that might be forced into
the same AG. i.e. the bmbt splitting during record insertion after a
data extent allocation (which I think is the only case this matters)
The bmapbt for the above filesystem has a maximum height of 4
levels. Hence the worst case "data extent + bmbt split" allocation
is going to require 1 + ((max bmbt level - 1) + 1) block allocations
from the AG. i.e. (max bmbt level + 1) AGFL refills.
Hence to guarantee that we'd never run out of blocks on an
allocation requiring worst case btree splits on all btrees
on all allocations is ~75 blocks.
So I wouldn't expect the AGFL reserve pool to be any larger than a
couple of hundred blocks in the worst case.
Which brings me to the next question: do we even need a reserve pool
for this? We know when we are doing such a dependent allocation
chain because args->minleft has to be set to ensure enough space
is left after the first allocation to allow all the remaining
allocations to succeed.
Perhaps we can to extend the minleft mechanism to indicate how many
dependent allocations are in this chain (we know the bmbt height,
too) and trigger the initial AGFL fill to handle multiple splits
when the AG reservations cause the AG to be "empty" when there's
actually still lots of free space in the AG?
Or maybe there's a simpler way? Just extend the RMAPBT reservation
slightly and use it for AGFL blocks whenever that reservation is
non-zero and causing ENOSPC for AGFL refills?
I'm not sure waht the best approach is here - even though I
suggested making use of reserve pools some time ago, looking at the
context and the maths again I'm not really sure that it is the only
(or best) approach to solving the issue...
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 0/4] bringing back the AGFL reserve
2024-06-14 0:45 ` [RFC PATCH 0/4] bringing back the AGFL reserve Dave Chinner
@ 2024-06-17 22:25 ` Krister Johansen
2024-06-17 23:54 ` Dave Chinner
0 siblings, 1 reply; 14+ messages in thread
From: Krister Johansen @ 2024-06-17 22:25 UTC (permalink / raw)
To: Dave Chinner
Cc: Chandan Babu R, Darrick J. Wong, Dave Chinner, Gao Xiang,
linux-xfs
On Fri, Jun 14, 2024 at 10:45:37AM +1000, Dave Chinner wrote:
> On Thu, Jun 13, 2024 at 01:27:09PM -0700, Krister Johansen wrote:
> > I managed to work out a reproducer for the problem. Debugging that, the
> > steps Gao outlined turned out to be essentially what was necessary to
> > get the problem to happen repeatably.
> >
> > 1. Allocate almost all of the space in an AG
> > 2. Free and reallocate that space to fragement it so the freespace
> > b-trees are just about to split.
> > 3. Allocate blocks in a file such that the next extent allocated for
> > that file will cause its bmbt to get converted from an inline extent to
> > a b-tree.
> > 4. Free space such that the free-space btrees have a contiguous extent
> > with a busy portion on either end
> > 5. Allocate the portion in the middle, splitting the extent and
> > triggering a b-tree split.
>
> Do you have a script that sets up this precondition reliably?
> It sounds like it can be done from a known filesystem config. If you
> do have a script, can you share it? Or maybe even better, turn it
> into an fstest?
I do have a script that reproduces the problem. At the moment it is in
a pretty embarrasing state. I'm happy to clean it up a bit and share
it, or try to turn it into a fstest, or both. The script currently
creates small loop devices to generate a filesystem layout that's a
little easier to work with. Is it considered acceptable to have a
fstest create a filesystem with a particular geometry? (And would you
consider taking a patch to let mkfs.xfs --unsupported take both size and
agsize arguments so the overall filesystem size and the per-ag size
could be set by a test?)
> > On older kernels this is all it takes. After the AG-aware allocator
> > changes I also need to start the allocation in the highest numbered AG
> > available while inducing lock contention in the lower numbered AGs.
>
> Ah, so you have to perform a DOS on the lower AGFs so that the
> attempts made by the xfs_alloc_vextent_start_ag() to trylock the
> lower AGFs once it finds it cannot allocate in the highest AG
> anymore also fail.
>
> That was one of the changes made in the perag aware allocator
> rework; it added full-range AG iteration when XFS_ALLOC_FLAG_TRYLOCK
> is set because we can't deadlock on reverse order AGF locking when
> using trylocks.
>
> However, if the trylock iteration fails, it then sets the restart AG
> to the minimum AG be can wait for without deadlocking, removes the
> trylock and restarts the iteration. Hence you've had to create AGF
> lock contention to force the allocator back to being restricted by
> the AGF locking orders.
The other thing that I really appreciated here is that the patchset
cleaned up a bunch of the different allocation functions and made
everything easier to read and follow. Thanks for that as well.
> Is this new behaviour sufficient to mitigate the problem being seen
> with this database workload? Has it been tested with kernels that
> have those changes, and if so did it have any impact on the
> frequency of the issue occurring?
I don't have a good answer for this yet. The team is planning to start
migrating later in the year and this will probably run through to next
year. I'll have that information eventually and will share it when I
do, but don't know yet. Aside from the script, other synethtic
load-tests have not been successful in reproducing the problemr. That
may be the result of the databases that are spun up for load testing not
having filesystems that as full and fragmented as the production ones.
> > In order to ensure that AGs have enough space to complete transactions
> > with multiple allocations, I've taken a stab at implementing an AGFL
> > reserve pool.
>
> OK. I'll comment directly on the code from here, hopefully I'll
> address your other questions in those comments.
Thanks, Dave. I appreciate you spending the time to review and provide
feedback.
-K
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 1/4] xfs: resurrect the AGFL reservation
2024-06-14 2:59 ` Dave Chinner
@ 2024-06-17 23:46 ` Krister Johansen
2024-07-09 2:20 ` Dave Chinner
0 siblings, 1 reply; 14+ messages in thread
From: Krister Johansen @ 2024-06-17 23:46 UTC (permalink / raw)
To: Dave Chinner
Cc: Chandan Babu R, Darrick J. Wong, Dave Chinner, Gao Xiang,
linux-xfs
On Fri, Jun 14, 2024 at 12:59:32PM +1000, Dave Chinner wrote:
> On Thu, Jun 13, 2024 at 01:27:26PM -0700, Krister Johansen wrote:
> > Transactions that perform multiple allocations may inadvertently run out
> > of space after the first allocation selects an AG that appears to have
> > enough available space. The problem occurs when the allocation in the
> > transaction splits freespace b-trees but the second allocation does not
> > have enough available space to refill the AGFL. This results in an
> > aborted transaction and a filesystem shutdown. In this author's case,
> > it's frequently encountered in the xfs_bmap_extents_to_btree path on a
> > write to an AG that's almost reached its limits.
> >
> > The AGFL reservation allows us to save some blocks to refill the AGFL to
> > its minimum level in an Nth allocation, and to prevent allocations from
> > proceeding when there's not enough reserved space to accommodate the
> > refill.
> >
> > This patch just brings back the reservation and does the plumbing. The
> > policy decisions about which allocations to allow will be in a
> > subsequent patch.
> >
> > This implementation includes space for the bnobt and cnobt in the
> > reserve. This was done largely because the AGFL reserve stubs appeared
> > to already be doing it this way.
> >
> > Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
> > ---
> > fs/xfs/libxfs/xfs_ag.h | 2 ++
> > fs/xfs/libxfs/xfs_ag_resv.c | 54 ++++++++++++++++++++++--------
> > fs/xfs/libxfs/xfs_ag_resv.h | 4 +++
> > fs/xfs/libxfs/xfs_alloc.c | 43 +++++++++++++++++++++++-
> > fs/xfs/libxfs/xfs_alloc.h | 3 +-
> > fs/xfs/libxfs/xfs_alloc_btree.c | 59 +++++++++++++++++++++++++++++++++
> > fs/xfs/libxfs/xfs_alloc_btree.h | 5 +++
> > fs/xfs/libxfs/xfs_rmap_btree.c | 5 +++
> > fs/xfs/scrub/fscounters.c | 1 +
> > 9 files changed, 161 insertions(+), 15 deletions(-)
> > diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
> > index 35de09a2516c..40bff82f2b7e 100644
> > --- a/fs/xfs/libxfs/xfs_ag.h
> > +++ b/fs/xfs/libxfs/xfs_ag.h
> > @@ -62,6 +62,8 @@ struct xfs_perag {
> > struct xfs_ag_resv pag_meta_resv;
> > /* Blocks reserved for the reverse mapping btree. */
> > struct xfs_ag_resv pag_rmapbt_resv;
> > + /* Blocks reserved for the AGFL. */
> > + struct xfs_ag_resv pag_agfl_resv;
>
> Ok, so you're creating a new pool for the AGFL refills.
Sorry, I should have been clearer about what I was trying to do. When I
reviewed the patches[1] that created the rmapbt reservation, it looked
like part of the justification for that work was that blocks from the
agfl moving to other data structures, like the rmapbt, caused accounting
inconsistencies. I worried that if I just accounted for the AGFL blocks
without also trying to track the agfl blocks in use by the bnobt and the
cnobt then I'd be re-creating a problem similar to what the previous
patch had tried to fix.
After reading the rest of your comments, it sounds like I didn't get
this right and we should probably discuss which approach is best to
take.
> > @@ -48,6 +50,8 @@ xfs_ag_resv_rmapbt_alloc(
> >
> > args.len = 1;
> > pag = xfs_perag_get(mp, agno);
> > + /* Transfer this reservation from the AGFL to RMAPBT */
> > + xfs_ag_resv_free_extent(pag, XFS_AG_RESV_AGFL, NULL, 1);
> > xfs_ag_resv_alloc_extent(pag, XFS_AG_RESV_RMAPBT, &args);
> > xfs_perag_put(pag);
> > }
>
> I have no idea what this is accounting for or why it needs to be
> done. Comments need to explain why, not repeat what the code is
> doing.
Thanks, would something along the lines of the following been better?
/*
*
* The rmapbt has its own reservation separate from the AGFL. In cases
* where an agfl block is destined for use in the rmapbt, ensure that it
* is counted against the rmapbt reservation instead of the agfl's.
*/
> > +/*
> > + * Work out how many blocks to reserve for the AGFL as well as how many are in
> > + * use currently.
> > + */
> > +int
> > +xfs_alloc_agfl_calc_reserves(
> > + struct xfs_mount *mp,
> > + struct xfs_trans *tp,
> > + struct xfs_perag *pag,
> > + xfs_extlen_t *ask,
> > + xfs_extlen_t *used)
> > +{
> > + struct xfs_buf *agbp;
> > + struct xfs_agf *agf;
> > + xfs_extlen_t agfl_blocks;
> > + xfs_extlen_t list_len;
> > + int error;
> > +
> > + error = xfs_alloc_read_agf(pag, tp, 0, &agbp);
> > + if (error)
> > + return error;
> > +
> > + agf = agbp->b_addr;
> > + agfl_blocks = xfs_alloc_min_freelist(mp, NULL);
>
> Why are you passing a NULL for the perag in here? This minimum AGFL
> blocks we will reserve is for single level btrees.
Thanks for catching this. I let myself get fooled by the comment at the
top that said, "If @pag is NULL, return the largest possible minimum
length" which was what I thought I wanted.
> When we are actually doing an allocation, this calculation returns
> a block count based on the current height of the btrees, and so the
> actual AGFL refill size is always going to be larger than the
> reserves calculated here.
>
>
> > + list_len = be32_to_cpu(agf->agf_flcount);
>
> There is no relationship between the minimum AGFL block count and
> the current number of blocks on the AGFL. The number of blocks on
> the AGFL can be much larger than the minimum required to perform an
> allocation (e.g. we just did a series of merges that moved lots of
> blocks to the AGFL that was already at it's required size).
>
> > + xfs_trans_brelse(tp, agbp);
> > +
> > + /*
> > + * Reserve enough space to refill AGFL to minimum fullness if btrees are
> > + * at maximum height.
> > + */
> > + *ask += agfl_blocks;
> > + *used += list_len;
>
> Hence the above calculations can result in a situation where used >
> ask, and ask is not large enough for runtime AGFL reservations.
It sounds like I need to work out some tests that involve generating a
really full agfl and/or free-space trees without the change and then
trying to use those with the change.
> > diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
> > index 6ef5ddd89600..9c20f85a459d 100644
> > --- a/fs/xfs/libxfs/xfs_alloc_btree.c
> > +++ b/fs/xfs/libxfs/xfs_alloc_btree.c
> > @@ -671,6 +671,65 @@ xfs_allocbt_calc_size(
> > return xfs_btree_calc_size(mp->m_alloc_mnr, len);
> > }
> >
> > +/*
> > + * Calculate the maximum alloc btree size. This is for a single allocbt.
> > + * Callers wishing to compute both the size of the bnobt and cnobt must double
> > + * this result.
> > + */
> > +xfs_extlen_t
> > +xfs_allocbt_max_size(
> > + struct xfs_mount *mp,
> > + xfs_agblock_t agblocks)
> > +{
> > +
> > + /* Don't proceed if uninitialized. Can happen in mkfs. */
> > + if (mp->m_alloc_mxr[0] == 0)
> > + return 0;
> > +
> > + return xfs_allocbt_calc_size(mp, agblocks);
> > +}
>
> So this is using a block count as the number of records in the
> btree? And it's returning the number of btree blocks that would be
> needed to index that number of records?
>
> I'm not sure what you are attempting to calculate here, because...
>
> > +/*
> > + * Work out how many blocks to reserve for the bnobt and the cnobt as well as
> > + * how many blocks are in use by these trees.
> > + */
> > +int
> > +xfs_allocbt_calc_reserves(
> > + struct xfs_mount *mp,
> > + struct xfs_trans *tp,
> > + struct xfs_perag *pag,
> > + xfs_extlen_t *ask,
> > + xfs_extlen_t *used)
> > +{
> > + struct xfs_buf *agbp;
> > + struct xfs_agf *agf;
> > + xfs_agblock_t agblocks;
> > + xfs_extlen_t tree_len;
> > + int error;
> > +
> > + error = xfs_alloc_read_agf(pag, tp, 0, &agbp);
> > + if (error)
> > + return error;
> > +
> > + agf = agbp->b_addr;
> > + agblocks = be32_to_cpu(agf->agf_length);
> > + tree_len = be32_to_cpu(agf->agf_btreeblks);
> > + xfs_trans_brelse(tp, agbp);
> > +
> > + /*
> > + * The log is permanently allocated. The space it occupies will never be
> > + * available for btree expansion. Pretend the space is not there.
> > + */
> > + if (xfs_ag_contains_log(mp, pag->pag_agno))
> > + agblocks -= mp->m_sb.sb_logblocks;
> > +
> > + /* Reserve 1% of the AG or enough for one block per record per tree. */
> > + *ask += max(agblocks / 100, 2 * xfs_allocbt_max_size(mp, agblocks));
> > + *used += tree_len;
> > +
> > + return error;
> > +}
>
> ... this is passing the AG size into xfs_allocbt_max_size(), but the
> free space btree cannot have that many records in it. If all the
> blocks are free, then there will be a single contiguous free space
> extent record spanning the entire space. That's definitely not what
> this is calculation results in. :)
>
> I can see why you might be trying to calculate the max number of
> blocks in the btree - given worst case single block fragmentation,
> half the space in the AG can be indexed by the free space btree.
> That's a non-trivial number of records and a decent number of btree
> blocks to index them, but even then I don't think it's a relevant
> number to the AGFL reservation size.
>
> The minimum AGFL size is tiny - it's just enough blocks to cover a
> full height split of each btree that sources blocks from the AGFL.
> If you point xfs_db at a filesystem and run the btheight command
> with the length of the AG as the record count, you'll get an idea
> of the worst case btree heights in the filesystem.
>
> On the workstation I'm typing this one, /home is 1.8TB, has 32 AGs
> and has reflink and rmap enabled.
>
> The worst case btree heights for both the bno and cnt btrees is 4
> levels, and the rmapbt is 5 levels at worst. Hence for any given
> allocation the worst case AGFL length is going to be ~15 blocks.
>
> Then we need to consider how many times we might have to refill the
> AGFL in a series of dependent allocations that might be forced into
> the same AG. i.e. the bmbt splitting during record insertion after a
> data extent allocation (which I think is the only case this matters)
>
> The bmapbt for the above filesystem has a maximum height of 4
> levels. Hence the worst case "data extent + bmbt split" allocation
> is going to require 1 + ((max bmbt level - 1) + 1) block allocations
> from the AG. i.e. (max bmbt level + 1) AGFL refills.
>
> Hence to guarantee that we'd never run out of blocks on an
> allocation requiring worst case btree splits on all btrees
> on all allocations is ~75 blocks.
Just to make certain I understand, for each allocation the correct way
to calculate the AGFL refill size would be to use the formula in
xfs_alloc_min_freelist(), and assume that in the worst-case the level
increments each time?
> So I wouldn't expect the AGFL reserve pool to be any larger than a
> couple of hundred blocks in the worst case.
>
> Which brings me to the next question: do we even need a reserve pool
> for this? We know when we are doing such a dependent allocation
> chain because args->minleft has to be set to ensure enough space
> is left after the first allocation to allow all the remaining
> allocations to succeed.
>
> Perhaps we can to extend the minleft mechanism to indicate how many
> dependent allocations are in this chain (we know the bmbt height,
> too) and trigger the initial AGFL fill to handle multiple splits
> when the AG reservations cause the AG to be "empty" when there's
> actually still lots of free space in the AG?
>
> Or maybe there's a simpler way? Just extend the RMAPBT reservation
> slightly and use it for AGFL blocks whenever that reservation is
> non-zero and causing ENOSPC for AGFL refills?
>
> I'm not sure waht the best approach is here - even though I
> suggested making use of reserve pools some time ago, looking at the
> context and the maths again I'm not really sure that it is the only
> (or best) approach to solving the issue...
I'd be happy to explore some of these alternate approaches.
One approach that I tried before the agfl reserved pool was to work out
how much space would be needed for the next agfl refill by allowing
xfs_alloc_min_freelist to tell me the need at level n+1. From there, I
required minleft > 0 allocations to have that additional space
available. It seemed like it was working, but broke the stripe
alignment tests in generic/336. The case where that test selects a
really large extent size was failing because it was falling back to a
minleft allocation and not getting proper stripe alignment.
This was because I was effectively reserving additional space, but not
subtracting that space from m_ag_max_usable. The ag_resv code already
does this, so I wondered if I might be acting like I was smarter than
the experts and should try following the advice that was given.
Let me try to summarize and make a few of these proposals more concrete
and then we can discuss the tradeoffs.
Based upon previous discussions, I was under the impression that a
correct solution should maintain the following invariant:
xfs_alloc_fix_freelist() should not select an AG for which there
is not enough space to satisfy all of the allocations in the
transaction.
I had pedantically turned that into a pair of checks:
a. Validate that enough space exists to satisfy the transaction at the
time the AG for the first allocation is selected.
b. Take additional steps in subsequent allocations in the same
transaction to ensure available space could be used.
The absolutely _most_ simplistic approach I can think of is the
following.
1. Allow AGFL refills to ignore the reservation if we're close to ENOSPC
and are performing a subsequent allocation in a transaction.
I think you originally suggested this idea[2] in the previous discussion.
It's a simplistic solution, and doesn't check ahead of time that enough
blocks are available to meet the demand. However, in all of the cases
where I've run into this problem, the filesystems that hit this case had
substantially more than 75 blocks (approx 20x more) in the reserve, and
were only doing a single bmbt, bnobt, and cnobt split. (Only short 8
blocks).
If the reserved block counter doesn't change as a result of
the "borrowing" of these blocks, then subsequent allocation attempts
will correctly get ENOSPC on that AG. Essentially,
xfs_alloc_space_available might let one of these through, but it's
unlikely to let any more through since the available count will now
always be less than the reserved count after one of these borrow edge
cases occurs.
2. Modify minleft to account for dependent allocations
Turn minleft into something like:
union xfs_minleft {
xfs_extlen_t blocksleft;
unsigned int ndepallocs;
};
AFAICS, xfs_bmapi_minleft() handles this calculation for bmap
allocations which just leaves the ialloc code using blocksleft as the
original minfree. The code in xfs_alloc_space_available would need to
know how to turn ndepallocs into a number of blocks, and then there's
some tracepoint code to touch up.
The ialloc code sets owner info that we might be able to use later to
determine whether to use blocksleft or ndepallocs, provided it's
acceptable to rely on the field for that.
This is appealing because it would let the allocation path perform one
allocation to fill the AGFL instead of multiple, and we'd know up front
whether we could get all the space needed for the transaction or not.
The downside might be having to adjust the freelist length more
frequently if we need to subsequently shrink it, or partially fill, hit
an error, and then have to cleanup. The AGF modifications needed to be
journaled so there might be additional overhead from that path.
3. Extend the RMAPBT reservation
Older filesystems don't default to having rmapbt enabled so they may not
benefit from this adjustment. If we could extend both reservations that
might be a compromise solution.
It seems like this solution would still want to look at the filesystem
gemoetry and use that to make an estimate of how many additional blocks
to reserve. That starts to look like the first step in creating a
separate AGFL reservation which leaves...
4. Implement a less ambitious AGFL reserve
The draft in this thread tried to track all AGFL blocks from their
allocation and continued to account for them during their lifetime in
the bnobt and cnobt. Instead of trying to build a reserve for all free
space management blocks that aren't rmapbt, reduce the scope of the AGFL
reserve to be just the number of blocks needed to satsify a single
transaction with multiple allocations that occurs close to an AG's
ENOSPC.
Blocks would be subtracted from used on agfl block free, if used were
non zero. Blocks would be added to used on agfl block allocation if a
non-reserve alloction failed to obtain the requested number of blocks.
The wanted size could be determined using the same calculation described
earlier: calculate the maximum number of bmbt splits in a transaction
and add up the worst-case number of blocks for each split.
The one part I'm not quite sure how to work out is how to re-derive the
amount of the AGFL reserve that's been taken and is currently in use on a
subsequent mount of the filesystem. I think option #3 suffers from this
problem as well. IOW, is there a sneaky way to re-compute this without
having to add another field to the on-disk AGF?
Do any of these approaches stand out as clearly better (or worse) to
you?
Thanks again for taking the time to read through these patches and
provide feedback. I appreciate all the comments, ideas, and
explanations.
-K
[1] https://lore.kernel.org/linux-xfs/20180205174601.51574-5-bfoster@redhat.com/
[2] https://lore.kernel.org/linux-xfs/20221116025106.GB3600936@dread.disaster.area/
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 0/4] bringing back the AGFL reserve
2024-06-17 22:25 ` Krister Johansen
@ 2024-06-17 23:54 ` Dave Chinner
0 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2024-06-17 23:54 UTC (permalink / raw)
To: Krister Johansen
Cc: Chandan Babu R, Darrick J. Wong, Dave Chinner, Gao Xiang,
linux-xfs
On Mon, Jun 17, 2024 at 03:25:27PM -0700, Krister Johansen wrote:
> On Fri, Jun 14, 2024 at 10:45:37AM +1000, Dave Chinner wrote:
> > On Thu, Jun 13, 2024 at 01:27:09PM -0700, Krister Johansen wrote:
> > > I managed to work out a reproducer for the problem. Debugging that, the
> > > steps Gao outlined turned out to be essentially what was necessary to
> > > get the problem to happen repeatably.
> > >
> > > 1. Allocate almost all of the space in an AG
> > > 2. Free and reallocate that space to fragement it so the freespace
> > > b-trees are just about to split.
> > > 3. Allocate blocks in a file such that the next extent allocated for
> > > that file will cause its bmbt to get converted from an inline extent to
> > > a b-tree.
> > > 4. Free space such that the free-space btrees have a contiguous extent
> > > with a busy portion on either end
> > > 5. Allocate the portion in the middle, splitting the extent and
> > > triggering a b-tree split.
> >
> > Do you have a script that sets up this precondition reliably?
> > It sounds like it can be done from a known filesystem config. If you
> > do have a script, can you share it? Or maybe even better, turn it
> > into an fstest?
>
> I do have a script that reproduces the problem. At the moment it is in
> a pretty embarrasing state. I'm happy to clean it up a bit and share
> it, or try to turn it into a fstest, or both. The script currently
> creates small loop devices to generate a filesystem layout that's a
> little easier to work with. Is it considered acceptable to have a
> fstest create a filesystem with a particular geometry?
Absolutely.
We do that quite a bit in fstests - if you just need one filesystem
for this, then _scratch_mkfs_sized() can be used to make a
filesystem of a specific size, or you can use _scratch_mkfs and pass
the options you want directly to it. An example of this is xfs/227.
Otherwise you can create loop devices and use _mkfs_dev() to create
the filesystems you need on them. An example of this is xfs/074.
> (And would you
> consider taking a patch to let mkfs.xfs --unsupported take both size and
> agsize arguments so the overall filesystem size and the per-ag size
> could be set by a test?)
We can already do that, but it's been placed behind a strong
environmental guard to restrict such configurations to the fstests
environment:
# TEST_DIR=foo TEST_DEV=foo QA_CHECK_FS=foo mkfs.xfs -dagcount=2,size=32m,file,name=foo
will bypass the "300MB minimum size" checks and allow you to make a
filesystem image with 2x16MB AGs (i.e. a 32MB filesystem) ready to
have a loop device created over it. If this is run inside a fstest,
then you don't need the env variables because they are already set.
> > > On older kernels this is all it takes. After the AG-aware allocator
> > > changes I also need to start the allocation in the highest numbered AG
> > > available while inducing lock contention in the lower numbered AGs.
> >
> > Ah, so you have to perform a DOS on the lower AGFs so that the
> > attempts made by the xfs_alloc_vextent_start_ag() to trylock the
> > lower AGFs once it finds it cannot allocate in the highest AG
> > anymore also fail.
> >
> > That was one of the changes made in the perag aware allocator
> > rework; it added full-range AG iteration when XFS_ALLOC_FLAG_TRYLOCK
> > is set because we can't deadlock on reverse order AGF locking when
> > using trylocks.
> >
> > However, if the trylock iteration fails, it then sets the restart AG
> > to the minimum AG be can wait for without deadlocking, removes the
> > trylock and restarts the iteration. Hence you've had to create AGF
> > lock contention to force the allocator back to being restricted by
> > the AGF locking orders.
>
> The other thing that I really appreciated here is that the patchset
> cleaned up a bunch of the different allocation functions and made
> everything easier to read and follow. Thanks for that as well.
>
> > Is this new behaviour sufficient to mitigate the problem being seen
> > with this database workload? Has it been tested with kernels that
> > have those changes, and if so did it have any impact on the
> > frequency of the issue occurring?
>
> I don't have a good answer for this yet. The team is planning to start
> migrating later in the year and this will probably run through to next
> year. I'll have that information eventually and will share it when I
> do, but don't know yet.
OK, that's fine. We still need to fix the issue, I just wanted to
get an idea of the urgency of the situation.
> Aside from the script, other synethtic
> load-tests have not been successful in reproducing the problemr. That
> may be the result of the databases that are spun up for load testing not
> having filesystems that as full and fragmented as the production ones.
Yeah, that's typical when triaging these sorts of issues. We've had
issues like this in the past with scyallaDB w.r.t. extent size hint
based allocation occasionally failing at ~70% filesystem capacity.
Situations like this require a very particular free space pattern to
trigger, and it tends to be rare that they are hit in either test or
production systems. Having a synthetic reproducer for such an issue
is like finding gold. :)
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 1/4] xfs: resurrect the AGFL reservation
2024-06-17 23:46 ` Krister Johansen
@ 2024-07-09 2:20 ` Dave Chinner
2024-07-23 6:51 ` Krister Johansen
0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2024-07-09 2:20 UTC (permalink / raw)
To: Krister Johansen
Cc: Chandan Babu R, Darrick J. Wong, Dave Chinner, Gao Xiang,
linux-xfs
[ Sorry for the long delay - PTO and then catchup time so I'm only
just getting back to this now. This is long, and I disappeared down
a rabbit hole in getting to the bottom of some of the code before I
even read the summary you wrote. I think I came up with solution
to your #4 proposal in the process... ]
On Mon, Jun 17, 2024 at 04:46:31PM -0700, Krister Johansen wrote:
> On Fri, Jun 14, 2024 at 12:59:32PM +1000, Dave Chinner wrote:
> > On Thu, Jun 13, 2024 at 01:27:26PM -0700, Krister Johansen wrote:
> > > Transactions that perform multiple allocations may inadvertently run out
> > > of space after the first allocation selects an AG that appears to have
> > > enough available space. The problem occurs when the allocation in the
> > > transaction splits freespace b-trees but the second allocation does not
> > > have enough available space to refill the AGFL. This results in an
> > > aborted transaction and a filesystem shutdown. In this author's case,
> > > it's frequently encountered in the xfs_bmap_extents_to_btree path on a
> > > write to an AG that's almost reached its limits.
> > >
> > > The AGFL reservation allows us to save some blocks to refill the AGFL to
> > > its minimum level in an Nth allocation, and to prevent allocations from
> > > proceeding when there's not enough reserved space to accommodate the
> > > refill.
> > >
> > > This patch just brings back the reservation and does the plumbing. The
> > > policy decisions about which allocations to allow will be in a
> > > subsequent patch.
> > >
> > > This implementation includes space for the bnobt and cnobt in the
> > > reserve. This was done largely because the AGFL reserve stubs appeared
> > > to already be doing it this way.
> > >
> > > Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
> > > ---
> > > fs/xfs/libxfs/xfs_ag.h | 2 ++
> > > fs/xfs/libxfs/xfs_ag_resv.c | 54 ++++++++++++++++++++++--------
> > > fs/xfs/libxfs/xfs_ag_resv.h | 4 +++
> > > fs/xfs/libxfs/xfs_alloc.c | 43 +++++++++++++++++++++++-
> > > fs/xfs/libxfs/xfs_alloc.h | 3 +-
> > > fs/xfs/libxfs/xfs_alloc_btree.c | 59 +++++++++++++++++++++++++++++++++
> > > fs/xfs/libxfs/xfs_alloc_btree.h | 5 +++
> > > fs/xfs/libxfs/xfs_rmap_btree.c | 5 +++
> > > fs/xfs/scrub/fscounters.c | 1 +
> > > 9 files changed, 161 insertions(+), 15 deletions(-)
> > > diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
> > > index 35de09a2516c..40bff82f2b7e 100644
> > > --- a/fs/xfs/libxfs/xfs_ag.h
> > > +++ b/fs/xfs/libxfs/xfs_ag.h
> > > @@ -62,6 +62,8 @@ struct xfs_perag {
> > > struct xfs_ag_resv pag_meta_resv;
> > > /* Blocks reserved for the reverse mapping btree. */
> > > struct xfs_ag_resv pag_rmapbt_resv;
> > > + /* Blocks reserved for the AGFL. */
> > > + struct xfs_ag_resv pag_agfl_resv;
> >
> > Ok, so you're creating a new pool for the AGFL refills.
>
> Sorry, I should have been clearer about what I was trying to do. When I
> reviewed the patches[1] that created the rmapbt reservation, it looked
> like part of the justification for that work was that blocks from the
> agfl moving to other data structures, like the rmapbt, caused accounting
> inconsistencies. I worried that if I just accounted for the AGFL blocks
> without also trying to track the agfl blocks in use by the bnobt and the
> cnobt then I'd be re-creating a problem similar to what the previous
> patch had tried to fix.
Ah. There's a key difference between bno/cnt btree blocks and rmapbt
blocks: bnobt/cntbt blocks are considered free space whilst rmapbt
blocks are considered used space.
The reason for this is that the free space btrees have a special
characteristic that only becomes obvious when someone points it out.
When the AG reaches a full state, there is no free space and the
free space btrees are empty. Hence as we slowly consume space in the
AG, the free space trees are shrinking and the record count
decreases and the btree blocks indexing that free space get released
as the index shrinks. Hence we don't have to account for those btree
blocks as used space because they have always been released back to the
free space btree by the time we get near ENOSPC. When we are near
ENOSPC, the btree is reduced to a single root block that we never
free. By the time we empty all the records in that root block, the
only free space remaining in the AG is the blocks on the AGFL
itself.
In comparison, the rmapbt does not shrink as the filesystem fills
up; it continues to grow as we add more used space records. hence we
have to account for the rmapbt records as used space, but to
guarantee we can always allocate a new rmapbt record near ENOSPC, we
have to guarantee that there are enough blocks to fill the AGFL for
any given rampbt insertion. Hence we need a reserve pool of free
space that ensures the free space btree can refill the AGFL blocks
that the rmapbt can consume during growth near ENOSPC.
The fact that we don't account for bnobt/cntbt blocks as used space
is also reflected in the behaviour of xfs_alloc_space_available()
when we are freeing extents. It does:
if (flags & XFS_ALLOC_FLAG_FREEING)
return true;
And so always allows extent freeing regardless of how little space
is remaining the AG and AGFL.
The reason for this is that freeing space when the AGFL is empty
will not require AGFL blocks to be consumed. The bnobt/cntbt root
blocks are completely empty, and so freeing the extent will simply
insert a new record into those blocks. No btree block allocation
will be required. Any AGFL refill after this point will have some
blocks that can be pulled from the free space btrees, hence the AGFL
being undersized does not matter and hence we never return ENOSPC
when freeing extents on a completely full AG.
The fundamental issue here is that the rmapbt and reflink
reservations mean ENOSPC (AG full) is now occurring while there is
still large amounts of space being free-but-reserved, and so the
assumption that bno/cnt trees will never split on allocation nor
grow when we free an extent when we are right at ENOSPC no longer
holds true.
Hence the space we set aside for the AGFL for bno/cnt operation at
ENOSPC is no longer sufficient.....
> After reading the rest of your comments, it sounds like I didn't get
> this right and we should probably discuss which approach is best to
> take.
>
> > > @@ -48,6 +50,8 @@ xfs_ag_resv_rmapbt_alloc(
> > >
> > > args.len = 1;
> > > pag = xfs_perag_get(mp, agno);
> > > + /* Transfer this reservation from the AGFL to RMAPBT */
> > > + xfs_ag_resv_free_extent(pag, XFS_AG_RESV_AGFL, NULL, 1);
> > > xfs_ag_resv_alloc_extent(pag, XFS_AG_RESV_RMAPBT, &args);
> > > xfs_perag_put(pag);
> > > }
> >
> > I have no idea what this is accounting for or why it needs to be
> > done. Comments need to explain why, not repeat what the code is
> > doing.
>
> Thanks, would something along the lines of the following been better?
>
> /*
> *
> * The rmapbt has its own reservation separate from the AGFL. In cases
> * where an agfl block is destined for use in the rmapbt, ensure that it
> * is counted against the rmapbt reservation instead of the agfl's.
> */
Ok, I understand what you are trying to do, but given the above I'm
not sure this correct....
> > I can see why you might be trying to calculate the max number of
> > blocks in the btree - given worst case single block fragmentation,
> > half the space in the AG can be indexed by the free space btree.
> > That's a non-trivial number of records and a decent number of btree
> > blocks to index them, but even then I don't think it's a relevant
> > number to the AGFL reservation size.
> >
> > The minimum AGFL size is tiny - it's just enough blocks to cover a
> > full height split of each btree that sources blocks from the AGFL.
> > If you point xfs_db at a filesystem and run the btheight command
> > with the length of the AG as the record count, you'll get an idea
> > of the worst case btree heights in the filesystem.
> >
> > On the workstation I'm typing this one, /home is 1.8TB, has 32 AGs
> > and has reflink and rmap enabled.
> >
> > The worst case btree heights for both the bno and cnt btrees is 4
> > levels, and the rmapbt is 5 levels at worst. Hence for any given
> > allocation the worst case AGFL length is going to be ~15 blocks.
> >
> > Then we need to consider how many times we might have to refill the
> > AGFL in a series of dependent allocations that might be forced into
> > the same AG. i.e. the bmbt splitting during record insertion after a
> > data extent allocation (which I think is the only case this matters)
> >
> > The bmapbt for the above filesystem has a maximum height of 4
> > levels. Hence the worst case "data extent + bmbt split" allocation
> > is going to require 1 + ((max bmbt level - 1) + 1) block allocations
> > from the AG. i.e. (max bmbt level + 1) AGFL refills.
> >
> > Hence to guarantee that we'd never run out of blocks on an
> > allocation requiring worst case btree splits on all btrees
> > on all allocations is ~75 blocks.
>
> Just to make certain I understand, for each allocation the correct way
> to calculate the AGFL refill size would be to use the formula in
> xfs_alloc_min_freelist(), and assume that in the worst-case the level
> increments each time?
Not quite. Btrees don't work that way - once you split the root
block, it can't be split again until the root block fills. That
means we have to split the level below the root block enough times
to fill the root block. And that requires splitting the level below
that enough times to fill and split the second level enough times to
fill the root block. The number of inserts needed to split the root
block increases exponentially with the height of the tree.
Hence if we've got a height of 2 and 10 records/ptrs per leaf/node,
a full tree looks like this:
Level 1 0123456789
||.......|
/---------------/|.......\---------------\
| | |
Level 0 0123456789 0123456789 ..... 0123456789
If we do an insert anywhere we'll get a full height split and a new
root at level 2. Lets insert record A in the leaf at the left edge:
Level 2 01
||
/-------/\------\
| |
Level 1 0A1234 56789
|||... ....|
/-------/|\------\ ....\------------\
| | | |
Level 0 01234 A56789 0123456789 ..... 0123456789
It should be obvious now that a followup insertion anywhere in the
tree cannot split either the root node at level 2, nor the old root
node that was split in two at level 1.
Yes, we can still split at level 0, but that will only result in
nodes at level 1 getting updated, and only once they are full will
they start splitting at level 1. And then level 2 (the root) won't
split until level 1 has split enough to fill all the nodes in level
1.
What this means is that for any single insertion into a btree, the
worst case demand for new blocks is "current height + 1". In the
case of BMBT, INOBT and REFCOUNBT, we're typically only doing one or
two adjacent record insertions in a given operation, so only one
split will occur during the operation. The free space trees are
different, however.
When the first allocation occurs, we split the tree as per above
and that consumes (start height + 1) blocks from AGFL. We then
get the next BMBT allocation request (because it's doing a
multi-level split), and the best free space we find is, this time,
in the leaf at the right side of the tree. Hence we split that
block, but we don't need to split it's parent (level 1) block
because it has free space in it. Hence the second allocation has
a worst case block requirement of (start height - 1).
If we then do a third allocation (level 2 of the BMBT is splitting),
and that lands in the middle of the tree, we split that block. That
can be inserted into either of the level 1 blocks without causing a
split there, either. Hence a worst case block requirement for that
is also (start height - 1).
It's only once we've split enough blocks at (start height - 1) that
to fill both blocks at start height that we'll get a larger split.
We had ten blocks at level 0 to begin with, 11 after the first full
hieght split, so we need, at minimum, another 9 splits at level 0 to
fill level 1. This means we then have all the leaf blocks at either
5 or 6 records at the time the level 1 nodes fill. To then get
another insert into level 1 to split that, we need to fill a minimum
of -3 sibling- leaf blocks.
The reason we need to fill 3 leaf blocks and not 1 is that we shift
records between left and right siblings instead of splitting. i.e.
we steal space from a sibling in preference to splitting and
inserting a new record into the parent. Hence we only insert into
the parent when the insert node and both siblings are full
completely full.
When we look at a typical AG btree on a v5 filesystem, we have at
least 110 records/ptrs per block (1kB block size) so the minimum
number of record inserts needed before we need to split a parent
again after it was just split is at least 150 inserts.
Hence after a full height split, we are not going to split a node at
the old root level again until at least 150 inserts (1kB block size)
are done at the child level.
IOWs, we're never going to get two start height splits in a single
allocation chain. At most we are going to get one start height split
and then N (start height - 1) splits. That's the worst case AGFL we
need to consider, but such a split chain would be so infintesimally
probably that always reserving that amount of space in the AGFL is
just a waste of space because it will never, ever be used.
So the worst case AGFL requirement for a data extent allocation or
free will be:
1. data extent - full free space tree split * 2 + rmapbt
2. BMBT leaf split - (free space height - 2) * 2 + rmapbt
3. BMBT node split - (free space height - 2) * 2 + rmapbt
....
N. BMBT root split - (free space height - 2) * 2 + rmapbt
So the maximum number of chained allocations is likely going to be
something like:
(max AGFL for one allocation) +
((max bmbt/refcountbt height) * (max agfl for dependent allocation))
Which is likely to be in the order of...
> > So I wouldn't expect the AGFL reserve pool to be any larger than a
> > couple of hundred blocks in the worst case.
... this.
That's a lot of space (close on 1MB at 4kB block size) per AG to
never be able to use for this exceedingly rare corner case
situation. However, let's get the calculation right first and then
see how it falls out in comparison to the amount of space we already
reserve for rmapbt, refcountbt and finobt. It may just be noise...
> > Which brings me to the next question: do we even need a reserve pool
> > for this? We know when we are doing such a dependent allocation
> > chain because args->minleft has to be set to ensure enough space
> > is left after the first allocation to allow all the remaining
> > allocations to succeed.
> >
> > Perhaps we can to extend the minleft mechanism to indicate how many
> > dependent allocations are in this chain (we know the bmbt height,
> > too) and trigger the initial AGFL fill to handle multiple splits
> > when the AG reservations cause the AG to be "empty" when there's
> > actually still lots of free space in the AG?
> >
> > Or maybe there's a simpler way? Just extend the RMAPBT reservation
> > slightly and use it for AGFL blocks whenever that reservation is
> > non-zero and causing ENOSPC for AGFL refills?
> >
> > I'm not sure waht the best approach is here - even though I
> > suggested making use of reserve pools some time ago, looking at the
> > context and the maths again I'm not really sure that it is the only
> > (or best) approach to solving the issue...
>
> I'd be happy to explore some of these alternate approaches.
>
> One approach that I tried before the agfl reserved pool was to work out
> how much space would be needed for the next agfl refill by allowing
> xfs_alloc_min_freelist to tell me the need at level n+1. From there, I
> required minleft > 0 allocations to have that additional space
> available.
Yes, that would mostly work, but the issue is that every allocation
that goes through xfs_bmapi_write() (essentially every data,
directory and xattr block allocation) have minleft set. And it
gets set on inode chunk allocation, too, because we need to have
space for inobt/finobt record insertion...
> It seemed like it was working, but broke the stripe
> alignment tests in generic/336. The case where that test selects a
> really large extent size was failing because it was falling back to a
> minleft allocation and not getting proper stripe alignment.
It would have been falling back to an *minlen* allocation, not a
"minleft" allocation. Two very different things - minlen and maxlen
define the allowable size of allocated extent, minleft defines how
many blocks must remain free in the AG after the first allocation
succeeds.
> This was because I was effectively reserving additional space, but not
> subtracting that space from m_ag_max_usable. The ag_resv code already
> does this, so I wondered if I might be acting like I was smarter than
> the experts and should try following the advice that was given.
Hmmmm. Subtracting space from mp->m_ag_max_usable might be a viable
option here - that's supposed to have an AGFL block reserve
component in it already.
[ And now we disappear down the rabbit hole. ]
Take a look at xfs_alloc_ag_max_usable(), which is used to set
mp->m_ag_max_usable:
blocks = XFS_BB_TO_FSB(mp, XFS_FSS_TO_BB(mp, 4)); /* ag headers */
blocks += XFS_ALLOCBT_AGFL_RESERVE;
blocks += 3; /* AGF, AGI btree root blocks */
....
There's a number of blocks reserved for the AGFL that aren't usable
already. What is that value for?
/*
* The number of blocks per AG that we withhold from xfs_dec_fdblocks to
* guarantee that we can refill the AGFL prior to allocating space in a nearly
* full AG. Although the space described by the free space btrees, the
* blocks used by the freesp btrees themselves, and the blocks owned by the
* AGFL are counted in the ondisk fdblocks, it's a mistake to let the ondisk
* free space in the AG drop so low that the free space btrees cannot refill an
* empty AGFL up to the minimum level. Rather than grind through empty AGs
* until the fs goes down, we subtract this many AG blocks from the incore
* fdblocks to ensure user allocation does not overcommit the space the
* filesystem needs for the AGFLs. The rmap btree uses a per-AG reservation to
* withhold space from xfs_dec_fdblocks, so we do not account for that here.
*/
#define XFS_ALLOCBT_AGFL_RESERVE 4
What is this magic number of 4, then?
Remember what I said above about the free space btree blocks being
accounted as free space and that freeing extents are always allowed
to go ahead, even when the AG is full?
This AGFL reserve is the minimum space needed for an extent free to
successfully refill the AGFL at ENOSPC to allow the extent free to
proceed. That is, both trees have a singel level, so a full height
split of either tree will require 2 blocks to be allocated. 2
freespace trees per AG, two blocks per tree, and so we need 4 blocks
per AG to ensure freeing extents at ENOSPC will always succeed.
This "magic 4" blocks isn't valid when we have rmapbt, recountbt, or
finobt reservations in the AG anymore, because ENOSPC can occur when
we still have multi-level bno/cnt btrees in the AG. That's the first
thing we need to fix.
This AGFL set aside is taken out of the global free space pool via
this function:
/*
* Compute the number of blocks that we set aside to guarantee the ability to
* refill the AGFL and handle a full bmap btree split.
*
* In order to avoid ENOSPC-related deadlock caused by out-of-order locking of
* AGF buffer (PV 947395), we place constraints on the relationship among
* actual allocations for data blocks, freelist blocks, and potential file data
* bmap btree blocks. However, these restrictions may result in no actual space
* allocated for a delayed extent, for example, a data block in a certain AG is
* allocated but there is no additional block for the additional bmap btree
* block due to a split of the bmap btree of the file. The result of this may
* lead to an infinite loop when the file gets flushed to disk and all delayed
* extents need to be actually allocated. To get around this, we explicitly set
* aside a few blocks which will not be reserved in delayed allocation.
*
* For each AG, we need to reserve enough blocks to replenish a totally empty
* AGFL and 4 more to handle a potential split of the file's bmap btree.
*/
unsigned int
xfs_alloc_set_aside(
struct xfs_mount *mp)
{
return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + 4);
}
But where did that magic number of "4 more to handle a potential
split of the file's bmap btree" come from?
Ah, a fix I made for a ENOSPC regression back in 2006:
+ * ..... Considering the minimum number of
+ * needed freelist blocks is 4 fsbs _per AG_, a potential split of file's bmap
+ * btree requires 1 fsb, so we set the number of set-aside blocks
+ * to 4 + 4*agcount.
+ */
+#define XFS_ALLOC_SET_ASIDE(mp) (4 + ((mp)->m_sb.sb_agcount * 4))
That was fixing a regression caused by an ENOSPC deadlock bug fix
done for SGI PV 947395.
commit 96b336c71016a85cd193762199aa6f99f543b6fe
Author: Yingping Lu <yingping@sgi.com>
Date: Tue May 23 19:28:08 2006 +0000
In actual allocation of file system blocks and freeing extents, the transaction
within each such operation may involve multiple locking of AGF buffer. While the
freeing extent function has sorted the extents based on AGF number before entering
into transaction, however, when the file system space is very limited, the allocation
of space would try every AGF to get space allocated, this could potentially cause
out-of-order locking, thus deadlock could happen. This fix mitigates the scarce space
for allocation by setting aside a few blocks without reservation, and avoid deadlock
by maintaining ascending order of AGF locking.
Add a new flag XFS_ALLOC_FLAG_FREEING to indicate whether the caller is freeing or
allocating. Add a field firstblock in structure xfs_alloc_arg to indicate whether
this is the first allocation request.
Which included this:
+ * ...... Considering the minimum number of
+ * needed freelist blocks is 4 fsbs, a potential split of file's bmap
+ * btree requires 1 fsb, so we set the number of set-aside blocks to 8.
+*/
+#define SET_ASIDE_BLOCKS 8
<ding!>
Ah, now I get it. It's only taken me 18 years to really understand
how this value was originally derived and why it is wrong.
It assumed 4 blocks for the AGFL for the initial allocation to fill
and empty AGFL, and then 4 more blocks for the AGFL to allow a
refill for the subsequent BMBT block allocation to refill the AGFL.
IOWs, it wasn't reserving blocks for the BMBT itself, it was
reserving blocks for a second AGFL refill in the allocation chain....
Right, so I think what we need to change is XFS_ALLOCBT_AGFL_RESERVE
and xfs_alloc_set_aside() to reflect the worst case size of the free
space trees when we hit ENOSPC for user data, not a hard coded 4 + 4.
The worst case free space tree size is going to be single block free
records when all the per-ag reservations are at their maximum
values. So we need to calculate maximum reservation sizes for the
AG, then calculate how many btree blocks that will require to index
as single block free space extents, then calculate the hieght of the
btrees needs to index that. That gives us out "free space tree
height at ENOSPC", and then we can calculate how many AGFL blocks we
need at ENOSPC to handle a split of both trees. That value replaces
XFS_ALLOCBT_AGFL_RESERVE.
As for the BMBT split requiring more AGFL refills, we need to know
what the maximum BMBT height the fs supports is (I think we already
calculate that), and that then becomes the level we feed into the
"how many AGFL blocks do we need in a full height BMBT split
allocation chain. I did that math above, and that calculation
should replace the "+ 4" in xfs_alloc_set_aside(). I think that
ends up with the calculation being something like:
setaside = agcount * (number of dependent allocations *
AGFL blocks for refill at ENOSPC)
I think we also need to make mp->m_ag_max_usable also take into
account the BMBT split related AGFL refills as xfs_alloc_set_aside()
as it only considers XFS_ALLOCBT_AGFL_RESERVE right now. That will
ensure we always have those blocks available for the AGFL.
> Let me try to summarize and make a few of these proposals more concrete
> and then we can discuss the tradeoffs.
Fair summary. this is already long, and I think what I discovered
above kinda comes under:
> 4. Implement a less ambitious AGFL reserve
That's what we did back in 2006 - we just carved out the smallest
amount of blocks we needed from the free space pool to ensure the
failing case worked.
> The draft in this thread tried to track all AGFL blocks from their
> allocation and continued to account for them during their lifetime in
> the bnobt and cnobt. Instead of trying to build a reserve for all free
> space management blocks that aren't rmapbt, reduce the scope of the AGFL
> reserve to be just the number of blocks needed to satsify a single
> transaction with multiple allocations that occurs close to an AG's
> ENOSPC.
Yes, and that's what I suggest we extend xfs_alloc_set_aside() and
mp->m_ag_max_usable to encompass. That way we don't actually need to
track anything, the AGs will always have that space available to use
because mp->m_ag_max_usable prevents them from being used by
anything else.
> Do any of these approaches stand out as clearly better (or worse) to
> you?
I think I've convinced myself that we already use option 4 and the
accounting just needs extending to reserve the correct amount of
blocks...
> Thanks again for taking the time to read through these patches and
> provide feedback. I appreciate all the comments, ideas, and
> explanations.
Don't thank me - I've got to thank you for taking the time and
effort to understand this complex code well enough to ask exactly
the right question. It's not often that answering a question on how
to do something causes a whole bunch of not quite connected floating
pieces in my head to suddenly fall and land in exactly the right
places... :)
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 1/4] xfs: resurrect the AGFL reservation
2024-07-09 2:20 ` Dave Chinner
@ 2024-07-23 6:51 ` Krister Johansen
2024-08-01 0:53 ` Dave Chinner
0 siblings, 1 reply; 14+ messages in thread
From: Krister Johansen @ 2024-07-23 6:51 UTC (permalink / raw)
To: Dave Chinner
Cc: Chandan Babu R, Darrick J. Wong, Dave Chinner, Gao Xiang,
linux-xfs
[ My turn to apologize. I had a couple of short weeks at the beginning
of July, but have finally been able to get back to this. ]
On Tue, Jul 09, 2024 at 12:20:16PM +1000, Dave Chinner wrote:
> On Mon, Jun 17, 2024 at 04:46:31PM -0700, Krister Johansen wrote:
> > On Fri, Jun 14, 2024 at 12:59:32PM +1000, Dave Chinner wrote:
> > > On Thu, Jun 13, 2024 at 01:27:26PM -0700, Krister Johansen wrote:
> > > > Transactions that perform multiple allocations may inadvertently run out
> > > > of space after the first allocation selects an AG that appears to have
> > > > enough available space. The problem occurs when the allocation in the
> > > > transaction splits freespace b-trees but the second allocation does not
> > > > have enough available space to refill the AGFL. This results in an
> > > > aborted transaction and a filesystem shutdown. In this author's case,
> > > > it's frequently encountered in the xfs_bmap_extents_to_btree path on a
> > > > write to an AG that's almost reached its limits.
> > > >
> > > > The AGFL reservation allows us to save some blocks to refill the AGFL to
> > > > its minimum level in an Nth allocation, and to prevent allocations from
> > > > proceeding when there's not enough reserved space to accommodate the
> > > > refill.
> > > >
> > > > This patch just brings back the reservation and does the plumbing. The
> > > > policy decisions about which allocations to allow will be in a
> > > > subsequent patch.
> > > >
> > > > This implementation includes space for the bnobt and cnobt in the
> > > > reserve. This was done largely because the AGFL reserve stubs appeared
> > > > to already be doing it this way.
> > > >
> > > > Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
> > > > ---
> > > > fs/xfs/libxfs/xfs_ag.h | 2 ++
> > > > fs/xfs/libxfs/xfs_ag_resv.c | 54 ++++++++++++++++++++++--------
> > > > fs/xfs/libxfs/xfs_ag_resv.h | 4 +++
> > > > fs/xfs/libxfs/xfs_alloc.c | 43 +++++++++++++++++++++++-
> > > > fs/xfs/libxfs/xfs_alloc.h | 3 +-
> > > > fs/xfs/libxfs/xfs_alloc_btree.c | 59 +++++++++++++++++++++++++++++++++
> > > > fs/xfs/libxfs/xfs_alloc_btree.h | 5 +++
> > > > fs/xfs/libxfs/xfs_rmap_btree.c | 5 +++
> > > > fs/xfs/scrub/fscounters.c | 1 +
> > > > 9 files changed, 161 insertions(+), 15 deletions(-)
> > > > diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
> > > > index 35de09a2516c..40bff82f2b7e 100644
> > > > --- a/fs/xfs/libxfs/xfs_ag.h
> > > > +++ b/fs/xfs/libxfs/xfs_ag.h
> > > > @@ -62,6 +62,8 @@ struct xfs_perag {
> > > > struct xfs_ag_resv pag_meta_resv;
> > > > /* Blocks reserved for the reverse mapping btree. */
> > > > struct xfs_ag_resv pag_rmapbt_resv;
> > > > + /* Blocks reserved for the AGFL. */
> > > > + struct xfs_ag_resv pag_agfl_resv;
> > >
<snip>
> Not quite. Btrees don't work that way - once you split the root
> block, it can't be split again until the root block fills. That
> means we have to split the level below the root block enough times
> to fill the root block. And that requires splitting the level below
> that enough times to fill and split the second level enough times to
> fill the root block. The number of inserts needed to split the root
> block increases exponentially with the height of the tree.
>
> Hence if we've got a height of 2 and 10 records/ptrs per leaf/node,
> a full tree looks like this:
>
> Level 1 0123456789
> ||.......|
> /---------------/|.......\---------------\
> | | |
> Level 0 0123456789 0123456789 ..... 0123456789
>
> If we do an insert anywhere we'll get a full height split and a new
> root at level 2. Lets insert record A in the leaf at the left edge:
>
> Level 2 01
> ||
> /-------/\------\
> | |
> Level 1 0A1234 56789
> |||... ....|
> /-------/|\------\ ....\------------\
> | | | |
> Level 0 01234 A56789 0123456789 ..... 0123456789
>
> It should be obvious now that a followup insertion anywhere in the
> tree cannot split either the root node at level 2, nor the old root
> node that was split in two at level 1.
>
> Yes, we can still split at level 0, but that will only result in
> nodes at level 1 getting updated, and only once they are full will
> they start splitting at level 1. And then level 2 (the root) won't
> split until level 1 has split enough to fill all the nodes in level
> 1.
>
> What this means is that for any single insertion into a btree, the
> worst case demand for new blocks is "current height + 1". In the
> case of BMBT, INOBT and REFCOUNBT, we're typically only doing one or
> two adjacent record insertions in a given operation, so only one
> split will occur during the operation. The free space trees are
> different, however.
>
> When the first allocation occurs, we split the tree as per above
> and that consumes (start height + 1) blocks from AGFL. We then
> get the next BMBT allocation request (because it's doing a
> multi-level split), and the best free space we find is, this time,
> in the leaf at the right side of the tree. Hence we split that
> block, but we don't need to split it's parent (level 1) block
> because it has free space in it. Hence the second allocation has
> a worst case block requirement of (start height - 1).
>
> If we then do a third allocation (level 2 of the BMBT is splitting),
> and that lands in the middle of the tree, we split that block. That
> can be inserted into either of the level 1 blocks without causing a
> split there, either. Hence a worst case block requirement for that
> is also (start height - 1).
>
> It's only once we've split enough blocks at (start height - 1) that
> to fill both blocks at start height that we'll get a larger split.
> We had ten blocks at level 0 to begin with, 11 after the first full
> hieght split, so we need, at minimum, another 9 splits at level 0 to
> fill level 1. This means we then have all the leaf blocks at either
> 5 or 6 records at the time the level 1 nodes fill. To then get
> another insert into level 1 to split that, we need to fill a minimum
> of -3 sibling- leaf blocks.
>
> The reason we need to fill 3 leaf blocks and not 1 is that we shift
> records between left and right siblings instead of splitting. i.e.
> we steal space from a sibling in preference to splitting and
> inserting a new record into the parent. Hence we only insert into
> the parent when the insert node and both siblings are full
> completely full.
>
> When we look at a typical AG btree on a v5 filesystem, we have at
> least 110 records/ptrs per block (1kB block size) so the minimum
> number of record inserts needed before we need to split a parent
> again after it was just split is at least 150 inserts.
>
> Hence after a full height split, we are not going to split a node at
> the old root level again until at least 150 inserts (1kB block size)
> are done at the child level.
>
> IOWs, we're never going to get two start height splits in a single
> allocation chain. At most we are going to get one start height split
> and then N (start height - 1) splits. That's the worst case AGFL we
> need to consider, but such a split chain would be so infintesimally
> probably that always reserving that amount of space in the AGFL is
> just a waste of space because it will never, ever be used.
>
> So the worst case AGFL requirement for a data extent allocation or
> free will be:
>
> 1. data extent - full free space tree split * 2 + rmapbt
> 2. BMBT leaf split - (free space height - 2) * 2 + rmapbt
> 3. BMBT node split - (free space height - 2) * 2 + rmapbt
> ....
> N. BMBT root split - (free space height - 2) * 2 + rmapbt
>
> So the maximum number of chained allocations is likely going to be
> something like:
>
> (max AGFL for one allocation) +
> ((max bmbt/refcountbt height) * (max agfl for dependent allocation))
>
> Which is likely to be in the order of...
>
> > > So I wouldn't expect the AGFL reserve pool to be any larger than a
> > > couple of hundred blocks in the worst case.
>
> ... this.
Thanks for the detailed explanation. It helps.
I do have a few followup questions on this part.
What strategy would you recommend we use to calculate the 'rmapbt'
portion of this equation? I've defaulted to the same strategy of a
single full-space rmapbt tree split, plus N (height - 2) splits, but
instead of using the free space height I used the maxlevel in
m_rmap_maxlevels. Does that sound correct to you?
For trees that are short, like a free-space reservation with a max
height of 2, this math actually yields a smaller number of blocks than
the constants it might replace below. In the case of a b-tree with a
max height of 2, do we assume then that we'd only need 2 blocks to
handle the full split, and the remaining dependent allocations would not
consume any more space as records are added to the tree? (It sounds like
yes, but wanted to confirm)
> [ And now we disappear down the rabbit hole. ]
>
> Take a look at xfs_alloc_ag_max_usable(), which is used to set
> mp->m_ag_max_usable:
>
> blocks = XFS_BB_TO_FSB(mp, XFS_FSS_TO_BB(mp, 4)); /* ag headers */
> blocks += XFS_ALLOCBT_AGFL_RESERVE;
> blocks += 3; /* AGF, AGI btree root blocks */
> ....
>
> There's a number of blocks reserved for the AGFL that aren't usable
> already. What is that value for?
>
> /*
> * The number of blocks per AG that we withhold from xfs_dec_fdblocks to
> * guarantee that we can refill the AGFL prior to allocating space in a nearly
> * full AG. Although the space described by the free space btrees, the
> * blocks used by the freesp btrees themselves, and the blocks owned by the
> * AGFL are counted in the ondisk fdblocks, it's a mistake to let the ondisk
> * free space in the AG drop so low that the free space btrees cannot refill an
> * empty AGFL up to the minimum level. Rather than grind through empty AGs
> * until the fs goes down, we subtract this many AG blocks from the incore
> * fdblocks to ensure user allocation does not overcommit the space the
> * filesystem needs for the AGFLs. The rmap btree uses a per-AG reservation to
> * withhold space from xfs_dec_fdblocks, so we do not account for that here.
> */
> #define XFS_ALLOCBT_AGFL_RESERVE 4
>
> What is this magic number of 4, then?
>
> Remember what I said above about the free space btree blocks being
> accounted as free space and that freeing extents are always allowed
> to go ahead, even when the AG is full?
>
> This AGFL reserve is the minimum space needed for an extent free to
> successfully refill the AGFL at ENOSPC to allow the extent free to
> proceed. That is, both trees have a singel level, so a full height
> split of either tree will require 2 blocks to be allocated. 2
> freespace trees per AG, two blocks per tree, and so we need 4 blocks
> per AG to ensure freeing extents at ENOSPC will always succeed.
>
> This "magic 4" blocks isn't valid when we have rmapbt, recountbt, or
> finobt reservations in the AG anymore, because ENOSPC can occur when
> we still have multi-level bno/cnt btrees in the AG. That's the first
> thing we need to fix.
>
> This AGFL set aside is taken out of the global free space pool via
> this function:
>
> /*
> * Compute the number of blocks that we set aside to guarantee the ability to
> * refill the AGFL and handle a full bmap btree split.
> *
> * In order to avoid ENOSPC-related deadlock caused by out-of-order locking of
> * AGF buffer (PV 947395), we place constraints on the relationship among
> * actual allocations for data blocks, freelist blocks, and potential file data
> * bmap btree blocks. However, these restrictions may result in no actual space
> * allocated for a delayed extent, for example, a data block in a certain AG is
> * allocated but there is no additional block for the additional bmap btree
> * block due to a split of the bmap btree of the file. The result of this may
> * lead to an infinite loop when the file gets flushed to disk and all delayed
> * extents need to be actually allocated. To get around this, we explicitly set
> * aside a few blocks which will not be reserved in delayed allocation.
> *
> * For each AG, we need to reserve enough blocks to replenish a totally empty
> * AGFL and 4 more to handle a potential split of the file's bmap btree.
> */
> unsigned int
> xfs_alloc_set_aside(
> struct xfs_mount *mp)
> {
> return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + 4);
> }
>
> But where did that magic number of "4 more to handle a potential
> split of the file's bmap btree" come from?
>
> Ah, a fix I made for a ENOSPC regression back in 2006:
>
> + * ..... Considering the minimum number of
> + * needed freelist blocks is 4 fsbs _per AG_, a potential split of file's bmap
> + * btree requires 1 fsb, so we set the number of set-aside blocks
> + * to 4 + 4*agcount.
> + */
> +#define XFS_ALLOC_SET_ASIDE(mp) (4 + ((mp)->m_sb.sb_agcount * 4))
>
> That was fixing a regression caused by an ENOSPC deadlock bug fix
> done for SGI PV 947395.
>
> commit 96b336c71016a85cd193762199aa6f99f543b6fe
> Author: Yingping Lu <yingping@sgi.com>
> Date: Tue May 23 19:28:08 2006 +0000
>
> In actual allocation of file system blocks and freeing extents, the transaction
> within each such operation may involve multiple locking of AGF buffer. While the
> freeing extent function has sorted the extents based on AGF number before entering
> into transaction, however, when the file system space is very limited, the allocation
> of space would try every AGF to get space allocated, this could potentially cause
> out-of-order locking, thus deadlock could happen. This fix mitigates the scarce space
> for allocation by setting aside a few blocks without reservation, and avoid deadlock
> by maintaining ascending order of AGF locking.
> Add a new flag XFS_ALLOC_FLAG_FREEING to indicate whether the caller is freeing or
> allocating. Add a field firstblock in structure xfs_alloc_arg to indicate whether
> this is the first allocation request.
>
> Which included this:
>
> + * ...... Considering the minimum number of
> + * needed freelist blocks is 4 fsbs, a potential split of file's bmap
> + * btree requires 1 fsb, so we set the number of set-aside blocks to 8.
> +*/
> +#define SET_ASIDE_BLOCKS 8
>
> <ding!>
>
> Ah, now I get it. It's only taken me 18 years to really understand
> how this value was originally derived and why it is wrong.
>
> It assumed 4 blocks for the AGFL for the initial allocation to fill
> and empty AGFL, and then 4 more blocks for the AGFL to allow a
> refill for the subsequent BMBT block allocation to refill the AGFL.
> IOWs, it wasn't reserving blocks for the BMBT itself, it was
> reserving blocks for a second AGFL refill in the allocation chain....
>
> Right, so I think what we need to change is XFS_ALLOCBT_AGFL_RESERVE
> and xfs_alloc_set_aside() to reflect the worst case size of the free
> space trees when we hit ENOSPC for user data, not a hard coded 4 + 4.
>
> The worst case free space tree size is going to be single block free
> records when all the per-ag reservations are at their maximum
> values. So we need to calculate maximum reservation sizes for the
> AG, then calculate how many btree blocks that will require to index
> as single block free space extents, then calculate the hieght of the
> btrees needs to index that. That gives us out "free space tree
> height at ENOSPC", and then we can calculate how many AGFL blocks we
> need at ENOSPC to handle a split of both trees. That value replaces
> XFS_ALLOCBT_AGFL_RESERVE.
Just to confirm, this would be
xfs_btree_compute_maxlevels(mp->m_alloc_mnr, n_perag_resvd_blocks),
correct?
> As for the BMBT split requiring more AGFL refills, we need to know
> what the maximum BMBT height the fs supports is (I think we already
> calculate that), and that then becomes the level we feed into the
> "how many AGFL blocks do we need in a full height BMBT split
> allocation chain. I did that math above, and that calculation
> should replace the "+ 4" in xfs_alloc_set_aside(). I think that
> ends up with the calculation being something like:
>
> setaside = agcount * (number of dependent allocations *
> AGFL blocks for refill at ENOSPC)
>
> I think we also need to make mp->m_ag_max_usable also take into
> account the BMBT split related AGFL refills as xfs_alloc_set_aside()
> as it only considers XFS_ALLOCBT_AGFL_RESERVE right now. That will
> ensure we always have those blocks available for the AGFL.
Thanks, this was convincing enough that I went off to prototype
something that tried to faithfully follow your advice.
> Fair summary. this is already long, and I think what I discovered
> above kinda comes under:
>
> > 4. Implement a less ambitious AGFL reserve
>
> That's what we did back in 2006 - we just carved out the smallest
> amount of blocks we needed from the free space pool to ensure the
> failing case worked.
>
> > The draft in this thread tried to track all AGFL blocks from their
> > allocation and continued to account for them during their lifetime in
> > the bnobt and cnobt. Instead of trying to build a reserve for all free
> > space management blocks that aren't rmapbt, reduce the scope of the AGFL
> > reserve to be just the number of blocks needed to satsify a single
> > transaction with multiple allocations that occurs close to an AG's
> > ENOSPC.
>
> Yes, and that's what I suggest we extend xfs_alloc_set_aside() and
> mp->m_ag_max_usable to encompass. That way we don't actually need to
> track anything, the AGs will always have that space available to use
> because mp->m_ag_max_usable prevents them from being used by
> anything else.
>
> > Do any of these approaches stand out as clearly better (or worse) to
> > you?
>
> I think I've convinced myself that we already use option 4 and the
> accounting just needs extending to reserve the correct amount of
> blocks...
I've got something that sets the mp->m_ag_max_usable and
alloc_set_aside() to the values discussed above. For demonstration
purposes, since I was ending up with numbers that were smaller than
XFS_ALLOCBT_AGFL_RESERVE + 4, I just picked some slightly different math
that inflated these numbers a bit to validate the assumptions.
I'm still able to run AG low enough on space to hit the bug with these
modifcations in place, partly because mp->m_ag_max_usable seems to place
an upper bound on large contiguous allocations in
xfs_alloc_space_available(). E.g. if I allocate in small chunks <
m_ag_max_usable, I can still run pagf_freeblks down to about the per-ag
reservation. The test is explicit about only using space from a single
AG, which defeats the global reservation checks, since enough space
remains free in the other AGs that we don't hit any of the
per-filesystem reservation limits.
It seems like we've got the right general concept, but may need to make
a few more tweaks to the implementation.
In terms of where to go next, I have a few ideas, but would certainly
appreciate additional input.
One was to investigate propagating the per-AG set-aside into
xfs_alloc_space_available() so that it's only eligible for consumption
by a dependent allocation. (Is the correct way to determine that by
checking tp->t_highest_agno != NULLAGNUMBER?)
The other was to determine if I needed add the quantity from
xfs_alloc_min_freelist(free space height) to the set-aside. The reason
for this was that if the reserved free space can be indexed by a tree of
height 2, then the full split only needs 2 blocks, but once we've split
to a tree of that height, xfs_alloc_min_freelist() requires 12 blocks
free (with rmap off) to continue the allocation chain, even if we're
never going to allocate that many. It's that edge case that these tests
always manage to hit: run the pagf_freeblks within a block or two of the
per-ag reservation, and then split a free-space b-tree right before a
BMBT split. IOW, I need both the blocks for the b-tree expansion and
enough to satisfy the xfs_alloc_min_freelist() checks. If you explained
this earlier, and I'm reading the equations too pedantically, my
apologies.
I also ran into a bit of a snag in the testing department after merging
with 673cd885bbbf ("xfs: honor init_xattrs in xfs_init_new_inode for !ATTR fs")
I understand broadly why this is needed, but it does modify when an
inode converts from extents to b-tree. I couldn't find a way to disable
this behavior via mkfs or a mount option for testing purposes. Are you
aware of a switch I may have missed?
> > Thanks again for taking the time to read through these patches and
> > provide feedback. I appreciate all the comments, ideas, and
> > explanations.
>
> Don't thank me - I've got to thank you for taking the time and
> effort to understand this complex code well enough to ask exactly
> the right question. It's not often that answering a question on how
> to do something causes a whole bunch of not quite connected floating
> pieces in my head to suddenly fall and land in exactly the right
> places... :)
I do very much appreciate the detailed explanations and the feedback, so
it seemed like an expression of gratitude was appropriate. Thanks for
your willingness to answer these questions in such detail.
-K
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 1/4] xfs: resurrect the AGFL reservation
2024-07-23 6:51 ` Krister Johansen
@ 2024-08-01 0:53 ` Dave Chinner
2024-09-16 23:01 ` Krister Johansen
0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2024-08-01 0:53 UTC (permalink / raw)
To: Krister Johansen
Cc: Chandan Babu R, Darrick J. Wong, Dave Chinner, Gao Xiang,
linux-xfs
On Mon, Jul 22, 2024 at 11:51:25PM -0700, Krister Johansen wrote:
> [ My turn to apologize. I had a couple of short weeks at the beginning
> of July, but have finally been able to get back to this. ]
>
> On Tue, Jul 09, 2024 at 12:20:16PM +1000, Dave Chinner wrote:
> > So the worst case AGFL requirement for a data extent allocation or
> > free will be:
> >
> > 1. data extent - full free space tree split * 2 + rmapbt
> > 2. BMBT leaf split - (free space height - 2) * 2 + rmapbt
> > 3. BMBT node split - (free space height - 2) * 2 + rmapbt
> > ....
> > N. BMBT root split - (free space height - 2) * 2 + rmapbt
> >
> > So the maximum number of chained allocations is likely going to be
> > something like:
> >
> > (max AGFL for one allocation) +
> > ((max bmbt/refcountbt height) * (max agfl for dependent allocation))
> >
> > Which is likely to be in the order of...
> >
> > > > So I wouldn't expect the AGFL reserve pool to be any larger than a
> > > > couple of hundred blocks in the worst case.
> >
> > ... this.
>
> Thanks for the detailed explanation. It helps.
>
> I do have a few followup questions on this part.
>
> What strategy would you recommend we use to calculate the 'rmapbt'
> portion of this equation? I've defaulted to the same strategy of a
> single full-space rmapbt tree split, plus N (height - 2) splits, but
> instead of using the free space height I used the maxlevel in
> m_rmap_maxlevels. Does that sound correct to you?
The rmapbt already has space reservations already set aside for it,
and because the rmapbt updates are run in a separate transaction
reservation context (i.e. they are done via deferred ops using
intent chaining), I don't think they need to be considered a
dependent allocation.
i.e. the issue we have right now is multiple allocations within a
single transaction context failing part way through due to not
having enough space to refill the AGFL in the middle of the
transaction. This is typically happening because the rmapbt
reservation is causing the AG to appear empty when it actually still
has lots of space free in it. When the rmapbt update occurs, all
it's allocations and AGFL refills come from that reserved space that
the original allocation was not allowed to dip into.
Hence I don't think that the rmapbt space requirements actually
impact on the data+BMBT dependent allocation chain AGFL space
requirements.
> For trees that are short, like a free-space reservation with a max
> height of 2, this math actually yields a smaller number of blocks than
> the constants it might replace below. In the case of a b-tree with a
> max height of 2, do we assume then that we'd only need 2 blocks to
> handle the full split, and the remaining dependent allocations would not
> consume any more space as records are added to the tree? (It sounds like
> yes, but wanted to confirm)
If the max height of the btree is two, then it can only contain a
root block and leaf blocks. Hence once we've gone from a single
block to a root + 2 half full leaf blocks, the rest of the updates
are going to fit in either of the two leaf blocks. So, yes, I think
the math works out in that case, too.
> > <ding!>
> >
> > Ah, now I get it. It's only taken me 18 years to really understand
> > how this value was originally derived and why it is wrong.
> >
> > It assumed 4 blocks for the AGFL for the initial allocation to fill
> > and empty AGFL, and then 4 more blocks for the AGFL to allow a
> > refill for the subsequent BMBT block allocation to refill the AGFL.
> > IOWs, it wasn't reserving blocks for the BMBT itself, it was
> > reserving blocks for a second AGFL refill in the allocation chain....
> >
> > Right, so I think what we need to change is XFS_ALLOCBT_AGFL_RESERVE
> > and xfs_alloc_set_aside() to reflect the worst case size of the free
> > space trees when we hit ENOSPC for user data, not a hard coded 4 + 4.
> >
> > The worst case free space tree size is going to be single block free
> > records when all the per-ag reservations are at their maximum
> > values. So we need to calculate maximum reservation sizes for the
> > AG, then calculate how many btree blocks that will require to index
> > as single block free space extents, then calculate the hieght of the
> > btrees needs to index that. That gives us out "free space tree
> > height at ENOSPC", and then we can calculate how many AGFL blocks we
> > need at ENOSPC to handle a split of both trees. That value replaces
> > XFS_ALLOCBT_AGFL_RESERVE.
>
> Just to confirm, this would be
> xfs_btree_compute_maxlevels(mp->m_alloc_mnr, n_perag_resvd_blocks),
> correct?
That sounds correct.
> > Fair summary. this is already long, and I think what I discovered
> > above kinda comes under:
> >
> > > 4. Implement a less ambitious AGFL reserve
> >
> > That's what we did back in 2006 - we just carved out the smallest
> > amount of blocks we needed from the free space pool to ensure the
> > failing case worked.
> >
> > > The draft in this thread tried to track all AGFL blocks from their
> > > allocation and continued to account for them during their lifetime in
> > > the bnobt and cnobt. Instead of trying to build a reserve for all free
> > > space management blocks that aren't rmapbt, reduce the scope of the AGFL
> > > reserve to be just the number of blocks needed to satsify a single
> > > transaction with multiple allocations that occurs close to an AG's
> > > ENOSPC.
> >
> > Yes, and that's what I suggest we extend xfs_alloc_set_aside() and
> > mp->m_ag_max_usable to encompass. That way we don't actually need to
> > track anything, the AGs will always have that space available to use
> > because mp->m_ag_max_usable prevents them from being used by
> > anything else.
> >
> > > Do any of these approaches stand out as clearly better (or worse) to
> > > you?
> >
> > I think I've convinced myself that we already use option 4 and the
> > accounting just needs extending to reserve the correct amount of
> > blocks...
>
> I've got something that sets the mp->m_ag_max_usable and
> alloc_set_aside() to the values discussed above. For demonstration
> purposes, since I was ending up with numbers that were smaller than
> XFS_ALLOCBT_AGFL_RESERVE + 4, I just picked some slightly different math
> that inflated these numbers a bit to validate the assumptions.
>
> I'm still able to run AG low enough on space to hit the bug with these
> modifcations in place, partly because mp->m_ag_max_usable seems to place
> an upper bound on large contiguous allocations in
> xfs_alloc_space_available(). E.g. if I allocate in small chunks <
> m_ag_max_usable, I can still run pagf_freeblks down to about the per-ag
> reservation. The test is explicit about only using space from a single
> AG, which defeats the global reservation checks, since enough space
> remains free in the other AGs that we don't hit any of the
> per-filesystem reservation limits.
>
> It seems like we've got the right general concept, but may need to make
> a few more tweaks to the implementation.
>
> In terms of where to go next, I have a few ideas, but would certainly
> appreciate additional input.
>
> One was to investigate propagating the per-AG set-aside into
> xfs_alloc_space_available() so that it's only eligible for consumption
> by a dependent allocation. (Is the correct way to determine that by
> checking tp->t_highest_agno != NULLAGNUMBER?)
Yes, I suspect that we aren't taking into account the correct
set-aside value when checking if the AG has enough space available
for the allocation to proceed. i.e. xfs_alloc_space_available()
isn't getting the reserved space correct - it may be that
xfs_ag_resv_needed() should be taking into account the set aside
amount for XFS_AG_RESV_NONE allocations.
> The other was to determine if I needed add the quantity from
> xfs_alloc_min_freelist(free space height) to the set-aside. The reason
> for this was that if the reserved free space can be indexed by a tree of
> height 2, then the full split only needs 2 blocks, but once we've split
> to a tree of that height, xfs_alloc_min_freelist() requires 12 blocks
> free (with rmap off) to continue the allocation chain, even if we're
> never going to allocate that many.
Yes, that is definitely -one- of the problems here.
> It's that edge case that these tests
> always manage to hit: run the pagf_freeblks within a block or two of the
> per-ag reservation, and then split a free-space b-tree right before a
> BMBT split. IOW, I need both the blocks for the b-tree expansion and
> enough to satisfy the xfs_alloc_min_freelist() checks. If you explained
> this earlier, and I'm reading the equations too pedantically, my
> apologies.
No, I think you're getting deeper into the underlying issues that
we haven't really solved properly as this code has grown organically
to solve corner case issue after corner case issue.
Looking at it, and considering the concept of dependent allocation
chains and reserve pools for different types of allocations I've
previously explained, I think the calculation in
xfs_alloc_min_freelist() is wrong.
That is, xfs_alloc_min_freelist() look sto assume that all AGFL
blocks for a free space modification need to be reserved up front
and placed in the AGFL before we start. Historically speaking, this
was how allocation used to work. Then we introduced intent based
operations for freeing extents in 1998 so the BMBT modifications
were in a separate transaction to the extent freeing. Each now had
their own AGFL refill requirements in completely decoupled
transactions.
Then we added reflink and rmapbt as additional operations that are
decoupled from the BMBT allocations and transactions by intents.
They all run in their own allocation contexts and require their own
independent AGFL fill operations.
However, xfs_alloc_min_freelist() makes the assumption that the AGFL
must be large enough to handle both free space btree and rmapbt
splits for every allocation, even though the space for an rmapbt
split will only ever be needed when running a XFS_AG_RESV_RMAPBT
allocation. IOWs, it is over-calculating the required AGFL size for
most allocations.
Further, it is calculating this number for every allocation in a
transaction, without taking into account the "dependent allocation
chain" context. This is more difficult to do, because the required
AGFL size only goes down when a full split is done in the chain.
I think we actually need to track the number of blocks we may need
and have consumed for the AGFL within a transaction. We initialise
that in the first transaction and ensure that we have that amount of
space remaining in the AG, and then we always check that the
remaining reservation is greater than the minimum free list size
required for this allocation (it should always be).
Whenever we move blocks to/from the AGFL, we account for that in the
transaction. Hence we have a direct accounting of the AGFL
consumption across dependent allocations within that AG. When we
complete one of the dependent allocations in a chain, we can reduce
the reservation held the transaction by the number of AGFL blocks
reserved but not consumed by that allocation, thereby allowing us to
use more the remaining free space the further through the allocation
chain we get....
IOWs, we need to track the total AGFL reservation/usage within the
AG across the transaction, not just individual allocations. We need
xfs_alloc_min_freelist() to be aware of allocation context and not
reserve RMAPBT space for allocations that aren't modifying the
rmapbt. We need xfs_alloc_space_available() to initialise and use
transaction context based AGFL block reservations for considering
whether there is space available or not. We need to account for AGFL
blocks consumed during an AGFL grow operation.
Lifting the AGFL reservation tracking to the transaction structure
for dependent allocation chains should alleviate these "but we don't
know what happened on the previous allocation" and "what might be
needed for the next allocation" issues. If we do this correctly,
then we should never need to switch AGs during a dependent
allocation chain because the first allocation in an will be rejected
with ENOSPC because there isn't enough space for the dependent chain
to complete....
> I also ran into a bit of a snag in the testing department after merging
> with 673cd885bbbf ("xfs: honor init_xattrs in xfs_init_new_inode for !ATTR fs")
> I understand broadly why this is needed, but it does modify when an
> inode converts from extents to b-tree. I couldn't find a way to disable
> this behavior via mkfs or a mount option for testing purposes. Are you
> aware of a switch I may have missed?
Just revert it for testing purposes.
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 1/4] xfs: resurrect the AGFL reservation
2024-08-01 0:53 ` Dave Chinner
@ 2024-09-16 23:01 ` Krister Johansen
0 siblings, 0 replies; 14+ messages in thread
From: Krister Johansen @ 2024-09-16 23:01 UTC (permalink / raw)
To: Dave Chinner
Cc: Chandan Babu R, Darrick J. Wong, Dave Chinner, Gao Xiang,
linux-xfs
On Thu, Aug 01, 2024 at 10:53:04AM +1000, Dave Chinner wrote:
> On Mon, Jul 22, 2024 at 11:51:25PM -0700, Krister Johansen wrote:
> > I've got something that sets the mp->m_ag_max_usable and
> > alloc_set_aside() to the values discussed above. For demonstration
> > purposes, since I was ending up with numbers that were smaller than
> > XFS_ALLOCBT_AGFL_RESERVE + 4, I just picked some slightly different math
> > that inflated these numbers a bit to validate the assumptions.
> >
> > I'm still able to run AG low enough on space to hit the bug with these
> > modifcations in place, partly because mp->m_ag_max_usable seems to place
> > an upper bound on large contiguous allocations in
> > xfs_alloc_space_available(). E.g. if I allocate in small chunks <
> > m_ag_max_usable, I can still run pagf_freeblks down to about the per-ag
> > reservation. The test is explicit about only using space from a single
> > AG, which defeats the global reservation checks, since enough space
> > remains free in the other AGs that we don't hit any of the
> > per-filesystem reservation limits.
> >
> > It seems like we've got the right general concept, but may need to make
> > a few more tweaks to the implementation.
> >
> > In terms of where to go next, I have a few ideas, but would certainly
> > appreciate additional input.
> >
> > One was to investigate propagating the per-AG set-aside into
> > xfs_alloc_space_available() so that it's only eligible for consumption
> > by a dependent allocation. (Is the correct way to determine that by
> > checking tp->t_highest_agno != NULLAGNUMBER?)
>
> Yes, I suspect that we aren't taking into account the correct
> set-aside value when checking if the AG has enough space available
> for the allocation to proceed. i.e. xfs_alloc_space_available()
> isn't getting the reserved space correct - it may be that
> xfs_ag_resv_needed() should be taking into account the set aside
> amount for XFS_AG_RESV_NONE allocations.
>
> > The other was to determine if I needed add the quantity from
> > xfs_alloc_min_freelist(free space height) to the set-aside. The reason
> > for this was that if the reserved free space can be indexed by a tree of
> > height 2, then the full split only needs 2 blocks, but once we've split
> > to a tree of that height, xfs_alloc_min_freelist() requires 12 blocks
> > free (with rmap off) to continue the allocation chain, even if we're
> > never going to allocate that many.
>
> Yes, that is definitely -one- of the problems here.
>
> > It's that edge case that these tests
> > always manage to hit: run the pagf_freeblks within a block or two of the
> > per-ag reservation, and then split a free-space b-tree right before a
> > BMBT split. IOW, I need both the blocks for the b-tree expansion and
> > enough to satisfy the xfs_alloc_min_freelist() checks. If you explained
> > this earlier, and I'm reading the equations too pedantically, my
> > apologies.
>
> No, I think you're getting deeper into the underlying issues that
> we haven't really solved properly as this code has grown organically
> to solve corner case issue after corner case issue.
>
> Looking at it, and considering the concept of dependent allocation
> chains and reserve pools for different types of allocations I've
> previously explained, I think the calculation in
> xfs_alloc_min_freelist() is wrong.
>
> That is, xfs_alloc_min_freelist() looks to assume that all AGFL
> blocks for a free space modification need to be reserved up front
> and placed in the AGFL before we start. Historically speaking, this
> was how allocation used to work. Then we introduced intent based
> operations for freeing extents in 1998 so the BMBT modifications
> were in a separate transaction to the extent freeing. Each now had
> their own AGFL refill requirements in completely decoupled
> transactions.
>
> Then we added reflink and rmapbt as additional operations that are
> decoupled from the BMBT allocations and transactions by intents.
> They all run in their own allocation contexts and require their own
> independent AGFL fill operations.
>
> However, xfs_alloc_min_freelist() makes the assumption that the AGFL
> must be large enough to handle both free space btree and rmapbt
> splits for every allocation, even though the space for an rmapbt
> split will only ever be needed when running a XFS_AG_RESV_RMAPBT
> allocation. IOWs, it is over-calculating the required AGFL size for
> most allocations.
>
> Further, it is calculating this number for every allocation in a
> transaction, without taking into account the "dependent allocation
> chain" context. This is more difficult to do, because the required
> AGFL size only goes down when a full split is done in the chain.
>
> I think we actually need to track the number of blocks we may need
> and have consumed for the AGFL within a transaction. We initialise
> that in the first transaction and ensure that we have that amount of
> space remaining in the AG, and then we always check that the
> remaining reservation is greater than the minimum free list size
> required for this allocation (it should always be).
>
> Whenever we move blocks to/from the AGFL, we account for that in the
> transaction. Hence we have a direct accounting of the AGFL
> consumption across dependent allocations within that AG. When we
> complete one of the dependent allocations in a chain, we can reduce
> the reservation held the transaction by the number of AGFL blocks
> reserved but not consumed by that allocation, thereby allowing us to
> use more the remaining free space the further through the allocation
> chain we get....
>
> IOWs, we need to track the total AGFL reservation/usage within the
> AG across the transaction, not just individual allocations. We need
> xfs_alloc_min_freelist() to be aware of allocation context and not
> reserve RMAPBT space for allocations that aren't modifying the
> rmapbt. We need xfs_alloc_space_available() to initialise and use
> transaction context based AGFL block reservations for considering
> whether there is space available or not. We need to account for AGFL
> blocks consumed during an AGFL grow operation.
>
> Lifting the AGFL reservation tracking to the transaction structure
> for dependent allocation chains should alleviate these "but we don't
> know what happened on the previous allocation" and "what might be
> needed for the next allocation" issues. If we do this correctly,
> then we should never need to switch AGs during a dependent
> allocation chain because the first allocation in an will be rejected
> with ENOSPC because there isn't enough space for the dependent chain
> to complete....
I like the idea of moving the AGFL reservation tracking into the
transacation, though I'm not certain I understand all the particulars
yet.
In your mental model for how this might work, is the idea that the space
for the AGFL reservation will be added to the transaction once an AG is
selected for the allocations? Even with this approach, it will still be
necessary to make some worst-case estimates about how much space might
be used in a transaction chain, correct? E.g. the worst-case AGFL
equations from earlier in this thread still apply, it's just that the
space would be reserved _once_ at the beginning of the first
transaction.
For the rmapbt allocations, it doesn't look like code calls into the
allocator with XFS_AG_RESV_RMAPBT set. The code that allocates rmapbt
blocks pulls a blocks from the AGFL and then deducts that from its
per-AG reservation. I tried walking myself through a few examples, but
it looks like it's possible to do an rmap allocation in a dependent
allocation in a transaction even if the first allocation in the chain is
not allocating a rmap record.
To give a specific example: allocating bmbt blocks associate a rmap
entry with the inode for which the bmbt block is being allocated, but
the first allocation may well set XFS_RMAP_OINFO_SKIP_UPDATE, like
xfs_bmap_btalloc does.
It might be possible to use the oinfo in the xfs_alloc_arg structure to
determine if a particular allocation can allocate a rmap entry.
However, it also looks like allocating blocks to refill the AGFL can
itself allocate these blocks. Is this a situation where we'd need to
audit the different operations and add specific attributes to their
existing xfs_trans_res to account for whether or not they'd allocate
rmap entries? Or is there a better way to do this that I'm overlooking?
Thanks,
-K
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2024-09-16 23:20 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-13 20:27 [RFC PATCH 0/4] bringing back the AGFL reserve Krister Johansen
2024-06-13 20:27 ` [RFC PATCH 1/4] xfs: resurrect the AGFL reservation Krister Johansen
2024-06-14 2:59 ` Dave Chinner
2024-06-17 23:46 ` Krister Johansen
2024-07-09 2:20 ` Dave Chinner
2024-07-23 6:51 ` Krister Johansen
2024-08-01 0:53 ` Dave Chinner
2024-09-16 23:01 ` Krister Johansen
2024-06-13 20:27 ` [RFC PATCH 2/4] xfs: modify xfs_alloc_min_freelist to take an increment Krister Johansen
2024-06-13 20:27 ` [RFC PATCH 3/4] xfs: let allocations tap the AGFL reserve Krister Johansen
2024-06-13 20:27 ` [RFC PATCH 4/4] xfs: refuse allocations without agfl refill space Krister Johansen
2024-06-14 0:45 ` [RFC PATCH 0/4] bringing back the AGFL reserve Dave Chinner
2024-06-17 22:25 ` Krister Johansen
2024-06-17 23:54 ` Dave Chinner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox