* fix recovery of extfree items just after a growfs
@ 2024-09-10 4:28 Christoph Hellwig
2024-09-10 4:28 ` [PATCH 1/4] xfs: pass the exact range to initialize to xfs_initialize_perag Christoph Hellwig
` (3 more replies)
0 siblings, 4 replies; 15+ messages in thread
From: Christoph Hellwig @ 2024-09-10 4:28 UTC (permalink / raw)
To: Chandan Babu R; +Cc: Darrick J. Wong, linux-xfs
Hi all,
auditing the perag code for the generic groups feature found an issue
where recovery of an extfree intent without a logged done entry will
fail when the log also contained the transaction that added the AG
to the extent is freed to because the perag structures are only created
after log recovery has finished.
I will also send out a reproducer for this issue.
Diffstat:
libxfs/xfs_ag.c | 63 +++++++++++------------------------------------
libxfs/xfs_ag.h | 10 +++----
libxfs/xfs_log_recover.h | 2 +
xfs_buf_item_recover.c | 16 +++++++++++
xfs_fsops.c | 20 ++++++--------
xfs_log_recover.c | 58 ++++++++++++++++++-------------------------
xfs_mount.c | 9 ++----
7 files changed, 76 insertions(+), 102 deletions(-)
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 1/4] xfs: pass the exact range to initialize to xfs_initialize_perag
2024-09-10 4:28 fix recovery of extfree items just after a growfs Christoph Hellwig
@ 2024-09-10 4:28 ` Christoph Hellwig
2024-09-17 18:50 ` Darrick J. Wong
2024-09-10 4:28 ` [PATCH 2/4] xfs: merge the perag freeing helpers Christoph Hellwig
` (2 subsequent siblings)
3 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2024-09-10 4:28 UTC (permalink / raw)
To: Chandan Babu R; +Cc: Darrick J. Wong, linux-xfs
Currently only the new agcount is passed to xfs_initialize_perag, which
requires lookups of existing AGs to skip them and complicates error
handling. Also pass the previous agcount so that the range that
xfs_initialize_perag operates on is exactly defined. That way the
extra lookups can be avoided, and error handling can clean up the
exact range from the old count to the last added perag structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
fs/xfs/libxfs/xfs_ag.c | 23 +++++------------------
fs/xfs/libxfs/xfs_ag.h | 5 +++--
fs/xfs/xfs_fsops.c | 18 ++++++++----------
fs/xfs/xfs_log_recover.c | 5 +++--
fs/xfs/xfs_mount.c | 4 ++--
5 files changed, 21 insertions(+), 34 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
index 5f0494702e0b55..5186735da5d45a 100644
--- a/fs/xfs/libxfs/xfs_ag.c
+++ b/fs/xfs/libxfs/xfs_ag.c
@@ -296,27 +296,19 @@ xfs_free_unused_perag_range(
int
xfs_initialize_perag(
struct xfs_mount *mp,
+ xfs_agnumber_t old_agcount,
xfs_agnumber_t agcount,
xfs_rfsblock_t dblocks,
xfs_agnumber_t *maxagi)
{
struct xfs_perag *pag;
xfs_agnumber_t index;
- xfs_agnumber_t first_initialised = NULLAGNUMBER;
int error;
- /*
- * Walk the current per-ag tree so we don't try to initialise AGs
- * that already exist (growfs case). Allocate and insert all the
- * AGs we don't find ready for initialisation.
- */
- for (index = 0; index < agcount; index++) {
- pag = xfs_perag_get(mp, index);
- if (pag) {
- xfs_perag_put(pag);
- continue;
- }
+ if (old_agcount >= agcount)
+ return 0;
+ for (index = old_agcount; index < agcount; index++) {
pag = kzalloc(sizeof(*pag), GFP_KERNEL | __GFP_RETRY_MAYFAIL);
if (!pag) {
error = -ENOMEM;
@@ -353,10 +345,6 @@ xfs_initialize_perag(
/* Active ref owned by mount indicates AG is online. */
atomic_set(&pag->pag_active_ref, 1);
- /* first new pag is fully initialized */
- if (first_initialised == NULLAGNUMBER)
- first_initialised = index;
-
/*
* Pre-calculated geometry
*/
@@ -381,8 +369,7 @@ xfs_initialize_perag(
out_free_pag:
kfree(pag);
out_unwind_new_pags:
- /* unwind any prior newly initialized pags */
- xfs_free_unused_perag_range(mp, first_initialised, agcount);
+ xfs_free_unused_perag_range(mp, old_agcount, index);
return error;
}
diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
index d9cccd093b60e0..69fc31e7b84728 100644
--- a/fs/xfs/libxfs/xfs_ag.h
+++ b/fs/xfs/libxfs/xfs_ag.h
@@ -146,8 +146,9 @@ __XFS_AG_OPSTATE(agfl_needs_reset, AGFL_NEEDS_RESET)
void xfs_free_unused_perag_range(struct xfs_mount *mp, xfs_agnumber_t agstart,
xfs_agnumber_t agend);
-int xfs_initialize_perag(struct xfs_mount *mp, xfs_agnumber_t agcount,
- xfs_rfsblock_t dcount, xfs_agnumber_t *maxagi);
+int xfs_initialize_perag(struct xfs_mount *mp, xfs_agnumber_t old_agcount,
+ xfs_agnumber_t agcount, xfs_rfsblock_t dcount,
+ xfs_agnumber_t *maxagi);
int xfs_initialize_perag_data(struct xfs_mount *mp, xfs_agnumber_t agno);
void xfs_free_perag(struct xfs_mount *mp);
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 3643cc843f6271..de2bf0594cb474 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -87,6 +87,7 @@ xfs_growfs_data_private(
struct xfs_mount *mp, /* mount point for filesystem */
struct xfs_growfs_data *in) /* growfs data input struct */
{
+ xfs_agnumber_t oagcount = mp->m_sb.sb_agcount;
struct xfs_buf *bp;
int error;
xfs_agnumber_t nagcount;
@@ -94,7 +95,6 @@ xfs_growfs_data_private(
xfs_rfsblock_t nb, nb_div, nb_mod;
int64_t delta;
bool lastag_extended = false;
- xfs_agnumber_t oagcount;
struct xfs_trans *tp;
struct aghdr_init_data id = {};
struct xfs_perag *last_pag;
@@ -138,16 +138,14 @@ xfs_growfs_data_private(
if (delta == 0)
return 0;
- oagcount = mp->m_sb.sb_agcount;
- /* allocate the new per-ag structures */
- if (nagcount > oagcount) {
- error = xfs_initialize_perag(mp, nagcount, nb, &nagimax);
- if (error)
- return error;
- } else if (nagcount < oagcount) {
- /* TODO: shrinking the entire AGs hasn't yet completed */
+ /* TODO: shrinking the entire AGs hasn't yet completed */
+ if (nagcount < oagcount)
return -EINVAL;
- }
+
+ /* allocate the new per-ag structures */
+ error = xfs_initialize_perag(mp, oagcount, nagcount, nb, &nagimax);
+ if (error)
+ return error;
if (delta > 0)
error = xfs_trans_alloc(mp, &M_RES(mp)->tr_growdata,
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 1a74fe22672e3e..2af02b32f419c2 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -3346,6 +3346,7 @@ xlog_do_recover(
struct xfs_mount *mp = log->l_mp;
struct xfs_buf *bp = mp->m_sb_bp;
struct xfs_sb *sbp = &mp->m_sb;
+ xfs_agnumber_t old_agcount = sbp->sb_agcount;
int error;
trace_xfs_log_recover(log, head_blk, tail_blk);
@@ -3393,8 +3394,8 @@ xlog_do_recover(
/* re-initialise in-core superblock and geometry structures */
mp->m_features |= xfs_sb_version_to_features(sbp);
xfs_reinit_percpu_counters(mp);
- error = xfs_initialize_perag(mp, sbp->sb_agcount, sbp->sb_dblocks,
- &mp->m_maxagi);
+ error = xfs_initialize_perag(mp, old_agcount, sbp->sb_agcount,
+ sbp->sb_dblocks, &mp->m_maxagi);
if (error) {
xfs_warn(mp, "Failed post-recovery per-ag init: %d", error);
return error;
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 460f93a9ce00d1..0f4f56a7f02d9a 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -806,8 +806,8 @@ xfs_mountfs(
/*
* Allocate and initialize the per-ag data.
*/
- error = xfs_initialize_perag(mp, sbp->sb_agcount, mp->m_sb.sb_dblocks,
- &mp->m_maxagi);
+ error = xfs_initialize_perag(mp, 0, sbp->sb_agcount,
+ mp->m_sb.sb_dblocks, &mp->m_maxagi);
if (error) {
xfs_warn(mp, "Failed per-ag init: %d", error);
goto out_free_dir;
--
2.45.2
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 2/4] xfs: merge the perag freeing helpers
2024-09-10 4:28 fix recovery of extfree items just after a growfs Christoph Hellwig
2024-09-10 4:28 ` [PATCH 1/4] xfs: pass the exact range to initialize to xfs_initialize_perag Christoph Hellwig
@ 2024-09-10 4:28 ` Christoph Hellwig
2024-09-17 18:55 ` Darrick J. Wong
2024-09-10 4:28 ` [PATCH 3/4] xfs: create perag structures as soon as possible during log recovery Christoph Hellwig
2024-09-10 4:28 ` [PATCH 4/4] xfs: don't use __GFP_RETRY_MAYFAIL Christoph Hellwig
3 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2024-09-10 4:28 UTC (permalink / raw)
To: Chandan Babu R; +Cc: Darrick J. Wong, linux-xfs
There is no good reason to have two different routines for freeing perag
structures for the unmount and error cases. Add two arguments to specify
the range of AGs to free to xfs_free_perag, and use that to replace
xfs_free_unused_perag_range.
The addition RCU grace period for the error case is harmless, and the
extra check for the AG to actually exist is not required now that the
callers pass the exact known allocated range.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
fs/xfs/libxfs/xfs_ag.c | 40 ++++++++++------------------------------
fs/xfs/libxfs/xfs_ag.h | 5 ++---
fs/xfs/xfs_fsops.c | 2 +-
fs/xfs/xfs_mount.c | 5 ++---
4 files changed, 15 insertions(+), 37 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
index 5186735da5d45a..3f695100d7ab58 100644
--- a/fs/xfs/libxfs/xfs_ag.c
+++ b/fs/xfs/libxfs/xfs_ag.c
@@ -185,17 +185,20 @@ xfs_initialize_perag_data(
}
/*
- * Free up the per-ag resources associated with the mount structure.
+ * Free up the per-ag resources within the specified AG range.
*/
void
-xfs_free_perag(
- struct xfs_mount *mp)
+xfs_free_perag_range(
+ struct xfs_mount *mp,
+ xfs_agnumber_t first_agno,
+ xfs_agnumber_t end_agno)
+
{
- struct xfs_perag *pag;
xfs_agnumber_t agno;
- for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
- pag = xa_erase(&mp->m_perags, agno);
+ for (agno = first_agno; agno < end_agno; agno++) {
+ struct xfs_perag *pag = xa_erase(&mp->m_perags, agno);
+
ASSERT(pag);
XFS_IS_CORRUPT(pag->pag_mount, atomic_read(&pag->pag_ref) != 0);
xfs_defer_drain_free(&pag->pag_intents_drain);
@@ -270,29 +273,6 @@ xfs_agino_range(
return __xfs_agino_range(mp, xfs_ag_block_count(mp, agno), first, last);
}
-/*
- * Free perag within the specified AG range, it is only used to free unused
- * perags under the error handling path.
- */
-void
-xfs_free_unused_perag_range(
- struct xfs_mount *mp,
- xfs_agnumber_t agstart,
- xfs_agnumber_t agend)
-{
- struct xfs_perag *pag;
- xfs_agnumber_t index;
-
- for (index = agstart; index < agend; index++) {
- pag = xa_erase(&mp->m_perags, index);
- if (!pag)
- break;
- xfs_buf_cache_destroy(&pag->pag_bcache);
- xfs_defer_drain_free(&pag->pag_intents_drain);
- kfree(pag);
- }
-}
-
int
xfs_initialize_perag(
struct xfs_mount *mp,
@@ -369,7 +349,7 @@ xfs_initialize_perag(
out_free_pag:
kfree(pag);
out_unwind_new_pags:
- xfs_free_unused_perag_range(mp, old_agcount, index);
+ xfs_free_perag_range(mp, old_agcount, index);
return error;
}
diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
index 69fc31e7b84728..6e68d6a3161a0f 100644
--- a/fs/xfs/libxfs/xfs_ag.h
+++ b/fs/xfs/libxfs/xfs_ag.h
@@ -144,13 +144,12 @@ __XFS_AG_OPSTATE(prefers_metadata, PREFERS_METADATA)
__XFS_AG_OPSTATE(allows_inodes, ALLOWS_INODES)
__XFS_AG_OPSTATE(agfl_needs_reset, AGFL_NEEDS_RESET)
-void xfs_free_unused_perag_range(struct xfs_mount *mp, xfs_agnumber_t agstart,
- xfs_agnumber_t agend);
int xfs_initialize_perag(struct xfs_mount *mp, xfs_agnumber_t old_agcount,
xfs_agnumber_t agcount, xfs_rfsblock_t dcount,
xfs_agnumber_t *maxagi);
+void xfs_free_perag_range(struct xfs_mount *mp, xfs_agnumber_t first_agno,
+ xfs_agnumber_t end_agno);
int xfs_initialize_perag_data(struct xfs_mount *mp, xfs_agnumber_t agno);
-void xfs_free_perag(struct xfs_mount *mp);
/* Passive AG references */
struct xfs_perag *xfs_perag_get(struct xfs_mount *mp, xfs_agnumber_t agno);
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index de2bf0594cb474..b247d895c276d2 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -229,7 +229,7 @@ xfs_growfs_data_private(
xfs_trans_cancel(tp);
out_free_unused_perag:
if (nagcount > oagcount)
- xfs_free_unused_perag_range(mp, oagcount, nagcount);
+ xfs_free_perag_range(mp, oagcount, nagcount);
return error;
}
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 0f4f56a7f02d9a..6671ee3849c239 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1044,7 +1044,7 @@ xfs_mountfs(
xfs_buftarg_drain(mp->m_logdev_targp);
xfs_buftarg_drain(mp->m_ddev_targp);
out_free_perag:
- xfs_free_perag(mp);
+ xfs_free_perag_range(mp, 0, mp->m_sb.sb_agcount);
out_free_dir:
xfs_da_unmount(mp);
out_remove_uuid:
@@ -1125,8 +1125,7 @@ xfs_unmountfs(
xfs_errortag_clearall(mp);
#endif
shrinker_free(mp->m_inodegc_shrinker);
- xfs_free_perag(mp);
-
+ xfs_free_perag_range(mp, 0, mp->m_sb.sb_agcount);
xfs_errortag_del(mp);
xfs_error_sysfs_del(mp);
xchk_stats_unregister(mp->m_scrub_stats);
--
2.45.2
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 3/4] xfs: create perag structures as soon as possible during log recovery
2024-09-10 4:28 fix recovery of extfree items just after a growfs Christoph Hellwig
2024-09-10 4:28 ` [PATCH 1/4] xfs: pass the exact range to initialize to xfs_initialize_perag Christoph Hellwig
2024-09-10 4:28 ` [PATCH 2/4] xfs: merge the perag freeing helpers Christoph Hellwig
@ 2024-09-10 4:28 ` Christoph Hellwig
2024-09-16 1:28 ` Dave Chinner
2024-09-10 4:28 ` [PATCH 4/4] xfs: don't use __GFP_RETRY_MAYFAIL Christoph Hellwig
3 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2024-09-10 4:28 UTC (permalink / raw)
To: Chandan Babu R; +Cc: Darrick J. Wong, linux-xfs
An unclean log can contain both the transaction that created a new
allocation group and the first transaction that is freeing space from
it, in which case the extent free item recovery requires the perag
structure to be present.
Currently the perag structures are only created after log recovery
has completed, leading a warning and file system shutdown for the
above case. Fix this by creating new perag structures and updating
the in-memory superblock fields as soon a buffer log item that covers
the primary super block is recovered.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
fs/xfs/libxfs/xfs_log_recover.h | 2 ++
fs/xfs/xfs_buf_item_recover.c | 16 +++++++++
fs/xfs/xfs_log_recover.c | 59 ++++++++++++++-------------------
3 files changed, 43 insertions(+), 34 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_log_recover.h b/fs/xfs/libxfs/xfs_log_recover.h
index 521d327e4c89ed..d0e13c84422d0a 100644
--- a/fs/xfs/libxfs/xfs_log_recover.h
+++ b/fs/xfs/libxfs/xfs_log_recover.h
@@ -165,4 +165,6 @@ void xlog_recover_intent_item(struct xlog *log, struct xfs_log_item *lip,
int xlog_recover_finish_intent(struct xfs_trans *tp,
struct xfs_defer_pending *dfp);
+int xlog_recover_update_agcount(struct xfs_mount *mp, struct xfs_dsb *dsb);
+
#endif /* __XFS_LOG_RECOVER_H__ */
diff --git a/fs/xfs/xfs_buf_item_recover.c b/fs/xfs/xfs_buf_item_recover.c
index 09e893cf563cb9..033821a56b6ac6 100644
--- a/fs/xfs/xfs_buf_item_recover.c
+++ b/fs/xfs/xfs_buf_item_recover.c
@@ -969,6 +969,22 @@ xlog_recover_buf_commit_pass2(
goto out_release;
} else {
xlog_recover_do_reg_buffer(mp, item, bp, buf_f, current_lsn);
+
+ /*
+ * Update the in-memory superblock and perag structures from the
+ * primary SB buffer.
+ *
+ * This is required because transactions running after growf
+ * s may require in-memory structures like the perag right after
+ * committing the growfs transaction that created the underlying
+ * objects.
+ */
+ if ((xfs_blft_from_flags(buf_f) & XFS_BLFT_SB_BUF) &&
+ xfs_buf_daddr(bp) == 0) {
+ error = xlog_recover_update_agcount(mp, bp->b_addr);
+ if (error)
+ goto out_release;
+ }
}
/*
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 2af02b32f419c2..7d7ab146cae758 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -3334,6 +3334,30 @@ xlog_do_log_recovery(
return error;
}
+int
+xlog_recover_update_agcount(
+ struct xfs_mount *mp,
+ struct xfs_dsb *dsb)
+{
+ xfs_agnumber_t old_agcount = mp->m_sb.sb_agcount;
+ int error;
+
+ xfs_sb_from_disk(&mp->m_sb, dsb);
+ if (mp->m_sb.sb_agcount < old_agcount) {
+ xfs_alert(mp, "Shrinking AG count in log recovery");
+ return -EFSCORRUPTED;
+ }
+ mp->m_features |= xfs_sb_version_to_features(&mp->m_sb);
+ error = xfs_initialize_perag(mp, old_agcount, mp->m_sb.sb_agcount,
+ mp->m_sb.sb_dblocks, &mp->m_maxagi);
+ if (error) {
+ xfs_warn(mp, "Failed recovery per-ag init: %d", error);
+ return error;
+ }
+ mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
+ return 0;
+}
+
/*
* Do the actual recovery
*/
@@ -3343,10 +3367,6 @@ xlog_do_recover(
xfs_daddr_t head_blk,
xfs_daddr_t tail_blk)
{
- struct xfs_mount *mp = log->l_mp;
- struct xfs_buf *bp = mp->m_sb_bp;
- struct xfs_sb *sbp = &mp->m_sb;
- xfs_agnumber_t old_agcount = sbp->sb_agcount;
int error;
trace_xfs_log_recover(log, head_blk, tail_blk);
@@ -3371,36 +3391,7 @@ xlog_do_recover(
*/
xfs_ail_assign_tail_lsn(log->l_ailp);
- /*
- * Now that we've finished replaying all buffer and inode updates,
- * re-read the superblock and reverify it.
- */
- xfs_buf_lock(bp);
- xfs_buf_hold(bp);
- error = _xfs_buf_read(bp, XBF_READ);
- if (error) {
- if (!xlog_is_shutdown(log)) {
- xfs_buf_ioerror_alert(bp, __this_address);
- ASSERT(0);
- }
- xfs_buf_relse(bp);
- return error;
- }
-
- /* Convert superblock from on-disk format */
- xfs_sb_from_disk(sbp, bp->b_addr);
- xfs_buf_relse(bp);
-
- /* re-initialise in-core superblock and geometry structures */
- mp->m_features |= xfs_sb_version_to_features(sbp);
- xfs_reinit_percpu_counters(mp);
- error = xfs_initialize_perag(mp, old_agcount, sbp->sb_agcount,
- sbp->sb_dblocks, &mp->m_maxagi);
- if (error) {
- xfs_warn(mp, "Failed post-recovery per-ag init: %d", error);
- return error;
- }
- mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
+ xfs_reinit_percpu_counters(log->l_mp);
/* Normal transactions can now occur */
clear_bit(XLOG_ACTIVE_RECOVERY, &log->l_opstate);
--
2.45.2
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 4/4] xfs: don't use __GFP_RETRY_MAYFAIL
2024-09-10 4:28 fix recovery of extfree items just after a growfs Christoph Hellwig
` (2 preceding siblings ...)
2024-09-10 4:28 ` [PATCH 3/4] xfs: create perag structures as soon as possible during log recovery Christoph Hellwig
@ 2024-09-10 4:28 ` Christoph Hellwig
2024-09-17 19:00 ` Darrick J. Wong
3 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2024-09-10 4:28 UTC (permalink / raw)
To: Chandan Babu R; +Cc: Darrick J. Wong, linux-xfs
__GFP_RETRY_MAYFAIL increases the likelyhood of allocations to fail,
which isn't really helpful during log recovery. Remove the flag and
stick to the default GFP_KERNEL policies.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
fs/xfs/libxfs/xfs_ag.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
index 3f695100d7ab58..f6c666a87dd393 100644
--- a/fs/xfs/libxfs/xfs_ag.c
+++ b/fs/xfs/libxfs/xfs_ag.c
@@ -289,7 +289,7 @@ xfs_initialize_perag(
return 0;
for (index = old_agcount; index < agcount; index++) {
- pag = kzalloc(sizeof(*pag), GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+ pag = kzalloc(sizeof(*pag), GFP_KERNEL);
if (!pag) {
error = -ENOMEM;
goto out_unwind_new_pags;
--
2.45.2
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 3/4] xfs: create perag structures as soon as possible during log recovery
2024-09-10 4:28 ` [PATCH 3/4] xfs: create perag structures as soon as possible during log recovery Christoph Hellwig
@ 2024-09-16 1:28 ` Dave Chinner
2024-09-18 6:11 ` Christoph Hellwig
0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2024-09-16 1:28 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Chandan Babu R, Darrick J. Wong, linux-xfs
On Tue, Sep 10, 2024 at 07:28:46AM +0300, Christoph Hellwig wrote:
> An unclean log can contain both the transaction that created a new
> allocation group and the first transaction that is freeing space from
> it, in which case the extent free item recovery requires the perag
> structure to be present.
>
> Currently the perag structures are only created after log recovery
> has completed, leading a warning and file system shutdown for the
> above case.
I'm missing something - the intents aren't processed until the log
has been recovered - queuing an intent to be processed does
not require the per-ag to be present. We don't take per-ag
references until we are recovering the intent. i.e. we've completed
journal recovery and haven't found the corresponding EFD.
That leaves the EFI in the log->r_dfops, and we then run
->recover_work in the second phase of recovery. It is
xfs_extent_free_recover_work() that creates the
new transaction and runs the EFI processing that requires the
perag references, isn't it?
IOWs, I don't see where the initial EFI/EFD recovery during the
checkpoint processing requires the newly created perags to be
present in memory for processing incomplete EFIs before the journal
recovery phase has completed.
>
> Fix this by creating new perag structures and updating
> the in-memory superblock fields as soon a buffer log item that covers
> the primary super block is recovered.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> fs/xfs/libxfs/xfs_log_recover.h | 2 ++
> fs/xfs/xfs_buf_item_recover.c | 16 +++++++++
> fs/xfs/xfs_log_recover.c | 59 ++++++++++++++-------------------
> 3 files changed, 43 insertions(+), 34 deletions(-)
>
> diff --git a/fs/xfs/libxfs/xfs_log_recover.h b/fs/xfs/libxfs/xfs_log_recover.h
> index 521d327e4c89ed..d0e13c84422d0a 100644
> --- a/fs/xfs/libxfs/xfs_log_recover.h
> +++ b/fs/xfs/libxfs/xfs_log_recover.h
> @@ -165,4 +165,6 @@ void xlog_recover_intent_item(struct xlog *log, struct xfs_log_item *lip,
> int xlog_recover_finish_intent(struct xfs_trans *tp,
> struct xfs_defer_pending *dfp);
>
> +int xlog_recover_update_agcount(struct xfs_mount *mp, struct xfs_dsb *dsb);
> +
> #endif /* __XFS_LOG_RECOVER_H__ */
> diff --git a/fs/xfs/xfs_buf_item_recover.c b/fs/xfs/xfs_buf_item_recover.c
> index 09e893cf563cb9..033821a56b6ac6 100644
> --- a/fs/xfs/xfs_buf_item_recover.c
> +++ b/fs/xfs/xfs_buf_item_recover.c
> @@ -969,6 +969,22 @@ xlog_recover_buf_commit_pass2(
> goto out_release;
> } else {
> xlog_recover_do_reg_buffer(mp, item, bp, buf_f, current_lsn);
> +
> + /*
> + * Update the in-memory superblock and perag structures from the
> + * primary SB buffer.
> + *
> + * This is required because transactions running after growf
> + * s may require in-memory structures like the perag right after
> + * committing the growfs transaction that created the underlying
> + * objects.
> + */
> + if ((xfs_blft_from_flags(buf_f) & XFS_BLFT_SB_BUF) &&
> + xfs_buf_daddr(bp) == 0) {
> + error = xlog_recover_update_agcount(mp, bp->b_addr);
> + if (error)
> + goto out_release;
> + }
> }
If we are going to keep this logic, can you do this as a separate
helper function? i.e.:
if (inode buffer) {
xlog_recover_do_inode_buffer();
} else if (dquot buffer) {
xlog_recover_do_dquot_buffer();
} else if (superblock buffer) {
xlog_recover_do_sb_buffer();
} else {
xlog_recover_do_reg_buffer();
}
and
xlog_recover_do_sb_buffer()
{
error = xlog_recover_do_reg_buffer()
if (error || xfs_buf_daddr(bp) != XFS_SB_ADDR)
return error;
return xlog_recover_update_agcount();
}
>
> /*
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 2af02b32f419c2..7d7ab146cae758 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -3334,6 +3334,30 @@ xlog_do_log_recovery(
> return error;
> }
>
> +int
> +xlog_recover_update_agcount(
> + struct xfs_mount *mp,
> + struct xfs_dsb *dsb)
> +{
> + xfs_agnumber_t old_agcount = mp->m_sb.sb_agcount;
> + int error;
> +
> + xfs_sb_from_disk(&mp->m_sb, dsb);
> + if (mp->m_sb.sb_agcount < old_agcount) {
> + xfs_alert(mp, "Shrinking AG count in log recovery");
> + return -EFSCORRUPTED;
> + }
> + mp->m_features |= xfs_sb_version_to_features(&mp->m_sb);
I'm not sure this is safe. The item order in the checkpoint recovery
isn't guaranteed to be exactly the same as when feature bits are
modified at runtime. Hence there could be items in the checkpoint
that haven't yet been recovered that are dependent on the original
sb feature mask being present. It may be OK to do this at the end
of the checkpoint being recovered.
I'm also not sure why this feature update code is being changed
because it's not mentioned at all in the commit message.
> + error = xfs_initialize_perag(mp, old_agcount, mp->m_sb.sb_agcount,
> + mp->m_sb.sb_dblocks, &mp->m_maxagi);
Why do this if sb_agcount has not changed? AFAICT it only iterates
the AGs already initialised and so skips them, then recalculates
inode32 and prealloc block parameters, which won't change. Hence
it's a total no-op for anything other than an actual ag count change
and should be skipped, right?
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/4] xfs: pass the exact range to initialize to xfs_initialize_perag
2024-09-10 4:28 ` [PATCH 1/4] xfs: pass the exact range to initialize to xfs_initialize_perag Christoph Hellwig
@ 2024-09-17 18:50 ` Darrick J. Wong
2024-09-18 6:15 ` Christoph Hellwig
0 siblings, 1 reply; 15+ messages in thread
From: Darrick J. Wong @ 2024-09-17 18:50 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Chandan Babu R, linux-xfs
On Tue, Sep 10, 2024 at 07:28:44AM +0300, Christoph Hellwig wrote:
> Currently only the new agcount is passed to xfs_initialize_perag, which
> requires lookups of existing AGs to skip them and complicates error
> handling. Also pass the previous agcount so that the range that
> xfs_initialize_perag operates on is exactly defined. That way the
> extra lookups can be avoided, and error handling can clean up the
> exact range from the old count to the last added perag structure.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> fs/xfs/libxfs/xfs_ag.c | 23 +++++------------------
> fs/xfs/libxfs/xfs_ag.h | 5 +++--
> fs/xfs/xfs_fsops.c | 18 ++++++++----------
> fs/xfs/xfs_log_recover.c | 5 +++--
> fs/xfs/xfs_mount.c | 4 ++--
> 5 files changed, 21 insertions(+), 34 deletions(-)
>
> diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
> index 5f0494702e0b55..5186735da5d45a 100644
> --- a/fs/xfs/libxfs/xfs_ag.c
> +++ b/fs/xfs/libxfs/xfs_ag.c
> @@ -296,27 +296,19 @@ xfs_free_unused_perag_range(
> int
> xfs_initialize_perag(
> struct xfs_mount *mp,
> + xfs_agnumber_t old_agcount,
> xfs_agnumber_t agcount,
Might want to rename this new_agcount for clarity?
Otherwise looks pretty straightforward to me, so:
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
--D
> xfs_rfsblock_t dblocks,
> xfs_agnumber_t *maxagi)
> {
> struct xfs_perag *pag;
> xfs_agnumber_t index;
> - xfs_agnumber_t first_initialised = NULLAGNUMBER;
> int error;
>
> - /*
> - * Walk the current per-ag tree so we don't try to initialise AGs
> - * that already exist (growfs case). Allocate and insert all the
> - * AGs we don't find ready for initialisation.
> - */
> - for (index = 0; index < agcount; index++) {
> - pag = xfs_perag_get(mp, index);
> - if (pag) {
> - xfs_perag_put(pag);
> - continue;
> - }
> + if (old_agcount >= agcount)
> + return 0;
>
> + for (index = old_agcount; index < agcount; index++) {
> pag = kzalloc(sizeof(*pag), GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> if (!pag) {
> error = -ENOMEM;
> @@ -353,10 +345,6 @@ xfs_initialize_perag(
> /* Active ref owned by mount indicates AG is online. */
> atomic_set(&pag->pag_active_ref, 1);
>
> - /* first new pag is fully initialized */
> - if (first_initialised == NULLAGNUMBER)
> - first_initialised = index;
> -
> /*
> * Pre-calculated geometry
> */
> @@ -381,8 +369,7 @@ xfs_initialize_perag(
> out_free_pag:
> kfree(pag);
> out_unwind_new_pags:
> - /* unwind any prior newly initialized pags */
> - xfs_free_unused_perag_range(mp, first_initialised, agcount);
> + xfs_free_unused_perag_range(mp, old_agcount, index);
> return error;
> }
>
> diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
> index d9cccd093b60e0..69fc31e7b84728 100644
> --- a/fs/xfs/libxfs/xfs_ag.h
> +++ b/fs/xfs/libxfs/xfs_ag.h
> @@ -146,8 +146,9 @@ __XFS_AG_OPSTATE(agfl_needs_reset, AGFL_NEEDS_RESET)
>
> void xfs_free_unused_perag_range(struct xfs_mount *mp, xfs_agnumber_t agstart,
> xfs_agnumber_t agend);
> -int xfs_initialize_perag(struct xfs_mount *mp, xfs_agnumber_t agcount,
> - xfs_rfsblock_t dcount, xfs_agnumber_t *maxagi);
> +int xfs_initialize_perag(struct xfs_mount *mp, xfs_agnumber_t old_agcount,
> + xfs_agnumber_t agcount, xfs_rfsblock_t dcount,
> + xfs_agnumber_t *maxagi);
> int xfs_initialize_perag_data(struct xfs_mount *mp, xfs_agnumber_t agno);
> void xfs_free_perag(struct xfs_mount *mp);
>
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index 3643cc843f6271..de2bf0594cb474 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -87,6 +87,7 @@ xfs_growfs_data_private(
> struct xfs_mount *mp, /* mount point for filesystem */
> struct xfs_growfs_data *in) /* growfs data input struct */
> {
> + xfs_agnumber_t oagcount = mp->m_sb.sb_agcount;
> struct xfs_buf *bp;
> int error;
> xfs_agnumber_t nagcount;
> @@ -94,7 +95,6 @@ xfs_growfs_data_private(
> xfs_rfsblock_t nb, nb_div, nb_mod;
> int64_t delta;
> bool lastag_extended = false;
> - xfs_agnumber_t oagcount;
> struct xfs_trans *tp;
> struct aghdr_init_data id = {};
> struct xfs_perag *last_pag;
> @@ -138,16 +138,14 @@ xfs_growfs_data_private(
> if (delta == 0)
> return 0;
>
> - oagcount = mp->m_sb.sb_agcount;
> - /* allocate the new per-ag structures */
> - if (nagcount > oagcount) {
> - error = xfs_initialize_perag(mp, nagcount, nb, &nagimax);
> - if (error)
> - return error;
> - } else if (nagcount < oagcount) {
> - /* TODO: shrinking the entire AGs hasn't yet completed */
> + /* TODO: shrinking the entire AGs hasn't yet completed */
> + if (nagcount < oagcount)
> return -EINVAL;
> - }
> +
> + /* allocate the new per-ag structures */
> + error = xfs_initialize_perag(mp, oagcount, nagcount, nb, &nagimax);
> + if (error)
> + return error;
>
> if (delta > 0)
> error = xfs_trans_alloc(mp, &M_RES(mp)->tr_growdata,
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 1a74fe22672e3e..2af02b32f419c2 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -3346,6 +3346,7 @@ xlog_do_recover(
> struct xfs_mount *mp = log->l_mp;
> struct xfs_buf *bp = mp->m_sb_bp;
> struct xfs_sb *sbp = &mp->m_sb;
> + xfs_agnumber_t old_agcount = sbp->sb_agcount;
> int error;
>
> trace_xfs_log_recover(log, head_blk, tail_blk);
> @@ -3393,8 +3394,8 @@ xlog_do_recover(
> /* re-initialise in-core superblock and geometry structures */
> mp->m_features |= xfs_sb_version_to_features(sbp);
> xfs_reinit_percpu_counters(mp);
> - error = xfs_initialize_perag(mp, sbp->sb_agcount, sbp->sb_dblocks,
> - &mp->m_maxagi);
> + error = xfs_initialize_perag(mp, old_agcount, sbp->sb_agcount,
> + sbp->sb_dblocks, &mp->m_maxagi);
> if (error) {
> xfs_warn(mp, "Failed post-recovery per-ag init: %d", error);
> return error;
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 460f93a9ce00d1..0f4f56a7f02d9a 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -806,8 +806,8 @@ xfs_mountfs(
> /*
> * Allocate and initialize the per-ag data.
> */
> - error = xfs_initialize_perag(mp, sbp->sb_agcount, mp->m_sb.sb_dblocks,
> - &mp->m_maxagi);
> + error = xfs_initialize_perag(mp, 0, sbp->sb_agcount,
> + mp->m_sb.sb_dblocks, &mp->m_maxagi);
> if (error) {
> xfs_warn(mp, "Failed per-ag init: %d", error);
> goto out_free_dir;
> --
> 2.45.2
>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/4] xfs: merge the perag freeing helpers
2024-09-10 4:28 ` [PATCH 2/4] xfs: merge the perag freeing helpers Christoph Hellwig
@ 2024-09-17 18:55 ` Darrick J. Wong
2024-09-18 6:15 ` Christoph Hellwig
0 siblings, 1 reply; 15+ messages in thread
From: Darrick J. Wong @ 2024-09-17 18:55 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Chandan Babu R, linux-xfs
On Tue, Sep 10, 2024 at 07:28:45AM +0300, Christoph Hellwig wrote:
> There is no good reason to have two different routines for freeing perag
> structures for the unmount and error cases. Add two arguments to specify
> the range of AGs to free to xfs_free_perag, and use that to replace
> xfs_free_unused_perag_range.
>
> The addition RCU grace period for the error case is harmless, and the
> extra check for the AG to actually exist is not required now that the
> callers pass the exact known allocated range.
And I guess the extra xfs_perag_rele is ok because xfs_initialize_perag
sets it to 1 and the _unused free function didn't care what the active
refcount was?
If the answer is yes then I've understood this enough to say
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
--D
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> fs/xfs/libxfs/xfs_ag.c | 40 ++++++++++------------------------------
> fs/xfs/libxfs/xfs_ag.h | 5 ++---
> fs/xfs/xfs_fsops.c | 2 +-
> fs/xfs/xfs_mount.c | 5 ++---
> 4 files changed, 15 insertions(+), 37 deletions(-)
>
> diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
> index 5186735da5d45a..3f695100d7ab58 100644
> --- a/fs/xfs/libxfs/xfs_ag.c
> +++ b/fs/xfs/libxfs/xfs_ag.c
> @@ -185,17 +185,20 @@ xfs_initialize_perag_data(
> }
>
> /*
> - * Free up the per-ag resources associated with the mount structure.
> + * Free up the per-ag resources within the specified AG range.
> */
> void
> -xfs_free_perag(
> - struct xfs_mount *mp)
> +xfs_free_perag_range(
> + struct xfs_mount *mp,
> + xfs_agnumber_t first_agno,
> + xfs_agnumber_t end_agno)
> +
> {
> - struct xfs_perag *pag;
> xfs_agnumber_t agno;
>
> - for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> - pag = xa_erase(&mp->m_perags, agno);
> + for (agno = first_agno; agno < end_agno; agno++) {
> + struct xfs_perag *pag = xa_erase(&mp->m_perags, agno);
> +
> ASSERT(pag);
> XFS_IS_CORRUPT(pag->pag_mount, atomic_read(&pag->pag_ref) != 0);
> xfs_defer_drain_free(&pag->pag_intents_drain);
> @@ -270,29 +273,6 @@ xfs_agino_range(
> return __xfs_agino_range(mp, xfs_ag_block_count(mp, agno), first, last);
> }
>
> -/*
> - * Free perag within the specified AG range, it is only used to free unused
> - * perags under the error handling path.
> - */
> -void
> -xfs_free_unused_perag_range(
> - struct xfs_mount *mp,
> - xfs_agnumber_t agstart,
> - xfs_agnumber_t agend)
> -{
> - struct xfs_perag *pag;
> - xfs_agnumber_t index;
> -
> - for (index = agstart; index < agend; index++) {
> - pag = xa_erase(&mp->m_perags, index);
> - if (!pag)
> - break;
> - xfs_buf_cache_destroy(&pag->pag_bcache);
> - xfs_defer_drain_free(&pag->pag_intents_drain);
> - kfree(pag);
> - }
> -}
> -
> int
> xfs_initialize_perag(
> struct xfs_mount *mp,
> @@ -369,7 +349,7 @@ xfs_initialize_perag(
> out_free_pag:
> kfree(pag);
> out_unwind_new_pags:
> - xfs_free_unused_perag_range(mp, old_agcount, index);
> + xfs_free_perag_range(mp, old_agcount, index);
> return error;
> }
>
> diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
> index 69fc31e7b84728..6e68d6a3161a0f 100644
> --- a/fs/xfs/libxfs/xfs_ag.h
> +++ b/fs/xfs/libxfs/xfs_ag.h
> @@ -144,13 +144,12 @@ __XFS_AG_OPSTATE(prefers_metadata, PREFERS_METADATA)
> __XFS_AG_OPSTATE(allows_inodes, ALLOWS_INODES)
> __XFS_AG_OPSTATE(agfl_needs_reset, AGFL_NEEDS_RESET)
>
> -void xfs_free_unused_perag_range(struct xfs_mount *mp, xfs_agnumber_t agstart,
> - xfs_agnumber_t agend);
> int xfs_initialize_perag(struct xfs_mount *mp, xfs_agnumber_t old_agcount,
> xfs_agnumber_t agcount, xfs_rfsblock_t dcount,
> xfs_agnumber_t *maxagi);
> +void xfs_free_perag_range(struct xfs_mount *mp, xfs_agnumber_t first_agno,
> + xfs_agnumber_t end_agno);
> int xfs_initialize_perag_data(struct xfs_mount *mp, xfs_agnumber_t agno);
> -void xfs_free_perag(struct xfs_mount *mp);
>
> /* Passive AG references */
> struct xfs_perag *xfs_perag_get(struct xfs_mount *mp, xfs_agnumber_t agno);
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index de2bf0594cb474..b247d895c276d2 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -229,7 +229,7 @@ xfs_growfs_data_private(
> xfs_trans_cancel(tp);
> out_free_unused_perag:
> if (nagcount > oagcount)
> - xfs_free_unused_perag_range(mp, oagcount, nagcount);
> + xfs_free_perag_range(mp, oagcount, nagcount);
> return error;
> }
>
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 0f4f56a7f02d9a..6671ee3849c239 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -1044,7 +1044,7 @@ xfs_mountfs(
> xfs_buftarg_drain(mp->m_logdev_targp);
> xfs_buftarg_drain(mp->m_ddev_targp);
> out_free_perag:
> - xfs_free_perag(mp);
> + xfs_free_perag_range(mp, 0, mp->m_sb.sb_agcount);
> out_free_dir:
> xfs_da_unmount(mp);
> out_remove_uuid:
> @@ -1125,8 +1125,7 @@ xfs_unmountfs(
> xfs_errortag_clearall(mp);
> #endif
> shrinker_free(mp->m_inodegc_shrinker);
> - xfs_free_perag(mp);
> -
> + xfs_free_perag_range(mp, 0, mp->m_sb.sb_agcount);
> xfs_errortag_del(mp);
> xfs_error_sysfs_del(mp);
> xchk_stats_unregister(mp->m_scrub_stats);
> --
> 2.45.2
>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 4/4] xfs: don't use __GFP_RETRY_MAYFAIL
2024-09-10 4:28 ` [PATCH 4/4] xfs: don't use __GFP_RETRY_MAYFAIL Christoph Hellwig
@ 2024-09-17 19:00 ` Darrick J. Wong
0 siblings, 0 replies; 15+ messages in thread
From: Darrick J. Wong @ 2024-09-17 19:00 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Chandan Babu R, linux-xfs
On Tue, Sep 10, 2024 at 07:28:47AM +0300, Christoph Hellwig wrote:
> __GFP_RETRY_MAYFAIL increases the likelyhood of allocations to fail,
> which isn't really helpful during log recovery. Remove the flag and
> stick to the default GFP_KERNEL policies.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
Sounds fine to me
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
--D
> ---
> fs/xfs/libxfs/xfs_ag.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
> index 3f695100d7ab58..f6c666a87dd393 100644
> --- a/fs/xfs/libxfs/xfs_ag.c
> +++ b/fs/xfs/libxfs/xfs_ag.c
> @@ -289,7 +289,7 @@ xfs_initialize_perag(
> return 0;
>
> for (index = old_agcount; index < agcount; index++) {
> - pag = kzalloc(sizeof(*pag), GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> + pag = kzalloc(sizeof(*pag), GFP_KERNEL);
> if (!pag) {
> error = -ENOMEM;
> goto out_unwind_new_pags;
> --
> 2.45.2
>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 3/4] xfs: create perag structures as soon as possible during log recovery
2024-09-16 1:28 ` Dave Chinner
@ 2024-09-18 6:11 ` Christoph Hellwig
2024-09-19 1:09 ` Dave Chinner
0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2024-09-18 6:11 UTC (permalink / raw)
To: Dave Chinner
Cc: Christoph Hellwig, Chandan Babu R, Darrick J. Wong, linux-xfs
On Mon, Sep 16, 2024 at 11:28:26AM +1000, Dave Chinner wrote:
> I'm missing something - the intents aren't processed until the log
> has been recovered - queuing an intent to be processed does
> not require the per-ag to be present. We don't take per-ag
> references until we are recovering the intent. i.e. we've completed
> journal recovery and haven't found the corresponding EFD.
>
> That leaves the EFI in the log->r_dfops, and we then run
> ->recover_work in the second phase of recovery. It is
> xfs_extent_free_recover_work() that creates the
> new transaction and runs the EFI processing that requires the
> perag references, isn't it?
>
> IOWs, I don't see where the initial EFI/EFD recovery during the
> checkpoint processing requires the newly created perags to be
> present in memory for processing incomplete EFIs before the journal
> recovery phase has completed.
So my new test actually blows up before creating intents:
[ 81.695529] XFS (nvme1n1): Mounting V5 Filesystem 07057234-4bec-4f17-97c5-420c71c83292
[ 81.704541] XFS (nvme1n1): Starting recovery (logdev: internal)
[ 81.707260] XFS (nvme1n1): xfs_buf_map_verify: daddr 0x40003 out of range, EOFS 0x40000
[ 81.707974] ------------[ cut here ]------------
[ 81.708376] WARNING: CPU: 1 PID: 5004 at fs/xfs/xfs_buf.c:553 xfs_buf_get_map+0x8b4/0xb70
Because sb_dblocks hasn't been updated yet. I'd kinda assume we run
into the intents next, but maybe we don't. I can try how far just
fixing the sb would get us, but that will potentially gets us into
more problems late the more we actually use the pag structure.
> If we are going to keep this logic, can you do this as a separate
> helper function? i.e.:
I actually did that earlier, and it turned out to create a bit more
boilerplate than I liked, but I can revert to it if there is a strong
preference.
> > + xfs_sb_from_disk(&mp->m_sb, dsb);
> > + if (mp->m_sb.sb_agcount < old_agcount) {
> > + xfs_alert(mp, "Shrinking AG count in log recovery");
> > + return -EFSCORRUPTED;
> > + }
> > + mp->m_features |= xfs_sb_version_to_features(&mp->m_sb);
>
> I'm not sure this is safe. The item order in the checkpoint recovery
> isn't guaranteed to be exactly the same as when feature bits are
> modified at runtime. Hence there could be items in the checkpoint
> that haven't yet been recovered that are dependent on the original
> sb feature mask being present. It may be OK to do this at the end
> of the checkpoint being recovered.
>
> I'm also not sure why this feature update code is being changed
> because it's not mentioned at all in the commit message.
Mostly to keep the features in sync with the in-memory sb fields
updated above. I'll switch to keep this as-is, but I fail to see how
updating features only after the entire reocvery is done will be safe
for all cases either.
Where would we depend on the old feature setting?
>
> > + error = xfs_initialize_perag(mp, old_agcount, mp->m_sb.sb_agcount,
> > + mp->m_sb.sb_dblocks, &mp->m_maxagi);
>
> Why do this if sb_agcount has not changed? AFAICT it only iterates
> the AGs already initialised and so skips them, then recalculates
> inode32 and prealloc block parameters, which won't change. Hence
> it's a total no-op for anything other than an actual ag count change
> and should be skipped, right?
Yes, and the way how xfs_initialize_perag it is an entire no-op.
But I can add an extra explicit check to make that more clear.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/4] xfs: pass the exact range to initialize to xfs_initialize_perag
2024-09-17 18:50 ` Darrick J. Wong
@ 2024-09-18 6:15 ` Christoph Hellwig
0 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2024-09-18 6:15 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: Christoph Hellwig, Chandan Babu R, linux-xfs
On Tue, Sep 17, 2024 at 11:50:54AM -0700, Darrick J. Wong wrote:
> Might want to rename this new_agcount for clarity?
Sure.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/4] xfs: merge the perag freeing helpers
2024-09-17 18:55 ` Darrick J. Wong
@ 2024-09-18 6:15 ` Christoph Hellwig
0 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2024-09-18 6:15 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: Christoph Hellwig, Chandan Babu R, linux-xfs
On Tue, Sep 17, 2024 at 11:55:56AM -0700, Darrick J. Wong wrote:
> And I guess the extra xfs_perag_rele is ok because xfs_initialize_perag
> sets it to 1 and the _unused free function didn't care what the active
> refcount was?
Yes.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 3/4] xfs: create perag structures as soon as possible during log recovery
2024-09-18 6:11 ` Christoph Hellwig
@ 2024-09-19 1:09 ` Dave Chinner
2024-09-19 7:46 ` Christoph Hellwig
0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2024-09-19 1:09 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Chandan Babu R, Darrick J. Wong, linux-xfs
On Wed, Sep 18, 2024 at 08:11:05AM +0200, Christoph Hellwig wrote:
> On Mon, Sep 16, 2024 at 11:28:26AM +1000, Dave Chinner wrote:
> > I'm missing something - the intents aren't processed until the log
> > has been recovered - queuing an intent to be processed does
> > not require the per-ag to be present. We don't take per-ag
> > references until we are recovering the intent. i.e. we've completed
> > journal recovery and haven't found the corresponding EFD.
> >
> > That leaves the EFI in the log->r_dfops, and we then run
> > ->recover_work in the second phase of recovery. It is
> > xfs_extent_free_recover_work() that creates the
> > new transaction and runs the EFI processing that requires the
> > perag references, isn't it?
> >
> > IOWs, I don't see where the initial EFI/EFD recovery during the
> > checkpoint processing requires the newly created perags to be
> > present in memory for processing incomplete EFIs before the journal
> > recovery phase has completed.
>
> So my new test actually blows up before creating intents:
>
> [ 81.695529] XFS (nvme1n1): Mounting V5 Filesystem 07057234-4bec-4f17-97c5-420c71c83292
> [ 81.704541] XFS (nvme1n1): Starting recovery (logdev: internal)
> [ 81.707260] XFS (nvme1n1): xfs_buf_map_verify: daddr 0x40003 out of range, EOFS 0x40000
> [ 81.707974] ------------[ cut here ]------------
> [ 81.708376] WARNING: CPU: 1 PID: 5004 at fs/xfs/xfs_buf.c:553 xfs_buf_get_map+0x8b4/0xb70
>
> Because sb_dblocks hasn't been updated yet.
Hmmmmm. Ok, I can see how this would be an issue, but it's not the
issue the commit message describes :)
....
Oh, this is a -much worse- problem that I thought.
This is a replay for a modification to a new AGFL, and that *must*
only be replayed after the superblock modifications have been
replayed.
Ideally, we should not be using the new AGs until *after* the growfs
transaction has hit stable storage (i.e. the journal has fully
commmitted the growfs transaction), not just committed to the CIL.
If anything can access the new AGs beyond the old EOFS *before* the
growfs transaction is stable in the journal, then we have a nasty
set of race condtions where we can be making modifications to
metadata that is beyond EOFS in the journal *before* we've replayed
the superblock growfs modification.
For example:
growfs task allocating task log worker
xfs_trans_set_sync
xfs_trans_commit
xlog_cil_commit
<items added to CIL>
xfs_trans_unreserve_and_mod_sb
mp->m_sb.sb_agcount += tp->t_agcount_delta;
->iop_committing()
xfs_buf_item_committing
xfs_buf_item_release
<superblock buffer unlocked>
<preempt>
xfs_bmap_btalloc
xfs_bmap_btalloc_best_length
for_each_perag_wrap(...)
<sees new, uncommitted mp->m_sb.sb_agcount>
<selects new AG beyond EOFS in journal>
<does allocation in new AG beyond EOFS in journal>
xfs_trans_commit()
xlog_cil_commit()
<items added to CIL>
....
<log state in XLOG_STATE_COVER_NEED>
xfs_sync_sb(wait)
locks sb buffer
xfs_trans_set_sync
xfs_trans_commit
xlog_cil_commit
<sb already in CIL>
<updates sb item order id>
xfs_log_force(SYNC)
<pushes CIL>
<runs again>
xfs_log_force(SYNC)
<pushes CIL>
This growfs vs allocating task race results in a checkpoint in the
journal where the new AGs are modified in the same checkpoint as the
growfs transaction. This will work if the superblock item is placed
in the journal before the items for the new AGs with your new code.
However...
.... when we add in the log worker (or some other transaction)
updating the superblock again before the CIL is pushed, we now have
a reordering problem. The CIL push will order all the items in the
CIL according to the order id attached to the log item, and this
causes the superblock item to be placed -after- the new AG items.
That will result in the exact recovery error you quoted above,
regardless of the fixes in this series.
<thinks about how to fix it>
I think the first step in avoiding this is ensuring we can't relog
the superblock until the growfs transaction is on disk. We can
do that by explicitly grabbing the superblock buffer and holding it
before we commit the transaction so the superblock buffer remains
locked until xfs_trans_commit() returns. That will prevent races
with other sb updates by blocking them on the sb buffer lock.
The second step is preventing allocations that are running from
seeing the mp->m_sb.sb_agcount update until after the transaction is
stable.
Ok, xfs_trans_mod_sb(XFS_TRANS_SB_AGCOUNT) is only used by the
grwofs code and the first step above means we have explicitly locked
the sb buffer. Hence the whole xfs_trans_mod_sb() -> delta in trans
-> xfs_trans_commit -> xfs_trans_apply_sb_deltas() -> lock sb buffer,
modify sb and log it -> xlog_cil_commit() song and dance routine can
go away.
i.e. We can modify the superblock buffer directly in the growfs
code, and then when xfs_trans_commit() returns we directly update
the in-memory superblock before unlocking the superblock buffer.
This means that nothing can update the superblock before the growfs
transaction is stable in the journal, and nothing at runtime will
see that there are new AGs or space available until after the on
disk state has changed.
This then allows recovery to be able to update the superblock and
perag state after checkpoint processing is complete. All future
checkpoint recoveries will now be guaranteed to see the correct
superblock state, whilst the checkpoint the growfs update is in
is guaranteed to only be dependent on the old superblock state...
I suspect that we might have to move all superblock updates to this
model - we don't update the in-memory state until after the on-disk
update is on stable storage in the journal - that will avoid the
possibility of racing/reordered feature bit update being replayed,
too.
The only issue with this is that I have a nagging memory of holding
the superblock buffer locked across a synchronous transaction could
deadlock the log force in some way. I can't find a reference to that
situation anywhere - maybe it wasn't a real issue, or someone else
remembers that situation better than I do....
> I'd kinda assume we run
> into the intents next, but maybe we don't.
I don't think intents are the problem because they are run after
the superblock and perags are updated.
> > > + xfs_sb_from_disk(&mp->m_sb, dsb);
> > > + if (mp->m_sb.sb_agcount < old_agcount) {
> > > + xfs_alert(mp, "Shrinking AG count in log recovery");
> > > + return -EFSCORRUPTED;
> > > + }
> > > + mp->m_features |= xfs_sb_version_to_features(&mp->m_sb);
> >
> > I'm not sure this is safe. The item order in the checkpoint recovery
> > isn't guaranteed to be exactly the same as when feature bits are
> > modified at runtime. Hence there could be items in the checkpoint
> > that haven't yet been recovered that are dependent on the original
> > sb feature mask being present. It may be OK to do this at the end
> > of the checkpoint being recovered.
> >
> > I'm also not sure why this feature update code is being changed
> > because it's not mentioned at all in the commit message.
>
> Mostly to keep the features in sync with the in-memory sb fields
> updated above. I'll switch to keep this as-is, but I fail to see how
> updating features only after the entire reocvery is done will be safe
> for all cases either.
>
> Where would we depend on the old feature setting?
One possibility is a similar reordering race to what I describe
above; another is simply that the superblock feature update is not
serialised at runtime with anything that checks that feature. Hence
the order of modifications in the journal may not reflect the order
in which the checks and modifications were done at runtime....
Hence I'm not convinced that it is safe to update superblock state
in the middle of a checkpoint - between checkpoints (i.e. after
checkpoint commit record processing) can be made safe (as per
above), but the relogging of items in the CIL makes mid-checkpoint
updates somewhat ... problematic.
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 3/4] xfs: create perag structures as soon as possible during log recovery
2024-09-19 1:09 ` Dave Chinner
@ 2024-09-19 7:46 ` Christoph Hellwig
2024-09-19 21:50 ` Dave Chinner
0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2024-09-19 7:46 UTC (permalink / raw)
To: Dave Chinner
Cc: Christoph Hellwig, Chandan Babu R, Darrick J. Wong, linux-xfs
On Thu, Sep 19, 2024 at 11:09:40AM +1000, Dave Chinner wrote:
> Ideally, we should not be using the new AGs until *after* the growfs
> transaction has hit stable storage (i.e. the journal has fully
> commmitted the growfs transaction), not just committed to the CIL.
Yes. A crude version of that - freeze/unfreeze before setting the
AG live was my other initial idea, but Darrick wasn't exactly
excited about that..
> The second step is preventing allocations that are running from
> seeing the mp->m_sb.sb_agcount update until after the transaction is
> stable.
Or just not seeing the pag as active by not setting the initial
active reference until after the transaction is stable. Given
all the issues outlined by you about sb locking that might be the
easier approach.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 3/4] xfs: create perag structures as soon as possible during log recovery
2024-09-19 7:46 ` Christoph Hellwig
@ 2024-09-19 21:50 ` Dave Chinner
0 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2024-09-19 21:50 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Chandan Babu R, Darrick J. Wong, linux-xfs
On Thu, Sep 19, 2024 at 09:46:31AM +0200, Christoph Hellwig wrote:
> On Thu, Sep 19, 2024 at 11:09:40AM +1000, Dave Chinner wrote:
> > Ideally, we should not be using the new AGs until *after* the growfs
> > transaction has hit stable storage (i.e. the journal has fully
> > commmitted the growfs transaction), not just committed to the CIL.
>
> Yes. A crude version of that - freeze/unfreeze before setting the
> AG live was my other initial idea, but Darrick wasn't exactly
> excited about that..
I'm not exactly excited by that idea, either...
> > The second step is preventing allocations that are running from
> > seeing the mp->m_sb.sb_agcount update until after the transaction is
> > stable.
>
> Or just not seeing the pag as active by not setting the initial
> active reference until after the transaction is stable. Given
> all the issues outlined by you about sb locking that might be the
> easier approach.
Yeah, that's a good idea for avoiding perag references from
iterations before the growfs is stable.
However, my concern is whether that is sufficient. Whilst I didn't
mention it, changing sb_agcount and sb_dblocks before the grwofs is
stable affects things like size calculations for the old end runt
AG(*) because it is no longer considered a runt the moment we change
the in-memory size fields in the superblock. That will, at least,
affect ino/fsbno/agbno verification, as well as corruption checks
through the code.
An alternative is to delay the perag initialisation until after the
growfs is stable, because we don't want the old runt AG size to
visibly change until after the growfs is stable.
There may be more potential issues, but I haven't done a careful
code audit and that's kinda why I suggested simply delaying the
in-memory state update. Delaying the update removes the whole
in-memory transient state situation altogether...
-Dave.
(*) The precalculated AG length and inode min/max values we've added
to the perag (calculated at perag init time) should be used for
these calculations. That gets around this 'growfs changes sb values
dynamically' issue, but not all the places where we do these
verifications have a valid perag reference to pass these
functions.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2024-09-19 21:50 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-10 4:28 fix recovery of extfree items just after a growfs Christoph Hellwig
2024-09-10 4:28 ` [PATCH 1/4] xfs: pass the exact range to initialize to xfs_initialize_perag Christoph Hellwig
2024-09-17 18:50 ` Darrick J. Wong
2024-09-18 6:15 ` Christoph Hellwig
2024-09-10 4:28 ` [PATCH 2/4] xfs: merge the perag freeing helpers Christoph Hellwig
2024-09-17 18:55 ` Darrick J. Wong
2024-09-18 6:15 ` Christoph Hellwig
2024-09-10 4:28 ` [PATCH 3/4] xfs: create perag structures as soon as possible during log recovery Christoph Hellwig
2024-09-16 1:28 ` Dave Chinner
2024-09-18 6:11 ` Christoph Hellwig
2024-09-19 1:09 ` Dave Chinner
2024-09-19 7:46 ` Christoph Hellwig
2024-09-19 21:50 ` Dave Chinner
2024-09-10 4:28 ` [PATCH 4/4] xfs: don't use __GFP_RETRY_MAYFAIL Christoph Hellwig
2024-09-17 19:00 ` Darrick J. Wong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox