* xfs: byte-based grant head reservation tracking v4
@ 2024-06-20 7:21 Christoph Hellwig
2024-06-20 7:21 ` [PATCH 01/11] xfs: fix the contact address for the sysfs ABI documentation Christoph Hellwig
` (10 more replies)
0 siblings, 11 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-06-20 7:21 UTC (permalink / raw)
To: Chandan Babu R; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs
[Note: I've taken over from Dave on this to push it over the finish line]
One of the significant limitations of the log reservation code is
that it uses physical tracking of the reservation space to account
for both the space used in the journal as well as the reservations
held in memory by the CIL and active running transactions. Because
this in-memory reservation tracking requires byte-level granularity,
this means that the "LSN" that the grant head stores it's location
in is split into 32 bits for the log cycle and 32 bits for the grant
head offset into the log.
Storing a byte count as the grant head offset into the log means
that we can only index 4GB of space with the grant head. This is one
of the primary limiting factors preventing us from increasing the
physical log size beyond 2GB. Hence to increase the physical log
size, we have to increase the space available for storing the grant
head.
Needing more physical space to store the grant head is an issue
because we use lockless atomic accounting for the grant head to
minimise the overhead of new incoming transaction reservations.
These have unbound concurrency, and hence any lock in the
reservation path will cause serious scalability issues. The lockless
accounting fast path was the solution to these scalability problems
that we had over a decade ago, and hence we know we cannot go back
to a lock based solution.
The simplest way I can describe how we track the log space is
as follows:
l_tail_lsn l_last_sync_lsn grant head lsn
|-----------------------|+++++++++++++++++++++|
| physical space | in memory space |
| - - - - - - xlog_space_left() - - - - - - - |
It is simple for the AIL to track the maximum LSN that has been
inserted into the AIL. If we do this, we no longer need to track
log->l_last_sync_lsn in the journal itself and we can always get the
physical space tracked by the journal directly from the AIL. The AIL
functions can calculate the "log tail space" dynamically when either
the log tail or the max LSN seen changes, thereby removing all need
for the log itself to track this state. Hence we now have:
l_tail_lsn ail_max_lsn_seen grant head lsn
|-----------------------|+++++++++++++++++++++|
| log->l_tail_space | in memory space |
| - - - - - - xlog_space_left() - - - - - - - |
And we've solved the problem of efficiently calculating the amount
of physical space the log is consuming. All this leaves is now
calculating how much space we are consuming in memory.
Luckily for us, we've just added all the update hooks needed to do
this. From the above diagram, two things are obvious:
1. when the tail moves, only log->l_tail_space reduces
2. when the ail_max_lsn_seen increases, log->l_tail_space increases
and "in memory space" reduces by the same amount.
IOWs, we now have a mechanism that can transfer the in-memory
reservation space directly to the on-disk tail space accounting. At
this point, we can change the grant head from tracking physical
location to tracking a simple byte count:
l_tail_lsn ail_max_lsn_seen grant head bytes
|-----------------------|+++++++++++++++++++++|
| log->l_tail_space | grant space |
| - - - - - - xlog_space_left() - - - - - - - |
and xlog_space_left() simply changes to:
space left = log->l_logsize - grant space - log->l_tail_space;
All of the complex grant head cracking, combining and
compare/exchange code gets replaced by simple atomic add/sub
operations, and the grant heads can now track a full 64 bit bytes
space. The fastpath reservation accounting is also much faster
because it is much simpler.
There's one little problem, though. The transaction reservation code
has to set the LSN target for the AIL to push to ensure that the log
tail keeps moving forward (xlog_grant_push_ail()), and the deferred
intent logging code also tries to keep abreast of the amount of
space available in the log via xlog_grant_push_threshold().
The AIL pushing problem is actually easy to solve - we don't need to
push the AIL from the transaction reservation code as the AIL
already tracks all the space used by the journal. All the
transaction reservation code does is try to keep 25% of the journal
physically free, and there's no reason why the AIL can't do that
itself.
Hence before we start changing any of the grant head accounting, we
remove all the AIL pushing hooks from the reservation code and let
the AIL determine the target it needs to push to itself. We also
allow the deferred intent logging code to determine if the AIL
should be tail pushing similar to how it currently checks if we are
running out of log space, so the intent relogging still works as it
should.
With these changes in place, there is no external code that is
dependent on the grant heads tracking physical space, and hence we
can then implement the change to pure in-memory reservation space
tracking in the grant heads.....
This all passes fstests for default and rmapbt enabled configs.
Performance tests also show good improvements where the transaction
accounting is the bottleneck.
Changes since v3:
- fix all review comments (Dave)
- add a new patch to skip flushing AIL items (Dave)
- rework XFS_AIL_OPSTATE_PUSH_ALL handling (Dave)
- misc checkpath and minor coding style fixups (Christoph)
- clean up the grant head manipulation helpers (Christoph)
- rename the sysfs files so that xfstests can autodetect the new format
(Christoph)
- fix the contact address for xfs sysfs ABI entries (Christoph)
Changes since v2:
- rebase on 6.6-rc2 + linux-xfs/for-next
- cleaned up static warnings from build bot.
- fixed comment about minimum AIL push target.
- fixed whitespace problems in multiple patches.
Changes since v1:
- https://lore.kernel.org/linux-xfs/20220809230353.3353059-1-david@fromorbit.com/
- reorder moving xfs_trans_bulk_commit() patch to start of series
- fix failure to consider NULLCOMMITLSN push target in AIL
- grant space release based on ctx->start_lsn fails to release the space used in
the checkpoint that was just committed. Release needs to be based on
ctx->commit_lsn which is the end of the region that the checkpoint consumes in
the log
Diffstat:
Documentation/ABI/testing/sysfs-fs-xfs | 26 -
fs/xfs/libxfs/xfs_defer.c | 4
fs/xfs/xfs_inode.c | 1
fs/xfs/xfs_inode_item.c | 6
fs/xfs/xfs_log.c | 511 +++++++--------------------------
fs/xfs/xfs_log.h | 1
fs/xfs/xfs_log_cil.c | 177 +++++++++++
fs/xfs/xfs_log_priv.h | 61 +--
fs/xfs/xfs_log_recover.c | 23 -
fs/xfs/xfs_sysfs.c | 29 -
fs/xfs/xfs_trace.c | 1
fs/xfs/xfs_trace.h | 42 +-
fs/xfs/xfs_trans.c | 129 --------
fs/xfs/xfs_trans.h | 4
fs/xfs/xfs_trans_ail.c | 244 ++++++++-------
fs/xfs/xfs_trans_priv.h | 44 ++
16 files changed, 552 insertions(+), 751 deletions(-)
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 01/11] xfs: fix the contact address for the sysfs ABI documentation
2024-06-20 7:21 xfs: byte-based grant head reservation tracking v4 Christoph Hellwig
@ 2024-06-20 7:21 ` Christoph Hellwig
2024-06-20 15:20 ` Darrick J. Wong
2024-06-20 7:21 ` [PATCH 02/11] xfs: move and rename xfs_trans_committed_bulk Christoph Hellwig
` (9 subsequent siblings)
10 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2024-06-20 7:21 UTC (permalink / raw)
To: Chandan Babu R; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs
oss.sgi.com is long dead, refer to the current linux-xfs list instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
Documentation/ABI/testing/sysfs-fs-xfs | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/Documentation/ABI/testing/sysfs-fs-xfs b/Documentation/ABI/testing/sysfs-fs-xfs
index f704925f6fe93b..82d8e2f79834b5 100644
--- a/Documentation/ABI/testing/sysfs-fs-xfs
+++ b/Documentation/ABI/testing/sysfs-fs-xfs
@@ -1,7 +1,7 @@
What: /sys/fs/xfs/<disk>/log/log_head_lsn
Date: July 2014
KernelVersion: 3.17
-Contact: xfs@oss.sgi.com
+Contact: linux-xfs@vger.kernel.org
Description:
The log sequence number (LSN) of the current head of the
log. The LSN is exported in "cycle:basic block" format.
@@ -10,7 +10,7 @@ Users: xfstests
What: /sys/fs/xfs/<disk>/log/log_tail_lsn
Date: July 2014
KernelVersion: 3.17
-Contact: xfs@oss.sgi.com
+Contact: linux-xfs@vger.kernel.org
Description:
The log sequence number (LSN) of the current tail of the
log. The LSN is exported in "cycle:basic block" format.
@@ -18,7 +18,7 @@ Description:
What: /sys/fs/xfs/<disk>/log/reserve_grant_head
Date: July 2014
KernelVersion: 3.17
-Contact: xfs@oss.sgi.com
+Contact: linux-xfs@vger.kernel.org
Description:
The current state of the log reserve grant head. It
represents the total log reservation of all currently
@@ -29,7 +29,7 @@ Users: xfstests
What: /sys/fs/xfs/<disk>/log/write_grant_head
Date: July 2014
KernelVersion: 3.17
-Contact: xfs@oss.sgi.com
+Contact: linux-xfs@vger.kernel.org
Description:
The current state of the log write grant head. It
represents the total log reservation of all currently
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 02/11] xfs: move and rename xfs_trans_committed_bulk
2024-06-20 7:21 xfs: byte-based grant head reservation tracking v4 Christoph Hellwig
2024-06-20 7:21 ` [PATCH 01/11] xfs: fix the contact address for the sysfs ABI documentation Christoph Hellwig
@ 2024-06-20 7:21 ` Christoph Hellwig
2024-06-20 7:21 ` [PATCH 03/11] xfs: AIL doesn't need manual pushing Christoph Hellwig
` (8 subsequent siblings)
10 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-06-20 7:21 UTC (permalink / raw)
To: Chandan Babu R; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs, Dave Chinner
From: Dave Chinner <dchinner@redhat.com>
Ever since the CIL and delayed logging was introduced,
xfs_trans_committed_bulk() has been a purely CIL checkpoint
completion function and not a transaction commit completion
function. Now that we are adding log specific updates to this
function, it really does not have anything to do with the
transaction subsystem - it is really log and log item level
functionality.
This should be part of the CIL code as it is the callback
that moves log items from the CIL checkpoint to the AIL. Move it
and rename it to xlog_cil_ail_insert().
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/xfs_log_cil.c | 132 +++++++++++++++++++++++++++++++++++++++-
fs/xfs/xfs_trans.c | 129 ---------------------------------------
fs/xfs/xfs_trans_priv.h | 3 -
3 files changed, 131 insertions(+), 133 deletions(-)
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index f51cbc6405c15a..141bde08bd6e3c 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -694,6 +694,136 @@ xlog_cil_insert_items(
}
}
+static inline void
+xlog_cil_ail_insert_batch(
+ struct xfs_ail *ailp,
+ struct xfs_ail_cursor *cur,
+ struct xfs_log_item **log_items,
+ int nr_items,
+ xfs_lsn_t commit_lsn)
+{
+ int i;
+
+ spin_lock(&ailp->ail_lock);
+ /* xfs_trans_ail_update_bulk drops ailp->ail_lock */
+ xfs_trans_ail_update_bulk(ailp, cur, log_items, nr_items, commit_lsn);
+
+ for (i = 0; i < nr_items; i++) {
+ struct xfs_log_item *lip = log_items[i];
+
+ if (lip->li_ops->iop_unpin)
+ lip->li_ops->iop_unpin(lip, 0);
+ }
+}
+
+/*
+ * Take the checkpoint's log vector chain of items and insert the attached log
+ * items into the AIL. This uses bulk insertion techniques to minimise AIL lock
+ * traffic.
+ *
+ * If we are called with the aborted flag set, it is because a log write during
+ * a CIL checkpoint commit has failed. In this case, all the items in the
+ * checkpoint have already gone through iop_committed and iop_committing, which
+ * means that checkpoint commit abort handling is treated exactly the same as an
+ * iclog write error even though we haven't started any IO yet. Hence in this
+ * case all we need to do is iop_committed processing, followed by an
+ * iop_unpin(aborted) call.
+ *
+ * The AIL cursor is used to optimise the insert process. If commit_lsn is not
+ * at the end of the AIL, the insert cursor avoids the need to walk the AIL to
+ * find the insertion point on every xfs_log_item_batch_insert() call. This
+ * saves a lot of needless list walking and is a net win, even though it
+ * slightly increases that amount of AIL lock traffic to set it up and tear it
+ * down.
+ */
+static void
+xlog_cil_ail_insert(
+ struct xlog *log,
+ struct list_head *lv_chain,
+ xfs_lsn_t commit_lsn,
+ bool aborted)
+{
+#define LOG_ITEM_BATCH_SIZE 32
+ struct xfs_ail *ailp = log->l_ailp;
+ struct xfs_log_item *log_items[LOG_ITEM_BATCH_SIZE];
+ struct xfs_log_vec *lv;
+ struct xfs_ail_cursor cur;
+ int i = 0;
+
+ spin_lock(&ailp->ail_lock);
+ xfs_trans_ail_cursor_last(ailp, &cur, commit_lsn);
+ spin_unlock(&ailp->ail_lock);
+
+ /* unpin all the log items */
+ list_for_each_entry(lv, lv_chain, lv_list) {
+ struct xfs_log_item *lip = lv->lv_item;
+ xfs_lsn_t item_lsn;
+
+ if (aborted)
+ set_bit(XFS_LI_ABORTED, &lip->li_flags);
+
+ if (lip->li_ops->flags & XFS_ITEM_RELEASE_WHEN_COMMITTED) {
+ lip->li_ops->iop_release(lip);
+ continue;
+ }
+
+ if (lip->li_ops->iop_committed)
+ item_lsn = lip->li_ops->iop_committed(lip, commit_lsn);
+ else
+ item_lsn = commit_lsn;
+
+ /* item_lsn of -1 means the item needs no further processing */
+ if (XFS_LSN_CMP(item_lsn, (xfs_lsn_t)-1) == 0)
+ continue;
+
+ /*
+ * if we are aborting the operation, no point in inserting the
+ * object into the AIL as we are in a shutdown situation.
+ */
+ if (aborted) {
+ ASSERT(xlog_is_shutdown(ailp->ail_log));
+ if (lip->li_ops->iop_unpin)
+ lip->li_ops->iop_unpin(lip, 1);
+ continue;
+ }
+
+ if (item_lsn != commit_lsn) {
+
+ /*
+ * Not a bulk update option due to unusual item_lsn.
+ * Push into AIL immediately, rechecking the lsn once
+ * we have the ail lock. Then unpin the item. This does
+ * not affect the AIL cursor the bulk insert path is
+ * using.
+ */
+ spin_lock(&ailp->ail_lock);
+ if (XFS_LSN_CMP(item_lsn, lip->li_lsn) > 0)
+ xfs_trans_ail_update(ailp, lip, item_lsn);
+ else
+ spin_unlock(&ailp->ail_lock);
+ if (lip->li_ops->iop_unpin)
+ lip->li_ops->iop_unpin(lip, 0);
+ continue;
+ }
+
+ /* Item is a candidate for bulk AIL insert. */
+ log_items[i++] = lv->lv_item;
+ if (i >= LOG_ITEM_BATCH_SIZE) {
+ xlog_cil_ail_insert_batch(ailp, &cur, log_items,
+ LOG_ITEM_BATCH_SIZE, commit_lsn);
+ i = 0;
+ }
+ }
+
+ /* make sure we insert the remainder! */
+ if (i)
+ xlog_cil_ail_insert_batch(ailp, &cur, log_items, i, commit_lsn);
+
+ spin_lock(&ailp->ail_lock);
+ xfs_trans_ail_cursor_done(&cur);
+ spin_unlock(&ailp->ail_lock);
+}
+
static void
xlog_cil_free_logvec(
struct list_head *lv_chain)
@@ -733,7 +863,7 @@ xlog_cil_committed(
spin_unlock(&ctx->cil->xc_push_lock);
}
- xfs_trans_committed_bulk(ctx->cil->xc_log->l_ailp, &ctx->lv_chain,
+ xlog_cil_ail_insert(ctx->cil->xc_log, &ctx->lv_chain,
ctx->start_lsn, abort);
xfs_extent_busy_sort(&ctx->busy_extents.extent_list);
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 828da4ac4316d6..bdf3704dc30118 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -725,135 +725,6 @@ xfs_trans_free_items(
}
}
-static inline void
-xfs_log_item_batch_insert(
- struct xfs_ail *ailp,
- struct xfs_ail_cursor *cur,
- struct xfs_log_item **log_items,
- int nr_items,
- xfs_lsn_t commit_lsn)
-{
- int i;
-
- spin_lock(&ailp->ail_lock);
- /* xfs_trans_ail_update_bulk drops ailp->ail_lock */
- xfs_trans_ail_update_bulk(ailp, cur, log_items, nr_items, commit_lsn);
-
- for (i = 0; i < nr_items; i++) {
- struct xfs_log_item *lip = log_items[i];
-
- if (lip->li_ops->iop_unpin)
- lip->li_ops->iop_unpin(lip, 0);
- }
-}
-
-/*
- * Bulk operation version of xfs_trans_committed that takes a log vector of
- * items to insert into the AIL. This uses bulk AIL insertion techniques to
- * minimise lock traffic.
- *
- * If we are called with the aborted flag set, it is because a log write during
- * a CIL checkpoint commit has failed. In this case, all the items in the
- * checkpoint have already gone through iop_committed and iop_committing, which
- * means that checkpoint commit abort handling is treated exactly the same
- * as an iclog write error even though we haven't started any IO yet. Hence in
- * this case all we need to do is iop_committed processing, followed by an
- * iop_unpin(aborted) call.
- *
- * The AIL cursor is used to optimise the insert process. If commit_lsn is not
- * at the end of the AIL, the insert cursor avoids the need to walk
- * the AIL to find the insertion point on every xfs_log_item_batch_insert()
- * call. This saves a lot of needless list walking and is a net win, even
- * though it slightly increases that amount of AIL lock traffic to set it up
- * and tear it down.
- */
-void
-xfs_trans_committed_bulk(
- struct xfs_ail *ailp,
- struct list_head *lv_chain,
- xfs_lsn_t commit_lsn,
- bool aborted)
-{
-#define LOG_ITEM_BATCH_SIZE 32
- struct xfs_log_item *log_items[LOG_ITEM_BATCH_SIZE];
- struct xfs_log_vec *lv;
- struct xfs_ail_cursor cur;
- int i = 0;
-
- spin_lock(&ailp->ail_lock);
- xfs_trans_ail_cursor_last(ailp, &cur, commit_lsn);
- spin_unlock(&ailp->ail_lock);
-
- /* unpin all the log items */
- list_for_each_entry(lv, lv_chain, lv_list) {
- struct xfs_log_item *lip = lv->lv_item;
- xfs_lsn_t item_lsn;
-
- if (aborted)
- set_bit(XFS_LI_ABORTED, &lip->li_flags);
-
- if (lip->li_ops->flags & XFS_ITEM_RELEASE_WHEN_COMMITTED) {
- lip->li_ops->iop_release(lip);
- continue;
- }
-
- if (lip->li_ops->iop_committed)
- item_lsn = lip->li_ops->iop_committed(lip, commit_lsn);
- else
- item_lsn = commit_lsn;
-
- /* item_lsn of -1 means the item needs no further processing */
- if (XFS_LSN_CMP(item_lsn, (xfs_lsn_t)-1) == 0)
- continue;
-
- /*
- * if we are aborting the operation, no point in inserting the
- * object into the AIL as we are in a shutdown situation.
- */
- if (aborted) {
- ASSERT(xlog_is_shutdown(ailp->ail_log));
- if (lip->li_ops->iop_unpin)
- lip->li_ops->iop_unpin(lip, 1);
- continue;
- }
-
- if (item_lsn != commit_lsn) {
-
- /*
- * Not a bulk update option due to unusual item_lsn.
- * Push into AIL immediately, rechecking the lsn once
- * we have the ail lock. Then unpin the item. This does
- * not affect the AIL cursor the bulk insert path is
- * using.
- */
- spin_lock(&ailp->ail_lock);
- if (XFS_LSN_CMP(item_lsn, lip->li_lsn) > 0)
- xfs_trans_ail_update(ailp, lip, item_lsn);
- else
- spin_unlock(&ailp->ail_lock);
- if (lip->li_ops->iop_unpin)
- lip->li_ops->iop_unpin(lip, 0);
- continue;
- }
-
- /* Item is a candidate for bulk AIL insert. */
- log_items[i++] = lv->lv_item;
- if (i >= LOG_ITEM_BATCH_SIZE) {
- xfs_log_item_batch_insert(ailp, &cur, log_items,
- LOG_ITEM_BATCH_SIZE, commit_lsn);
- i = 0;
- }
- }
-
- /* make sure we insert the remainder! */
- if (i)
- xfs_log_item_batch_insert(ailp, &cur, log_items, i, commit_lsn);
-
- spin_lock(&ailp->ail_lock);
- xfs_trans_ail_cursor_done(&cur);
- spin_unlock(&ailp->ail_lock);
-}
-
/*
* Sort transaction items prior to running precommit operations. This will
* attempt to order the items such that they will always be locked in the same
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index d5400150358e85..52a45f0a5ef173 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -19,9 +19,6 @@ void xfs_trans_add_item(struct xfs_trans *, struct xfs_log_item *);
void xfs_trans_del_item(struct xfs_log_item *);
void xfs_trans_unreserve_and_mod_sb(struct xfs_trans *tp);
-void xfs_trans_committed_bulk(struct xfs_ail *ailp,
- struct list_head *lv_chain,
- xfs_lsn_t commit_lsn, bool aborted);
/*
* AIL traversal cursor.
*
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 03/11] xfs: AIL doesn't need manual pushing
2024-06-20 7:21 xfs: byte-based grant head reservation tracking v4 Christoph Hellwig
2024-06-20 7:21 ` [PATCH 01/11] xfs: fix the contact address for the sysfs ABI documentation Christoph Hellwig
2024-06-20 7:21 ` [PATCH 02/11] xfs: move and rename xfs_trans_committed_bulk Christoph Hellwig
@ 2024-06-20 7:21 ` Christoph Hellwig
2024-06-20 19:01 ` Darrick J. Wong
2024-06-20 7:21 ` [PATCH 04/11] xfs: background AIL push should target physical space Christoph Hellwig
` (7 subsequent siblings)
10 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2024-06-20 7:21 UTC (permalink / raw)
To: Chandan Babu R; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs, Dave Chinner
From: Dave Chinner <dchinner@redhat.com>
We have a mechanism that checks the amount of log space remaining
available every time we make a transaction reservation. If the
amount of space is below a threshold (25% free) we push on the AIL
to tell it to do more work. To do this, we end up calculating the
LSN that the AIL needs to push to on every reservation and updating
the push target for the AIL with that new target LSN.
This is silly and expensive. The AIL is perfectly capable of
calculating the push target itself, and it will always be running
when the AIL contains objects.
What the target does is determine if the AIL needs to do
any work before it goes back to sleep. If we haven't run out of
reservation space or memory (or some other push all trigger), it
will simply go back to sleep for a while if there is more than 25%
of the journal space free without doing anything.
If there are items in the AIL at a lower LSN than the target, it
will try to push up to the target or to the point of getting stuck
before going back to sleep and trying again soon after.`
Hence we can modify the AIL to calculate it's own 25% push target
before it starts a push using the same reserve grant head based
calculation as is currently used, and remove all the places where we
ask the AIL to push to a new 25% free target. We can also drop the
minimum free space size of 256BBs from the calculation because the
25% of a minimum sized log is *always going to be larger than
256BBs.
This does still require a manual push in certain circumstances.
These circumstances arise when the AIL is not full, but the
reservation grants consume the entire of the free space in the log.
In this case, we still need to push on the AIL to free up space, so
when we hit this condition (i.e. reservation going to sleep to wait
on log space) we do a single push to tell the AIL it should empty
itself. This will keep the AIL moving as new reservations come in
and want more space, rather than keep queuing them and having to
push the AIL repeatedly.
The reason for using the "push all" when grant space runs out is
that we can run out of grant space when there is more than 25% of
the log free. Small logs are notorious for this, and we have a hack
in the log callback code (xlog_state_set_callback()) where we push
the AIL because the *head* moved) to ensure that we kick the AIL
when we consume space in it because that can push us over the "less
than 25% available" available that starts tail pushing back up
again.
Hence when we run out of grant space and are going to sleep, we have
to consider that the grant space may be consuming almost all the log
space and there is almost nothing in the AIL. In this situation, the
AIL pins the tail and moving the tail forwards is the only way the
grant space will come available, so we have to force the AIL to push
everything to guarantee grant space will eventually be returned.
Hence triggering a "push all" just before sleeping removes all the
nasty corner cases we have in other parts of the code that work
around the "we didn't ask the AIL to push enough to free grant
space" condition that leads to log space hangs...
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/libxfs/xfs_defer.c | 4 +-
fs/xfs/xfs_log.c | 135 ++-----------------------------
fs/xfs/xfs_log.h | 1 -
fs/xfs/xfs_log_priv.h | 2 +
fs/xfs/xfs_trans_ail.c | 162 +++++++++++++++++---------------------
fs/xfs/xfs_trans_priv.h | 33 ++++++--
6 files changed, 108 insertions(+), 229 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index 4a078e07e1a0a0..e2c8308d518b56 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -12,12 +12,14 @@
#include "xfs_mount.h"
#include "xfs_defer.h"
#include "xfs_trans.h"
+#include "xfs_trans_priv.h"
#include "xfs_buf_item.h"
#include "xfs_inode.h"
#include "xfs_inode_item.h"
#include "xfs_trace.h"
#include "xfs_icache.h"
#include "xfs_log.h"
+#include "xfs_log_priv.h"
#include "xfs_rmap.h"
#include "xfs_refcount.h"
#include "xfs_bmap.h"
@@ -556,7 +558,7 @@ xfs_defer_relog(
* the log threshold once per call.
*/
if (threshold_lsn == NULLCOMMITLSN) {
- threshold_lsn = xlog_grant_push_threshold(log, 0);
+ threshold_lsn = xfs_ail_push_target(log->l_ailp);
if (threshold_lsn == NULLCOMMITLSN)
break;
}
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 416c154949832c..235fcf6dc4eeb5 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -30,10 +30,6 @@ xlog_alloc_log(
struct xfs_buftarg *log_target,
xfs_daddr_t blk_offset,
int num_bblks);
-STATIC int
-xlog_space_left(
- struct xlog *log,
- atomic64_t *head);
STATIC void
xlog_dealloc_log(
struct xlog *log);
@@ -51,10 +47,6 @@ xlog_state_get_iclog_space(
struct xlog_ticket *ticket,
int *logoffsetp);
STATIC void
-xlog_grant_push_ail(
- struct xlog *log,
- int need_bytes);
-STATIC void
xlog_sync(
struct xlog *log,
struct xlog_in_core *iclog,
@@ -242,42 +234,15 @@ xlog_grant_head_wake(
{
struct xlog_ticket *tic;
int need_bytes;
- bool woken_task = false;
list_for_each_entry(tic, &head->waiters, t_queue) {
-
- /*
- * There is a chance that the size of the CIL checkpoints in
- * progress at the last AIL push target calculation resulted in
- * limiting the target to the log head (l_last_sync_lsn) at the
- * time. This may not reflect where the log head is now as the
- * CIL checkpoints may have completed.
- *
- * Hence when we are woken here, it may be that the head of the
- * log that has moved rather than the tail. As the tail didn't
- * move, there still won't be space available for the
- * reservation we require. However, if the AIL has already
- * pushed to the target defined by the old log head location, we
- * will hang here waiting for something else to update the AIL
- * push target.
- *
- * Therefore, if there isn't space to wake the first waiter on
- * the grant head, we need to push the AIL again to ensure the
- * target reflects both the current log tail and log head
- * position before we wait for the tail to move again.
- */
-
need_bytes = xlog_ticket_reservation(log, head, tic);
- if (*free_bytes < need_bytes) {
- if (!woken_task)
- xlog_grant_push_ail(log, need_bytes);
+ if (*free_bytes < need_bytes)
return false;
- }
*free_bytes -= need_bytes;
trace_xfs_log_grant_wake_up(log, tic);
wake_up_process(tic->t_task);
- woken_task = true;
}
return true;
@@ -296,13 +261,15 @@ xlog_grant_head_wait(
do {
if (xlog_is_shutdown(log))
goto shutdown;
- xlog_grant_push_ail(log, need_bytes);
__set_current_state(TASK_UNINTERRUPTIBLE);
spin_unlock(&head->lock);
XFS_STATS_INC(log->l_mp, xs_sleep_logspace);
+ /* Push on the AIL to free up all the log space. */
+ xfs_ail_push_all(log->l_ailp);
+
trace_xfs_log_grant_sleep(log, tic);
schedule();
trace_xfs_log_grant_wake(log, tic);
@@ -418,9 +385,6 @@ xfs_log_regrant(
* of rolling transactions in the log easily.
*/
tic->t_tid++;
-
- xlog_grant_push_ail(log, tic->t_unit_res);
-
tic->t_curr_res = tic->t_unit_res;
if (tic->t_cnt > 0)
return 0;
@@ -477,12 +441,7 @@ xfs_log_reserve(
ASSERT(*ticp == NULL);
tic = xlog_ticket_alloc(log, unit_bytes, cnt, permanent);
*ticp = tic;
-
- xlog_grant_push_ail(log, tic->t_cnt ? tic->t_unit_res * tic->t_cnt
- : tic->t_unit_res);
-
trace_xfs_log_reserve(log, tic);
-
error = xlog_grant_head_check(log, &log->l_reserve_head, tic,
&need_bytes);
if (error)
@@ -1330,7 +1289,7 @@ xlog_assign_tail_lsn(
* shortcut invalidity asserts in this case so that we don't trigger them
* falsely.
*/
-STATIC int
+int
xlog_space_left(
struct xlog *log,
atomic64_t *head)
@@ -1667,89 +1626,6 @@ xlog_alloc_log(
return ERR_PTR(error);
} /* xlog_alloc_log */
-/*
- * Compute the LSN that we'd need to push the log tail towards in order to have
- * (a) enough on-disk log space to log the number of bytes specified, (b) at
- * least 25% of the log space free, and (c) at least 256 blocks free. If the
- * log free space already meets all three thresholds, this function returns
- * NULLCOMMITLSN.
- */
-xfs_lsn_t
-xlog_grant_push_threshold(
- struct xlog *log,
- int need_bytes)
-{
- xfs_lsn_t threshold_lsn = 0;
- xfs_lsn_t last_sync_lsn;
- int free_blocks;
- int free_bytes;
- int threshold_block;
- int threshold_cycle;
- int free_threshold;
-
- ASSERT(BTOBB(need_bytes) < log->l_logBBsize);
-
- free_bytes = xlog_space_left(log, &log->l_reserve_head.grant);
- free_blocks = BTOBBT(free_bytes);
-
- /*
- * Set the threshold for the minimum number of free blocks in the
- * log to the maximum of what the caller needs, one quarter of the
- * log, and 256 blocks.
- */
- free_threshold = BTOBB(need_bytes);
- free_threshold = max(free_threshold, (log->l_logBBsize >> 2));
- free_threshold = max(free_threshold, 256);
- if (free_blocks >= free_threshold)
- return NULLCOMMITLSN;
-
- xlog_crack_atomic_lsn(&log->l_tail_lsn, &threshold_cycle,
- &threshold_block);
- threshold_block += free_threshold;
- if (threshold_block >= log->l_logBBsize) {
- threshold_block -= log->l_logBBsize;
- threshold_cycle += 1;
- }
- threshold_lsn = xlog_assign_lsn(threshold_cycle,
- threshold_block);
- /*
- * Don't pass in an lsn greater than the lsn of the last
- * log record known to be on disk. Use a snapshot of the last sync lsn
- * so that it doesn't change between the compare and the set.
- */
- last_sync_lsn = atomic64_read(&log->l_last_sync_lsn);
- if (XFS_LSN_CMP(threshold_lsn, last_sync_lsn) > 0)
- threshold_lsn = last_sync_lsn;
-
- return threshold_lsn;
-}
-
-/*
- * Push the tail of the log if we need to do so to maintain the free log space
- * thresholds set out by xlog_grant_push_threshold. We may need to adopt a
- * policy which pushes on an lsn which is further along in the log once we
- * reach the high water mark. In this manner, we would be creating a low water
- * mark.
- */
-STATIC void
-xlog_grant_push_ail(
- struct xlog *log,
- int need_bytes)
-{
- xfs_lsn_t threshold_lsn;
-
- threshold_lsn = xlog_grant_push_threshold(log, need_bytes);
- if (threshold_lsn == NULLCOMMITLSN || xlog_is_shutdown(log))
- return;
-
- /*
- * Get the transaction layer to kick the dirty buffers out to
- * disk asynchronously. No point in trying to do this if
- * the filesystem is shutting down.
- */
- xfs_ail_push(log->l_ailp, threshold_lsn);
-}
-
/*
* Stamp cycle number in every block
*/
@@ -2712,7 +2588,6 @@ xlog_state_set_callback(
return;
atomic64_set(&log->l_last_sync_lsn, header_lsn);
- xlog_grant_push_ail(log, 0);
}
/*
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index d69acf881153d0..67c539cc9305c7 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -156,7 +156,6 @@ int xfs_log_quiesce(struct xfs_mount *mp);
void xfs_log_clean(struct xfs_mount *mp);
bool xfs_log_check_lsn(struct xfs_mount *, xfs_lsn_t);
-xfs_lsn_t xlog_grant_push_threshold(struct xlog *log, int need_bytes);
bool xlog_force_shutdown(struct xlog *log, uint32_t shutdown_flags);
int xfs_attr_use_log_assist(struct xfs_mount *mp);
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 40e22ec0fbe69a..0482b11965e248 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -575,6 +575,8 @@ xlog_assign_grant_head(atomic64_t *head, int cycle, int space)
atomic64_set(head, xlog_assign_grant_head_val(cycle, space));
}
+int xlog_space_left(struct xlog *log, atomic64_t *head);
+
/*
* Committed Item List interfaces
*/
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index e4c343096f95a3..a6b6fca1d13852 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -134,25 +134,6 @@ xfs_ail_min_lsn(
return lsn;
}
-/*
- * Return the maximum lsn held in the AIL, or zero if the AIL is empty.
- */
-static xfs_lsn_t
-xfs_ail_max_lsn(
- struct xfs_ail *ailp)
-{
- xfs_lsn_t lsn = 0;
- struct xfs_log_item *lip;
-
- spin_lock(&ailp->ail_lock);
- lip = xfs_ail_max(ailp);
- if (lip)
- lsn = lip->li_lsn;
- spin_unlock(&ailp->ail_lock);
-
- return lsn;
-}
-
/*
* The cursor keeps track of where our current traversal is up to by tracking
* the next item in the list for us. However, for this to be safe, removing an
@@ -414,6 +395,56 @@ xfsaild_push_item(
return lip->li_ops->iop_push(lip, &ailp->ail_buf_list);
}
+/*
+ * Compute the LSN that we'd need to push the log tail towards in order to have
+ * at least 25% of the log space free. If the log free space already meets this
+ * threshold, this function returns NULLCOMMITLSN.
+ */
+xfs_lsn_t
+__xfs_ail_push_target(
+ struct xfs_ail *ailp)
+{
+ struct xlog *log = ailp->ail_log;
+ xfs_lsn_t threshold_lsn = 0;
+ xfs_lsn_t last_sync_lsn;
+ int free_blocks;
+ int free_bytes;
+ int threshold_block;
+ int threshold_cycle;
+ int free_threshold;
+
+ free_bytes = xlog_space_left(log, &log->l_reserve_head.grant);
+ free_blocks = BTOBBT(free_bytes);
+
+ /*
+ * The threshold for the minimum number of free blocks is one quarter of
+ * the entire log space.
+ */
+ free_threshold = log->l_logBBsize >> 2;
+ if (free_blocks >= free_threshold)
+ return NULLCOMMITLSN;
+
+ xlog_crack_atomic_lsn(&log->l_tail_lsn, &threshold_cycle,
+ &threshold_block);
+ threshold_block += free_threshold;
+ if (threshold_block >= log->l_logBBsize) {
+ threshold_block -= log->l_logBBsize;
+ threshold_cycle += 1;
+ }
+ threshold_lsn = xlog_assign_lsn(threshold_cycle,
+ threshold_block);
+ /*
+ * Don't pass in an lsn greater than the lsn of the last
+ * log record known to be on disk. Use a snapshot of the last sync lsn
+ * so that it doesn't change between the compare and the set.
+ */
+ last_sync_lsn = atomic64_read(&log->l_last_sync_lsn);
+ if (XFS_LSN_CMP(threshold_lsn, last_sync_lsn) > 0)
+ threshold_lsn = last_sync_lsn;
+
+ return threshold_lsn;
+}
+
static long
xfsaild_push(
struct xfs_ail *ailp)
@@ -454,21 +485,24 @@ xfsaild_push(
* capture updates that occur after the sync push waiter has gone to
* sleep.
*/
- if (waitqueue_active(&ailp->ail_empty)) {
+ if (test_bit(XFS_AIL_OPSTATE_PUSH_ALL, &ailp->ail_opstate) ||
+ waitqueue_active(&ailp->ail_empty)) {
lip = xfs_ail_max(ailp);
if (lip)
target = lip->li_lsn;
+ else
+ clear_bit(XFS_AIL_OPSTATE_PUSH_ALL, &ailp->ail_opstate);
} else {
- /* barrier matches the ail_target update in xfs_ail_push() */
- smp_rmb();
- target = ailp->ail_target;
- ailp->ail_target_prev = target;
+ target = __xfs_ail_push_target(ailp);
}
+ if (target == NULLCOMMITLSN)
+ goto out_done;
+
/* we're done if the AIL is empty or our push has reached the end */
lip = xfs_trans_ail_cursor_first(ailp, &cur, ailp->ail_last_pushed_lsn);
if (!lip)
- goto out_done;
+ goto out_done_cursor;
XFS_STATS_INC(mp, xs_push_ail);
@@ -553,8 +587,9 @@ xfsaild_push(
lsn = lip->li_lsn;
}
-out_done:
+out_done_cursor:
xfs_trans_ail_cursor_done(&cur);
+out_done:
spin_unlock(&ailp->ail_lock);
if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list))
@@ -603,7 +638,7 @@ xfsaild(
set_freezable();
while (1) {
- if (tout && tout <= 20)
+ if (tout)
set_current_state(TASK_KILLABLE|TASK_FREEZABLE);
else
set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
@@ -639,21 +674,9 @@ xfsaild(
break;
}
+ /* Idle if the AIL is empty. */
spin_lock(&ailp->ail_lock);
-
- /*
- * Idle if the AIL is empty and we are not racing with a target
- * update. We check the AIL after we set the task to a sleep
- * state to guarantee that we either catch an ail_target update
- * or that a wake_up resets the state to TASK_RUNNING.
- * Otherwise, we run the risk of sleeping indefinitely.
- *
- * The barrier matches the ail_target update in xfs_ail_push().
- */
- smp_rmb();
- if (!xfs_ail_min(ailp) &&
- ailp->ail_target == ailp->ail_target_prev &&
- list_empty(&ailp->ail_buf_list)) {
+ if (!xfs_ail_min(ailp) && list_empty(&ailp->ail_buf_list)) {
spin_unlock(&ailp->ail_lock);
schedule();
tout = 0;
@@ -675,56 +698,6 @@ xfsaild(
return 0;
}
-/*
- * This routine is called to move the tail of the AIL forward. It does this by
- * trying to flush items in the AIL whose lsns are below the given
- * threshold_lsn.
- *
- * The push is run asynchronously in a workqueue, which means the caller needs
- * to handle waiting on the async flush for space to become available.
- * We don't want to interrupt any push that is in progress, hence we only queue
- * work if we set the pushing bit appropriately.
- *
- * We do this unlocked - we only need to know whether there is anything in the
- * AIL at the time we are called. We don't need to access the contents of
- * any of the objects, so the lock is not needed.
- */
-void
-xfs_ail_push(
- struct xfs_ail *ailp,
- xfs_lsn_t threshold_lsn)
-{
- struct xfs_log_item *lip;
-
- lip = xfs_ail_min(ailp);
- if (!lip || xlog_is_shutdown(ailp->ail_log) ||
- XFS_LSN_CMP(threshold_lsn, ailp->ail_target) <= 0)
- return;
-
- /*
- * Ensure that the new target is noticed in push code before it clears
- * the XFS_AIL_PUSHING_BIT.
- */
- smp_wmb();
- xfs_trans_ail_copy_lsn(ailp, &ailp->ail_target, &threshold_lsn);
- smp_wmb();
-
- wake_up_process(ailp->ail_task);
-}
-
-/*
- * Push out all items in the AIL immediately
- */
-void
-xfs_ail_push_all(
- struct xfs_ail *ailp)
-{
- xfs_lsn_t threshold_lsn = xfs_ail_max_lsn(ailp);
-
- if (threshold_lsn)
- xfs_ail_push(ailp, threshold_lsn);
-}
-
/*
* Push out all items in the AIL immediately and wait until the AIL is empty.
*/
@@ -829,6 +802,13 @@ xfs_trans_ail_update_bulk(
if (!list_empty(&tmp))
xfs_ail_splice(ailp, cur, &tmp, lsn);
+ /*
+ * If this is the first insert, wake up the push daemon so it can
+ * actively scan for items to push.
+ */
+ if (!mlip)
+ wake_up_process(ailp->ail_task);
+
xfs_ail_update_finish(ailp, tail_lsn);
}
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 52a45f0a5ef173..9a131e7fae9467 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -52,16 +52,18 @@ struct xfs_ail {
struct xlog *ail_log;
struct task_struct *ail_task;
struct list_head ail_head;
- xfs_lsn_t ail_target;
- xfs_lsn_t ail_target_prev;
struct list_head ail_cursors;
spinlock_t ail_lock;
xfs_lsn_t ail_last_pushed_lsn;
int ail_log_flush;
+ unsigned long ail_opstate;
struct list_head ail_buf_list;
wait_queue_head_t ail_empty;
};
+/* Push all items out of the AIL immediately. */
+#define XFS_AIL_OPSTATE_PUSH_ALL 0u
+
/*
* From xfs_trans_ail.c
*/
@@ -98,10 +100,29 @@ void xfs_ail_update_finish(struct xfs_ail *ailp, xfs_lsn_t old_lsn)
__releases(ailp->ail_lock);
void xfs_trans_ail_delete(struct xfs_log_item *lip, int shutdown_type);
-void xfs_ail_push(struct xfs_ail *, xfs_lsn_t);
-void xfs_ail_push_all(struct xfs_ail *);
-void xfs_ail_push_all_sync(struct xfs_ail *);
-struct xfs_log_item *xfs_ail_min(struct xfs_ail *ailp);
+static inline void xfs_ail_push(struct xfs_ail *ailp)
+{
+ wake_up_process(ailp->ail_task);
+}
+
+static inline void xfs_ail_push_all(struct xfs_ail *ailp)
+{
+ if (!test_and_set_bit(XFS_AIL_OPSTATE_PUSH_ALL, &ailp->ail_opstate))
+ xfs_ail_push(ailp);
+}
+
+xfs_lsn_t __xfs_ail_push_target(struct xfs_ail *ailp);
+static inline xfs_lsn_t xfs_ail_push_target(struct xfs_ail *ailp)
+{
+ xfs_lsn_t lsn;
+
+ spin_lock(&ailp->ail_lock);
+ lsn = __xfs_ail_push_target(ailp);
+ spin_unlock(&ailp->ail_lock);
+ return lsn;
+}
+
+void xfs_ail_push_all_sync(struct xfs_ail *ailp);
xfs_lsn_t xfs_ail_min_lsn(struct xfs_ail *ailp);
struct xfs_log_item * xfs_trans_ail_cursor_first(struct xfs_ail *ailp,
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 04/11] xfs: background AIL push should target physical space
2024-06-20 7:21 xfs: byte-based grant head reservation tracking v4 Christoph Hellwig
` (2 preceding siblings ...)
2024-06-20 7:21 ` [PATCH 03/11] xfs: AIL doesn't need manual pushing Christoph Hellwig
@ 2024-06-20 7:21 ` Christoph Hellwig
2024-06-20 19:42 ` Darrick J. Wong
2024-06-20 7:21 ` [PATCH 05/11] xfs: ensure log tail is always up to date Christoph Hellwig
` (6 subsequent siblings)
10 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2024-06-20 7:21 UTC (permalink / raw)
To: Chandan Babu R; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs, Dave Chinner
From: Dave Chinner <dchinner@redhat.com>
Currently the AIL attempts to keep 25% of the "log space" free,
where the current used space is tracked by the reserve grant head.
That is, it tracks both physical space used plus the amount reserved
by transactions in progress.
When we start tail pushing, we are trying to make space for new
reservations by writing back older metadata and the log is generally
physically full of dirty metadata, and reservations for modifications
in flight take up whatever space the AIL can physically free up.
Hence we don't really need to take into account the reservation
space that has been used - we just need to keep the log tail moving
as fast as we can to free up space for more reservations to be made.
We know exactly how much physical space the journal is consuming in
the AIL (i.e. max LSN - min LSN) so we can base push thresholds
directly on this state rather than have to look at grant head
reservations to determine how much to physically push out of the
log.
This also allows code that needs to know if log items in the current
transaction need to be pushed or re-logged to simply sample the
current target - they don't need to calculate the current target
themselves. This avoids the need for any locking when doing such
checks.
Further, moving to a physical target means we don't need "push all
until empty semantics" like were introduced in the previous patch.
We can now test and clear the "push all" as a one-shot command to
set the target to the current head of the AIL. This allows the
xfsaild to maximise the use of log space right up to the point where
conditions indicate that the xfsaild is not keeping up with load and
it needs to work harder, and as soon as those constraints go away
(i.e. external code no longer needs everything pushed) the xfsaild
will return to maintaining the normal 25% free space thresholds.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/libxfs/xfs_defer.c | 2 +-
fs/xfs/xfs_log_priv.h | 18 ++++++
fs/xfs/xfs_trans_ail.c | 116 +++++++++++++++++++-------------------
fs/xfs/xfs_trans_priv.h | 11 +---
4 files changed, 80 insertions(+), 67 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index e2c8308d518b56..40021849b42f0a 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -558,7 +558,7 @@ xfs_defer_relog(
* the log threshold once per call.
*/
if (threshold_lsn == NULLCOMMITLSN) {
- threshold_lsn = xfs_ail_push_target(log->l_ailp);
+ threshold_lsn = xfs_ail_get_push_target(log->l_ailp);
if (threshold_lsn == NULLCOMMITLSN)
break;
}
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 0482b11965e248..971871b84d8436 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -625,6 +625,24 @@ xlog_wait(
int xlog_wait_on_iclog(struct xlog_in_core *iclog)
__releases(iclog->ic_log->l_icloglock);
+/* Calculate the distance between two LSNs in bytes */
+static inline uint64_t
+xlog_lsn_sub(
+ struct xlog *log,
+ xfs_lsn_t high,
+ xfs_lsn_t low)
+{
+ uint32_t hi_cycle = CYCLE_LSN(high);
+ uint32_t hi_block = BLOCK_LSN(high);
+ uint32_t lo_cycle = CYCLE_LSN(low);
+ uint32_t lo_block = BLOCK_LSN(low);
+
+ if (hi_cycle == lo_cycle)
+ return BBTOB(hi_block - lo_block);
+ ASSERT((hi_cycle == lo_cycle + 1) || xlog_is_shutdown(log));
+ return (uint64_t)log->l_logsize - BBTOB(lo_block - hi_block);
+}
+
/*
* The LSN is valid so long as it is behind the current LSN. If it isn't, this
* means that the next log record that includes this metadata could have a
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index a6b6fca1d13852..26d4d9b3e35789 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -398,51 +398,69 @@ xfsaild_push_item(
/*
* Compute the LSN that we'd need to push the log tail towards in order to have
* at least 25% of the log space free. If the log free space already meets this
- * threshold, this function returns NULLCOMMITLSN.
+ * threshold, this function returns the lowest LSN in the AIL to slowly keep
+ * writeback ticking over and the tail of the log moving forward.
*/
-xfs_lsn_t
-__xfs_ail_push_target(
+static xfs_lsn_t
+xfs_ail_calc_push_target(
struct xfs_ail *ailp)
{
- struct xlog *log = ailp->ail_log;
- xfs_lsn_t threshold_lsn = 0;
- xfs_lsn_t last_sync_lsn;
- int free_blocks;
- int free_bytes;
- int threshold_block;
- int threshold_cycle;
- int free_threshold;
-
- free_bytes = xlog_space_left(log, &log->l_reserve_head.grant);
- free_blocks = BTOBBT(free_bytes);
+ struct xlog *log = ailp->ail_log;
+ struct xfs_log_item *lip;
+ xfs_lsn_t target_lsn;
+ xfs_lsn_t max_lsn;
+ xfs_lsn_t min_lsn;
+ int32_t free_bytes;
+ uint32_t target_block;
+ uint32_t target_cycle;
+
+ lockdep_assert_held(&ailp->ail_lock);
+
+ lip = xfs_ail_max(ailp);
+ if (!lip)
+ return NULLCOMMITLSN;
+
+ max_lsn = lip->li_lsn;
+ min_lsn = __xfs_ail_min_lsn(ailp);
/*
- * The threshold for the minimum number of free blocks is one quarter of
- * the entire log space.
+ * If we are supposed to push all the items in the AIL, we want to push
+ * to the current head. We then clear the push flag so that we don't
+ * keep pushing newly queued items beyond where the push all command was
+ * run. If the push waiter wants to empty the ail, it should queue
+ * itself on the ail_empty wait queue.
*/
- free_threshold = log->l_logBBsize >> 2;
- if (free_blocks >= free_threshold)
- return NULLCOMMITLSN;
+ if (test_and_clear_bit(XFS_AIL_OPSTATE_PUSH_ALL, &ailp->ail_opstate))
+ return max_lsn;
+
+ /* If someone wants the AIL empty, keep pushing everything we have. */
+ if (waitqueue_active(&ailp->ail_empty))
+ return max_lsn;
- xlog_crack_atomic_lsn(&log->l_tail_lsn, &threshold_cycle,
- &threshold_block);
- threshold_block += free_threshold;
- if (threshold_block >= log->l_logBBsize) {
- threshold_block -= log->l_logBBsize;
- threshold_cycle += 1;
- }
- threshold_lsn = xlog_assign_lsn(threshold_cycle,
- threshold_block);
/*
- * Don't pass in an lsn greater than the lsn of the last
- * log record known to be on disk. Use a snapshot of the last sync lsn
- * so that it doesn't change between the compare and the set.
+ * Background pushing - attempt to keep 25% of the log free and if we
+ * have that much free retain the existing target.
*/
- last_sync_lsn = atomic64_read(&log->l_last_sync_lsn);
- if (XFS_LSN_CMP(threshold_lsn, last_sync_lsn) > 0)
- threshold_lsn = last_sync_lsn;
+ free_bytes = log->l_logsize - xlog_lsn_sub(log, max_lsn, min_lsn);
+ if (free_bytes >= log->l_logsize >> 2)
+ return ailp->ail_target;
+
+ target_cycle = CYCLE_LSN(min_lsn);
+ target_block = BLOCK_LSN(min_lsn) + (log->l_logBBsize >> 2);
+ if (target_block >= log->l_logBBsize) {
+ target_block -= log->l_logBBsize;
+ target_cycle += 1;
+ }
+ target_lsn = xlog_assign_lsn(target_cycle, target_block);
+
+ /* Cap the target to the highest LSN known to be in the AIL. */
+ if (XFS_LSN_CMP(target_lsn, max_lsn) > 0)
+ return max_lsn;
- return threshold_lsn;
+ /* If the existing target is higher than the new target, keep it. */
+ if (XFS_LSN_CMP(ailp->ail_target, target_lsn) >= 0)
+ return ailp->ail_target;
+ return target_lsn;
}
static long
@@ -453,7 +471,6 @@ xfsaild_push(
struct xfs_ail_cursor cur;
struct xfs_log_item *lip;
xfs_lsn_t lsn;
- xfs_lsn_t target = NULLCOMMITLSN;
long tout;
int stuck = 0;
int flushing = 0;
@@ -478,25 +495,8 @@ xfsaild_push(
}
spin_lock(&ailp->ail_lock);
-
- /*
- * If we have a sync push waiter, we always have to push till the AIL is
- * empty. Update the target to point to the end of the AIL so that
- * capture updates that occur after the sync push waiter has gone to
- * sleep.
- */
- if (test_bit(XFS_AIL_OPSTATE_PUSH_ALL, &ailp->ail_opstate) ||
- waitqueue_active(&ailp->ail_empty)) {
- lip = xfs_ail_max(ailp);
- if (lip)
- target = lip->li_lsn;
- else
- clear_bit(XFS_AIL_OPSTATE_PUSH_ALL, &ailp->ail_opstate);
- } else {
- target = __xfs_ail_push_target(ailp);
- }
-
- if (target == NULLCOMMITLSN)
+ WRITE_ONCE(ailp->ail_target, xfs_ail_calc_push_target(ailp));
+ if (ailp->ail_target == NULLCOMMITLSN)
goto out_done;
/* we're done if the AIL is empty or our push has reached the end */
@@ -506,10 +506,10 @@ xfsaild_push(
XFS_STATS_INC(mp, xs_push_ail);
- ASSERT(target != NULLCOMMITLSN);
+ ASSERT(ailp->ail_target != NULLCOMMITLSN);
lsn = lip->li_lsn;
- while ((XFS_LSN_CMP(lip->li_lsn, target) <= 0)) {
+ while ((XFS_LSN_CMP(lip->li_lsn, ailp->ail_target) <= 0)) {
int lock_result;
/*
@@ -595,7 +595,7 @@ xfsaild_push(
if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list))
ailp->ail_log_flush++;
- if (!count || XFS_LSN_CMP(lsn, target) >= 0) {
+ if (!count || XFS_LSN_CMP(lsn, ailp->ail_target) >= 0) {
/*
* We reached the target or the AIL is empty, so wait a bit
* longer for I/O to complete and remove pushed items from the
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 9a131e7fae9467..60b4707c3a6583 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -59,6 +59,7 @@ struct xfs_ail {
unsigned long ail_opstate;
struct list_head ail_buf_list;
wait_queue_head_t ail_empty;
+ xfs_lsn_t ail_target;
};
/* Push all items out of the AIL immediately. */
@@ -111,15 +112,9 @@ static inline void xfs_ail_push_all(struct xfs_ail *ailp)
xfs_ail_push(ailp);
}
-xfs_lsn_t __xfs_ail_push_target(struct xfs_ail *ailp);
-static inline xfs_lsn_t xfs_ail_push_target(struct xfs_ail *ailp)
+static inline xfs_lsn_t xfs_ail_get_push_target(struct xfs_ail *ailp)
{
- xfs_lsn_t lsn;
-
- spin_lock(&ailp->ail_lock);
- lsn = __xfs_ail_push_target(ailp);
- spin_unlock(&ailp->ail_lock);
- return lsn;
+ return READ_ONCE(ailp->ail_target);
}
void xfs_ail_push_all_sync(struct xfs_ail *ailp);
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 05/11] xfs: ensure log tail is always up to date
2024-06-20 7:21 xfs: byte-based grant head reservation tracking v4 Christoph Hellwig
` (3 preceding siblings ...)
2024-06-20 7:21 ` [PATCH 04/11] xfs: background AIL push should target physical space Christoph Hellwig
@ 2024-06-20 7:21 ` Christoph Hellwig
2024-06-20 7:21 ` [PATCH 06/11] xfs: l_last_sync_lsn is really AIL state Christoph Hellwig
` (5 subsequent siblings)
10 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-06-20 7:21 UTC (permalink / raw)
To: Chandan Babu R; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs, Dave Chinner
From: Dave Chinner <dchinner@redhat.com>
Whenever we write an iclog, we call xlog_assign_tail_lsn() to update
the current tail before we write it into the iclog header. This
means we have to take the AIL lock on every iclog write just to
check if the tail of the log has moved.
This doesn't avoid races with log tail updates - the log tail could
move immediately after we assign the tail to the iclog header and
hence by the time the iclog reaches stable storage the tail LSN has
moved forward in memory. Hence the log tail LSN in the iclog header
is really just a point in time snapshot of the current state of the
AIL.
With this in mind, if we simply update the in memory log->l_tail_lsn
every time it changes in the AIL, there is no need to update the in
memory value when we are writing it into an iclog - it will already
be up-to-date in memory and checking the AIL again will not change
this. Hence xlog_state_release_iclog() does not need to check the
AIL to update the tail lsn and can just sample it directly without
needing to take the AIL lock.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
fs/xfs/xfs_log.c | 5 ++---
fs/xfs/xfs_trans_ail.c | 17 +++++++++++++++--
2 files changed, 17 insertions(+), 5 deletions(-)
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 235fcf6dc4eeb5..ae22f361627fe4 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -530,7 +530,6 @@ xlog_state_release_iclog(
struct xlog_in_core *iclog,
struct xlog_ticket *ticket)
{
- xfs_lsn_t tail_lsn;
bool last_ref;
lockdep_assert_held(&log->l_icloglock);
@@ -545,8 +544,8 @@ xlog_state_release_iclog(
if ((iclog->ic_state == XLOG_STATE_WANT_SYNC ||
(iclog->ic_flags & XLOG_ICL_NEED_FUA)) &&
!iclog->ic_header.h_tail_lsn) {
- tail_lsn = xlog_assign_tail_lsn(log->l_mp);
- iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
+ iclog->ic_header.h_tail_lsn =
+ cpu_to_be64(atomic64_read(&log->l_tail_lsn));
}
last_ref = atomic_dec_and_test(&iclog->ic_refcnt);
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 26d4d9b3e35789..7d6ccd21aae2e5 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -720,6 +720,13 @@ xfs_ail_push_all_sync(
finish_wait(&ailp->ail_empty, &wait);
}
+/*
+ * Callers should pass the original tail lsn so that we can detect if the tail
+ * has moved as a result of the operation that was performed. If the caller
+ * needs to force a tail LSN update, it should pass NULLCOMMITLSN to bypass the
+ * "did the tail LSN change?" checks. If the caller wants to avoid a tail update
+ * (e.g. it knows the tail did not change) it should pass an @old_lsn of 0.
+ */
void
xfs_ail_update_finish(
struct xfs_ail *ailp,
@@ -804,10 +811,16 @@ xfs_trans_ail_update_bulk(
/*
* If this is the first insert, wake up the push daemon so it can
- * actively scan for items to push.
+ * actively scan for items to push. We also need to do a log tail
+ * LSN update to ensure that it is correctly tracked by the log, so
+ * set the tail_lsn to NULLCOMMITLSN so that xfs_ail_update_finish()
+ * will see that the tail lsn has changed and will update the tail
+ * appropriately.
*/
- if (!mlip)
+ if (!mlip) {
wake_up_process(ailp->ail_task);
+ tail_lsn = NULLCOMMITLSN;
+ }
xfs_ail_update_finish(ailp, tail_lsn);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 06/11] xfs: l_last_sync_lsn is really AIL state
2024-06-20 7:21 xfs: byte-based grant head reservation tracking v4 Christoph Hellwig
` (4 preceding siblings ...)
2024-06-20 7:21 ` [PATCH 05/11] xfs: ensure log tail is always up to date Christoph Hellwig
@ 2024-06-20 7:21 ` Christoph Hellwig
2024-06-20 7:21 ` [PATCH 07/11] xfs: collapse xlog_state_set_callback in caller Christoph Hellwig
` (4 subsequent siblings)
10 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-06-20 7:21 UTC (permalink / raw)
To: Chandan Babu R; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs, Dave Chinner
From: Dave Chinner <dchinner@redhat.com>
The current implementation of xlog_assign_tail_lsn() assumes that
when the AIL is empty, the log tail matches the LSN of the last
written commit record. This is recorded in xlog_state_set_callback()
as log->l_last_sync_lsn when the iclog state changes to
XLOG_STATE_CALLBACK. This change is then immediately followed by
running the callbacks on the iclog which then insert the log items
into the AIL at the "commit lsn" of that checkpoint.
The AIL tracks log items via the start record LSN of the checkpoint,
not the commit record LSN. This is because we can pipeline multiple
checkpoints, and so the start record of checkpoint N+1 can be
written before the commit record of checkpoint N. i.e:
start N commit N
+-------------+------------+----------------+
start N+1 commit N+1
The tail of the log cannot be moved to the LSN of commit N when all
the items of that checkpoint are written back, because then the
start record for N+1 is no longer in the active portion of the log
and recovery will fail/corrupt the filesystem.
Hence when all the log items in checkpoint N are written back, the
tail of the log most now only move as far forwards as the start LSN
of checkpoint N+1.
Hence we cannot use the maximum start record LSN the AIL sees as a
replacement the pointer to the current head of the on-disk log
records. However, we currently only use the l_last_sync_lsn when the
AIL is empty - when there is no start LSN remaining, the tail of the
log moves to the LSN of the last commit record as this is where
recovery needs to start searching for recoverable records. THe next
checkpoint will have a start record LSN that is higher than
l_last_sync_lsn, and so everything still works correctly when new
checkpoints are written to an otherwise empty log.
l_last_sync_lsn is an atomic variable because it is currently
updated when an iclog with callbacks attached moves to the CALLBACK
state. While we hold the icloglock at this point, we don't hold the
AIL lock. When we assign the log tail, we hold the AIL lock, not the
icloglock because we have to look up the AIL. Hence it is an atomic
variable so it's not bound to a specific lock context.
However, the iclog callbacks are only used for CIL checkpoints. We
don't use callbacks with unmount record writes, so the
l_last_sync_lsn variable only gets updated when we are processing
CIL checkpoint callbacks. And those callbacks run under AIL lock
contexts, not icloglock context. The CIL checkpoint already knows
what the LSN of the iclog the commit record was written to (obtained
when written into the iclog before submission) and so we can update
the l_last_sync_lsn under the AIL lock in this callback. No other
iclog callbacks will run until the currently executing one
completes, and hence we can update the l_last_sync_lsn under the AIL
lock safely.
This means l_last_sync_lsn can move to the AIL as the "ail_head_lsn"
and it can be used to replace the atomic l_last_sync_lsn in the
iclog code. This makes tracking the log tail belong entirely to the
AIL, rather than being smeared across log, iclog and AIL state and
locking.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/xfs_log.c | 81 +++++-----------------------------------
fs/xfs/xfs_log_cil.c | 54 ++++++++++++++++++++-------
fs/xfs/xfs_log_priv.h | 9 ++---
fs/xfs/xfs_log_recover.c | 19 +++++-----
fs/xfs/xfs_trace.c | 1 +
fs/xfs/xfs_trace.h | 8 ++--
fs/xfs/xfs_trans_ail.c | 26 +++++++++++--
fs/xfs/xfs_trans_priv.h | 13 +++++++
8 files changed, 102 insertions(+), 109 deletions(-)
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index ae22f361627fe4..1977afecd385d5 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1230,47 +1230,6 @@ xfs_log_cover(
return error;
}
-/*
- * We may be holding the log iclog lock upon entering this routine.
- */
-xfs_lsn_t
-xlog_assign_tail_lsn_locked(
- struct xfs_mount *mp)
-{
- struct xlog *log = mp->m_log;
- struct xfs_log_item *lip;
- xfs_lsn_t tail_lsn;
-
- assert_spin_locked(&mp->m_ail->ail_lock);
-
- /*
- * To make sure we always have a valid LSN for the log tail we keep
- * track of the last LSN which was committed in log->l_last_sync_lsn,
- * and use that when the AIL was empty.
- */
- lip = xfs_ail_min(mp->m_ail);
- if (lip)
- tail_lsn = lip->li_lsn;
- else
- tail_lsn = atomic64_read(&log->l_last_sync_lsn);
- trace_xfs_log_assign_tail_lsn(log, tail_lsn);
- atomic64_set(&log->l_tail_lsn, tail_lsn);
- return tail_lsn;
-}
-
-xfs_lsn_t
-xlog_assign_tail_lsn(
- struct xfs_mount *mp)
-{
- xfs_lsn_t tail_lsn;
-
- spin_lock(&mp->m_ail->ail_lock);
- tail_lsn = xlog_assign_tail_lsn_locked(mp);
- spin_unlock(&mp->m_ail->ail_lock);
-
- return tail_lsn;
-}
-
/*
* Return the space in the log between the tail and the head. The head
* is passed in the cycle/bytes formal parms. In the special case where
@@ -1501,7 +1460,6 @@ xlog_alloc_log(
log->l_prev_block = -1;
/* log->l_tail_lsn = 0x100000000LL; cycle = 1; current block = 0 */
xlog_assign_atomic_lsn(&log->l_tail_lsn, 1, 0);
- xlog_assign_atomic_lsn(&log->l_last_sync_lsn, 1, 0);
log->l_curr_cycle = 1; /* 0 is bad since this is initial value */
if (xfs_has_logv2(mp) && mp->m_sb.sb_logsunit > 1)
@@ -2549,44 +2507,23 @@ xlog_get_lowest_lsn(
return lowest_lsn;
}
-/*
- * Completion of a iclog IO does not imply that a transaction has completed, as
- * transactions can be large enough to span many iclogs. We cannot change the
- * tail of the log half way through a transaction as this may be the only
- * transaction in the log and moving the tail to point to the middle of it
- * will prevent recovery from finding the start of the transaction. Hence we
- * should only update the last_sync_lsn if this iclog contains transaction
- * completion callbacks on it.
- *
- * We have to do this before we drop the icloglock to ensure we are the only one
- * that can update it.
- *
- * If we are moving the last_sync_lsn forwards, we also need to ensure we kick
- * the reservation grant head pushing. This is due to the fact that the push
- * target is bound by the current last_sync_lsn value. Hence if we have a large
- * amount of log space bound up in this committing transaction then the
- * last_sync_lsn value may be the limiting factor preventing tail pushing from
- * freeing space in the log. Hence once we've updated the last_sync_lsn we
- * should push the AIL to ensure the push target (and hence the grant head) is
- * no longer bound by the old log head location and can move forwards and make
- * progress again.
- */
static void
xlog_state_set_callback(
struct xlog *log,
struct xlog_in_core *iclog,
xfs_lsn_t header_lsn)
{
+ /*
+ * If there are no callbacks on this iclog, we can mark it clean
+ * immediately and return. Otherwise we need to run the
+ * callbacks.
+ */
+ if (list_empty(&iclog->ic_callbacks)) {
+ xlog_state_clean_iclog(log, iclog);
+ return;
+ }
trace_xlog_iclog_callback(iclog, _RET_IP_);
iclog->ic_state = XLOG_STATE_CALLBACK;
-
- ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn),
- header_lsn) <= 0);
-
- if (list_empty_careful(&iclog->ic_callbacks))
- return;
-
- atomic64_set(&log->l_last_sync_lsn, header_lsn);
}
/*
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 141bde08bd6e3c..482955f1fa1f9f 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -721,6 +721,24 @@ xlog_cil_ail_insert_batch(
* items into the AIL. This uses bulk insertion techniques to minimise AIL lock
* traffic.
*
+ * The AIL tracks log items via the start record LSN of the checkpoint,
+ * not the commit record LSN. This is because we can pipeline multiple
+ * checkpoints, and so the start record of checkpoint N+1 can be
+ * written before the commit record of checkpoint N. i.e:
+ *
+ * start N commit N
+ * +-------------+------------+----------------+
+ * start N+1 commit N+1
+ *
+ * The tail of the log cannot be moved to the LSN of commit N when all
+ * the items of that checkpoint are written back, because then the
+ * start record for N+1 is no longer in the active portion of the log
+ * and recovery will fail/corrupt the filesystem.
+ *
+ * Hence when all the log items in checkpoint N are written back, the
+ * tail of the log most now only move as far forwards as the start LSN
+ * of checkpoint N+1.
+ *
* If we are called with the aborted flag set, it is because a log write during
* a CIL checkpoint commit has failed. In this case, all the items in the
* checkpoint have already gone through iop_committed and iop_committing, which
@@ -738,24 +756,33 @@ xlog_cil_ail_insert_batch(
*/
static void
xlog_cil_ail_insert(
- struct xlog *log,
- struct list_head *lv_chain,
- xfs_lsn_t commit_lsn,
+ struct xfs_cil_ctx *ctx,
bool aborted)
{
#define LOG_ITEM_BATCH_SIZE 32
- struct xfs_ail *ailp = log->l_ailp;
+ struct xfs_ail *ailp = ctx->cil->xc_log->l_ailp;
struct xfs_log_item *log_items[LOG_ITEM_BATCH_SIZE];
struct xfs_log_vec *lv;
struct xfs_ail_cursor cur;
int i = 0;
+ /*
+ * Update the AIL head LSN with the commit record LSN of this
+ * checkpoint. As iclogs are always completed in order, this should
+ * always be the same (as iclogs can contain multiple commit records) or
+ * higher LSN than the current head. We do this before insertion of the
+ * items so that log space checks during insertion will reflect the
+ * space that this checkpoint has already consumed.
+ */
+ ASSERT(XFS_LSN_CMP(ctx->commit_lsn, ailp->ail_head_lsn) >= 0 ||
+ aborted);
spin_lock(&ailp->ail_lock);
- xfs_trans_ail_cursor_last(ailp, &cur, commit_lsn);
+ ailp->ail_head_lsn = ctx->commit_lsn;
+ xfs_trans_ail_cursor_last(ailp, &cur, ctx->start_lsn);
spin_unlock(&ailp->ail_lock);
/* unpin all the log items */
- list_for_each_entry(lv, lv_chain, lv_list) {
+ list_for_each_entry(lv, &ctx->lv_chain, lv_list) {
struct xfs_log_item *lip = lv->lv_item;
xfs_lsn_t item_lsn;
@@ -768,9 +795,10 @@ xlog_cil_ail_insert(
}
if (lip->li_ops->iop_committed)
- item_lsn = lip->li_ops->iop_committed(lip, commit_lsn);
+ item_lsn = lip->li_ops->iop_committed(lip,
+ ctx->start_lsn);
else
- item_lsn = commit_lsn;
+ item_lsn = ctx->start_lsn;
/* item_lsn of -1 means the item needs no further processing */
if (XFS_LSN_CMP(item_lsn, (xfs_lsn_t)-1) == 0)
@@ -787,7 +815,7 @@ xlog_cil_ail_insert(
continue;
}
- if (item_lsn != commit_lsn) {
+ if (item_lsn != ctx->start_lsn) {
/*
* Not a bulk update option due to unusual item_lsn.
@@ -810,14 +838,15 @@ xlog_cil_ail_insert(
log_items[i++] = lv->lv_item;
if (i >= LOG_ITEM_BATCH_SIZE) {
xlog_cil_ail_insert_batch(ailp, &cur, log_items,
- LOG_ITEM_BATCH_SIZE, commit_lsn);
+ LOG_ITEM_BATCH_SIZE, ctx->start_lsn);
i = 0;
}
}
/* make sure we insert the remainder! */
if (i)
- xlog_cil_ail_insert_batch(ailp, &cur, log_items, i, commit_lsn);
+ xlog_cil_ail_insert_batch(ailp, &cur, log_items, i,
+ ctx->start_lsn);
spin_lock(&ailp->ail_lock);
xfs_trans_ail_cursor_done(&cur);
@@ -863,8 +892,7 @@ xlog_cil_committed(
spin_unlock(&ctx->cil->xc_push_lock);
}
- xlog_cil_ail_insert(ctx->cil->xc_log, &ctx->lv_chain,
- ctx->start_lsn, abort);
+ xlog_cil_ail_insert(ctx, abort);
xfs_extent_busy_sort(&ctx->busy_extents.extent_list);
xfs_extent_busy_clear(mp, &ctx->busy_extents.extent_list,
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 971871b84d8436..4b8ef926044599 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -431,13 +431,10 @@ struct xlog {
int l_prev_block; /* previous logical log block */
/*
- * l_last_sync_lsn and l_tail_lsn are atomics so they can be set and
- * read without needing to hold specific locks. To avoid operations
- * contending with other hot objects, place each of them on a separate
- * cacheline.
+ * l_tail_lsn is atomic so it can be set and read without needing to
+ * hold specific locks. To avoid operations contending with other hot
+ * objects, it on a separate cacheline.
*/
- /* lsn of last LR on disk */
- atomic64_t l_last_sync_lsn ____cacheline_aligned_in_smp;
/* lsn of 1st LR with unflushed * buffers */
atomic64_t l_tail_lsn ____cacheline_aligned_in_smp;
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 4fe627991e8653..63f667f92c322e 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -1177,8 +1177,8 @@ xlog_check_unmount_rec(
*/
xlog_assign_atomic_lsn(&log->l_tail_lsn,
log->l_curr_cycle, after_umount_blk);
- xlog_assign_atomic_lsn(&log->l_last_sync_lsn,
- log->l_curr_cycle, after_umount_blk);
+ log->l_ailp->ail_head_lsn =
+ atomic64_read(&log->l_tail_lsn);
*tail_blk = after_umount_blk;
*clean = true;
@@ -1212,7 +1212,7 @@ xlog_set_state(
if (bump_cycle)
log->l_curr_cycle++;
atomic64_set(&log->l_tail_lsn, be64_to_cpu(rhead->h_tail_lsn));
- atomic64_set(&log->l_last_sync_lsn, be64_to_cpu(rhead->h_lsn));
+ log->l_ailp->ail_head_lsn = be64_to_cpu(rhead->h_lsn);
xlog_assign_grant_head(&log->l_reserve_head.grant, log->l_curr_cycle,
BBTOB(log->l_curr_block));
xlog_assign_grant_head(&log->l_write_head.grant, log->l_curr_cycle,
@@ -3363,14 +3363,13 @@ xlog_do_recover(
/*
* We now update the tail_lsn since much of the recovery has completed
- * and there may be space available to use. If there were no extent
- * or iunlinks, we can free up the entire log and set the tail_lsn to
- * be the last_sync_lsn. This was set in xlog_find_tail to be the
- * lsn of the last known good LR on disk. If there are extent frees
- * or iunlinks they will have some entries in the AIL; so we look at
- * the AIL to determine how to set the tail_lsn.
+ * and there may be space available to use. If there were no extent or
+ * iunlinks, we can free up the entire log. This was set in
+ * xlog_find_tail to be the lsn of the last known good LR on disk. If
+ * there are extent frees or iunlinks they will have some entries in the
+ * AIL; so we look at the AIL to determine how to set the tail_lsn.
*/
- xlog_assign_tail_lsn(mp);
+ xfs_ail_assign_tail_lsn(log->l_ailp);
/*
* Now that we've finished replaying all buffer and inode updates,
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index 9c7fbaae2717dd..1aa013fdc36fcf 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -22,6 +22,7 @@
#include "xfs_trans.h"
#include "xfs_log.h"
#include "xfs_log_priv.h"
+#include "xfs_trans_priv.h"
#include "xfs_buf_item.h"
#include "xfs_quota.h"
#include "xfs_dquot_item.h"
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 25ff6fe1eb6c8a..13f6e6cab572ae 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1404,19 +1404,19 @@ TRACE_EVENT(xfs_log_assign_tail_lsn,
__field(dev_t, dev)
__field(xfs_lsn_t, new_lsn)
__field(xfs_lsn_t, old_lsn)
- __field(xfs_lsn_t, last_sync_lsn)
+ __field(xfs_lsn_t, head_lsn)
),
TP_fast_assign(
__entry->dev = log->l_mp->m_super->s_dev;
__entry->new_lsn = new_lsn;
__entry->old_lsn = atomic64_read(&log->l_tail_lsn);
- __entry->last_sync_lsn = atomic64_read(&log->l_last_sync_lsn);
+ __entry->head_lsn = log->l_ailp->ail_head_lsn;
),
- TP_printk("dev %d:%d new tail lsn %d/%d, old lsn %d/%d, last sync %d/%d",
+ TP_printk("dev %d:%d new tail lsn %d/%d, old lsn %d/%d, head lsn %d/%d",
MAJOR(__entry->dev), MINOR(__entry->dev),
CYCLE_LSN(__entry->new_lsn), BLOCK_LSN(__entry->new_lsn),
CYCLE_LSN(__entry->old_lsn), BLOCK_LSN(__entry->old_lsn),
- CYCLE_LSN(__entry->last_sync_lsn), BLOCK_LSN(__entry->last_sync_lsn))
+ CYCLE_LSN(__entry->head_lsn), BLOCK_LSN(__entry->head_lsn))
)
DECLARE_EVENT_CLASS(xfs_file_class,
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 7d6ccd21aae2e5..5f03f82c46838e 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -720,6 +720,26 @@ xfs_ail_push_all_sync(
finish_wait(&ailp->ail_empty, &wait);
}
+void
+__xfs_ail_assign_tail_lsn(
+ struct xfs_ail *ailp)
+{
+ struct xlog *log = ailp->ail_log;
+ xfs_lsn_t tail_lsn;
+
+ assert_spin_locked(&ailp->ail_lock);
+
+ if (xlog_is_shutdown(log))
+ return;
+
+ tail_lsn = __xfs_ail_min_lsn(ailp);
+ if (!tail_lsn)
+ tail_lsn = ailp->ail_head_lsn;
+
+ trace_xfs_log_assign_tail_lsn(log, tail_lsn);
+ atomic64_set(&log->l_tail_lsn, tail_lsn);
+}
+
/*
* Callers should pass the original tail lsn so that we can detect if the tail
* has moved as a result of the operation that was performed. If the caller
@@ -734,15 +754,13 @@ xfs_ail_update_finish(
{
struct xlog *log = ailp->ail_log;
- /* if the tail lsn hasn't changed, don't do updates or wakeups. */
+ /* If the tail lsn hasn't changed, don't do updates or wakeups. */
if (!old_lsn || old_lsn == __xfs_ail_min_lsn(ailp)) {
spin_unlock(&ailp->ail_lock);
return;
}
- if (!xlog_is_shutdown(log))
- xlog_assign_tail_lsn_locked(log->l_mp);
-
+ __xfs_ail_assign_tail_lsn(ailp);
if (list_empty(&ailp->ail_head))
wake_up_all(&ailp->ail_empty);
spin_unlock(&ailp->ail_lock);
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 60b4707c3a6583..bd841df93021ff 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -55,6 +55,7 @@ struct xfs_ail {
struct list_head ail_cursors;
spinlock_t ail_lock;
xfs_lsn_t ail_last_pushed_lsn;
+ xfs_lsn_t ail_head_lsn;
int ail_log_flush;
unsigned long ail_opstate;
struct list_head ail_buf_list;
@@ -130,6 +131,18 @@ struct xfs_log_item * xfs_trans_ail_cursor_next(struct xfs_ail *ailp,
struct xfs_ail_cursor *cur);
void xfs_trans_ail_cursor_done(struct xfs_ail_cursor *cur);
+void __xfs_ail_assign_tail_lsn(struct xfs_ail *ailp);
+
+static inline void
+xfs_ail_assign_tail_lsn(
+ struct xfs_ail *ailp)
+{
+
+ spin_lock(&ailp->ail_lock);
+ __xfs_ail_assign_tail_lsn(ailp);
+ spin_unlock(&ailp->ail_lock);
+}
+
#if BITS_PER_LONG != 64
static inline void
xfs_trans_ail_copy_lsn(
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 07/11] xfs: collapse xlog_state_set_callback in caller
2024-06-20 7:21 xfs: byte-based grant head reservation tracking v4 Christoph Hellwig
` (5 preceding siblings ...)
2024-06-20 7:21 ` [PATCH 06/11] xfs: l_last_sync_lsn is really AIL state Christoph Hellwig
@ 2024-06-20 7:21 ` Christoph Hellwig
2024-06-20 7:21 ` [PATCH 08/11] xfs: track log space pinned by the AIL Christoph Hellwig
` (3 subsequent siblings)
10 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-06-20 7:21 UTC (permalink / raw)
To: Chandan Babu R; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs, Dave Chinner
From: Dave Chinner <dchinner@redhat.com>
The function is called from a single place, and it isn't just
setting the iclog state to XLOG_STATE_CALLBACK - it can mark iclogs
clean, which moves them to states after CALLBACK. Hence the function
is now badly named, and should just be folded into the caller where
the iclog completion logic makes a whole lot more sense.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
fs/xfs/xfs_log.c | 31 +++++++++++--------------------
1 file changed, 11 insertions(+), 20 deletions(-)
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 1977afecd385d5..381d6143a78777 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -2507,25 +2507,6 @@ xlog_get_lowest_lsn(
return lowest_lsn;
}
-static void
-xlog_state_set_callback(
- struct xlog *log,
- struct xlog_in_core *iclog,
- xfs_lsn_t header_lsn)
-{
- /*
- * If there are no callbacks on this iclog, we can mark it clean
- * immediately and return. Otherwise we need to run the
- * callbacks.
- */
- if (list_empty(&iclog->ic_callbacks)) {
- xlog_state_clean_iclog(log, iclog);
- return;
- }
- trace_xlog_iclog_callback(iclog, _RET_IP_);
- iclog->ic_state = XLOG_STATE_CALLBACK;
-}
-
/*
* Return true if we need to stop processing, false to continue to the next
* iclog. The caller will need to run callbacks if the iclog is returned in the
@@ -2557,7 +2538,17 @@ xlog_state_iodone_process_iclog(
lowest_lsn = xlog_get_lowest_lsn(log);
if (lowest_lsn && XFS_LSN_CMP(lowest_lsn, header_lsn) < 0)
return false;
- xlog_state_set_callback(log, iclog, header_lsn);
+ /*
+ * If there are no callbacks on this iclog, we can mark it clean
+ * immediately and return. Otherwise we need to run the
+ * callbacks.
+ */
+ if (list_empty(&iclog->ic_callbacks)) {
+ xlog_state_clean_iclog(log, iclog);
+ return false;
+ }
+ trace_xlog_iclog_callback(iclog, _RET_IP_);
+ iclog->ic_state = XLOG_STATE_CALLBACK;
return false;
default:
/*
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 08/11] xfs: track log space pinned by the AIL
2024-06-20 7:21 xfs: byte-based grant head reservation tracking v4 Christoph Hellwig
` (6 preceding siblings ...)
2024-06-20 7:21 ` [PATCH 07/11] xfs: collapse xlog_state_set_callback in caller Christoph Hellwig
@ 2024-06-20 7:21 ` Christoph Hellwig
2024-06-20 7:21 ` [PATCH 09/11] xfs: pass the full grant head to accounting functions Christoph Hellwig
` (2 subsequent siblings)
10 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-06-20 7:21 UTC (permalink / raw)
To: Chandan Babu R; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs, Dave Chinner
From: Dave Chinner <dchinner@redhat.com>
Currently we track space used in the log by grant heads.
These store the reserved space as a physical log location and
combine both space reserved for future use with space already used in
the log in a single variable. The amount of space consumed in the
log is then calculated as the distance between the log tail and
the grant head.
The problem with tracking the grant head as a physical location
comes from the fact that it tracks both log cycle count and offset
into the log in bytes in a single 64 bit variable. because the cycle
count on disk is a 32 bit number, this also limits the offset into
the log to 32 bits. ANd because that is in bytes, we are limited to
being able to track only 2GB of log space in the grant head.
Hence to support larger physical logs, we need to track used space
differently in the grant head. We no longer use the grant head for
guiding AIL pushing, so the only thing it is now used for is
determining if we've run out of reservation space via the
calculation in xlog_space_left().
What we really need to do is move the grant heads away from tracking
physical space in the log. The issue here is that space consumed in
the log is not directly tracked by the current mechanism - the
space consumed in the log by grant head reservations gets returned
to the free pool by the tail of the log moving forward. i.e. the
space isn't directly tracked or calculated, but the used grant space
gets "freed" as the physical limits of the log are updated without
actually needing to update the grant heads.
Hence to move away from implicit, zero-update log space tracking we
need to explicitly track the amount of physical space the log
actually consumes separately to the in-memory reservations for
operations that will be committed to the journal. Luckily, we
already track the information we need to calculate this in the AIL
itself.
That is, the space currently consumed by the journal is the maximum
LSN that the AIL has seen minus the current log tail. As we update
both of these items dynamically as the head and tail of the log
moves, we always know exactly how much space the journal consumes.
This means that we also know exactly how much space the currently
active reservations require, and exactly how much free space we have
remaining for new reservations to be made. Most importantly, we know
what these spaces are indepedently of the physical locations of
the head and tail of the log.
Hence by separating out the physical space consumed by the journal,
we can now track reservations in the grant heads purely as a byte
count, and the log can be considered full when the tail space +
reservation space exceeds the size of the log. This means we can use
the full 64 bits of grant head space for reservation space,
completely removing the 32 bit byte count limitation on log size
that they impose.
Hence the first step in this conversion is to track and update the
"log tail space" every time the AIL tail or maximum seen LSN
changes.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
fs/xfs/xfs_log_cil.c | 9 ++++++---
fs/xfs/xfs_log_priv.h | 1 +
fs/xfs/xfs_trans_ail.c | 9 ++++++---
3 files changed, 13 insertions(+), 6 deletions(-)
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 482955f1fa1f9f..92ccac7f905448 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -772,14 +772,17 @@ xlog_cil_ail_insert(
* always be the same (as iclogs can contain multiple commit records) or
* higher LSN than the current head. We do this before insertion of the
* items so that log space checks during insertion will reflect the
- * space that this checkpoint has already consumed.
+ * space that this checkpoint has already consumed. We call
+ * xfs_ail_update_finish() so that tail space and space-based wakeups
+ * will be recalculated appropriately.
*/
ASSERT(XFS_LSN_CMP(ctx->commit_lsn, ailp->ail_head_lsn) >= 0 ||
aborted);
spin_lock(&ailp->ail_lock);
- ailp->ail_head_lsn = ctx->commit_lsn;
xfs_trans_ail_cursor_last(ailp, &cur, ctx->start_lsn);
- spin_unlock(&ailp->ail_lock);
+ ailp->ail_head_lsn = ctx->commit_lsn;
+ /* xfs_ail_update_finish() drops the ail_lock */
+ xfs_ail_update_finish(ailp, NULLCOMMITLSN);
/* unpin all the log items */
list_for_each_entry(lv, &ctx->lv_chain, lv_list) {
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 4b8ef926044599..2896745989795d 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -440,6 +440,7 @@ struct xlog {
struct xlog_grant_head l_reserve_head;
struct xlog_grant_head l_write_head;
+ uint64_t l_tail_space;
struct xfs_kobj l_kobj;
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 5f03f82c46838e..6a106a05fae017 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -736,6 +736,8 @@ __xfs_ail_assign_tail_lsn(
if (!tail_lsn)
tail_lsn = ailp->ail_head_lsn;
+ WRITE_ONCE(log->l_tail_space,
+ xlog_lsn_sub(log, ailp->ail_head_lsn, tail_lsn));
trace_xfs_log_assign_tail_lsn(log, tail_lsn);
atomic64_set(&log->l_tail_lsn, tail_lsn);
}
@@ -743,9 +745,10 @@ __xfs_ail_assign_tail_lsn(
/*
* Callers should pass the original tail lsn so that we can detect if the tail
* has moved as a result of the operation that was performed. If the caller
- * needs to force a tail LSN update, it should pass NULLCOMMITLSN to bypass the
- * "did the tail LSN change?" checks. If the caller wants to avoid a tail update
- * (e.g. it knows the tail did not change) it should pass an @old_lsn of 0.
+ * needs to force a tail space update, it should pass NULLCOMMITLSN to bypass
+ * the "did the tail LSN change?" checks. If the caller wants to avoid a tail
+ * update (e.g. it knows the tail did not change) it should pass an @old_lsn of
+ * 0.
*/
void
xfs_ail_update_finish(
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 09/11] xfs: pass the full grant head to accounting functions
2024-06-20 7:21 xfs: byte-based grant head reservation tracking v4 Christoph Hellwig
` (7 preceding siblings ...)
2024-06-20 7:21 ` [PATCH 08/11] xfs: track log space pinned by the AIL Christoph Hellwig
@ 2024-06-20 7:21 ` Christoph Hellwig
2024-06-20 7:21 ` [PATCH 10/11] xfs: grant heads track byte counts, not LSNs Christoph Hellwig
2024-06-20 7:21 ` [PATCH 11/11] xfs: skip flushing log items during push Christoph Hellwig
10 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-06-20 7:21 UTC (permalink / raw)
To: Chandan Babu R; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs, Dave Chinner
From: Dave Chinner <dchinner@redhat.com>
Because we are going to need them soon. API change only, no logic
changes.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
fs/xfs/xfs_log.c | 157 +++++++++++++++++++++---------------------
fs/xfs/xfs_log_priv.h | 2 -
2 files changed, 77 insertions(+), 82 deletions(-)
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 381d6143a78777..0e50b370f0e4c7 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -136,10 +136,10 @@ xlog_prepare_iovec(
static void
xlog_grant_sub_space(
struct xlog *log,
- atomic64_t *head,
+ struct xlog_grant_head *head,
int bytes)
{
- int64_t head_val = atomic64_read(head);
+ int64_t head_val = atomic64_read(&head->grant);
int64_t new, old;
do {
@@ -155,17 +155,17 @@ xlog_grant_sub_space(
old = head_val;
new = xlog_assign_grant_head_val(cycle, space);
- head_val = atomic64_cmpxchg(head, old, new);
+ head_val = atomic64_cmpxchg(&head->grant, old, new);
} while (head_val != old);
}
static void
xlog_grant_add_space(
struct xlog *log,
- atomic64_t *head,
+ struct xlog_grant_head *head,
int bytes)
{
- int64_t head_val = atomic64_read(head);
+ int64_t head_val = atomic64_read(&head->grant);
int64_t new, old;
do {
@@ -184,7 +184,7 @@ xlog_grant_add_space(
old = head_val;
new = xlog_assign_grant_head_val(cycle, space);
- head_val = atomic64_cmpxchg(head, old, new);
+ head_val = atomic64_cmpxchg(&head->grant, old, new);
} while (head_val != old);
}
@@ -197,6 +197,63 @@ xlog_grant_head_init(
spin_lock_init(&head->lock);
}
+/*
+ * Return the space in the log between the tail and the head. The head
+ * is passed in the cycle/bytes formal parms. In the special case where
+ * the reserve head has wrapped passed the tail, this calculation is no
+ * longer valid. In this case, just return 0 which means there is no space
+ * in the log. This works for all places where this function is called
+ * with the reserve head. Of course, if the write head were to ever
+ * wrap the tail, we should blow up. Rather than catch this case here,
+ * we depend on other ASSERTions in other parts of the code. XXXmiken
+ *
+ * If reservation head is behind the tail, we have a problem. Warn about it,
+ * but then treat it as if the log is empty.
+ *
+ * If the log is shut down, the head and tail may be invalid or out of whack, so
+ * shortcut invalidity asserts in this case so that we don't trigger them
+ * falsely.
+ */
+static int
+xlog_grant_space_left(
+ struct xlog *log,
+ struct xlog_grant_head *head)
+{
+ int tail_bytes;
+ int tail_cycle;
+ int head_cycle;
+ int head_bytes;
+
+ xlog_crack_grant_head(&head->grant, &head_cycle, &head_bytes);
+ xlog_crack_atomic_lsn(&log->l_tail_lsn, &tail_cycle, &tail_bytes);
+ tail_bytes = BBTOB(tail_bytes);
+ if (tail_cycle == head_cycle && head_bytes >= tail_bytes)
+ return log->l_logsize - (head_bytes - tail_bytes);
+ if (tail_cycle + 1 < head_cycle)
+ return 0;
+
+ /* Ignore potential inconsistency when shutdown. */
+ if (xlog_is_shutdown(log))
+ return log->l_logsize;
+
+ if (tail_cycle < head_cycle) {
+ ASSERT(tail_cycle == (head_cycle - 1));
+ return tail_bytes - head_bytes;
+ }
+
+ /*
+ * The reservation head is behind the tail. In this case we just want to
+ * return the size of the log as the amount of space left.
+ */
+ xfs_alert(log->l_mp, "xlog_grant_space_left: head behind tail");
+ xfs_alert(log->l_mp, " tail_cycle = %d, tail_bytes = %d",
+ tail_cycle, tail_bytes);
+ xfs_alert(log->l_mp, " GH cycle = %d, GH bytes = %d",
+ head_cycle, head_bytes);
+ ASSERT(0);
+ return log->l_logsize;
+}
+
STATIC void
xlog_grant_head_wake_all(
struct xlog_grant_head *head)
@@ -277,7 +334,7 @@ xlog_grant_head_wait(
spin_lock(&head->lock);
if (xlog_is_shutdown(log))
goto shutdown;
- } while (xlog_space_left(log, &head->grant) < need_bytes);
+ } while (xlog_grant_space_left(log, head) < need_bytes);
list_del_init(&tic->t_queue);
return 0;
@@ -322,7 +379,7 @@ xlog_grant_head_check(
* otherwise try to get some space for this transaction.
*/
*need_bytes = xlog_ticket_reservation(log, head, tic);
- free_bytes = xlog_space_left(log, &head->grant);
+ free_bytes = xlog_grant_space_left(log, head);
if (!list_empty_careful(&head->waiters)) {
spin_lock(&head->lock);
if (!xlog_grant_head_wake(log, head, &free_bytes) ||
@@ -396,7 +453,7 @@ xfs_log_regrant(
if (error)
goto out_error;
- xlog_grant_add_space(log, &log->l_write_head.grant, need_bytes);
+ xlog_grant_add_space(log, &log->l_write_head, need_bytes);
trace_xfs_log_regrant_exit(log, tic);
xlog_verify_grant_tail(log);
return 0;
@@ -447,8 +504,8 @@ xfs_log_reserve(
if (error)
goto out_error;
- xlog_grant_add_space(log, &log->l_reserve_head.grant, need_bytes);
- xlog_grant_add_space(log, &log->l_write_head.grant, need_bytes);
+ xlog_grant_add_space(log, &log->l_reserve_head, need_bytes);
+ xlog_grant_add_space(log, &log->l_write_head, need_bytes);
trace_xfs_log_reserve_exit(log, tic);
xlog_verify_grant_tail(log);
return 0;
@@ -1107,7 +1164,7 @@ xfs_log_space_wake(
ASSERT(!xlog_in_recovery(log));
spin_lock(&log->l_write_head.lock);
- free_bytes = xlog_space_left(log, &log->l_write_head.grant);
+ free_bytes = xlog_grant_space_left(log, &log->l_write_head);
xlog_grant_head_wake(log, &log->l_write_head, &free_bytes);
spin_unlock(&log->l_write_head.lock);
}
@@ -1116,7 +1173,7 @@ xfs_log_space_wake(
ASSERT(!xlog_in_recovery(log));
spin_lock(&log->l_reserve_head.lock);
- free_bytes = xlog_space_left(log, &log->l_reserve_head.grant);
+ free_bytes = xlog_grant_space_left(log, &log->l_reserve_head);
xlog_grant_head_wake(log, &log->l_reserve_head, &free_bytes);
spin_unlock(&log->l_reserve_head.lock);
}
@@ -1230,64 +1287,6 @@ xfs_log_cover(
return error;
}
-/*
- * Return the space in the log between the tail and the head. The head
- * is passed in the cycle/bytes formal parms. In the special case where
- * the reserve head has wrapped passed the tail, this calculation is no
- * longer valid. In this case, just return 0 which means there is no space
- * in the log. This works for all places where this function is called
- * with the reserve head. Of course, if the write head were to ever
- * wrap the tail, we should blow up. Rather than catch this case here,
- * we depend on other ASSERTions in other parts of the code. XXXmiken
- *
- * If reservation head is behind the tail, we have a problem. Warn about it,
- * but then treat it as if the log is empty.
- *
- * If the log is shut down, the head and tail may be invalid or out of whack, so
- * shortcut invalidity asserts in this case so that we don't trigger them
- * falsely.
- */
-int
-xlog_space_left(
- struct xlog *log,
- atomic64_t *head)
-{
- int tail_bytes;
- int tail_cycle;
- int head_cycle;
- int head_bytes;
-
- xlog_crack_grant_head(head, &head_cycle, &head_bytes);
- xlog_crack_atomic_lsn(&log->l_tail_lsn, &tail_cycle, &tail_bytes);
- tail_bytes = BBTOB(tail_bytes);
- if (tail_cycle == head_cycle && head_bytes >= tail_bytes)
- return log->l_logsize - (head_bytes - tail_bytes);
- if (tail_cycle + 1 < head_cycle)
- return 0;
-
- /* Ignore potential inconsistency when shutdown. */
- if (xlog_is_shutdown(log))
- return log->l_logsize;
-
- if (tail_cycle < head_cycle) {
- ASSERT(tail_cycle == (head_cycle - 1));
- return tail_bytes - head_bytes;
- }
-
- /*
- * The reservation head is behind the tail. In this case we just want to
- * return the size of the log as the amount of space left.
- */
- xfs_alert(log->l_mp, "xlog_space_left: head behind tail");
- xfs_alert(log->l_mp, " tail_cycle = %d, tail_bytes = %d",
- tail_cycle, tail_bytes);
- xfs_alert(log->l_mp, " GH cycle = %d, GH bytes = %d",
- head_cycle, head_bytes);
- ASSERT(0);
- return log->l_logsize;
-}
-
-
static void
xlog_ioend_work(
struct work_struct *work)
@@ -1881,8 +1880,8 @@ xlog_sync(
if (ticket) {
ticket->t_curr_res -= roundoff;
} else {
- xlog_grant_add_space(log, &log->l_reserve_head.grant, roundoff);
- xlog_grant_add_space(log, &log->l_write_head.grant, roundoff);
+ xlog_grant_add_space(log, &log->l_reserve_head, roundoff);
+ xlog_grant_add_space(log, &log->l_write_head, roundoff);
}
/* put cycle number in every block */
@@ -2802,17 +2801,15 @@ xfs_log_ticket_regrant(
if (ticket->t_cnt > 0)
ticket->t_cnt--;
- xlog_grant_sub_space(log, &log->l_reserve_head.grant,
- ticket->t_curr_res);
- xlog_grant_sub_space(log, &log->l_write_head.grant,
- ticket->t_curr_res);
+ xlog_grant_sub_space(log, &log->l_reserve_head, ticket->t_curr_res);
+ xlog_grant_sub_space(log, &log->l_write_head, ticket->t_curr_res);
ticket->t_curr_res = ticket->t_unit_res;
trace_xfs_log_ticket_regrant_sub(log, ticket);
/* just return if we still have some of the pre-reserved space */
if (!ticket->t_cnt) {
- xlog_grant_add_space(log, &log->l_reserve_head.grant,
+ xlog_grant_add_space(log, &log->l_reserve_head,
ticket->t_unit_res);
trace_xfs_log_ticket_regrant_exit(log, ticket);
@@ -2860,8 +2857,8 @@ xfs_log_ticket_ungrant(
bytes += ticket->t_unit_res*ticket->t_cnt;
}
- xlog_grant_sub_space(log, &log->l_reserve_head.grant, bytes);
- xlog_grant_sub_space(log, &log->l_write_head.grant, bytes);
+ xlog_grant_sub_space(log, &log->l_reserve_head, bytes);
+ xlog_grant_sub_space(log, &log->l_write_head, bytes);
trace_xfs_log_ticket_ungrant_exit(log, ticket);
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 2896745989795d..0838c57ca8ac22 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -573,8 +573,6 @@ xlog_assign_grant_head(atomic64_t *head, int cycle, int space)
atomic64_set(head, xlog_assign_grant_head_val(cycle, space));
}
-int xlog_space_left(struct xlog *log, atomic64_t *head);
-
/*
* Committed Item List interfaces
*/
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 10/11] xfs: grant heads track byte counts, not LSNs
2024-06-20 7:21 xfs: byte-based grant head reservation tracking v4 Christoph Hellwig
` (8 preceding siblings ...)
2024-06-20 7:21 ` [PATCH 09/11] xfs: pass the full grant head to accounting functions Christoph Hellwig
@ 2024-06-20 7:21 ` Christoph Hellwig
2024-07-01 0:59 ` Dave Chinner
2024-06-20 7:21 ` [PATCH 11/11] xfs: skip flushing log items during push Christoph Hellwig
10 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2024-06-20 7:21 UTC (permalink / raw)
To: Chandan Babu R; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs, Dave Chinner
From: Dave Chinner <dchinner@redhat.com>
The grant heads in the log track the space reserved in the log for
running transactions. They do this by tracking how far ahead of the
tail that the reservation has reached, and the units for doing this
are {cycle,bytes} for the reserve head rather than {cycle,blocks}
which are normal used by LSNs.
This is annoyingly complex because we have to split, crack and
combined these tuples for any calculation we do to determine log
space and targets. This is computationally expensive as well as
difficult to do atomically and locklessly, as well as limiting the
size of the log to 2^32 bytes.
Really, though, all the grant heads are tracking is how much space
is currently available for use in the log. We can track this as a
simply byte count - we just don't care what the actual physical
location in the log the head and tail are at, just how much space we
have remaining before the head and tail overlap.
So, convert the grant heads to track the byte reservations that are
active rather than the current (cycle, offset) tuples. This means an
empty log has zero bytes consumed, and a full log is when the
reservations reach the size of the log minus the space consumed by
the AIL.
This greatly simplifies the accounting and checks for whether there
is space available. We no longer need to crack or combine LSNs to
determine how much space the log has left, nor do we need to look at
the head or tail of the log to determine how close to full we are.
There is, however, a complexity that needs to be handled. We know
how much space is being tracked in the AIL now via log->l_tail_space
and the log tickets track active reservations and return the unused
portions to the grant heads when ungranted. Unfortunately, we don't
track the used portion of the grant, so when we transfer log items
from the CIL to the AIL, the space accounted to the grant heads is
transferred to the log tail space. Hence when we move the AIL head
forwards on item insert, we have to remove that space from the grant
heads.
We also remove the xlog_verify_grant_tail() debug function as it is
no longer useful. The check it performs has been racy since delayed
logging was introduced, but now it is clearly only detecting false
positives so remove it.
The result of this substantially simpler accounting algorithm is an
increase in sustained transaction rate from ~1.3 million
transactions/s to ~1.9 million transactions/s with no increase in
CPU usage. We also remove the 32 bit space limitation on the grant
heads, which will allow us to increase the journal size beyond 2GB
in future.
Note that this renames the sysfs files exposing the log grant space
now that the values are exported in bytes. This allows xfstests
to auto-detect the old or new ABI.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[hch: move xlog_grant_sub_space out of line,
update the xlog_grant_{add,sub}_space prototypes,
rename the sysfs files to allow auto-detection in xfstests]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
Documentation/ABI/testing/sysfs-fs-xfs | 18 +-
fs/xfs/xfs_log.c | 246 +++++++++----------------
fs/xfs/xfs_log_cil.c | 12 ++
fs/xfs/xfs_log_priv.h | 33 +---
fs/xfs/xfs_log_recover.c | 4 -
fs/xfs/xfs_sysfs.c | 29 +--
fs/xfs/xfs_trace.h | 34 ++--
7 files changed, 138 insertions(+), 238 deletions(-)
diff --git a/Documentation/ABI/testing/sysfs-fs-xfs b/Documentation/ABI/testing/sysfs-fs-xfs
index 82d8e2f79834b5..7da4de948b46e7 100644
--- a/Documentation/ABI/testing/sysfs-fs-xfs
+++ b/Documentation/ABI/testing/sysfs-fs-xfs
@@ -15,25 +15,23 @@ Description:
The log sequence number (LSN) of the current tail of the
log. The LSN is exported in "cycle:basic block" format.
-What: /sys/fs/xfs/<disk>/log/reserve_grant_head
-Date: July 2014
-KernelVersion: 3.17
+What: /sys/fs/xfs/<disk>/log/reserve_grant_head_bytes
+Date: June 2024
+KernelVersion: 6.11
Contact: linux-xfs@vger.kernel.org
Description:
The current state of the log reserve grant head. It
represents the total log reservation of all currently
- outstanding transactions. The grant head is exported in
- "cycle:bytes" format.
+ outstanding transactions in bytes.
Users: xfstests
-What: /sys/fs/xfs/<disk>/log/write_grant_head
-Date: July 2014
-KernelVersion: 3.17
+What: /sys/fs/xfs/<disk>/log/write_grant_head_bytes
+Date: June 2024
+KernelVersion: 6.11
Contact: linux-xfs@vger.kernel.org
Description:
The current state of the log write grant head. It
represents the total log reservation of all currently
outstanding transactions, including regrants due to
- rolling transactions. The grant head is exported in
- "cycle:bytes" format.
+ rolling transactions in bytes.
Users: xfstests
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 0e50b370f0e4c7..817ea7e0a8ab54 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -53,9 +53,6 @@ xlog_sync(
struct xlog_ticket *ticket);
#if defined(DEBUG)
STATIC void
-xlog_verify_grant_tail(
- struct xlog *log);
-STATIC void
xlog_verify_iclog(
struct xlog *log,
struct xlog_in_core *iclog,
@@ -65,7 +62,6 @@ xlog_verify_tail_lsn(
struct xlog *log,
struct xlog_in_core *iclog);
#else
-#define xlog_verify_grant_tail(a)
#define xlog_verify_iclog(a,b,c)
#define xlog_verify_tail_lsn(a,b)
#endif
@@ -133,125 +129,64 @@ xlog_prepare_iovec(
return buf;
}
-static void
+static inline void
xlog_grant_sub_space(
- struct xlog *log,
struct xlog_grant_head *head,
- int bytes)
+ int64_t bytes)
{
- int64_t head_val = atomic64_read(&head->grant);
- int64_t new, old;
-
- do {
- int cycle, space;
-
- xlog_crack_grant_head_val(head_val, &cycle, &space);
-
- space -= bytes;
- if (space < 0) {
- space += log->l_logsize;
- cycle--;
- }
-
- old = head_val;
- new = xlog_assign_grant_head_val(cycle, space);
- head_val = atomic64_cmpxchg(&head->grant, old, new);
- } while (head_val != old);
+ atomic64_sub(bytes, &head->grant);
}
-static void
+static inline void
xlog_grant_add_space(
- struct xlog *log,
struct xlog_grant_head *head,
- int bytes)
+ int64_t bytes)
{
- int64_t head_val = atomic64_read(&head->grant);
- int64_t new, old;
-
- do {
- int tmp;
- int cycle, space;
-
- xlog_crack_grant_head_val(head_val, &cycle, &space);
-
- tmp = log->l_logsize - space;
- if (tmp > bytes)
- space += bytes;
- else {
- space = bytes - tmp;
- cycle++;
- }
-
- old = head_val;
- new = xlog_assign_grant_head_val(cycle, space);
- head_val = atomic64_cmpxchg(&head->grant, old, new);
- } while (head_val != old);
+ atomic64_add(bytes, &head->grant);
}
-STATIC void
+static void
xlog_grant_head_init(
struct xlog_grant_head *head)
{
- xlog_assign_grant_head(&head->grant, 1, 0);
+ atomic64_set(&head->grant, 0);
INIT_LIST_HEAD(&head->waiters);
spin_lock_init(&head->lock);
}
+void
+xlog_grant_return_space(
+ struct xlog *log,
+ xfs_lsn_t old_head,
+ xfs_lsn_t new_head)
+{
+ int64_t diff = xlog_lsn_sub(log, new_head, old_head);
+
+ xlog_grant_sub_space(&log->l_reserve_head, diff);
+ xlog_grant_sub_space(&log->l_write_head, diff);
+}
+
/*
- * Return the space in the log between the tail and the head. The head
- * is passed in the cycle/bytes formal parms. In the special case where
- * the reserve head has wrapped passed the tail, this calculation is no
- * longer valid. In this case, just return 0 which means there is no space
- * in the log. This works for all places where this function is called
- * with the reserve head. Of course, if the write head were to ever
- * wrap the tail, we should blow up. Rather than catch this case here,
- * we depend on other ASSERTions in other parts of the code. XXXmiken
- *
- * If reservation head is behind the tail, we have a problem. Warn about it,
- * but then treat it as if the log is empty.
- *
- * If the log is shut down, the head and tail may be invalid or out of whack, so
- * shortcut invalidity asserts in this case so that we don't trigger them
- * falsely.
+ * Return the space in the log between the tail and the head. In the case where
+ * we have overrun available reservation space, return 0. The memory barrier
+ * pairs with the smp_wmb() in xlog_cil_ail_insert() to ensure that grant head
+ * vs tail space updates are seen in the correct order and hence avoid
+ * transients as space is transferred from the grant heads to the AIL on commit
+ * completion.
*/
-static int
+static uint64_t
xlog_grant_space_left(
struct xlog *log,
struct xlog_grant_head *head)
{
- int tail_bytes;
- int tail_cycle;
- int head_cycle;
- int head_bytes;
-
- xlog_crack_grant_head(&head->grant, &head_cycle, &head_bytes);
- xlog_crack_atomic_lsn(&log->l_tail_lsn, &tail_cycle, &tail_bytes);
- tail_bytes = BBTOB(tail_bytes);
- if (tail_cycle == head_cycle && head_bytes >= tail_bytes)
- return log->l_logsize - (head_bytes - tail_bytes);
- if (tail_cycle + 1 < head_cycle)
- return 0;
-
- /* Ignore potential inconsistency when shutdown. */
- if (xlog_is_shutdown(log))
- return log->l_logsize;
-
- if (tail_cycle < head_cycle) {
- ASSERT(tail_cycle == (head_cycle - 1));
- return tail_bytes - head_bytes;
- }
+ int64_t free_bytes;
- /*
- * The reservation head is behind the tail. In this case we just want to
- * return the size of the log as the amount of space left.
- */
- xfs_alert(log->l_mp, "xlog_grant_space_left: head behind tail");
- xfs_alert(log->l_mp, " tail_cycle = %d, tail_bytes = %d",
- tail_cycle, tail_bytes);
- xfs_alert(log->l_mp, " GH cycle = %d, GH bytes = %d",
- head_cycle, head_bytes);
- ASSERT(0);
- return log->l_logsize;
+ smp_rmb(); /* paired with smp_wmb in xlog_cil_ail_insert() */
+ free_bytes = log->l_logsize - READ_ONCE(log->l_tail_space) -
+ atomic64_read(&head->grant);
+ if (free_bytes > 0)
+ return free_bytes;
+ return 0;
}
STATIC void
@@ -453,9 +388,8 @@ xfs_log_regrant(
if (error)
goto out_error;
- xlog_grant_add_space(log, &log->l_write_head, need_bytes);
+ xlog_grant_add_space(&log->l_write_head, need_bytes);
trace_xfs_log_regrant_exit(log, tic);
- xlog_verify_grant_tail(log);
return 0;
out_error:
@@ -504,10 +438,9 @@ xfs_log_reserve(
if (error)
goto out_error;
- xlog_grant_add_space(log, &log->l_reserve_head, need_bytes);
- xlog_grant_add_space(log, &log->l_write_head, need_bytes);
+ xlog_grant_add_space(&log->l_reserve_head, need_bytes);
+ xlog_grant_add_space(&log->l_write_head, need_bytes);
trace_xfs_log_reserve_exit(log, tic);
- xlog_verify_grant_tail(log);
return 0;
out_error:
@@ -1880,8 +1813,8 @@ xlog_sync(
if (ticket) {
ticket->t_curr_res -= roundoff;
} else {
- xlog_grant_add_space(log, &log->l_reserve_head, roundoff);
- xlog_grant_add_space(log, &log->l_write_head, roundoff);
+ xlog_grant_add_space(&log->l_reserve_head, roundoff);
+ xlog_grant_add_space(&log->l_write_head, roundoff);
}
/* put cycle number in every block */
@@ -2801,16 +2734,15 @@ xfs_log_ticket_regrant(
if (ticket->t_cnt > 0)
ticket->t_cnt--;
- xlog_grant_sub_space(log, &log->l_reserve_head, ticket->t_curr_res);
- xlog_grant_sub_space(log, &log->l_write_head, ticket->t_curr_res);
+ xlog_grant_sub_space(&log->l_reserve_head, ticket->t_curr_res);
+ xlog_grant_sub_space(&log->l_write_head, ticket->t_curr_res);
ticket->t_curr_res = ticket->t_unit_res;
trace_xfs_log_ticket_regrant_sub(log, ticket);
/* just return if we still have some of the pre-reserved space */
if (!ticket->t_cnt) {
- xlog_grant_add_space(log, &log->l_reserve_head,
- ticket->t_unit_res);
+ xlog_grant_add_space(&log->l_reserve_head, ticket->t_unit_res);
trace_xfs_log_ticket_regrant_exit(log, ticket);
ticket->t_curr_res = ticket->t_unit_res;
@@ -2857,8 +2789,8 @@ xfs_log_ticket_ungrant(
bytes += ticket->t_unit_res*ticket->t_cnt;
}
- xlog_grant_sub_space(log, &log->l_reserve_head, bytes);
- xlog_grant_sub_space(log, &log->l_write_head, bytes);
+ xlog_grant_sub_space(&log->l_reserve_head, bytes);
+ xlog_grant_sub_space(&log->l_write_head, bytes);
trace_xfs_log_ticket_ungrant_exit(log, ticket);
@@ -3331,42 +3263,27 @@ xlog_ticket_alloc(
}
#if defined(DEBUG)
-/*
- * Check to make sure the grant write head didn't just over lap the tail. If
- * the cycles are the same, we can't be overlapping. Otherwise, make sure that
- * the cycles differ by exactly one and check the byte count.
- *
- * This check is run unlocked, so can give false positives. Rather than assert
- * on failures, use a warn-once flag and a panic tag to allow the admin to
- * determine if they want to panic the machine when such an error occurs. For
- * debug kernels this will have the same effect as using an assert but, unlinke
- * an assert, it can be turned off at runtime.
- */
-STATIC void
-xlog_verify_grant_tail(
- struct xlog *log)
+static void
+xlog_verify_dump_tail(
+ struct xlog *log,
+ struct xlog_in_core *iclog)
{
- int tail_cycle, tail_blocks;
- int cycle, space;
-
- xlog_crack_grant_head(&log->l_write_head.grant, &cycle, &space);
- xlog_crack_atomic_lsn(&log->l_tail_lsn, &tail_cycle, &tail_blocks);
- if (tail_cycle != cycle) {
- if (cycle - 1 != tail_cycle &&
- !test_and_set_bit(XLOG_TAIL_WARN, &log->l_opstate)) {
- xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES,
- "%s: cycle - 1 != tail_cycle", __func__);
- }
-
- if (space > BBTOB(tail_blocks) &&
- !test_and_set_bit(XLOG_TAIL_WARN, &log->l_opstate)) {
- xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES,
- "%s: space > BBTOB(tail_blocks)", __func__);
- }
- }
-}
-
-/* check if it will fit */
+ xfs_alert(log->l_mp,
+"ran out of log space tail 0x%llx/0x%llx, head lsn 0x%llx, head 0x%x/0x%x, prev head 0x%x/0x%x",
+ iclog ? be64_to_cpu(iclog->ic_header.h_tail_lsn) : -1,
+ atomic64_read(&log->l_tail_lsn),
+ log->l_ailp->ail_head_lsn,
+ log->l_curr_cycle, log->l_curr_block,
+ log->l_prev_cycle, log->l_prev_block);
+ xfs_alert(log->l_mp,
+"write grant 0x%llx, reserve grant 0x%llx, tail_space 0x%llx, size 0x%x, iclog flags 0x%x",
+ atomic64_read(&log->l_write_head.grant),
+ atomic64_read(&log->l_reserve_head.grant),
+ log->l_tail_space, log->l_logsize,
+ iclog ? iclog->ic_flags : -1);
+}
+
+/* Check if the new iclog will fit in the log. */
STATIC void
xlog_verify_tail_lsn(
struct xlog *log,
@@ -3375,21 +3292,34 @@ xlog_verify_tail_lsn(
xfs_lsn_t tail_lsn = be64_to_cpu(iclog->ic_header.h_tail_lsn);
int blocks;
- if (CYCLE_LSN(tail_lsn) == log->l_prev_cycle) {
- blocks =
- log->l_logBBsize - (log->l_prev_block - BLOCK_LSN(tail_lsn));
- if (blocks < BTOBB(iclog->ic_offset)+BTOBB(log->l_iclog_hsize))
- xfs_emerg(log->l_mp, "%s: ran out of log space", __func__);
- } else {
- ASSERT(CYCLE_LSN(tail_lsn)+1 == log->l_prev_cycle);
+ if (CYCLE_LSN(tail_lsn) == log->l_prev_cycle) {
+ blocks = log->l_logBBsize -
+ (log->l_prev_block - BLOCK_LSN(tail_lsn));
+ if (blocks < BTOBB(iclog->ic_offset) +
+ BTOBB(log->l_iclog_hsize)) {
+ xfs_emerg(log->l_mp,
+ "%s: ran out of log space", __func__);
+ xlog_verify_dump_tail(log, iclog);
+ }
+ return;
+ }
- if (BLOCK_LSN(tail_lsn) == log->l_prev_block)
+ if (CYCLE_LSN(tail_lsn) + 1 != log->l_prev_cycle) {
+ xfs_emerg(log->l_mp, "%s: head has wrapped tail.", __func__);
+ xlog_verify_dump_tail(log, iclog);
+ return;
+ }
+ if (BLOCK_LSN(tail_lsn) == log->l_prev_block) {
xfs_emerg(log->l_mp, "%s: tail wrapped", __func__);
+ xlog_verify_dump_tail(log, iclog);
+ return;
+ }
blocks = BLOCK_LSN(tail_lsn) - log->l_prev_block;
- if (blocks < BTOBB(iclog->ic_offset) + 1)
- xfs_emerg(log->l_mp, "%s: ran out of log space", __func__);
- }
+ if (blocks < BTOBB(iclog->ic_offset) + 1) {
+ xfs_emerg(log->l_mp, "%s: ran out of iclog space", __func__);
+ xlog_verify_dump_tail(log, iclog);
+ }
}
/*
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 92ccac7f905448..391a938d690c59 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -764,6 +764,7 @@ xlog_cil_ail_insert(
struct xfs_log_item *log_items[LOG_ITEM_BATCH_SIZE];
struct xfs_log_vec *lv;
struct xfs_ail_cursor cur;
+ xfs_lsn_t old_head;
int i = 0;
/*
@@ -780,10 +781,21 @@ xlog_cil_ail_insert(
aborted);
spin_lock(&ailp->ail_lock);
xfs_trans_ail_cursor_last(ailp, &cur, ctx->start_lsn);
+ old_head = ailp->ail_head_lsn;
ailp->ail_head_lsn = ctx->commit_lsn;
/* xfs_ail_update_finish() drops the ail_lock */
xfs_ail_update_finish(ailp, NULLCOMMITLSN);
+ /*
+ * We move the AIL head forwards to account for the space used in the
+ * log before we remove that space from the grant heads. This prevents a
+ * transient condition where reservation space appears to become
+ * available on return, only for it to disappear again immediately as
+ * the AIL head update accounts in the log tail space.
+ */
+ smp_wmb(); /* paired with smp_rmb in xlog_grant_space_left */
+ xlog_grant_return_space(ailp->ail_log, old_head, ailp->ail_head_lsn);
+
/* unpin all the log items */
list_for_each_entry(lv, &ctx->lv_chain, lv_list) {
struct xfs_log_item *lip = lv->lv_item;
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 0838c57ca8ac22..b8778a4fd6b64e 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -543,36 +543,6 @@ xlog_assign_atomic_lsn(atomic64_t *lsn, uint cycle, uint block)
atomic64_set(lsn, xlog_assign_lsn(cycle, block));
}
-/*
- * When we crack the grant head, we sample it first so that the value will not
- * change while we are cracking it into the component values. This means we
- * will always get consistent component values to work from.
- */
-static inline void
-xlog_crack_grant_head_val(int64_t val, int *cycle, int *space)
-{
- *cycle = val >> 32;
- *space = val & 0xffffffff;
-}
-
-static inline void
-xlog_crack_grant_head(atomic64_t *head, int *cycle, int *space)
-{
- xlog_crack_grant_head_val(atomic64_read(head), cycle, space);
-}
-
-static inline int64_t
-xlog_assign_grant_head_val(int cycle, int space)
-{
- return ((int64_t)cycle << 32) | space;
-}
-
-static inline void
-xlog_assign_grant_head(atomic64_t *head, int cycle, int space)
-{
- atomic64_set(head, xlog_assign_grant_head_val(cycle, space));
-}
-
/*
* Committed Item List interfaces
*/
@@ -639,6 +609,9 @@ xlog_lsn_sub(
return (uint64_t)log->l_logsize - BBTOB(lo_block - hi_block);
}
+void xlog_grant_return_space(struct xlog *log, xfs_lsn_t old_head,
+ xfs_lsn_t new_head);
+
/*
* The LSN is valid so long as it is behind the current LSN. If it isn't, this
* means that the next log record that includes this metadata could have a
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 63f667f92c322e..32c6d7070871dc 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -1213,10 +1213,6 @@ xlog_set_state(
log->l_curr_cycle++;
atomic64_set(&log->l_tail_lsn, be64_to_cpu(rhead->h_tail_lsn));
log->l_ailp->ail_head_lsn = be64_to_cpu(rhead->h_lsn);
- xlog_assign_grant_head(&log->l_reserve_head.grant, log->l_curr_cycle,
- BBTOB(log->l_curr_block));
- xlog_assign_grant_head(&log->l_write_head.grant, log->l_curr_cycle,
- BBTOB(log->l_curr_block));
}
/*
diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
index d2391eec37fe9d..60cb5318fdae3c 100644
--- a/fs/xfs/xfs_sysfs.c
+++ b/fs/xfs/xfs_sysfs.c
@@ -432,39 +432,30 @@ log_tail_lsn_show(
XFS_SYSFS_ATTR_RO(log_tail_lsn);
STATIC ssize_t
-reserve_grant_head_show(
+reserve_grant_head_bytes_show(
struct kobject *kobject,
char *buf)
-
{
- int cycle;
- int bytes;
- struct xlog *log = to_xlog(kobject);
-
- xlog_crack_grant_head(&log->l_reserve_head.grant, &cycle, &bytes);
- return sysfs_emit(buf, "%d:%d\n", cycle, bytes);
+ return sysfs_emit(buf, "%lld\n",
+ atomic64_read(&to_xlog(kobject)->l_reserve_head.grant));
}
-XFS_SYSFS_ATTR_RO(reserve_grant_head);
+XFS_SYSFS_ATTR_RO(reserve_grant_head_bytes);
STATIC ssize_t
-write_grant_head_show(
+write_grant_head_bytes_show(
struct kobject *kobject,
char *buf)
{
- int cycle;
- int bytes;
- struct xlog *log = to_xlog(kobject);
-
- xlog_crack_grant_head(&log->l_write_head.grant, &cycle, &bytes);
- return sysfs_emit(buf, "%d:%d\n", cycle, bytes);
+ return sysfs_emit(buf, "%lld\n",
+ atomic64_read(&to_xlog(kobject)->l_write_head.grant));
}
-XFS_SYSFS_ATTR_RO(write_grant_head);
+XFS_SYSFS_ATTR_RO(write_grant_head_bytes);
static struct attribute *xfs_log_attrs[] = {
ATTR_LIST(log_head_lsn),
ATTR_LIST(log_tail_lsn),
- ATTR_LIST(reserve_grant_head),
- ATTR_LIST(write_grant_head),
+ ATTR_LIST(reserve_grant_head_bytes),
+ ATTR_LIST(write_grant_head_bytes),
NULL,
};
ATTRIBUTE_GROUPS(xfs_log);
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 13f6e6cab572ae..a7ff0c7f6800a0 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1227,6 +1227,7 @@ DECLARE_EVENT_CLASS(xfs_loggrant_class,
TP_ARGS(log, tic),
TP_STRUCT__entry(
__field(dev_t, dev)
+ __field(unsigned long, tic)
__field(char, ocnt)
__field(char, cnt)
__field(int, curr_res)
@@ -1234,16 +1235,16 @@ DECLARE_EVENT_CLASS(xfs_loggrant_class,
__field(unsigned int, flags)
__field(int, reserveq)
__field(int, writeq)
- __field(int, grant_reserve_cycle)
- __field(int, grant_reserve_bytes)
- __field(int, grant_write_cycle)
- __field(int, grant_write_bytes)
+ __field(uint64_t, grant_reserve_bytes)
+ __field(uint64_t, grant_write_bytes)
+ __field(uint64_t, tail_space)
__field(int, curr_cycle)
__field(int, curr_block)
__field(xfs_lsn_t, tail_lsn)
),
TP_fast_assign(
__entry->dev = log->l_mp->m_super->s_dev;
+ __entry->tic = (unsigned long)tic;
__entry->ocnt = tic->t_ocnt;
__entry->cnt = tic->t_cnt;
__entry->curr_res = tic->t_curr_res;
@@ -1251,23 +1252,22 @@ DECLARE_EVENT_CLASS(xfs_loggrant_class,
__entry->flags = tic->t_flags;
__entry->reserveq = list_empty(&log->l_reserve_head.waiters);
__entry->writeq = list_empty(&log->l_write_head.waiters);
- xlog_crack_grant_head(&log->l_reserve_head.grant,
- &__entry->grant_reserve_cycle,
- &__entry->grant_reserve_bytes);
- xlog_crack_grant_head(&log->l_write_head.grant,
- &__entry->grant_write_cycle,
- &__entry->grant_write_bytes);
+ __entry->tail_space = READ_ONCE(log->l_tail_space);
+ __entry->grant_reserve_bytes = __entry->tail_space +
+ atomic64_read(&log->l_reserve_head.grant);
+ __entry->grant_write_bytes = __entry->tail_space +
+ atomic64_read(&log->l_write_head.grant);
__entry->curr_cycle = log->l_curr_cycle;
__entry->curr_block = log->l_curr_block;
__entry->tail_lsn = atomic64_read(&log->l_tail_lsn);
),
- TP_printk("dev %d:%d t_ocnt %u t_cnt %u t_curr_res %u "
- "t_unit_res %u t_flags %s reserveq %s "
- "writeq %s grant_reserve_cycle %d "
- "grant_reserve_bytes %d grant_write_cycle %d "
- "grant_write_bytes %d curr_cycle %d curr_block %d "
+ TP_printk("dev %d:%d tic 0x%lx t_ocnt %u t_cnt %u t_curr_res %u "
+ "t_unit_res %u t_flags %s reserveq %s writeq %s "
+ "tail space %llu grant_reserve_bytes %llu "
+ "grant_write_bytes %llu curr_cycle %d curr_block %d "
"tail_cycle %d tail_block %d",
MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->tic,
__entry->ocnt,
__entry->cnt,
__entry->curr_res,
@@ -1275,9 +1275,8 @@ DECLARE_EVENT_CLASS(xfs_loggrant_class,
__print_flags(__entry->flags, "|", XLOG_TIC_FLAGS),
__entry->reserveq ? "empty" : "active",
__entry->writeq ? "empty" : "active",
- __entry->grant_reserve_cycle,
+ __entry->tail_space,
__entry->grant_reserve_bytes,
- __entry->grant_write_cycle,
__entry->grant_write_bytes,
__entry->curr_cycle,
__entry->curr_block,
@@ -1305,6 +1304,7 @@ DEFINE_LOGGRANT_EVENT(xfs_log_ticket_ungrant);
DEFINE_LOGGRANT_EVENT(xfs_log_ticket_ungrant_sub);
DEFINE_LOGGRANT_EVENT(xfs_log_ticket_ungrant_exit);
DEFINE_LOGGRANT_EVENT(xfs_log_cil_wait);
+DEFINE_LOGGRANT_EVENT(xfs_log_cil_return);
DECLARE_EVENT_CLASS(xfs_log_item_class,
TP_PROTO(struct xfs_log_item *lip),
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 11/11] xfs: skip flushing log items during push
2024-06-20 7:21 xfs: byte-based grant head reservation tracking v4 Christoph Hellwig
` (9 preceding siblings ...)
2024-06-20 7:21 ` [PATCH 10/11] xfs: grant heads track byte counts, not LSNs Christoph Hellwig
@ 2024-06-20 7:21 ` Christoph Hellwig
2024-06-20 19:51 ` Darrick J. Wong
10 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2024-06-20 7:21 UTC (permalink / raw)
To: Chandan Babu R; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs, Dave Chinner
From: Dave Chinner <dchinner@redhat.com>
The AIL pushing code spends a huge amount of time skipping over
items that are already marked as flushing. It is not uncommon to
see hundreds of thousands of items skipped every second due to inode
clustering marking all the inodes in a cluster as flushing when the
first one is flushed.
However, to discover an item is already flushing and should be
skipped we have to call the iop_push() method for it to try to flush
the item. For inodes (where this matters most), we have to first
check that inode is flushable first.
We can optimise this overhead away by tracking whether the log item
is flushing internally. This allows xfsaild_push() to check the log
item directly for flushing state and immediately skip the log item.
Whilst this doesn't remove the CPU cache misses for loading the log
item, it does avoid the overhead of an indirect function call
and the cache misses involved in accessing inode and
backing cluster buffer structures to determine flushing state. When
trying to flush hundreds of thousands of inodes each second, this
CPU overhead saving adds up quickly.
It's so noticeable that the biggest issue with pushing on the AIL on
fast storage becomes the 10ms back-off wait when we hit enough
pinned buffers to break out of the push loop but not enough for the
AIL pushing to be considered stuck. This limits the xfsaild to about
70% total CPU usage, and on fast storage this isn't enough to keep
the storage 100% busy.
The xfsaild will block on IO submission on slow storage and so is
self throttling - it does not need a backoff in the case where we
are really just breaking out of the walk to submit the IO we have
gathered.
Further with no backoff we don't need to gather huge delwri lists to
mitigate the impact of backoffs, so we can submit IO more frequently
and reduce the time log items spend in flushing state by breaking
out of the item push loop once we've gathered enough IO to batch
submission effectively.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_inode.c | 1 +
fs/xfs/xfs_inode_item.c | 6 +++++-
fs/xfs/xfs_trans.h | 4 +++-
fs/xfs/xfs_trans_ail.c | 8 +++++++-
4 files changed, 16 insertions(+), 3 deletions(-)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 58fb7a5062e1e6..da611ec56f1be0 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3713,6 +3713,7 @@ xfs_iflush(
iip->ili_last_fields = iip->ili_fields;
iip->ili_fields = 0;
iip->ili_fsync_fields = 0;
+ set_bit(XFS_LI_FLUSHING, &iip->ili_item.li_flags);
spin_unlock(&iip->ili_lock);
/*
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index f28d653300d124..ba49e56820f0a7 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -933,6 +933,7 @@ xfs_iflush_finish(
}
iip->ili_last_fields = 0;
iip->ili_flush_lsn = 0;
+ clear_bit(XFS_LI_FLUSHING, &lip->li_flags);
spin_unlock(&iip->ili_lock);
xfs_iflags_clear(iip->ili_inode, XFS_IFLUSHING);
if (drop_buffer)
@@ -991,8 +992,10 @@ xfs_buf_inode_io_fail(
{
struct xfs_log_item *lip;
- list_for_each_entry(lip, &bp->b_li_list, li_bio_list)
+ list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
set_bit(XFS_LI_FAILED, &lip->li_flags);
+ clear_bit(XFS_LI_FLUSHING, &lip->li_flags);
+ }
}
/*
@@ -1011,6 +1014,7 @@ xfs_iflush_abort_clean(
iip->ili_flush_lsn = 0;
iip->ili_item.li_buf = NULL;
list_del_init(&iip->ili_item.li_bio_list);
+ clear_bit(XFS_LI_FLUSHING, &iip->ili_item.li_flags);
}
/*
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 1636663707dc04..20eb6ea7ebaa04 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -58,13 +58,15 @@ struct xfs_log_item {
#define XFS_LI_FAILED 2
#define XFS_LI_DIRTY 3
#define XFS_LI_WHITEOUT 4
+#define XFS_LI_FLUSHING 5
#define XFS_LI_FLAGS \
{ (1u << XFS_LI_IN_AIL), "IN_AIL" }, \
{ (1u << XFS_LI_ABORTED), "ABORTED" }, \
{ (1u << XFS_LI_FAILED), "FAILED" }, \
{ (1u << XFS_LI_DIRTY), "DIRTY" }, \
- { (1u << XFS_LI_WHITEOUT), "WHITEOUT" }
+ { (1u << XFS_LI_WHITEOUT), "WHITEOUT" }, \
+ { (1u << XFS_LI_FLUSHING), "FLUSHING" }
struct xfs_item_ops {
unsigned flags;
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 6a106a05fae017..0fafcc9f3dbe44 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -512,6 +512,9 @@ xfsaild_push(
while ((XFS_LSN_CMP(lip->li_lsn, ailp->ail_target) <= 0)) {
int lock_result;
+ if (test_bit(XFS_LI_FLUSHING, &lip->li_flags))
+ goto next_item;
+
/*
* Note that iop_push may unlock and reacquire the AIL lock. We
* rely on the AIL cursor implementation to be able to deal with
@@ -581,9 +584,12 @@ xfsaild_push(
if (stuck > 100)
break;
+next_item:
lip = xfs_trans_ail_cursor_next(ailp, &cur);
if (lip == NULL)
break;
+ if (lip->li_lsn != lsn && count > 1000)
+ break;
lsn = lip->li_lsn;
}
@@ -620,7 +626,7 @@ xfsaild_push(
/*
* Assume we have more work to do in a short while.
*/
- tout = 10;
+ tout = 0;
}
return tout;
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 01/11] xfs: fix the contact address for the sysfs ABI documentation
2024-06-20 7:21 ` [PATCH 01/11] xfs: fix the contact address for the sysfs ABI documentation Christoph Hellwig
@ 2024-06-20 15:20 ` Darrick J. Wong
0 siblings, 0 replies; 24+ messages in thread
From: Darrick J. Wong @ 2024-06-20 15:20 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Chandan Babu R, Dave Chinner, linux-xfs
On Thu, Jun 20, 2024 at 09:21:18AM +0200, Christoph Hellwig wrote:
> oss.sgi.com is long dead, refer to the current linux-xfs list instead.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
Yikes.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
--D
> ---
> Documentation/ABI/testing/sysfs-fs-xfs | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-fs-xfs b/Documentation/ABI/testing/sysfs-fs-xfs
> index f704925f6fe93b..82d8e2f79834b5 100644
> --- a/Documentation/ABI/testing/sysfs-fs-xfs
> +++ b/Documentation/ABI/testing/sysfs-fs-xfs
> @@ -1,7 +1,7 @@
> What: /sys/fs/xfs/<disk>/log/log_head_lsn
> Date: July 2014
> KernelVersion: 3.17
> -Contact: xfs@oss.sgi.com
> +Contact: linux-xfs@vger.kernel.org
> Description:
> The log sequence number (LSN) of the current head of the
> log. The LSN is exported in "cycle:basic block" format.
> @@ -10,7 +10,7 @@ Users: xfstests
> What: /sys/fs/xfs/<disk>/log/log_tail_lsn
> Date: July 2014
> KernelVersion: 3.17
> -Contact: xfs@oss.sgi.com
> +Contact: linux-xfs@vger.kernel.org
> Description:
> The log sequence number (LSN) of the current tail of the
> log. The LSN is exported in "cycle:basic block" format.
> @@ -18,7 +18,7 @@ Description:
> What: /sys/fs/xfs/<disk>/log/reserve_grant_head
> Date: July 2014
> KernelVersion: 3.17
> -Contact: xfs@oss.sgi.com
> +Contact: linux-xfs@vger.kernel.org
> Description:
> The current state of the log reserve grant head. It
> represents the total log reservation of all currently
> @@ -29,7 +29,7 @@ Users: xfstests
> What: /sys/fs/xfs/<disk>/log/write_grant_head
> Date: July 2014
> KernelVersion: 3.17
> -Contact: xfs@oss.sgi.com
> +Contact: linux-xfs@vger.kernel.org
> Description:
> The current state of the log write grant head. It
> represents the total log reservation of all currently
> --
> 2.43.0
>
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 03/11] xfs: AIL doesn't need manual pushing
2024-06-20 7:21 ` [PATCH 03/11] xfs: AIL doesn't need manual pushing Christoph Hellwig
@ 2024-06-20 19:01 ` Darrick J. Wong
0 siblings, 0 replies; 24+ messages in thread
From: Darrick J. Wong @ 2024-06-20 19:01 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Chandan Babu R, Dave Chinner, linux-xfs, Dave Chinner
On Thu, Jun 20, 2024 at 09:21:20AM +0200, Christoph Hellwig wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> We have a mechanism that checks the amount of log space remaining
> available every time we make a transaction reservation. If the
> amount of space is below a threshold (25% free) we push on the AIL
> to tell it to do more work. To do this, we end up calculating the
> LSN that the AIL needs to push to on every reservation and updating
> the push target for the AIL with that new target LSN.
>
> This is silly and expensive. The AIL is perfectly capable of
> calculating the push target itself, and it will always be running
> when the AIL contains objects.
>
> What the target does is determine if the AIL needs to do
> any work before it goes back to sleep. If we haven't run out of
> reservation space or memory (or some other push all trigger), it
> will simply go back to sleep for a while if there is more than 25%
> of the journal space free without doing anything.
>
> If there are items in the AIL at a lower LSN than the target, it
> will try to push up to the target or to the point of getting stuck
> before going back to sleep and trying again soon after.`
>
> Hence we can modify the AIL to calculate it's own 25% push target
> before it starts a push using the same reserve grant head based
> calculation as is currently used, and remove all the places where we
> ask the AIL to push to a new 25% free target. We can also drop the
> minimum free space size of 256BBs from the calculation because the
> 25% of a minimum sized log is *always going to be larger than
> 256BBs.
>
> This does still require a manual push in certain circumstances.
> These circumstances arise when the AIL is not full, but the
> reservation grants consume the entire of the free space in the log.
> In this case, we still need to push on the AIL to free up space, so
> when we hit this condition (i.e. reservation going to sleep to wait
> on log space) we do a single push to tell the AIL it should empty
> itself. This will keep the AIL moving as new reservations come in
> and want more space, rather than keep queuing them and having to
> push the AIL repeatedly.
>
> The reason for using the "push all" when grant space runs out is
> that we can run out of grant space when there is more than 25% of
> the log free. Small logs are notorious for this, and we have a hack
> in the log callback code (xlog_state_set_callback()) where we push
> the AIL because the *head* moved) to ensure that we kick the AIL
> when we consume space in it because that can push us over the "less
> than 25% available" available that starts tail pushing back up
> again.
>
> Hence when we run out of grant space and are going to sleep, we have
> to consider that the grant space may be consuming almost all the log
> space and there is almost nothing in the AIL. In this situation, the
> AIL pins the tail and moving the tail forwards is the only way the
> grant space will come available, so we have to force the AIL to push
> everything to guarantee grant space will eventually be returned.
> Hence triggering a "push all" just before sleeping removes all the
> nasty corner cases we have in other parts of the code that work
> around the "we didn't ask the AIL to push enough to free grant
> space" condition that leads to log space hangs...
Thanks for adding the extra explanation from the last time this patch
was sent to the list!
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
--D
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> fs/xfs/libxfs/xfs_defer.c | 4 +-
> fs/xfs/xfs_log.c | 135 ++-----------------------------
> fs/xfs/xfs_log.h | 1 -
> fs/xfs/xfs_log_priv.h | 2 +
> fs/xfs/xfs_trans_ail.c | 162 +++++++++++++++++---------------------
> fs/xfs/xfs_trans_priv.h | 33 ++++++--
> 6 files changed, 108 insertions(+), 229 deletions(-)
>
> diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
> index 4a078e07e1a0a0..e2c8308d518b56 100644
> --- a/fs/xfs/libxfs/xfs_defer.c
> +++ b/fs/xfs/libxfs/xfs_defer.c
> @@ -12,12 +12,14 @@
> #include "xfs_mount.h"
> #include "xfs_defer.h"
> #include "xfs_trans.h"
> +#include "xfs_trans_priv.h"
> #include "xfs_buf_item.h"
> #include "xfs_inode.h"
> #include "xfs_inode_item.h"
> #include "xfs_trace.h"
> #include "xfs_icache.h"
> #include "xfs_log.h"
> +#include "xfs_log_priv.h"
> #include "xfs_rmap.h"
> #include "xfs_refcount.h"
> #include "xfs_bmap.h"
> @@ -556,7 +558,7 @@ xfs_defer_relog(
> * the log threshold once per call.
> */
> if (threshold_lsn == NULLCOMMITLSN) {
> - threshold_lsn = xlog_grant_push_threshold(log, 0);
> + threshold_lsn = xfs_ail_push_target(log->l_ailp);
> if (threshold_lsn == NULLCOMMITLSN)
> break;
> }
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 416c154949832c..235fcf6dc4eeb5 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -30,10 +30,6 @@ xlog_alloc_log(
> struct xfs_buftarg *log_target,
> xfs_daddr_t blk_offset,
> int num_bblks);
> -STATIC int
> -xlog_space_left(
> - struct xlog *log,
> - atomic64_t *head);
> STATIC void
> xlog_dealloc_log(
> struct xlog *log);
> @@ -51,10 +47,6 @@ xlog_state_get_iclog_space(
> struct xlog_ticket *ticket,
> int *logoffsetp);
> STATIC void
> -xlog_grant_push_ail(
> - struct xlog *log,
> - int need_bytes);
> -STATIC void
> xlog_sync(
> struct xlog *log,
> struct xlog_in_core *iclog,
> @@ -242,42 +234,15 @@ xlog_grant_head_wake(
> {
> struct xlog_ticket *tic;
> int need_bytes;
> - bool woken_task = false;
>
> list_for_each_entry(tic, &head->waiters, t_queue) {
> -
> - /*
> - * There is a chance that the size of the CIL checkpoints in
> - * progress at the last AIL push target calculation resulted in
> - * limiting the target to the log head (l_last_sync_lsn) at the
> - * time. This may not reflect where the log head is now as the
> - * CIL checkpoints may have completed.
> - *
> - * Hence when we are woken here, it may be that the head of the
> - * log that has moved rather than the tail. As the tail didn't
> - * move, there still won't be space available for the
> - * reservation we require. However, if the AIL has already
> - * pushed to the target defined by the old log head location, we
> - * will hang here waiting for something else to update the AIL
> - * push target.
> - *
> - * Therefore, if there isn't space to wake the first waiter on
> - * the grant head, we need to push the AIL again to ensure the
> - * target reflects both the current log tail and log head
> - * position before we wait for the tail to move again.
> - */
> -
> need_bytes = xlog_ticket_reservation(log, head, tic);
> - if (*free_bytes < need_bytes) {
> - if (!woken_task)
> - xlog_grant_push_ail(log, need_bytes);
> + if (*free_bytes < need_bytes)
> return false;
> - }
>
> *free_bytes -= need_bytes;
> trace_xfs_log_grant_wake_up(log, tic);
> wake_up_process(tic->t_task);
> - woken_task = true;
> }
>
> return true;
> @@ -296,13 +261,15 @@ xlog_grant_head_wait(
> do {
> if (xlog_is_shutdown(log))
> goto shutdown;
> - xlog_grant_push_ail(log, need_bytes);
>
> __set_current_state(TASK_UNINTERRUPTIBLE);
> spin_unlock(&head->lock);
>
> XFS_STATS_INC(log->l_mp, xs_sleep_logspace);
>
> + /* Push on the AIL to free up all the log space. */
> + xfs_ail_push_all(log->l_ailp);
> +
> trace_xfs_log_grant_sleep(log, tic);
> schedule();
> trace_xfs_log_grant_wake(log, tic);
> @@ -418,9 +385,6 @@ xfs_log_regrant(
> * of rolling transactions in the log easily.
> */
> tic->t_tid++;
> -
> - xlog_grant_push_ail(log, tic->t_unit_res);
> -
> tic->t_curr_res = tic->t_unit_res;
> if (tic->t_cnt > 0)
> return 0;
> @@ -477,12 +441,7 @@ xfs_log_reserve(
> ASSERT(*ticp == NULL);
> tic = xlog_ticket_alloc(log, unit_bytes, cnt, permanent);
> *ticp = tic;
> -
> - xlog_grant_push_ail(log, tic->t_cnt ? tic->t_unit_res * tic->t_cnt
> - : tic->t_unit_res);
> -
> trace_xfs_log_reserve(log, tic);
> -
> error = xlog_grant_head_check(log, &log->l_reserve_head, tic,
> &need_bytes);
> if (error)
> @@ -1330,7 +1289,7 @@ xlog_assign_tail_lsn(
> * shortcut invalidity asserts in this case so that we don't trigger them
> * falsely.
> */
> -STATIC int
> +int
> xlog_space_left(
> struct xlog *log,
> atomic64_t *head)
> @@ -1667,89 +1626,6 @@ xlog_alloc_log(
> return ERR_PTR(error);
> } /* xlog_alloc_log */
>
> -/*
> - * Compute the LSN that we'd need to push the log tail towards in order to have
> - * (a) enough on-disk log space to log the number of bytes specified, (b) at
> - * least 25% of the log space free, and (c) at least 256 blocks free. If the
> - * log free space already meets all three thresholds, this function returns
> - * NULLCOMMITLSN.
> - */
> -xfs_lsn_t
> -xlog_grant_push_threshold(
> - struct xlog *log,
> - int need_bytes)
> -{
> - xfs_lsn_t threshold_lsn = 0;
> - xfs_lsn_t last_sync_lsn;
> - int free_blocks;
> - int free_bytes;
> - int threshold_block;
> - int threshold_cycle;
> - int free_threshold;
> -
> - ASSERT(BTOBB(need_bytes) < log->l_logBBsize);
> -
> - free_bytes = xlog_space_left(log, &log->l_reserve_head.grant);
> - free_blocks = BTOBBT(free_bytes);
> -
> - /*
> - * Set the threshold for the minimum number of free blocks in the
> - * log to the maximum of what the caller needs, one quarter of the
> - * log, and 256 blocks.
> - */
> - free_threshold = BTOBB(need_bytes);
> - free_threshold = max(free_threshold, (log->l_logBBsize >> 2));
> - free_threshold = max(free_threshold, 256);
> - if (free_blocks >= free_threshold)
> - return NULLCOMMITLSN;
> -
> - xlog_crack_atomic_lsn(&log->l_tail_lsn, &threshold_cycle,
> - &threshold_block);
> - threshold_block += free_threshold;
> - if (threshold_block >= log->l_logBBsize) {
> - threshold_block -= log->l_logBBsize;
> - threshold_cycle += 1;
> - }
> - threshold_lsn = xlog_assign_lsn(threshold_cycle,
> - threshold_block);
> - /*
> - * Don't pass in an lsn greater than the lsn of the last
> - * log record known to be on disk. Use a snapshot of the last sync lsn
> - * so that it doesn't change between the compare and the set.
> - */
> - last_sync_lsn = atomic64_read(&log->l_last_sync_lsn);
> - if (XFS_LSN_CMP(threshold_lsn, last_sync_lsn) > 0)
> - threshold_lsn = last_sync_lsn;
> -
> - return threshold_lsn;
> -}
> -
> -/*
> - * Push the tail of the log if we need to do so to maintain the free log space
> - * thresholds set out by xlog_grant_push_threshold. We may need to adopt a
> - * policy which pushes on an lsn which is further along in the log once we
> - * reach the high water mark. In this manner, we would be creating a low water
> - * mark.
> - */
> -STATIC void
> -xlog_grant_push_ail(
> - struct xlog *log,
> - int need_bytes)
> -{
> - xfs_lsn_t threshold_lsn;
> -
> - threshold_lsn = xlog_grant_push_threshold(log, need_bytes);
> - if (threshold_lsn == NULLCOMMITLSN || xlog_is_shutdown(log))
> - return;
> -
> - /*
> - * Get the transaction layer to kick the dirty buffers out to
> - * disk asynchronously. No point in trying to do this if
> - * the filesystem is shutting down.
> - */
> - xfs_ail_push(log->l_ailp, threshold_lsn);
> -}
> -
> /*
> * Stamp cycle number in every block
> */
> @@ -2712,7 +2588,6 @@ xlog_state_set_callback(
> return;
>
> atomic64_set(&log->l_last_sync_lsn, header_lsn);
> - xlog_grant_push_ail(log, 0);
> }
>
> /*
> diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
> index d69acf881153d0..67c539cc9305c7 100644
> --- a/fs/xfs/xfs_log.h
> +++ b/fs/xfs/xfs_log.h
> @@ -156,7 +156,6 @@ int xfs_log_quiesce(struct xfs_mount *mp);
> void xfs_log_clean(struct xfs_mount *mp);
> bool xfs_log_check_lsn(struct xfs_mount *, xfs_lsn_t);
>
> -xfs_lsn_t xlog_grant_push_threshold(struct xlog *log, int need_bytes);
> bool xlog_force_shutdown(struct xlog *log, uint32_t shutdown_flags);
>
> int xfs_attr_use_log_assist(struct xfs_mount *mp);
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 40e22ec0fbe69a..0482b11965e248 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -575,6 +575,8 @@ xlog_assign_grant_head(atomic64_t *head, int cycle, int space)
> atomic64_set(head, xlog_assign_grant_head_val(cycle, space));
> }
>
> +int xlog_space_left(struct xlog *log, atomic64_t *head);
> +
> /*
> * Committed Item List interfaces
> */
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index e4c343096f95a3..a6b6fca1d13852 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -134,25 +134,6 @@ xfs_ail_min_lsn(
> return lsn;
> }
>
> -/*
> - * Return the maximum lsn held in the AIL, or zero if the AIL is empty.
> - */
> -static xfs_lsn_t
> -xfs_ail_max_lsn(
> - struct xfs_ail *ailp)
> -{
> - xfs_lsn_t lsn = 0;
> - struct xfs_log_item *lip;
> -
> - spin_lock(&ailp->ail_lock);
> - lip = xfs_ail_max(ailp);
> - if (lip)
> - lsn = lip->li_lsn;
> - spin_unlock(&ailp->ail_lock);
> -
> - return lsn;
> -}
> -
> /*
> * The cursor keeps track of where our current traversal is up to by tracking
> * the next item in the list for us. However, for this to be safe, removing an
> @@ -414,6 +395,56 @@ xfsaild_push_item(
> return lip->li_ops->iop_push(lip, &ailp->ail_buf_list);
> }
>
> +/*
> + * Compute the LSN that we'd need to push the log tail towards in order to have
> + * at least 25% of the log space free. If the log free space already meets this
> + * threshold, this function returns NULLCOMMITLSN.
> + */
> +xfs_lsn_t
> +__xfs_ail_push_target(
> + struct xfs_ail *ailp)
> +{
> + struct xlog *log = ailp->ail_log;
> + xfs_lsn_t threshold_lsn = 0;
> + xfs_lsn_t last_sync_lsn;
> + int free_blocks;
> + int free_bytes;
> + int threshold_block;
> + int threshold_cycle;
> + int free_threshold;
> +
> + free_bytes = xlog_space_left(log, &log->l_reserve_head.grant);
> + free_blocks = BTOBBT(free_bytes);
> +
> + /*
> + * The threshold for the minimum number of free blocks is one quarter of
> + * the entire log space.
> + */
> + free_threshold = log->l_logBBsize >> 2;
> + if (free_blocks >= free_threshold)
> + return NULLCOMMITLSN;
> +
> + xlog_crack_atomic_lsn(&log->l_tail_lsn, &threshold_cycle,
> + &threshold_block);
> + threshold_block += free_threshold;
> + if (threshold_block >= log->l_logBBsize) {
> + threshold_block -= log->l_logBBsize;
> + threshold_cycle += 1;
> + }
> + threshold_lsn = xlog_assign_lsn(threshold_cycle,
> + threshold_block);
> + /*
> + * Don't pass in an lsn greater than the lsn of the last
> + * log record known to be on disk. Use a snapshot of the last sync lsn
> + * so that it doesn't change between the compare and the set.
> + */
> + last_sync_lsn = atomic64_read(&log->l_last_sync_lsn);
> + if (XFS_LSN_CMP(threshold_lsn, last_sync_lsn) > 0)
> + threshold_lsn = last_sync_lsn;
> +
> + return threshold_lsn;
> +}
> +
> static long
> xfsaild_push(
> struct xfs_ail *ailp)
> @@ -454,21 +485,24 @@ xfsaild_push(
> * capture updates that occur after the sync push waiter has gone to
> * sleep.
> */
> - if (waitqueue_active(&ailp->ail_empty)) {
> + if (test_bit(XFS_AIL_OPSTATE_PUSH_ALL, &ailp->ail_opstate) ||
> + waitqueue_active(&ailp->ail_empty)) {
> lip = xfs_ail_max(ailp);
> if (lip)
> target = lip->li_lsn;
> + else
> + clear_bit(XFS_AIL_OPSTATE_PUSH_ALL, &ailp->ail_opstate);
> } else {
> - /* barrier matches the ail_target update in xfs_ail_push() */
> - smp_rmb();
> - target = ailp->ail_target;
> - ailp->ail_target_prev = target;
> + target = __xfs_ail_push_target(ailp);
> }
>
> + if (target == NULLCOMMITLSN)
> + goto out_done;
> +
> /* we're done if the AIL is empty or our push has reached the end */
> lip = xfs_trans_ail_cursor_first(ailp, &cur, ailp->ail_last_pushed_lsn);
> if (!lip)
> - goto out_done;
> + goto out_done_cursor;
>
> XFS_STATS_INC(mp, xs_push_ail);
>
> @@ -553,8 +587,9 @@ xfsaild_push(
> lsn = lip->li_lsn;
> }
>
> -out_done:
> +out_done_cursor:
> xfs_trans_ail_cursor_done(&cur);
> +out_done:
> spin_unlock(&ailp->ail_lock);
>
> if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list))
> @@ -603,7 +638,7 @@ xfsaild(
> set_freezable();
>
> while (1) {
> - if (tout && tout <= 20)
> + if (tout)
> set_current_state(TASK_KILLABLE|TASK_FREEZABLE);
> else
> set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
> @@ -639,21 +674,9 @@ xfsaild(
> break;
> }
>
> + /* Idle if the AIL is empty. */
> spin_lock(&ailp->ail_lock);
> -
> - /*
> - * Idle if the AIL is empty and we are not racing with a target
> - * update. We check the AIL after we set the task to a sleep
> - * state to guarantee that we either catch an ail_target update
> - * or that a wake_up resets the state to TASK_RUNNING.
> - * Otherwise, we run the risk of sleeping indefinitely.
> - *
> - * The barrier matches the ail_target update in xfs_ail_push().
> - */
> - smp_rmb();
> - if (!xfs_ail_min(ailp) &&
> - ailp->ail_target == ailp->ail_target_prev &&
> - list_empty(&ailp->ail_buf_list)) {
> + if (!xfs_ail_min(ailp) && list_empty(&ailp->ail_buf_list)) {
> spin_unlock(&ailp->ail_lock);
> schedule();
> tout = 0;
> @@ -675,56 +698,6 @@ xfsaild(
> return 0;
> }
>
> -/*
> - * This routine is called to move the tail of the AIL forward. It does this by
> - * trying to flush items in the AIL whose lsns are below the given
> - * threshold_lsn.
> - *
> - * The push is run asynchronously in a workqueue, which means the caller needs
> - * to handle waiting on the async flush for space to become available.
> - * We don't want to interrupt any push that is in progress, hence we only queue
> - * work if we set the pushing bit appropriately.
> - *
> - * We do this unlocked - we only need to know whether there is anything in the
> - * AIL at the time we are called. We don't need to access the contents of
> - * any of the objects, so the lock is not needed.
> - */
> -void
> -xfs_ail_push(
> - struct xfs_ail *ailp,
> - xfs_lsn_t threshold_lsn)
> -{
> - struct xfs_log_item *lip;
> -
> - lip = xfs_ail_min(ailp);
> - if (!lip || xlog_is_shutdown(ailp->ail_log) ||
> - XFS_LSN_CMP(threshold_lsn, ailp->ail_target) <= 0)
> - return;
> -
> - /*
> - * Ensure that the new target is noticed in push code before it clears
> - * the XFS_AIL_PUSHING_BIT.
> - */
> - smp_wmb();
> - xfs_trans_ail_copy_lsn(ailp, &ailp->ail_target, &threshold_lsn);
> - smp_wmb();
> -
> - wake_up_process(ailp->ail_task);
> -}
> -
> -/*
> - * Push out all items in the AIL immediately
> - */
> -void
> -xfs_ail_push_all(
> - struct xfs_ail *ailp)
> -{
> - xfs_lsn_t threshold_lsn = xfs_ail_max_lsn(ailp);
> -
> - if (threshold_lsn)
> - xfs_ail_push(ailp, threshold_lsn);
> -}
> -
> /*
> * Push out all items in the AIL immediately and wait until the AIL is empty.
> */
> @@ -829,6 +802,13 @@ xfs_trans_ail_update_bulk(
> if (!list_empty(&tmp))
> xfs_ail_splice(ailp, cur, &tmp, lsn);
>
> + /*
> + * If this is the first insert, wake up the push daemon so it can
> + * actively scan for items to push.
> + */
> + if (!mlip)
> + wake_up_process(ailp->ail_task);
> +
> xfs_ail_update_finish(ailp, tail_lsn);
> }
>
> diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> index 52a45f0a5ef173..9a131e7fae9467 100644
> --- a/fs/xfs/xfs_trans_priv.h
> +++ b/fs/xfs/xfs_trans_priv.h
> @@ -52,16 +52,18 @@ struct xfs_ail {
> struct xlog *ail_log;
> struct task_struct *ail_task;
> struct list_head ail_head;
> - xfs_lsn_t ail_target;
> - xfs_lsn_t ail_target_prev;
> struct list_head ail_cursors;
> spinlock_t ail_lock;
> xfs_lsn_t ail_last_pushed_lsn;
> int ail_log_flush;
> + unsigned long ail_opstate;
> struct list_head ail_buf_list;
> wait_queue_head_t ail_empty;
> };
>
> +/* Push all items out of the AIL immediately. */
> +#define XFS_AIL_OPSTATE_PUSH_ALL 0u
> +
> /*
> * From xfs_trans_ail.c
> */
> @@ -98,10 +100,29 @@ void xfs_ail_update_finish(struct xfs_ail *ailp, xfs_lsn_t old_lsn)
> __releases(ailp->ail_lock);
> void xfs_trans_ail_delete(struct xfs_log_item *lip, int shutdown_type);
>
> -void xfs_ail_push(struct xfs_ail *, xfs_lsn_t);
> -void xfs_ail_push_all(struct xfs_ail *);
> -void xfs_ail_push_all_sync(struct xfs_ail *);
> -struct xfs_log_item *xfs_ail_min(struct xfs_ail *ailp);
> +static inline void xfs_ail_push(struct xfs_ail *ailp)
> +{
> + wake_up_process(ailp->ail_task);
> +}
> +
> +static inline void xfs_ail_push_all(struct xfs_ail *ailp)
> +{
> + if (!test_and_set_bit(XFS_AIL_OPSTATE_PUSH_ALL, &ailp->ail_opstate))
> + xfs_ail_push(ailp);
> +}
> +
> +xfs_lsn_t __xfs_ail_push_target(struct xfs_ail *ailp);
> +static inline xfs_lsn_t xfs_ail_push_target(struct xfs_ail *ailp)
> +{
> + xfs_lsn_t lsn;
> +
> + spin_lock(&ailp->ail_lock);
> + lsn = __xfs_ail_push_target(ailp);
> + spin_unlock(&ailp->ail_lock);
> + return lsn;
> +}
> +
> +void xfs_ail_push_all_sync(struct xfs_ail *ailp);
> xfs_lsn_t xfs_ail_min_lsn(struct xfs_ail *ailp);
>
> struct xfs_log_item * xfs_trans_ail_cursor_first(struct xfs_ail *ailp,
> --
> 2.43.0
>
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 04/11] xfs: background AIL push should target physical space
2024-06-20 7:21 ` [PATCH 04/11] xfs: background AIL push should target physical space Christoph Hellwig
@ 2024-06-20 19:42 ` Darrick J. Wong
2024-06-21 5:37 ` Christoph Hellwig
0 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2024-06-20 19:42 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Chandan Babu R, Dave Chinner, linux-xfs, Dave Chinner
On Thu, Jun 20, 2024 at 09:21:21AM +0200, Christoph Hellwig wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Currently the AIL attempts to keep 25% of the "log space" free,
> where the current used space is tracked by the reserve grant head.
> That is, it tracks both physical space used plus the amount reserved
> by transactions in progress.
>
> When we start tail pushing, we are trying to make space for new
> reservations by writing back older metadata and the log is generally
> physically full of dirty metadata, and reservations for modifications
> in flight take up whatever space the AIL can physically free up.
>
> Hence we don't really need to take into account the reservation
> space that has been used - we just need to keep the log tail moving
> as fast as we can to free up space for more reservations to be made.
> We know exactly how much physical space the journal is consuming in
> the AIL (i.e. max LSN - min LSN) so we can base push thresholds
> directly on this state rather than have to look at grant head
> reservations to determine how much to physically push out of the
> log.
Hmm. Right up to here, this looks like
https://lore.kernel.org/linux-xfs/20230921014844.582667-4-david@fromorbit.com/
which I previously reviewed. But this next part I've never seen before:
> This also allows code that needs to know if log items in the current
> transaction need to be pushed or re-logged to simply sample the
> current target - they don't need to calculate the current target
> themselves. This avoids the need for any locking when doing such
> checks.
Ok, so I guess now the AIL kthread maintains the push target in
ailp->ail_target all of the time?
> Further, moving to a physical target means we don't need "push all
> until empty semantics" like were introduced in the previous patch.
> We can now test and clear the "push all" as a one-shot command to
> set the target to the current head of the AIL. This allows the
> xfsaild to maximise the use of log space right up to the point where
> conditions indicate that the xfsaild is not keeping up with load and
> it needs to work harder, and as soon as those constraints go away
> (i.e. external code no longer needs everything pushed) the xfsaild
> will return to maintaining the normal 25% free space thresholds.
I think this makes sense to me, but I'm wondering that I've never seen
it on the list before. This is a later revision to the earlier posting,
right?
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> fs/xfs/libxfs/xfs_defer.c | 2 +-
> fs/xfs/xfs_log_priv.h | 18 ++++++
> fs/xfs/xfs_trans_ail.c | 116 +++++++++++++++++++-------------------
> fs/xfs/xfs_trans_priv.h | 11 +---
> 4 files changed, 80 insertions(+), 67 deletions(-)
>
> diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
> index e2c8308d518b56..40021849b42f0a 100644
> --- a/fs/xfs/libxfs/xfs_defer.c
> +++ b/fs/xfs/libxfs/xfs_defer.c
> @@ -558,7 +558,7 @@ xfs_defer_relog(
> * the log threshold once per call.
> */
> if (threshold_lsn == NULLCOMMITLSN) {
> - threshold_lsn = xfs_ail_push_target(log->l_ailp);
> + threshold_lsn = xfs_ail_get_push_target(log->l_ailp);
Hmm, ok, so now we do a quick read of the AIL's push target to decide if
we need to relog this intent item, instead of taking locks and computing
the push target again. Just like what it says on the tin.
Why is this second part (actually moving to physical space) not a
separate patch from the old patch that simply dropped the grant head
from the computation?
--D
> if (threshold_lsn == NULLCOMMITLSN)
> break;
> }
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 0482b11965e248..971871b84d8436 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -625,6 +625,24 @@ xlog_wait(
> int xlog_wait_on_iclog(struct xlog_in_core *iclog)
> __releases(iclog->ic_log->l_icloglock);
>
> +/* Calculate the distance between two LSNs in bytes */
> +static inline uint64_t
> +xlog_lsn_sub(
> + struct xlog *log,
> + xfs_lsn_t high,
> + xfs_lsn_t low)
> +{
> + uint32_t hi_cycle = CYCLE_LSN(high);
> + uint32_t hi_block = BLOCK_LSN(high);
> + uint32_t lo_cycle = CYCLE_LSN(low);
> + uint32_t lo_block = BLOCK_LSN(low);
> +
> + if (hi_cycle == lo_cycle)
> + return BBTOB(hi_block - lo_block);
> + ASSERT((hi_cycle == lo_cycle + 1) || xlog_is_shutdown(log));
> + return (uint64_t)log->l_logsize - BBTOB(lo_block - hi_block);
> +}
> +
> /*
> * The LSN is valid so long as it is behind the current LSN. If it isn't, this
> * means that the next log record that includes this metadata could have a
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index a6b6fca1d13852..26d4d9b3e35789 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -398,51 +398,69 @@ xfsaild_push_item(
> /*
> * Compute the LSN that we'd need to push the log tail towards in order to have
> * at least 25% of the log space free. If the log free space already meets this
> - * threshold, this function returns NULLCOMMITLSN.
> + * threshold, this function returns the lowest LSN in the AIL to slowly keep
> + * writeback ticking over and the tail of the log moving forward.
> */
> -xfs_lsn_t
> -__xfs_ail_push_target(
> +static xfs_lsn_t
> +xfs_ail_calc_push_target(
> struct xfs_ail *ailp)
> {
> - struct xlog *log = ailp->ail_log;
> - xfs_lsn_t threshold_lsn = 0;
> - xfs_lsn_t last_sync_lsn;
> - int free_blocks;
> - int free_bytes;
> - int threshold_block;
> - int threshold_cycle;
> - int free_threshold;
> -
> - free_bytes = xlog_space_left(log, &log->l_reserve_head.grant);
> - free_blocks = BTOBBT(free_bytes);
> + struct xlog *log = ailp->ail_log;
> + struct xfs_log_item *lip;
> + xfs_lsn_t target_lsn;
> + xfs_lsn_t max_lsn;
> + xfs_lsn_t min_lsn;
> + int32_t free_bytes;
> + uint32_t target_block;
> + uint32_t target_cycle;
> +
> + lockdep_assert_held(&ailp->ail_lock);
> +
> + lip = xfs_ail_max(ailp);
> + if (!lip)
> + return NULLCOMMITLSN;
> +
> + max_lsn = lip->li_lsn;
> + min_lsn = __xfs_ail_min_lsn(ailp);
>
> /*
> - * The threshold for the minimum number of free blocks is one quarter of
> - * the entire log space.
> + * If we are supposed to push all the items in the AIL, we want to push
> + * to the current head. We then clear the push flag so that we don't
> + * keep pushing newly queued items beyond where the push all command was
> + * run. If the push waiter wants to empty the ail, it should queue
> + * itself on the ail_empty wait queue.
> */
> - free_threshold = log->l_logBBsize >> 2;
> - if (free_blocks >= free_threshold)
> - return NULLCOMMITLSN;
> + if (test_and_clear_bit(XFS_AIL_OPSTATE_PUSH_ALL, &ailp->ail_opstate))
> + return max_lsn;
> +
> + /* If someone wants the AIL empty, keep pushing everything we have. */
> + if (waitqueue_active(&ailp->ail_empty))
> + return max_lsn;
>
> - xlog_crack_atomic_lsn(&log->l_tail_lsn, &threshold_cycle,
> - &threshold_block);
> - threshold_block += free_threshold;
> - if (threshold_block >= log->l_logBBsize) {
> - threshold_block -= log->l_logBBsize;
> - threshold_cycle += 1;
> - }
> - threshold_lsn = xlog_assign_lsn(threshold_cycle,
> - threshold_block);
> /*
> - * Don't pass in an lsn greater than the lsn of the last
> - * log record known to be on disk. Use a snapshot of the last sync lsn
> - * so that it doesn't change between the compare and the set.
> + * Background pushing - attempt to keep 25% of the log free and if we
> + * have that much free retain the existing target.
> */
> - last_sync_lsn = atomic64_read(&log->l_last_sync_lsn);
> - if (XFS_LSN_CMP(threshold_lsn, last_sync_lsn) > 0)
> - threshold_lsn = last_sync_lsn;
> + free_bytes = log->l_logsize - xlog_lsn_sub(log, max_lsn, min_lsn);
> + if (free_bytes >= log->l_logsize >> 2)
> + return ailp->ail_target;
> +
> + target_cycle = CYCLE_LSN(min_lsn);
> + target_block = BLOCK_LSN(min_lsn) + (log->l_logBBsize >> 2);
> + if (target_block >= log->l_logBBsize) {
> + target_block -= log->l_logBBsize;
> + target_cycle += 1;
> + }
> + target_lsn = xlog_assign_lsn(target_cycle, target_block);
> +
> + /* Cap the target to the highest LSN known to be in the AIL. */
> + if (XFS_LSN_CMP(target_lsn, max_lsn) > 0)
> + return max_lsn;
>
> - return threshold_lsn;
> + /* If the existing target is higher than the new target, keep it. */
> + if (XFS_LSN_CMP(ailp->ail_target, target_lsn) >= 0)
> + return ailp->ail_target;
> + return target_lsn;
> }
>
> static long
> @@ -453,7 +471,6 @@ xfsaild_push(
> struct xfs_ail_cursor cur;
> struct xfs_log_item *lip;
> xfs_lsn_t lsn;
> - xfs_lsn_t target = NULLCOMMITLSN;
> long tout;
> int stuck = 0;
> int flushing = 0;
> @@ -478,25 +495,8 @@ xfsaild_push(
> }
>
> spin_lock(&ailp->ail_lock);
> -
> - /*
> - * If we have a sync push waiter, we always have to push till the AIL is
> - * empty. Update the target to point to the end of the AIL so that
> - * capture updates that occur after the sync push waiter has gone to
> - * sleep.
> - */
> - if (test_bit(XFS_AIL_OPSTATE_PUSH_ALL, &ailp->ail_opstate) ||
> - waitqueue_active(&ailp->ail_empty)) {
> - lip = xfs_ail_max(ailp);
> - if (lip)
> - target = lip->li_lsn;
> - else
> - clear_bit(XFS_AIL_OPSTATE_PUSH_ALL, &ailp->ail_opstate);
> - } else {
> - target = __xfs_ail_push_target(ailp);
> - }
> -
> - if (target == NULLCOMMITLSN)
> + WRITE_ONCE(ailp->ail_target, xfs_ail_calc_push_target(ailp));
> + if (ailp->ail_target == NULLCOMMITLSN)
> goto out_done;
>
> /* we're done if the AIL is empty or our push has reached the end */
> @@ -506,10 +506,10 @@ xfsaild_push(
>
> XFS_STATS_INC(mp, xs_push_ail);
>
> - ASSERT(target != NULLCOMMITLSN);
> + ASSERT(ailp->ail_target != NULLCOMMITLSN);
>
> lsn = lip->li_lsn;
> - while ((XFS_LSN_CMP(lip->li_lsn, target) <= 0)) {
> + while ((XFS_LSN_CMP(lip->li_lsn, ailp->ail_target) <= 0)) {
> int lock_result;
>
> /*
> @@ -595,7 +595,7 @@ xfsaild_push(
> if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list))
> ailp->ail_log_flush++;
>
> - if (!count || XFS_LSN_CMP(lsn, target) >= 0) {
> + if (!count || XFS_LSN_CMP(lsn, ailp->ail_target) >= 0) {
> /*
> * We reached the target or the AIL is empty, so wait a bit
> * longer for I/O to complete and remove pushed items from the
> diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> index 9a131e7fae9467..60b4707c3a6583 100644
> --- a/fs/xfs/xfs_trans_priv.h
> +++ b/fs/xfs/xfs_trans_priv.h
> @@ -59,6 +59,7 @@ struct xfs_ail {
> unsigned long ail_opstate;
> struct list_head ail_buf_list;
> wait_queue_head_t ail_empty;
> + xfs_lsn_t ail_target;
> };
>
> /* Push all items out of the AIL immediately. */
> @@ -111,15 +112,9 @@ static inline void xfs_ail_push_all(struct xfs_ail *ailp)
> xfs_ail_push(ailp);
> }
>
> -xfs_lsn_t __xfs_ail_push_target(struct xfs_ail *ailp);
> -static inline xfs_lsn_t xfs_ail_push_target(struct xfs_ail *ailp)
> +static inline xfs_lsn_t xfs_ail_get_push_target(struct xfs_ail *ailp)
> {
> - xfs_lsn_t lsn;
> -
> - spin_lock(&ailp->ail_lock);
> - lsn = __xfs_ail_push_target(ailp);
> - spin_unlock(&ailp->ail_lock);
> - return lsn;
> + return READ_ONCE(ailp->ail_target);
> }
>
> void xfs_ail_push_all_sync(struct xfs_ail *ailp);
> --
> 2.43.0
>
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 11/11] xfs: skip flushing log items during push
2024-06-20 7:21 ` [PATCH 11/11] xfs: skip flushing log items during push Christoph Hellwig
@ 2024-06-20 19:51 ` Darrick J. Wong
2024-06-21 5:48 ` Christoph Hellwig
0 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2024-06-20 19:51 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Chandan Babu R, Dave Chinner, linux-xfs, Dave Chinner
On Thu, Jun 20, 2024 at 09:21:28AM +0200, Christoph Hellwig wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> The AIL pushing code spends a huge amount of time skipping over
> items that are already marked as flushing. It is not uncommon to
> see hundreds of thousands of items skipped every second due to inode
> clustering marking all the inodes in a cluster as flushing when the
> first one is flushed.
>
> However, to discover an item is already flushing and should be
> skipped we have to call the iop_push() method for it to try to flush
> the item. For inodes (where this matters most), we have to first
> check that inode is flushable first.
>
> We can optimise this overhead away by tracking whether the log item
> is flushing internally. This allows xfsaild_push() to check the log
> item directly for flushing state and immediately skip the log item.
> Whilst this doesn't remove the CPU cache misses for loading the log
> item, it does avoid the overhead of an indirect function call
> and the cache misses involved in accessing inode and
> backing cluster buffer structures to determine flushing state. When
> trying to flush hundreds of thousands of inodes each second, this
> CPU overhead saving adds up quickly.
>
> It's so noticeable that the biggest issue with pushing on the AIL on
> fast storage becomes the 10ms back-off wait when we hit enough
> pinned buffers to break out of the push loop but not enough for the
> AIL pushing to be considered stuck. This limits the xfsaild to about
> 70% total CPU usage, and on fast storage this isn't enough to keep
> the storage 100% busy.
>
> The xfsaild will block on IO submission on slow storage and so is
> self throttling - it does not need a backoff in the case where we
> are really just breaking out of the walk to submit the IO we have
> gathered.
>
> Further with no backoff we don't need to gather huge delwri lists to
> mitigate the impact of backoffs, so we can submit IO more frequently
> and reduce the time log items spend in flushing state by breaking
> out of the item push loop once we've gathered enough IO to batch
> submission effectively.
Is that what the new count > 1000 branch does?
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> fs/xfs/xfs_inode.c | 1 +
> fs/xfs/xfs_inode_item.c | 6 +++++-
Does it make sense to do this for buffer or dquot items too?
Modulo my questions, the rest of the changes in this series look sensible.
--D
> fs/xfs/xfs_trans.h | 4 +++-
> fs/xfs/xfs_trans_ail.c | 8 +++++++-
> 4 files changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 58fb7a5062e1e6..da611ec56f1be0 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3713,6 +3713,7 @@ xfs_iflush(
> iip->ili_last_fields = iip->ili_fields;
> iip->ili_fields = 0;
> iip->ili_fsync_fields = 0;
> + set_bit(XFS_LI_FLUSHING, &iip->ili_item.li_flags);
> spin_unlock(&iip->ili_lock);
>
> /*
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index f28d653300d124..ba49e56820f0a7 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -933,6 +933,7 @@ xfs_iflush_finish(
> }
> iip->ili_last_fields = 0;
> iip->ili_flush_lsn = 0;
> + clear_bit(XFS_LI_FLUSHING, &lip->li_flags);
> spin_unlock(&iip->ili_lock);
> xfs_iflags_clear(iip->ili_inode, XFS_IFLUSHING);
> if (drop_buffer)
> @@ -991,8 +992,10 @@ xfs_buf_inode_io_fail(
> {
> struct xfs_log_item *lip;
>
> - list_for_each_entry(lip, &bp->b_li_list, li_bio_list)
> + list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
> set_bit(XFS_LI_FAILED, &lip->li_flags);
> + clear_bit(XFS_LI_FLUSHING, &lip->li_flags);
> + }
> }
>
> /*
> @@ -1011,6 +1014,7 @@ xfs_iflush_abort_clean(
> iip->ili_flush_lsn = 0;
> iip->ili_item.li_buf = NULL;
> list_del_init(&iip->ili_item.li_bio_list);
> + clear_bit(XFS_LI_FLUSHING, &iip->ili_item.li_flags);
> }
>
> /*
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index 1636663707dc04..20eb6ea7ebaa04 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -58,13 +58,15 @@ struct xfs_log_item {
> #define XFS_LI_FAILED 2
> #define XFS_LI_DIRTY 3
> #define XFS_LI_WHITEOUT 4
> +#define XFS_LI_FLUSHING 5
>
> #define XFS_LI_FLAGS \
> { (1u << XFS_LI_IN_AIL), "IN_AIL" }, \
> { (1u << XFS_LI_ABORTED), "ABORTED" }, \
> { (1u << XFS_LI_FAILED), "FAILED" }, \
> { (1u << XFS_LI_DIRTY), "DIRTY" }, \
> - { (1u << XFS_LI_WHITEOUT), "WHITEOUT" }
> + { (1u << XFS_LI_WHITEOUT), "WHITEOUT" }, \
> + { (1u << XFS_LI_FLUSHING), "FLUSHING" }
>
> struct xfs_item_ops {
> unsigned flags;
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 6a106a05fae017..0fafcc9f3dbe44 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -512,6 +512,9 @@ xfsaild_push(
> while ((XFS_LSN_CMP(lip->li_lsn, ailp->ail_target) <= 0)) {
> int lock_result;
>
> + if (test_bit(XFS_LI_FLUSHING, &lip->li_flags))
> + goto next_item;
> +
> /*
> * Note that iop_push may unlock and reacquire the AIL lock. We
> * rely on the AIL cursor implementation to be able to deal with
> @@ -581,9 +584,12 @@ xfsaild_push(
> if (stuck > 100)
> break;
>
> +next_item:
> lip = xfs_trans_ail_cursor_next(ailp, &cur);
> if (lip == NULL)
> break;
> + if (lip->li_lsn != lsn && count > 1000)
> + break;
> lsn = lip->li_lsn;
> }
>
> @@ -620,7 +626,7 @@ xfsaild_push(
> /*
> * Assume we have more work to do in a short while.
> */
> - tout = 10;
> + tout = 0;
> }
>
> return tout;
> --
> 2.43.0
>
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 04/11] xfs: background AIL push should target physical space
2024-06-20 19:42 ` Darrick J. Wong
@ 2024-06-21 5:37 ` Christoph Hellwig
2024-06-21 16:18 ` Darrick J. Wong
0 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2024-06-21 5:37 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Christoph Hellwig, Chandan Babu R, Dave Chinner, linux-xfs,
Dave Chinner
On Thu, Jun 20, 2024 at 12:42:26PM -0700, Darrick J. Wong wrote:
> > transaction need to be pushed or re-logged to simply sample the
> > current target - they don't need to calculate the current target
> > themselves. This avoids the need for any locking when doing such
> > checks.
>
> Ok, so I guess now the AIL kthread maintains the push target in
> ailp->ail_target all of the time?
Yes.
> I think this makes sense to me, but I'm wondering that I've never seen
> it on the list before. This is a later revision to the earlier posting,
> right?
I started dusting off the last posted version and then Dave gave me
his latests working copy which hadn't been posted so far.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 11/11] xfs: skip flushing log items during push
2024-06-20 19:51 ` Darrick J. Wong
@ 2024-06-21 5:48 ` Christoph Hellwig
2024-06-21 17:46 ` Darrick J. Wong
0 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2024-06-21 5:48 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Christoph Hellwig, Chandan Babu R, Dave Chinner, linux-xfs,
Dave Chinner
On Thu, Jun 20, 2024 at 12:51:42PM -0700, Darrick J. Wong wrote:
> > Further with no backoff we don't need to gather huge delwri lists to
> > mitigate the impact of backoffs, so we can submit IO more frequently
> > and reduce the time log items spend in flushing state by breaking
> > out of the item push loop once we've gathered enough IO to batch
> > submission effectively.
>
> Is that what the new count > 1000 branch does?
That's my interpreation anyway. I'll let Dave chime in if he disagrees.
>
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> > fs/xfs/xfs_inode.c | 1 +
> > fs/xfs/xfs_inode_item.c | 6 +++++-
>
> Does it make sense to do this for buffer or dquot items too?
Not having written this here is my 2 unqualified cents:
For dquots it looks like it could be easily ported over, but I guess no
one has been bothering with dquot performance work for a while as it's
also missing a bunch of other things we did to the inode. But given that
according to Dave's commit log the іnode cluster flushing is a big part
of this dquots probably aren't as affected anyway as we flush them
individually (and there generally are a lot fewer dquot items in the AIL
anyway).
For buf items the buffers are queued up on the on-stack delwri list
and written when we flush them. So we won't ever find already
flushing items.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 04/11] xfs: background AIL push should target physical space
2024-06-21 5:37 ` Christoph Hellwig
@ 2024-06-21 16:18 ` Darrick J. Wong
0 siblings, 0 replies; 24+ messages in thread
From: Darrick J. Wong @ 2024-06-21 16:18 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Chandan Babu R, Dave Chinner, linux-xfs, Dave Chinner
On Fri, Jun 21, 2024 at 07:37:26AM +0200, Christoph Hellwig wrote:
> On Thu, Jun 20, 2024 at 12:42:26PM -0700, Darrick J. Wong wrote:
> > > transaction need to be pushed or re-logged to simply sample the
> > > current target - they don't need to calculate the current target
> > > themselves. This avoids the need for any locking when doing such
> > > checks.
> >
> > Ok, so I guess now the AIL kthread maintains the push target in
> > ailp->ail_target all of the time?
>
> Yes.
>
> > I think this makes sense to me, but I'm wondering that I've never seen
> > it on the list before. This is a later revision to the earlier posting,
> > right?
>
> I started dusting off the last posted version and then Dave gave me
> his latests working copy which hadn't been posted so far.
Ah, ok then. Thanks for answering my question; let's move this along
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
--D
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 11/11] xfs: skip flushing log items during push
2024-06-21 5:48 ` Christoph Hellwig
@ 2024-06-21 17:46 ` Darrick J. Wong
2024-07-02 18:51 ` Darrick J. Wong
0 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2024-06-21 17:46 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Chandan Babu R, Dave Chinner, linux-xfs, Dave Chinner
On Fri, Jun 21, 2024 at 07:48:08AM +0200, Christoph Hellwig wrote:
> On Thu, Jun 20, 2024 at 12:51:42PM -0700, Darrick J. Wong wrote:
> > > Further with no backoff we don't need to gather huge delwri lists to
> > > mitigate the impact of backoffs, so we can submit IO more frequently
> > > and reduce the time log items spend in flushing state by breaking
> > > out of the item push loop once we've gathered enough IO to batch
> > > submission effectively.
> >
> > Is that what the new count > 1000 branch does?
>
> That's my interpreation anyway. I'll let Dave chime in if he disagrees.
<nod> I'll await a response on this...
> >
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > > fs/xfs/xfs_inode.c | 1 +
> > > fs/xfs/xfs_inode_item.c | 6 +++++-
> >
> > Does it make sense to do this for buffer or dquot items too?
>
> Not having written this here is my 2 unqualified cents:
>
> For dquots it looks like it could be easily ported over, but I guess no
> one has been bothering with dquot performance work for a while as it's
> also missing a bunch of other things we did to the inode. But given that
> according to Dave's commit log the іnode cluster flushing is a big part
> of this dquots probably aren't as affected anyway as we flush them
> individually (and there generally are a lot fewer dquot items in the AIL
> anyway).
It probably helps that dquot "clusters" are also single fsblocks too.
> For buf items the buffers are queued up on the on-stack delwri list
> and written when we flush them. So we won't ever find already
> flushing items.
Oh right, because only the AIL flushes logged buffers to disk.
--D
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 10/11] xfs: grant heads track byte counts, not LSNs
2024-06-20 7:21 ` [PATCH 10/11] xfs: grant heads track byte counts, not LSNs Christoph Hellwig
@ 2024-07-01 0:59 ` Dave Chinner
0 siblings, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2024-07-01 0:59 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Chandan Babu R, Darrick J. Wong, linux-xfs, Dave Chinner
On Thu, Jun 20, 2024 at 09:21:27AM +0200, Christoph Hellwig wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> The grant heads in the log track the space reserved in the log for
> running transactions. They do this by tracking how far ahead of the
> tail that the reservation has reached, and the units for doing this
> are {cycle,bytes} for the reserve head rather than {cycle,blocks}
> which are normal used by LSNs.
>
> This is annoyingly complex because we have to split, crack and
> combined these tuples for any calculation we do to determine log
> space and targets. This is computationally expensive as well as
> difficult to do atomically and locklessly, as well as limiting the
> size of the log to 2^32 bytes.
>
> Really, though, all the grant heads are tracking is how much space
> is currently available for use in the log. We can track this as a
> simply byte count - we just don't care what the actual physical
> location in the log the head and tail are at, just how much space we
> have remaining before the head and tail overlap.
>
> So, convert the grant heads to track the byte reservations that are
> active rather than the current (cycle, offset) tuples. This means an
> empty log has zero bytes consumed, and a full log is when the
> reservations reach the size of the log minus the space consumed by
> the AIL.
>
> This greatly simplifies the accounting and checks for whether there
> is space available. We no longer need to crack or combine LSNs to
> determine how much space the log has left, nor do we need to look at
> the head or tail of the log to determine how close to full we are.
>
> There is, however, a complexity that needs to be handled. We know
> how much space is being tracked in the AIL now via log->l_tail_space
> and the log tickets track active reservations and return the unused
> portions to the grant heads when ungranted. Unfortunately, we don't
> track the used portion of the grant, so when we transfer log items
> from the CIL to the AIL, the space accounted to the grant heads is
> transferred to the log tail space. Hence when we move the AIL head
> forwards on item insert, we have to remove that space from the grant
> heads.
>
> We also remove the xlog_verify_grant_tail() debug function as it is
> no longer useful. The check it performs has been racy since delayed
> logging was introduced, but now it is clearly only detecting false
> positives so remove it.
>
> The result of this substantially simpler accounting algorithm is an
> increase in sustained transaction rate from ~1.3 million
> transactions/s to ~1.9 million transactions/s with no increase in
> CPU usage. We also remove the 32 bit space limitation on the grant
> heads, which will allow us to increase the journal size beyond 2GB
> in future.
>
> Note that this renames the sysfs files exposing the log grant space
> now that the values are exported in bytes. This allows xfstests
> to auto-detect the old or new ABI.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> [hch: move xlog_grant_sub_space out of line,
> update the xlog_grant_{add,sub}_space prototypes,
> rename the sysfs files to allow auto-detection in xfstests]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
Updates look fine.
Reviewed-by: Dave Chinner <dchinner@redhat.com>
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 11/11] xfs: skip flushing log items during push
2024-06-21 17:46 ` Darrick J. Wong
@ 2024-07-02 18:51 ` Darrick J. Wong
2024-07-02 23:29 ` Dave Chinner
0 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2024-07-02 18:51 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Chandan Babu R, Dave Chinner, linux-xfs, Dave Chinner
On Fri, Jun 21, 2024 at 10:46:45AM -0700, Darrick J. Wong wrote:
> On Fri, Jun 21, 2024 at 07:48:08AM +0200, Christoph Hellwig wrote:
> > On Thu, Jun 20, 2024 at 12:51:42PM -0700, Darrick J. Wong wrote:
> > > > Further with no backoff we don't need to gather huge delwri lists to
> > > > mitigate the impact of backoffs, so we can submit IO more frequently
> > > > and reduce the time log items spend in flushing state by breaking
> > > > out of the item push loop once we've gathered enough IO to batch
> > > > submission effectively.
> > >
> > > Is that what the new count > 1000 branch does?
> >
> > That's my interpreation anyway. I'll let Dave chime in if he disagrees.
>
> <nod> I'll await a response on this...
<shrug> No response after 11 days, I'll not hold this up further over a
minor point.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
--D
> > >
> > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > ---
> > > > fs/xfs/xfs_inode.c | 1 +
> > > > fs/xfs/xfs_inode_item.c | 6 +++++-
> > >
> > > Does it make sense to do this for buffer or dquot items too?
> >
> > Not having written this here is my 2 unqualified cents:
> >
> > For dquots it looks like it could be easily ported over, but I guess no
> > one has been bothering with dquot performance work for a while as it's
> > also missing a bunch of other things we did to the inode. But given that
> > according to Dave's commit log the іnode cluster flushing is a big part
> > of this dquots probably aren't as affected anyway as we flush them
> > individually (and there generally are a lot fewer dquot items in the AIL
> > anyway).
>
> It probably helps that dquot "clusters" are also single fsblocks too.
>
> > For buf items the buffers are queued up on the on-stack delwri list
> > and written when we flush them. So we won't ever find already
> > flushing items.
>
> Oh right, because only the AIL flushes logged buffers to disk.
>
> --D
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 11/11] xfs: skip flushing log items during push
2024-07-02 18:51 ` Darrick J. Wong
@ 2024-07-02 23:29 ` Dave Chinner
2024-07-03 5:10 ` Darrick J. Wong
0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2024-07-02 23:29 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Christoph Hellwig, Chandan Babu R, linux-xfs, Dave Chinner
On Tue, Jul 02, 2024 at 11:51:20AM -0700, Darrick J. Wong wrote:
> On Fri, Jun 21, 2024 at 10:46:45AM -0700, Darrick J. Wong wrote:
> > On Fri, Jun 21, 2024 at 07:48:08AM +0200, Christoph Hellwig wrote:
> > > On Thu, Jun 20, 2024 at 12:51:42PM -0700, Darrick J. Wong wrote:
> > > > > Further with no backoff we don't need to gather huge delwri lists to
> > > > > mitigate the impact of backoffs, so we can submit IO more frequently
> > > > > and reduce the time log items spend in flushing state by breaking
> > > > > out of the item push loop once we've gathered enough IO to batch
> > > > > submission effectively.
> > > >
> > > > Is that what the new count > 1000 branch does?
> > >
> > > That's my interpreation anyway. I'll let Dave chime in if he disagrees.
Yes, that's correct. I didn't finish this patch - I never wrote the
comments in the code to explain this because I don't bother doing
that until I've validated the heuristic and know it mostly works
as desired. I simply hadn't closed the loop.
Please add comments to the code to explain what the magic "1000"
is...
> > <nod> I'll await a response on this...
>
> <shrug> No response after 11 days, I'll not hold this up further over a
> minor point.
I've been on PTO for the last couple of weeks, and I'm still
catching up on email. You could have just pinged me on #xfs asking
if I'd seen this, just like jlayton did about the mgtime stuff last
week. I answered even though I was on PTO. You always used to do
this when you wanted an answer to a question - I'm curious as to why
have you stopped using #xfs to ask questions about code, bugs and
patch reviews?
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 11/11] xfs: skip flushing log items during push
2024-07-02 23:29 ` Dave Chinner
@ 2024-07-03 5:10 ` Darrick J. Wong
0 siblings, 0 replies; 24+ messages in thread
From: Darrick J. Wong @ 2024-07-03 5:10 UTC (permalink / raw)
To: Dave Chinner; +Cc: Christoph Hellwig, Chandan Babu R, linux-xfs, Dave Chinner
On Wed, Jul 03, 2024 at 09:29:47AM +1000, Dave Chinner wrote:
> On Tue, Jul 02, 2024 at 11:51:20AM -0700, Darrick J. Wong wrote:
> > On Fri, Jun 21, 2024 at 10:46:45AM -0700, Darrick J. Wong wrote:
> > > On Fri, Jun 21, 2024 at 07:48:08AM +0200, Christoph Hellwig wrote:
> > > > On Thu, Jun 20, 2024 at 12:51:42PM -0700, Darrick J. Wong wrote:
> > > > > > Further with no backoff we don't need to gather huge delwri lists to
> > > > > > mitigate the impact of backoffs, so we can submit IO more frequently
> > > > > > and reduce the time log items spend in flushing state by breaking
> > > > > > out of the item push loop once we've gathered enough IO to batch
> > > > > > submission effectively.
> > > > >
> > > > > Is that what the new count > 1000 branch does?
> > > >
> > > > That's my interpreation anyway. I'll let Dave chime in if he disagrees.
>
> Yes, that's correct. I didn't finish this patch - I never wrote the
> comments in the code to explain this because I don't bother doing
> that until I've validated the heuristic and know it mostly works
> as desired. I simply hadn't closed the loop.
>
> Please add comments to the code to explain what the magic "1000"
> is...
Something along the lines of
/*
* Submit IO more frequently and reduce the time log items spend
* in flushing state by breaking out of the item push loop once
* we've gathered enough IO to batch submission effectively.
*/
if (lip->li_lsn != lsn && count > 1000)
break;
Maybe?
> > > <nod> I'll await a response on this...
> >
> > <shrug> No response after 11 days, I'll not hold this up further over a
> > minor point.
>
> I've been on PTO for the last couple of weeks, and I'm still
> catching up on email. You could have just pinged me on #xfs asking
> if I'd seen this, just like jlayton did about the mgtime stuff last
> week. I answered even though I was on PTO. You always used to do
> this when you wanted an answer to a question - I'm curious as to why
> have you stopped using #xfs to ask questions about code, bugs and
> patch reviews?
I think you told me you had some PTO coming up the month after LSF so I
did not choose to bother you during your time off with something that
didn't seem all /that/ urgent. At worst, we merge it, try to tweak it,
and either make it better or revert it.
--D
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2024-07-03 5:10 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-20 7:21 xfs: byte-based grant head reservation tracking v4 Christoph Hellwig
2024-06-20 7:21 ` [PATCH 01/11] xfs: fix the contact address for the sysfs ABI documentation Christoph Hellwig
2024-06-20 15:20 ` Darrick J. Wong
2024-06-20 7:21 ` [PATCH 02/11] xfs: move and rename xfs_trans_committed_bulk Christoph Hellwig
2024-06-20 7:21 ` [PATCH 03/11] xfs: AIL doesn't need manual pushing Christoph Hellwig
2024-06-20 19:01 ` Darrick J. Wong
2024-06-20 7:21 ` [PATCH 04/11] xfs: background AIL push should target physical space Christoph Hellwig
2024-06-20 19:42 ` Darrick J. Wong
2024-06-21 5:37 ` Christoph Hellwig
2024-06-21 16:18 ` Darrick J. Wong
2024-06-20 7:21 ` [PATCH 05/11] xfs: ensure log tail is always up to date Christoph Hellwig
2024-06-20 7:21 ` [PATCH 06/11] xfs: l_last_sync_lsn is really AIL state Christoph Hellwig
2024-06-20 7:21 ` [PATCH 07/11] xfs: collapse xlog_state_set_callback in caller Christoph Hellwig
2024-06-20 7:21 ` [PATCH 08/11] xfs: track log space pinned by the AIL Christoph Hellwig
2024-06-20 7:21 ` [PATCH 09/11] xfs: pass the full grant head to accounting functions Christoph Hellwig
2024-06-20 7:21 ` [PATCH 10/11] xfs: grant heads track byte counts, not LSNs Christoph Hellwig
2024-07-01 0:59 ` Dave Chinner
2024-06-20 7:21 ` [PATCH 11/11] xfs: skip flushing log items during push Christoph Hellwig
2024-06-20 19:51 ` Darrick J. Wong
2024-06-21 5:48 ` Christoph Hellwig
2024-06-21 17:46 ` Darrick J. Wong
2024-07-02 18:51 ` Darrick J. Wong
2024-07-02 23:29 ` Dave Chinner
2024-07-03 5:10 ` Darrick J. Wong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).