From: "Darrick J. Wong" <djwong@kernel.org>
To: John Garry <john.g.garry@oracle.com>
Cc: brauner@kernel.org, hch@lst.de, viro@zeniv.linux.org.uk,
jack@suse.cz, cem@kernel.org, linux-fsdevel@vger.kernel.org,
dchinner@redhat.com, linux-xfs@vger.kernel.org,
linux-kernel@vger.kernel.org, ojaswin@linux.ibm.com,
ritesh.list@gmail.com, martin.petersen@oracle.com,
linux-ext4@vger.kernel.org, linux-block@vger.kernel.org,
catherine.hoang@oracle.com
Subject: Re: [PATCH v6 11/12] xfs: add xfs_compute_atomic_write_unit_max()
Date: Tue, 8 Apr 2025 14:28:35 -0700 [thread overview]
Message-ID: <20250408212835.GH6283@frogsfrogsfrogs> (raw)
In-Reply-To: <20250408104209.1852036-12-john.g.garry@oracle.com>
On Tue, Apr 08, 2025 at 10:42:08AM +0000, John Garry wrote:
> Now that CoW-based atomic writes are supported, update the max size of an
> atomic write for the data device.
>
> The limit of a CoW-based atomic write will be the limit of the number of
> logitems which can fit into a single transaction.
>
> In addition, the max atomic write size needs to be aligned to the agsize.
> Limit the size of atomic writes to the greatest power-of-two factor of the
> agsize so that allocations for an atomic write will always be aligned
> compatibly with the alignment requirements of the storage.
>
> rtvol is not commonly used, so it is not very important to support large
> atomic writes there initially.
>
> Furthermore, adding large atomic writes for rtvol would be complicated due
> to alignment already offered by rtextsize and also the limitation of
> reflink support only be possible for rtextsize is a power-of-2.
>
> Function xfs_atomic_write_logitems() is added to find the limit the number
> of log items which can fit in a single transaction.
>
> Darrick Wong contributed the changes in xfs_atomic_write_logitems()
> originally, but may now be outdated by [0].
>
> [0] https://lore.kernel.org/linux-xfs/20250406172227.GC6307@frogsfrogsfrogs/
>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
> fs/xfs/xfs_mount.c | 36 ++++++++++++++++++++++++++++++++++++
> fs/xfs/xfs_mount.h | 5 +++++
> fs/xfs/xfs_super.c | 22 ++++++++++++++++++++++
> fs/xfs/xfs_super.h | 1 +
> 4 files changed, 64 insertions(+)
>
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 00b53f479ece..27a737202637 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -666,6 +666,37 @@ xfs_agbtree_compute_maxlevels(
> mp->m_agbtree_maxlevels = max(levels, mp->m_refc_maxlevels);
> }
>
> +static inline void
> +xfs_compute_atomic_write_unit_max(
> + struct xfs_mount *mp)
> +{
> + xfs_agblock_t agsize = mp->m_sb.sb_agblocks;
> + unsigned int max_extents_logitems;
> + unsigned int max_agsize;
> +
> + if (!xfs_has_reflink(mp)) {
> + mp->m_atomic_write_unit_max = 1;
> + return;
> + }
> +
> + /*
> + * Find limit according to logitems.
> + */
> + max_extents_logitems = xfs_atomic_write_logitems(mp);
> +
> + /*
> + * Also limit the size of atomic writes to the greatest power-of-two
> + * factor of the agsize so that allocations for an atomic write will
> + * always be aligned compatibly with the alignment requirements of the
> + * storage.
> + * The greatest power-of-two is the value according to the lowest bit
> + * set.
> + */
> + max_agsize = 1 << (ffs(agsize) - 1);
> +
> + mp->m_atomic_write_unit_max = min(max_extents_logitems, max_agsize);
> +}
> +
> /* Compute maximum possible height for realtime btree types for this fs. */
> static inline void
> xfs_rtbtree_compute_maxlevels(
> @@ -842,6 +873,11 @@ xfs_mountfs(
> */
> xfs_trans_init(mp);
>
> + /*
> + * Pre-calculate atomic write unit max.
> + */
> + xfs_compute_atomic_write_unit_max(mp);
> +
> /*
> * Allocate and initialize the per-ag data.
> */
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 799b84220ebb..4462bffbf0ff 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -230,6 +230,11 @@ typedef struct xfs_mount {
> bool m_update_sb; /* sb needs update in mount */
> unsigned int m_max_open_zones;
>
> + /*
> + * data device max atomic write.
> + */
> + xfs_extlen_t m_atomic_write_unit_max;
> +
> /*
> * Bitsets of per-fs metadata that have been checked and/or are sick.
> * Callers must hold m_sb_lock to access these two fields.
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index b2dd0c0bf509..42b2b7540507 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -615,6 +615,28 @@ xfs_init_mount_workqueues(
> return -ENOMEM;
> }
>
> +unsigned int
> +xfs_atomic_write_logitems(
> + struct xfs_mount *mp)
> +{
> + unsigned int efi = xfs_efi_item_overhead(1);
> + unsigned int rui = xfs_rui_item_overhead(1);
> + unsigned int cui = xfs_cui_item_overhead(1);
> + unsigned int bui = xfs_bui_item_overhead(1);
Intent items can be relogged during a transaction roll, so you need to
add the done item overhead too, e.g.
const unsigned int efi = xfs_efi_item_overhead(1) +
xfs_efd_item_overhead(1);
const unsigned int rui = xfs_rui_item_overhead(1) +
xfs_rud_item_overhead();
const unsigned int cui = xfs_cui_item_overhead(1) +
xfs_cud_item_overhead();
const unsigned int bui = xfs_bui_item_overhead(1) +
xfs_bud_item_overhead();
> + unsigned int logres = M_RES(mp)->tr_write.tr_logres;
> +
> + /*
> + * Maximum overhead to complete an atomic write ioend in software:
> + * remove data fork extent + remove cow fork extent +
> + * map extent into data fork
> + */
> + unsigned int atomic_logitems =
> + (bui + cui + rui + efi) + (cui + rui) + (bui + rui);
You still have to leave enough space to finish at least one step of the
intent types that can be attached to the untorn cow ioend. Assuming
that you have functions to compute the reservation needed to finish one
step of each of the four intent item types, the worst case reservation
to finish one item is:
/* Overhead to finish one step of each intent item type */
const unsigned int f1 = xfs_calc_finish_efi_reservation(mp, 1);
const unsigned int f2 = xfs_calc_finish_rui_reservation(mp, 1);
const unsigned int f3 = xfs_calc_finish_cui_reservation(mp, 1);
const unsigned int f4 = xfs_calc_finish_bui_reservation(mp, 1);
/* We only finish one item per transaction in a chain */
const unsigned int step_size = max(f4, max3(f1, f2, f3));
So the worst case limit on the number of loops through
xfs_reflink_remap_extent is:
return rounddown_pow_of_two((logres - step_size) /
atomic_logitems);
and that's the maximum software untorn write unit. On my system that
gets you 128 blocks, but YMMY. Those xfs_calc_finish_*_reservation
helpers look something like this:
/*
* Finishing an EFI can free the blocks and bmap blocks (t2):
* the agf for each of the ags: nr * sector size
* the agfl for each of the ags: nr * sector size
* the super block to reflect the freed blocks: sector size
* worst case split in allocation btrees per extent assuming nr extents:
* nr exts * 2 trees * (2 * max depth - 1) * block size
*/
inline unsigned int
xfs_calc_finish_efi_reservation(
struct xfs_mount *mp,
unsigned int nr)
{
return xfs_calc_buf_res((2 * nr) + 1, mp->m_sb.sb_sectsize) +
xfs_calc_buf_res(xfs_allocfree_block_count(mp, nr),
mp->m_sb.sb_blocksize);
}
/*
* Finishing an RUI is the same as an EFI. We can split the rmap btree twice
* on each end of the record, and that can cause the AGFL to be refilled or
* emptied out.
*/
inline unsigned int
xfs_calc_finish_rui_reservation(
struct xfs_mount *mp,
unsigned int nr)
{
if (!xfs_has_rmapbt(mp))
return 0;
return xfs_calc_finish_efi_reservation(mp, nr);
}
/*
* In finishing a BUI, we can modify:
* the inode being truncated: inode size
* dquots
* the inode's bmap btree: (max depth + 1) * block size
*/
inline unsigned int
xfs_calc_finish_bui_reservation(
struct xfs_mount *mp,
unsigned int nr)
{
return xfs_calc_inode_res(mp, 1) + XFS_DQUOT_LOGRES +
xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) + 1,
mp->m_sb.sb_blocksize);
}
/*
* Finishing a data device refcount updates (t1):
* the agfs of the ags containing the blocks: nr_ops * sector size
* the refcount btrees: nr_ops * 1 trees * (2 * max depth - 1) * block size
*/
inline unsigned int
xfs_calc_finish_cui_reservation(
struct xfs_mount *mp,
unsigned int nr_ops)
{
if (!xfs_has_reflink(mp))
return 0;
return xfs_calc_buf_res(nr_ops, mp->m_sb.sb_sectsize) +
xfs_calc_buf_res(xfs_refcountbt_block_count(mp, nr_ops),
mp->m_sb.sb_blocksize);
}
--D
> +
> + /* atomic write limits are always a power-of-2 */
> + return rounddown_pow_of_two(logres / (2 * atomic_logitems));
> +}
> +
> STATIC void
> xfs_destroy_mount_workqueues(
> struct xfs_mount *mp)
> diff --git a/fs/xfs/xfs_super.h b/fs/xfs/xfs_super.h
> index c0e85c1e42f2..e0f82be9093a 100644
> --- a/fs/xfs/xfs_super.h
> +++ b/fs/xfs/xfs_super.h
> @@ -100,5 +100,6 @@ extern struct workqueue_struct *xfs_discard_wq;
> #define XFS_M(sb) ((struct xfs_mount *)((sb)->s_fs_info))
>
> struct dentry *xfs_debugfs_mkdir(const char *name, struct dentry *parent);
> +unsigned int xfs_atomic_write_logitems(struct xfs_mount *mp);
>
> #endif /* __XFS_SUPER_H__ */
> --
> 2.31.1
>
>
next prev parent reply other threads:[~2025-04-08 21:28 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-08 10:41 [PATCH v6 00/12] large atomic writes for xfs John Garry
2025-04-08 10:41 ` [PATCH v6 01/12] fs: add atomic write unit max opt to statx John Garry
2025-04-09 2:23 ` Darrick J. Wong
2025-04-09 10:45 ` Christoph Hellwig
2025-04-08 10:41 ` [PATCH v6 02/12] xfs: add helpers to compute log item overhead John Garry
2025-04-08 22:50 ` Dave Chinner
2025-04-08 23:21 ` Darrick J. Wong
2025-04-09 2:25 ` [PATCH v6.1 " Darrick J. Wong
2025-04-09 2:25 ` [PATCH v6.1 RFC 02.1/12] xfs: add helpers to compute transaction reservation for finishing intent items Darrick J. Wong
2025-04-08 10:42 ` [PATCH v6 03/12] xfs: rename xfs_inode_can_atomicwrite() -> xfs_inode_can_hw_atomicwrite() John Garry
2025-04-09 2:02 ` Darrick J. Wong
2025-04-09 10:46 ` Christoph Hellwig
2025-04-08 10:42 ` [PATCH v6 04/12] xfs: allow block allocator to take an alignment hint John Garry
2025-04-08 10:42 ` [PATCH v6 05/12] xfs: refactor xfs_reflink_end_cow_extent() John Garry
2025-04-08 10:42 ` [PATCH v6 06/12] xfs: refine atomic write size check in xfs_file_write_iter() John Garry
2025-04-08 10:42 ` [PATCH v6 07/12] xfs: add xfs_atomic_write_cow_iomap_begin() John Garry
2025-04-08 10:42 ` [PATCH v6 08/12] xfs: add large atomic writes checks in xfs_direct_write_iomap_begin() John Garry
2025-04-08 10:42 ` [PATCH v6 09/12] xfs: commit CoW-based atomic writes atomically John Garry
2025-04-08 10:42 ` [PATCH v6 10/12] xfs: add xfs_file_dio_write_atomic() John Garry
2025-04-08 10:42 ` [PATCH v6 11/12] xfs: add xfs_compute_atomic_write_unit_max() John Garry
2025-04-08 21:28 ` Darrick J. Wong [this message]
2025-04-08 22:47 ` Dave Chinner
2025-04-09 0:41 ` Darrick J. Wong
2025-04-09 5:30 ` Dave Chinner
2025-04-09 8:15 ` John Garry
2025-04-09 22:49 ` Dave Chinner
2025-04-10 8:58 ` John Garry
2025-04-09 23:46 ` Darrick J. Wong
2025-04-08 10:42 ` [PATCH v6 12/12] xfs: update atomic write limits John Garry
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250408212835.GH6283@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=brauner@kernel.org \
--cc=catherine.hoang@oracle.com \
--cc=cem@kernel.org \
--cc=dchinner@redhat.com \
--cc=hch@lst.de \
--cc=jack@suse.cz \
--cc=john.g.garry@oracle.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=ojaswin@linux.ibm.com \
--cc=ritesh.list@gmail.com \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox