* [PATCH v5 00/10] large atomic writes for xfs with CoW
@ 2025-03-10 18:39 John Garry
2025-03-10 18:39 ` [PATCH v5 01/10] xfs: Pass flags to xfs_reflink_allocate_cow() John Garry
` (9 more replies)
0 siblings, 10 replies; 63+ messages in thread
From: John Garry @ 2025-03-10 18:39 UTC (permalink / raw)
To: brauner, djwong, cem
Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
martin.petersen, John Garry
Currently atomic write support for xfs is limited to writing a single
block as we have no way to guarantee alignment and that the write covers
a single extent.
This series introduces a method to issue atomic writes via a software
emulated method.
The software emulated method is used as a fallback for when attempting to
issue an atomic write over misaligned or multiple extents.
For XFS, this support is based on CoW.
The basic idea of this CoW method is to alloc a range in the CoW fork,
write the data, and atomically update the mapping.
Initial mysql performance testing has shown this method to perform ok.
However, there we are only using 16K atomic writes (and 4K block size),
so typically - and thankfully - this software fallback method won't be
used often.
For other FSes which want large atomics writes and don't support CoW, I
think that they can follow the example in [0].
Based on 32f6987f9384 (xfs/for-next) Merge branch 'xfs-6.15-merge' into for-next
[0] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@oracle.com/
Differences to v4:
- Omit iomap patches which have already been queued
- Add () in xfs_bmap_compute_alignments() (Dave)
- Rename awu_max -> m_awu_max (Carlos)
- Add RFC to change IOMAP flag names
- Rebase
Differences to v3:
- Error !reflink in xfs_atomic_write_sw_iomap_begin() (Darrick)
- Fix unused variable (kbuild bot)
- Add RB tags from Darrick (Thanks!)
Differences to v2:
(all from Darrick)
- Add dedicated function for xfs iomap sw-based atomic write
- Don't ignore xfs_reflink_end_atomic_cow() -> xfs_trans_commit() return
value
- Pass flags for reflink alloc functions
- Rename IOMAP_ATOMIC_COW -> IOMAP_ATOMIC_SW
- Coding style corrections and comment improvements
- Add RB tags (thanks!)
John Garry (10):
xfs: Pass flags to xfs_reflink_allocate_cow()
xfs: Switch atomic write size check in xfs_file_write_iter()
xfs: Refactor xfs_reflink_end_cow_extent()
xfs: Reflink CoW-based atomic write support
xfs: Iomap SW-based atomic write support
xfs: Add xfs_file_dio_write_atomic()
xfs: Commit CoW-based atomic writes atomically
xfs: Update atomic write max size
xfs: Allow block allocator to take an alignment hint
iomap: Rename ATOMIC flags again
.../filesystems/iomap/operations.rst | 6 +-
fs/ext4/inode.c | 2 +-
fs/iomap/direct-io.c | 18 +--
fs/iomap/trace.h | 3 +-
fs/xfs/libxfs/xfs_bmap.c | 4 +
fs/xfs/libxfs/xfs_bmap.h | 6 +-
fs/xfs/xfs_file.c | 61 +++++++-
fs/xfs/xfs_iomap.c | 144 ++++++++++++++++-
fs/xfs/xfs_iomap.h | 1 +
fs/xfs/xfs_iops.c | 32 +++-
fs/xfs/xfs_iops.h | 2 +
fs/xfs/xfs_mount.c | 28 ++++
fs/xfs/xfs_mount.h | 1 +
fs/xfs/xfs_reflink.c | 145 +++++++++++++-----
fs/xfs/xfs_reflink.h | 11 +-
include/linux/iomap.h | 10 +-
16 files changed, 400 insertions(+), 74 deletions(-)
--
2.31.1
^ permalink raw reply [flat|nested] 63+ messages in thread
* [PATCH v5 01/10] xfs: Pass flags to xfs_reflink_allocate_cow()
2025-03-10 18:39 [PATCH v5 00/10] large atomic writes for xfs with CoW John Garry
@ 2025-03-10 18:39 ` John Garry
2025-03-12 7:15 ` Christoph Hellwig
2025-03-10 18:39 ` [PATCH v5 02/10] xfs: Switch atomic write size check in xfs_file_write_iter() John Garry
` (8 subsequent siblings)
9 siblings, 1 reply; 63+ messages in thread
From: John Garry @ 2025-03-10 18:39 UTC (permalink / raw)
To: brauner, djwong, cem
Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
martin.petersen, John Garry
In future we will want more boolean options for xfs_reflink_allocate_cow(),
so just prepare for this by passing a flags arg for @convert_now.
Suggested-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
fs/xfs/xfs_iomap.c | 7 +++++--
fs/xfs/xfs_reflink.c | 10 ++++++----
fs/xfs/xfs_reflink.h | 7 ++++++-
3 files changed, 17 insertions(+), 7 deletions(-)
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 30e257f683bb..f3a6ec2d3a40 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -813,6 +813,7 @@ xfs_direct_write_iomap_begin(
xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset);
xfs_fileoff_t end_fsb = xfs_iomap_end_fsb(mp, offset, length);
int nimaps = 1, error = 0;
+ unsigned int reflink_flags = 0;
bool shared = false;
u16 iomap_flags = 0;
unsigned int lockmode;
@@ -823,6 +824,9 @@ xfs_direct_write_iomap_begin(
if (xfs_is_shutdown(mp))
return -EIO;
+ if (flags & IOMAP_DIRECT || IS_DAX(inode))
+ reflink_flags |= XFS_REFLINK_CONVERT;
+
/*
* Writes that span EOF might trigger an IO size update on completion,
* so consider them to be dirty for the purposes of O_DSYNC even if
@@ -867,8 +871,7 @@ xfs_direct_write_iomap_begin(
/* may drop and re-acquire the ilock */
error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
- &lockmode,
- (flags & IOMAP_DIRECT) || IS_DAX(inode));
+ &lockmode, reflink_flags);
if (error)
goto out_unlock;
if (shared)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index cc3b4df88110..e9791e567bdf 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -435,7 +435,7 @@ xfs_reflink_fill_cow_hole(
struct xfs_bmbt_irec *cmap,
bool *shared,
uint *lockmode,
- bool convert_now)
+ unsigned int flags)
{
struct xfs_mount *mp = ip->i_mount;
struct xfs_trans *tp;
@@ -488,7 +488,8 @@ xfs_reflink_fill_cow_hole(
return error;
convert:
- return xfs_reflink_convert_unwritten(ip, imap, cmap, convert_now);
+ return xfs_reflink_convert_unwritten(ip, imap, cmap,
+ flags & XFS_REFLINK_CONVERT);
out_trans_cancel:
xfs_trans_cancel(tp);
@@ -566,10 +567,11 @@ xfs_reflink_allocate_cow(
struct xfs_bmbt_irec *cmap,
bool *shared,
uint *lockmode,
- bool convert_now)
+ unsigned int flags)
{
int error;
bool found;
+ bool convert_now = flags & XFS_REFLINK_CONVERT;
xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
if (!ip->i_cowfp) {
@@ -592,7 +594,7 @@ xfs_reflink_allocate_cow(
*/
if (cmap->br_startoff > imap->br_startoff)
return xfs_reflink_fill_cow_hole(ip, imap, cmap, shared,
- lockmode, convert_now);
+ lockmode, flags);
/*
* CoW fork has a delalloc reservation. Replace it with a real extent.
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index cc4e92278279..cdbd73d58822 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -6,6 +6,11 @@
#ifndef __XFS_REFLINK_H
#define __XFS_REFLINK_H 1
+/*
+ * Flags for xfs_reflink_allocate_cow()
+ */
+#define XFS_REFLINK_CONVERT (1u << 0) /* convert unwritten extents now */
+
/*
* Check whether it is safe to free COW fork blocks from an inode. It is unsafe
* to do so when an inode has dirty cache or I/O in-flight, even if no shared
@@ -32,7 +37,7 @@ int xfs_bmap_trim_cow(struct xfs_inode *ip, struct xfs_bmbt_irec *imap,
int xfs_reflink_allocate_cow(struct xfs_inode *ip, struct xfs_bmbt_irec *imap,
struct xfs_bmbt_irec *cmap, bool *shared, uint *lockmode,
- bool convert_now);
+ unsigned int flags);
extern int xfs_reflink_convert_cow(struct xfs_inode *ip, xfs_off_t offset,
xfs_off_t count);
--
2.31.1
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH v5 02/10] xfs: Switch atomic write size check in xfs_file_write_iter()
2025-03-10 18:39 [PATCH v5 00/10] large atomic writes for xfs with CoW John Garry
2025-03-10 18:39 ` [PATCH v5 01/10] xfs: Pass flags to xfs_reflink_allocate_cow() John Garry
@ 2025-03-10 18:39 ` John Garry
2025-03-12 7:17 ` Christoph Hellwig
2025-03-10 18:39 ` [PATCH v5 03/10] xfs: Refactor xfs_reflink_end_cow_extent() John Garry
` (7 subsequent siblings)
9 siblings, 1 reply; 63+ messages in thread
From: John Garry @ 2025-03-10 18:39 UTC (permalink / raw)
To: brauner, djwong, cem
Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
martin.petersen, John Garry
Currently the size of atomic write allowed is fixed at the blocksize.
To start to lift this restriction, refactor xfs_get_atomic_write_attr()
to into a helper - xfs_report_atomic_write() - and use that helper to
find the per-inode atomic write limits and check according to that.
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
fs/xfs/xfs_file.c | 12 +++++-------
fs/xfs/xfs_iops.c | 20 +++++++++++++++++---
fs/xfs/xfs_iops.h | 2 ++
3 files changed, 24 insertions(+), 10 deletions(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index fe8cf9d96eb0..75a6d7e75bf8 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1032,14 +1032,12 @@ xfs_file_write_iter(
return xfs_file_dax_write(iocb, from);
if (iocb->ki_flags & IOCB_ATOMIC) {
- /*
- * Currently only atomic writing of a single FS block is
- * supported. It would be possible to atomic write smaller than
- * a FS block, but there is no requirement to support this.
- * Note that iomap also does not support this yet.
- */
- if (ocount != ip->i_mount->m_sb.sb_blocksize)
+ unsigned int unit_min, unit_max;
+
+ xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
+ if (ocount < unit_min || ocount > unit_max)
return -EINVAL;
+
ret = generic_atomic_write_valid(iocb, from);
if (ret)
return ret;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 444193f543ef..de065cc2e7cf 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -601,15 +601,29 @@ xfs_report_dioalign(
stat->dio_offset_align = stat->dio_read_offset_align;
}
+void
+xfs_get_atomic_write_attr(
+ struct xfs_inode *ip,
+ unsigned int *unit_min,
+ unsigned int *unit_max)
+{
+ if (!xfs_inode_can_atomicwrite(ip)) {
+ *unit_min = *unit_max = 0;
+ return;
+ }
+
+ *unit_min = *unit_max = ip->i_mount->m_sb.sb_blocksize;
+}
+
static void
xfs_report_atomic_write(
struct xfs_inode *ip,
struct kstat *stat)
{
- unsigned int unit_min = 0, unit_max = 0;
+ unsigned int unit_min, unit_max;
+
+ xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
- if (xfs_inode_can_atomicwrite(ip))
- unit_min = unit_max = ip->i_mount->m_sb.sb_blocksize;
generic_fill_statx_atomic_writes(stat, unit_min, unit_max);
}
diff --git a/fs/xfs/xfs_iops.h b/fs/xfs/xfs_iops.h
index 3c1a2605ffd2..d95a543f3ab0 100644
--- a/fs/xfs/xfs_iops.h
+++ b/fs/xfs/xfs_iops.h
@@ -19,5 +19,7 @@ int xfs_inode_init_security(struct inode *inode, struct inode *dir,
extern void xfs_setup_inode(struct xfs_inode *ip);
extern void xfs_setup_iops(struct xfs_inode *ip);
extern void xfs_diflags_to_iflags(struct xfs_inode *ip, bool init);
+void xfs_get_atomic_write_attr(struct xfs_inode *ip,
+ unsigned int *unit_min, unsigned int *unit_max);
#endif /* __XFS_IOPS_H__ */
--
2.31.1
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH v5 03/10] xfs: Refactor xfs_reflink_end_cow_extent()
2025-03-10 18:39 [PATCH v5 00/10] large atomic writes for xfs with CoW John Garry
2025-03-10 18:39 ` [PATCH v5 01/10] xfs: Pass flags to xfs_reflink_allocate_cow() John Garry
2025-03-10 18:39 ` [PATCH v5 02/10] xfs: Switch atomic write size check in xfs_file_write_iter() John Garry
@ 2025-03-10 18:39 ` John Garry
2025-03-12 7:24 ` Christoph Hellwig
2025-03-10 18:39 ` [PATCH v5 04/10] xfs: Reflink CoW-based atomic write support John Garry
` (6 subsequent siblings)
9 siblings, 1 reply; 63+ messages in thread
From: John Garry @ 2025-03-10 18:39 UTC (permalink / raw)
To: brauner, djwong, cem
Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
martin.petersen, John Garry
Refactor xfs_reflink_end_cow_extent() into separate parts which process
the CoW range and commit the transaction.
This refactoring will be used in future for when it is required to commit
a range of extents as a single transaction, similar to how it was done
pre-commit d6f215f359637.
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
fs/xfs/xfs_reflink.c | 73 ++++++++++++++++++++++++++------------------
1 file changed, 43 insertions(+), 30 deletions(-)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index e9791e567bdf..e3e594c65745 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -788,35 +788,19 @@ xfs_reflink_update_quota(
* requirements as low as possible.
*/
STATIC int
-xfs_reflink_end_cow_extent(
+xfs_reflink_end_cow_extent_locked(
+ struct xfs_trans *tp,
struct xfs_inode *ip,
xfs_fileoff_t *offset_fsb,
xfs_fileoff_t end_fsb)
{
struct xfs_iext_cursor icur;
struct xfs_bmbt_irec got, del, data;
- struct xfs_mount *mp = ip->i_mount;
- struct xfs_trans *tp;
struct xfs_ifork *ifp = xfs_ifork_ptr(ip, XFS_COW_FORK);
- unsigned int resblks;
int nmaps;
bool isrt = XFS_IS_REALTIME_INODE(ip);
int error;
- resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
- error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
- XFS_TRANS_RESERVE, &tp);
- if (error)
- return error;
-
- /*
- * Lock the inode. We have to ijoin without automatic unlock because
- * the lead transaction is the refcountbt record deletion; the data
- * fork update follows as a deferred log item.
- */
- xfs_ilock(ip, XFS_ILOCK_EXCL);
- xfs_trans_ijoin(tp, ip, 0);
-
/*
* In case of racing, overlapping AIO writes no COW extents might be
* left by the time I/O completes for the loser of the race. In that
@@ -825,7 +809,7 @@ xfs_reflink_end_cow_extent(
if (!xfs_iext_lookup_extent(ip, ifp, *offset_fsb, &icur, &got) ||
got.br_startoff >= end_fsb) {
*offset_fsb = end_fsb;
- goto out_cancel;
+ return 0;
}
/*
@@ -839,7 +823,7 @@ xfs_reflink_end_cow_extent(
if (!xfs_iext_next_extent(ifp, &icur, &got) ||
got.br_startoff >= end_fsb) {
*offset_fsb = end_fsb;
- goto out_cancel;
+ return 0;
}
}
del = got;
@@ -848,14 +832,14 @@ xfs_reflink_end_cow_extent(
error = xfs_iext_count_extend(tp, ip, XFS_DATA_FORK,
XFS_IEXT_REFLINK_END_COW_CNT);
if (error)
- goto out_cancel;
+ return error;
/* Grab the corresponding mapping in the data fork. */
nmaps = 1;
error = xfs_bmapi_read(ip, del.br_startoff, del.br_blockcount, &data,
&nmaps, 0);
if (error)
- goto out_cancel;
+ return error;
/* We can only remap the smaller of the two extent sizes. */
data.br_blockcount = min(data.br_blockcount, del.br_blockcount);
@@ -884,7 +868,7 @@ xfs_reflink_end_cow_extent(
error = xfs_bunmapi(NULL, ip, data.br_startoff,
data.br_blockcount, 0, 1, &done);
if (error)
- goto out_cancel;
+ return error;
ASSERT(done);
}
@@ -901,17 +885,46 @@ xfs_reflink_end_cow_extent(
/* Remove the mapping from the CoW fork. */
xfs_bmap_del_extent_cow(ip, &icur, &got, &del);
- error = xfs_trans_commit(tp);
- xfs_iunlock(ip, XFS_ILOCK_EXCL);
- if (error)
- return error;
-
/* Update the caller about how much progress we made. */
*offset_fsb = del.br_startoff + del.br_blockcount;
return 0;
+}
-out_cancel:
- xfs_trans_cancel(tp);
+
+/*
+ * Remap part of the CoW fork into the data fork.
+ *
+ * We aim to remap the range starting at @offset_fsb and ending at @end_fsb
+ * into the data fork; this function will remap what it can (at the end of the
+ * range) and update @end_fsb appropriately. Each remap gets its own
+ * transaction because we can end up merging and splitting bmbt blocks for
+ * every remap operation and we'd like to keep the block reservation
+ * requirements as low as possible.
+ */
+STATIC int
+xfs_reflink_end_cow_extent(
+ struct xfs_inode *ip,
+ xfs_fileoff_t *offset_fsb,
+ xfs_fileoff_t end_fsb)
+{
+ struct xfs_mount *mp = ip->i_mount;
+ struct xfs_trans *tp;
+ unsigned int resblks;
+ int error;
+
+ resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
+ error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
+ XFS_TRANS_RESERVE, &tp);
+ if (error)
+ return error;
+ xfs_ilock(ip, XFS_ILOCK_EXCL);
+ xfs_trans_ijoin(tp, ip, 0);
+
+ error = xfs_reflink_end_cow_extent_locked(tp, ip, offset_fsb, end_fsb);
+ if (error)
+ xfs_trans_cancel(tp);
+ else
+ error = xfs_trans_commit(tp);
xfs_iunlock(ip, XFS_ILOCK_EXCL);
return error;
}
--
2.31.1
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH v5 04/10] xfs: Reflink CoW-based atomic write support
2025-03-10 18:39 [PATCH v5 00/10] large atomic writes for xfs with CoW John Garry
` (2 preceding siblings ...)
2025-03-10 18:39 ` [PATCH v5 03/10] xfs: Refactor xfs_reflink_end_cow_extent() John Garry
@ 2025-03-10 18:39 ` John Garry
2025-03-12 7:27 ` Christoph Hellwig
2025-03-10 18:39 ` [PATCH v5 05/10] xfs: Iomap SW-based " John Garry
` (5 subsequent siblings)
9 siblings, 1 reply; 63+ messages in thread
From: John Garry @ 2025-03-10 18:39 UTC (permalink / raw)
To: brauner, djwong, cem
Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
martin.petersen, John Garry
Base SW-based atomic writes on CoW.
For SW-based atomic write support, always allocate a cow hole in
xfs_reflink_allocate_cow() to write the new data.
The semantics is that if @atomic_sw is set, we will be passed a CoW fork
extent mapping for no error returned.
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
fs/xfs/xfs_reflink.c | 5 +++--
fs/xfs/xfs_reflink.h | 1 +
2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index e3e594c65745..0949d6ba2b3b 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -444,6 +444,7 @@ xfs_reflink_fill_cow_hole(
int nimaps;
int error;
bool found;
+ bool atomic_sw = flags & XFS_REFLINK_ATOMIC_SW;
resaligned = xfs_aligned_fsb_count(imap->br_startoff,
imap->br_blockcount, xfs_get_cowextsz_hint(ip));
@@ -466,7 +467,7 @@ xfs_reflink_fill_cow_hole(
*lockmode = XFS_ILOCK_EXCL;
error = xfs_find_trim_cow_extent(ip, imap, cmap, shared, &found);
- if (error || !*shared)
+ if (error || (!*shared && !atomic_sw))
goto out_trans_cancel;
if (found) {
@@ -580,7 +581,7 @@ xfs_reflink_allocate_cow(
}
error = xfs_find_trim_cow_extent(ip, imap, cmap, shared, &found);
- if (error || !*shared)
+ if (error || (!*shared && !(flags & XFS_REFLINK_ATOMIC_SW)))
return error;
/* CoW fork has a real extent */
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index cdbd73d58822..dfd94e51e2b4 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -10,6 +10,7 @@
* Flags for xfs_reflink_allocate_cow()
*/
#define XFS_REFLINK_CONVERT (1u << 0) /* convert unwritten extents now */
+#define XFS_REFLINK_ATOMIC_SW (1u << 1) /* alloc for SW-based atomic write */
/*
* Check whether it is safe to free COW fork blocks from an inode. It is unsafe
--
2.31.1
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH v5 05/10] xfs: Iomap SW-based atomic write support
2025-03-10 18:39 [PATCH v5 00/10] large atomic writes for xfs with CoW John Garry
` (3 preceding siblings ...)
2025-03-10 18:39 ` [PATCH v5 04/10] xfs: Reflink CoW-based atomic write support John Garry
@ 2025-03-10 18:39 ` John Garry
2025-03-12 7:37 ` Christoph Hellwig
2025-03-10 18:39 ` [PATCH v5 06/10] xfs: Add xfs_file_dio_write_atomic() John Garry
` (4 subsequent siblings)
9 siblings, 1 reply; 63+ messages in thread
From: John Garry @ 2025-03-10 18:39 UTC (permalink / raw)
To: brauner, djwong, cem
Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
martin.petersen, John Garry
In cases of an atomic write occurs for misaligned or discontiguous disk
blocks, we will use a CoW-based method to issue the atomic write.
So, for that case, return -EAGAIN to request that the write be issued in
CoW atomic write mode. The dio write path should detect this, similar to
how misaligned regular DIO writes are handled.
For normal HW-based mode, when the range which we are atomic writing to
covers a shared data extent, try to allocate a new CoW fork. However, if
we find that what we allocated does not meet atomic write requirements
in terms of length and alignment, then fallback on the CoW-based mode
for the atomic write.
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
fs/xfs/xfs_iomap.c | 137 ++++++++++++++++++++++++++++++++++++++++++++-
fs/xfs/xfs_iomap.h | 1 +
2 files changed, 136 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index f3a6ec2d3a40..6c963786530d 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -798,6 +798,23 @@ imap_spans_range(
return true;
}
+static bool
+xfs_bmap_valid_for_atomic_write(
+ struct xfs_bmbt_irec *imap,
+ xfs_fileoff_t offset_fsb,
+ xfs_fileoff_t end_fsb)
+{
+ /* Misaligned start block wrt size */
+ if (!IS_ALIGNED(imap->br_startblock, imap->br_blockcount))
+ return false;
+
+ /* Discontiguous or mixed extents */
+ if (!imap_spans_range(imap, offset_fsb, end_fsb))
+ return false;
+
+ return true;
+}
+
static int
xfs_direct_write_iomap_begin(
struct inode *inode,
@@ -812,10 +829,13 @@ xfs_direct_write_iomap_begin(
struct xfs_bmbt_irec imap, cmap;
xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset);
xfs_fileoff_t end_fsb = xfs_iomap_end_fsb(mp, offset, length);
+ xfs_fileoff_t orig_end_fsb = end_fsb;
+ bool atomic_hw = flags & IOMAP_ATOMIC_HW;
int nimaps = 1, error = 0;
unsigned int reflink_flags = 0;
bool shared = false;
u16 iomap_flags = 0;
+ bool needs_alloc;
unsigned int lockmode;
u64 seq;
@@ -874,13 +894,37 @@ xfs_direct_write_iomap_begin(
&lockmode, reflink_flags);
if (error)
goto out_unlock;
- if (shared)
+ if (shared) {
+ if (atomic_hw &&
+ !xfs_bmap_valid_for_atomic_write(&cmap,
+ offset_fsb, end_fsb)) {
+ error = -EAGAIN;
+ goto out_unlock;
+ }
goto out_found_cow;
+ }
end_fsb = imap.br_startoff + imap.br_blockcount;
length = XFS_FSB_TO_B(mp, end_fsb) - offset;
}
- if (imap_needs_alloc(inode, flags, &imap, nimaps))
+ needs_alloc = imap_needs_alloc(inode, flags, &imap, nimaps);
+
+ if (atomic_hw) {
+ error = -EAGAIN;
+ /*
+ * Use CoW method for when we need to alloc > 1 block,
+ * otherwise we might allocate less than what we need here and
+ * have multiple mappings.
+ */
+ if (needs_alloc && orig_end_fsb - offset_fsb > 1)
+ goto out_unlock;
+
+ if (!xfs_bmap_valid_for_atomic_write(&imap, offset_fsb,
+ orig_end_fsb))
+ goto out_unlock;
+ }
+
+ if (needs_alloc)
goto allocate_blocks;
/*
@@ -1021,6 +1065,95 @@ const struct iomap_ops xfs_zoned_direct_write_iomap_ops = {
};
#endif /* CONFIG_XFS_RT */
+static int
+xfs_atomic_write_sw_iomap_begin(
+ struct inode *inode,
+ loff_t offset,
+ loff_t length,
+ unsigned flags,
+ struct iomap *iomap,
+ struct iomap *srcmap)
+{
+ struct xfs_inode *ip = XFS_I(inode);
+ struct xfs_mount *mp = ip->i_mount;
+ struct xfs_bmbt_irec imap, cmap;
+ xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset);
+ xfs_fileoff_t end_fsb = xfs_iomap_end_fsb(mp, offset, length);
+ int nimaps = 1, error;
+ bool shared = false;
+ unsigned int lockmode = XFS_ILOCK_EXCL;
+ u64 seq;
+
+ if (xfs_is_shutdown(mp))
+ return -EIO;
+
+ if (!xfs_has_reflink(mp))
+ return -EINVAL;
+
+ error = xfs_ilock_for_iomap(ip, flags, &lockmode);
+ if (error)
+ return error;
+
+ error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb, &imap,
+ &nimaps, 0);
+ if (error)
+ goto out_unlock;
+
+ error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
+ &lockmode, XFS_REFLINK_CONVERT |
+ XFS_REFLINK_ATOMIC_SW);
+ /*
+ * Don't check @shared. For atomic writes, we should error when
+ * we don't get a COW mapping
+ */
+ if (error)
+ goto out_unlock;
+
+ end_fsb = imap.br_startoff + imap.br_blockcount;
+
+ length = XFS_FSB_TO_B(mp, cmap.br_startoff + cmap.br_blockcount);
+ trace_xfs_iomap_found(ip, offset, length - offset, XFS_COW_FORK, &cmap);
+ if (imap.br_startblock != HOLESTARTBLOCK) {
+ seq = xfs_iomap_inode_sequence(ip, 0);
+ error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0, seq);
+ if (error)
+ goto out_unlock;
+ }
+ seq = xfs_iomap_inode_sequence(ip, IOMAP_F_SHARED);
+ xfs_iunlock(ip, lockmode);
+ return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, IOMAP_F_SHARED, seq);
+
+out_unlock:
+ if (lockmode)
+ xfs_iunlock(ip, lockmode);
+ return error;
+}
+
+static int
+xfs_atomic_write_iomap_begin(
+ struct inode *inode,
+ loff_t offset,
+ loff_t length,
+ unsigned flags,
+ struct iomap *iomap,
+ struct iomap *srcmap)
+{
+ ASSERT(flags & IOMAP_WRITE);
+ ASSERT(flags & IOMAP_DIRECT);
+
+ if (flags & IOMAP_ATOMIC_SW)
+ return xfs_atomic_write_sw_iomap_begin(inode, offset, length,
+ flags, iomap, srcmap);
+
+ ASSERT(flags & IOMAP_ATOMIC_HW);
+ return xfs_direct_write_iomap_begin(inode, offset, length, flags,
+ iomap, srcmap);
+}
+
+const struct iomap_ops xfs_atomic_write_iomap_ops = {
+ .iomap_begin = xfs_atomic_write_iomap_begin,
+};
+
static int
xfs_dax_write_iomap_end(
struct inode *inode,
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index d330c4a581b1..5272cf9ec9d3 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -56,5 +56,6 @@ extern const struct iomap_ops xfs_read_iomap_ops;
extern const struct iomap_ops xfs_seek_iomap_ops;
extern const struct iomap_ops xfs_xattr_iomap_ops;
extern const struct iomap_ops xfs_dax_write_iomap_ops;
+extern const struct iomap_ops xfs_atomic_write_iomap_ops;
#endif /* __XFS_IOMAP_H__*/
--
2.31.1
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH v5 06/10] xfs: Add xfs_file_dio_write_atomic()
2025-03-10 18:39 [PATCH v5 00/10] large atomic writes for xfs with CoW John Garry
` (4 preceding siblings ...)
2025-03-10 18:39 ` [PATCH v5 05/10] xfs: Iomap SW-based " John Garry
@ 2025-03-10 18:39 ` John Garry
2025-03-10 18:39 ` [PATCH v5 07/10] xfs: Commit CoW-based atomic writes atomically John Garry
` (3 subsequent siblings)
9 siblings, 0 replies; 63+ messages in thread
From: John Garry @ 2025-03-10 18:39 UTC (permalink / raw)
To: brauner, djwong, cem
Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
martin.petersen, John Garry
Add xfs_file_dio_write_atomic() for dedicated handling of atomic writes.
In case of -EAGAIN being returned from iomap_dio_rw(), reissue the write
in CoW-based atomic write mode.
For CoW-based mode, ensure that we have no outstanding IOs which we
may trample on.
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
fs/xfs/xfs_file.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 44 insertions(+)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 75a6d7e75bf8..ddcf95ce741e 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -725,6 +725,46 @@ xfs_file_dio_write_zoned(
return ret;
}
+static noinline ssize_t
+xfs_file_dio_write_atomic(
+ struct xfs_inode *ip,
+ struct kiocb *iocb,
+ struct iov_iter *from)
+{
+ unsigned int iolock = XFS_IOLOCK_SHARED;
+ unsigned int dio_flags = 0;
+ ssize_t ret;
+
+retry:
+ ret = xfs_ilock_iocb_for_write(iocb, &iolock);
+ if (ret)
+ return ret;
+
+ ret = xfs_file_write_checks(iocb, from, &iolock, NULL);
+ if (ret)
+ goto out_unlock;
+
+ if (dio_flags & IOMAP_DIO_FORCE_WAIT)
+ inode_dio_wait(VFS_I(ip));
+
+ trace_xfs_file_direct_write(iocb, from);
+ ret = iomap_dio_rw(iocb, from, &xfs_atomic_write_iomap_ops,
+ &xfs_dio_write_ops, dio_flags, NULL, 0);
+
+ if (ret == -EAGAIN && !(iocb->ki_flags & IOCB_NOWAIT) &&
+ !(dio_flags & IOMAP_DIO_ATOMIC_SW)) {
+ xfs_iunlock(ip, iolock);
+ dio_flags = IOMAP_DIO_ATOMIC_SW | IOMAP_DIO_FORCE_WAIT;
+ iolock = XFS_IOLOCK_EXCL;
+ goto retry;
+ }
+
+out_unlock:
+ if (iolock)
+ xfs_iunlock(ip, iolock);
+ return ret;
+}
+
/*
* Handle block unaligned direct I/O writes
*
@@ -840,6 +880,10 @@ xfs_file_dio_write(
return xfs_file_dio_write_unaligned(ip, iocb, from);
if (xfs_is_zoned_inode(ip))
return xfs_file_dio_write_zoned(ip, iocb, from);
+
+ if (iocb->ki_flags & IOCB_ATOMIC)
+ return xfs_file_dio_write_atomic(ip, iocb, from);
+
return xfs_file_dio_write_aligned(ip, iocb, from,
&xfs_direct_write_iomap_ops, &xfs_dio_write_ops, NULL);
}
--
2.31.1
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH v5 07/10] xfs: Commit CoW-based atomic writes atomically
2025-03-10 18:39 [PATCH v5 00/10] large atomic writes for xfs with CoW John Garry
` (5 preceding siblings ...)
2025-03-10 18:39 ` [PATCH v5 06/10] xfs: Add xfs_file_dio_write_atomic() John Garry
@ 2025-03-10 18:39 ` John Garry
2025-03-12 7:39 ` Christoph Hellwig
2025-03-10 18:39 ` [PATCH v5 08/10] xfs: Update atomic write max size John Garry
` (2 subsequent siblings)
9 siblings, 1 reply; 63+ messages in thread
From: John Garry @ 2025-03-10 18:39 UTC (permalink / raw)
To: brauner, djwong, cem
Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
martin.petersen, John Garry
When completing a CoW-based write, each extent range mapping update is
covered by a separate transaction.
For a CoW-based atomic write, all mappings must be changed at once, so
change to use a single transaction.
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
fs/xfs/xfs_file.c | 5 ++++-
fs/xfs/xfs_reflink.c | 49 ++++++++++++++++++++++++++++++++++++++++++++
fs/xfs/xfs_reflink.h | 3 +++
3 files changed, 56 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index ddcf95ce741e..16739c408af3 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -576,7 +576,10 @@ xfs_dio_write_end_io(
nofs_flag = memalloc_nofs_save();
if (flags & IOMAP_DIO_COW) {
- error = xfs_reflink_end_cow(ip, offset, size);
+ if (iocb->ki_flags & IOCB_ATOMIC)
+ error = xfs_reflink_end_atomic_cow(ip, offset, size);
+ else
+ error = xfs_reflink_end_cow(ip, offset, size);
if (error)
goto out;
}
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 0949d6ba2b3b..ce1fd58dff35 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -987,6 +987,55 @@ xfs_reflink_end_cow(
trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
return error;
}
+int
+xfs_reflink_end_atomic_cow(
+ struct xfs_inode *ip,
+ xfs_off_t offset,
+ xfs_off_t count)
+{
+ xfs_fileoff_t offset_fsb;
+ xfs_fileoff_t end_fsb;
+ int error = 0;
+ struct xfs_mount *mp = ip->i_mount;
+ struct xfs_trans *tp;
+ unsigned int resblks;
+
+ trace_xfs_reflink_end_cow(ip, offset, count);
+
+ offset_fsb = XFS_B_TO_FSBT(mp, offset);
+ end_fsb = XFS_B_TO_FSB(mp, offset + count);
+
+ /*
+ * Each remapping operation could cause a btree split, so in the worst
+ * case that's one for each block.
+ */
+ resblks = (end_fsb - offset_fsb) *
+ XFS_NEXTENTADD_SPACE_RES(mp, 1, XFS_DATA_FORK);
+
+ error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
+ XFS_TRANS_RESERVE, &tp);
+ if (error)
+ return error;
+
+ xfs_ilock(ip, XFS_ILOCK_EXCL);
+ xfs_trans_ijoin(tp, ip, 0);
+
+ while (end_fsb > offset_fsb && !error) {
+ error = xfs_reflink_end_cow_extent_locked(tp, ip, &offset_fsb,
+ end_fsb);
+ }
+ if (error) {
+ trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
+ goto out_cancel;
+ }
+ error = xfs_trans_commit(tp);
+ xfs_iunlock(ip, XFS_ILOCK_EXCL);
+ return error;
+out_cancel:
+ xfs_trans_cancel(tp);
+ xfs_iunlock(ip, XFS_ILOCK_EXCL);
+ return error;
+}
/*
* Free all CoW staging blocks that are still referenced by the ondisk refcount
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index dfd94e51e2b4..4cb2ee53cd8d 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -49,6 +49,9 @@ extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
xfs_off_t count, bool cancel_real);
extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
xfs_off_t count);
+ int
+xfs_reflink_end_atomic_cow(struct xfs_inode *ip, xfs_off_t offset,
+ xfs_off_t count);
extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
extern loff_t xfs_reflink_remap_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out, loff_t len,
--
2.31.1
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH v5 08/10] xfs: Update atomic write max size
2025-03-10 18:39 [PATCH v5 00/10] large atomic writes for xfs with CoW John Garry
` (6 preceding siblings ...)
2025-03-10 18:39 ` [PATCH v5 07/10] xfs: Commit CoW-based atomic writes atomically John Garry
@ 2025-03-10 18:39 ` John Garry
2025-03-11 14:40 ` Carlos Maiolino
2025-03-12 7:41 ` Christoph Hellwig
2025-03-10 18:39 ` [PATCH v5 09/10] xfs: Allow block allocator to take an alignment hint John Garry
2025-03-10 18:39 ` [PATCH RFC v5 10/10] iomap: Rename ATOMIC flags again John Garry
9 siblings, 2 replies; 63+ messages in thread
From: John Garry @ 2025-03-10 18:39 UTC (permalink / raw)
To: brauner, djwong, cem
Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
martin.petersen, John Garry
Now that CoW-based atomic writes are supported, update the max size of an
atomic write.
For simplicity, limit at the max of what the mounted bdev can support in
terms of atomic write limits. Maybe in future we will have a better way
to advertise this optimised limit.
In addition, the max atomic write size needs to be aligned to the agsize.
Limit the size of atomic writes to the greatest power-of-two factor of the
agsize so that allocations for an atomic write will always be aligned
compatibly with the alignment requirements of the storage.
For RT inode, just limit to 1x block, even though larger can be supported
in future.
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
fs/xfs/xfs_iops.c | 14 +++++++++++++-
fs/xfs/xfs_mount.c | 28 ++++++++++++++++++++++++++++
fs/xfs/xfs_mount.h | 1 +
3 files changed, 42 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index de065cc2e7cf..16a1f9541690 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -607,12 +607,24 @@ xfs_get_atomic_write_attr(
unsigned int *unit_min,
unsigned int *unit_max)
{
+ struct xfs_buftarg *target = xfs_inode_buftarg(ip);
+ struct xfs_mount *mp = ip->i_mount;
+
if (!xfs_inode_can_atomicwrite(ip)) {
*unit_min = *unit_max = 0;
return;
}
- *unit_min = *unit_max = ip->i_mount->m_sb.sb_blocksize;
+ *unit_min = ip->i_mount->m_sb.sb_blocksize;
+
+ if (XFS_IS_REALTIME_INODE(ip)) {
+ /* For now, set limit at 1x block */
+ *unit_max = ip->i_mount->m_sb.sb_blocksize;
+ } else {
+ *unit_max = min_t(unsigned int,
+ XFS_FSB_TO_B(mp, mp->m_awu_max),
+ target->bt_bdev_awu_max);
+ }
}
static void
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index e65a659901d5..414adfb944b9 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -665,6 +665,32 @@ xfs_agbtree_compute_maxlevels(
levels = max(levels, mp->m_rmap_maxlevels);
mp->m_agbtree_maxlevels = max(levels, mp->m_refc_maxlevels);
}
+static inline void
+xfs_compute_awu_max(
+ struct xfs_mount *mp)
+{
+ xfs_agblock_t agsize = mp->m_sb.sb_agblocks;
+ xfs_agblock_t awu_max;
+
+ if (!xfs_has_reflink(mp)) {
+ mp->m_awu_max = 1;
+ return;
+ }
+
+ /*
+ * Find highest power-of-2 evenly divisible into agsize and which
+ * also fits into an unsigned int field.
+ */
+ awu_max = 1;
+ while (1) {
+ if (agsize % (awu_max * 2))
+ break;
+ if (XFS_FSB_TO_B(mp, awu_max * 2) > UINT_MAX)
+ break;
+ awu_max *= 2;
+ }
+ mp->m_awu_max = awu_max;
+}
/* Compute maximum possible height for realtime btree types for this fs. */
static inline void
@@ -751,6 +777,8 @@ xfs_mountfs(
xfs_agbtree_compute_maxlevels(mp);
xfs_rtbtree_compute_maxlevels(mp);
+ xfs_compute_awu_max(mp);
+
/*
* Check if sb_agblocks is aligned at stripe boundary. If sb_agblocks
* is NOT aligned turn off m_dalign since allocator alignment is within
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 799b84220ebb..1b0136da2aec 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -229,6 +229,7 @@ typedef struct xfs_mount {
bool m_finobt_nores; /* no per-AG finobt resv. */
bool m_update_sb; /* sb needs update in mount */
unsigned int m_max_open_zones;
+ xfs_extlen_t m_awu_max; /* data device max atomic write */
/*
* Bitsets of per-fs metadata that have been checked and/or are sick.
--
2.31.1
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH v5 09/10] xfs: Allow block allocator to take an alignment hint
2025-03-10 18:39 [PATCH v5 00/10] large atomic writes for xfs with CoW John Garry
` (7 preceding siblings ...)
2025-03-10 18:39 ` [PATCH v5 08/10] xfs: Update atomic write max size John Garry
@ 2025-03-10 18:39 ` John Garry
2025-03-12 7:42 ` Christoph Hellwig
2025-03-10 18:39 ` [PATCH RFC v5 10/10] iomap: Rename ATOMIC flags again John Garry
9 siblings, 1 reply; 63+ messages in thread
From: John Garry @ 2025-03-10 18:39 UTC (permalink / raw)
To: brauner, djwong, cem
Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
martin.petersen, John Garry
When issuing an atomic write by the CoW method, give the block allocator a
hint to align to the extszhint.
This means that we have a better chance to issuing the atomic write via
HW offload next time.
It does mean that the inode extszhint should be set appropriately for the
expected atomic write size.
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
fs/xfs/libxfs/xfs_bmap.c | 4 ++++
fs/xfs/libxfs/xfs_bmap.h | 6 +++++-
fs/xfs/xfs_reflink.c | 8 ++++++--
3 files changed, 15 insertions(+), 3 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 63255820b58a..5ef5e4476209 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3312,6 +3312,10 @@ xfs_bmap_compute_alignments(
align = xfs_get_cowextsz_hint(ap->ip);
else if (ap->datatype & XFS_ALLOC_USERDATA)
align = xfs_get_extsz_hint(ap->ip);
+
+ if (align > 1 && (ap->flags & XFS_BMAPI_EXTSZALIGN))
+ args->alignment = align;
+
if (align) {
if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
ap->eof, 0, ap->conv, &ap->offset,
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index b4d9c6e0f3f9..d5f2729305fa 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -87,6 +87,9 @@ struct xfs_bmalloca {
/* Do not update the rmap btree. Used for reconstructing bmbt from rmapbt. */
#define XFS_BMAPI_NORMAP (1u << 10)
+/* Try to align allocations to the extent size hint */
+#define XFS_BMAPI_EXTSZALIGN (1u << 11)
+
#define XFS_BMAPI_FLAGS \
{ XFS_BMAPI_ENTIRE, "ENTIRE" }, \
{ XFS_BMAPI_METADATA, "METADATA" }, \
@@ -98,7 +101,8 @@ struct xfs_bmalloca {
{ XFS_BMAPI_REMAP, "REMAP" }, \
{ XFS_BMAPI_COWFORK, "COWFORK" }, \
{ XFS_BMAPI_NODISCARD, "NODISCARD" }, \
- { XFS_BMAPI_NORMAP, "NORMAP" }
+ { XFS_BMAPI_NORMAP, "NORMAP" },\
+ { XFS_BMAPI_EXTSZALIGN, "EXTSZALIGN" }
static inline int xfs_bmapi_aflag(int w)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index ce1fd58dff35..d577db42baab 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -445,6 +445,11 @@ xfs_reflink_fill_cow_hole(
int error;
bool found;
bool atomic_sw = flags & XFS_REFLINK_ATOMIC_SW;
+ uint32_t bmapi_flags = XFS_BMAPI_COWFORK |
+ XFS_BMAPI_PREALLOC;
+
+ if (atomic_sw)
+ bmapi_flags |= XFS_BMAPI_EXTSZALIGN;
resaligned = xfs_aligned_fsb_count(imap->br_startoff,
imap->br_blockcount, xfs_get_cowextsz_hint(ip));
@@ -478,8 +483,7 @@ xfs_reflink_fill_cow_hole(
/* Allocate the entire reservation as unwritten blocks. */
nimaps = 1;
error = xfs_bmapi_write(tp, ip, imap->br_startoff, imap->br_blockcount,
- XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC, 0, cmap,
- &nimaps);
+ bmapi_flags, 0, cmap, &nimaps);
if (error)
goto out_trans_cancel;
--
2.31.1
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH RFC v5 10/10] iomap: Rename ATOMIC flags again
2025-03-10 18:39 [PATCH v5 00/10] large atomic writes for xfs with CoW John Garry
` (8 preceding siblings ...)
2025-03-10 18:39 ` [PATCH v5 09/10] xfs: Allow block allocator to take an alignment hint John Garry
@ 2025-03-10 18:39 ` John Garry
2025-03-12 7:13 ` Christoph Hellwig
9 siblings, 1 reply; 63+ messages in thread
From: John Garry @ 2025-03-10 18:39 UTC (permalink / raw)
To: brauner, djwong, cem
Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
martin.petersen, John Garry
Dave Chinner thought that names IOMAP_ATOMIC_HW and IOMAP_ATOMIC_SW were
not appropopiate. Specifically because IOMAP_ATOMIC_HW could actually be
realised with a SW-based method in the block or md/dm layers.
So rename to IOMAP_ATOMIC_BIO and IOMAP_ATOMIC_FS.
Also renumber the flags so that the atomic flags are adjacent.
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
Documentation/filesystems/iomap/operations.rst | 6 +++---
fs/ext4/inode.c | 2 +-
fs/iomap/direct-io.c | 18 +++++++++---------
fs/iomap/trace.h | 3 ++-
fs/xfs/xfs_file.c | 4 ++--
fs/xfs/xfs_iomap.c | 10 +++++-----
include/linux/iomap.h | 10 +++++-----
7 files changed, 27 insertions(+), 26 deletions(-)
diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst
index b08a79d11d9f..f1d9aa767d30 100644
--- a/Documentation/filesystems/iomap/operations.rst
+++ b/Documentation/filesystems/iomap/operations.rst
@@ -514,7 +514,7 @@ IOMAP_WRITE`` with any combination of the following enhancements:
if the mapping is unwritten and the filesystem cannot handle zeroing
the unaligned regions without exposing stale contents.
- * ``IOMAP_ATOMIC_HW``: This write is being issued with torn-write
+ * ``IOMAP_BIO_ATOMIC``: This write is being issued with torn-write
protection based on HW-offload support.
Only a single bio can be created for the write, and the write must
not be split into multiple I/O requests, i.e. flag REQ_ATOMIC must be
@@ -530,10 +530,10 @@ IOMAP_WRITE`` with any combination of the following enhancements:
the mapping start disk block must have at least the same alignment as
the write offset.
- * ``IOMAP_ATOMIC_SW``: This write is being issued with torn-write
+ * ``IOMAP_FS_ATOMIC``: This write is being issued with torn-write
protection via a software mechanism provided by the filesystem.
All the disk block alignment and single bio restrictions which apply
- to IOMAP_ATOMIC_HW do not apply here.
+ to IOMAP_BIO_ATOMIC do not apply here.
SW-based untorn writes would typically be used as a fallback when
HW-based untorn writes may not be issued, e.g. the range of the write
covers multiple extents, meaning that it is not possible to issue
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ba2f1e3db7c7..da385862c1b5 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3467,7 +3467,7 @@ static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
return false;
/* atomic writes are all-or-nothing */
- if (flags & IOMAP_ATOMIC_HW)
+ if (flags & IOMAP_BIO_ATOMIC)
return false;
/* can only try again if we wrote nothing */
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 5299f70428ef..d728d894bd90 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -317,7 +317,7 @@ static int iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
* clearing the WRITE_THROUGH flag in the dio request.
*/
static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
- const struct iomap *iomap, bool use_fua, bool atomic_hw)
+ const struct iomap *iomap, bool use_fua, bool bio_atomic)
{
blk_opf_t opflags = REQ_SYNC | REQ_IDLE;
@@ -329,7 +329,7 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
opflags |= REQ_FUA;
else
dio->flags &= ~IOMAP_DIO_WRITE_THROUGH;
- if (atomic_hw)
+ if (bio_atomic)
opflags |= REQ_ATOMIC;
return opflags;
@@ -340,7 +340,7 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
const struct iomap *iomap = &iter->iomap;
struct inode *inode = iter->inode;
unsigned int fs_block_size = i_blocksize(inode), pad;
- bool atomic_hw = iter->flags & IOMAP_ATOMIC_HW;
+ bool bio_atomic = iter->flags & IOMAP_BIO_ATOMIC;
const loff_t length = iomap_length(iter);
loff_t pos = iter->pos;
blk_opf_t bio_opf;
@@ -351,7 +351,7 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
u64 copied = 0;
size_t orig_count;
- if (atomic_hw && length != iter->len)
+ if (bio_atomic && length != iter->len)
return -EINVAL;
if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
@@ -428,7 +428,7 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
goto out;
}
- bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic_hw);
+ bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, bio_atomic);
nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS);
do {
@@ -461,7 +461,7 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
}
n = bio->bi_iter.bi_size;
- if (WARN_ON_ONCE(atomic_hw && n != length)) {
+ if (WARN_ON_ONCE(bio_atomic && n != length)) {
/*
* This bio should have covered the complete length,
* which it doesn't, so error. We may need to zero out
@@ -686,10 +686,10 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
iomi.flags |= IOMAP_OVERWRITE_ONLY;
}
- if (dio_flags & IOMAP_DIO_ATOMIC_SW)
- iomi.flags |= IOMAP_ATOMIC_SW;
+ if (dio_flags & IOMAP_DIO_FS_ATOMIC)
+ iomi.flags |= IOMAP_FS_ATOMIC;
else if (iocb->ki_flags & IOCB_ATOMIC)
- iomi.flags |= IOMAP_ATOMIC_HW;
+ iomi.flags |= IOMAP_BIO_ATOMIC;
/* for data sync or sync, we need sync completion processing */
if (iocb_is_dsync(iocb)) {
diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
index 69af89044ebd..4b71f1711b69 100644
--- a/fs/iomap/trace.h
+++ b/fs/iomap/trace.h
@@ -99,7 +99,8 @@ DEFINE_RANGE_EVENT(iomap_dio_rw_queued);
{ IOMAP_FAULT, "FAULT" }, \
{ IOMAP_DIRECT, "DIRECT" }, \
{ IOMAP_NOWAIT, "NOWAIT" }, \
- { IOMAP_ATOMIC_HW, "ATOMIC_HW" }
+ { IOMAP_BIO_ATOMIC, "ATOMIC_BIO" }, \
+ { IOMAP_FS_ATOMIC, "ATOMIC_FS" }
#define IOMAP_F_FLAGS_STRINGS \
{ IOMAP_F_NEW, "NEW" }, \
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 16739c408af3..4ad80179173a 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -755,9 +755,9 @@ xfs_file_dio_write_atomic(
&xfs_dio_write_ops, dio_flags, NULL, 0);
if (ret == -EAGAIN && !(iocb->ki_flags & IOCB_NOWAIT) &&
- !(dio_flags & IOMAP_DIO_ATOMIC_SW)) {
+ !(dio_flags & IOMAP_DIO_FS_ATOMIC)) {
xfs_iunlock(ip, iolock);
- dio_flags = IOMAP_DIO_ATOMIC_SW | IOMAP_DIO_FORCE_WAIT;
+ dio_flags = IOMAP_DIO_FS_ATOMIC | IOMAP_DIO_FORCE_WAIT;
iolock = XFS_IOLOCK_EXCL;
goto retry;
}
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 6c963786530d..71ddd5979091 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -830,7 +830,7 @@ xfs_direct_write_iomap_begin(
xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset);
xfs_fileoff_t end_fsb = xfs_iomap_end_fsb(mp, offset, length);
xfs_fileoff_t orig_end_fsb = end_fsb;
- bool atomic_hw = flags & IOMAP_ATOMIC_HW;
+ bool atomic_bio = flags & IOMAP_BIO_ATOMIC;
int nimaps = 1, error = 0;
unsigned int reflink_flags = 0;
bool shared = false;
@@ -895,7 +895,7 @@ xfs_direct_write_iomap_begin(
if (error)
goto out_unlock;
if (shared) {
- if (atomic_hw &&
+ if (atomic_bio &&
!xfs_bmap_valid_for_atomic_write(&cmap,
offset_fsb, end_fsb)) {
error = -EAGAIN;
@@ -909,7 +909,7 @@ xfs_direct_write_iomap_begin(
needs_alloc = imap_needs_alloc(inode, flags, &imap, nimaps);
- if (atomic_hw) {
+ if (atomic_bio) {
error = -EAGAIN;
/*
* Use CoW method for when we need to alloc > 1 block,
@@ -1141,11 +1141,11 @@ xfs_atomic_write_iomap_begin(
ASSERT(flags & IOMAP_WRITE);
ASSERT(flags & IOMAP_DIRECT);
- if (flags & IOMAP_ATOMIC_SW)
+ if (flags & IOMAP_FS_ATOMIC)
return xfs_atomic_write_sw_iomap_begin(inode, offset, length,
flags, iomap, srcmap);
- ASSERT(flags & IOMAP_ATOMIC_HW);
+ ASSERT(flags & IOMAP_BIO_ATOMIC);
return xfs_direct_write_iomap_begin(inode, offset, length, flags,
iomap, srcmap);
}
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 9cd93530013c..5e44ca17a64a 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -189,9 +189,9 @@ struct iomap_folio_ops {
#else
#define IOMAP_DAX 0
#endif /* CONFIG_FS_DAX */
-#define IOMAP_ATOMIC_HW (1 << 9) /* HW-based torn-write protection */
-#define IOMAP_DONTCACHE (1 << 10)
-#define IOMAP_ATOMIC_SW (1 << 11)/* SW-based torn-write protection */
+#define IOMAP_DONTCACHE (1 << 9)
+#define IOMAP_BIO_ATOMIC (1 << 10) /* Use REQ_ATOMIC on single bio */
+#define IOMAP_FS_ATOMIC (1 << 11) /* FS-based torn-write protection */
struct iomap_ops {
/*
@@ -504,9 +504,9 @@ struct iomap_dio_ops {
#define IOMAP_DIO_PARTIAL (1 << 2)
/*
- * Use software-based torn-write protection.
+ * Use FS-based torn-write protection.
*/
-#define IOMAP_DIO_ATOMIC_SW (1 << 3)
+#define IOMAP_DIO_FS_ATOMIC (1 << 3)
ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
--
2.31.1
^ permalink raw reply related [flat|nested] 63+ messages in thread
* Re: [PATCH v5 08/10] xfs: Update atomic write max size
2025-03-10 18:39 ` [PATCH v5 08/10] xfs: Update atomic write max size John Garry
@ 2025-03-11 14:40 ` Carlos Maiolino
2025-03-12 7:41 ` Christoph Hellwig
1 sibling, 0 replies; 63+ messages in thread
From: Carlos Maiolino @ 2025-03-11 14:40 UTC (permalink / raw)
To: John Garry
Cc: brauner, djwong, linux-xfs, linux-fsdevel, linux-kernel, ojaswin,
ritesh.list, martin.petersen
Thanks for updating it John.
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
On Mon, Mar 10, 2025 at 06:39:44PM +0000, John Garry wrote:
> Now that CoW-based atomic writes are supported, update the max size of an
> atomic write.
>
> For simplicity, limit at the max of what the mounted bdev can support in
> terms of atomic write limits. Maybe in future we will have a better way
> to advertise this optimised limit.
>
> In addition, the max atomic write size needs to be aligned to the agsize.
> Limit the size of atomic writes to the greatest power-of-two factor of the
> agsize so that allocations for an atomic write will always be aligned
> compatibly with the alignment requirements of the storage.
>
> For RT inode, just limit to 1x block, even though larger can be supported
> in future.
>
> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
> fs/xfs/xfs_iops.c | 14 +++++++++++++-
> fs/xfs/xfs_mount.c | 28 ++++++++++++++++++++++++++++
> fs/xfs/xfs_mount.h | 1 +
> 3 files changed, 42 insertions(+), 1 deletion(-)
>
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index de065cc2e7cf..16a1f9541690 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -607,12 +607,24 @@ xfs_get_atomic_write_attr(
> unsigned int *unit_min,
> unsigned int *unit_max)
> {
> + struct xfs_buftarg *target = xfs_inode_buftarg(ip);
> + struct xfs_mount *mp = ip->i_mount;
> +
> if (!xfs_inode_can_atomicwrite(ip)) {
> *unit_min = *unit_max = 0;
> return;
> }
>
> - *unit_min = *unit_max = ip->i_mount->m_sb.sb_blocksize;
> + *unit_min = ip->i_mount->m_sb.sb_blocksize;
> +
> + if (XFS_IS_REALTIME_INODE(ip)) {
> + /* For now, set limit at 1x block */
> + *unit_max = ip->i_mount->m_sb.sb_blocksize;
> + } else {
> + *unit_max = min_t(unsigned int,
> + XFS_FSB_TO_B(mp, mp->m_awu_max),
> + target->bt_bdev_awu_max);
> + }
> }
>
> static void
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index e65a659901d5..414adfb944b9 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -665,6 +665,32 @@ xfs_agbtree_compute_maxlevels(
> levels = max(levels, mp->m_rmap_maxlevels);
> mp->m_agbtree_maxlevels = max(levels, mp->m_refc_maxlevels);
> }
> +static inline void
> +xfs_compute_awu_max(
> + struct xfs_mount *mp)
> +{
> + xfs_agblock_t agsize = mp->m_sb.sb_agblocks;
> + xfs_agblock_t awu_max;
> +
> + if (!xfs_has_reflink(mp)) {
> + mp->m_awu_max = 1;
> + return;
> + }
> +
> + /*
> + * Find highest power-of-2 evenly divisible into agsize and which
> + * also fits into an unsigned int field.
> + */
> + awu_max = 1;
> + while (1) {
> + if (agsize % (awu_max * 2))
> + break;
> + if (XFS_FSB_TO_B(mp, awu_max * 2) > UINT_MAX)
> + break;
> + awu_max *= 2;
> + }
> + mp->m_awu_max = awu_max;
> +}
>
> /* Compute maximum possible height for realtime btree types for this fs. */
> static inline void
> @@ -751,6 +777,8 @@ xfs_mountfs(
> xfs_agbtree_compute_maxlevels(mp);
> xfs_rtbtree_compute_maxlevels(mp);
>
> + xfs_compute_awu_max(mp);
> +
> /*
> * Check if sb_agblocks is aligned at stripe boundary. If sb_agblocks
> * is NOT aligned turn off m_dalign since allocator alignment is within
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 799b84220ebb..1b0136da2aec 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -229,6 +229,7 @@ typedef struct xfs_mount {
> bool m_finobt_nores; /* no per-AG finobt resv. */
> bool m_update_sb; /* sb needs update in mount */
> unsigned int m_max_open_zones;
> + xfs_extlen_t m_awu_max; /* data device max atomic write */
>
> /*
> * Bitsets of per-fs metadata that have been checked and/or are sick.
> --
> 2.31.1
>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH RFC v5 10/10] iomap: Rename ATOMIC flags again
2025-03-10 18:39 ` [PATCH RFC v5 10/10] iomap: Rename ATOMIC flags again John Garry
@ 2025-03-12 7:13 ` Christoph Hellwig
2025-03-12 23:59 ` Dave Chinner
0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2025-03-12 7:13 UTC (permalink / raw)
To: John Garry
Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, linux-kernel,
ojaswin, ritesh.list, martin.petersen
On Mon, Mar 10, 2025 at 06:39:46PM +0000, John Garry wrote:
> Dave Chinner thought that names IOMAP_ATOMIC_HW and IOMAP_ATOMIC_SW were
> not appropopiate. Specifically because IOMAP_ATOMIC_HW could actually be
> realised with a SW-based method in the block or md/dm layers.
>
> So rename to IOMAP_ATOMIC_BIO and IOMAP_ATOMIC_FS.
Looking over the entire series and the already merged iomap code:
there should be no reason at all for having IOMAP_ATOMIC_FS.
The only thing it does is to branch out to
xfs_atomic_write_sw_iomap_begin from xfs_atomic_write_iomap_begin.
You can do that in a much simpler and nicer way by just having
different iomap_ops for the bio based vs file system based atomics.
I agree with dave that bio is a better term for the bio based
atomic, but please use the IOMAP_ATOMIC_BIO name above instead
of the IOMAP_BIO_ATOMIC actually used in the code if you change
it.
> */
> static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
> - const struct iomap *iomap, bool use_fua, bool atomic_hw)
> + const struct iomap *iomap, bool use_fua, bool bio_atomic)
Not new here, but these two bools are pretty ugly.
I'd rather have a
blk_opf_t extra_flags;
in the caller that gets REQ_FUA and REQ_ATOMIC assigned as needed,
and then just clear
>
> - if (atomic_hw && length != iter->len)
> + if (bio_atomic && length != iter->len)
> return -EINVAL;
This could really use a comment explaining what the check is for.
> - if (WARN_ON_ONCE(atomic_hw && n != length)) {
> + if (WARN_ON_ONCE(bio_atomic && n != length)) {
Same.
> -#define IOMAP_ATOMIC_HW (1 << 9) /* HW-based torn-write protection */
> -#define IOMAP_DONTCACHE (1 << 10)
> -#define IOMAP_ATOMIC_SW (1 << 11)/* SW-based torn-write protection */
> +#define IOMAP_DONTCACHE (1 << 9)
> +#define IOMAP_BIO_ATOMIC (1 << 10) /* Use REQ_ATOMIC on single bio */
> +#define IOMAP_FS_ATOMIC (1 << 11) /* FS-based torn-write protection */
Please also fix the overly long lines here by moving the comments
above the definitions. That should also give you enough space to
expand the comment into a full sentence describing the flag fully.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 01/10] xfs: Pass flags to xfs_reflink_allocate_cow()
2025-03-10 18:39 ` [PATCH v5 01/10] xfs: Pass flags to xfs_reflink_allocate_cow() John Garry
@ 2025-03-12 7:15 ` Christoph Hellwig
2025-03-12 8:19 ` John Garry
0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2025-03-12 7:15 UTC (permalink / raw)
To: John Garry
Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, linux-kernel,
ojaswin, ritesh.list, martin.petersen
All the patches (but not the cover letter) capitalize the first
word after the prefix. While some subsystems do that, xfs does not.
Please be consistent with the other commit messages.
> +/*
> + * Flags for xfs_reflink_allocate_cow()
> + */
> +#define XFS_REFLINK_CONVERT (1u << 0) /* convert unwritten extents now */
Please add a _UNWRITTEN prefix to avoid any confusion with delalloc
conversions.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 02/10] xfs: Switch atomic write size check in xfs_file_write_iter()
2025-03-10 18:39 ` [PATCH v5 02/10] xfs: Switch atomic write size check in xfs_file_write_iter() John Garry
@ 2025-03-12 7:17 ` Christoph Hellwig
2025-03-12 8:21 ` John Garry
0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2025-03-12 7:17 UTC (permalink / raw)
To: John Garry
Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, linux-kernel,
ojaswin, ritesh.list, martin.petersen
> +void
> +xfs_get_atomic_write_attr(
> + struct xfs_inode *ip,
> + unsigned int *unit_min,
> + unsigned int *unit_max)
> +{
> + if (!xfs_inode_can_atomicwrite(ip)) {
> + *unit_min = *unit_max = 0;
> + return;
> + }
> +
> + *unit_min = *unit_max = ip->i_mount->m_sb.sb_blocksize;
This is a rather ugly interface. Why not have separate helpers for
the min and max values that can just return the actual values?
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 03/10] xfs: Refactor xfs_reflink_end_cow_extent()
2025-03-10 18:39 ` [PATCH v5 03/10] xfs: Refactor xfs_reflink_end_cow_extent() John Garry
@ 2025-03-12 7:24 ` Christoph Hellwig
2025-03-12 8:27 ` John Garry
0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2025-03-12 7:24 UTC (permalink / raw)
To: John Garry
Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, linux-kernel,
ojaswin, ritesh.list, martin.petersen
On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote:
> Refactor xfs_reflink_end_cow_extent() into separate parts which process
> the CoW range and commit the transaction.
>
> This refactoring will be used in future for when it is required to commit
> a range of extents as a single transaction, similar to how it was done
> pre-commit d6f215f359637.
Darrick pointed out that if you do more than just a tiny number
of extents per transactions you run out of log reservations very
quickly here:
https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/
how does your scheme deal with that?
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 04/10] xfs: Reflink CoW-based atomic write support
2025-03-10 18:39 ` [PATCH v5 04/10] xfs: Reflink CoW-based atomic write support John Garry
@ 2025-03-12 7:27 ` Christoph Hellwig
2025-03-12 9:13 ` John Garry
0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2025-03-12 7:27 UTC (permalink / raw)
To: John Garry
Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, linux-kernel,
ojaswin, ritesh.list, martin.petersen
On Mon, Mar 10, 2025 at 06:39:40PM +0000, John Garry wrote:
> Base SW-based atomic writes on CoW.
>
> For SW-based atomic write support, always allocate a cow hole in
> xfs_reflink_allocate_cow() to write the new data.
What is a "COW hole"?
> The semantics is that if @atomic_sw is set, we will be passed a CoW fork
> extent mapping for no error returned.
This commit log feels extremely sparse for a brand new feature with
data integrity impact. Can you expand on it a little?
> + bool atomic_sw = flags & XFS_REFLINK_ATOMIC_SW;
atomic_sw is not a very descriptive variable name.
>
> resaligned = xfs_aligned_fsb_count(imap->br_startoff,
> imap->br_blockcount, xfs_get_cowextsz_hint(ip));
> @@ -466,7 +467,7 @@ xfs_reflink_fill_cow_hole(
> *lockmode = XFS_ILOCK_EXCL;
>
> error = xfs_find_trim_cow_extent(ip, imap, cmap, shared, &found);
> - if (error || !*shared)
> + if (error || (!*shared && !atomic_sw))
And it's pnly used once. Basically is is used to force COW, right?
Maybe use that fact as it describes the semantics at this level
instead of the very high level intent?
> @@ -10,6 +10,7 @@
> * Flags for xfs_reflink_allocate_cow()
> */
> #define XFS_REFLINK_CONVERT (1u << 0) /* convert unwritten extents now */
> +#define XFS_REFLINK_ATOMIC_SW (1u << 1) /* alloc for SW-based atomic write */
Please expand what this actually means at the xfs_reflink_allocate_cow.
Of if it is just a force flag as I suspect speel that out. And
move the comment up to avoid the overly long line as well as giving
you space to actually spell the semantics out.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 05/10] xfs: Iomap SW-based atomic write support
2025-03-10 18:39 ` [PATCH v5 05/10] xfs: Iomap SW-based " John Garry
@ 2025-03-12 7:37 ` Christoph Hellwig
2025-03-12 9:00 ` John Garry
0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2025-03-12 7:37 UTC (permalink / raw)
To: John Garry
Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, linux-kernel,
ojaswin, ritesh.list, martin.petersen
On Mon, Mar 10, 2025 at 06:39:41PM +0000, John Garry wrote:
> In cases of an atomic write occurs for misaligned or discontiguous disk
> blocks, we will use a CoW-based method to issue the atomic write.
>
> So, for that case, return -EAGAIN to request that the write be issued in
> CoW atomic write mode. The dio write path should detect this, similar to
> how misaligned regular DIO writes are handled.
How is -EAGAIN going to work here given that it is also used to defer
non-blocking requests to the caller blocking context?
What is the probem with only setting the flag that causes REQ_ATOMIC
to be set from the file system instead of forcing it when calling
iomap_dio_rw?
Also how you ensure this -EAGAIN only happens on the first extent
mapped and you doesn't cause double writes?
> + bool atomic_hw = flags & IOMAP_ATOMIC_HW;
Again, atomic_hw is not a very useful variable name. But the
whole idea of using a non-descriptive bool variable for a flags
field feels like an antipattern to me.
> - if (shared)
> + if (shared) {
> + if (atomic_hw &&
> + !xfs_bmap_valid_for_atomic_write(&cmap,
> + offset_fsb, end_fsb)) {
> + error = -EAGAIN;
> + goto out_unlock;
> + }
> goto out_found_cow;
This needs a big fat comment explaining why bailing out here is
fine and how it works.
> + /*
> + * Use CoW method for when we need to alloc > 1 block,
> + * otherwise we might allocate less than what we need here and
> + * have multiple mappings.
> + */
Describe why this is done, not just what is done.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 07/10] xfs: Commit CoW-based atomic writes atomically
2025-03-10 18:39 ` [PATCH v5 07/10] xfs: Commit CoW-based atomic writes atomically John Garry
@ 2025-03-12 7:39 ` Christoph Hellwig
2025-03-12 9:04 ` John Garry
0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2025-03-12 7:39 UTC (permalink / raw)
To: John Garry
Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, linux-kernel,
ojaswin, ritesh.list, martin.petersen
On Mon, Mar 10, 2025 at 06:39:43PM +0000, John Garry wrote:
> When completing a CoW-based write, each extent range mapping update is
> covered by a separate transaction.
>
> For a CoW-based atomic write, all mappings must be changed at once, so
> change to use a single transaction.
As already mentioned in a previous reply: "all" might be to much.
The code can only support a (relatively low) number of extents
in a single transaction safely.
> +int
> +xfs_reflink_end_atomic_cow(
> + struct xfs_inode *ip,
> + xfs_off_t offset,
> + xfs_off_t count)
Assuming we could actually to the multi extent per transaction
commit safely, what would be the reason to not always do it?
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 08/10] xfs: Update atomic write max size
2025-03-10 18:39 ` [PATCH v5 08/10] xfs: Update atomic write max size John Garry
2025-03-11 14:40 ` Carlos Maiolino
@ 2025-03-12 7:41 ` Christoph Hellwig
2025-03-12 8:09 ` John Garry
1 sibling, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2025-03-12 7:41 UTC (permalink / raw)
To: John Garry
Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, linux-kernel,
ojaswin, ritesh.list, martin.petersen
On Mon, Mar 10, 2025 at 06:39:44PM +0000, John Garry wrote:
> For RT inode, just limit to 1x block, even though larger can be supported
> in future.
Why?
> + if (XFS_IS_REALTIME_INODE(ip)) {
> + /* For now, set limit at 1x block */
Why? It' clearly obvious that you do that from the code, but comments
are supposed to explain why something non-obvious is done.
> + *unit_max = ip->i_mount->m_sb.sb_blocksize;
> + } else {
> + *unit_max = min_t(unsigned int,
double whitespace before the min.
> +++ b/fs/xfs/xfs_mount.c
> @@ -665,6 +665,32 @@ xfs_agbtree_compute_maxlevels(
> levels = max(levels, mp->m_rmap_maxlevels);
> mp->m_agbtree_maxlevels = max(levels, mp->m_refc_maxlevels);
> }
> +static inline void
Missing empty line after the previous function.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 09/10] xfs: Allow block allocator to take an alignment hint
2025-03-10 18:39 ` [PATCH v5 09/10] xfs: Allow block allocator to take an alignment hint John Garry
@ 2025-03-12 7:42 ` Christoph Hellwig
2025-03-12 8:05 ` John Garry
0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2025-03-12 7:42 UTC (permalink / raw)
To: John Garry
Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, linux-kernel,
ojaswin, ritesh.list, martin.petersen
> else if (ap->datatype & XFS_ALLOC_USERDATA)
> align = xfs_get_extsz_hint(ap->ip);
> +
> + if (align > 1 && (ap->flags & XFS_BMAPI_EXTSZALIGN))
> + args->alignment = align;
> +
Add a comment please.
> +/* Try to align allocations to the extent size hint */
> +#define XFS_BMAPI_EXTSZALIGN (1u << 11)
Shouldn't we be doing this by default for any extent size hint
based allocations?
> bool found;
> bool atomic_sw = flags & XFS_REFLINK_ATOMIC_SW;
> + uint32_t bmapi_flags = XFS_BMAPI_COWFORK |
> + XFS_BMAPI_PREALLOC;
> +
> + if (atomic_sw)
> + bmapi_flags |= XFS_BMAPI_EXTSZALIGN;
Please add a comment why you are doing this.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 09/10] xfs: Allow block allocator to take an alignment hint
2025-03-12 7:42 ` Christoph Hellwig
@ 2025-03-12 8:05 ` John Garry
2025-03-12 13:45 ` Christoph Hellwig
0 siblings, 1 reply; 63+ messages in thread
From: John Garry @ 2025-03-12 8:05 UTC (permalink / raw)
To: Christoph Hellwig
Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, linux-kernel,
ojaswin, ritesh.list, martin.petersen
On 12/03/2025 07:42, Christoph Hellwig wrote:
>> else if (ap->datatype & XFS_ALLOC_USERDATA)
>> align = xfs_get_extsz_hint(ap->ip);
>> +
>> + if (align > 1 && (ap->flags & XFS_BMAPI_EXTSZALIGN))
>> + args->alignment = align;
>> +
>
> Add a comment please.
ok
>
>> +/* Try to align allocations to the extent size hint */
>> +#define XFS_BMAPI_EXTSZALIGN (1u << 11)
>
> Shouldn't we be doing this by default for any extent size hint
> based allocations?
I'm not sure.
I think that currently users just expect extszhint to hint at the
granularity only.
Maybe users don't require alignment and adding an alignment requirement
just leads to more fragmentation.
>
>> bool found;
>> bool atomic_sw = flags & XFS_REFLINK_ATOMIC_SW;
>> + uint32_t bmapi_flags = XFS_BMAPI_COWFORK |
>> + XFS_BMAPI_PREALLOC;
>> +
>> + if (atomic_sw)
>> + bmapi_flags |= XFS_BMAPI_EXTSZALIGN;
>
> Please add a comment why you are doing this.
>
Sure
Thanks,
John
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 08/10] xfs: Update atomic write max size
2025-03-12 7:41 ` Christoph Hellwig
@ 2025-03-12 8:09 ` John Garry
2025-03-12 8:13 ` Christoph Hellwig
0 siblings, 1 reply; 63+ messages in thread
From: John Garry @ 2025-03-12 8:09 UTC (permalink / raw)
To: Christoph Hellwig
Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, linux-kernel,
ojaswin, ritesh.list, martin.petersen
On 12/03/2025 07:41, Christoph Hellwig wrote:
> On Mon, Mar 10, 2025 at 06:39:44PM +0000, John Garry wrote:
>> For RT inode, just limit to 1x block, even though larger can be supported
>> in future.
>
> Why?
Adding RT support adds even more complexity upfront, and RT is
uncommonly used.
In addition, it will be limited to using power-of-2 rtextsize, so a
slightly restricted feature.
>
>> + if (XFS_IS_REALTIME_INODE(ip)) {
>> + /* For now, set limit at 1x block */
>
> Why? It' clearly obvious that you do that from the code, but comments
> are supposed to explain why something non-obvious is done.
ok
>
>> + *unit_max = ip->i_mount->m_sb.sb_blocksize;
>> + } else {
>> + *unit_max = min_t(unsigned int,
>
> double whitespace before the min.
will fix
>
>> +++ b/fs/xfs/xfs_mount.c
>> @@ -665,6 +665,32 @@ xfs_agbtree_compute_maxlevels(
>> levels = max(levels, mp->m_rmap_maxlevels);
>> mp->m_agbtree_maxlevels = max(levels, mp->m_refc_maxlevels);
>> }
>> +static inline void
>
> Missing empty line after the previous function.
Will fix.
Thanks,
John
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 08/10] xfs: Update atomic write max size
2025-03-12 8:09 ` John Garry
@ 2025-03-12 8:13 ` Christoph Hellwig
2025-03-12 8:14 ` John Garry
0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2025-03-12 8:13 UTC (permalink / raw)
To: John Garry
Cc: Christoph Hellwig, brauner, djwong, cem, linux-xfs, linux-fsdevel,
linux-kernel, ojaswin, ritesh.list, martin.petersen
On Wed, Mar 12, 2025 at 08:09:47AM +0000, John Garry wrote:
> On 12/03/2025 07:41, Christoph Hellwig wrote:
> > On Mon, Mar 10, 2025 at 06:39:44PM +0000, John Garry wrote:
> > > For RT inode, just limit to 1x block, even though larger can be supported
> > > in future.
> >
> > Why?
>
> Adding RT support adds even more complexity upfront, and RT is uncommonly
> used.
>
> In addition, it will be limited to using power-of-2 rtextsize, so a slightly
> restricted feature.
Please spell that out in the commit message.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 08/10] xfs: Update atomic write max size
2025-03-12 8:13 ` Christoph Hellwig
@ 2025-03-12 8:14 ` John Garry
0 siblings, 0 replies; 63+ messages in thread
From: John Garry @ 2025-03-12 8:14 UTC (permalink / raw)
To: Christoph Hellwig
Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, linux-kernel,
ojaswin, ritesh.list, martin.petersen
On 12/03/2025 08:13, Christoph Hellwig wrote:
> On Wed, Mar 12, 2025 at 08:09:47AM +0000, John Garry wrote:
>> On 12/03/2025 07:41, Christoph Hellwig wrote:
>>> On Mon, Mar 10, 2025 at 06:39:44PM +0000, John Garry wrote:
>>>> For RT inode, just limit to 1x block, even though larger can be supported
>>>> in future.
>>> Why?
>> Adding RT support adds even more complexity upfront, and RT is uncommonly
>> used.
>>
>> In addition, it will be limited to using power-of-2 rtextsize, so a slightly
>> restricted feature.
> Please spell that out in the commit message.
ok, fine.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 01/10] xfs: Pass flags to xfs_reflink_allocate_cow()
2025-03-12 7:15 ` Christoph Hellwig
@ 2025-03-12 8:19 ` John Garry
0 siblings, 0 replies; 63+ messages in thread
From: John Garry @ 2025-03-12 8:19 UTC (permalink / raw)
To: Christoph Hellwig
Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, linux-kernel,
ojaswin, ritesh.list, martin.petersen
On 12/03/2025 07:15, Christoph Hellwig wrote:
> All the patches (but not the cover letter) capitalize the first
> word after the prefix. While some subsystems do that, xfs does not.
> Please be consistent with the other commit messages.
>
>> +/*
>> + * Flags for xfs_reflink_allocate_cow()
>> + */
>> +#define XFS_REFLINK_CONVERT (1u << 0) /* convert unwritten extents now */
>
> Please add a _UNWRITTEN prefix to avoid any confusion with delalloc
> conversions.
>
ok, fine
Thanks,
John
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 02/10] xfs: Switch atomic write size check in xfs_file_write_iter()
2025-03-12 7:17 ` Christoph Hellwig
@ 2025-03-12 8:21 ` John Garry
0 siblings, 0 replies; 63+ messages in thread
From: John Garry @ 2025-03-12 8:21 UTC (permalink / raw)
To: Christoph Hellwig
Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, linux-kernel,
ojaswin, ritesh.list, martin.petersen
On 12/03/2025 07:17, Christoph Hellwig wrote:
>> +void
>> +xfs_get_atomic_write_attr(
>> + struct xfs_inode *ip,
>> + unsigned int *unit_min,
>> + unsigned int *unit_max)
>> +{
>> + if (!xfs_inode_can_atomicwrite(ip)) {
>> + *unit_min = *unit_max = 0;
>> + return;
>> + }
>> +
>> + *unit_min = *unit_max = ip->i_mount->m_sb.sb_blocksize;
>
> This is a rather ugly interface. Why not have separate helpers for
> the min and max values that can just return the actual values?
>
ok, I can make that change.
Thanks,
John
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 03/10] xfs: Refactor xfs_reflink_end_cow_extent()
2025-03-12 7:24 ` Christoph Hellwig
@ 2025-03-12 8:27 ` John Garry
2025-03-12 8:35 ` Christoph Hellwig
0 siblings, 1 reply; 63+ messages in thread
From: John Garry @ 2025-03-12 8:27 UTC (permalink / raw)
To: Christoph Hellwig
Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, linux-kernel,
ojaswin, ritesh.list, martin.petersen
On 12/03/2025 07:24, Christoph Hellwig wrote:
> On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote:
>> Refactor xfs_reflink_end_cow_extent() into separate parts which process
>> the CoW range and commit the transaction.
>>
>> This refactoring will be used in future for when it is required to commit
>> a range of extents as a single transaction, similar to how it was done
>> pre-commit d6f215f359637.
>
> Darrick pointed out that if you do more than just a tiny number
> of extents per transactions you run out of log reservations very
> quickly here:
>
> https://urldefense.com/v3/__https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!PWLcBof1tKimKUObvCj4vOhljWjFmjtzVHLx9apcU5Rah1xZnmp_3PIq6eSwx6TdEXzMLYYyBfmZLgvj$
>
> how does your scheme deal with that?
>
The resblks calculation in xfs_reflink_end_atomic_cow() takes care of
this, right? Or does the log reservation have a hard size limit,
regardless of that calculation?
Thanks,
John
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 03/10] xfs: Refactor xfs_reflink_end_cow_extent()
2025-03-12 8:27 ` John Garry
@ 2025-03-12 8:35 ` Christoph Hellwig
2025-03-12 15:46 ` Darrick J. Wong
0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2025-03-12 8:35 UTC (permalink / raw)
To: John Garry
Cc: Christoph Hellwig, brauner, djwong, cem, linux-xfs, linux-fsdevel,
linux-kernel, ojaswin, ritesh.list, martin.petersen
On Wed, Mar 12, 2025 at 08:27:05AM +0000, John Garry wrote:
> On 12/03/2025 07:24, Christoph Hellwig wrote:
> > On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote:
> > > Refactor xfs_reflink_end_cow_extent() into separate parts which process
> > > the CoW range and commit the transaction.
> > >
> > > This refactoring will be used in future for when it is required to commit
> > > a range of extents as a single transaction, similar to how it was done
> > > pre-commit d6f215f359637.
> >
> > Darrick pointed out that if you do more than just a tiny number
> > of extents per transactions you run out of log reservations very
> > quickly here:
> >
> > https://urldefense.com/v3/__https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!PWLcBof1tKimKUObvCj4vOhljWjFmjtzVHLx9apcU5Rah1xZnmp_3PIq6eSwx6TdEXzMLYYyBfmZLgvj$
> >
> > how does your scheme deal with that?
> >
> The resblks calculation in xfs_reflink_end_atomic_cow() takes care of this,
> right? Or does the log reservation have a hard size limit, regardless of
> that calculation?
The resblks calculated there are the reserved disk blocks and have
nothing to do with the log reservations, which comes from the
tr_write field passed in. There is some kind of upper limited to it
obviously by the log size, although I'm not sure if we've formalized
that somewhere. Dave might be the right person to ask about that.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 05/10] xfs: Iomap SW-based atomic write support
2025-03-12 7:37 ` Christoph Hellwig
@ 2025-03-12 9:00 ` John Garry
2025-03-12 13:52 ` Christoph Hellwig
0 siblings, 1 reply; 63+ messages in thread
From: John Garry @ 2025-03-12 9:00 UTC (permalink / raw)
To: Christoph Hellwig
Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, linux-kernel,
ojaswin, ritesh.list, martin.petersen
On 12/03/2025 07:37, Christoph Hellwig wrote:
> On Mon, Mar 10, 2025 at 06:39:41PM +0000, John Garry wrote:
>> In cases of an atomic write occurs for misaligned or discontiguous disk
>> blocks, we will use a CoW-based method to issue the atomic write.
>>
>> So, for that case, return -EAGAIN to request that the write be issued in
>> CoW atomic write mode. The dio write path should detect this, similar to
>> how misaligned regular DIO writes are handled.
>
> How is -EAGAIN going to work here given that it is also used to defer
> non-blocking requests to the caller blocking context?
You are talking about IOMAP_NOWAIT handling, right? If so, we handle
that in xfs_file_dio_write_atomic(), similar to
xfs_file_dio_write_unaligned(), i.e. if IOMAP_NOWAIT is set and we get
-EAGAIN, then we will return -EAGAIN directly to the caller.
>
> What is the probem with only setting the flag that causes REQ_ATOMIC
> to be set from the file system instead of forcing it when calling
> iomap_dio_rw?
We have this in __iomap_dio_rw():
if (dio_flags & IOMAP_DIO_ATOMIC_SW)
iomi.flags |= IOMAP_ATOMIC_SW;
else if (iocb->ki_flags & IOCB_ATOMIC)
iomi.flags |= IOMAP_ATOMIC_HW;
I do admit that the checks are a bit uneven, i.e. check vs
IOMAP_DIO_ATOMIC_SW and IOCB_ATOMIC
If we want a flag to set REQ_ATOMIC from the FS then we need
IOMAP_DIO_BIO_ATOMIC, and that would set IOMAP_BIO_ATOMIC. Is that better?
>
> Also how you ensure this -EAGAIN only happens on the first extent
> mapped and you doesn't cause double writes?
When we find that a mapping does not suit REQ_ATOMIC-based atomic write,
then we immediately bail and retry with FS-based atomic write. And that
check should cover all requirements for a REQ_ATOMIC-based atomic write:
- aligned
- contiguous blocks, i.e. the mapping covers the full write
And we also have the check in iomap_dio_bit_iter() to ensure that the
mapping covers the full write (for REQ_ATOMIC-based atomic write).
>
>> + bool atomic_hw = flags & IOMAP_ATOMIC_HW;
>
> Again, atomic_hw is not a very useful variable name. But the
> whole idea of using a non-descriptive bool variable for a flags
> field feels like an antipattern to me.
>
>> - if (shared)
>> + if (shared) {
>> + if (atomic_hw &&
>> + !xfs_bmap_valid_for_atomic_write(&cmap,
>> + offset_fsb, end_fsb)) {
>> + error = -EAGAIN;
>> + goto out_unlock;
>> + }
>> goto out_found_cow;
>
> This needs a big fat comment explaining why bailing out here is
> fine and how it works.
ok
>
>> + /*
>> + * Use CoW method for when we need to alloc > 1 block,
>> + * otherwise we might allocate less than what we need here and
>> + * have multiple mappings.
>> + */
>
> Describe why this is done, not just what is done.
I did say that we may get multiple mappings, which obvs is not useful
for REQ_ATOMIC-based atomic write. But I can add a bit more detail.
Thanks,
John
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 07/10] xfs: Commit CoW-based atomic writes atomically
2025-03-12 7:39 ` Christoph Hellwig
@ 2025-03-12 9:04 ` John Garry
2025-03-12 13:54 ` Christoph Hellwig
0 siblings, 1 reply; 63+ messages in thread
From: John Garry @ 2025-03-12 9:04 UTC (permalink / raw)
To: Christoph Hellwig
Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, linux-kernel,
ojaswin, ritesh.list, martin.petersen
On 12/03/2025 07:39, Christoph Hellwig wrote:
> On Mon, Mar 10, 2025 at 06:39:43PM +0000, John Garry wrote:
>> When completing a CoW-based write, each extent range mapping update is
>> covered by a separate transaction.
>>
>> For a CoW-based atomic write, all mappings must be changed at once, so
>> change to use a single transaction.
>
> As already mentioned in a previous reply: "all" might be to much.
> The code can only support a (relatively low) number of extents
> in a single transaction safely.
Then we would need to limit the awu max to whatever can be guaranteed
(to fit).
>
>> +int
>> +xfs_reflink_end_atomic_cow(
>> + struct xfs_inode *ip,
>> + xfs_off_t offset,
>> + xfs_off_t count)
>
> Assuming we could actually to the multi extent per transaction
> commit safely, what would be the reason to not always do it?
>
Yes, I suppose that it could always be used. I would suggest that as a
later improvement, if you agree.
Thanks,
John
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 04/10] xfs: Reflink CoW-based atomic write support
2025-03-12 7:27 ` Christoph Hellwig
@ 2025-03-12 9:13 ` John Garry
2025-03-12 13:45 ` Christoph Hellwig
0 siblings, 1 reply; 63+ messages in thread
From: John Garry @ 2025-03-12 9:13 UTC (permalink / raw)
To: Christoph Hellwig
Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, linux-kernel,
ojaswin, ritesh.list, martin.petersen
On 12/03/2025 07:27, Christoph Hellwig wrote:
> On Mon, Mar 10, 2025 at 06:39:40PM +0000, John Garry wrote:
>> Base SW-based atomic writes on CoW.
>>
>> For SW-based atomic write support, always allocate a cow hole in
>> xfs_reflink_allocate_cow() to write the new data.
>
> What is a "COW hole"?
I really mean a cow mapping. I can reword that.
>
>> The semantics is that if @atomic_sw is set, we will be passed a CoW fork
>> extent mapping for no error returned.
>
> This commit log feels extremely sparse for a brand new feature with
> data integrity impact. Can you expand on it a little?
Sure, will do
>
>> + bool atomic_sw = flags & XFS_REFLINK_ATOMIC_SW;
>
> atomic_sw is not a very descriptive variable name.
ack
>
>>
>> resaligned = xfs_aligned_fsb_count(imap->br_startoff,
>> imap->br_blockcount, xfs_get_cowextsz_hint(ip));
>> @@ -466,7 +467,7 @@ xfs_reflink_fill_cow_hole(
>> *lockmode = XFS_ILOCK_EXCL;
>>
>> error = xfs_find_trim_cow_extent(ip, imap, cmap, shared, &found);
>> - if (error || !*shared)
>> + if (error || (!*shared && !atomic_sw))
>
> And it's pnly used once. Basically is is used to force COW, right?
Yes, we force it. Indeed, I think that is term you used a long time ago
in your RFC for atomic file updates.
But that flag is being used to set XFS_BMAPI_EXTSZALIGN, so feels like a
bit of a disconnect as why we would set XFS_BMAPI_EXTSZALIGN for "forced
cow". I would need to spell that out.
> Maybe use that fact as it describes the semantics at this level
> instead of the very high level intent?
ok, fine
>
>> @@ -10,6 +10,7 @@
>> * Flags for xfs_reflink_allocate_cow()
>> */
>> #define XFS_REFLINK_CONVERT (1u << 0) /* convert unwritten extents now */
>> +#define XFS_REFLINK_ATOMIC_SW (1u << 1) /* alloc for SW-based atomic write */
>
> Please expand what this actually means at the xfs_reflink_allocate_cow.
> Of if it is just a force flag as I suspect speel that out. And
> move the comment up to avoid the overly long line as well as giving
> you space to actually spell the semantics out.
OK, I can do that, especially since XFS_REFLINK_CONVERT is going to be
renamed.
Thanks,
John
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 09/10] xfs: Allow block allocator to take an alignment hint
2025-03-12 8:05 ` John Garry
@ 2025-03-12 13:45 ` Christoph Hellwig
2025-03-12 14:47 ` John Garry
2025-03-12 16:00 ` Darrick J. Wong
0 siblings, 2 replies; 63+ messages in thread
From: Christoph Hellwig @ 2025-03-12 13:45 UTC (permalink / raw)
To: John Garry
Cc: Christoph Hellwig, brauner, djwong, cem, linux-xfs, linux-fsdevel,
linux-kernel, ojaswin, ritesh.list, martin.petersen
On Wed, Mar 12, 2025 at 08:05:14AM +0000, John Garry wrote:
> > Shouldn't we be doing this by default for any extent size hint
> > based allocations?
>
> I'm not sure.
>
> I think that currently users just expect extszhint to hint at the
> granularity only.
>
> Maybe users don't require alignment and adding an alignment requirement just
> leads to more fragmentation.
But does it? Once an extsize hint is set I'd expect that we keep
getting more allocation with it. And keeping the aligned is the concept
of a buddy allocator which reduces fragmentation. Because of that I
wonder why we aren't doing that by default.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 04/10] xfs: Reflink CoW-based atomic write support
2025-03-12 9:13 ` John Garry
@ 2025-03-12 13:45 ` Christoph Hellwig
2025-03-12 14:48 ` John Garry
0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2025-03-12 13:45 UTC (permalink / raw)
To: John Garry
Cc: Christoph Hellwig, brauner, djwong, cem, linux-xfs, linux-fsdevel,
linux-kernel, ojaswin, ritesh.list, martin.petersen
On Wed, Mar 12, 2025 at 09:13:45AM +0000, John Garry wrote:
> > > + if (error || (!*shared && !atomic_sw))
> >
> > And it's pnly used once. Basically is is used to force COW, right?
>
> Yes, we force it. Indeed, I think that is term you used a long time ago in
> your RFC for atomic file updates.
>
> But that flag is being used to set XFS_BMAPI_EXTSZALIGN, so feels like a bit
> of a disconnect as why we would set XFS_BMAPI_EXTSZALIGN for "forced cow". I
> would need to spell that out.
Maybe use two flags for that even if they currently are set together?
Note that this would go away if we'd always align extsize hinted
allocations, which I suspect is a good idea (even if I'm not 100% sure
about it).
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 05/10] xfs: Iomap SW-based atomic write support
2025-03-12 9:00 ` John Garry
@ 2025-03-12 13:52 ` Christoph Hellwig
2025-03-12 14:57 ` John Garry
0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2025-03-12 13:52 UTC (permalink / raw)
To: John Garry
Cc: Christoph Hellwig, brauner, djwong, cem, linux-xfs, linux-fsdevel,
linux-kernel, ojaswin, ritesh.list, martin.petersen
On Wed, Mar 12, 2025 at 09:00:52AM +0000, John Garry wrote:
> > How is -EAGAIN going to work here given that it is also used to defer
> > non-blocking requests to the caller blocking context?
>
> You are talking about IOMAP_NOWAIT handling, right?
Yes.
> If so, we handle that in
> xfs_file_dio_write_atomic(), similar to xfs_file_dio_write_unaligned(), i.e.
> if IOMAP_NOWAIT is set and we get -EAGAIN, then we will return -EAGAIN
> directly to the caller.
Can you document this including the interaction between the different
cases of -EAGAIN somewhere?
> > What is the probem with only setting the flag that causes REQ_ATOMIC
> > to be set from the file system instead of forcing it when calling
> > iomap_dio_rw?
>
> We have this in __iomap_dio_rw():
>
> if (dio_flags & IOMAP_DIO_ATOMIC_SW)
> iomi.flags |= IOMAP_ATOMIC_SW;
> else if (iocb->ki_flags & IOCB_ATOMIC)
> iomi.flags |= IOMAP_ATOMIC_HW;
>
> I do admit that the checks are a bit uneven, i.e. check vs
> IOMAP_DIO_ATOMIC_SW and IOCB_ATOMIC
>
> If we want a flag to set REQ_ATOMIC from the FS then we need
> IOMAP_DIO_BIO_ATOMIC, and that would set IOMAP_BIO_ATOMIC. Is that better?
My expectation from a very cursory view is that iomap would be that
there is a IOMAP_F_REQ_ATOMIC that is set in ->iomap_begin and which
would make the core iomap code set REQ_ATOMIC on the bio for that
iteration.
> > Also how you ensure this -EAGAIN only happens on the first extent
> > mapped and you doesn't cause double writes?
>
> When we find that a mapping does not suit REQ_ATOMIC-based atomic write,
> then we immediately bail and retry with FS-based atomic write. And that
> check should cover all requirements for a REQ_ATOMIC-based atomic write:
> - aligned
> - contiguous blocks, i.e. the mapping covers the full write
>
> And we also have the check in iomap_dio_bit_iter() to ensure that the
> mapping covers the full write (for REQ_ATOMIC-based atomic write).
Ah, I guess that's the
if (bio_atomic && length != iter->len)
return -EINVAL;
So yes, please adda comment there that this is about a single iteration
covering the entire write.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 07/10] xfs: Commit CoW-based atomic writes atomically
2025-03-12 9:04 ` John Garry
@ 2025-03-12 13:54 ` Christoph Hellwig
2025-03-12 15:01 ` John Garry
0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2025-03-12 13:54 UTC (permalink / raw)
To: John Garry
Cc: Christoph Hellwig, brauner, djwong, cem, linux-xfs, linux-fsdevel,
linux-kernel, ojaswin, ritesh.list, martin.petersen
On Wed, Mar 12, 2025 at 09:04:07AM +0000, John Garry wrote:
> > As already mentioned in a previous reply: "all" might be to much.
> > The code can only support a (relatively low) number of extents
> > in a single transaction safely.
>
> Then we would need to limit the awu max to whatever can be guaranteed
> (to fit).
Yes. And please add a testcase that creates a badly fragmented file
and verifies that we can handle the worst case for this limit.
(although being able to reproduce the worst case btree splits might
be hard, but at least the worst case fragmentation should be doable)
> > Assuming we could actually to the multi extent per transaction
> > commit safely, what would be the reason to not always do it?
> >
>
> Yes, I suppose that it could always be used. I would suggest that as a later
> improvement, if you agree.
I remember running into some problems with my earlier version, but I'd
have to dig into it. Maybe it will resurface with the above testing,
or it was due to my optimizations for the extent lookups.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 09/10] xfs: Allow block allocator to take an alignment hint
2025-03-12 13:45 ` Christoph Hellwig
@ 2025-03-12 14:47 ` John Garry
2025-03-12 16:00 ` Darrick J. Wong
1 sibling, 0 replies; 63+ messages in thread
From: John Garry @ 2025-03-12 14:47 UTC (permalink / raw)
To: Christoph Hellwig
Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, linux-kernel,
ojaswin, ritesh.list, martin.petersen
On 12/03/2025 13:45, Christoph Hellwig wrote:
> On Wed, Mar 12, 2025 at 08:05:14AM +0000, John Garry wrote:
>>> Shouldn't we be doing this by default for any extent size hint
>>> based allocations?
>>
>> I'm not sure.
>>
>> I think that currently users just expect extszhint to hint at the
>> granularity only.
>>
>> Maybe users don't require alignment and adding an alignment requirement just
>> leads to more fragmentation.
>
> But does it? Once an extsize hint is set I'd expect that we keep
> getting more allocation with it.
Sure, but that value is configurable per inode (so not all inodes may
have it set)...but then it is also inherited.
> And keeping the aligned is the concept
> of a buddy allocator which reduces fragmentation. Because of that I
> wonder why we aren't doing that by default.
>
As I see, we just pay attention to stripe alignment. Dave also
questioned this - Dave, any further idea on why this?
To me it could sense to use extszhint as the alignment requirement, but
only if no stripe alignment. We also need to ensure to ignore this in
low space allocator.
Thanks,
John
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 04/10] xfs: Reflink CoW-based atomic write support
2025-03-12 13:45 ` Christoph Hellwig
@ 2025-03-12 14:48 ` John Garry
0 siblings, 0 replies; 63+ messages in thread
From: John Garry @ 2025-03-12 14:48 UTC (permalink / raw)
To: Christoph Hellwig
Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, linux-kernel,
ojaswin, ritesh.list, martin.petersen
On 12/03/2025 13:45, Christoph Hellwig wrote:
> On Wed, Mar 12, 2025 at 09:13:45AM +0000, John Garry wrote:
>>>> + if (error || (!*shared && !atomic_sw))
>>>
>>> And it's pnly used once. Basically is is used to force COW, right?
>>
>> Yes, we force it. Indeed, I think that is term you used a long time ago in
>> your RFC for atomic file updates.
>>
>> But that flag is being used to set XFS_BMAPI_EXTSZALIGN, so feels like a bit
>> of a disconnect as why we would set XFS_BMAPI_EXTSZALIGN for "forced cow". I
>> would need to spell that out.
>
> Maybe use two flags for that even if they currently are set together?
ok, fine. Then it makes explaining why we set XFS_BMAPI_EXTSZALIGN for
"force cow" a lot simpler in the code.
> Note that this would go away if we'd always align extsize hinted
> allocations, which I suspect is a good idea (even if I'm not 100% sure
> about it).
>
Sure
Cheers,
John
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 05/10] xfs: Iomap SW-based atomic write support
2025-03-12 13:52 ` Christoph Hellwig
@ 2025-03-12 14:57 ` John Garry
2025-03-12 15:55 ` Christoph Hellwig
0 siblings, 1 reply; 63+ messages in thread
From: John Garry @ 2025-03-12 14:57 UTC (permalink / raw)
To: Christoph Hellwig
Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, linux-kernel,
ojaswin, ritesh.list, martin.petersen
On 12/03/2025 13:52, Christoph Hellwig wrote:
> On Wed, Mar 12, 2025 at 09:00:52AM +0000, John Garry wrote:
>>> How is -EAGAIN going to work here given that it is also used to defer
>>> non-blocking requests to the caller blocking context?
>>
>> You are talking about IOMAP_NOWAIT handling, right?
>
> Yes.
>
>> If so, we handle that in
>> xfs_file_dio_write_atomic(), similar to xfs_file_dio_write_unaligned(), i.e.
>> if IOMAP_NOWAIT is set and we get -EAGAIN, then we will return -EAGAIN
>> directly to the caller.
>
> Can you document this including the interaction between the different
> cases of -EAGAIN somewhere?
Sure
We do the same for dio write unaligned - but that already has a big
comment explaining the retry mechanism.
>
>>> What is the probem with only setting the flag that causes REQ_ATOMIC
>>> to be set from the file system instead of forcing it when calling
>>> iomap_dio_rw?
>>
>> We have this in __iomap_dio_rw():
>>
>> if (dio_flags & IOMAP_DIO_ATOMIC_SW)
>> iomi.flags |= IOMAP_ATOMIC_SW;
>> else if (iocb->ki_flags & IOCB_ATOMIC)
>> iomi.flags |= IOMAP_ATOMIC_HW;
>>
>> I do admit that the checks are a bit uneven, i.e. check vs
>> IOMAP_DIO_ATOMIC_SW and IOCB_ATOMIC
>>
>> If we want a flag to set REQ_ATOMIC from the FS then we need
>> IOMAP_DIO_BIO_ATOMIC, and that would set IOMAP_BIO_ATOMIC. Is that better?
>
> My expectation from a very cursory view is that iomap would be that
> there is a IOMAP_F_REQ_ATOMIC that is set in ->iomap_begin and which
> would make the core iomap code set REQ_ATOMIC on the bio for that
> iteration.
but we still need to tell ->iomap_begin about IOCB_ATOMIC, hence
IOMAP_DIO_BIO_ATOMIC which sets IOMAP_BIO_ATOMIC.
We can't allow __iomap_dio_rw() check IOCB_ATOMIC only (and set
IOMAP_BIO_ATOMIC), as this is the common path for COW and regular atomic
write
>
>>> Also how you ensure this -EAGAIN only happens on the first extent
>>> mapped and you doesn't cause double writes?
>>
>> When we find that a mapping does not suit REQ_ATOMIC-based atomic write,
>> then we immediately bail and retry with FS-based atomic write. And that
>> check should cover all requirements for a REQ_ATOMIC-based atomic write:
>> - aligned
>> - contiguous blocks, i.e. the mapping covers the full write
>>
>> And we also have the check in iomap_dio_bit_iter() to ensure that the
>> mapping covers the full write (for REQ_ATOMIC-based atomic write).
>
> Ah, I guess that's the
>
> if (bio_atomic && length != iter->len)
> return -EINVAL;
>
> So yes, please adda comment there that this is about a single iteration
> covering the entire write.
ok, fine.
Thanks,
John
>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 07/10] xfs: Commit CoW-based atomic writes atomically
2025-03-12 13:54 ` Christoph Hellwig
@ 2025-03-12 15:01 ` John Garry
0 siblings, 0 replies; 63+ messages in thread
From: John Garry @ 2025-03-12 15:01 UTC (permalink / raw)
To: Christoph Hellwig
Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, linux-kernel,
ojaswin, ritesh.list, martin.petersen, catherine.hoang
On 12/03/2025 13:54, Christoph Hellwig wrote:
+
> On Wed, Mar 12, 2025 at 09:04:07AM +0000, John Garry wrote:
>>> As already mentioned in a previous reply: "all" might be to much.
>>> The code can only support a (relatively low) number of extents
>>> in a single transaction safely.
>>
>> Then we would need to limit the awu max to whatever can be guaranteed
>> (to fit).
>
> Yes. And please add a testcase that creates a badly fragmented file
> and verifies that we can handle the worst case for this limit.
ok, we can do that.
I have my own stress test for atomic writes on fragmented FSes, but I
have not encountered a such a problem. But we need to formalize
something though.
>
> (although being able to reproduce the worst case btree splits might
> be hard, but at least the worst case fragmentation should be doable)
>
>>> Assuming we could actually to the multi extent per transaction
>>> commit safely, what would be the reason to not always do it?
>>>
>>
>> Yes, I suppose that it could always be used. I would suggest that as a later
>> improvement, if you agree.
>
> I remember running into some problems with my earlier version, but I'd
> have to dig into it. Maybe it will resurface with the above testing,
> or it was due to my optimizations for the extent lookups.
>
ok.
Thanks,
John
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 03/10] xfs: Refactor xfs_reflink_end_cow_extent()
2025-03-12 8:35 ` Christoph Hellwig
@ 2025-03-12 15:46 ` Darrick J. Wong
2025-03-12 22:06 ` John Garry
2025-03-13 1:25 ` Dave Chinner
0 siblings, 2 replies; 63+ messages in thread
From: Darrick J. Wong @ 2025-03-12 15:46 UTC (permalink / raw)
To: Christoph Hellwig
Cc: John Garry, brauner, cem, linux-xfs, linux-fsdevel, linux-kernel,
ojaswin, ritesh.list, martin.petersen
On Wed, Mar 12, 2025 at 01:35:23AM -0700, Christoph Hellwig wrote:
> On Wed, Mar 12, 2025 at 08:27:05AM +0000, John Garry wrote:
> > On 12/03/2025 07:24, Christoph Hellwig wrote:
> > > On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote:
> > > > Refactor xfs_reflink_end_cow_extent() into separate parts which process
> > > > the CoW range and commit the transaction.
> > > >
> > > > This refactoring will be used in future for when it is required to commit
> > > > a range of extents as a single transaction, similar to how it was done
> > > > pre-commit d6f215f359637.
> > >
> > > Darrick pointed out that if you do more than just a tiny number
> > > of extents per transactions you run out of log reservations very
> > > quickly here:
> > >
> > > https://urldefense.com/v3/__https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!PWLcBof1tKimKUObvCj4vOhljWjFmjtzVHLx9apcU5Rah1xZnmp_3PIq6eSwx6TdEXzMLYYyBfmZLgvj$
> > >
> > > how does your scheme deal with that?
> > >
> > The resblks calculation in xfs_reflink_end_atomic_cow() takes care of this,
> > right? Or does the log reservation have a hard size limit, regardless of
> > that calculation?
>
> The resblks calculated there are the reserved disk blocks and have
> nothing to do with the log reservations, which comes from the
> tr_write field passed in. There is some kind of upper limited to it
> obviously by the log size, although I'm not sure if we've formalized
> that somewhere. Dave might be the right person to ask about that.
The (very very rough) upper limit for how many intent items you can
attach to a tr_write transaction is:
per_extent_cost = (cui_size + rui_size + bui_size + efi_size + ili_size)
max_blocks = tr_write::tr_logres / per_extent_cost
(ili_size is the inode log item size)
((I would halve that for the sake of paranoia))
since you have to commit all those intent items into the first
transaction in the chain. The difficulty we've always had is computing
the size of an intent item in the ondisk log, since that's a (somewhat
minor) layering violation -- it's xfs_cui_log_format_sizeof() for a CUI,
but then there' could be overhead for the ondisk log headers themselves.
Maybe we ought to formalize the computation of that since reap.c also
has a handwavy XREAP_MAX_DEFER_CHAIN that it uses to roll the scrub
transaction periodically... because I'd prefer we not add another
hardcoded limit. My guess is that the software fallback can probably
support any awu_max that a hardware wants to throw at us, but let's
actually figure out the min(sw, hw) that we can support and cap it at
that.
--D
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 05/10] xfs: Iomap SW-based atomic write support
2025-03-12 14:57 ` John Garry
@ 2025-03-12 15:55 ` Christoph Hellwig
2025-03-12 16:11 ` John Garry
0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2025-03-12 15:55 UTC (permalink / raw)
To: John Garry
Cc: Christoph Hellwig, brauner, djwong, cem, linux-xfs, linux-fsdevel,
linux-kernel, ojaswin, ritesh.list, martin.petersen
On Wed, Mar 12, 2025 at 02:57:38PM +0000, John Garry wrote:
> > > I do admit that the checks are a bit uneven, i.e. check vs
> > > IOMAP_DIO_ATOMIC_SW and IOCB_ATOMIC
> > >
> > > If we want a flag to set REQ_ATOMIC from the FS then we need
> > > IOMAP_DIO_BIO_ATOMIC, and that would set IOMAP_BIO_ATOMIC. Is that better?
> >
> > My expectation from a very cursory view is that iomap would be that
> > there is a IOMAP_F_REQ_ATOMIC that is set in ->iomap_begin and which
> > would make the core iomap code set REQ_ATOMIC on the bio for that
> > iteration.
>
> but we still need to tell ->iomap_begin about IOCB_ATOMIC, hence
Yeah, ->iomap_begin can't directly look at the iocb.
> IOMAP_DIO_BIO_ATOMIC which sets IOMAP_BIO_ATOMIC.
>
> We can't allow __iomap_dio_rw() check IOCB_ATOMIC only (and set
> IOMAP_BIO_ATOMIC), as this is the common path for COW and regular atomic
> write
Well, I'd imagine __iomap_dio_rw just sets IOMAP_ATOMIC from IOCB_ATOMIC
and then it's up to file system internal state if it wants to set
IOMAP_F_REQ_ATOMIC based on that, i.e. the actual setting of
IOMAP_F_REQ_ATOMIC is fully controlled by the file system and not
by the iomap core.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 09/10] xfs: Allow block allocator to take an alignment hint
2025-03-12 13:45 ` Christoph Hellwig
2025-03-12 14:47 ` John Garry
@ 2025-03-12 16:00 ` Darrick J. Wong
2025-03-12 16:28 ` John Garry
1 sibling, 1 reply; 63+ messages in thread
From: Darrick J. Wong @ 2025-03-12 16:00 UTC (permalink / raw)
To: Christoph Hellwig
Cc: John Garry, brauner, cem, linux-xfs, linux-fsdevel, linux-kernel,
ojaswin, ritesh.list, martin.petersen
On Wed, Mar 12, 2025 at 06:45:12AM -0700, Christoph Hellwig wrote:
> On Wed, Mar 12, 2025 at 08:05:14AM +0000, John Garry wrote:
> > > Shouldn't we be doing this by default for any extent size hint
> > > based allocations?
> >
> > I'm not sure.
> >
> > I think that currently users just expect extszhint to hint at the
> > granularity only.
Yes, the current behavior is that extszhint only affects the granularity
of the file range that's passed into the allocator. To align the actual
space, you have to set the raid stripe parameters.
I can see how that sorta made sense in the old days -- the fs could get
moved between raid arrays (or the raid array gets reconfigured), so you
want the actual allocations to be aligned to whatever the current
hardware config advertises. The extent size hint is merely a means to
amortize the cost of allocation/second-guess the delalloc machinery.
> > Maybe users don't require alignment and adding an alignment requirement just
> > leads to more fragmentation.
>
> But does it? Once an extsize hint is set I'd expect that we keep
> getting more allocation with it. And keeping the aligned is the concept
> of a buddy allocator which reduces fragmentation. Because of that I
> wonder why we aren't doing that by default.
Histerical raisins?
We /could/ let extszhint influence allocation alignment by default, but
then anyone who had (say) a 8k hint on a 32k raid stripe might be
surprised when the allocator behavior changes.
What do you say about logic like this?
if (software_atomic) {
/*
* align things so we can use hw atomic on the next
* overwrite, no matter what hw says
*/
args->alignment = ip->i_extsize;
} else if (raid_stripe) {
/* otherwise try to align for better raid performance */
args->alignment = mp->m_dalign;
} else if (ip->i_extsize) {
/* if no raid, align to the hint provided */
args->alignment = ip->i_extsize;
} else {
args->alignment = 1;
}
Hm? (I'm probably forgetting something...)
--D
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 05/10] xfs: Iomap SW-based atomic write support
2025-03-12 15:55 ` Christoph Hellwig
@ 2025-03-12 16:11 ` John Garry
0 siblings, 0 replies; 63+ messages in thread
From: John Garry @ 2025-03-12 16:11 UTC (permalink / raw)
To: Christoph Hellwig
Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, linux-kernel,
ojaswin, ritesh.list, martin.petersen
On 12/03/2025 15:55, Christoph Hellwig wrote:
>>> would make the core iomap code set REQ_ATOMIC on the bio for that
>>> iteration.
>> but we still need to tell ->iomap_begin about IOCB_ATOMIC, hence
> Yeah, ->iomap_begin can't directly look at the iocb.
>
>> IOMAP_DIO_BIO_ATOMIC which sets IOMAP_BIO_ATOMIC.
>> We can't allow __iomap_dio_rw() check IOCB_ATOMIC only (and set
>> IOMAP_BIO_ATOMIC), as this is the common path for COW and regular atomic
>> write
> Well, I'd imagine __iomap_dio_rw just sets IOMAP_ATOMIC from IOCB_ATOMIC
> and then it's up to file system internal state if it wants to set
> IOMAP_F_REQ_ATOMIC based on that, i.e. the actual setting of
> IOMAP_F_REQ_ATOMIC is fully controlled by the file system and not
> by the iomap core.
ok, fine. But I will also need to change ext4 to do the same (in terms
of setting IOMAP_F_REQ_ATOMIC).
And I am thinking of IOMAP_F_ATOMIC_BIO as the name, as we mentioned
earlier that it is not so nice to with clash block layer names
Thanks,
John
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 09/10] xfs: Allow block allocator to take an alignment hint
2025-03-12 16:00 ` Darrick J. Wong
@ 2025-03-12 16:28 ` John Garry
0 siblings, 0 replies; 63+ messages in thread
From: John Garry @ 2025-03-12 16:28 UTC (permalink / raw)
To: Darrick J. Wong, Christoph Hellwig
Cc: brauner, cem, linux-xfs, linux-fsdevel, linux-kernel, ojaswin,
ritesh.list, martin.petersen
On 12/03/2025 16:00, Darrick J. Wong wrote:
> On Wed, Mar 12, 2025 at 06:45:12AM -0700, Christoph Hellwig wrote:
>> On Wed, Mar 12, 2025 at 08:05:14AM +0000, John Garry wrote:
>>>> Shouldn't we be doing this by default for any extent size hint
>>>> based allocations?
>>>
>>> I'm not sure.
>>>
>>> I think that currently users just expect extszhint to hint at the
>>> granularity only.
>
> Yes, the current behavior is that extszhint only affects the granularity
> of the file range that's passed into the allocator. To align the actual
> space, you have to set the raid stripe parameters.
>
> I can see how that sorta made sense in the old days -- the fs could get
> moved between raid arrays (or the raid array gets reconfigured), so you
> want the actual allocations to be aligned to whatever the current
> hardware config advertises. The extent size hint is merely a means to
> amortize the cost of allocation/second-guess the delalloc machinery.
>
>>> Maybe users don't require alignment and adding an alignment requirement just
>>> leads to more fragmentation.
>>
>> But does it? Once an extsize hint is set I'd expect that we keep
>> getting more allocation with it. And keeping the aligned is the concept
>> of a buddy allocator which reduces fragmentation. Because of that I
>> wonder why we aren't doing that by default.
>
> Histerical raisins?
>
> We /could/ let extszhint influence allocation alignment by default, but
> then anyone who had (say) a 8k hint on a 32k raid stripe might be
> surprised when the allocator behavior changes.
>
> What do you say about logic like this?
>
> if (software_atomic) {
> /*
> * align things so we can use hw atomic on the next
> * overwrite, no matter what hw says
> */
> args->alignment = ip->i_extsize;
> } else if (raid_stripe) {> /* otherwise try to align for better
raid performance */
> args->alignment = mp->m_dalign;
> } else if (ip->i_extsize) {
> /* if no raid, align to the hint provided */
> args->alignment = ip->i_extsize;
> } else {
> args->alignment = 1;
> }
>
> Hm? (I'm probably forgetting something...)
>
note that for forcealign support, there was a prep patch to do something
similar to this:
https://lore.kernel.org/linux-xfs/20240429174746.2132161-5-john.g.garry@oracle.com/
> --D
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 03/10] xfs: Refactor xfs_reflink_end_cow_extent()
2025-03-12 15:46 ` Darrick J. Wong
@ 2025-03-12 22:06 ` John Garry
2025-03-12 23:22 ` Darrick J. Wong
2025-03-13 1:25 ` Dave Chinner
1 sibling, 1 reply; 63+ messages in thread
From: John Garry @ 2025-03-12 22:06 UTC (permalink / raw)
To: Darrick J. Wong, Christoph Hellwig
Cc: brauner, cem, linux-xfs, linux-fsdevel, linux-kernel, ojaswin,
ritesh.list, martin.petersen
On 12/03/2025 15:46, Darrick J. Wong wrote:
> On Wed, Mar 12, 2025 at 01:35:23AM -0700, Christoph Hellwig wrote:
>> On Wed, Mar 12, 2025 at 08:27:05AM +0000, John Garry wrote:
>>> On 12/03/2025 07:24, Christoph Hellwig wrote:
>>>> On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote:
>>>>> Refactor xfs_reflink_end_cow_extent() into separate parts which process
>>>>> the CoW range and commit the transaction.
>>>>>
>>>>> This refactoring will be used in future for when it is required to commit
>>>>> a range of extents as a single transaction, similar to how it was done
>>>>> pre-commit d6f215f359637.
>>>>
>>>> Darrick pointed out that if you do more than just a tiny number
>>>> of extents per transactions you run out of log reservations very
>>>> quickly here:
>>>>
>>>> https://urldefense.com/v3/__https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!PWLcBof1tKimKUObvCj4vOhljWjFmjtzVHLx9apcU5Rah1xZnmp_3PIq6eSwx6TdEXzMLYYyBfmZLgvj$
>>>>
>>>> how does your scheme deal with that?
>>>>
>>> The resblks calculation in xfs_reflink_end_atomic_cow() takes care of this,
>>> right? Or does the log reservation have a hard size limit, regardless of
>>> that calculation?
>>
>> The resblks calculated there are the reserved disk blocks and have
>> nothing to do with the log reservations, which comes from the
>> tr_write field passed in. There is some kind of upper limited to it
>> obviously by the log size, although I'm not sure if we've formalized
>> that somewhere. Dave might be the right person to ask about that.
>
> The (very very rough) upper limit for how many intent items you can
> attach to a tr_write transaction is:
>
> per_extent_cost = (cui_size + rui_size + bui_size + efi_size + ili_size)
> max_blocks = tr_write::tr_logres / per_extent_cost
>
> (ili_size is the inode log item size)
So will it be something like this:
static size_t
xfs_compute_awu_max_extents(
struct xfs_mount *mp)
{
struct xfs_trans_res *resp = &M_RES(mp)->tr_write;
size_t logtotal = xfs_bui_log_format_sizeof(1)+
xfs_cui_log_format_sizeof(1) +
xfs_efi_log_format_sizeof(1) +
xfs_rui_log_format_sizeof(1) +
sizeof(struct xfs_inode_log_format);
return rounddown_pow_of_two(resp->tr_logres / logtotal);
}
static inline void
xfs_compute_awu_max(
struct xfs_mount *mp, int jjcount)
{
....
mp->m_awu_max =
min_t(unsigned int, awu_max, xfs_compute_awu_max_extents(mp));
}
>
> ((I would halve that for the sake of paranoia))
>
> since you have to commit all those intent items into the first
> transaction in the chain. The difficulty we've always had is computing
> the size of an intent item in the ondisk log, since that's a (somewhat
> minor) layering violation -- it's xfs_cui_log_format_sizeof() for a CUI,
> but then there' could be overhead for the ondisk log headers themselves.
>
> Maybe we ought to formalize the computation of that since reap.c also
> has a handwavy XREAP_MAX_DEFER_CHAIN that it uses to roll the scrub
> transaction periodically... because I'd prefer we not add another
> hardcoded limit. My guess is that the software fallback can probably
> support any awu_max that a hardware wants to throw at us, but let's
> actually figure out the min(sw, hw) that we can support and cap it at
> that.
>
> --D
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 03/10] xfs: Refactor xfs_reflink_end_cow_extent()
2025-03-12 22:06 ` John Garry
@ 2025-03-12 23:22 ` Darrick J. Wong
0 siblings, 0 replies; 63+ messages in thread
From: Darrick J. Wong @ 2025-03-12 23:22 UTC (permalink / raw)
To: John Garry
Cc: Christoph Hellwig, brauner, cem, linux-xfs, linux-fsdevel,
linux-kernel, ojaswin, ritesh.list, martin.petersen
On Wed, Mar 12, 2025 at 10:06:11PM +0000, John Garry wrote:
> On 12/03/2025 15:46, Darrick J. Wong wrote:
> > On Wed, Mar 12, 2025 at 01:35:23AM -0700, Christoph Hellwig wrote:
> > > On Wed, Mar 12, 2025 at 08:27:05AM +0000, John Garry wrote:
> > > > On 12/03/2025 07:24, Christoph Hellwig wrote:
> > > > > On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote:
> > > > > > Refactor xfs_reflink_end_cow_extent() into separate parts which process
> > > > > > the CoW range and commit the transaction.
> > > > > >
> > > > > > This refactoring will be used in future for when it is required to commit
> > > > > > a range of extents as a single transaction, similar to how it was done
> > > > > > pre-commit d6f215f359637.
> > > > >
> > > > > Darrick pointed out that if you do more than just a tiny number
> > > > > of extents per transactions you run out of log reservations very
> > > > > quickly here:
> > > > >
> > > > > https://urldefense.com/v3/__https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!PWLcBof1tKimKUObvCj4vOhljWjFmjtzVHLx9apcU5Rah1xZnmp_3PIq6eSwx6TdEXzMLYYyBfmZLgvj$
> > > > >
> > > > > how does your scheme deal with that?
> > > > >
> > > > The resblks calculation in xfs_reflink_end_atomic_cow() takes care of this,
> > > > right? Or does the log reservation have a hard size limit, regardless of
> > > > that calculation?
> > >
> > > The resblks calculated there are the reserved disk blocks and have
> > > nothing to do with the log reservations, which comes from the
> > > tr_write field passed in. There is some kind of upper limited to it
> > > obviously by the log size, although I'm not sure if we've formalized
> > > that somewhere. Dave might be the right person to ask about that.
> >
> > The (very very rough) upper limit for how many intent items you can
> > attach to a tr_write transaction is:
> >
> > per_extent_cost = (cui_size + rui_size + bui_size + efi_size + ili_size)
> > max_blocks = tr_write::tr_logres / per_extent_cost
> >
> > (ili_size is the inode log item size)
>
> So will it be something like this:
>
> static size_t
> xfs_compute_awu_max_extents(
> struct xfs_mount *mp)
> {
> struct xfs_trans_res *resp = &M_RES(mp)->tr_write;
> size_t logtotal = xfs_bui_log_format_sizeof(1)+
Might want to call it "per_extent_logres" since that's what it is.
> xfs_cui_log_format_sizeof(1) +
> xfs_efi_log_format_sizeof(1) +
> xfs_rui_log_format_sizeof(1) +
> sizeof(struct xfs_inode_log_format);
Something like that, yeah. You should probably add
xfs_log_dinode_size(ip->i_mount) to that.
What you're really doing is summing the *nbytes output of the
->iop_size() call for each possible log item. For the four log intent
items it's the xfs_FOO_log_format_sizeof() function like you have above.
For inode items it's:
*nbytes += sizeof(struct xfs_inode_log_format) +
xfs_log_dinode_size(ip->i_mount);
> return rounddown_pow_of_two(resp->tr_logres / logtotal);
and like I said earlier, you should double logtotal to be on the safe
side with a 2x safety margin:
/* 100% safety margin for safety's sake */
return rounddown_pow_of_two(resp->tr_logres /
(2 * per_extent_logres));
I'm curious what number you get back from this function? Hopefully it's
at least a few hundred blocks.
Thanks for putting that together. :)
--D
> }
>
> static inline void
> xfs_compute_awu_max(
> struct xfs_mount *mp, int jjcount)
> {
> ....
> mp->m_awu_max =
> min_t(unsigned int, awu_max, xfs_compute_awu_max_extents(mp));
> }
>
> >
> > ((I would halve that for the sake of paranoia))
> >
> > since you have to commit all those intent items into the first
> > transaction in the chain. The difficulty we've always had is computing
> > the size of an intent item in the ondisk log, since that's a (somewhat
> > minor) layering violation -- it's xfs_cui_log_format_sizeof() for a CUI,
> > but then there' could be overhead for the ondisk log headers themselves.
> >
> > Maybe we ought to formalize the computation of that since reap.c also
> > has a handwavy XREAP_MAX_DEFER_CHAIN that it uses to roll the scrub
> > transaction periodically... because I'd prefer we not add another
> > hardcoded limit. My guess is that the software fallback can probably
> > support any awu_max that a hardware wants to throw at us, but let's
> > actually figure out the min(sw, hw) that we can support and cap it at
> > that.
> >
> > --D
>
>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH RFC v5 10/10] iomap: Rename ATOMIC flags again
2025-03-12 7:13 ` Christoph Hellwig
@ 2025-03-12 23:59 ` Dave Chinner
2025-03-13 6:28 ` John Garry
0 siblings, 1 reply; 63+ messages in thread
From: Dave Chinner @ 2025-03-12 23:59 UTC (permalink / raw)
To: Christoph Hellwig
Cc: John Garry, brauner, djwong, cem, linux-xfs, linux-fsdevel,
linux-kernel, ojaswin, ritesh.list, martin.petersen
On Wed, Mar 12, 2025 at 12:13:42AM -0700, Christoph Hellwig wrote:
> On Mon, Mar 10, 2025 at 06:39:46PM +0000, John Garry wrote:
> > Dave Chinner thought that names IOMAP_ATOMIC_HW and IOMAP_ATOMIC_SW were
> > not appropopiate. Specifically because IOMAP_ATOMIC_HW could actually be
> > realised with a SW-based method in the block or md/dm layers.
> >
> > So rename to IOMAP_ATOMIC_BIO and IOMAP_ATOMIC_FS.
>
> Looking over the entire series and the already merged iomap code:
> there should be no reason at all for having IOMAP_ATOMIC_FS.
> The only thing it does is to branch out to
> xfs_atomic_write_sw_iomap_begin from xfs_atomic_write_iomap_begin.
>
> You can do that in a much simpler and nicer way by just having
> different iomap_ops for the bio based vs file system based atomics.
Agreed - I was going to suggest that, but got distracted by
something else and then forgot about it when I got back to writing
the email...
> I agree with dave that bio is a better term for the bio based
> atomic, but please use the IOMAP_ATOMIC_BIO name above instead
> of the IOMAP_BIO_ATOMIC actually used in the code if you change
> it.
Works for me.
> > */
> > static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
> > - const struct iomap *iomap, bool use_fua, bool atomic_hw)
> > + const struct iomap *iomap, bool use_fua, bool bio_atomic)
>
> Not new here, but these two bools are pretty ugly.
>
> I'd rather have a
>
> blk_opf_t extra_flags;
>
> in the caller that gets REQ_FUA and REQ_ATOMIC assigned as needed,
> and then just clear
Yep, that is cleaner..
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 03/10] xfs: Refactor xfs_reflink_end_cow_extent()
2025-03-12 15:46 ` Darrick J. Wong
2025-03-12 22:06 ` John Garry
@ 2025-03-13 1:25 ` Dave Chinner
2025-03-13 4:51 ` Darrick J. Wong
1 sibling, 1 reply; 63+ messages in thread
From: Dave Chinner @ 2025-03-13 1:25 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Christoph Hellwig, John Garry, brauner, cem, linux-xfs,
linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
martin.petersen
On Wed, Mar 12, 2025 at 08:46:36AM -0700, Darrick J. Wong wrote:
> On Wed, Mar 12, 2025 at 01:35:23AM -0700, Christoph Hellwig wrote:
> > On Wed, Mar 12, 2025 at 08:27:05AM +0000, John Garry wrote:
> > > On 12/03/2025 07:24, Christoph Hellwig wrote:
> > > > On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote:
> > > > > Refactor xfs_reflink_end_cow_extent() into separate parts which process
> > > > > the CoW range and commit the transaction.
> > > > >
> > > > > This refactoring will be used in future for when it is required to commit
> > > > > a range of extents as a single transaction, similar to how it was done
> > > > > pre-commit d6f215f359637.
> > > >
> > > > Darrick pointed out that if you do more than just a tiny number
> > > > of extents per transactions you run out of log reservations very
> > > > quickly here:
> > > >
> > > > https://urldefense.com/v3/__https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!PWLcBof1tKimKUObvCj4vOhljWjFmjtzVHLx9apcU5Rah1xZnmp_3PIq6eSwx6TdEXzMLYYyBfmZLgvj$
> > > >
> > > > how does your scheme deal with that?
> > > >
> > > The resblks calculation in xfs_reflink_end_atomic_cow() takes care of this,
> > > right? Or does the log reservation have a hard size limit, regardless of
> > > that calculation?
> >
> > The resblks calculated there are the reserved disk blocks
Used for btree block allocations that might be needed during the
processing of the transaction.
> > and have
> > nothing to do with the log reservations, which comes from the
> > tr_write field passed in. There is some kind of upper limited to it
> > obviously by the log size, although I'm not sure if we've formalized
> > that somewhere. Dave might be the right person to ask about that.
>
> The (very very rough) upper limit for how many intent items you can
> attach to a tr_write transaction is:
>
> per_extent_cost = (cui_size + rui_size + bui_size + efi_size + ili_size)
> max_blocks = tr_write::tr_logres / per_extent_cost
>
> (ili_size is the inode log item size)
That doesn't sound right. The number of intents we can log is not
dependent on the aggregated size of all intent types. We do not log
all those intent types in a single transaction, nor do we process
more than one type of intent in a given transaction. Also, we only
log the inode once per transaction, so that is not a per-extent
overhead.
Realistically, the tr_write transaction is goign to be at least a
100kB because it has to be big enough to log full splits of multiple
btrees (e.g. BMBT + both free space trees). Yeah, a small 4kB
filesystem spits out:
xfs_trans_resv_calc: dev 7:0 type 0 logres 193528 logcount 5 flags 0x4
About 190kB.
However, intents are typically very small - around 32 bytes in size
plus another 12 bytes for the log region ophdr.
This implies that we can fit thousands of individual intents in a
single tr_write log reservation on any given filesystem, and the
number of loop iterations in a transaction is therefore dependent
largely on how many intents are logged per iteration.
Hence if we are walking a range of extents in the BMBT to unmap
them, then we should only be generating 2 intents per loop - a BUI
for the BMBT removal and a CUI for the shared refcount decrease.
That means we should be able to run at least a thousand iterations
of that loop per transaction without getting anywhere near the
transaction reservation limits.
*However!*
We have to relog every intent we haven't processed in the deferred
batch every-so-often to prevent the outstanding intents from pinning
the tail of the log. Hence the larger the number of intents in the
initial batch, the more work we have to do later on (and the more
overall log space and bandwidth they will consume) to relog them
them over and over again until they pop to the head of the
processing queue.
Hence there is no real perforamce advantage to creating massive intent
batches because we end up doing more work later on to relog those
intents to prevent journal space deadlocks. It also doesn't speed up
processing, because we still process the intent chains one at a time
from start to completion before moving on to the next high level
intent chain that needs to be processed.
Further, after the first couple of intent chains have been
processed, the initial log space reservation will have run out, and
we are now asking for a new resrevation on every transaction roll we
do. i.e. we now are now doing a log space reservation on every
transaction roll in the processing chain instead of only doing it
once per high level intent chain.
Hence from a log space accounting perspective (the hottest code path
in the journal), it is far more efficient to perform a single high
level transaction per extent unmap operation than it is to batch
intents into a single high level transaction.
My advice is this: we should never batch high level iterative
intent-based operations into a single transaction because it's a
false optimisation. It might look like it is an efficiency
improvement from the high level, but it ends up hammering the hot,
performance critical paths in the transaction subsystem much, much
harder and so will end up being slower than the single transaction
per intent-based operation algorithm when it matters most....
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 03/10] xfs: Refactor xfs_reflink_end_cow_extent()
2025-03-13 1:25 ` Dave Chinner
@ 2025-03-13 4:51 ` Darrick J. Wong
2025-03-13 6:11 ` John Garry
2025-03-13 7:21 ` Dave Chinner
0 siblings, 2 replies; 63+ messages in thread
From: Darrick J. Wong @ 2025-03-13 4:51 UTC (permalink / raw)
To: Dave Chinner
Cc: Christoph Hellwig, John Garry, brauner, cem, linux-xfs,
linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
martin.petersen
On Thu, Mar 13, 2025 at 12:25:21PM +1100, Dave Chinner wrote:
> On Wed, Mar 12, 2025 at 08:46:36AM -0700, Darrick J. Wong wrote:
> > On Wed, Mar 12, 2025 at 01:35:23AM -0700, Christoph Hellwig wrote:
> > > On Wed, Mar 12, 2025 at 08:27:05AM +0000, John Garry wrote:
> > > > On 12/03/2025 07:24, Christoph Hellwig wrote:
> > > > > On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote:
> > > > > > Refactor xfs_reflink_end_cow_extent() into separate parts which process
> > > > > > the CoW range and commit the transaction.
> > > > > >
> > > > > > This refactoring will be used in future for when it is required to commit
> > > > > > a range of extents as a single transaction, similar to how it was done
> > > > > > pre-commit d6f215f359637.
> > > > >
> > > > > Darrick pointed out that if you do more than just a tiny number
> > > > > of extents per transactions you run out of log reservations very
> > > > > quickly here:
> > > > >
> > > > > https://urldefense.com/v3/__https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!PWLcBof1tKimKUObvCj4vOhljWjFmjtzVHLx9apcU5Rah1xZnmp_3PIq6eSwx6TdEXzMLYYyBfmZLgvj$
> > > > >
> > > > > how does your scheme deal with that?
> > > > >
> > > > The resblks calculation in xfs_reflink_end_atomic_cow() takes care of this,
> > > > right? Or does the log reservation have a hard size limit, regardless of
> > > > that calculation?
> > >
> > > The resblks calculated there are the reserved disk blocks
>
> Used for btree block allocations that might be needed during the
> processing of the transaction.
>
> > > and have
> > > nothing to do with the log reservations, which comes from the
> > > tr_write field passed in. There is some kind of upper limited to it
> > > obviously by the log size, although I'm not sure if we've formalized
> > > that somewhere. Dave might be the right person to ask about that.
> >
> > The (very very rough) upper limit for how many intent items you can
> > attach to a tr_write transaction is:
> >
> > per_extent_cost = (cui_size + rui_size + bui_size + efi_size + ili_size)
> > max_blocks = tr_write::tr_logres / per_extent_cost
> >
> > (ili_size is the inode log item size)
>
> That doesn't sound right. The number of intents we can log is not
> dependent on the aggregated size of all intent types. We do not log
> all those intent types in a single transaction, nor do we process
> more than one type of intent in a given transaction. Also, we only
> log the inode once per transaction, so that is not a per-extent
> overhead.
>
> Realistically, the tr_write transaction is goign to be at least a
> 100kB because it has to be big enough to log full splits of multiple
> btrees (e.g. BMBT + both free space trees). Yeah, a small 4kB
> filesystem spits out:
>
> xfs_trans_resv_calc: dev 7:0 type 0 logres 193528 logcount 5 flags 0x4
>
> About 190kB.
>
> However, intents are typically very small - around 32 bytes in size
> plus another 12 bytes for the log region ophdr.
>
> This implies that we can fit thousands of individual intents in a
> single tr_write log reservation on any given filesystem, and the
> number of loop iterations in a transaction is therefore dependent
> largely on how many intents are logged per iteration.
>
> Hence if we are walking a range of extents in the BMBT to unmap
> them, then we should only be generating 2 intents per loop - a BUI
> for the BMBT removal and a CUI for the shared refcount decrease.
> That means we should be able to run at least a thousand iterations
> of that loop per transaction without getting anywhere near the
> transaction reservation limits.
>
> *However!*
>
> We have to relog every intent we haven't processed in the deferred
> batch every-so-often to prevent the outstanding intents from pinning
> the tail of the log. Hence the larger the number of intents in the
> initial batch, the more work we have to do later on (and the more
> overall log space and bandwidth they will consume) to relog them
> them over and over again until they pop to the head of the
> processing queue.
>
> Hence there is no real perforamce advantage to creating massive intent
> batches because we end up doing more work later on to relog those
> intents to prevent journal space deadlocks. It also doesn't speed up
> processing, because we still process the intent chains one at a time
> from start to completion before moving on to the next high level
> intent chain that needs to be processed.
>
> Further, after the first couple of intent chains have been
> processed, the initial log space reservation will have run out, and
> we are now asking for a new resrevation on every transaction roll we
> do. i.e. we now are now doing a log space reservation on every
> transaction roll in the processing chain instead of only doing it
> once per high level intent chain.
>
> Hence from a log space accounting perspective (the hottest code path
> in the journal), it is far more efficient to perform a single high
> level transaction per extent unmap operation than it is to batch
> intents into a single high level transaction.
>
> My advice is this: we should never batch high level iterative
> intent-based operations into a single transaction because it's a
> false optimisation. It might look like it is an efficiency
> improvement from the high level, but it ends up hammering the hot,
> performance critical paths in the transaction subsystem much, much
> harder and so will end up being slower than the single transaction
> per intent-based operation algorithm when it matters most....
How specifically do you propose remapping all the extents in a file
range after an untorn write? The regular cow ioend does a single
transaction per extent across the entire ioend range and cannot deliver
untorn writes. This latest proposal does, but now you've torn that idea
down too.
At this point I have run out of ideas and conclude that can only submit
to your superior intellect.
--D
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 03/10] xfs: Refactor xfs_reflink_end_cow_extent()
2025-03-13 4:51 ` Darrick J. Wong
@ 2025-03-13 6:11 ` John Garry
2025-03-18 0:43 ` Dave Chinner
2025-03-13 7:21 ` Dave Chinner
1 sibling, 1 reply; 63+ messages in thread
From: John Garry @ 2025-03-13 6:11 UTC (permalink / raw)
To: Darrick J. Wong, Dave Chinner
Cc: Christoph Hellwig, brauner, cem, linux-xfs, linux-fsdevel,
linux-kernel, ojaswin, ritesh.list, martin.petersen
On 13/03/2025 04:51, Darrick J. Wong wrote:
>> Hence if we are walking a range of extents in the BMBT to unmap
>> them, then we should only be generating 2 intents per loop - a BUI
>> for the BMBT removal and a CUI for the shared refcount decrease.
>> That means we should be able to run at least a thousand iterations
>> of that loop per transaction without getting anywhere near the
>> transaction reservation limits.
>>
>> *However!*
>>
>> We have to relog every intent we haven't processed in the deferred
>> batch every-so-often to prevent the outstanding intents from pinning
>> the tail of the log. Hence the larger the number of intents in the
>> initial batch, the more work we have to do later on (and the more
>> overall log space and bandwidth they will consume) to relog them
>> them over and over again until they pop to the head of the
>> processing queue.
>>
>> Hence there is no real perforamce advantage to creating massive intent
>> batches because we end up doing more work later on to relog those
>> intents to prevent journal space deadlocks. It also doesn't speed up
>> processing, because we still process the intent chains one at a time
>> from start to completion before moving on to the next high level
>> intent chain that needs to be processed.
>>
>> Further, after the first couple of intent chains have been
>> processed, the initial log space reservation will have run out, and
>> we are now asking for a new resrevation on every transaction roll we
>> do. i.e. we now are now doing a log space reservation on every
>> transaction roll in the processing chain instead of only doing it
>> once per high level intent chain.
>>
>> Hence from a log space accounting perspective (the hottest code path
>> in the journal), it is far more efficient to perform a single high
>> level transaction per extent unmap operation than it is to batch
>> intents into a single high level transaction.
>>
>> My advice is this: we should never batch high level iterative
>> intent-based operations into a single transaction because it's a
>> false optimisation. It might look like it is an efficiency
>> improvement from the high level, but it ends up hammering the hot,
>> performance critical paths in the transaction subsystem much, much
>> harder and so will end up being slower than the single transaction
>> per intent-based operation algorithm when it matters most....
> How specifically do you propose remapping all the extents in a file
> range after an untorn write? The regular cow ioend does a single
> transaction per extent across the entire ioend range and cannot deliver
> untorn writes. This latest proposal does, but now you've torn that idea
> down too.
>
> At this point I have run out of ideas and conclude that can only submit
> to your superior intellect.
>
> --D
I'm hearing that we can fit thousands without getting anywhere the
limits - this is good.
But then also it is not optimal in terms of performance to batch, right?
Performance is not so important here. This is for a software fallback,
which we should not frequently hit. And even if we do, we're still
typically not going to have many extents.
For our specific purpose, we want 16KB atomic writes - that is max of 4
extents. So this does not really sound like something to be concerned
with for these atomic write sizes.
We can add some arbitrary FS awu max, like 64KB, if that makes people
feel more comfortable.
Thanks,
John
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH RFC v5 10/10] iomap: Rename ATOMIC flags again
2025-03-12 23:59 ` Dave Chinner
@ 2025-03-13 6:28 ` John Garry
2025-03-13 7:02 ` Christoph Hellwig
0 siblings, 1 reply; 63+ messages in thread
From: John Garry @ 2025-03-13 6:28 UTC (permalink / raw)
To: Dave Chinner, Christoph Hellwig
Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, linux-kernel,
ojaswin, ritesh.list, martin.petersen
iomap_dio_bio_iter()
On 12/03/2025 23:59, Dave Chinner wrote:
>>> */
>>> static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
>>> - const struct iomap *iomap, bool use_fua, bool atomic_hw)
>>> + const struct iomap *iomap, bool use_fua, bool bio_atomic)
>> Not new here, but these two bools are pretty ugly.
>>
>> I'd rather have a
>>
>> blk_opf_t extra_flags;
>>
>> in the caller that gets REQ_FUA and REQ_ATOMIC assigned as needed,
>> and then just clear
> Yep, that is cleaner..
This suggestion is not clear to me.
Is it that iomap_dio_bio_iter() [the only caller of
iomap_dio_bio_opflags()] sets REQ_FUA and REQ_ATOMIC in extra_flags, and
then we extra_flags | bio_opf?
Note that iomap_dio_bio_opflags() does still use use_fua for clearing
IOMAP_DIO_WRITE_THROUGH.
And to me it seems nicer to set all the REQ_ flags in one place.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH RFC v5 10/10] iomap: Rename ATOMIC flags again
2025-03-13 6:28 ` John Garry
@ 2025-03-13 7:02 ` Christoph Hellwig
2025-03-13 7:41 ` John Garry
0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2025-03-13 7:02 UTC (permalink / raw)
To: John Garry
Cc: Dave Chinner, Christoph Hellwig, brauner, djwong, cem, linux-xfs,
linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
martin.petersen
On Thu, Mar 13, 2025 at 06:28:03AM +0000, John Garry wrote:
> > > I'd rather have a
> > >
> > > blk_opf_t extra_flags;
> > >
> > > in the caller that gets REQ_FUA and REQ_ATOMIC assigned as needed,
> > > and then just clear
> > Yep, that is cleaner..
>
> This suggestion is not clear to me.
>
> Is it that iomap_dio_bio_iter() [the only caller of iomap_dio_bio_opflags()]
> sets REQ_FUA and REQ_ATOMIC in extra_flags, and then we extra_flags |
> bio_opf?
Yes.
> Note that iomap_dio_bio_opflags() does still use use_fua for clearing
> IOMAP_DIO_WRITE_THROUGH.
You can check for REQ_FUA in extra_flags (or the entire op).
> And to me it seems nicer to set all the REQ_ flags in one place.
Passing multiple bool arguments just loses way too much context. But
if you really want everything in one place you could probably build
the entire blk_opf_t in iomap_dio_bio_iter, and avoid having to
recalculate it for every bio.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 03/10] xfs: Refactor xfs_reflink_end_cow_extent()
2025-03-13 4:51 ` Darrick J. Wong
2025-03-13 6:11 ` John Garry
@ 2025-03-13 7:21 ` Dave Chinner
2025-03-22 5:19 ` Darrick J. Wong
1 sibling, 1 reply; 63+ messages in thread
From: Dave Chinner @ 2025-03-13 7:21 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Christoph Hellwig, John Garry, brauner, cem, linux-xfs,
linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
martin.petersen
On Wed, Mar 12, 2025 at 09:51:21PM -0700, Darrick J. Wong wrote:
> On Thu, Mar 13, 2025 at 12:25:21PM +1100, Dave Chinner wrote:
> > On Wed, Mar 12, 2025 at 08:46:36AM -0700, Darrick J. Wong wrote:
> > > > > > On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote:
> > > > > > > Refactor xfs_reflink_end_cow_extent() into separate parts which process
> > > > > > > the CoW range and commit the transaction.
> > > > > > >
> > > > > > > This refactoring will be used in future for when it is required to commit
> > > > > > > a range of extents as a single transaction, similar to how it was done
> > > > > > > pre-commit d6f215f359637.
> > > > > >
> > > > > > Darrick pointed out that if you do more than just a tiny number
> > > > > > of extents per transactions you run out of log reservations very
> > > > > > quickly here:
> > > > > >
> > > > > > https://urldefense.com/v3/__https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!PWLcBof1tKimKUObvCj4vOhljWjFmjtzVHLx9apcU5Rah1xZnmp_3PIq6eSwx6TdEXzMLYYyBfmZLgvj$
> > > > > >
> > > > > > how does your scheme deal with that?
> > > > > >
> > > > > The resblks calculation in xfs_reflink_end_atomic_cow() takes care of this,
> > > > > right? Or does the log reservation have a hard size limit, regardless of
> > > > > that calculation?
> > > >
> > > > The resblks calculated there are the reserved disk blocks
> >
> > Used for btree block allocations that might be needed during the
> > processing of the transaction.
> >
> > > > and have
> > > > nothing to do with the log reservations, which comes from the
> > > > tr_write field passed in. There is some kind of upper limited to it
> > > > obviously by the log size, although I'm not sure if we've formalized
> > > > that somewhere. Dave might be the right person to ask about that.
> > >
> > > The (very very rough) upper limit for how many intent items you can
> > > attach to a tr_write transaction is:
> > >
> > > per_extent_cost = (cui_size + rui_size + bui_size + efi_size + ili_size)
> > > max_blocks = tr_write::tr_logres / per_extent_cost
> > >
> > > (ili_size is the inode log item size)
> >
> > That doesn't sound right. The number of intents we can log is not
> > dependent on the aggregated size of all intent types. We do not log
> > all those intent types in a single transaction, nor do we process
> > more than one type of intent in a given transaction. Also, we only
> > log the inode once per transaction, so that is not a per-extent
> > overhead.
> >
> > Realistically, the tr_write transaction is goign to be at least a
> > 100kB because it has to be big enough to log full splits of multiple
> > btrees (e.g. BMBT + both free space trees). Yeah, a small 4kB
> > filesystem spits out:
> >
> > xfs_trans_resv_calc: dev 7:0 type 0 logres 193528 logcount 5 flags 0x4
> >
> > About 190kB.
> >
> > However, intents are typically very small - around 32 bytes in size
> > plus another 12 bytes for the log region ophdr.
> >
> > This implies that we can fit thousands of individual intents in a
> > single tr_write log reservation on any given filesystem, and the
> > number of loop iterations in a transaction is therefore dependent
> > largely on how many intents are logged per iteration.
> >
> > Hence if we are walking a range of extents in the BMBT to unmap
> > them, then we should only be generating 2 intents per loop - a BUI
> > for the BMBT removal and a CUI for the shared refcount decrease.
> > That means we should be able to run at least a thousand iterations
> > of that loop per transaction without getting anywhere near the
> > transaction reservation limits.
> >
> > *However!*
> >
> > We have to relog every intent we haven't processed in the deferred
> > batch every-so-often to prevent the outstanding intents from pinning
> > the tail of the log. Hence the larger the number of intents in the
> > initial batch, the more work we have to do later on (and the more
> > overall log space and bandwidth they will consume) to relog them
> > them over and over again until they pop to the head of the
> > processing queue.
> >
> > Hence there is no real perforamce advantage to creating massive intent
> > batches because we end up doing more work later on to relog those
> > intents to prevent journal space deadlocks. It also doesn't speed up
> > processing, because we still process the intent chains one at a time
> > from start to completion before moving on to the next high level
> > intent chain that needs to be processed.
> >
> > Further, after the first couple of intent chains have been
> > processed, the initial log space reservation will have run out, and
> > we are now asking for a new resrevation on every transaction roll we
> > do. i.e. we now are now doing a log space reservation on every
> > transaction roll in the processing chain instead of only doing it
> > once per high level intent chain.
> >
> > Hence from a log space accounting perspective (the hottest code path
> > in the journal), it is far more efficient to perform a single high
> > level transaction per extent unmap operation than it is to batch
> > intents into a single high level transaction.
> >
> > My advice is this: we should never batch high level iterative
> > intent-based operations into a single transaction because it's a
> > false optimisation. It might look like it is an efficiency
> > improvement from the high level, but it ends up hammering the hot,
> > performance critical paths in the transaction subsystem much, much
> > harder and so will end up being slower than the single transaction
> > per intent-based operation algorithm when it matters most....
>
> How specifically do you propose remapping all the extents in a file
> range after an untorn write?
Sorry, I didn't realise that was the context of the question that
was asked - there was not enough context in the email I replied to
to indicate this important detail. hence it just looked like a
question about "how many intents can we batch into a single write
transaction reservation".
I gave that answer (thousands) and then recommended against doing
batching like this as an optimisation. Batching operations into a
single context is normally done as an optimisation, so that is what
I assumed was being talked about here....
> The regular cow ioend does a single
> transaction per extent across the entire ioend range and cannot deliver
> untorn writes. This latest proposal does, but now you've torn that idea
> down too.
>
> At this point I have run out of ideas and conclude that can only submit
> to your superior intellect.
I think you're jumping to incorrect conclusions, and then making
needless personal attacks. This is unacceptable behaviour, Darrick,
and if you keep it up you are going to end up having to explain
yourself to the CoC committee....
Anyway....
Now I understand the assumed context was batching for atomicity and
not optimisation, I'll address that aspect of the suggested
approach: overloading the existing write reservation with a special
case like this is the wrong way to define a new atomic operation.
That is: trying to infer limits of special case behaviour by adding
a heuristic based calculation based on the write log reservation
is poor design.
The write reservation varies according to the current filesystem's
geometry (filesystem block size, AG size, capacity, etc) and kernel
version. Hence the batch size supported for atomic writes would then
vary unpredictably from filesystem to filesystem and even within the
same filesystem over time.
Given that we have to abort atomic writes up front if the mapping
covering the atomic write range is more fragmented than this
calculated value, this unpredicable limit will be directly exposed
to userspace via the errors it generates.
We do not want anything even vaguely related to transaction
reservation sizes exposed to userspace. They are, and should always
be, entirely internal filesystem implementation details.
A much better way to define an atomic unmap operation is to set a
fixed maximum number of extents that a batched atomic unmap
operation needs to support. With a fixed extent count, we can then
base the transaction reservation size required for the operation on
this number.
We know that the write length is never going to be larger than 2GB
due to MAX_RW_COUNT bounding of user IO. Hence there is a maximum
number of extents that a single write can map to. Whether that's the
size we try to support is separate discussion, but I point it out as
a hard upper maximum.
Regardless of what we define as the maximum number of extents for
the atomic unmap, we can now calculate the exact log space
reservation the an unmap intents will require. We can then max()
that with the log space reservation that any of those specific
intents will require to process. Now we have an exact log
reservation to an atomic unmap of a known, fixed number of extents.
Then we can look at what the common unmap behaviour is (e.g. through
tracing various test cases) are, and determine how many extents we
are typically going to convert in a single atomic unmap operation.
That then guides us to an optimal log count for the atomic unmap
reservation.
This gives us a single log space reservation that can handle a known,
fixed number of extents on any filesystem.
It gives us an optimised log count to minimise the number of log
space reservations the common case code is going to need.
It gives us a reservation size that will contribute to defining the
minimum supported log size for the features enabled in the
filesystem.
It gives us a consistent behaviour that we can directly exercise
from userspace with relatively simple test code based on constant
values.
It means the high level unmap code doesn't rely on heuristics to
prevent transaction reservation overflows.
It means the high level code can use bound loops and compile time
constants without having to worry that they will ever overrun the
capability of the underlying filesystem or kernel.
Simple, huh?
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH RFC v5 10/10] iomap: Rename ATOMIC flags again
2025-03-13 7:02 ` Christoph Hellwig
@ 2025-03-13 7:41 ` John Garry
2025-03-13 7:49 ` Christoph Hellwig
0 siblings, 1 reply; 63+ messages in thread
From: John Garry @ 2025-03-13 7:41 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Dave Chinner, brauner, djwong, cem, linux-xfs, linux-fsdevel,
linux-kernel, ojaswin, ritesh.list, martin.petersen
On 13/03/2025 07:02, Christoph Hellwig wrote:
> On Thu, Mar 13, 2025 at 06:28:03AM +0000, John Garry wrote:
>>>> I'd rather have a
>>>>
>>>> blk_opf_t extra_flags;
>>>>
>>>> in the caller that gets REQ_FUA and REQ_ATOMIC assigned as needed,
>>>> and then just clear
>>> Yep, that is cleaner..
>>
>> This suggestion is not clear to me.
>>
>> Is it that iomap_dio_bio_iter() [the only caller of iomap_dio_bio_opflags()]
>> sets REQ_FUA and REQ_ATOMIC in extra_flags, and then we extra_flags |
>> bio_opf?
>
> Yes.
>
>> Note that iomap_dio_bio_opflags() does still use use_fua for clearing
>> IOMAP_DIO_WRITE_THROUGH.
>
> You can check for REQ_FUA in extra_flags (or the entire op).
> >> And to me it seems nicer to set all the REQ_ flags in one place.
>
> Passing multiple bool arguments just loses way too much context. But
> if you really want everything in one place you could probably build
> the entire blk_opf_t in iomap_dio_bio_iter, and avoid having to
> recalculate it for every bio.
>
Yeah, when we start taking use_fua and atomic_bio args from
iomap_dio_bio_opflags(), then iomap_dio_bio_opflags() becomes a shell of
the function.
So how about this (I would re-add the write through comment):
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -311,30 +311,6 @@ static int iomap_dio_zero(const struct iomap_iter
*iter, struct iomap_dio *dio,
return 0;
}
-/*
- * Figure out the bio's operation flags from the dio request, the
- * mapping, and whether or not we want FUA. Note that we can end up
- * clearing the WRITE_THROUGH flag in the dio request.
- */
-static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
- const struct iomap *iomap, bool use_fua, bool atomic_hw)
-{
- blk_opf_t opflags = REQ_SYNC | REQ_IDLE;
-
- if (!(dio->flags & IOMAP_DIO_WRITE))
- return REQ_OP_READ;
-
- opflags |= REQ_OP_WRITE;
- if (use_fua)
- opflags |= REQ_FUA;
- else
- dio->flags &= ~IOMAP_DIO_WRITE_THROUGH;
- if (atomic_hw)
- opflags |= REQ_ATOMIC;
-
- return opflags;
-}
-
static int iomap_dio_bio_iter(struct iomap_iter *iter, struct
iomap_dio *dio)
{
const struct iomap *iomap = &iter->iomap;
@@ -346,13 +322,20 @@ static int iomap_dio_bio_iter(struct iomap_iter
*iter, struct iomap_dio *dio)
blk_opf_t bio_opf;
struct bio *bio;
bool need_zeroout = false;
- bool use_fua = false;
int nr_pages, ret = 0;
u64 copied = 0;
size_t orig_count;
- if (atomic_hw && length != iter->len)
- return -EINVAL;
+ if (dio->flags & IOMAP_DIO_WRITE) {
+ bio_opf = REQ_OP_WRITE;
+ if (atomic_hw) {
+ if (length != iter->len)
+ return -EINVAL;
+ bio_opf |= REQ_ATOMIC;
+ }
+ } else {
+ bio_opf = REQ_OP_READ;
+ }
if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
!bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
@@ -382,10 +365,12 @@ static int iomap_dio_bio_iter(struct iomap_iter
*iter, struct iomap_dio *dio)
*/
if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
(dio->flags & IOMAP_DIO_WRITE_THROUGH) &&
- (bdev_fua(iomap->bdev) || !bdev_write_cache(iomap->bdev)))
- use_fua = true;
- else if (dio->flags & IOMAP_DIO_NEED_SYNC)
+ (bdev_fua(iomap->bdev) || !bdev_write_cache(iomap->bdev))) {
+ bio_opf |= REQ_FUA; //reads as well?
+ dio->flags &= ~IOMAP_DIO_WRITE_THROUGH;
+ } else if (dio->flags & IOMAP_DIO_NEED_SYNC) {
dio->flags &= ~IOMAP_DIO_CALLER_COMP;
+ }
}
/*
@@ -407,7 +392,7 @@ static int iomap_dio_bio_iter(struct iomap_iter
*iter, struct iomap_dio *dio)
* during completion processing.
*/
if (need_zeroout ||
- ((dio->flags & IOMAP_DIO_NEED_SYNC) && !use_fua) ||
+ ((dio->flags & IOMAP_DIO_NEED_SYNC) && !(bio_opf & REQ_FUA)) ||
((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
dio->flags &= ~IOMAP_DIO_CALLER_COMP;
@@ -428,8 +413,6 @@ static int iomap_dio_bio_iter(struct iomap_iter
*iter, struct iomap_dio *dio)
goto out;
}
- bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic_hw);
-
nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS);
do {
size_t n;
--
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH RFC v5 10/10] iomap: Rename ATOMIC flags again
2025-03-13 7:41 ` John Garry
@ 2025-03-13 7:49 ` Christoph Hellwig
2025-03-13 7:53 ` John Garry
0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2025-03-13 7:49 UTC (permalink / raw)
To: John Garry
Cc: Christoph Hellwig, Dave Chinner, brauner, djwong, cem, linux-xfs,
linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
martin.petersen
On Thu, Mar 13, 2025 at 07:41:11AM +0000, John Garry wrote:
> So how about this (I would re-add the write through comment):
This looks roughly sane. You'd probably want to turn the
iomap_dio_bio_opflags removal into a prep path, though.
> - blk_opf_t opflags = REQ_SYNC | REQ_IDLE;
This good lost and should move to the bio_opf declaration now.
> + (bdev_fua(iomap->bdev) || !bdev_write_cache(iomap->bdev))) {
> + bio_opf |= REQ_FUA; //reads as well?
REQ_FUA is not defined for reads in Linux Some of the storage standards
define it for reads, but the semantics are pretty nonsensical.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH RFC v5 10/10] iomap: Rename ATOMIC flags again
2025-03-13 7:49 ` Christoph Hellwig
@ 2025-03-13 7:53 ` John Garry
2025-03-13 8:09 ` Christoph Hellwig
0 siblings, 1 reply; 63+ messages in thread
From: John Garry @ 2025-03-13 7:53 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Dave Chinner, brauner, djwong, cem, linux-xfs, linux-fsdevel,
linux-kernel, ojaswin, ritesh.list, martin.petersen
On 13/03/2025 07:49, Christoph Hellwig wrote:
> On Thu, Mar 13, 2025 at 07:41:11AM +0000, John Garry wrote:
>> So how about this (I would re-add the write through comment):
>
> This looks roughly sane. You'd probably want to turn the
> iomap_dio_bio_opflags removal into a prep path, though.
Sure
>
>> - blk_opf_t opflags = REQ_SYNC | REQ_IDLE;
>
> This good lost and should move to the bio_opf declaration now.
ok
>
>> + (bdev_fua(iomap->bdev) || !bdev_write_cache(iomap->bdev))) {
>> + bio_opf |= REQ_FUA; //reads as well?
>
> REQ_FUA is not defined for reads in Linux Some of the storage standards
> define it for reads, but the semantics are pretty nonsensical.
>
ok, so I will need to check for writes when setting that (as it was
previously)
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH RFC v5 10/10] iomap: Rename ATOMIC flags again
2025-03-13 7:53 ` John Garry
@ 2025-03-13 8:09 ` Christoph Hellwig
2025-03-13 8:18 ` Christoph Hellwig
0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2025-03-13 8:09 UTC (permalink / raw)
To: John Garry
Cc: Christoph Hellwig, Dave Chinner, brauner, djwong, cem, linux-xfs,
linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
martin.petersen
On Thu, Mar 13, 2025 at 07:53:22AM +0000, John Garry wrote:
> > REQ_FUA is not defined for reads in Linux Some of the storage standards
> > define it for reads, but the semantics are pretty nonsensical.
> >
>
> ok, so I will need to check for writes when setting that (as it was
> previously)
Yes. That entire IOMAP_MAPPED branch should just grow and additional
IOMAP_DIO_WRITE check, as all the code in it is only relevant to writes.
In fact that also includes the most of the other checks before,
so this could use more refactoring if we want to.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH RFC v5 10/10] iomap: Rename ATOMIC flags again
2025-03-13 8:09 ` Christoph Hellwig
@ 2025-03-13 8:18 ` Christoph Hellwig
2025-03-13 8:24 ` John Garry
2025-03-13 8:28 ` Christoph Hellwig
0 siblings, 2 replies; 63+ messages in thread
From: Christoph Hellwig @ 2025-03-13 8:18 UTC (permalink / raw)
To: John Garry
Cc: Christoph Hellwig, Dave Chinner, brauner, djwong, cem, linux-xfs,
linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
martin.petersen
Something like this (untestested):
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 5299f70428ef..3fa21445906a 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -312,27 +312,20 @@ static int iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
}
/*
- * Figure out the bio's operation flags from the dio request, the
- * mapping, and whether or not we want FUA. Note that we can end up
- * clearing the WRITE_THROUGH flag in the dio request.
+ * Use a FUA write if we need datasync semantics and this is a pure data I/O
+ * that doesn't require any metadata updates (including after I/O completion
+ * such as unwritten extent conversion) and the underlying device either
+ * doesn't have a volatile write cache or supports FUA.
+ * This allows us to avoid cache flushes on I/O completion.
*/
-static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
- const struct iomap *iomap, bool use_fua, bool atomic_hw)
+static inline bool iomap_dio_can_use_fua(const struct iomap *iomap,
+ struct iomap_dio *dio)
{
- blk_opf_t opflags = REQ_SYNC | REQ_IDLE;
-
- if (!(dio->flags & IOMAP_DIO_WRITE))
- return REQ_OP_READ;
-
- opflags |= REQ_OP_WRITE;
- if (use_fua)
- opflags |= REQ_FUA;
- else
- dio->flags &= ~IOMAP_DIO_WRITE_THROUGH;
- if (atomic_hw)
- opflags |= REQ_ATOMIC;
-
- return opflags;
+ if (iomap->flags & (IOMAP_F_SHARED | IOMAP_F_DIRTY))
+ return false;
+ if (!(dio->flags & IOMAP_DIO_WRITE_THROUGH))
+ return false;
+ return !bdev_write_cache(iomap->bdev) || bdev_fua(iomap->bdev);
}
static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
@@ -343,49 +336,53 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
bool atomic_hw = iter->flags & IOMAP_ATOMIC_HW;
const loff_t length = iomap_length(iter);
loff_t pos = iter->pos;
- blk_opf_t bio_opf;
+ blk_opf_t bio_opf = REQ_SYNC | REQ_IDLE;
struct bio *bio;
bool need_zeroout = false;
- bool use_fua = false;
int nr_pages, ret = 0;
u64 copied = 0;
size_t orig_count;
- if (atomic_hw && length != iter->len)
- return -EINVAL;
-
if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
!bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
return -EINVAL;
- if (iomap->type == IOMAP_UNWRITTEN) {
- dio->flags |= IOMAP_DIO_UNWRITTEN;
- need_zeroout = true;
- }
+ if (dio->flags & IOMAP_DIO_WRITE) {
+ bio_opf |= REQ_OP_WRITE;
+
+ if (atomic_hw) {
+ if (length != iter->len)
+ return -EINVAL;
+ bio_opf |= REQ_ATOMIC;
+ }
- if (iomap->flags & IOMAP_F_SHARED)
- dio->flags |= IOMAP_DIO_COW;
+ if (iomap->type == IOMAP_UNWRITTEN) {
+ dio->flags |= IOMAP_DIO_UNWRITTEN;
+ need_zeroout = true;
+ }
- if (iomap->flags & IOMAP_F_NEW) {
- need_zeroout = true;
- } else if (iomap->type == IOMAP_MAPPED) {
- /*
- * Use a FUA write if we need datasync semantics, this is a pure
- * data IO that doesn't require any metadata updates (including
- * after IO completion such as unwritten extent conversion) and
- * the underlying device either supports FUA or doesn't have
- * a volatile write cache. This allows us to avoid cache flushes
- * on IO completion. If we can't use writethrough and need to
- * sync, disable in-task completions as dio completion will
- * need to call generic_write_sync() which will do a blocking
- * fsync / cache flush call.
- */
- if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
- (dio->flags & IOMAP_DIO_WRITE_THROUGH) &&
- (bdev_fua(iomap->bdev) || !bdev_write_cache(iomap->bdev)))
- use_fua = true;
- else if (dio->flags & IOMAP_DIO_NEED_SYNC)
- dio->flags &= ~IOMAP_DIO_CALLER_COMP;
+ if (iomap->flags & IOMAP_F_SHARED)
+ dio->flags |= IOMAP_DIO_COW;
+
+ if (iomap->flags & IOMAP_F_NEW) {
+ need_zeroout = true;
+ } else if (iomap->type == IOMAP_MAPPED) {
+ if (iomap_dio_can_use_fua(iomap, dio)) {
+ bio_opf |= REQ_FUA;
+ } else {
+ dio->flags &= ~IOMAP_DIO_WRITE_THROUGH;
+ /*
+ * Disable in-task completions if we can't use
+ * writethrough and need to sync as the I/O
+ * completion handler has to force a (blocking)
+ * cache flush.
+ */
+ if (dio->flags & IOMAP_DIO_NEED_SYNC)
+ dio->flags &= ~IOMAP_DIO_CALLER_COMP;
+ }
+ }
+ } else {
+ bio_opf |= REQ_OP_READ;
}
/*
@@ -407,7 +404,7 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
* during completion processing.
*/
if (need_zeroout ||
- ((dio->flags & IOMAP_DIO_NEED_SYNC) && !use_fua) ||
+ ((dio->flags & IOMAP_DIO_NEED_SYNC) && !(bio_opf & REQ_FUA)) ||
((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
dio->flags &= ~IOMAP_DIO_CALLER_COMP;
@@ -428,8 +425,6 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
goto out;
}
- bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic_hw);
-
nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS);
do {
size_t n;
^ permalink raw reply related [flat|nested] 63+ messages in thread
* Re: [PATCH RFC v5 10/10] iomap: Rename ATOMIC flags again
2025-03-13 8:18 ` Christoph Hellwig
@ 2025-03-13 8:24 ` John Garry
2025-03-13 8:28 ` Christoph Hellwig
1 sibling, 0 replies; 63+ messages in thread
From: John Garry @ 2025-03-13 8:24 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Dave Chinner, brauner, djwong, cem, linux-xfs, linux-fsdevel,
linux-kernel, ojaswin, ritesh.list, martin.petersen
On 13/03/2025 08:18, Christoph Hellwig wrote:
> Something like this (untestested):
looks sane, I'll check it further and pick it up.
Thanks
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH RFC v5 10/10] iomap: Rename ATOMIC flags again
2025-03-13 8:18 ` Christoph Hellwig
2025-03-13 8:24 ` John Garry
@ 2025-03-13 8:28 ` Christoph Hellwig
1 sibling, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2025-03-13 8:28 UTC (permalink / raw)
To: John Garry
Cc: Christoph Hellwig, Dave Chinner, brauner, djwong, cem, linux-xfs,
linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
martin.petersen
On Thu, Mar 13, 2025 at 01:18:26AM -0700, Christoph Hellwig wrote:
> Something like this (untestested):
In fact this can be further simplified, as the clearing of
IOMAP_DIO_CALLER_COMP has duplicate logic, and only applies to writes.
So something like this would do even better:
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 5299f70428ef..e9b03b9dae9e 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -312,27 +312,20 @@ static int iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
}
/*
- * Figure out the bio's operation flags from the dio request, the
- * mapping, and whether or not we want FUA. Note that we can end up
- * clearing the WRITE_THROUGH flag in the dio request.
+ * Use a FUA write if we need datasync semantics and this is a pure data I/O
+ * that doesn't require any metadata updates (including after I/O completion
+ * such as unwritten extent conversion) and the underlying device either
+ * doesn't have a volatile write cache or supports FUA.
+ * This allows us to avoid cache flushes on I/O completion.
*/
-static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
- const struct iomap *iomap, bool use_fua, bool atomic_hw)
+static inline bool iomap_dio_can_use_fua(const struct iomap *iomap,
+ struct iomap_dio *dio)
{
- blk_opf_t opflags = REQ_SYNC | REQ_IDLE;
-
- if (!(dio->flags & IOMAP_DIO_WRITE))
- return REQ_OP_READ;
-
- opflags |= REQ_OP_WRITE;
- if (use_fua)
- opflags |= REQ_FUA;
- else
- dio->flags &= ~IOMAP_DIO_WRITE_THROUGH;
- if (atomic_hw)
- opflags |= REQ_ATOMIC;
-
- return opflags;
+ if (iomap->flags & (IOMAP_F_SHARED | IOMAP_F_DIRTY))
+ return false;
+ if (!(dio->flags & IOMAP_DIO_WRITE_THROUGH))
+ return false;
+ return !bdev_write_cache(iomap->bdev) || bdev_fua(iomap->bdev);
}
static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
@@ -340,52 +333,59 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
const struct iomap *iomap = &iter->iomap;
struct inode *inode = iter->inode;
unsigned int fs_block_size = i_blocksize(inode), pad;
- bool atomic_hw = iter->flags & IOMAP_ATOMIC_HW;
const loff_t length = iomap_length(iter);
loff_t pos = iter->pos;
- blk_opf_t bio_opf;
+ blk_opf_t bio_opf = REQ_SYNC | REQ_IDLE;
struct bio *bio;
bool need_zeroout = false;
- bool use_fua = false;
int nr_pages, ret = 0;
u64 copied = 0;
size_t orig_count;
- if (atomic_hw && length != iter->len)
- return -EINVAL;
-
if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
!bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
return -EINVAL;
- if (iomap->type == IOMAP_UNWRITTEN) {
- dio->flags |= IOMAP_DIO_UNWRITTEN;
- need_zeroout = true;
- }
+ if (dio->flags & IOMAP_DIO_WRITE) {
+ bio_opf |= REQ_OP_WRITE;
+
+ if (iter->flags & IOMAP_ATOMIC_HW) {
+ if (length != iter->len)
+ return -EINVAL;
+ bio_opf |= REQ_ATOMIC;
+ }
+
+ if (iomap->type == IOMAP_UNWRITTEN) {
+ dio->flags |= IOMAP_DIO_UNWRITTEN;
+ need_zeroout = true;
+ }
- if (iomap->flags & IOMAP_F_SHARED)
- dio->flags |= IOMAP_DIO_COW;
+ if (iomap->flags & IOMAP_F_SHARED)
+ dio->flags |= IOMAP_DIO_COW;
- if (iomap->flags & IOMAP_F_NEW) {
- need_zeroout = true;
- } else if (iomap->type == IOMAP_MAPPED) {
+ if (iomap->flags & IOMAP_F_NEW) {
+ need_zeroout = true;
+ } else if (iomap->type == IOMAP_MAPPED) {
+ if (iomap_dio_can_use_fua(iomap, dio))
+ bio_opf |= REQ_FUA;
+ else
+ dio->flags &= ~IOMAP_DIO_WRITE_THROUGH;
+ }
+
/*
- * Use a FUA write if we need datasync semantics, this is a pure
- * data IO that doesn't require any metadata updates (including
- * after IO completion such as unwritten extent conversion) and
- * the underlying device either supports FUA or doesn't have
- * a volatile write cache. This allows us to avoid cache flushes
- * on IO completion. If we can't use writethrough and need to
- * sync, disable in-task completions as dio completion will
- * need to call generic_write_sync() which will do a blocking
- * fsync / cache flush call.
+ * We can only do deferred completion for pure overwrites that
+ * don't require additional I/O at completion time.
+ *
+ * This rules out writes that need zeroing or extent conversion,
+ * extend the file size, or issue metadata I/O or cache flushes
+ * during completion processing.
*/
- if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
- (dio->flags & IOMAP_DIO_WRITE_THROUGH) &&
- (bdev_fua(iomap->bdev) || !bdev_write_cache(iomap->bdev)))
- use_fua = true;
- else if (dio->flags & IOMAP_DIO_NEED_SYNC)
+ if (need_zeroout || (pos >= i_size_read(inode)) ||
+ ((dio->flags & IOMAP_DIO_NEED_SYNC) &&
+ !(bio_opf & REQ_FUA)))
dio->flags &= ~IOMAP_DIO_CALLER_COMP;
+ } else {
+ bio_opf |= REQ_OP_READ;
}
/*
@@ -399,18 +399,6 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
if (!iov_iter_count(dio->submit.iter))
goto out;
- /*
- * We can only do deferred completion for pure overwrites that
- * don't require additional IO at completion. This rules out
- * writes that need zeroing or extent conversion, extend
- * the file size, or issue journal IO or cache flushes
- * during completion processing.
- */
- if (need_zeroout ||
- ((dio->flags & IOMAP_DIO_NEED_SYNC) && !use_fua) ||
- ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
- dio->flags &= ~IOMAP_DIO_CALLER_COMP;
-
/*
* The rules for polled IO completions follow the guidelines as the
* ones we set for inline and deferred completions. If none of those
@@ -428,8 +416,6 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
goto out;
}
- bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic_hw);
-
nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS);
do {
size_t n;
@@ -461,7 +447,7 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
}
n = bio->bi_iter.bi_size;
- if (WARN_ON_ONCE(atomic_hw && n != length)) {
+ if (WARN_ON_ONCE((bio_opf & REQ_ATOMIC) && n != length)) {
/*
* This bio should have covered the complete length,
* which it doesn't, so error. We may need to zero out
^ permalink raw reply related [flat|nested] 63+ messages in thread
* Re: [PATCH v5 03/10] xfs: Refactor xfs_reflink_end_cow_extent()
2025-03-13 6:11 ` John Garry
@ 2025-03-18 0:43 ` Dave Chinner
0 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2025-03-18 0:43 UTC (permalink / raw)
To: John Garry
Cc: Darrick J. Wong, Christoph Hellwig, brauner, cem, linux-xfs,
linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
martin.petersen
On Thu, Mar 13, 2025 at 06:11:12AM +0000, John Garry wrote:
> On 13/03/2025 04:51, Darrick J. Wong wrote:
> > > Hence if we are walking a range of extents in the BMBT to unmap
> > > them, then we should only be generating 2 intents per loop - a BUI
> > > for the BMBT removal and a CUI for the shared refcount decrease.
> > > That means we should be able to run at least a thousand iterations
> > > of that loop per transaction without getting anywhere near the
> > > transaction reservation limits.
> > >
> > > *However!*
> > >
> > > We have to relog every intent we haven't processed in the deferred
> > > batch every-so-often to prevent the outstanding intents from pinning
> > > the tail of the log. Hence the larger the number of intents in the
> > > initial batch, the more work we have to do later on (and the more
> > > overall log space and bandwidth they will consume) to relog them
> > > them over and over again until they pop to the head of the
> > > processing queue.
> > >
> > > Hence there is no real perforamce advantage to creating massive intent
> > > batches because we end up doing more work later on to relog those
> > > intents to prevent journal space deadlocks. It also doesn't speed up
> > > processing, because we still process the intent chains one at a time
> > > from start to completion before moving on to the next high level
> > > intent chain that needs to be processed.
> > >
> > > Further, after the first couple of intent chains have been
> > > processed, the initial log space reservation will have run out, and
> > > we are now asking for a new resrevation on every transaction roll we
> > > do. i.e. we now are now doing a log space reservation on every
> > > transaction roll in the processing chain instead of only doing it
> > > once per high level intent chain.
> > >
> > > Hence from a log space accounting perspective (the hottest code path
> > > in the journal), it is far more efficient to perform a single high
> > > level transaction per extent unmap operation than it is to batch
> > > intents into a single high level transaction.
> > >
> > > My advice is this: we should never batch high level iterative
> > > intent-based operations into a single transaction because it's a
> > > false optimisation. It might look like it is an efficiency
> > > improvement from the high level, but it ends up hammering the hot,
> > > performance critical paths in the transaction subsystem much, much
> > > harder and so will end up being slower than the single transaction
> > > per intent-based operation algorithm when it matters most....
> > How specifically do you propose remapping all the extents in a file
> > range after an untorn write? The regular cow ioend does a single
> > transaction per extent across the entire ioend range and cannot deliver
> > untorn writes. This latest proposal does, but now you've torn that idea
> > down too.
> >
> > At this point I have run out of ideas and conclude that can only submit
> > to your superior intellect.
> >
> > --D
>
> I'm hearing that we can fit thousands without getting anywhere the limits -
> this is good.
>
> But then also it is not optimal in terms of performance to batch, right?
> Performance is not so important here. This is for a software fallback, which
> we should not frequently hit. And even if we do, we're still typically not
> going to have many extents.
>
> For our specific purpose, we want 16KB atomic writes - that is max of 4
> extents. So this does not really sound like something to be concerned with
> for these atomic write sizes.
Apart from the fact that we should not be overloading some other
transaction reservation definition for this special case? Saying
"it should work" does not justify not thinking about constraints,
layered design, application exposure to error cases, overruns, etc.
i.e. the whole point of the software fallback is to make atomic
writes largely generic. Saying "if we limit them to 16kB" it's not
really generic, is it?
> We can add some arbitrary FS awu max, like 64KB, if that makes people feel
> more comfortable.
I was thinking more like 4-16MB as a usable maximum size for atomic
writes. i.e. allow for whole file atomic overwrites for small-medium
sized files, and decent IO sizes for performance when overwriting
large files.
If we set the max at 4MB, that's 1024 extents on a 4kB
block size filesystem. That gives us 2048 intents in a single unmap
operation which we can directly calculate the transaction
reservation size it will need. We need to do this as two separate
reservation steps with a max() calculation, because the processing
reservation size reduces with filesystem block size but the extent
unmap intent overhead goes up as the block count increases with
decreasing block size. i.e. the two components of the transacation
reservation scale in different directions.
If we are adding a new atomic transaction, we really need to design
it properly from the ground up, not hack around an existing
transaction reservation whilst handwaving about how "it should be
enough".
This "maximum size" is going to be exposed directly to userspace,
hence we need to think carefully about what the maximum supported
limits are going to be and how we can support them with a minimum of
effort long into the future. Hacking around the existing write
transaction isn't the way to do this, especially as that may change
in future as we modify internal allocation behaviour over time.
Saying "we only need 16kB right now, so that's all we should
support" isn't the right approach to take here....
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v5 03/10] xfs: Refactor xfs_reflink_end_cow_extent()
2025-03-13 7:21 ` Dave Chinner
@ 2025-03-22 5:19 ` Darrick J. Wong
0 siblings, 0 replies; 63+ messages in thread
From: Darrick J. Wong @ 2025-03-22 5:19 UTC (permalink / raw)
To: Dave Chinner
Cc: Christoph Hellwig, John Garry, brauner, cem, linux-xfs,
linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
martin.petersen
On Thu, Mar 13, 2025 at 06:21:29PM +1100, Dave Chinner wrote:
> On Wed, Mar 12, 2025 at 09:51:21PM -0700, Darrick J. Wong wrote:
> > On Thu, Mar 13, 2025 at 12:25:21PM +1100, Dave Chinner wrote:
> > > On Wed, Mar 12, 2025 at 08:46:36AM -0700, Darrick J. Wong wrote:
> > > > > > > On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote:
> > > > > > > > Refactor xfs_reflink_end_cow_extent() into separate parts which process
> > > > > > > > the CoW range and commit the transaction.
> > > > > > > >
> > > > > > > > This refactoring will be used in future for when it is required to commit
> > > > > > > > a range of extents as a single transaction, similar to how it was done
> > > > > > > > pre-commit d6f215f359637.
> > > > > > >
> > > > > > > Darrick pointed out that if you do more than just a tiny number
> > > > > > > of extents per transactions you run out of log reservations very
> > > > > > > quickly here:
> > > > > > >
> > > > > > > https://urldefense.com/v3/__https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!PWLcBof1tKimKUObvCj4vOhljWjFmjtzVHLx9apcU5Rah1xZnmp_3PIq6eSwx6TdEXzMLYYyBfmZLgvj$
> > > > > > >
> > > > > > > how does your scheme deal with that?
> > > > > > >
> > > > > > The resblks calculation in xfs_reflink_end_atomic_cow() takes care of this,
> > > > > > right? Or does the log reservation have a hard size limit, regardless of
> > > > > > that calculation?
> > > > >
> > > > > The resblks calculated there are the reserved disk blocks
> > >
> > > Used for btree block allocations that might be needed during the
> > > processing of the transaction.
> > >
> > > > > and have
> > > > > nothing to do with the log reservations, which comes from the
> > > > > tr_write field passed in. There is some kind of upper limited to it
> > > > > obviously by the log size, although I'm not sure if we've formalized
> > > > > that somewhere. Dave might be the right person to ask about that.
> > > >
> > > > The (very very rough) upper limit for how many intent items you can
> > > > attach to a tr_write transaction is:
> > > >
> > > > per_extent_cost = (cui_size + rui_size + bui_size + efi_size + ili_size)
> > > > max_blocks = tr_write::tr_logres / per_extent_cost
> > > >
> > > > (ili_size is the inode log item size)
> > >
> > > That doesn't sound right. The number of intents we can log is not
> > > dependent on the aggregated size of all intent types. We do not log
> > > all those intent types in a single transaction, nor do we process
> > > more than one type of intent in a given transaction. Also, we only
> > > log the inode once per transaction, so that is not a per-extent
> > > overhead.
> > >
> > > Realistically, the tr_write transaction is goign to be at least a
> > > 100kB because it has to be big enough to log full splits of multiple
> > > btrees (e.g. BMBT + both free space trees). Yeah, a small 4kB
> > > filesystem spits out:
> > >
> > > xfs_trans_resv_calc: dev 7:0 type 0 logres 193528 logcount 5 flags 0x4
> > >
> > > About 190kB.
> > >
> > > However, intents are typically very small - around 32 bytes in size
> > > plus another 12 bytes for the log region ophdr.
> > >
> > > This implies that we can fit thousands of individual intents in a
> > > single tr_write log reservation on any given filesystem, and the
> > > number of loop iterations in a transaction is therefore dependent
> > > largely on how many intents are logged per iteration.
> > >
> > > Hence if we are walking a range of extents in the BMBT to unmap
> > > them, then we should only be generating 2 intents per loop - a BUI
> > > for the BMBT removal and a CUI for the shared refcount decrease.
> > > That means we should be able to run at least a thousand iterations
> > > of that loop per transaction without getting anywhere near the
> > > transaction reservation limits.
> > >
> > > *However!*
> > >
> > > We have to relog every intent we haven't processed in the deferred
> > > batch every-so-often to prevent the outstanding intents from pinning
> > > the tail of the log. Hence the larger the number of intents in the
> > > initial batch, the more work we have to do later on (and the more
> > > overall log space and bandwidth they will consume) to relog them
> > > them over and over again until they pop to the head of the
> > > processing queue.
> > >
> > > Hence there is no real perforamce advantage to creating massive intent
> > > batches because we end up doing more work later on to relog those
> > > intents to prevent journal space deadlocks. It also doesn't speed up
> > > processing, because we still process the intent chains one at a time
> > > from start to completion before moving on to the next high level
> > > intent chain that needs to be processed.
> > >
> > > Further, after the first couple of intent chains have been
> > > processed, the initial log space reservation will have run out, and
> > > we are now asking for a new resrevation on every transaction roll we
> > > do. i.e. we now are now doing a log space reservation on every
> > > transaction roll in the processing chain instead of only doing it
> > > once per high level intent chain.
> > >
> > > Hence from a log space accounting perspective (the hottest code path
> > > in the journal), it is far more efficient to perform a single high
> > > level transaction per extent unmap operation than it is to batch
> > > intents into a single high level transaction.
> > >
> > > My advice is this: we should never batch high level iterative
> > > intent-based operations into a single transaction because it's a
> > > false optimisation. It might look like it is an efficiency
> > > improvement from the high level, but it ends up hammering the hot,
> > > performance critical paths in the transaction subsystem much, much
> > > harder and so will end up being slower than the single transaction
> > > per intent-based operation algorithm when it matters most....
> >
> > How specifically do you propose remapping all the extents in a file
> > range after an untorn write?
>
> Sorry, I didn't realise that was the context of the question that
> was asked - there was not enough context in the email I replied to
> to indicate this important detail. hence it just looked like a
> question about "how many intents can we batch into a single write
> transaction reservation".
>
> I gave that answer (thousands) and then recommended against doing
> batching like this as an optimisation. Batching operations into a
> single context is normally done as an optimisation, so that is what
> I assumed was being talked about here....
>
> > The regular cow ioend does a single
> > transaction per extent across the entire ioend range and cannot deliver
> > untorn writes. This latest proposal does, but now you've torn that idea
> > down too.
> >
> > At this point I have run out of ideas and conclude that can only submit
> > to your superior intellect.
>
> I think you're jumping to incorrect conclusions, and then making
> needless personal attacks. This is unacceptable behaviour, Darrick,
> and if you keep it up you are going to end up having to explain
> yourself to the CoC committee....
I apologize to everyone here for using sarcasm. Anyone should be able
to participate in the FOSS community without fear of being snarked at
having their choices undermined. Senior developers ought to enable
collaboration within community norms in any way they can. I will take
this opportunity to remind myself of how I would like to interact with
other people.
--D
^ permalink raw reply [flat|nested] 63+ messages in thread
end of thread, other threads:[~2025-03-22 5:19 UTC | newest]
Thread overview: 63+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-10 18:39 [PATCH v5 00/10] large atomic writes for xfs with CoW John Garry
2025-03-10 18:39 ` [PATCH v5 01/10] xfs: Pass flags to xfs_reflink_allocate_cow() John Garry
2025-03-12 7:15 ` Christoph Hellwig
2025-03-12 8:19 ` John Garry
2025-03-10 18:39 ` [PATCH v5 02/10] xfs: Switch atomic write size check in xfs_file_write_iter() John Garry
2025-03-12 7:17 ` Christoph Hellwig
2025-03-12 8:21 ` John Garry
2025-03-10 18:39 ` [PATCH v5 03/10] xfs: Refactor xfs_reflink_end_cow_extent() John Garry
2025-03-12 7:24 ` Christoph Hellwig
2025-03-12 8:27 ` John Garry
2025-03-12 8:35 ` Christoph Hellwig
2025-03-12 15:46 ` Darrick J. Wong
2025-03-12 22:06 ` John Garry
2025-03-12 23:22 ` Darrick J. Wong
2025-03-13 1:25 ` Dave Chinner
2025-03-13 4:51 ` Darrick J. Wong
2025-03-13 6:11 ` John Garry
2025-03-18 0:43 ` Dave Chinner
2025-03-13 7:21 ` Dave Chinner
2025-03-22 5:19 ` Darrick J. Wong
2025-03-10 18:39 ` [PATCH v5 04/10] xfs: Reflink CoW-based atomic write support John Garry
2025-03-12 7:27 ` Christoph Hellwig
2025-03-12 9:13 ` John Garry
2025-03-12 13:45 ` Christoph Hellwig
2025-03-12 14:48 ` John Garry
2025-03-10 18:39 ` [PATCH v5 05/10] xfs: Iomap SW-based " John Garry
2025-03-12 7:37 ` Christoph Hellwig
2025-03-12 9:00 ` John Garry
2025-03-12 13:52 ` Christoph Hellwig
2025-03-12 14:57 ` John Garry
2025-03-12 15:55 ` Christoph Hellwig
2025-03-12 16:11 ` John Garry
2025-03-10 18:39 ` [PATCH v5 06/10] xfs: Add xfs_file_dio_write_atomic() John Garry
2025-03-10 18:39 ` [PATCH v5 07/10] xfs: Commit CoW-based atomic writes atomically John Garry
2025-03-12 7:39 ` Christoph Hellwig
2025-03-12 9:04 ` John Garry
2025-03-12 13:54 ` Christoph Hellwig
2025-03-12 15:01 ` John Garry
2025-03-10 18:39 ` [PATCH v5 08/10] xfs: Update atomic write max size John Garry
2025-03-11 14:40 ` Carlos Maiolino
2025-03-12 7:41 ` Christoph Hellwig
2025-03-12 8:09 ` John Garry
2025-03-12 8:13 ` Christoph Hellwig
2025-03-12 8:14 ` John Garry
2025-03-10 18:39 ` [PATCH v5 09/10] xfs: Allow block allocator to take an alignment hint John Garry
2025-03-12 7:42 ` Christoph Hellwig
2025-03-12 8:05 ` John Garry
2025-03-12 13:45 ` Christoph Hellwig
2025-03-12 14:47 ` John Garry
2025-03-12 16:00 ` Darrick J. Wong
2025-03-12 16:28 ` John Garry
2025-03-10 18:39 ` [PATCH RFC v5 10/10] iomap: Rename ATOMIC flags again John Garry
2025-03-12 7:13 ` Christoph Hellwig
2025-03-12 23:59 ` Dave Chinner
2025-03-13 6:28 ` John Garry
2025-03-13 7:02 ` Christoph Hellwig
2025-03-13 7:41 ` John Garry
2025-03-13 7:49 ` Christoph Hellwig
2025-03-13 7:53 ` John Garry
2025-03-13 8:09 ` Christoph Hellwig
2025-03-13 8:18 ` Christoph Hellwig
2025-03-13 8:24 ` John Garry
2025-03-13 8:28 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).