* [PATCH v2 01/14] block: Add blk_validate_atomic_write_op_size()
2024-03-04 13:04 [PATCH v2 00/14] block atomic writes for XFS John Garry
@ 2024-03-04 13:04 ` John Garry
2024-03-04 13:04 ` [PATCH v2 02/14] fs: xfs: Don't use low-space allocator for alignment > 1 John Garry
` (12 subsequent siblings)
13 siblings, 0 replies; 53+ messages in thread
From: John Garry @ 2024-03-04 13:04 UTC (permalink / raw)
To: djwong, hch, viro, brauner, jack, chandan.babu, david, axboe
Cc: martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block, John Garry
FSes will not check if the BIO submitted for an atomic write adheres to the
request_queue limits, so check in the BIO submission path.
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
We really should return -EINVAL to userspce, so can be have BLK_STS_INVAL
or similar?
block/blk-core.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/block/blk-core.c b/block/blk-core.c
index de771093b526..a1b095a68498 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -718,6 +718,18 @@ void submit_bio_noacct_nocheck(struct bio *bio)
__submit_bio_noacct(bio);
}
+static blk_status_t blk_validate_atomic_write_op_size(struct request_queue *q,
+ struct bio *bio)
+{
+ if (bio->bi_iter.bi_size > queue_atomic_write_unit_max_bytes(q))
+ return BLK_STS_IOERR;
+
+ if (bio->bi_iter.bi_size % queue_atomic_write_unit_min_bytes(q))
+ return BLK_STS_IOERR;
+
+ return BLK_STS_OK;
+}
+
/**
* submit_bio_noacct - re-submit a bio to the block device layer for I/O
* @bio: The bio describing the location in memory and on the device.
@@ -775,6 +787,11 @@ void submit_bio_noacct(struct bio *bio)
switch (bio_op(bio)) {
case REQ_OP_READ:
case REQ_OP_WRITE:
+ if (bio->bi_opf & REQ_ATOMIC) {
+ status = blk_validate_atomic_write_op_size(q, bio);
+ if (status != BLK_STS_OK)
+ goto end_io;
+ }
break;
case REQ_OP_FLUSH:
/*
--
2.31.1
^ permalink raw reply related [flat|nested] 53+ messages in thread* [PATCH v2 02/14] fs: xfs: Don't use low-space allocator for alignment > 1
2024-03-04 13:04 [PATCH v2 00/14] block atomic writes for XFS John Garry
2024-03-04 13:04 ` [PATCH v2 01/14] block: Add blk_validate_atomic_write_op_size() John Garry
@ 2024-03-04 13:04 ` John Garry
2024-03-04 22:15 ` Dave Chinner
2024-03-04 13:04 ` [PATCH v2 03/14] fs: xfs: Introduce FORCEALIGN inode flag John Garry
` (11 subsequent siblings)
13 siblings, 1 reply; 53+ messages in thread
From: John Garry @ 2024-03-04 13:04 UTC (permalink / raw)
To: djwong, hch, viro, brauner, jack, chandan.babu, david, axboe
Cc: martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block, John Garry
The low-space allocator doesn't honour the alignment requirement, so don't
attempt to even use it (when we have an alignment requirement).
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
fs/xfs/libxfs/xfs_bmap.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index f362345467fa..60d100134280 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3584,6 +3584,10 @@ xfs_bmap_btalloc_low_space(
{
int error;
+ /* The allocator doesn't honour args->alignment */
+ if (args->alignment > 1)
+ return 0;
+
if (args->minlen > ap->minlen) {
args->minlen = ap->minlen;
error = xfs_alloc_vextent_start_ag(args, ap->blkno);
--
2.31.1
^ permalink raw reply related [flat|nested] 53+ messages in thread* Re: [PATCH v2 02/14] fs: xfs: Don't use low-space allocator for alignment > 1
2024-03-04 13:04 ` [PATCH v2 02/14] fs: xfs: Don't use low-space allocator for alignment > 1 John Garry
@ 2024-03-04 22:15 ` Dave Chinner
2024-03-05 13:36 ` John Garry
0 siblings, 1 reply; 53+ messages in thread
From: Dave Chinner @ 2024-03-04 22:15 UTC (permalink / raw)
To: John Garry
Cc: djwong, hch, viro, brauner, jack, chandan.babu, axboe,
martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block
On Mon, Mar 04, 2024 at 01:04:16PM +0000, John Garry wrote:
> The low-space allocator doesn't honour the alignment requirement, so don't
> attempt to even use it (when we have an alignment requirement).
>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
> fs/xfs/libxfs/xfs_bmap.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index f362345467fa..60d100134280 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -3584,6 +3584,10 @@ xfs_bmap_btalloc_low_space(
> {
> int error;
>
> + /* The allocator doesn't honour args->alignment */
> + if (args->alignment > 1)
> + return 0;
I think that's wrong.
The alignment argument here is purely a best effort consideration -
we ignore it several different allocation situations, not just low
space.
e.g. xfs_bmap_btalloc_at_eof() will try exact block
allocation regardless of whether an alignment parameter is set. It
will then fall back to stripe alignment if exact block fails.
If stripe aligned allocation fails, it will then set args->alignment
= 1 and try a full filesystem allocation scan without alignment.
And if that fails, then we finally get to the low space allocator
with args->alignment = 1 even though we might be trying to allocate
an aligned extent for an atomic IO....
IOWs, I think this indicates deeper surgery is needed to ensure
aligned allocations fail immediately and don't fall back to
unaligned allocations and set XFS_TRANS_LOW_MODE...
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 53+ messages in thread* Re: [PATCH v2 02/14] fs: xfs: Don't use low-space allocator for alignment > 1
2024-03-04 22:15 ` Dave Chinner
@ 2024-03-05 13:36 ` John Garry
0 siblings, 0 replies; 53+ messages in thread
From: John Garry @ 2024-03-05 13:36 UTC (permalink / raw)
To: Dave Chinner
Cc: djwong, hch, viro, brauner, jack, chandan.babu, axboe,
martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block
On 04/03/2024 22:15, Dave Chinner wrote:
> On Mon, Mar 04, 2024 at 01:04:16PM +0000, John Garry wrote:
>> The low-space allocator doesn't honour the alignment requirement, so don't
>> attempt to even use it (when we have an alignment requirement).
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>> fs/xfs/libxfs/xfs_bmap.c | 4 ++++
>> 1 file changed, 4 insertions(+)
>>
>> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
>> index f362345467fa..60d100134280 100644
>> --- a/fs/xfs/libxfs/xfs_bmap.c
>> +++ b/fs/xfs/libxfs/xfs_bmap.c
>> @@ -3584,6 +3584,10 @@ xfs_bmap_btalloc_low_space(
>> {
>> int error;
>>
>> + /* The allocator doesn't honour args->alignment */
>> + if (args->alignment > 1)
>> + return 0;
>
> I think that's wrong.
>
> The alignment argument here is purely a best effort consideration -
> we ignore it several different allocation situations, not just low
> space.
Sure, but I am simply addressing the low-space allocator here.
In this series I am /we are effectively trying to conflate
args->alignment > 1 with forcealign. I thought that args->alignment was
guaranteed to be honoured, with some caveats. For forcealign, we
obviously require a guarantee.
>
> e.g. xfs_bmap_btalloc_at_eof() will try exact block
> allocation regardless of whether an alignment parameter is set.
For this specific issue, I think that we are ok, as:
- in xfs_bmap_compute_alignments(), stripe_align is aligned with
args->alignment for forcealign
- xfs_bmap_btalloc_at_eof() has the optimisation to alloc according to
stripe alignment
But obviously we should not be relying on optimisations.
Please also note that I have a modification later in this series to
always have EOF aligned for forcealign.
> It
> will then fall back to stripe alignment if exact block fails.
>
> If stripe aligned allocation fails, it will then set args->alignment
> = 1 and try a full filesystem allocation scan without alignment.
>
> And if that fails, then we finally get to the low space allocator
> with args->alignment = 1 even though we might be trying to allocate
> an aligned extent for an atomic IO....
>
> IOWs, I think this indicates deeper surgery is needed to ensure
> aligned allocations fail immediately and don't fall back to
> unaligned allocations and set XFS_TRANS_LOW_MODE...
>
ok, I'll look at what you write about all of this in the later patch review.
Thanks,
John
^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH v2 03/14] fs: xfs: Introduce FORCEALIGN inode flag
2024-03-04 13:04 [PATCH v2 00/14] block atomic writes for XFS John Garry
2024-03-04 13:04 ` [PATCH v2 01/14] block: Add blk_validate_atomic_write_op_size() John Garry
2024-03-04 13:04 ` [PATCH v2 02/14] fs: xfs: Don't use low-space allocator for alignment > 1 John Garry
@ 2024-03-04 13:04 ` John Garry
2024-03-04 13:04 ` [PATCH v2 04/14] fs: xfs: Make file data allocations observe the 'forcealign' flag John Garry
` (10 subsequent siblings)
13 siblings, 0 replies; 53+ messages in thread
From: John Garry @ 2024-03-04 13:04 UTC (permalink / raw)
To: djwong, hch, viro, brauner, jack, chandan.babu, david, axboe
Cc: martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block, John Garry
From: "Darrick J. Wong" <djwong@kernel.org>
Add a new inode flag to require that all file data extent mappings must
be aligned (both the file offset range and the allocated space itself)
to the extent size hint. Having a separate COW extent size hint is no
longer allowed.
The goal here is to enable sysadmins and users to mandate that all space
mappings in a file must have a startoff/blockcount that are aligned to
(say) a 2MB alignment and that the startblock/blockcount will follow the
same alignment.
jpg: Enforce extsize is a power-of-2 for forcealign
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Co-developed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
fs/xfs/libxfs/xfs_format.h | 6 +++++-
fs/xfs/libxfs/xfs_inode_buf.c | 40 +++++++++++++++++++++++++++++++++++
fs/xfs/libxfs/xfs_inode_buf.h | 3 +++
fs/xfs/libxfs/xfs_sb.c | 2 ++
fs/xfs/xfs_inode.c | 12 +++++++++++
fs/xfs/xfs_inode.h | 5 +++++
fs/xfs/xfs_ioctl.c | 34 ++++++++++++++++++++++++++++-
fs/xfs/xfs_mount.h | 2 ++
fs/xfs/xfs_super.c | 4 ++++
include/uapi/linux/fs.h | 2 ++
10 files changed, 108 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 382ab1e71c0b..db2113cf6e47 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -353,6 +353,7 @@ xfs_sb_has_compat_feature(
#define XFS_SB_FEAT_RO_COMPAT_RMAPBT (1 << 1) /* reverse map btree */
#define XFS_SB_FEAT_RO_COMPAT_REFLINK (1 << 2) /* reflinked files */
#define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3) /* inobt block counts */
+#define XFS_SB_FEAT_RO_COMPAT_FORCEALIGN (1 << 30) /* aligned file data extents */
#define XFS_SB_FEAT_RO_COMPAT_ALL \
(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
@@ -1085,16 +1086,19 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
#define XFS_DIFLAG2_COWEXTSIZE_BIT 2 /* copy on write extent size hint */
#define XFS_DIFLAG2_BIGTIME_BIT 3 /* big timestamps */
#define XFS_DIFLAG2_NREXT64_BIT 4 /* large extent counters */
+/* data extent mappings for regular files must be aligned to extent size hint */
+#define XFS_DIFLAG2_FORCEALIGN_BIT 5
#define XFS_DIFLAG2_DAX (1 << XFS_DIFLAG2_DAX_BIT)
#define XFS_DIFLAG2_REFLINK (1 << XFS_DIFLAG2_REFLINK_BIT)
#define XFS_DIFLAG2_COWEXTSIZE (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
#define XFS_DIFLAG2_BIGTIME (1 << XFS_DIFLAG2_BIGTIME_BIT)
#define XFS_DIFLAG2_NREXT64 (1 << XFS_DIFLAG2_NREXT64_BIT)
+#define XFS_DIFLAG2_FORCEALIGN (1 << XFS_DIFLAG2_FORCEALIGN_BIT)
#define XFS_DIFLAG2_ANY \
(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
- XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64)
+ XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_FORCEALIGN)
static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
{
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 137a65bda95d..61cc12cd54db 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -610,6 +610,14 @@ xfs_dinode_verify(
!xfs_has_bigtime(mp))
return __this_address;
+ if (flags2 & XFS_DIFLAG2_FORCEALIGN) {
+ fa = xfs_inode_validate_forcealign(mp, mode, flags,
+ be32_to_cpu(dip->di_extsize),
+ be32_to_cpu(dip->di_cowextsize));
+ if (fa)
+ return fa;
+ }
+
return NULL;
}
@@ -777,3 +785,35 @@ xfs_inode_validate_cowextsize(
return NULL;
}
+
+/* Validate the forcealign inode flag */
+xfs_failaddr_t
+xfs_inode_validate_forcealign(
+ struct xfs_mount *mp,
+ uint16_t mode,
+ uint16_t flags,
+ uint32_t extsize,
+ uint32_t cowextsize)
+{
+ /* superblock rocompat feature flag */
+ if (!xfs_has_forcealign(mp))
+ return __this_address;
+
+ /* Only regular files and directories */
+ if (!S_ISDIR(mode) && !S_ISREG(mode))
+ return __this_address;
+
+ /* Doesn't apply to realtime files */
+ if (flags & XFS_DIFLAG_REALTIME)
+ return __this_address;
+
+ /* Requires a non-zero power-of-2 extent size hint */
+ if (extsize == 0 || !is_power_of_2(extsize))
+ return __this_address;
+
+ /* Requires no cow extent size hint */
+ if (cowextsize != 0)
+ return __this_address;
+
+ return NULL;
+}
diff --git a/fs/xfs/libxfs/xfs_inode_buf.h b/fs/xfs/libxfs/xfs_inode_buf.h
index 585ed5a110af..50db17d22b68 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.h
+++ b/fs/xfs/libxfs/xfs_inode_buf.h
@@ -33,6 +33,9 @@ xfs_failaddr_t xfs_inode_validate_extsize(struct xfs_mount *mp,
xfs_failaddr_t xfs_inode_validate_cowextsize(struct xfs_mount *mp,
uint32_t cowextsize, uint16_t mode, uint16_t flags,
uint64_t flags2);
+xfs_failaddr_t xfs_inode_validate_forcealign(struct xfs_mount *mp,
+ uint16_t mode, uint16_t flags, uint32_t extsize,
+ uint32_t cowextsize);
static inline uint64_t xfs_inode_encode_bigtime(struct timespec64 tv)
{
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 5bb6e2bd6dee..f2c16a028fae 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -163,6 +163,8 @@ xfs_sb_version_to_features(
features |= XFS_FEAT_REFLINK;
if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
features |= XFS_FEAT_INOBTCNT;
+ if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
+ features |= XFS_FEAT_FORCEALIGN;
if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_FTYPE)
features |= XFS_FEAT_FTYPE;
if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_SPINODES)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 1fd94958aa97..2c439df8c47f 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -629,6 +629,8 @@ xfs_ip2xflags(
flags |= FS_XFLAG_DAX;
if (ip->i_diflags2 & XFS_DIFLAG2_COWEXTSIZE)
flags |= FS_XFLAG_COWEXTSIZE;
+ if (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)
+ flags |= FS_XFLAG_FORCEALIGN;
}
if (xfs_inode_has_attr_fork(ip))
@@ -758,6 +760,8 @@ xfs_inode_inherit_flags2(
}
if (pip->i_diflags2 & XFS_DIFLAG2_DAX)
ip->i_diflags2 |= XFS_DIFLAG2_DAX;
+ if (pip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)
+ ip->i_diflags2 |= XFS_DIFLAG2_FORCEALIGN;
/* Don't let invalid cowextsize hints propagate. */
failaddr = xfs_inode_validate_cowextsize(ip->i_mount, ip->i_cowextsize,
@@ -766,6 +770,14 @@ xfs_inode_inherit_flags2(
ip->i_diflags2 &= ~XFS_DIFLAG2_COWEXTSIZE;
ip->i_cowextsize = 0;
}
+
+ if (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN) {
+ failaddr = xfs_inode_validate_forcealign(ip->i_mount,
+ VFS_I(ip)->i_mode, ip->i_diflags, ip->i_extsize,
+ ip->i_cowextsize);
+ if (failaddr)
+ ip->i_diflags2 &= ~XFS_DIFLAG2_FORCEALIGN;
+ }
}
/*
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 97f63bacd4c2..82e2838f6d64 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -305,6 +305,11 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip)
return ip->i_diflags2 & XFS_DIFLAG2_NREXT64;
}
+static inline bool xfs_inode_forcealign(struct xfs_inode *ip)
+{
+ return ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN;
+}
+
/*
* Return the buftarg used for data allocations on a given inode.
*/
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index f02b6e558af5..867d8d51a3d0 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1110,6 +1110,8 @@ xfs_flags2diflags2(
di_flags2 |= XFS_DIFLAG2_DAX;
if (xflags & FS_XFLAG_COWEXTSIZE)
di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
+ if (xflags & FS_XFLAG_FORCEALIGN)
+ di_flags2 |= XFS_DIFLAG2_FORCEALIGN;
return di_flags2;
}
@@ -1146,6 +1148,22 @@ xfs_ioctl_setattr_xflags(
if (i_flags2 && !xfs_has_v3inodes(mp))
return -EINVAL;
+ /*
+ * Force-align requires a nonzero extent size hint and a zero cow
+ * extent size hint. It doesn't apply to realtime files.
+ */
+ if (fa->fsx_xflags & FS_XFLAG_FORCEALIGN) {
+ if (!xfs_has_forcealign(mp))
+ return -EINVAL;
+ if (fa->fsx_xflags & FS_XFLAG_COWEXTSIZE)
+ return -EINVAL;
+ if (!(fa->fsx_xflags & (FS_XFLAG_EXTSIZE |
+ FS_XFLAG_EXTSZINHERIT)))
+ return -EINVAL;
+ if (fa->fsx_xflags & FS_XFLAG_REALTIME)
+ return -EINVAL;
+ }
+
ip->i_diflags = xfs_flags2diflags(ip, fa->fsx_xflags);
ip->i_diflags2 = i_flags2;
@@ -1232,6 +1250,7 @@ xfs_ioctl_setattr_check_extsize(
struct xfs_mount *mp = ip->i_mount;
xfs_failaddr_t failaddr;
uint16_t new_diflags;
+ uint16_t new_diflags2;
if (!fa->fsx_valid)
return 0;
@@ -1244,6 +1263,7 @@ xfs_ioctl_setattr_check_extsize(
return -EINVAL;
new_diflags = xfs_flags2diflags(ip, fa->fsx_xflags);
+ new_diflags2 = xfs_flags2diflags2(ip, fa->fsx_xflags);
/*
* Inode verifiers do not check that the extent size hint is an integer
@@ -1263,7 +1283,19 @@ xfs_ioctl_setattr_check_extsize(
failaddr = xfs_inode_validate_extsize(ip->i_mount,
XFS_B_TO_FSB(mp, fa->fsx_extsize),
VFS_I(ip)->i_mode, new_diflags);
- return failaddr != NULL ? -EINVAL : 0;
+ if (failaddr)
+ return -EINVAL;
+
+ if (new_diflags2 & XFS_DIFLAG2_FORCEALIGN) {
+ failaddr = xfs_inode_validate_forcealign(ip->i_mount,
+ VFS_I(ip)->i_mode, new_diflags,
+ XFS_B_TO_FSB(mp, fa->fsx_extsize),
+ XFS_B_TO_FSB(mp, fa->fsx_cowextsize));
+ if (failaddr)
+ return -EINVAL;
+ }
+
+ return 0;
}
static int
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 503fe3c7edbf..e1ef31675db3 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -289,6 +289,7 @@ typedef struct xfs_mount {
#define XFS_FEAT_BIGTIME (1ULL << 24) /* large timestamps */
#define XFS_FEAT_NEEDSREPAIR (1ULL << 25) /* needs xfs_repair */
#define XFS_FEAT_NREXT64 (1ULL << 26) /* large extent counters */
+#define XFS_FEAT_FORCEALIGN (1ULL << 27) /* aligned file data extents */
/* Mount features */
#define XFS_FEAT_NOATTR2 (1ULL << 48) /* disable attr2 creation */
@@ -352,6 +353,7 @@ __XFS_HAS_FEAT(inobtcounts, INOBTCNT)
__XFS_HAS_FEAT(bigtime, BIGTIME)
__XFS_HAS_FEAT(needsrepair, NEEDSREPAIR)
__XFS_HAS_FEAT(large_extent_counts, NREXT64)
+__XFS_HAS_FEAT(forcealign, FORCEALIGN)
/*
* Mount features
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 5a2512d20bd0..74dcafddf6a9 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1708,6 +1708,10 @@ xfs_fs_fill_super(
mp->m_features &= ~XFS_FEAT_DISCARD;
}
+ if (xfs_has_forcealign(mp))
+ xfs_warn(mp,
+"EXPERIMENTAL forced data extent alignment feature in use. Use at your own risk!");
+
if (xfs_has_reflink(mp)) {
if (mp->m_sb.sb_rblocks) {
xfs_alert(mp,
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index a0975ae81e64..8828822331bf 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -140,6 +140,8 @@ struct fsxattr {
#define FS_XFLAG_FILESTREAM 0x00004000 /* use filestream allocator */
#define FS_XFLAG_DAX 0x00008000 /* use DAX for IO */
#define FS_XFLAG_COWEXTSIZE 0x00010000 /* CoW extent size allocator hint */
+/* data extent mappings for regular files must be aligned to extent size hint */
+#define FS_XFLAG_FORCEALIGN 0x00020000
#define FS_XFLAG_HASATTR 0x80000000 /* no DIFLAG for this */
/* the read-only stuff doesn't really belong here, but any other place is
--
2.31.1
^ permalink raw reply related [flat|nested] 53+ messages in thread* [PATCH v2 04/14] fs: xfs: Make file data allocations observe the 'forcealign' flag
2024-03-04 13:04 [PATCH v2 00/14] block atomic writes for XFS John Garry
` (2 preceding siblings ...)
2024-03-04 13:04 ` [PATCH v2 03/14] fs: xfs: Introduce FORCEALIGN inode flag John Garry
@ 2024-03-04 13:04 ` John Garry
2024-03-05 0:44 ` Dave Chinner
2024-03-04 13:04 ` [PATCH v2 05/14] fs: xfs: Enable file data forcealign feature John Garry
` (9 subsequent siblings)
13 siblings, 1 reply; 53+ messages in thread
From: John Garry @ 2024-03-04 13:04 UTC (permalink / raw)
To: djwong, hch, viro, brauner, jack, chandan.babu, david, axboe
Cc: martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block, John Garry
From: "Darrick J. Wong" <djwong@kernel.org>
The existing extsize hint code already did the work of expanding file
range mapping requests so that the range is aligned to the hint value.
Now add the code we need to guarantee that the space allocations are
also always aligned.
XXX: still need to check all this with reflink
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Co-developed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
fs/xfs/libxfs/xfs_bmap.c | 22 +++++++++++++++++-----
fs/xfs/xfs_iomap.c | 4 +++-
2 files changed, 20 insertions(+), 6 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 60d100134280..8dee60795cf4 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3343,6 +3343,19 @@ xfs_bmap_compute_alignments(
align = xfs_get_cowextsz_hint(ap->ip);
else if (ap->datatype & XFS_ALLOC_USERDATA)
align = xfs_get_extsz_hint(ap->ip);
+
+ /*
+ * xfs_get_cowextsz_hint() returns extsz_hint for when forcealign is
+ * set as forcealign and cowextsz_hint are mutually exclusive
+ */
+ if (xfs_inode_forcealign(ap->ip) && align) {
+ args->alignment = align;
+ if (stripe_align % align)
+ stripe_align = align;
+ } else {
+ args->alignment = 1;
+ }
+
if (align) {
if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
ap->eof, 0, ap->conv, &ap->offset,
@@ -3438,7 +3451,6 @@ xfs_bmap_exact_minlen_extent_alloc(
args.minlen = args.maxlen = ap->minlen;
args.total = ap->total;
- args.alignment = 1;
args.minalignslop = 0;
args.minleft = ap->minleft;
@@ -3484,6 +3496,7 @@ xfs_bmap_btalloc_at_eof(
{
struct xfs_mount *mp = args->mp;
struct xfs_perag *caller_pag = args->pag;
+ int orig_alignment = args->alignment;
int error;
/*
@@ -3558,10 +3571,10 @@ xfs_bmap_btalloc_at_eof(
/*
* Allocation failed, so turn return the allocation args to their
- * original non-aligned state so the caller can proceed on allocation
- * failure as if this function was never called.
+ * original state so the caller can proceed on allocation failure as
+ * if this function was never called.
*/
- args->alignment = 1;
+ args->alignment = orig_alignment;
return 0;
}
@@ -3709,7 +3722,6 @@ xfs_bmap_btalloc(
.wasdel = ap->wasdel,
.resv = XFS_AG_RESV_NONE,
.datatype = ap->datatype,
- .alignment = 1,
.minalignslop = 0,
};
xfs_fileoff_t orig_offset;
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 18c8f168b153..70fe873951f3 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -181,7 +181,9 @@ xfs_eof_alignment(
* If mounted with the "-o swalloc" option the alignment is
* increased from the strip unit size to the stripe width.
*/
- if (mp->m_swidth && xfs_has_swalloc(mp))
+ if (xfs_inode_forcealign(ip))
+ align = xfs_get_extsz_hint(ip);
+ else if (mp->m_swidth && xfs_has_swalloc(mp))
align = mp->m_swidth;
else if (mp->m_dalign)
align = mp->m_dalign;
--
2.31.1
^ permalink raw reply related [flat|nested] 53+ messages in thread* Re: [PATCH v2 04/14] fs: xfs: Make file data allocations observe the 'forcealign' flag
2024-03-04 13:04 ` [PATCH v2 04/14] fs: xfs: Make file data allocations observe the 'forcealign' flag John Garry
@ 2024-03-05 0:44 ` Dave Chinner
2024-03-05 15:22 ` John Garry
0 siblings, 1 reply; 53+ messages in thread
From: Dave Chinner @ 2024-03-05 0:44 UTC (permalink / raw)
To: John Garry
Cc: djwong, hch, viro, brauner, jack, chandan.babu, axboe,
martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block
On Mon, Mar 04, 2024 at 01:04:18PM +0000, John Garry wrote:
> From: "Darrick J. Wong" <djwong@kernel.org>
>
> The existing extsize hint code already did the work of expanding file
> range mapping requests so that the range is aligned to the hint value.
> Now add the code we need to guarantee that the space allocations are
> also always aligned.
>
> XXX: still need to check all this with reflink
>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> Co-developed-by: John Garry <john.g.garry@oracle.com>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
> fs/xfs/libxfs/xfs_bmap.c | 22 +++++++++++++++++-----
> fs/xfs/xfs_iomap.c | 4 +++-
> 2 files changed, 20 insertions(+), 6 deletions(-)
>
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 60d100134280..8dee60795cf4 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -3343,6 +3343,19 @@ xfs_bmap_compute_alignments(
> align = xfs_get_cowextsz_hint(ap->ip);
> else if (ap->datatype & XFS_ALLOC_USERDATA)
> align = xfs_get_extsz_hint(ap->ip);
> +
> + /*
> + * xfs_get_cowextsz_hint() returns extsz_hint for when forcealign is
> + * set as forcealign and cowextsz_hint are mutually exclusive
> + */
> + if (xfs_inode_forcealign(ap->ip) && align) {
> + args->alignment = align;
> + if (stripe_align % align)
> + stripe_align = align;
This kinda does the right thing, but it strikes me as the wrong
thing to be doing - it seems to conflates two different physical
alignment properties in potentially bad ways, and we should never
get this far into the allocator to discover these issues.
Stripe alignment is alignment to physical disks in a RAID array.
Inode forced alignment is file offset alignment to specificly
defined LBA address ranges. The stripe unit/width is set at mkfs
time, the inode extent size hint at runtime.
These can only work together in specific situations:
1. max sized RWF_ATOMIC IO must be the same or smaller size
than the stripe unit. Alternatively: stripe unit must be
an integer (power of 2?) multiple of max atomic IO size.
IOWs, stripe unit vs atomic IO constraints must be
resolved in a compatible manner at mkfs time. If they
can't be resolved (e.g. non-power of 2 stripe unit) then
atomic IO cannot be done on the filesystem at all. Stripe
unit constraints always win over atomic IO.
2. min sized RWF_ATOMIC IO must be an integer divider of
stripe unit so that small atomic IOs are always correctly
aligned to the stripe unit.
3. Inode forced alignment constraints set at runtime (i.e.
extsize hints) must fit within the above stripe unit vs
min/max atomic IO requirements.
i.e. extent size hint will typically need to an integer
multiple of stripe unit size which will always be
compatible with stripe unit and atomic IO size and
alignments...
IOWs, these are static geometry constraints and so should be checked
and rejected at the point where alignments are specified (i.e.
mkfs, mount and ioctls). Then the allocator can simply assume that
forced inode alignments are always stripe alignment compatible and
we don't need separate handling of two possibly incompatible
alignments.
Regardless, I think the code as it stands won't work correctly when
a stripe unit is not set.
The correct functioning of this code appears to be dependent on
stripe_align being set >= the extent size hint. If stripe align is
not set, then what does (0 % align) return? If my checks are
correct, this will return 0, and so the above code will end up with
stripe_align = 0.
Now, consider that args->alignment is used to signal to the
allocator what the -start alignment- of the allocation is supposed
to be, whilst args->prod/args->mod are used to tell the allocator
what the tail adjustment is supposed to be.
Stripe alignment wants start alignment, extent size hints want tail
adjustment and force aligned allocation wants both start alignment
and tail adjustment.
However, the allocator overrides these depending on what it is
doing. e.g. exact block EOF allocation turns off start alignment but
has to ensure space is reserved for start alignment if it fails.
This reserves space for stripe_align in args->minalignslop, but if
force alignment has a start alignment requirement larger than stripe
unit alignment, this code fails to reserve the necessary alignment
slop for the force alignment code.
If we aren't doing exact block EOF allocation, then we do stripe
unit alignment. Again, if this fails and the forced alignment is
larger than stripe alignment, this code does not reserve space for
the larger alignment in args->minalignslop, so it will also do the
wrong when we fall back to an args->alignment > 1 allocation.
Failing to correctly account for minalignslop is known to cause
accounting problems when the AG is very near empty, and the usual
result of that is cancelling of a dirty allocation transaction and a
forced shutdown. So this is something we need to get right.
More below....
> + } else {
> + args->alignment = 1;
> + }
Just initialise the allocation args structure with a value of 1 like
we already do?
> +
> if (align) {
> if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
> ap->eof, 0, ap->conv, &ap->offset,
> @@ -3438,7 +3451,6 @@ xfs_bmap_exact_minlen_extent_alloc(
> args.minlen = args.maxlen = ap->minlen;
> args.total = ap->total;
>
> - args.alignment = 1;
> args.minalignslop = 0;
This likely breaks the debug code. This isn't intended for testing
atomic write capability, it's for exercising low space allocation
algorithms with worst case fragmentation patterns. This is only
called when error injection is turned on to trigger this path, so we
shouldn't need to support forced IO alignment here.
>
> args.minleft = ap->minleft;
> @@ -3484,6 +3496,7 @@ xfs_bmap_btalloc_at_eof(
> {
> struct xfs_mount *mp = args->mp;
> struct xfs_perag *caller_pag = args->pag;
> + int orig_alignment = args->alignment;
> int error;
>
> /*
> @@ -3558,10 +3571,10 @@ xfs_bmap_btalloc_at_eof(
>
> /*
> * Allocation failed, so turn return the allocation args to their
> - * original non-aligned state so the caller can proceed on allocation
> - * failure as if this function was never called.
> + * original state so the caller can proceed on allocation failure as
> + * if this function was never called.
> */
> - args->alignment = 1;
> + args->alignment = orig_alignment;
> return 0;
> }
As I said above, we can't set an alignment of > 1 here if we haven't
accounted for that alignment in args->minalignslop above. This leads
to unexpected ENOSPC conditions and filesystem shutdowns.
I suspect what we need to do is get rid of the separate stripe_align
variable altogether and always just set args->alignment to what we
need the extent start alignment to be, regardless of whether it is
from stripe alignment or forced alignment.
Then the code in xfs_bmap_btalloc_at_eof() doesn't need to know what
'stripe_align' is - the exact EOF block allocation can simply save
and restore the args->alignment value and use it for minalignslop
calculations for the initial exact block allocation.
Then, if xfs_bmap_btalloc_at_eof() fails and xfs_inode_forcealign()
is true, we can abort allocation immediately, and not bother to fall
back on further aligned/unaligned attempts that will also fail or do
the wrong them.
Similarly, if we aren't doing EOF allocation, having args->alignment
set means it will do the right thing for the first allocation
attempt. Again, if that fails, we can check if
xfs_inode_forcealign() is true and fail the aligned allocation
instead of running the low space algorithm. This now makes it clear
that we're failing the allocation because of the forced alignment
requirement, and now the low space allocation code can explicitly
turn off start alignment as it isn't required...
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 18c8f168b153..70fe873951f3 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -181,7 +181,9 @@ xfs_eof_alignment(
> * If mounted with the "-o swalloc" option the alignment is
> * increased from the strip unit size to the stripe width.
> */
> - if (mp->m_swidth && xfs_has_swalloc(mp))
> + if (xfs_inode_forcealign(ip))
> + align = xfs_get_extsz_hint(ip);
> + else if (mp->m_swidth && xfs_has_swalloc(mp))
> align = mp->m_swidth;
> else if (mp->m_dalign)
> align = mp->m_dalign;
This doesn't work for files with a current size of less than
"align". The next line of code does:
if (align && XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, align))
align = 0;
IOWs, the first aligned write allocation will occur with a file size
of zero, and that will result in this function returning 0. i.e.
speculative post EOF delalloc doesn't turn on and align the EOF
until writes up to offset == align have already been completed.
Essentially, this code wants to be:
if (xfs_inode_forcealign(ip))
return xfs_get_extsz_hint(ip);
So that any write to the a force aligned inode will always trigger
extent size hint EOF alignment.
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 53+ messages in thread* Re: [PATCH v2 04/14] fs: xfs: Make file data allocations observe the 'forcealign' flag
2024-03-05 0:44 ` Dave Chinner
@ 2024-03-05 15:22 ` John Garry
2024-03-05 22:18 ` Dave Chinner
0 siblings, 1 reply; 53+ messages in thread
From: John Garry @ 2024-03-05 15:22 UTC (permalink / raw)
To: Dave Chinner
Cc: djwong, hch, viro, brauner, jack, chandan.babu, axboe,
martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block
On 05/03/2024 00:44, Dave Chinner wrote:
> On Mon, Mar 04, 2024 at 01:04:18PM +0000, John Garry wrote:
>> From: "Darrick J. Wong" <djwong@kernel.org>
>>
>> The existing extsize hint code already did the work of expanding file
>> range mapping requests so that the range is aligned to the hint value.
>> Now add the code we need to guarantee that the space allocations are
>> also always aligned.
>>
>> XXX: still need to check all this with reflink
>>
>> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
>> Co-developed-by: John Garry <john.g.garry@oracle.com>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>> fs/xfs/libxfs/xfs_bmap.c | 22 +++++++++++++++++-----
>> fs/xfs/xfs_iomap.c | 4 +++-
>> 2 files changed, 20 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
>> index 60d100134280..8dee60795cf4 100644
>> --- a/fs/xfs/libxfs/xfs_bmap.c
>> +++ b/fs/xfs/libxfs/xfs_bmap.c
>> @@ -3343,6 +3343,19 @@ xfs_bmap_compute_alignments(
>> align = xfs_get_cowextsz_hint(ap->ip);
>> else if (ap->datatype & XFS_ALLOC_USERDATA)
>> align = xfs_get_extsz_hint(ap->ip);
>> +
>> + /*
>> + * xfs_get_cowextsz_hint() returns extsz_hint for when forcealign is
>> + * set as forcealign and cowextsz_hint are mutually exclusive
>> + */
>> + if (xfs_inode_forcealign(ap->ip) && align) {
>> + args->alignment = align;
>> + if (stripe_align % align)
>> + stripe_align = align;
>
> This kinda does the right thing, but it strikes me as the wrong
> thing to be doing - it seems to conflates two different physical
> alignment properties in potentially bad ways, and we should never
> get this far into the allocator to discover these issues.
>
> Stripe alignment is alignment to physical disks in a RAID array.
> Inode forced alignment is file offset alignment to specificly
> defined LBA address ranges. The stripe unit/width is set at mkfs
> time, the inode extent size hint at runtime.
>
> These can only work together in specific situations:
>
> 1. max sized RWF_ATOMIC IO must be the same or smaller size
> than the stripe unit. Alternatively: stripe unit must be
> an integer (power of 2?) multiple of max atomic IO size.
Sure, it would not make sense to have max sized RWF_ATOMIC IO > stripe
alignment.
>
> IOWs, stripe unit vs atomic IO constraints must be
> resolved in a compatible manner at mkfs time. If they
> can't be resolved (e.g. non-power of 2 stripe unit) then
> atomic IO cannot be done on the filesystem at all. Stripe
> unit constraints always win over atomic IO.
We can could do that. Indeed, I thought our xfsprogs was doing that, but
we have had a few versions now for forcealign ...
>
> 2. min sized RWF_ATOMIC IO must be an integer divider of
> stripe unit so that small atomic IOs are always correctly
> aligned to the stripe unit.
That's a given from atomic write rules and point 1.
>
> 3. Inode forced alignment constraints set at runtime (i.e.
> extsize hints) must fit within the above stripe unit vs
> min/max atomic IO requirements.
> > i.e. extent size hint will typically need to an integer
> multiple of stripe unit size which will always be
> compatible with stripe unit and atomic IO size and
> alignments...
I think that any extsize hint for forcealign needs to comply with stripe
unit, same as point 1.
>
> IOWs, these are static geometry constraints and so should be checked
> and rejected at the point where alignments are specified (i.e.
> mkfs, mount and ioctls). Then the allocator can simply assume that
> forced inode alignments are always stripe alignment compatible and
> we don't need separate handling of two possibly incompatible
> alignments.
ok, makes sense.
Please note in case missed, I am mandating extsize hint for forcealign
needs to be a power-of-2. It just makes life easier for all the
sub-extent zero'ing later on.
Also we need to enforce that the AG count to be compliant with the
extsize hint for forcealign; but since the extsize hint for forcealign
needs to be compliant with stripe unit, above, and stripe unit would be
compliant wth AG count (right?), then this would be a given.
>
> Regardless, I think the code as it stands won't work correctly when
> a stripe unit is not set.
>
> The correct functioning of this code appears to be dependent on
> stripe_align being set >= the extent size hint. If stripe align is
> not set, then what does (0 % align) return? If my checks are
> correct, this will return 0,
Correct
> and so the above code will end up with
> stripe_align = 0.
>
> Now, consider that args->alignment is used to signal to the
> allocator what the -start alignment- of the allocation is supposed
> to be, whilst args->prod/args->mod are used to tell the allocator
> what the tail adjustment is supposed to be.
>
> Stripe alignment wants start alignment, extent size hints want tail
> adjustment and force aligned allocation wants both start alignment
> and tail adjustment.
Right
>
> However, the allocator overrides these depending on what it is
> doing. e.g. exact block EOF allocation turns off start alignment but
> has to ensure space is reserved for start alignment if it fails.
> This reserves space for stripe_align in args->minalignslop, but if
> force alignment has a start alignment requirement larger than stripe
> unit alignment, this code fails to reserve the necessary alignment
> slop for the force alignment code.
>
> If we aren't doing exact block EOF allocation, then we do stripe
> unit alignment. Again, if this fails and the forced alignment is
> larger than stripe alignment, this code does not reserve space for
> the larger alignment in args->minalignslop, so it will also do the
> wrong when we fall back to an args->alignment > 1 allocation.
>
> Failing to correctly account for minalignslop is known to cause
> accounting problems when the AG is very near empty, and the usual
> result of that is cancelling of a dirty allocation transaction and a
> forced shutdown. So this is something we need to get right.
For sure
>
> More below....
>
>> + } else {
>> + args->alignment = 1;
>> + }
>
> Just initialise the allocation args structure with a value of 1 like
> we already do?
It was being done in this way to have just a single place where the
value is initialised. It can easily be kept as is.
>
>> +
>> if (align) {
>> if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
>> ap->eof, 0, ap->conv, &ap->offset,
>> @@ -3438,7 +3451,6 @@ xfs_bmap_exact_minlen_extent_alloc(
>> args.minlen = args.maxlen = ap->minlen;
>> args.total = ap->total;
>>
>> - args.alignment = 1;
>> args.minalignslop = 0;
>
> This likely breaks the debug code. This isn't intended for testing
> atomic write capability, it's for exercising low space allocation
> algorithms with worst case fragmentation patterns. This is only
> called when error injection is turned on to trigger this path, so we
> shouldn't need to support forced IO alignment here.
ok, it can be left untouched.
>
>>
>> args.minleft = ap->minleft;
>> @@ -3484,6 +3496,7 @@ xfs_bmap_btalloc_at_eof(
>> {
>> struct xfs_mount *mp = args->mp;
>> struct xfs_perag *caller_pag = args->pag;
>> + int orig_alignment = args->alignment;
>> int error;
>>
>> /*
>> @@ -3558,10 +3571,10 @@ xfs_bmap_btalloc_at_eof(
>>
>> /*
>> * Allocation failed, so turn return the allocation args to their
>> - * original non-aligned state so the caller can proceed on allocation
>> - * failure as if this function was never called.
>> + * original state so the caller can proceed on allocation failure as
>> + * if this function was never called.
>> */
>> - args->alignment = 1;
>> + args->alignment = orig_alignment;
>> return 0;
>> }
>
> As I said above, we can't set an alignment of > 1 here if we haven't
> accounted for that alignment in args->minalignslop above. This leads
> to unexpected ENOSPC conditions and filesystem shutdowns.
>
> I suspect what we need to do is get rid of the separate stripe_align
> variable altogether and always just set args->alignment to what we
> need the extent start alignment to be, regardless of whether it is
> from stripe alignment or forced alignment.
ok, it sounds a bit simpler at least
>
> Then the code in xfs_bmap_btalloc_at_eof() doesn't need to know what
> 'stripe_align' is - the exact EOF block allocation can simply save
> and restore the args->alignment value and use it for minalignslop
> calculations for the initial exact block allocation.
>
> Then, if xfs_bmap_btalloc_at_eof() fails and xfs_inode_forcealign()
> is true, we can abort allocation immediately, and not bother to fall
> back on further aligned/unaligned attempts that will also fail or do
> the wrong them.
ok
>
> Similarly, if we aren't doing EOF allocation, having args->alignment
> set means it will do the right thing for the first allocation
> attempt. Again, if that fails, we can check if
> xfs_inode_forcealign() is true and fail the aligned allocation
> instead of running the low space algorithm. This now makes it clear
> that we're failing the allocation because of the forced alignment
> requirement, and now the low space allocation code can explicitly
> turn off start alignment as it isn't required...
are you saying that low-space allocator can set args->alignment = 1 to
be explicit?
>
>
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 18c8f168b153..70fe873951f3 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -181,7 +181,9 @@ xfs_eof_alignment(
>> * If mounted with the "-o swalloc" option the alignment is
>> * increased from the strip unit size to the stripe width.
>> */
>> - if (mp->m_swidth && xfs_has_swalloc(mp))
>> + if (xfs_inode_forcealign(ip))
I actually thought that I dropped this chunk as it was causing alignment
issues. I need to check that again.
>> + align = xfs_get_extsz_hint(ip);
>> + else if (mp->m_swidth && xfs_has_swalloc(mp))
>> align = mp->m_swidth;
>> else if (mp->m_dalign)
>> align = mp->m_dalign;
>
> This doesn't work for files with a current size of less than
> "align". The next line of code does:
>
> if (align && XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, align))
> align = 0;
>
> IOWs, the first aligned write allocation will occur with a file size
> of zero, and that will result in this function returning 0. i.e.
> speculative post EOF delalloc doesn't turn on and align the EOF
> until writes up to offset == align have already been completed.
Maybe it was working as I have the change to keep EOF aligned for
forcealign. I'll check that.
>
> Essentially, this code wants to be:
>
> if (xfs_inode_forcealign(ip))
> return xfs_get_extsz_hint(ip);
>
> So that any write to the a force aligned inode will always trigger
> extent size hint EOF alignment.
>
ok, sounds reasonable.
Thanks,
John
^ permalink raw reply [flat|nested] 53+ messages in thread* Re: [PATCH v2 04/14] fs: xfs: Make file data allocations observe the 'forcealign' flag
2024-03-05 15:22 ` John Garry
@ 2024-03-05 22:18 ` Dave Chinner
2024-03-06 5:20 ` [RFC PATCH 0/3] xfs: forced extent alignment Dave Chinner
2024-03-06 9:41 ` [PATCH v2 04/14] fs: xfs: Make file data allocations observe the 'forcealign' flag John Garry
0 siblings, 2 replies; 53+ messages in thread
From: Dave Chinner @ 2024-03-05 22:18 UTC (permalink / raw)
To: John Garry
Cc: djwong, hch, viro, brauner, jack, chandan.babu, axboe,
martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block
On Tue, Mar 05, 2024 at 03:22:52PM +0000, John Garry wrote:
> On 05/03/2024 00:44, Dave Chinner wrote:
> > On Mon, Mar 04, 2024 at 01:04:18PM +0000, John Garry wrote:
....
> > IOWs, these are static geometry constraints and so should be checked
> > and rejected at the point where alignments are specified (i.e.
> > mkfs, mount and ioctls). Then the allocator can simply assume that
> > forced inode alignments are always stripe alignment compatible and
> > we don't need separate handling of two possibly incompatible
> > alignments.
>
> ok, makes sense.
>
> Please note in case missed, I am mandating extsize hint for forcealign needs
> to be a power-of-2. It just makes life easier for all the sub-extent
> zero'ing later on.
That's fine - that will need to be documented in the xfsctl man
page...
> Also we need to enforce that the AG count to be compliant with the extsize
^^^^^ size?
> hint for forcealign; but since the extsize hint for forcealign needs to be
> compliant with stripe unit, above, and stripe unit would be compliant wth AG
> count (right?), then this would be a given.
We already align AG size to stripe unit when a stripe unit is set,
and ensure that we don't place all the AG headers on the same stripe
unit.
However, if there is no stripe unit we don't align the AG to
anything. So, yes, AG sizing by mkfs will need to ensure that all
AGs are correctly aligned to the underlying storage (integer
multiple of the max atomic write size, right?)...
> > More below....
> >
> > > + } else {
> > > + args->alignment = 1;
> > > + }
> >
> > Just initialise the allocation args structure with a value of 1 like
> > we already do?
>
> It was being done in this way to have just a single place where the value is
> initialised. It can easily be kept as is.
I'd prefer it as is, because then the value is always initialised
correctly and we only override in the special cases....
> > > args.minleft = ap->minleft;
> > > @@ -3484,6 +3496,7 @@ xfs_bmap_btalloc_at_eof(
> > > {
> > > struct xfs_mount *mp = args->mp;
> > > struct xfs_perag *caller_pag = args->pag;
> > > + int orig_alignment = args->alignment;
> > > int error;
> > > /*
> > > @@ -3558,10 +3571,10 @@ xfs_bmap_btalloc_at_eof(
> > > /*
> > > * Allocation failed, so turn return the allocation args to their
> > > - * original non-aligned state so the caller can proceed on allocation
> > > - * failure as if this function was never called.
> > > + * original state so the caller can proceed on allocation failure as
> > > + * if this function was never called.
> > > */
> > > - args->alignment = 1;
> > > + args->alignment = orig_alignment;
> > > return 0;
> > > }
> >
> > As I said above, we can't set an alignment of > 1 here if we haven't
> > accounted for that alignment in args->minalignslop above. This leads
> > to unexpected ENOSPC conditions and filesystem shutdowns.
> >
> > I suspect what we need to do is get rid of the separate stripe_align
> > variable altogether and always just set args->alignment to what we
> > need the extent start alignment to be, regardless of whether it is
> > from stripe alignment or forced alignment.
>
> ok, it sounds a bit simpler at least
>
> >
> > Then the code in xfs_bmap_btalloc_at_eof() doesn't need to know what
> > 'stripe_align' is - the exact EOF block allocation can simply save
> > and restore the args->alignment value and use it for minalignslop
> > calculations for the initial exact block allocation.
> >
> > Then, if xfs_bmap_btalloc_at_eof() fails and xfs_inode_forcealign()
> > is true, we can abort allocation immediately, and not bother to fall
> > back on further aligned/unaligned attempts that will also fail or do
> > the wrong them.
>
> ok
>
> >
> > Similarly, if we aren't doing EOF allocation, having args->alignment
> > set means it will do the right thing for the first allocation
> > attempt. Again, if that fails, we can check if
> > xfs_inode_forcealign() is true and fail the aligned allocation
> > instead of running the low space algorithm. This now makes it clear
> > that we're failing the allocation because of the forced alignment
> > requirement, and now the low space allocation code can explicitly
> > turn off start alignment as it isn't required...
>
> are you saying that low-space allocator can set args->alignment = 1 to be
> explicit?
Yes.
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 53+ messages in thread* [RFC PATCH 0/3] xfs: forced extent alignment
2024-03-05 22:18 ` Dave Chinner
@ 2024-03-06 5:20 ` Dave Chinner
2024-03-06 5:20 ` [PATCH 1/3] xfs: simplify extent allocation alignment Dave Chinner
` (4 more replies)
2024-03-06 9:41 ` [PATCH v2 04/14] fs: xfs: Make file data allocations observe the 'forcealign' flag John Garry
1 sibling, 5 replies; 53+ messages in thread
From: Dave Chinner @ 2024-03-06 5:20 UTC (permalink / raw)
To: linux-xfs; +Cc: john.g.garry, ojaswin, ritesh.list
Hi Garry,
I figured that it was simpler just to write the forced extent
alignment allocator patches that to make you struggle through them
and require lots of round trips to understand all the weird corner
cases.
The following 3 patches:
- rework the setup and extent allocation logic a bit to make force
aligned allocation much easier to implement and understand
- move all the alignment adjustments into the setup logic
- rework the alignment slop calculations and greatly simplify the
the exact EOF block allocation case
- add a XFS_ALLOC_FORCEALIGN flag so that the inode config only
needs to be checked once at setup. This also allows other
allocation types (e.g. inode clusters) use forced alignment
allocation semantics in future.
- clearly document when we are turning off allocation alignment and
abort FORCEALIGN allocation at that point rather than doing
unaligned allocation.
I've run this through fstests once so it doesn't let the smoke out,
but I haven't actually tested it against a stripe aligned filesystem
config yet, nor tested the forcealign functionality so it may not be
exactly right yet.
Is this sufficiently complete for you to take from here into the
forcealign series?
Cheers,
Dave.
^ permalink raw reply [flat|nested] 53+ messages in thread* [PATCH 1/3] xfs: simplify extent allocation alignment
2024-03-06 5:20 ` [RFC PATCH 0/3] xfs: forced extent alignment Dave Chinner
@ 2024-03-06 5:20 ` Dave Chinner
2024-03-13 11:03 ` John Garry
2024-03-06 5:20 ` [PATCH 2/3] xfs: make EOF allocation simpler Dave Chinner
` (3 subsequent siblings)
4 siblings, 1 reply; 53+ messages in thread
From: Dave Chinner @ 2024-03-06 5:20 UTC (permalink / raw)
To: linux-xfs; +Cc: john.g.garry, ojaswin, ritesh.list
From: Dave Chinner <dchinner@redhat.com>
We currently align extent allocation to stripe unit or stripe width.
That is specified by an external parameter to the allocation code,
which then manipulates the xfs_alloc_args alignment configuration in
interesting ways.
The args->alignment field specifies extent start alignment, but
because we may be attempting non-aligned allocation first there are
also slop variables that allow for those allocation attempts to
account for aligned allocation if they fail.
This gets much more complex as we introduce forced allocation
alignment, where extent size hints are used to generate the extent
start alignment. extent size hints currently only affect extent
lengths (via args->prod and args->mod) and so with this change we
will have two different start alignment conditions.
Avoid this complexity by always using args->alignment to indicate
extent start alignment, and always using args->prod/mod to indicate
extent length adjustment.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/libxfs/xfs_alloc.c | 2 +-
fs/xfs/libxfs/xfs_alloc.h | 2 +-
fs/xfs/libxfs/xfs_bmap.c | 96 +++++++++++++++++----------------------
3 files changed, 44 insertions(+), 56 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 9da52e92172a..5ed9c8a96e32 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2395,7 +2395,7 @@ xfs_alloc_space_available(
reservation = xfs_ag_resv_needed(pag, args->resv);
/* do we have enough contiguous free space for the allocation? */
- alloc_len = args->minlen + (args->alignment - 1) + args->minalignslop;
+ alloc_len = args->minlen + (args->alignment - 1) + args->alignslop;
longest = xfs_alloc_longest_free_extent(pag, min_free, reservation);
if (longest < alloc_len)
return false;
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 0b956f8b9d5a..aa2c103d98f0 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -46,7 +46,7 @@ typedef struct xfs_alloc_arg {
xfs_extlen_t minleft; /* min blocks must be left after us */
xfs_extlen_t total; /* total blocks needed in xaction */
xfs_extlen_t alignment; /* align answer to multiple of this */
- xfs_extlen_t minalignslop; /* slop for minlen+alignment calcs */
+ xfs_extlen_t alignslop; /* slop for alignment calcs */
xfs_agblock_t min_agbno; /* set an agbno range for NEAR allocs */
xfs_agblock_t max_agbno; /* ... */
xfs_extlen_t len; /* output: actual size of extent */
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 656c95a22f2e..d56c82c07505 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3295,6 +3295,10 @@ xfs_bmap_select_minlen(
xfs_extlen_t blen)
{
+ /* Adjust best length for extent start alignment. */
+ if (blen > args->alignment)
+ blen -= args->alignment;
+
/*
* Since we used XFS_ALLOC_FLAG_TRYLOCK in _longest_free_extent(), it is
* possible that there is enough contiguous free space for this request.
@@ -3310,6 +3314,7 @@ xfs_bmap_select_minlen(
if (blen < args->maxlen)
return blen;
return args->maxlen;
+
}
static int
@@ -3403,35 +3408,43 @@ xfs_bmap_alloc_account(
xfs_trans_mod_dquot_byino(ap->tp, ap->ip, fld, ap->length);
}
-static int
+/*
+ * Calculate the extent start alignment and the extent length adjustments that
+ * constrain this allocation.
+ *
+ * Extent start alignment is currently determined by stripe configuration and is
+ * carried in args->alignment, whilst extent length adjustment is determined by
+ * extent size hints and is carried by args->prod and args->mod.
+ *
+ * Low level allocation code is free to either ignore or override these values
+ * as required.
+ */
+static void
xfs_bmap_compute_alignments(
struct xfs_bmalloca *ap,
struct xfs_alloc_arg *args)
{
struct xfs_mount *mp = args->mp;
xfs_extlen_t align = 0; /* minimum allocation alignment */
- int stripe_align = 0;
/* stripe alignment for allocation is determined by mount parameters */
if (mp->m_swidth && xfs_has_swalloc(mp))
- stripe_align = mp->m_swidth;
+ args->alignment = mp->m_swidth;
else if (mp->m_dalign)
- stripe_align = mp->m_dalign;
+ args->alignment = mp->m_dalign;
if (ap->flags & XFS_BMAPI_COWFORK)
align = xfs_get_cowextsz_hint(ap->ip);
else if (ap->datatype & XFS_ALLOC_USERDATA)
align = xfs_get_extsz_hint(ap->ip);
+
if (align) {
if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
ap->eof, 0, ap->conv, &ap->offset,
&ap->length))
ASSERT(0);
ASSERT(ap->length);
- }
- /* apply extent size hints if obtained earlier */
- if (align) {
args->prod = align;
div_u64_rem(ap->offset, args->prod, &args->mod);
if (args->mod)
@@ -3446,7 +3459,6 @@ xfs_bmap_compute_alignments(
args->mod = args->prod - args->mod;
}
- return stripe_align;
}
static void
@@ -3518,7 +3530,7 @@ xfs_bmap_exact_minlen_extent_alloc(
args.total = ap->total;
args.alignment = 1;
- args.minalignslop = 0;
+ args.alignslop = 0;
args.minleft = ap->minleft;
args.wasdel = ap->wasdel;
@@ -3558,7 +3570,6 @@ xfs_bmap_btalloc_at_eof(
struct xfs_bmalloca *ap,
struct xfs_alloc_arg *args,
xfs_extlen_t blen,
- int stripe_align,
bool ag_only)
{
struct xfs_mount *mp = args->mp;
@@ -3572,23 +3583,15 @@ xfs_bmap_btalloc_at_eof(
* allocation.
*/
if (ap->offset) {
- xfs_extlen_t nextminlen = 0;
+ xfs_extlen_t alignment = args->alignment;
/*
- * Compute the minlen+alignment for the next case. Set slop so
- * that the value of minlen+alignment+slop doesn't go up between
- * the calls.
+ * Compute the alignment slop for the fallback path so we ensure
+ * we account for the potential alignemnt space required by the
+ * fallback paths before we modify the AGF and AGFL here.
*/
args->alignment = 1;
- if (blen > stripe_align && blen <= args->maxlen)
- nextminlen = blen - stripe_align;
- else
- nextminlen = args->minlen;
- if (nextminlen + stripe_align > args->minlen + 1)
- args->minalignslop = nextminlen + stripe_align -
- args->minlen - 1;
- else
- args->minalignslop = 0;
+ args->alignslop = alignment - args->alignment;
if (!caller_pag)
args->pag = xfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, ap->blkno));
@@ -3606,19 +3609,8 @@ xfs_bmap_btalloc_at_eof(
* Exact allocation failed. Reset to try an aligned allocation
* according to the original allocation specification.
*/
- args->alignment = stripe_align;
- args->minlen = nextminlen;
- args->minalignslop = 0;
- } else {
- /*
- * Adjust minlen to try and preserve alignment if we
- * can't guarantee an aligned maxlen extent.
- */
- args->alignment = stripe_align;
- if (blen > args->alignment &&
- blen <= args->maxlen + args->alignment)
- args->minlen = blen - args->alignment;
- args->minalignslop = 0;
+ args->alignment = alignment;
+ args->alignslop = 0;
}
if (ag_only) {
@@ -3636,9 +3628,8 @@ xfs_bmap_btalloc_at_eof(
return 0;
/*
- * Allocation failed, so turn return the allocation args to their
- * original non-aligned state so the caller can proceed on allocation
- * failure as if this function was never called.
+ * Aligned allocation failed, so all fallback paths from here drop the
+ * start alignment requirement as we know it will not succeed.
*/
args->alignment = 1;
return 0;
@@ -3646,7 +3637,9 @@ xfs_bmap_btalloc_at_eof(
/*
* We have failed multiple allocation attempts so now are in a low space
- * allocation situation. Try a locality first full filesystem minimum length
+ * allocation situation. We give up on any attempt at aligned allocation here.
+ *
+ * Try a locality first full filesystem minimum length
* allocation whilst still maintaining necessary total block reservation
* requirements.
*
@@ -3663,6 +3656,7 @@ xfs_bmap_btalloc_low_space(
{
int error;
+ args->alignment = 1;
if (args->minlen > ap->minlen) {
args->minlen = ap->minlen;
error = xfs_alloc_vextent_start_ag(args, ap->blkno);
@@ -3682,13 +3676,11 @@ xfs_bmap_btalloc_low_space(
static int
xfs_bmap_btalloc_filestreams(
struct xfs_bmalloca *ap,
- struct xfs_alloc_arg *args,
- int stripe_align)
+ struct xfs_alloc_arg *args)
{
xfs_extlen_t blen = 0;
int error = 0;
-
error = xfs_filestream_select_ag(ap, args, &blen);
if (error)
return error;
@@ -3707,8 +3699,7 @@ xfs_bmap_btalloc_filestreams(
args->minlen = xfs_bmap_select_minlen(ap, args, blen);
if (ap->aeof)
- error = xfs_bmap_btalloc_at_eof(ap, args, blen, stripe_align,
- true);
+ error = xfs_bmap_btalloc_at_eof(ap, args, blen, true);
if (!error && args->fsbno == NULLFSBLOCK)
error = xfs_alloc_vextent_near_bno(args, ap->blkno);
@@ -3732,8 +3723,7 @@ xfs_bmap_btalloc_filestreams(
static int
xfs_bmap_btalloc_best_length(
struct xfs_bmalloca *ap,
- struct xfs_alloc_arg *args,
- int stripe_align)
+ struct xfs_alloc_arg *args)
{
xfs_extlen_t blen = 0;
int error;
@@ -3757,8 +3747,7 @@ xfs_bmap_btalloc_best_length(
* trying.
*/
if (ap->aeof && !(ap->tp->t_flags & XFS_TRANS_LOWMODE)) {
- error = xfs_bmap_btalloc_at_eof(ap, args, blen, stripe_align,
- false);
+ error = xfs_bmap_btalloc_at_eof(ap, args, blen, false);
if (error || args->fsbno != NULLFSBLOCK)
return error;
}
@@ -3785,27 +3774,26 @@ xfs_bmap_btalloc(
.resv = XFS_AG_RESV_NONE,
.datatype = ap->datatype,
.alignment = 1,
- .minalignslop = 0,
+ .alignslop = 0,
};
xfs_fileoff_t orig_offset;
xfs_extlen_t orig_length;
int error;
- int stripe_align;
ASSERT(ap->length);
orig_offset = ap->offset;
orig_length = ap->length;
- stripe_align = xfs_bmap_compute_alignments(ap, &args);
+ xfs_bmap_compute_alignments(ap, &args);
/* Trim the allocation back to the maximum an AG can fit. */
args.maxlen = min(ap->length, mp->m_ag_max_usable);
if ((ap->datatype & XFS_ALLOC_USERDATA) &&
xfs_inode_is_filestream(ap->ip))
- error = xfs_bmap_btalloc_filestreams(ap, &args, stripe_align);
+ error = xfs_bmap_btalloc_filestreams(ap, &args);
else
- error = xfs_bmap_btalloc_best_length(ap, &args, stripe_align);
+ error = xfs_bmap_btalloc_best_length(ap, &args);
if (error)
return error;
--
2.43.0
^ permalink raw reply related [flat|nested] 53+ messages in thread* Re: [PATCH 1/3] xfs: simplify extent allocation alignment
2024-03-06 5:20 ` [PATCH 1/3] xfs: simplify extent allocation alignment Dave Chinner
@ 2024-03-13 11:03 ` John Garry
2024-03-20 4:35 ` Dave Chinner
0 siblings, 1 reply; 53+ messages in thread
From: John Garry @ 2024-03-13 11:03 UTC (permalink / raw)
To: Dave Chinner, linux-xfs; +Cc: ojaswin, ritesh.list
On 06/03/2024 05:20, Dave Chinner wrote:
> return false;
> diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
> index 0b956f8b9d5a..aa2c103d98f0 100644
> --- a/fs/xfs/libxfs/xfs_alloc.h
> +++ b/fs/xfs/libxfs/xfs_alloc.h
> @@ -46,7 +46,7 @@ typedef struct xfs_alloc_arg {
> xfs_extlen_t minleft; /* min blocks must be left after us */
> xfs_extlen_t total; /* total blocks needed in xaction */
> xfs_extlen_t alignment; /* align answer to multiple of this */
> - xfs_extlen_t minalignslop; /* slop for minlen+alignment calcs */
> + xfs_extlen_t alignslop; /* slop for alignment calcs */
> xfs_agblock_t min_agbno; /* set an agbno range for NEAR allocs */
> xfs_agblock_t max_agbno; /* ... */
> xfs_extlen_t len; /* output: actual size of extent */
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 656c95a22f2e..d56c82c07505 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -3295,6 +3295,10 @@ xfs_bmap_select_minlen(
> xfs_extlen_t blen)
Hi Dave,
> {
>
> + /* Adjust best length for extent start alignment. */
> + if (blen > args->alignment)
> + blen -= args->alignment;
> +
This change seems to be causing or exposing some issue, in that I find
that I am being allocated an extent which is aligned to but not a
multiple of args->alignment.
For my test, I have forcealign=16KB and initially I write 1756 * 4096 =
7192576B to the file, so I have this:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
0: [0..14079]: 42432..56511 0 (42432..56511) 14080
That is 1760 FSBs for extent #0.
Then I write 340992B from offset 7195648, and I find this:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
0: [0..14079]: 42432..56511 0 (42432..56511) 14080
1: [14080..14711]: 177344..177975 0 (177344..177975) 632
2: [14712..14719]: 350720..350727 1 (171520..171527) 8
extent #1 is 79 FSBs, which is not a multiple of 16KB.
In this case, in xfs_bmap_select_minlen() I find initially blen=80
args->alignment=4, ->minlen=0, ->maxlen=80. Subsequently blen is reduced
to 76 and args->minlen is set to 76, and then
xfs_bmap_btalloc_best_length() -> xfs_alloc_vextent_start_ag() happens
to find an extent of length 79.
Removing the specific change to modify blen seems to make things ok again.
I will also note something strange which could be the issue, that being
that xfs_alloc_fix_len() does not fix this up - I thought that was its
job. Firstly, in this same scenario, in xfs_alloc_space_available() we
calculate alloc_len = args->minlen + (args->alignment - 1) +
args->alignslop = 76 + (4 - 1) + 0 = 79, and then args->maxlen = 79.
Then xfs_alloc_fix_len() allows this as args->len == args->maxlen (=79),
even though args->prod, mod = 4, 0. To me, that (args->alignment - 1)
component in calculating alloc_len is odd. I assume it is done as
default args->alignment == 1.
Anyway, let me know what you think.
Cheers,
John
> /*
> * Since we used XFS_ALLOC_FLAG_TRYLOCK in _longest_free_extent(), it is
> * possible that there is enough contiguous free space for this request.
> @@ -3310,6 +3314,7 @@ xfs_bmap_select_minlen(
> if (blen < args->maxlen)
> return blen;
> return args->maxlen;
> +
^ permalink raw reply [flat|nested] 53+ messages in thread* Re: [PATCH 1/3] xfs: simplify extent allocation alignment
2024-03-13 11:03 ` John Garry
@ 2024-03-20 4:35 ` Dave Chinner
2024-03-26 16:08 ` John Garry
0 siblings, 1 reply; 53+ messages in thread
From: Dave Chinner @ 2024-03-20 4:35 UTC (permalink / raw)
To: John Garry; +Cc: linux-xfs, ojaswin, ritesh.list
On Wed, Mar 13, 2024 at 11:03:18AM +0000, John Garry wrote:
> On 06/03/2024 05:20, Dave Chinner wrote:
> > return false;
> > diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
> > index 0b956f8b9d5a..aa2c103d98f0 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.h
> > +++ b/fs/xfs/libxfs/xfs_alloc.h
> > @@ -46,7 +46,7 @@ typedef struct xfs_alloc_arg {
> > xfs_extlen_t minleft; /* min blocks must be left after us */
> > xfs_extlen_t total; /* total blocks needed in xaction */
> > xfs_extlen_t alignment; /* align answer to multiple of this */
> > - xfs_extlen_t minalignslop; /* slop for minlen+alignment calcs */
> > + xfs_extlen_t alignslop; /* slop for alignment calcs */
> > xfs_agblock_t min_agbno; /* set an agbno range for NEAR allocs */
> > xfs_agblock_t max_agbno; /* ... */
> > xfs_extlen_t len; /* output: actual size of extent */
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index 656c95a22f2e..d56c82c07505 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -3295,6 +3295,10 @@ xfs_bmap_select_minlen(
> > xfs_extlen_t blen)
>
> Hi Dave,
>
> > {
> > + /* Adjust best length for extent start alignment. */
> > + if (blen > args->alignment)
> > + blen -= args->alignment;
> > +
>
> This change seems to be causing or exposing some issue, in that I find that
> I am being allocated an extent which is aligned to but not a multiple of
> args->alignment.
Entirely possible the logic isn't correct ;)
IIRC, I added this because I thought that blen ends up influencing
args->maxlen and nothing else. The alignment isn't taken out of
"blen", it's supposed to be added to args->maxlen.
> For my test, I have forcealign=16KB and initially I write 1756 * 4096 =
> 7192576B to the file, so I have this:
>
> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
> 0: [0..14079]: 42432..56511 0 (42432..56511) 14080
>
> That is 1760 FSBs for extent #0.
>
> Then I write 340992B from offset 7195648, and I find this:
> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
> 0: [0..14079]: 42432..56511 0 (42432..56511) 14080
> 1: [14080..14711]: 177344..177975 0 (177344..177975) 632
> 2: [14712..14719]: 350720..350727 1 (171520..171527) 8
>
> extent #1 is 79 FSBs, which is not a multiple of 16KB.
> In this case, in xfs_bmap_select_minlen() I find initially blen=80
Ah, so you've hit the corner case of the largest free space being
exactly 80 in length and args->maxlen = 80.
> args->alignment=4, ->minlen=0, ->maxlen=80. Subsequently blen is reduced to
> 76 and args->minlen is set to 76, and then xfs_bmap_btalloc_best_length() ->
> xfs_alloc_vextent_start_ag() happens to find an extent of length 79.
So there's nothing wrong here. We've asked for any extent that
is between 76 and 80 blocks in length to be allocated, and we found
one that is 79 blocks in length.
Finding a 79 block extent instead of an 80 block extent like blen
indicated means that there was either:
- a block moved to the AGFL from that 80 block free extent prior to
doing the free extent search.
- a block was busy and so was trimmed out via
xfs_alloc_compute_aligned()
- the front edge wasn't aligned so it took a block off the front of
the free space to align it. It's this condition that code I added
above takes into account - an exact match on size does not imply
that aligned allocation of exactly that size can be done.
Given the front edge is aligned, I'd say it was the latter that
occurred. The question is this: why wasn't the tail edge aligned
down to make the extent length 76 blocks?
> Removing the specific change to modify blen seems to make things ok again.
Right, because now the allocation ends up being set up with
args->minlen = args->maxlen = 80 and the allocation is unable to
align the extent and meet the args->minlen requirement from that
same unaligned 80 block free space. Hence that allocation fails and
we fall back to a different allocation strategy that searches other
AGs for a matching aligned allocation.
IOWs, removing the 'blen -= args->alignment' code simply kicks the
problem down the road until all AGs run out of 80 block contiguous
extents.
This really smells like a tail alignment bug, not a problem with the
allocation setup. Returning an extent that is 76 blocks in length
fulfils the 4 block alignment requirement, so why did tail alignment
fail?
> I will also note something strange which could be the issue, that being that
> xfs_alloc_fix_len() does not fix this up - I thought that was its job.
Yes, it should fix this up.
> Firstly, in this same scenario, in xfs_alloc_space_available() we calculate
> alloc_len = args->minlen + (args->alignment - 1) + args->alignslop = 76 + (4
> - 1) + 0 = 79, and then args->maxlen = 79.
That seems OK, we're doing aligned allocation and this is an ENOSPC
corner case so the aligned allocation should get rounded down in
xfs_alloc_fix_len() or rejected.
One thought I just had is that the args->maxlen adjustment shouldn't
be to "available space" - it should probably be set to args->minlen
because that's the aligned 'alloc_len' we checked available space
against. That would fix this, because then we'd have args->minlen =
args->maxlen = 76.
However, that only addresses this specific case, not the general
case of xfs_alloc_fix_len() failing to tail align the allocated
extent.
> Then xfs_alloc_fix_len() allows
> this as args->len == args->maxlen (=79), even though args->prod, mod = 4, 0.
Yeah, that smells wrong.
I'd suggest that we've never noticed this until now because we
have never guaranteed extent alignment. Hence the occasional
short/unaligned extent being allocated in dark ENOSPC corners was
never an issue for anyone.
However, introducing a new alignment guarantee turns these sorts of
latent non-issues into bugs that need to be fixed. i.e. This is
exactly the sort of rare corner case behaviour I expected to be
flushed out by guaranteeing and then extensively testing allocation
alignments.
If you drop the rlen == args->maxlen check from
xfs_alloc_space_available(), the problem should go away and the
extent gets trimmed to 76 blocks. This shouldn't affect anything
else because maxlen allocations should already be properly aligned -
it's only when something like ENOSPC causes args->maxlen to be
modified to an unaligned value that this issue arises.
In the end, I suspect we'll want to make both changes....
> To me, that (args->alignment - 1) component in calculating alloc_len is odd.
> I assume it is done as default args->alignment == 1.
No, it's done because guaranteeing aligned allocation requires
selecting an aligned region from an unaligned free space. i.e. when
alignment is 4, then we can need up to 3 additional blocks to
guarantee front alignment for a given length extent.
i.e. we have to over-allocate to guarantee we can trim up
to alignment at the front edge and still guarantee that the extent
is as long as required by args->minlen/maxlen.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 53+ messages in thread* Re: [PATCH 1/3] xfs: simplify extent allocation alignment
2024-03-20 4:35 ` Dave Chinner
@ 2024-03-26 16:08 ` John Garry
2024-04-02 5:58 ` Dave Chinner
0 siblings, 1 reply; 53+ messages in thread
From: John Garry @ 2024-03-26 16:08 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs, ojaswin, ritesh.list
On 20/03/2024 04:35, Dave Chinner wrote:
For some reason I never received this mail. I just noticed it on
lore.kernel.org today by chance.
> On Wed, Mar 13, 2024 at 11:03:18AM +0000, John Garry wrote:
>> On 06/03/2024 05:20, Dave Chinner wrote:
>>> return false;
>>> diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
>>> index 0b956f8b9d5a..aa2c103d98f0 100644
>>> --- a/fs/xfs/libxfs/xfs_alloc.h
>>> +++ b/fs/xfs/libxfs/xfs_alloc.h
>>> @@ -46,7 +46,7 @@ typedef struct xfs_alloc_arg {
>>> xfs_extlen_t minleft; /* min blocks must be left after us */
>>> xfs_extlen_t total; /* total blocks needed in xaction */
>>> xfs_extlen_t alignment; /* align answer to multiple of this */
>>> - xfs_extlen_t minalignslop; /* slop for minlen+alignment calcs */
>>> + xfs_extlen_t alignslop; /* slop for alignment calcs */
>>> xfs_agblock_t min_agbno; /* set an agbno range for NEAR allocs */
>>> xfs_agblock_t max_agbno; /* ... */
>>> xfs_extlen_t len; /* output: actual size of extent */
>>> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
>>> index 656c95a22f2e..d56c82c07505 100644
>>> --- a/fs/xfs/libxfs/xfs_bmap.c
>>> +++ b/fs/xfs/libxfs/xfs_bmap.c
>>> @@ -3295,6 +3295,10 @@ xfs_bmap_select_minlen(
>>> xfs_extlen_t blen)
>>
>> Hi Dave,
>>
>>> {
>>> + /* Adjust best length for extent start alignment. */
>>> + if (blen > args->alignment)
>>> + blen -= args->alignment;
>>> +
>>
>> This change seems to be causing or exposing some issue, in that I find that
>> I am being allocated an extent which is aligned to but not a multiple of
>> args->alignment.
>
> Entirely possible the logic isn't correct ;)
Out of curiosity, how do you guys normally test all this sort of logic?
I found this issue with the small program which I wrote to generate
traffic. I could not find anything similar.
>
> IIRC, I added this because I thought that blen ends up influencing
> args->maxlen and nothing else. The alignment isn't taken out of
> "blen", it's supposed to be added to args->maxlen.
>
>> For my test, I have forcealign=16KB and initially I write 1756 * 4096 =
>> 7192576B to the file, so I have this:
>>
>> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
>> 0: [0..14079]: 42432..56511 0 (42432..56511) 14080
>>
>> That is 1760 FSBs for extent #0.
>>
>> Then I write 340992B from offset 7195648, and I find this:
>> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
>> 0: [0..14079]: 42432..56511 0 (42432..56511) 14080
>> 1: [14080..14711]: 177344..177975 0 (177344..177975) 632
>> 2: [14712..14719]: 350720..350727 1 (171520..171527) 8
>>
>> extent #1 is 79 FSBs, which is not a multiple of 16KB.
>
>> In this case, in xfs_bmap_select_minlen() I find initially blen=80
>
> Ah, so you've hit the corner case of the largest free space being
> exactly 80 in length and args->maxlen = 80.
>
>> args->alignment=4, ->minlen=0, ->maxlen=80. Subsequently blen is reduced to
>> 76 and args->minlen is set to 76, and then xfs_bmap_btalloc_best_length() ->
>> xfs_alloc_vextent_start_ag() happens to find an extent of length 79.
>
> So there's nothing wrong here. We've asked for any extent that
> is between 76 and 80 blocks in length to be allocated, and we found
> one that is 79 blocks in length.
Sure
>
> Finding a 79 block extent instead of an 80 block extent like blen
> indicated means that there was either:
>
> - a block moved to the AGFL from that 80 block free extent prior to
> doing the free extent search.
> - a block was busy and so was trimmed out via
> xfs_alloc_compute_aligned()
> - the front edge wasn't aligned so it took a block off the front of
> the free space to align it. It's this condition that code I added
> above takes into account - an exact match on size does not imply
> that aligned allocation of exactly that size can be done.
>
> Given the front edge is aligned, I'd say it was the latter that
> occurred. The question is this: why wasn't the tail edge aligned
> down to make the extent length 76 blocks?
>
>> Removing the specific change to modify blen seems to make things ok again.
>
> Right, because now the allocation ends up being set up with
> args->minlen = args->maxlen = 80 and the allocation is unable to
> align the extent and meet the args->minlen requirement from that
> same unaligned 80 block free space. Hence that allocation fails and
> we fall back to a different allocation strategy that searches other
> AGs for a matching aligned allocation.
ok
>
> IOWs, removing the 'blen -= args->alignment' code simply kicks the
> problem down the road until all AGs run out of 80 block contiguous
> extents.
right
>
> This really smells like a tail alignment bug, not a problem with the
> allocation setup. Returning an extent that is 76 blocks in length
> fulfils the 4 block alignment requirement, so why did tail alignment
> fail?
>
>> I will also note something strange which could be the issue, that being that
>> xfs_alloc_fix_len() does not fix this up - I thought that was its job.
>
> Yes, it should fix this up.
As below, xfs_alloc_fix_len() does nothing (useful).
>
>> Firstly, in this same scenario, in xfs_alloc_space_available() we calculate
>> alloc_len = args->minlen + (args->alignment - 1) + args->alignslop = 76 + (4
>> - 1) + 0 = 79, and then args->maxlen = 79.
>
> That seems OK, we're doing aligned allocation and this is an ENOSPC
> corner case so the aligned allocation should get rounded down in
> xfs_alloc_fix_len() or rejected.
>
> One thought I just had is that the args->maxlen adjustment shouldn't
> be to "available space" - it should probably be set to args->minlen
> because that's the aligned 'alloc_len' we checked available space
> against. That would fix this, because then we'd have args->minlen =
> args->maxlen = 76.
>
> However, that only addresses this specific case, not the general
> case of xfs_alloc_fix_len() failing to tail align the allocated
> extent.
>
>> Then xfs_alloc_fix_len() allows
>> this as args->len == args->maxlen (=79), even though args->prod, mod = 4, 0.
>
> Yeah, that smells wrong.
Would it be worth adding a debug assert for prod and mod being honoured
from the allocator? xfs_alloc_fix_len() does have an assert later on and
it does not help here.
>
> I'd suggest that we've never noticed this until now because we
> have never guaranteed extent alignment. Hence the occasional
> short/unaligned extent being allocated in dark ENOSPC corners was
> never an issue for anyone.
>
> However, introducing a new alignment guarantee turns these sorts of
> latent non-issues into bugs that need to be fixed. i.e. This is
> exactly the sort of rare corner case behaviour I expected to be
> flushed out by guaranteeing and then extensively testing allocation
> alignments.
>
> If you drop the rlen == args->maxlen check from
> xfs_alloc_space_available(),
I assume that you mean xfs_alloc_fix_len()
> the problem should go away and the
> extent gets trimmed to 76 blocks.
..if so, then, yes, it does. We end up with this:
0: [0..14079]: 42432..56511 0 (42432..56511) 14080
1: [14080..14687]: 177344..177951 0 (177344..177951) 608
2: [14688..14719]: 350720..350751 1 (171520..171551) 32
> This shouldn't affect anything
> else because maxlen allocations should already be properly aligned -
> it's only when something like ENOSPC causes args->maxlen to be
> modified to an unaligned value that this issue arises.
>
> In the end, I suspect we'll want to make both changes....
>
>> To me, that (args->alignment - 1) component in calculating alloc_len is odd.
>> I assume it is done as default args->alignment == 1.
>
> No, it's done because guaranteeing aligned allocation requires
> selecting an aligned region from an unaligned free space. i.e. when
> alignment is 4, then we can need up to 3 additional blocks to
> guarantee front alignment for a given length extent.
> i.e. we have to over-allocate to guarantee we can trim up
> to alignment at the front edge and still guarantee that the extent
> is as long as required by args->minlen/maxlen.
>
ok, understood.
Thanks,
John
^ permalink raw reply [flat|nested] 53+ messages in thread* Re: [PATCH 1/3] xfs: simplify extent allocation alignment
2024-03-26 16:08 ` John Garry
@ 2024-04-02 5:58 ` Dave Chinner
2024-04-02 7:49 ` John Garry
0 siblings, 1 reply; 53+ messages in thread
From: Dave Chinner @ 2024-04-02 5:58 UTC (permalink / raw)
To: John Garry; +Cc: linux-xfs, ojaswin, ritesh.list
On Tue, Mar 26, 2024 at 04:08:04PM +0000, John Garry wrote:
> On 20/03/2024 04:35, Dave Chinner wrote:
>
> For some reason I never received this mail. I just noticed it on
> lore.kernel.org today by chance.
>
> > On Wed, Mar 13, 2024 at 11:03:18AM +0000, John Garry wrote:
> > > On 06/03/2024 05:20, Dave Chinner wrote:
> > > > return false;
> > > > diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
> > > > index 0b956f8b9d5a..aa2c103d98f0 100644
> > > > --- a/fs/xfs/libxfs/xfs_alloc.h
> > > > +++ b/fs/xfs/libxfs/xfs_alloc.h
> > > > @@ -46,7 +46,7 @@ typedef struct xfs_alloc_arg {
> > > > xfs_extlen_t minleft; /* min blocks must be left after us */
> > > > xfs_extlen_t total; /* total blocks needed in xaction */
> > > > xfs_extlen_t alignment; /* align answer to multiple of this */
> > > > - xfs_extlen_t minalignslop; /* slop for minlen+alignment calcs */
> > > > + xfs_extlen_t alignslop; /* slop for alignment calcs */
> > > > xfs_agblock_t min_agbno; /* set an agbno range for NEAR allocs */
> > > > xfs_agblock_t max_agbno; /* ... */
> > > > xfs_extlen_t len; /* output: actual size of extent */
> > > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > > > index 656c95a22f2e..d56c82c07505 100644
> > > > --- a/fs/xfs/libxfs/xfs_bmap.c
> > > > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > > > @@ -3295,6 +3295,10 @@ xfs_bmap_select_minlen(
> > > > xfs_extlen_t blen)
> > >
> > > Hi Dave,
> > >
> > > > {
> > > > + /* Adjust best length for extent start alignment. */
> > > > + if (blen > args->alignment)
> > > > + blen -= args->alignment;
> > > > +
> > >
> > > This change seems to be causing or exposing some issue, in that I find that
> > > I am being allocated an extent which is aligned to but not a multiple of
> > > args->alignment.
> >
> > Entirely possible the logic isn't correct ;)
>
> Out of curiosity, how do you guys normally test all this sort of logic?
With difficulty.
Exercising all the weird corner cases is really hard because the
combinatory explosion that occurs when you have 20 control
parameters, up to 5 different failure fallback strategies,
behavioural variations with delayed allocation, ENOSPC and AGFL
refilling accounting variations, etc, means it's basically
impossible to enumerate and iterate the behaviour space fully.
And then we have filesystem geometry and application concurrency
to consider, too.
All of the behaviours up to this point in time are best effort - we
don't guarantee allocation policy is followed when there is not
enough free space to execute the preferred policy - we slowly fall
back to mechanisms that are further from the policy but more likely
to succeed. i.e. as we approach ENOSPC, the allocation policies get
"looser" - they are less restrictive and more variable and don't
give as good results as when there is plenty of free space for the
allocation policy to make good decisions from.
As such, I only check that macro-level behaviour when there is lots
of free space is largely correct. e.g. by doing something like
copying a kernel tree onto a new filesystem, then checking inode
locality follows directories, block locality follows inodes, large
files are stripe aligned, extent size hint based inodes appear to
have the correct extent sizes, etc.
I then rely on the ENOSPC tests in fstests to find regressions that
might occur when the filesystem is stressed with little free space
available. These are a whole lot better than they used to be; root
cause analysis of ENOSPC corner case bugs has consumed months of my
working life over the past 20 years....
> I found this issue with the small program which I wrote to generate traffic.
> I could not find anything similar.
That's because it's largely impossible to write a test that is
deterministic and works on all possible test configurations. Even
changing the size of the filesystem even slightly can result in
vastly different but still 100% correct allocation
behaviour....
> > > Firstly, in this same scenario, in xfs_alloc_space_available() we calculate
> > > alloc_len = args->minlen + (args->alignment - 1) + args->alignslop = 76 + (4
> > > - 1) + 0 = 79, and then args->maxlen = 79.
> >
> > That seems OK, we're doing aligned allocation and this is an ENOSPC
> > corner case so the aligned allocation should get rounded down in
> > xfs_alloc_fix_len() or rejected.
> >
> > One thought I just had is that the args->maxlen adjustment shouldn't
> > be to "available space" - it should probably be set to args->minlen
> > because that's the aligned 'alloc_len' we checked available space
> > against. That would fix this, because then we'd have args->minlen =
> > args->maxlen = 76.
> >
> > However, that only addresses this specific case, not the general
> > case of xfs_alloc_fix_len() failing to tail align the allocated
> > extent.
> >
> > > Then xfs_alloc_fix_len() allows
> > > this as args->len == args->maxlen (=79), even though args->prod, mod = 4, 0.
> >
> > Yeah, that smells wrong.
>
> Would it be worth adding a debug assert for prod and mod being honoured from
> the allocator? xfs_alloc_fix_len() does have an assert later on and it does
> not help here.
I don't see any value in that because it's not actually a "fatal"
issue. See above about trading off policy strictness for allocation
success.
Again, this force alignment stuff is a fundamental change in this
behaviour - it wants "hard failure" rather than "trade off" and so
there isn't a general case for asserting that allocation must be
mod/prod aligned. Extent size hints are a -hint-, not a requirement,
and I don't want random assert failures in test systems because
debug kernels start treating hints as "must not fail" requirements.
> > I'd suggest that we've never noticed this until now because we
> > have never guaranteed extent alignment. Hence the occasional
> > short/unaligned extent being allocated in dark ENOSPC corners was
> > never an issue for anyone.
> >
> > However, introducing a new alignment guarantee turns these sorts of
> > latent non-issues into bugs that need to be fixed. i.e. This is
> > exactly the sort of rare corner case behaviour I expected to be
> > flushed out by guaranteeing and then extensively testing allocation
> > alignments.
> >
> > If you drop the rlen == args->maxlen check from
> > xfs_alloc_space_available(),
>
> I assume that you mean xfs_alloc_fix_len()
Yes.
> > the problem should go away and the
> > extent gets trimmed to 76 blocks.
>
> ..if so, then, yes, it does. We end up with this:
>
> 0: [0..14079]: 42432..56511 0 (42432..56511) 14080
> 1: [14080..14687]: 177344..177951 0 (177344..177951) 608
> 2: [14688..14719]: 350720..350751 1 (171520..171551) 32
Good, that's how it should work. :)
I'll update the patchset I have with these fixes.
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 53+ messages in thread* Re: [PATCH 1/3] xfs: simplify extent allocation alignment
2024-04-02 5:58 ` Dave Chinner
@ 2024-04-02 7:49 ` John Garry
2024-04-02 15:11 ` John Garry
2024-04-02 23:44 ` Dave Chinner
0 siblings, 2 replies; 53+ messages in thread
From: John Garry @ 2024-04-02 7:49 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs, ojaswin, ritesh.list
>>> the problem should go away and the
>>> extent gets trimmed to 76 blocks.
>> ..if so, then, yes, it does. We end up with this:
>>
>> 0: [0..14079]: 42432..56511 0 (42432..56511) 14080
>> 1: [14080..14687]: 177344..177951 0 (177344..177951) 608
>> 2: [14688..14719]: 350720..350751 1 (171520..171551) 32
> Good, that's how it should work. 🙂
>
> I'll update the patchset I have with these fixes.
ok, thanks
Update:
So I have some more patches from trying to support both truncate and
fallocate + punch/insert/collapse for forcealign.
I seem to have at least 2x problems:
- unexpected -ENOSPC in some case
- sometimes misaligned extents from ordered combo of punch, truncate, write
I don't know if it is a good use of time for me to try to debug, as I
guess you could spot problems in the changes almost immediately.
Next steps:
I would like to send out a new series for XFS support for atomic writes
soon, which so far included forcealign support.
Please advise on your preference for what I should do, like wait for
your forcealign update and then combine into the XFS support for atomic
write series. Or just post the doubtful patches now, as above, and go
from there?
Thanks,
John
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 1/3] xfs: simplify extent allocation alignment
2024-04-02 7:49 ` John Garry
@ 2024-04-02 15:11 ` John Garry
2024-04-02 21:26 ` Dave Chinner
2024-04-02 23:44 ` Dave Chinner
1 sibling, 1 reply; 53+ messages in thread
From: John Garry @ 2024-04-02 15:11 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs, ojaswin, ritesh.list
On 02/04/2024 08:49, John Garry wrote:
> Update:
> So I have some more patches from trying to support both truncate and
> fallocate + punch/insert/collapse for forcealign.
>
> I seem to have at least 2x problems:
> - unexpected -ENOSPC in some case
This -ENOSPC seems related to xfs_bmap_select_minlen() again.
I find that it occurs when calling xfs_bmap_select_minlen() and blen ==
maxlen again, like:
blen=64 args->alignment=16, minlen=0, maxlen=64
And then this gives:
args->minlen=48 blen=64
But xfs_alloc_vextent_start_ag() -> xfs_alloc_vextent_iterate_ags() does
not seem to find something suitable.
I'm continuing to look...
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 1/3] xfs: simplify extent allocation alignment
2024-04-02 15:11 ` John Garry
@ 2024-04-02 21:26 ` Dave Chinner
2024-04-03 8:49 ` John Garry
0 siblings, 1 reply; 53+ messages in thread
From: Dave Chinner @ 2024-04-02 21:26 UTC (permalink / raw)
To: John Garry; +Cc: linux-xfs, ojaswin, ritesh.list
On Tue, Apr 02, 2024 at 04:11:20PM +0100, John Garry wrote:
> On 02/04/2024 08:49, John Garry wrote:
> > Update:
> > So I have some more patches from trying to support both truncate and
> > fallocate + punch/insert/collapse for forcealign.
> >
> > I seem to have at least 2x problems:
> > - unexpected -ENOSPC in some case
>
> This -ENOSPC seems related to xfs_bmap_select_minlen() again.
>
> I find that it occurs when calling xfs_bmap_select_minlen() and blen ==
> maxlen again, like:
> blen=64 args->alignment=16, minlen=0, maxlen=64
>
> And then this gives:
> args->minlen=48 blen=64
Which is perfectly reasonable - it fits the bounds specified just
fine, and we'll get a 64 block allocation if that free space is
exactly aligned. Otherwise we'll get a 48 block allocation.
> But xfs_alloc_vextent_start_ag() -> xfs_alloc_vextent_iterate_ags() does not
> seem to find something suitable.
Entirely possible. The AGFL might have needed refilling, so there
really wasn't enough blocks remaining for an aligned allocation to
take place after doing that. That's a real ENOSPC condition, despite
the best length sampling indicating that the allocation could be
done.
Remember, bestlen sampling is racy - it does not hold the AGF
locked from the point of sampling to the point of allocation. Hence
when we finally go to do the allocation after setting it all up, we
might have raced with another allocation that took the free space
sampled during the bestlen pass and so then the allocation fails
despite the setup saying it should succeed.
Fundamentally, if the filesystem's best free space length is the
same size as the requested allocation, *failure is expected* and the
fallback attempts progressively remove restrictions (such as
alignment) to allow sub-optimal extents to be allocated down to
minlen in size. Forced alignment turns off these fallbacks, so you
are going to see hard ENOSPC errors the moment the filesystem runs
out of contiguous free space extents large enough to hold aligned
allocations.
This can happen a -long- way away from a real enospc condition -
depending on alignment constraints, you can start seeing this sort
of behaviour (IIRC my math correctly) at around 70% full. The larger
the alignment and the older the filesystem, the earlier this sort of
ENOSPC event will occur.
Use `xfs_spaceman -c 'freesp'` to dump the free space extent size
histogram. That will quickly tell you if there is no remaining free
extents large enough for an aligned 48 block allocation to succeed.
With an alignment of 16 blocks, this requires at least a 63 block
free space extent to succeed.
IOWs, we should expect ENOSPC to occur long before the filesystem
reports that it is out of space when we are doing forced alignment
allocation.
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 1/3] xfs: simplify extent allocation alignment
2024-04-02 21:26 ` Dave Chinner
@ 2024-04-03 8:49 ` John Garry
0 siblings, 0 replies; 53+ messages in thread
From: John Garry @ 2024-04-03 8:49 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs, ojaswin, ritesh.list
On 02/04/2024 22:26, Dave Chinner wrote:
>> And then this gives:
>> args->minlen=48 blen=64
> Which is perfectly reasonable - it fits the bounds specified just
> fine, and we'll get a 64 block allocation if that free space is
> exactly aligned. Otherwise we'll get a 48 block allocation.
>
>> But xfs_alloc_vextent_start_ag() -> xfs_alloc_vextent_iterate_ags() does not
>> seem to find something suitable.
> Entirely possible. The AGFL might have needed refilling, so there
> really wasn't enough blocks remaining for an aligned allocation to
> take place after doing that. That's a real ENOSPC condition, despite
> the best length sampling indicating that the allocation could be
> done.
>
> Remember, bestlen sampling is racy - it does not hold the AGF
> locked from the point of sampling to the point of allocation.
ok, I assumed that some lock was held.
> Hence
> when we finally go to do the allocation after setting it all up, we
> might have raced with another allocation that took the free space
> sampled during the bestlen pass and so then the allocation fails
> despite the setup saying it should succeed.
My test is single threaded, so I did not think that would be an issue.
>
> Fundamentally, if the filesystem's best free space length is the
> same size as the requested allocation,*failure is expected* and the
> fallback attempts progressively remove restrictions (such as
> alignment) to allow sub-optimal extents to be allocated down to
> minlen in size. Forced alignment turns off these fallbacks, so you
> are going to see hard ENOSPC errors the moment the filesystem runs
> out of contiguous free space extents large enough to hold aligned
> allocations.
>
> This can happen a -long- way away from a real enospc condition -
> depending on alignment constraints, you can start seeing this sort
> of behaviour (IIRC my math correctly) at around 70% full. The larger
> the alignment and the older the filesystem, the earlier this sort of
> ENOSPC event will occur.
For this scenario, statvfs returns - as a sample - f_blocks=73216,
f_bfree=19950, f_bavail=19950
So ~27% free.
>
> Use `xfs_spaceman -c 'freesp'` to dump the free space extent size
> histogram. That will quickly tell you if there is no remaining free
> extents large enough for an aligned 48 block allocation to succeed.
> With an alignment of 16 blocks, this requires at least a 63 block
> free space extent to succeed.
# xfs_spaceman -c 'freesp' /root/mnt2/
from to extents blocks pct
4 7 4 25 0.10
16 31 90 1440 5.77
32 63 12 400 1.60
64 127 1 64 0.26
512 1023 1 640 2.56
16384 22400 1 22390 89.71
>
> IOWs, we should expect ENOSPC to occur long before the filesystem
> reports that it is out of space when we are doing forced alignment
> allocation.
For my test, once ENOSPC occurs and statvfs tells us more than 10% space
free, we exit as something seems wrong. As you say, deducing an error
for this condition may not be the proper thing to do.
I do also note that if I then manually attempt to write the same data to
that same empty file after the test exits, it succeeds. So something
racy. I also notice that the FSB range we scan in
xfs_alloc_ag_vextent_size() -> xfs_alloc_compute_aligned() ->
xfs_extent_busy_trim() returns busy=1 when ENOSPC occurs - I have not
checked that further.
Thanks,
John
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 1/3] xfs: simplify extent allocation alignment
2024-04-02 7:49 ` John Garry
2024-04-02 15:11 ` John Garry
@ 2024-04-02 23:44 ` Dave Chinner
2024-04-03 11:30 ` John Garry
1 sibling, 1 reply; 53+ messages in thread
From: Dave Chinner @ 2024-04-02 23:44 UTC (permalink / raw)
To: John Garry; +Cc: linux-xfs, ojaswin, ritesh.list
On Tue, Apr 02, 2024 at 08:49:09AM +0100, John Garry wrote:
>
> > > > the problem should go away and the
> > > > extent gets trimmed to 76 blocks.
> > > ..if so, then, yes, it does. We end up with this:
> > >
> > > 0: [0..14079]: 42432..56511 0 (42432..56511) 14080
> > > 1: [14080..14687]: 177344..177951 0 (177344..177951) 608
> > > 2: [14688..14719]: 350720..350751 1 (171520..171551) 32
> > Good, that's how it should work. 🙂
> >
> > I'll update the patchset I have with these fixes.
>
> ok, thanks
>
> Update:
> So I have some more patches from trying to support both truncate and
> fallocate + punch/insert/collapse for forcealign.
>
> I seem to have at least 2x problems:
> - unexpected -ENOSPC in some case
> - sometimes misaligned extents from ordered combo of punch, truncate, write
Punch and truncate do not enforce extent alignment and never have.
They will trim extents to whatever the new EOF block is supposed to
be. This is by design - they are intended to be able to remove
extents beyond EOF that may have been done due to extent size hints
and/or speculative delayed allocation to minimise wasted space.
Again, forced alignment is introducing new constraints, so anything
that is truncating EOF blocks (punch, truncate, eof block freeing
during inactivation or blockgc) is going to need to take forced
extent alignment constraints into account.
This is likely something that needs to be done in
xfs_itruncate_extents_flags() for the truncate/inactivation/blockgc
cases (i.e. correct calculation of first_unmap_block). Punching and
insert/collapse are a bit more complex - they'll need their
start/end offsets to be aligned, the chunk sizes they work in to be
aligned, and the rounding in xfs_flush_unmap_range() to be forced
alignment aware.
> I don't know if it is a good use of time for me to try to debug, as I guess
> you could spot problems in the changes almost immediately.
>
> Next steps:
> I would like to send out a new series for XFS support for atomic writes
> soon, which so far included forcealign support.
>
> Please advise on your preference for what I should do, like wait for your
> forcealign update and then combine into the XFS support for atomic write
> series. Or just post the doubtful patches now, as above, and go from there?
I just sent out the forced allocation alignment series for review.
Forced truncate/punch extent alignment will need to be implemented
and reviewed as a separate patch set...
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 1/3] xfs: simplify extent allocation alignment
2024-04-02 23:44 ` Dave Chinner
@ 2024-04-03 11:30 ` John Garry
0 siblings, 0 replies; 53+ messages in thread
From: John Garry @ 2024-04-03 11:30 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs, ojaswin, ritesh.list
On 03/04/2024 00:44, Dave Chinner wrote:
>> I seem to have at least 2x problems:
>> - unexpected -ENOSPC in some case
>> - sometimes misaligned extents from ordered combo of punch, truncate, write
> Punch and truncate do not enforce extent alignment and never have.
> They will trim extents to whatever the new EOF block is supposed to
> be. This is by design - they are intended to be able to remove
> extents beyond EOF that may have been done due to extent size hints
> and/or speculative delayed allocation to minimise wasted space.
>
> Again, forced alignment is introducing new constraints, so anything
> that is truncating EOF blocks (punch, truncate, eof block freeing
> during inactivation or blockgc) is going to need to take forced
> extent alignment constraints into account.
Sure
>
> This is likely something that needs to be done in
> xfs_itruncate_extents_flags() for the truncate/inactivation/blockgc
> cases (i.e. correct calculation of first_unmap_block). Punching and
> insert/collapse are a bit more complex - they'll need their
> start/end offsets to be aligned, the chunk sizes they work in to be
> aligned, and the rounding in xfs_flush_unmap_range() to be forced
> alignment aware.
Ack
>
>> I don't know if it is a good use of time for me to try to debug, as I guess
>> you could spot problems in the changes almost immediately.
>>
>> Next steps:
>> I would like to send out a new series for XFS support for atomic writes
>> soon, which so far included forcealign support.
>>
>> Please advise on your preference for what I should do, like wait for your
>> forcealign update and then combine into the XFS support for atomic write
>> series. Or just post the doubtful patches now, as above, and go from there?
> I just sent out the forced allocation alignment series for review.
cheers, I'll give them a spin.
> Forced truncate/punch extent alignment will need to be implemented
> and reviewed as a separate patch set...
Below are my changes.
I think that the xfs_is_falloc_aligned() change is sound. As for the
other two, I'm really not sure.
There is also
https://lore.kernel.org/linux-xfs/ZeeaKrmVEkcXYjbK@dread.disaster.area/T/#m73430d56d96e60f2908bab9ce3e6a0d56d27929c,
which still is still a candidate change.
Please check them, below.
Thanks,
John
------>8-------
From ec86dd3add7153062a612cb1141f36544f34e0cd Mon Sep 17 00:00:00 2001
From: John Garry <john.g.garry@oracle.com>
Date: Wed, 13 Mar 2024 18:14:37 +0000
Subject: [PATCH 1/3] fs: xfs: Update xfs_is_falloc_aligned() mask forcealign
For when forcealign is enabled, we want the alignment mask to cover an
aligned extent, similar to rtvol.
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
fs/xfs/xfs_file.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 632653e00906..e81e01e6b22b 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -61,7 +61,10 @@ xfs_is_falloc_aligned(
}
mask = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize) - 1;
} else {
- mask = mp->m_sb.sb_blocksize - 1;
+ if (xfs_inode_has_forcealign(ip) && ip->i_extsize > 1)
+ mask = (mp->m_sb.sb_blocksize * ip->i_extsize) - 1;
+ else
+ mask = mp->m_sb.sb_blocksize - 1;
}
return !((pos | len) & mask);
------8<--------
From c0866d2a5753f1c487872ef3add4e08c03c22d00 Mon Sep 17 00:00:00 2001
From: John Garry <john.g.garry@oracle.com>
Date: Fri, 22 Mar 2024 11:24:45 +0000
Subject: [PATCH 2/3] fs: xfs: Unmap blocks according to forcealign
For when forcealign is enabled, blocks in an inode need to be unmapped
according to extent alignment, like what is already done for rtvol.
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
fs/xfs/libxfs/xfs_bmap.c | 41 ++++++++++++++++++++++++++++++++++------
fs/xfs/xfs_inode.h | 5 +++++
2 files changed, 40 insertions(+), 6 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 7a0ef0900097..5dd7a62625db 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -5322,6 +5322,15 @@ xfs_bmap_del_extent_real(
return 0;
}
+/* Return the offset of an block number within an extent for forcealign. */
+static xfs_extlen_t
+xfs_forcealign_extent_offset(
+ struct xfs_inode *ip,
+ xfs_fsblock_t bno)
+{
+ return bno & (ip->i_extsize - 1);
+}
+
/*
* Unmap (remove) blocks from a file.
* If nexts is nonzero then the number of extents to remove is limited to
@@ -5344,6 +5353,7 @@ __xfs_bunmapi(
struct xfs_bmbt_irec got; /* current extent record */
struct xfs_ifork *ifp; /* inode fork pointer */
int isrt; /* freeing in rt area */
+ int isforcealign; /* freeing for file inode with forcealign */
int logflags; /* transaction logging flags */
xfs_extlen_t mod; /* rt extent offset */
struct xfs_mount *mp = ip->i_mount;
@@ -5380,7 +5390,10 @@ __xfs_bunmapi(
return 0;
}
XFS_STATS_INC(mp, xs_blk_unmap);
- isrt = xfs_ifork_is_realtime(ip, whichfork);
+ isrt = (whichfork == XFS_DATA_FORK) && XFS_IS_REALTIME_INODE(ip);
+ isforcealign = (whichfork == XFS_DATA_FORK) &&
+ xfs_inode_has_forcealign(ip) &&
+ xfs_inode_has_extsize(ip) && ip->i_extsize > 1;
end = start + len;
if (!xfs_iext_lookup_extent_before(ip, ifp, &end, &icur, &got)) {
@@ -5442,11 +5455,17 @@ __xfs_bunmapi(
if (del.br_startoff + del.br_blockcount > end + 1)
del.br_blockcount = end + 1 - del.br_startoff;
- if (!isrt || (flags & XFS_BMAPI_REMAP))
+ if ((!isrt && !isforcealign) || (flags & XFS_BMAPI_REMAP))
goto delete;
- mod = xfs_rtb_to_rtxoff(mp,
- del.br_startblock + del.br_blockcount);
+ if (isrt)
+ mod = xfs_rtb_to_rtxoff(mp,
+ del.br_startblock + del.br_blockcount);
+ else if (isforcealign)
+ mod = xfs_forcealign_extent_offset(ip,
+ del.br_startblock + del.br_blockcount);
if (mod) {
/*
* Realtime extent not lined up at the end.
@@ -5494,9 +5513,19 @@ __xfs_bunmapi(
goto nodelete;
}
- mod = xfs_rtb_to_rtxoff(mp, del.br_startblock);
+ if (isrt)
+ mod = xfs_rtb_to_rtxoff(mp, del.br_startblock);
+ else if (isforcealign)
+ mod = xfs_forcealign_extent_offset(ip,
+ del.br_startblock);
+
if (mod) {
- xfs_extlen_t off = mp->m_sb.sb_rextsize - mod;
+ xfs_extlen_t off;
+ if (isrt)
+ off = mp->m_sb.sb_rextsize - mod;
+ else if (isforcealign) {
+ off = ip->i_extsize - mod;
+ }
/*
* Realtime extent is lined up at the end but not
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 065028789473..3f13943ab3a3 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -316,6 +316,11 @@ static inline bool xfs_inode_has_forcealign(struct
xfs_inode *ip)
return ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN;
}
+static inline bool xfs_inode_has_extsize(struct xfs_inode *ip)
+{
+ return ip->i_diflags & XFS_DIFLAG_EXTSIZE;
+}
+
/*
* Return the buftarg used for data allocations on a given inode.
------>8-------
From 8cb2b61fa419b961d22609e12f2d941ce976d0f0 Mon Sep 17 00:00:00 2001
From: John Garry <john.g.garry@oracle.com>
Date: Wed, 20 Mar 2024 13:05:39 +0000
Subject: [PATCH 3/3] fs: xfs: Only free full extents for forcealign
Like we already do for rtvol, only free full extents for forcealign in
xfs_free_file_space().
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
fs/xfs/xfs_bmap_util.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 178a4865d1ed..e3e6e27a33bf 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -848,8 +848,11 @@ xfs_free_file_space(
startoffset_fsb = XFS_B_TO_FSB(mp, offset);
endoffset_fsb = XFS_B_TO_FSBT(mp, offset + len);
- /* We can only free complete realtime extents. */
- if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1) {
+ /* Free only complete extents. */
+ if (xfs_inode_has_forcealign(ip) && ip->i_extsize > 1) {
+ startoffset_fsb = roundup_64(startoffset_fsb, ip->i_extsize);
+ endoffset_fsb = rounddown_64(endoffset_fsb, ip->i_extsize);
+ } else if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1) {
startoffset_fsb = xfs_rtb_roundup_rtx(mp, startoffset_fsb);
endoffset_fsb = xfs_rtb_rounddown_rtx(mp, endoffset_fsb);
}
------8<--------
EOM
^ permalink raw reply related [flat|nested] 53+ messages in thread
* [PATCH 2/3] xfs: make EOF allocation simpler
2024-03-06 5:20 ` [RFC PATCH 0/3] xfs: forced extent alignment Dave Chinner
2024-03-06 5:20 ` [PATCH 1/3] xfs: simplify extent allocation alignment Dave Chinner
@ 2024-03-06 5:20 ` Dave Chinner
2024-03-06 5:20 ` [PATCH 3/3] xfs: introduce forced allocation alignment Dave Chinner
` (2 subsequent siblings)
4 siblings, 0 replies; 53+ messages in thread
From: Dave Chinner @ 2024-03-06 5:20 UTC (permalink / raw)
To: linux-xfs; +Cc: john.g.garry, ojaswin, ritesh.list
From: Dave Chinner <dchinner@redhat.com>
Currently the allocation at EOF is broken into two cases - when the
offset is zero and when the offset is non-zero. When the offset is
non-zero, we try to do exact block allocation for contiguous
extent allocation. When the offset is zero, the allocation is simply
an aligned allocation.
We want aligned allocation as the fallback when exact block
allocation fails, but that complicates the EOF allocation in that it
now has to handle two different allocation cases. The
caller also has to handle allocation when not at EOF, and for the
upcoming forced alignment changes we need that to also be aligned
allocation.
To simplify all this, pull the aligned allocation cases back into
the callers and leave the EOF allocation path for exact block
allocation only. This means that the EOF exact block allocation
fallback path is the normal aligned allocation path and that ends up
making things a lot simpler when forced alignment is introduced.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/libxfs/xfs_bmap.c | 131 +++++++++++++++----------------------
fs/xfs/libxfs/xfs_ialloc.c | 8 +--
fs/xfs/xfs_trace.h | 8 +--
3 files changed, 62 insertions(+), 85 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index d56c82c07505..c2ddf1875e52 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3320,12 +3320,12 @@ xfs_bmap_select_minlen(
static int
xfs_bmap_btalloc_select_lengths(
struct xfs_bmalloca *ap,
- struct xfs_alloc_arg *args,
- xfs_extlen_t *blen)
+ struct xfs_alloc_arg *args)
{
struct xfs_mount *mp = args->mp;
struct xfs_perag *pag;
xfs_agnumber_t agno, startag;
+ xfs_extlen_t blen = 0;
int error = 0;
if (ap->tp->t_flags & XFS_TRANS_LOWMODE) {
@@ -3339,19 +3339,18 @@ xfs_bmap_btalloc_select_lengths(
if (startag == NULLAGNUMBER)
startag = 0;
- *blen = 0;
for_each_perag_wrap(mp, startag, agno, pag) {
- error = xfs_bmap_longest_free_extent(pag, args->tp, blen);
+ error = xfs_bmap_longest_free_extent(pag, args->tp, &blen);
if (error && error != -EAGAIN)
break;
error = 0;
- if (*blen >= args->maxlen)
+ if (blen >= args->maxlen)
break;
}
if (pag)
xfs_perag_rele(pag);
- args->minlen = xfs_bmap_select_minlen(ap, args, *blen);
+ args->minlen = xfs_bmap_select_minlen(ap, args, blen);
return error;
}
@@ -3561,78 +3560,40 @@ xfs_bmap_exact_minlen_extent_alloc(
* If we are not low on available data blocks and we are allocating at
* EOF, optimise allocation for contiguous file extension and/or stripe
* alignment of the new extent.
- *
- * NOTE: ap->aeof is only set if the allocation length is >= the
- * stripe unit and the allocation offset is at the end of file.
*/
static int
xfs_bmap_btalloc_at_eof(
struct xfs_bmalloca *ap,
- struct xfs_alloc_arg *args,
- xfs_extlen_t blen,
- bool ag_only)
+ struct xfs_alloc_arg *args)
{
struct xfs_mount *mp = args->mp;
struct xfs_perag *caller_pag = args->pag;
+ xfs_extlen_t alignment = args->alignment;
int error;
+ ASSERT(ap->aeof && ap->offset);
+ ASSERT(args->alignment >= 1);
+
/*
- * If there are already extents in the file, try an exact EOF block
- * allocation to extend the file as a contiguous extent. If that fails,
- * or it's the first allocation in a file, just try for a stripe aligned
- * allocation.
+ * Compute the alignment slop for the fallback path so we ensure
+ * we account for the potential alignemnt space required by the
+ * fallback paths before we modify the AGF and AGFL here.
*/
- if (ap->offset) {
- xfs_extlen_t alignment = args->alignment;
-
- /*
- * Compute the alignment slop for the fallback path so we ensure
- * we account for the potential alignemnt space required by the
- * fallback paths before we modify the AGF and AGFL here.
- */
- args->alignment = 1;
- args->alignslop = alignment - args->alignment;
-
- if (!caller_pag)
- args->pag = xfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, ap->blkno));
- error = xfs_alloc_vextent_exact_bno(args, ap->blkno);
- if (!caller_pag) {
- xfs_perag_put(args->pag);
- args->pag = NULL;
- }
- if (error)
- return error;
-
- if (args->fsbno != NULLFSBLOCK)
- return 0;
- /*
- * Exact allocation failed. Reset to try an aligned allocation
- * according to the original allocation specification.
- */
- args->alignment = alignment;
- args->alignslop = 0;
- }
-
- if (ag_only) {
- error = xfs_alloc_vextent_near_bno(args, ap->blkno);
- } else {
+ args->alignment = 1;
+ args->alignslop = alignment - args->alignment;
+
+ if (!caller_pag)
+ args->pag = xfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, ap->blkno));
+ error = xfs_alloc_vextent_exact_bno(args, ap->blkno);
+ if (!caller_pag) {
+ xfs_perag_put(args->pag);
args->pag = NULL;
- error = xfs_alloc_vextent_start_ag(args, ap->blkno);
- ASSERT(args->pag == NULL);
- args->pag = caller_pag;
}
- if (error)
- return error;
- if (args->fsbno != NULLFSBLOCK)
- return 0;
-
- /*
- * Aligned allocation failed, so all fallback paths from here drop the
- * start alignment requirement as we know it will not succeed.
- */
- args->alignment = 1;
- return 0;
+ /* Reset alignment to original specifications. */
+ args->alignment = alignment;
+ args->alignslop = 0;
+ return error;
}
/*
@@ -3698,12 +3659,19 @@ xfs_bmap_btalloc_filestreams(
}
args->minlen = xfs_bmap_select_minlen(ap, args, blen);
- if (ap->aeof)
- error = xfs_bmap_btalloc_at_eof(ap, args, blen, true);
+ if (ap->aeof && ap->offset)
+ error = xfs_bmap_btalloc_at_eof(ap, args);
+ /* This may be an aligned allocation attempt. */
if (!error && args->fsbno == NULLFSBLOCK)
error = xfs_alloc_vextent_near_bno(args, ap->blkno);
+ /* Attempt non-aligned allocation if we haven't already. */
+ if (!error && args->fsbno == NULLFSBLOCK && args->alignment > 1) {
+ args->alignment = 1;
+ error = xfs_alloc_vextent_near_bno(args, ap->blkno);
+ }
+
out_low_space:
/*
* We are now done with the perag reference for the filestreams
@@ -3725,7 +3693,6 @@ xfs_bmap_btalloc_best_length(
struct xfs_bmalloca *ap,
struct xfs_alloc_arg *args)
{
- xfs_extlen_t blen = 0;
int error;
ap->blkno = XFS_INO_TO_FSB(args->mp, ap->ip->i_ino);
@@ -3736,23 +3703,33 @@ xfs_bmap_btalloc_best_length(
* the request. If one isn't found, then adjust the minimum allocation
* size to the largest space found.
*/
- error = xfs_bmap_btalloc_select_lengths(ap, args, &blen);
+ error = xfs_bmap_btalloc_select_lengths(ap, args);
if (error)
return error;
/*
- * Don't attempt optimal EOF allocation if previous allocations barely
- * succeeded due to being near ENOSPC. It is highly unlikely we'll get
- * optimal or even aligned allocations in this case, so don't waste time
- * trying.
+ * If we are in low space mode, then optimal allocation will fail so
+ * prepare for minimal allocation and run the low space algorithm
+ * immediately.
*/
- if (ap->aeof && !(ap->tp->t_flags & XFS_TRANS_LOWMODE)) {
- error = xfs_bmap_btalloc_at_eof(ap, args, blen, false);
- if (error || args->fsbno != NULLFSBLOCK)
- return error;
+ if (ap->tp->t_flags & XFS_TRANS_LOWMODE) {
+ ASSERT(args->fsbno == NULLFSBLOCK);
+ return xfs_bmap_btalloc_low_space(ap, args);
+ }
+
+ if (ap->aeof && ap->offset)
+ error = xfs_bmap_btalloc_at_eof(ap, args);
+
+ /* This may be an aligned allocation attempt. */
+ if (!error && args->fsbno == NULLFSBLOCK)
+ error = xfs_alloc_vextent_start_ag(args, ap->blkno);
+
+ /* Attempt non-aligned allocation if we haven't already. */
+ if (!error && args->fsbno == NULLFSBLOCK && args->alignment > 1) {
+ args->alignment = 1;
+ error = xfs_alloc_vextent_start_ag(args, ap->blkno);
}
- error = xfs_alloc_vextent_start_ag(args, ap->blkno);
if (error || args->fsbno != NULLFSBLOCK)
return error;
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index fa27a50f96ac..fc19d93a9d69 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -758,12 +758,12 @@ xfs_ialloc_ag_alloc(
*
* For an exact allocation, alignment must be 1,
* however we need to take cluster alignment into account when
- * fixing up the freelist. Use the minalignslop field to
+ * fixing up the freelist. Use the alignslop field to
* indicate that extra blocks might be required for alignment,
* but not to use them in the actual exact allocation.
*/
args.alignment = 1;
- args.minalignslop = igeo->cluster_align - 1;
+ args.alignslop = igeo->cluster_align - 1;
/* Allow space for the inode btree to split. */
args.minleft = igeo->inobt_maxlevels;
@@ -780,10 +780,10 @@ xfs_ialloc_ag_alloc(
* the exact agbno requirement and increase the alignment
* instead. It is critical that the total size of the request
* (len + alignment + slop) does not increase from this point
- * on, so reset minalignslop to ensure it is not included in
+ * on, so reset alignslop to ensure it is not included in
* subsequent requests.
*/
- args.minalignslop = 0;
+ args.alignslop = 0;
}
if (unlikely(args.fsbno == NULLFSBLOCK)) {
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 56b07d8ed431..0b4898b39e30 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1800,7 +1800,7 @@ DECLARE_EVENT_CLASS(xfs_alloc_class,
__field(xfs_extlen_t, minleft)
__field(xfs_extlen_t, total)
__field(xfs_extlen_t, alignment)
- __field(xfs_extlen_t, minalignslop)
+ __field(xfs_extlen_t, alignslop)
__field(xfs_extlen_t, len)
__field(char, wasdel)
__field(char, wasfromfl)
@@ -1819,7 +1819,7 @@ DECLARE_EVENT_CLASS(xfs_alloc_class,
__entry->minleft = args->minleft;
__entry->total = args->total;
__entry->alignment = args->alignment;
- __entry->minalignslop = args->minalignslop;
+ __entry->alignslop = args->alignslop;
__entry->len = args->len;
__entry->wasdel = args->wasdel;
__entry->wasfromfl = args->wasfromfl;
@@ -1828,7 +1828,7 @@ DECLARE_EVENT_CLASS(xfs_alloc_class,
__entry->highest_agno = args->tp->t_highest_agno;
),
TP_printk("dev %d:%d agno 0x%x agbno 0x%x minlen %u maxlen %u mod %u "
- "prod %u minleft %u total %u alignment %u minalignslop %u "
+ "prod %u minleft %u total %u alignment %u alignslop %u "
"len %u wasdel %d wasfromfl %d resv %d "
"datatype 0x%x highest_agno 0x%x",
MAJOR(__entry->dev), MINOR(__entry->dev),
@@ -1841,7 +1841,7 @@ DECLARE_EVENT_CLASS(xfs_alloc_class,
__entry->minleft,
__entry->total,
__entry->alignment,
- __entry->minalignslop,
+ __entry->alignslop,
__entry->len,
__entry->wasdel,
__entry->wasfromfl,
--
2.43.0
^ permalink raw reply related [flat|nested] 53+ messages in thread* [PATCH 3/3] xfs: introduce forced allocation alignment
2024-03-06 5:20 ` [RFC PATCH 0/3] xfs: forced extent alignment Dave Chinner
2024-03-06 5:20 ` [PATCH 1/3] xfs: simplify extent allocation alignment Dave Chinner
2024-03-06 5:20 ` [PATCH 2/3] xfs: make EOF allocation simpler Dave Chinner
@ 2024-03-06 5:20 ` Dave Chinner
2024-03-06 11:46 ` [RFC PATCH 0/3] xfs: forced extent alignment John Garry
2024-03-13 18:32 ` John Garry
4 siblings, 0 replies; 53+ messages in thread
From: Dave Chinner @ 2024-03-06 5:20 UTC (permalink / raw)
To: linux-xfs; +Cc: john.g.garry, ojaswin, ritesh.list
From: Dave Chinner <dchinner@redhat.com>
When forced allocation alignment is specified, the extent will
be aligned to the extent size hint size rather than stripe
alignment. If aligned allocation cannot be done, then the allocation
is failed rather than attempting non-aligned fallbacks.
Note: none of the per-inode force align configuration is present
yet, so this just triggers off an "always false" wrapper function
for the moment.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/libxfs/xfs_alloc.h | 1 +
fs/xfs/libxfs/xfs_bmap.c | 29 +++++++++++++++++++++++------
fs/xfs/xfs_inode.h | 5 +++++
3 files changed, 29 insertions(+), 6 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index aa2c103d98f0..7de2e6f64882 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -66,6 +66,7 @@ typedef struct xfs_alloc_arg {
#define XFS_ALLOC_USERDATA (1 << 0)/* allocation is for user data*/
#define XFS_ALLOC_INITIAL_USER_DATA (1 << 1)/* special case start of file */
#define XFS_ALLOC_NOBUSY (1 << 2)/* Busy extents not allowed */
+#define XFS_ALLOC_FORCEALIGN (1 << 3)/* forced extent alignment */
/* freespace limit calculations */
unsigned int xfs_alloc_set_aside(struct xfs_mount *mp);
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index c2ddf1875e52..7a0ef0900097 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3411,9 +3411,10 @@ xfs_bmap_alloc_account(
* Calculate the extent start alignment and the extent length adjustments that
* constrain this allocation.
*
- * Extent start alignment is currently determined by stripe configuration and is
- * carried in args->alignment, whilst extent length adjustment is determined by
- * extent size hints and is carried by args->prod and args->mod.
+ * Extent start alignment is currently determined by forced inode alignment or
+ * stripe configuration and is carried in args->alignment, whilst extent length
+ * adjustment is determined by extent size hints and is carried by args->prod
+ * and args->mod.
*
* Low level allocation code is free to either ignore or override these values
* as required.
@@ -3426,11 +3427,18 @@ xfs_bmap_compute_alignments(
struct xfs_mount *mp = args->mp;
xfs_extlen_t align = 0; /* minimum allocation alignment */
- /* stripe alignment for allocation is determined by mount parameters */
- if (mp->m_swidth && xfs_has_swalloc(mp))
+ /*
+ * Forced inode alignment takes preference over stripe alignment.
+ * Stripe alignment for allocation is determined by mount parameters.
+ */
+ if (xfs_inode_has_forcealign(ap->ip)) {
+ args->alignment = xfs_get_extsz_hint(ap->ip);
+ args->datatype |= XFS_ALLOC_FORCEALIGN;
+ } else if (mp->m_swidth && xfs_has_swalloc(mp)) {
args->alignment = mp->m_swidth;
- else if (mp->m_dalign)
+ } else if (mp->m_dalign) {
args->alignment = mp->m_dalign;
+ }
if (ap->flags & XFS_BMAPI_COWFORK)
align = xfs_get_cowextsz_hint(ap->ip);
@@ -3617,6 +3625,11 @@ xfs_bmap_btalloc_low_space(
{
int error;
+ if (args->alignment > 1 && (args->datatype & XFS_ALLOC_FORCEALIGN)) {
+ args->fsbno = NULLFSBLOCK;
+ return 0;
+ }
+
args->alignment = 1;
if (args->minlen > ap->minlen) {
args->minlen = ap->minlen;
@@ -3668,6 +3681,8 @@ xfs_bmap_btalloc_filestreams(
/* Attempt non-aligned allocation if we haven't already. */
if (!error && args->fsbno == NULLFSBLOCK && args->alignment > 1) {
+ if (args->datatype & XFS_ALLOC_FORCEALIGN)
+ return error;
args->alignment = 1;
error = xfs_alloc_vextent_near_bno(args, ap->blkno);
}
@@ -3726,6 +3741,8 @@ xfs_bmap_btalloc_best_length(
/* Attempt non-aligned allocation if we haven't already. */
if (!error && args->fsbno == NULLFSBLOCK && args->alignment > 1) {
+ if (args->datatype & XFS_ALLOC_FORCEALIGN)
+ return error;
args->alignment = 1;
error = xfs_alloc_vextent_start_ag(args, ap->blkno);
}
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 0f9d32cbae72..94fa79ae1591 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -312,6 +312,11 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip)
return ip->i_diflags2 & XFS_DIFLAG2_NREXT64;
}
+static inline bool xfs_inode_has_forcealign(struct xfs_inode *ip)
+{
+ return false;
+}
+
/*
* Return the buftarg used for data allocations on a given inode.
*/
--
2.43.0
^ permalink raw reply related [flat|nested] 53+ messages in thread* Re: [RFC PATCH 0/3] xfs: forced extent alignment
2024-03-06 5:20 ` [RFC PATCH 0/3] xfs: forced extent alignment Dave Chinner
` (2 preceding siblings ...)
2024-03-06 5:20 ` [PATCH 3/3] xfs: introduce forced allocation alignment Dave Chinner
@ 2024-03-06 11:46 ` John Garry
2024-03-06 17:52 ` John Garry
2024-03-06 20:54 ` Dave Chinner
2024-03-13 18:32 ` John Garry
4 siblings, 2 replies; 53+ messages in thread
From: John Garry @ 2024-03-06 11:46 UTC (permalink / raw)
To: Dave Chinner, linux-xfs; +Cc: ojaswin, ritesh.list
On 06/03/2024 05:20, Dave Chinner wrote:
> Hi Garry,
>
> I figured that it was simpler just to write the forced extent
> alignment allocator patches that to make you struggle through them
> and require lots of round trips to understand all the weird corner
> cases.
I appreciate that.
>
> The following 3 patches:
>
> - rework the setup and extent allocation logic a bit to make force
> aligned allocation much easier to implement and understand
> - move all the alignment adjustments into the setup logic
> - rework the alignment slop calculations and greatly simplify the
> the exact EOF block allocation case
> - add a XFS_ALLOC_FORCEALIGN flag so that the inode config only
> needs to be checked once at setup. This also allows other
> allocation types (e.g. inode clusters) use forced alignment
> allocation semantics in future.
> - clearly document when we are turning off allocation alignment and
> abort FORCEALIGN allocation at that point rather than doing
> unaligned allocation.
>
> I've run this through fstests once so it doesn't let the smoke out,
> but I haven't actually tested it against a stripe aligned filesystem
> config yet, nor tested the forcealign functionality so it may not be
> exactly right yet.
>
> Is this sufficiently complete for you to take from here into the
> forcealign series?
>
I'll try it out.
What baseline are these against? Mine were against v6.8-rc5, but I guess
that you develop against an XFS integration tree. Maybe they apply and
build cleanly against v6.8-rc5 ...
Cheers,
John
^ permalink raw reply [flat|nested] 53+ messages in thread* Re: [RFC PATCH 0/3] xfs: forced extent alignment
2024-03-06 11:46 ` [RFC PATCH 0/3] xfs: forced extent alignment John Garry
@ 2024-03-06 17:52 ` John Garry
2024-03-06 20:54 ` Dave Chinner
1 sibling, 0 replies; 53+ messages in thread
From: John Garry @ 2024-03-06 17:52 UTC (permalink / raw)
To: Dave Chinner, linux-xfs; +Cc: ojaswin, ritesh.list
On 06/03/2024 11:46, John Garry wrote:
>> The following 3 patches:
>>
>> - rework the setup and extent allocation logic a bit to make force
>> aligned allocation much easier to implement and understand
>> - move all the alignment adjustments into the setup logic
>> - rework the alignment slop calculations and greatly simplify the
>> the exact EOF block allocation case
>> - add a XFS_ALLOC_FORCEALIGN flag so that the inode config only
>> needs to be checked once at setup. This also allows other
>> allocation types (e.g. inode clusters) use forced alignment
>> allocation semantics in future.
>> - clearly document when we are turning off allocation alignment and
>> abort FORCEALIGN allocation at that point rather than doing
>> unaligned allocation.
>>
>> I've run this through fstests once so it doesn't let the smoke out,
>> but I haven't actually tested it against a stripe aligned filesystem
>> config yet, nor tested the forcealign functionality so it may not be
>> exactly right yet.
>>
>> Is this sufficiently complete for you to take from here into the
>> forcealign series?
>>
>
> I'll try it out.
Update:
I replaced your patches with the relevant forcealign patches which I
send out in "[PATCH v2 00/14] block atomic writes for XFS".
So far they seem to work fine. An updated branch is here, if you want to
check what else I am using:
https://github.com/johnpgarry/linux/commits/atomic-writes-v6.8-v5-fs-v2-dchinner/
I haven't tried any stripe unit testing yet - I'll look at that now.
Thanks,
John
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [RFC PATCH 0/3] xfs: forced extent alignment
2024-03-06 11:46 ` [RFC PATCH 0/3] xfs: forced extent alignment John Garry
2024-03-06 17:52 ` John Garry
@ 2024-03-06 20:54 ` Dave Chinner
1 sibling, 0 replies; 53+ messages in thread
From: Dave Chinner @ 2024-03-06 20:54 UTC (permalink / raw)
To: John Garry; +Cc: linux-xfs, ojaswin, ritesh.list
On Wed, Mar 06, 2024 at 11:46:38AM +0000, John Garry wrote:
> On 06/03/2024 05:20, Dave Chinner wrote:
> > Hi Garry,
> >
> > I figured that it was simpler just to write the forced extent
> > alignment allocator patches that to make you struggle through them
> > and require lots of round trips to understand all the weird corner
> > cases.
>
> I appreciate that.
>
> >
> > The following 3 patches:
> >
> > - rework the setup and extent allocation logic a bit to make force
> > aligned allocation much easier to implement and understand
> > - move all the alignment adjustments into the setup logic
> > - rework the alignment slop calculations and greatly simplify the
> > the exact EOF block allocation case
> > - add a XFS_ALLOC_FORCEALIGN flag so that the inode config only
> > needs to be checked once at setup. This also allows other
> > allocation types (e.g. inode clusters) use forced alignment
> > allocation semantics in future.
> > - clearly document when we are turning off allocation alignment and
> > abort FORCEALIGN allocation at that point rather than doing
> > unaligned allocation.
> >
> > I've run this through fstests once so it doesn't let the smoke out,
> > but I haven't actually tested it against a stripe aligned filesystem
> > config yet, nor tested the forcealign functionality so it may not be
> > exactly right yet.
> >
> > Is this sufficiently complete for you to take from here into the
> > forcealign series?
> >
>
> I'll try it out.
>
> What baseline are these against? Mine were against v6.8-rc5, but I guess
> that you develop against an XFS integration tree. Maybe they apply and build
> cleanly against v6.8-rc5 ...
6.8-rc7 + linux-xfs/for-next + some other local patches to other
parts of XFS that shouldn't overlap with this patch set. I don't
think there's anything in for-next overlapping this code, so it
might just apply cleanly to your tree....
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [RFC PATCH 0/3] xfs: forced extent alignment
2024-03-06 5:20 ` [RFC PATCH 0/3] xfs: forced extent alignment Dave Chinner
` (3 preceding siblings ...)
2024-03-06 11:46 ` [RFC PATCH 0/3] xfs: forced extent alignment John Garry
@ 2024-03-13 18:32 ` John Garry
4 siblings, 0 replies; 53+ messages in thread
From: John Garry @ 2024-03-13 18:32 UTC (permalink / raw)
To: Dave Chinner, linux-xfs; +Cc: ojaswin, ritesh.list
On 06/03/2024 05:20, Dave Chinner wrote:
> Hi Garry,
>
> I figured that it was simpler just to write the forced extent
> alignment allocator patches that to make you struggle through them
> and require lots of round trips to understand all the weird corner
> cases.
>
> The following 3 patches:
>
> - rework the setup and extent allocation logic a bit to make force
> aligned allocation much easier to implement and understand
> - move all the alignment adjustments into the setup logic
> - rework the alignment slop calculations and greatly simplify the
> the exact EOF block allocation case
> - add a XFS_ALLOC_FORCEALIGN flag so that the inode config only
> needs to be checked once at setup. This also allows other
> allocation types (e.g. inode clusters) use forced alignment
> allocation semantics in future.
> - clearly document when we are turning off allocation alignment and
> abort FORCEALIGN allocation at that point rather than doing
> unaligned allocation.
>
> I've run this through fstests once so it doesn't let the smoke out,
> but I haven't actually tested it against a stripe aligned filesystem
> config yet, nor tested the forcealign functionality so it may not be
> exactly right yet.
>
JFYI, I started to test fallocate for FALLOCATE_COLLAPSE. I think that
we need something like this:
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -61,7 +61,11 @@ xfs_is_falloc_aligned(
}
mask = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize) - 1;
} else {
- mask = mp->m_sb.sb_blocksize - 1;
+ if (xfs_inode_has_forcealign(ip))
+ mask = (mp->m_sb.sb_blocksize * ip->i_extsize) - 1;
+ else
+ mask = mp->m_sb.sb_blocksize - 1;
}
return !((pos | len) & mask);
I think that we also need to fix up __xfs_bunmapi() to do similar as
isrt (for forcealign).
Thanks,
John
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v2 04/14] fs: xfs: Make file data allocations observe the 'forcealign' flag
2024-03-05 22:18 ` Dave Chinner
2024-03-06 5:20 ` [RFC PATCH 0/3] xfs: forced extent alignment Dave Chinner
@ 2024-03-06 9:41 ` John Garry
1 sibling, 0 replies; 53+ messages in thread
From: John Garry @ 2024-03-06 9:41 UTC (permalink / raw)
To: Dave Chinner
Cc: djwong, hch, viro, brauner, jack, chandan.babu, axboe,
martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block
>> Please note in case missed, I am mandating extsize hint for forcealign needs
>> to be a power-of-2. It just makes life easier for all the sub-extent
>> zero'ing later on.
>
> That's fine - that will need to be documented in the xfsctl man
> page...
>
>> Also we need to enforce that the AG count to be compliant with the extsize
> ^^^^^ size?
Yes
>
>> hint for forcealign; but since the extsize hint for forcealign needs to be
>> compliant with stripe unit, above, and stripe unit would be compliant wth AG
>> count (right?), then this would be a given.
>
> We already align AG size to stripe unit when a stripe unit is set,
> and ensure that we don't place all the AG headers on the same stripe
> unit.
>
> However, if there is no stripe unit we don't align the AG to
> anything.
> So, yes, AG sizing by mkfs will need to ensure that all
> AGs are correctly aligned to the underlying storage (integer
> multiple of the max atomic write size, right?)...
right, this is really important
>
>>> More below....
>>>
>>>> + } else {
>>>> + args->alignment = 1;
>>>> + }
>>>
>>> Just initialise the allocation args structure with a value of 1 like
>>> we already do?
>>
>> It was being done in this way to have just a single place where the value is
>> initialised. It can easily be kept as is.
>
> I'd prefer it as is, because then the value is always initialised
> correctly and we only override in the special cases....
ok
>>
>> are you saying that low-space allocator can set args->alignment = 1 to be
>> explicit?
>
> Yes.
ok
Thanks,
John
^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH v2 05/14] fs: xfs: Enable file data forcealign feature
2024-03-04 13:04 [PATCH v2 00/14] block atomic writes for XFS John Garry
` (3 preceding siblings ...)
2024-03-04 13:04 ` [PATCH v2 04/14] fs: xfs: Make file data allocations observe the 'forcealign' flag John Garry
@ 2024-03-04 13:04 ` John Garry
2024-03-04 13:04 ` [PATCH v2 06/14] fs: xfs: Do not free EOF blocks for forcealign John Garry
` (8 subsequent siblings)
13 siblings, 0 replies; 53+ messages in thread
From: John Garry @ 2024-03-04 13:04 UTC (permalink / raw)
To: djwong, hch, viro, brauner, jack, chandan.babu, david, axboe
Cc: martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block, John Garry
From: "Darrick J. Wong" <djwong@kernel.org>
Enable this feature.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
fs/xfs/libxfs/xfs_format.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index db2113cf6e47..2d9f5430efc3 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -358,7 +358,8 @@ xfs_sb_has_compat_feature(
(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
XFS_SB_FEAT_RO_COMPAT_REFLINK| \
- XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
+ XFS_SB_FEAT_RO_COMPAT_INOBTCNT | \
+ XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
#define XFS_SB_FEAT_RO_COMPAT_UNKNOWN ~XFS_SB_FEAT_RO_COMPAT_ALL
static inline bool
xfs_sb_has_ro_compat_feature(
--
2.31.1
^ permalink raw reply related [flat|nested] 53+ messages in thread* [PATCH v2 06/14] fs: xfs: Do not free EOF blocks for forcealign
2024-03-04 13:04 [PATCH v2 00/14] block atomic writes for XFS John Garry
` (4 preceding siblings ...)
2024-03-04 13:04 ` [PATCH v2 05/14] fs: xfs: Enable file data forcealign feature John Garry
@ 2024-03-04 13:04 ` John Garry
2024-03-06 21:07 ` Dave Chinner
2024-03-04 13:04 ` [PATCH v2 07/14] fs: iomap: Sub-extent zeroing John Garry
` (7 subsequent siblings)
13 siblings, 1 reply; 53+ messages in thread
From: John Garry @ 2024-03-04 13:04 UTC (permalink / raw)
To: djwong, hch, viro, brauner, jack, chandan.babu, david, axboe
Cc: martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block, John Garry
For when forcealign is enabled, we want the EOF to be aligned as well, so
do not free EOF blocks.
Add helper function xfs_get_extsz() to get the guaranteed extent size
alignment for forcealign enabled. Since this code is only relevant to
forcealign and forcealign is not possible for RT yet, ignore RT.
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
fs/xfs/xfs_bmap_util.c | 7 ++++++-
fs/xfs/xfs_inode.c | 14 ++++++++++++++
fs/xfs/xfs_inode.h | 1 +
3 files changed, 21 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index c2531c28905c..07bfb03c671a 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -542,8 +542,13 @@ xfs_can_free_eofblocks(
* forever.
*/
end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_ISIZE(ip));
- if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1)
+
+ /* Do not free blocks when forcing extent sizes */
+ if (xfs_get_extsz(ip) > 1)
+ end_fsb = roundup_64(end_fsb, xfs_get_extsz(ip));
+ else if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1)
end_fsb = xfs_rtb_roundup_rtx(mp, end_fsb);
+
last_fsb = XFS_B_TO_FSB(mp, mp->m_super->s_maxbytes);
if (last_fsb <= end_fsb)
return false;
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 2c439df8c47f..bbb8886f1d32 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -65,6 +65,20 @@ xfs_get_extsz_hint(
return 0;
}
+/*
+ * Helper function to extract extent size. It will return a power-of-2,
+ * as forcealign requires this.
+ */
+xfs_extlen_t
+xfs_get_extsz(
+ struct xfs_inode *ip)
+{
+ if (xfs_inode_forcealign(ip) && ip->i_extsize)
+ return ip->i_extsize;
+
+ return 1;
+}
+
/*
* Helper function to extract CoW extent size hint from inode.
* Between the extent size hint and the CoW extent size hint, we
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 82e2838f6d64..b6c42c27943e 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -547,6 +547,7 @@ void xfs_lock_two_inodes(struct xfs_inode *ip0, uint ip0_mode,
struct xfs_inode *ip1, uint ip1_mode);
xfs_extlen_t xfs_get_extsz_hint(struct xfs_inode *ip);
+xfs_extlen_t xfs_get_extsz(struct xfs_inode *ip);
xfs_extlen_t xfs_get_cowextsz_hint(struct xfs_inode *ip);
int xfs_init_new_inode(struct mnt_idmap *idmap, struct xfs_trans *tp,
--
2.31.1
^ permalink raw reply related [flat|nested] 53+ messages in thread* Re: [PATCH v2 06/14] fs: xfs: Do not free EOF blocks for forcealign
2024-03-04 13:04 ` [PATCH v2 06/14] fs: xfs: Do not free EOF blocks for forcealign John Garry
@ 2024-03-06 21:07 ` Dave Chinner
2024-03-07 11:38 ` John Garry
0 siblings, 1 reply; 53+ messages in thread
From: Dave Chinner @ 2024-03-06 21:07 UTC (permalink / raw)
To: John Garry
Cc: djwong, hch, viro, brauner, jack, chandan.babu, axboe,
martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block
On Mon, Mar 04, 2024 at 01:04:20PM +0000, John Garry wrote:
> For when forcealign is enabled, we want the EOF to be aligned as well, so
> do not free EOF blocks.
>
> Add helper function xfs_get_extsz() to get the guaranteed extent size
> alignment for forcealign enabled. Since this code is only relevant to
> forcealign and forcealign is not possible for RT yet, ignore RT.
>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
> fs/xfs/xfs_bmap_util.c | 7 ++++++-
> fs/xfs/xfs_inode.c | 14 ++++++++++++++
> fs/xfs/xfs_inode.h | 1 +
> 3 files changed, 21 insertions(+), 1 deletion(-)
>
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index c2531c28905c..07bfb03c671a 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -542,8 +542,13 @@ xfs_can_free_eofblocks(
> * forever.
> */
> end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_ISIZE(ip));
> - if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1)
> +
> + /* Do not free blocks when forcing extent sizes */
That comment seems wrong - this just rounds up where we start
trimming from to be aligned...
> + if (xfs_get_extsz(ip) > 1)
> + end_fsb = roundup_64(end_fsb, xfs_get_extsz(ip));
> + else if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1)
> end_fsb = xfs_rtb_roundup_rtx(mp, end_fsb);
I think this would be better written as:
/*
* Forced extent alignment requires us to round up where we
* start trimming from so that result is correctly aligned.
*/
if (xfs_inode_forcealign(ip)) {
if (ip->i_extsize > 1)
end_fsb = roundup_64(end_fsb, ip->i_extsize);
else if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1)
end_fsb = xfs_rtb_roundup_rtx(mp, end_fsb);
}
Because we only want end-fsb alignment when forced alignment is
required.
Which then makes me wonder: truncate needs this, too, doesn't it?
And the various fallocate() ops like hole punching and extent
shifting?
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 2c439df8c47f..bbb8886f1d32 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -65,6 +65,20 @@ xfs_get_extsz_hint(
> return 0;
> }
>
> +/*
> + * Helper function to extract extent size. It will return a power-of-2,
> + * as forcealign requires this.
> + */
> +xfs_extlen_t
> +xfs_get_extsz(
> + struct xfs_inode *ip)
> +{
> + if (xfs_inode_forcealign(ip) && ip->i_extsize)
> + return ip->i_extsize;
> +
> + return 1;
> +}
This can go away - if it is needed elsewhere, then I think it would
be better open coded because it better documents what the code is
doing...
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 53+ messages in thread* Re: [PATCH v2 06/14] fs: xfs: Do not free EOF blocks for forcealign
2024-03-06 21:07 ` Dave Chinner
@ 2024-03-07 11:38 ` John Garry
0 siblings, 0 replies; 53+ messages in thread
From: John Garry @ 2024-03-07 11:38 UTC (permalink / raw)
To: Dave Chinner
Cc: djwong, hch, viro, brauner, jack, chandan.babu, axboe,
martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block
>> */
>> end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_ISIZE(ip));
>> - if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1)
>> +
>> + /* Do not free blocks when forcing extent sizes */
>
> That comment seems wrong - this just rounds up where we start
> trimming from to be aligned...
ok
>
>> + if (xfs_get_extsz(ip) > 1)
>> + end_fsb = roundup_64(end_fsb, xfs_get_extsz(ip));
>> + else if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1)
>> end_fsb = xfs_rtb_roundup_rtx(mp, end_fsb);
>
> I think this would be better written as:
>
> /*
> * Forced extent alignment requires us to round up where we
> * start trimming from so that result is correctly aligned.
> */
> if (xfs_inode_forcealign(ip)) {
> if (ip->i_extsize > 1)
> end_fsb = roundup_64(end_fsb, ip->i_extsize);
> else if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1)
> end_fsb = xfs_rtb_roundup_rtx(mp, end_fsb);
> }
>
> Because we only want end-fsb alignment when forced alignment is
> required.
But why change current rtvol behaviour?
>
> Which then makes me wonder: truncate needs this, too, doesn't it?
> And the various fallocate() ops like hole punching and extent
> shifting?
>
Yes, I would think so. I quickly checked rtvol for truncate and it does
the round up. I would need to check the relevant code for truncate and
fallocate for forcealign now.
I do also wonder if we could factor out this rounding up code for
truncate, facallocate, and whatever else.
>> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
>> index 2c439df8c47f..bbb8886f1d32 100644
>> --- a/fs/xfs/xfs_inode.c
>> +++ b/fs/xfs/xfs_inode.c
>> @@ -65,6 +65,20 @@ xfs_get_extsz_hint(
>> return 0;
>> }
>>
>> +/*
>> + * Helper function to extract extent size. It will return a power-of-2,
>> + * as forcealign requires this.
>> + */
>> +xfs_extlen_t
>> +xfs_get_extsz(
>> + struct xfs_inode *ip)
>> +{
>> + if (xfs_inode_forcealign(ip) && ip->i_extsize)
>> + return ip->i_extsize;
>> +
>> + return 1;
>> +}
>
> This can go away - if it is needed elsewhere, then I think it would
> be better open coded because it better documents what the code is
> doing...
>
I would rather get rid of xfs_get_extsz() for sure.
Thanks,
John
^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH v2 07/14] fs: iomap: Sub-extent zeroing
2024-03-04 13:04 [PATCH v2 00/14] block atomic writes for XFS John Garry
` (5 preceding siblings ...)
2024-03-04 13:04 ` [PATCH v2 06/14] fs: xfs: Do not free EOF blocks for forcealign John Garry
@ 2024-03-04 13:04 ` John Garry
2024-03-06 21:14 ` Dave Chinner
2024-03-04 13:04 ` [PATCH v2 08/14] fs: xfs: " John Garry
` (6 subsequent siblings)
13 siblings, 1 reply; 53+ messages in thread
From: John Garry @ 2024-03-04 13:04 UTC (permalink / raw)
To: djwong, hch, viro, brauner, jack, chandan.babu, david, axboe
Cc: martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block, John Garry
For FS_XFLAG_FORCEALIGN support, we want to treat any sub-extent IO like
sub-fsblock DIO, in that we will zero the sub-extent when the mapping is
unwritten.
This will be important for atomic writes support, in that atomically
writing over a partially written extent would mean that we would need to
do the unwritten extent conversion write separately, and the write could
no longer be atomic.
It is the task of the FS to set iomap.extent_shift per iter to indicate
sub-extent zeroing required.
Maybe a macro like i_blocksize() should be introduced for extent sizes,
instead of using extent_shift. It would also eliminate excessive use
of xfs_get_extss() for XFS in future.
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
fs/iomap/direct-io.c | 14 ++++++++------
include/linux/iomap.h | 1 +
2 files changed, 9 insertions(+), 6 deletions(-)
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index bcd3f8cf5ea4..733f83f839b6 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -277,7 +277,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
{
const struct iomap *iomap = &iter->iomap;
struct inode *inode = iter->inode;
- unsigned int fs_block_size = i_blocksize(inode), pad;
+ unsigned int zeroing_size, pad;
loff_t length = iomap_length(iter);
loff_t pos = iter->pos;
blk_opf_t bio_opf;
@@ -288,6 +288,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
size_t copied = 0;
size_t orig_count;
+ zeroing_size = i_blocksize(inode) << iomap->extent_shift;
+
if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
!bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
return -EINVAL;
@@ -354,8 +356,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
dio->iocb->ki_flags &= ~IOCB_HIPRI;
if (need_zeroout) {
- /* zero out from the start of the block to the write offset */
- pad = pos & (fs_block_size - 1);
+ /* zero out from the start of the region to the write offset */
+ pad = pos & (zeroing_size - 1);
if (pad)
iomap_dio_zero(iter, dio, pos - pad, pad);
}
@@ -427,10 +429,10 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
zero_tail:
if (need_zeroout ||
((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) {
- /* zero out from the end of the write to the end of the block */
- pad = pos & (fs_block_size - 1);
+ /* zero out from the end of the write to the end of the region */
+ pad = pos & (zeroing_size - 1);
if (pad)
- iomap_dio_zero(iter, dio, pos, fs_block_size - pad);
+ iomap_dio_zero(iter, dio, pos, zeroing_size - pad);
}
out:
/* Undo iter limitation to current extent */
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 96dd0acbba44..89cd3dcbb8ec 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -97,6 +97,7 @@ struct iomap {
u64 length; /* length of mapping, bytes */
u16 type; /* type of mapping */
u16 flags; /* flags for mapping */
+ unsigned int extent_shift;
struct block_device *bdev; /* block device for I/O */
struct dax_device *dax_dev; /* dax_dev for dax operations */
void *inline_data;
--
2.31.1
^ permalink raw reply related [flat|nested] 53+ messages in thread* Re: [PATCH v2 07/14] fs: iomap: Sub-extent zeroing
2024-03-04 13:04 ` [PATCH v2 07/14] fs: iomap: Sub-extent zeroing John Garry
@ 2024-03-06 21:14 ` Dave Chinner
2024-03-07 11:51 ` John Garry
0 siblings, 1 reply; 53+ messages in thread
From: Dave Chinner @ 2024-03-06 21:14 UTC (permalink / raw)
To: John Garry
Cc: djwong, hch, viro, brauner, jack, chandan.babu, axboe,
martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block
On Mon, Mar 04, 2024 at 01:04:21PM +0000, John Garry wrote:
> For FS_XFLAG_FORCEALIGN support, we want to treat any sub-extent IO like
> sub-fsblock DIO, in that we will zero the sub-extent when the mapping is
> unwritten.
>
> This will be important for atomic writes support, in that atomically
> writing over a partially written extent would mean that we would need to
> do the unwritten extent conversion write separately, and the write could
> no longer be atomic.
>
> It is the task of the FS to set iomap.extent_shift per iter to indicate
> sub-extent zeroing required.
>
> Maybe a macro like i_blocksize() should be introduced for extent sizes,
> instead of using extent_shift. It would also eliminate excessive use
> of xfs_get_extss() for XFS in future.
>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
> fs/iomap/direct-io.c | 14 ++++++++------
> include/linux/iomap.h | 1 +
> 2 files changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index bcd3f8cf5ea4..733f83f839b6 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -277,7 +277,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> {
> const struct iomap *iomap = &iter->iomap;
> struct inode *inode = iter->inode;
> - unsigned int fs_block_size = i_blocksize(inode), pad;
> + unsigned int zeroing_size, pad;
> loff_t length = iomap_length(iter);
> loff_t pos = iter->pos;
> blk_opf_t bio_opf;
> @@ -288,6 +288,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> size_t copied = 0;
> size_t orig_count;
>
> + zeroing_size = i_blocksize(inode) << iomap->extent_shift;
The iomap interfaces use units of bytes for offsets, sizes, ranges,
etc. Using shifts to define a granularity value seems like a
throwback to decades old XFS code and just a bit weird nowdays. Can
we just pass this as a byte count? i.e.:
zeroing_size = i_blocksize(inode);
if (iomap->extent_size)
zeroing_size = iomap->extent_size;
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 53+ messages in thread* Re: [PATCH v2 07/14] fs: iomap: Sub-extent zeroing
2024-03-06 21:14 ` Dave Chinner
@ 2024-03-07 11:51 ` John Garry
0 siblings, 0 replies; 53+ messages in thread
From: John Garry @ 2024-03-07 11:51 UTC (permalink / raw)
To: Dave Chinner
Cc: djwong, hch, viro, brauner, jack, chandan.babu, axboe,
martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block
On 06/03/2024 21:14, Dave Chinner wrote:
>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>> index bcd3f8cf5ea4..733f83f839b6 100644
>> --- a/fs/iomap/direct-io.c
>> +++ b/fs/iomap/direct-io.c
>> @@ -277,7 +277,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>> {
>> const struct iomap *iomap = &iter->iomap;
>> struct inode *inode = iter->inode;
>> - unsigned int fs_block_size = i_blocksize(inode), pad;
>> + unsigned int zeroing_size, pad;
>> loff_t length = iomap_length(iter);
>> loff_t pos = iter->pos;
>> blk_opf_t bio_opf;
>> @@ -288,6 +288,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>> size_t copied = 0;
>> size_t orig_count;
>>
>> + zeroing_size = i_blocksize(inode) << iomap->extent_shift;
> The iomap interfaces use units of bytes for offsets, sizes, ranges,
> etc. Using shifts to define a granularity value seems like a
> throwback to decades old XFS code and just a bit weird nowdays. Can
> we just pass this as a byte count? i.e.:
>
> zeroing_size = i_blocksize(inode);
> if (iomap->extent_size)
> zeroing_size = iomap->extent_size;
Fine
I was also thinking of something like i_extentsize() for vfs, which
would save requiring adding iomap->extent_size, but decided to limit vfs
changes.
Thanks,
John
^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH v2 08/14] fs: xfs: iomap: Sub-extent zeroing
2024-03-04 13:04 [PATCH v2 00/14] block atomic writes for XFS John Garry
` (6 preceding siblings ...)
2024-03-04 13:04 ` [PATCH v2 07/14] fs: iomap: Sub-extent zeroing John Garry
@ 2024-03-04 13:04 ` John Garry
2024-03-06 22:00 ` Dave Chinner
2024-03-04 13:04 ` [PATCH v2 09/14] fs: Add FS_XFLAG_ATOMICWRITES flag John Garry
` (5 subsequent siblings)
13 siblings, 1 reply; 53+ messages in thread
From: John Garry @ 2024-03-04 13:04 UTC (permalink / raw)
To: djwong, hch, viro, brauner, jack, chandan.babu, david, axboe
Cc: martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block, John Garry
Set iomap->extent_shift when sub-extent zeroing is required.
We treat a sub-extent write same as an unaligned write, so we can leverage
the existing sub-FSblock unaligned write support, i.e. try a shared lock
with IOMAP_DIO_OVERWRITE_ONLY flag, if this fails then try the exclusive
lock.
In xfs_iomap_write_unwritten(), FSB calcs are now based on the extsize.
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
fs/xfs/xfs_file.c | 28 +++++++++++++++-------------
fs/xfs/xfs_iomap.c | 15 +++++++++++++--
2 files changed, 28 insertions(+), 15 deletions(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e33e5e13b95f..d0bd9d5f596c 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -617,18 +617,19 @@ xfs_file_dio_write_aligned(
* Handle block unaligned direct I/O writes
*
* In most cases direct I/O writes will be done holding IOLOCK_SHARED, allowing
- * them to be done in parallel with reads and other direct I/O writes. However,
- * if the I/O is not aligned to filesystem blocks, the direct I/O layer may need
- * to do sub-block zeroing and that requires serialisation against other direct
- * I/O to the same block. In this case we need to serialise the submission of
- * the unaligned I/O so that we don't get racing block zeroing in the dio layer.
- * In the case where sub-block zeroing is not required, we can do concurrent
- * sub-block dios to the same block successfully.
+ * them to be done in parallel with reads and other direct I/O writes.
+ * However if the I/O is not aligned to filesystem blocks/extent, the direct
+ * I/O layer may need to do sub-block/extent zeroing and that requires
+ * serialisation against other direct I/O to the same block/extent. In this
+ * case we need to serialise the submission of the unaligned I/O so that we
+ * don't get racing block/extent zeroing in the dio layer.
+ * In the case where sub-block/extent zeroing is not required, we can do
+ * concurrent sub-block/extent dios to the same block/extent successfully.
*
* Optimistically submit the I/O using the shared lock first, but use the
* IOMAP_DIO_OVERWRITE_ONLY flag to tell the lower layers to return -EAGAIN
- * if block allocation or partial block zeroing would be required. In that case
- * we try again with the exclusive lock.
+ * if block/extent allocation or partial block/extent zeroing would be
+ * required. In that case we try again with the exclusive lock.
*/
static noinline ssize_t
xfs_file_dio_write_unaligned(
@@ -643,9 +644,9 @@ xfs_file_dio_write_unaligned(
ssize_t ret;
/*
- * Extending writes need exclusivity because of the sub-block zeroing
- * that the DIO code always does for partial tail blocks beyond EOF, so
- * don't even bother trying the fast path in this case.
+ * Extending writes need exclusivity because of the sub-block/extent
+ * zeroing that the DIO code always does for partial tail blocks
+ * beyond EOF, so don't even bother trying the fast path in this case.
*/
if (iocb->ki_pos > isize || iocb->ki_pos + count >= isize) {
if (iocb->ki_flags & IOCB_NOWAIT)
@@ -709,13 +710,14 @@ xfs_file_dio_write(
struct iov_iter *from)
{
struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp));
+ struct xfs_mount *mp = ip->i_mount;
struct xfs_buftarg *target = xfs_inode_buftarg(ip);
size_t count = iov_iter_count(from);
/* direct I/O must be aligned to device logical sector size */
if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
return -EINVAL;
- if ((iocb->ki_pos | count) & ip->i_mount->m_blockmask)
+ if ((iocb->ki_pos | count) & (XFS_FSB_TO_B(mp, xfs_get_extsz(ip)) - 1))
return xfs_file_dio_write_unaligned(ip, iocb, from);
return xfs_file_dio_write_aligned(ip, iocb, from);
}
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 70fe873951f3..88cc20bb19c9 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -98,6 +98,7 @@ xfs_bmbt_to_iomap(
{
struct xfs_mount *mp = ip->i_mount;
struct xfs_buftarg *target = xfs_inode_buftarg(ip);
+ xfs_extlen_t extsz = xfs_get_extsz(ip);
if (unlikely(!xfs_valid_startblock(ip, imap->br_startblock)))
return xfs_alert_fsblock_zero(ip, imap);
@@ -134,6 +135,8 @@ xfs_bmbt_to_iomap(
iomap->validity_cookie = sequence_cookie;
iomap->folio_ops = &xfs_iomap_folio_ops;
+ if (extsz > 1)
+ iomap->extent_shift = ffs(extsz) - 1;
return 0;
}
@@ -563,11 +566,19 @@ xfs_iomap_write_unwritten(
xfs_fsize_t i_size;
uint resblks;
int error;
+ xfs_extlen_t extsz = xfs_get_extsz(ip);
trace_xfs_unwritten_convert(ip, offset, count);
- offset_fsb = XFS_B_TO_FSBT(mp, offset);
- count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
+ if (extsz > 1) {
+ xfs_extlen_t extsize_bytes = XFS_FSB_TO_B(mp, extsz);
+
+ offset_fsb = XFS_B_TO_FSBT(mp, round_down(offset, extsize_bytes));
+ count_fsb = XFS_B_TO_FSB(mp, round_up(offset + count, extsize_bytes));
+ } else {
+ offset_fsb = XFS_B_TO_FSBT(mp, offset);
+ count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
+ }
count_fsb = (xfs_filblks_t)(count_fsb - offset_fsb);
/*
--
2.31.1
^ permalink raw reply related [flat|nested] 53+ messages in thread* Re: [PATCH v2 08/14] fs: xfs: iomap: Sub-extent zeroing
2024-03-04 13:04 ` [PATCH v2 08/14] fs: xfs: " John Garry
@ 2024-03-06 22:00 ` Dave Chinner
2024-03-07 12:57 ` John Garry
0 siblings, 1 reply; 53+ messages in thread
From: Dave Chinner @ 2024-03-06 22:00 UTC (permalink / raw)
To: John Garry
Cc: djwong, hch, viro, brauner, jack, chandan.babu, axboe,
martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block
On Mon, Mar 04, 2024 at 01:04:22PM +0000, John Garry wrote:
> Set iomap->extent_shift when sub-extent zeroing is required.
>
> We treat a sub-extent write same as an unaligned write, so we can leverage
> the existing sub-FSblock unaligned write support, i.e. try a shared lock
> with IOMAP_DIO_OVERWRITE_ONLY flag, if this fails then try the exclusive
> lock.
>
> In xfs_iomap_write_unwritten(), FSB calcs are now based on the extsize.
>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
> fs/xfs/xfs_file.c | 28 +++++++++++++++-------------
> fs/xfs/xfs_iomap.c | 15 +++++++++++++--
> 2 files changed, 28 insertions(+), 15 deletions(-)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index e33e5e13b95f..d0bd9d5f596c 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -617,18 +617,19 @@ xfs_file_dio_write_aligned(
> * Handle block unaligned direct I/O writes
> *
> * In most cases direct I/O writes will be done holding IOLOCK_SHARED, allowing
> - * them to be done in parallel with reads and other direct I/O writes. However,
> - * if the I/O is not aligned to filesystem blocks, the direct I/O layer may need
> - * to do sub-block zeroing and that requires serialisation against other direct
> - * I/O to the same block. In this case we need to serialise the submission of
> - * the unaligned I/O so that we don't get racing block zeroing in the dio layer.
> - * In the case where sub-block zeroing is not required, we can do concurrent
> - * sub-block dios to the same block successfully.
> + * them to be done in parallel with reads and other direct I/O writes.
> + * However if the I/O is not aligned to filesystem blocks/extent, the direct
> + * I/O layer may need to do sub-block/extent zeroing and that requires
> + * serialisation against other direct I/O to the same block/extent. In this
> + * case we need to serialise the submission of the unaligned I/O so that we
> + * don't get racing block/extent zeroing in the dio layer.
> + * In the case where sub-block/extent zeroing is not required, we can do
> + * concurrent sub-block/extent dios to the same block/extent successfully.
> *
> * Optimistically submit the I/O using the shared lock first, but use the
> * IOMAP_DIO_OVERWRITE_ONLY flag to tell the lower layers to return -EAGAIN
> - * if block allocation or partial block zeroing would be required. In that case
> - * we try again with the exclusive lock.
> + * if block/extent allocation or partial block/extent zeroing would be
> + * required. In that case we try again with the exclusive lock.
> */
> static noinline ssize_t
> xfs_file_dio_write_unaligned(
> @@ -643,9 +644,9 @@ xfs_file_dio_write_unaligned(
> ssize_t ret;
>
> /*
> - * Extending writes need exclusivity because of the sub-block zeroing
> - * that the DIO code always does for partial tail blocks beyond EOF, so
> - * don't even bother trying the fast path in this case.
> + * Extending writes need exclusivity because of the sub-block/extent
> + * zeroing that the DIO code always does for partial tail blocks
> + * beyond EOF, so don't even bother trying the fast path in this case.
> */
> if (iocb->ki_pos > isize || iocb->ki_pos + count >= isize) {
> if (iocb->ki_flags & IOCB_NOWAIT)
> @@ -709,13 +710,14 @@ xfs_file_dio_write(
> struct iov_iter *from)
> {
> struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp));
> + struct xfs_mount *mp = ip->i_mount;
> struct xfs_buftarg *target = xfs_inode_buftarg(ip);
> size_t count = iov_iter_count(from);
>
> /* direct I/O must be aligned to device logical sector size */
> if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
> return -EINVAL;
> - if ((iocb->ki_pos | count) & ip->i_mount->m_blockmask)
> + if ((iocb->ki_pos | count) & (XFS_FSB_TO_B(mp, xfs_get_extsz(ip)) - 1))
> return xfs_file_dio_write_unaligned(ip, iocb, from);
> return xfs_file_dio_write_aligned(ip, iocb, from);
> }
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 70fe873951f3..88cc20bb19c9 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -98,6 +98,7 @@ xfs_bmbt_to_iomap(
> {
> struct xfs_mount *mp = ip->i_mount;
> struct xfs_buftarg *target = xfs_inode_buftarg(ip);
> + xfs_extlen_t extsz = xfs_get_extsz(ip);
>
> if (unlikely(!xfs_valid_startblock(ip, imap->br_startblock)))
> return xfs_alert_fsblock_zero(ip, imap);
> @@ -134,6 +135,8 @@ xfs_bmbt_to_iomap(
>
> iomap->validity_cookie = sequence_cookie;
> iomap->folio_ops = &xfs_iomap_folio_ops;
> + if (extsz > 1)
> + iomap->extent_shift = ffs(extsz) - 1;
iomap->extent_size = mp->m_bsize;
if (xfs_inode_has_force_align(ip))
iomap->extent_size *= ip->i_extsize;
> @@ -563,11 +566,19 @@ xfs_iomap_write_unwritten(
> xfs_fsize_t i_size;
> uint resblks;
> int error;
> + xfs_extlen_t extsz = xfs_get_extsz(ip);
>
> trace_xfs_unwritten_convert(ip, offset, count);
>
> - offset_fsb = XFS_B_TO_FSBT(mp, offset);
> - count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
> + if (extsz > 1) {
> + xfs_extlen_t extsize_bytes = XFS_FSB_TO_B(mp, extsz);
> +
> + offset_fsb = XFS_B_TO_FSBT(mp, round_down(offset, extsize_bytes));
> + count_fsb = XFS_B_TO_FSB(mp, round_up(offset + count, extsize_bytes));
> + } else {
> + offset_fsb = XFS_B_TO_FSBT(mp, offset);
> + count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
> + }
I don't think this is correct. We should only be converting the
extent when the entire range has had data written to it. If we are
doing unaligned writes, we end up running 3 separate unwritten
conversion transactions - the leading zeroing, the data written and
the trailing zeroing.
This will end up converting the entire range to written when the
leading zeroing completes, exposing stale data until the data and
trailing zeroing completes.
Concurrent reads (both DIO and buffered) can see this stale data
while the write is in progress, leading to a mechanism where a user
can issue sub-atomic write range IO and concurrent overlapping reads
to read arbitrary stale data from the disk just before it is
overwritten.
I suspect the only way to fix this for sub-force-aligned DIo writes
if for iomap_dio_bio_iter() to chain the zeroing and data bios so
the entire range gets a single completion run on it instead of three
separate sub-aligned extent IO completions. We only need to do this
in the zeroing case - this is already the DIo slow path, so
additional submission overhead is not an issue. It would, however,
reduce completion overhead and latency, as we only need to run a
single extent conversion instead of 3, so chaining the bios on
aligned writes may well be a net win...
Thoughts? Christoph probably needs to weigh in on this one...
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 53+ messages in thread* Re: [PATCH v2 08/14] fs: xfs: iomap: Sub-extent zeroing
2024-03-06 22:00 ` Dave Chinner
@ 2024-03-07 12:57 ` John Garry
0 siblings, 0 replies; 53+ messages in thread
From: John Garry @ 2024-03-07 12:57 UTC (permalink / raw)
To: Dave Chinner
Cc: djwong, hch, viro, brauner, jack, chandan.babu, axboe,
martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block
>>
>> if (unlikely(!xfs_valid_startblock(ip, imap->br_startblock)))
>> return xfs_alert_fsblock_zero(ip, imap);
>> @@ -134,6 +135,8 @@ xfs_bmbt_to_iomap(
>>
>> iomap->validity_cookie = sequence_cookie;
>> iomap->folio_ops = &xfs_iomap_folio_ops;
>> + if (extsz > 1)
>> + iomap->extent_shift = ffs(extsz) - 1;
>
> iomap->extent_size = mp->m_bsize;
> if (xfs_inode_has_force_align(ip))
> iomap->extent_size *= ip->i_extsize;
ok, fine
>
>> @@ -563,11 +566,19 @@ xfs_iomap_write_unwritten(
>> xfs_fsize_t i_size;
>> uint resblks;
>> int error;
>> + xfs_extlen_t extsz = xfs_get_extsz(ip);
>>
>> trace_xfs_unwritten_convert(ip, offset, count);
>>
>> - offset_fsb = XFS_B_TO_FSBT(mp, offset);
>> - count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
>> + if (extsz > 1) {
>> + xfs_extlen_t extsize_bytes = XFS_FSB_TO_B(mp, extsz);
>> +
>> + offset_fsb = XFS_B_TO_FSBT(mp, round_down(offset, extsize_bytes));
>> + count_fsb = XFS_B_TO_FSB(mp, round_up(offset + count, extsize_bytes));
>> + } else {
>> + offset_fsb = XFS_B_TO_FSBT(mp, offset);
>> + count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
>> + }
>
> I don't think this is correct. We should only be converting the
> extent when the entire range has had data written to it. If we are
> doing unaligned writes, we end up running 3 separate unwritten
> conversion transactions - the leading zeroing, the data written and
> the trailing zeroing.
Then I missed that in the code.
For sub-FS block conversion, I thought that this was doing the complete
FS blocks conversion, including for the head and tail zeros. And now for
sub-extent writes, we would be similarly doing the full extent
conversion, including head and tail zeros.
>
> This will end up converting the entire range to written when the
> leading zeroing completes, exposing stale data until the data and
> trailing zeroing completes.
That would not be good.
>
> Concurrent reads (both DIO and buffered) can see this stale data
> while the write is in progress, leading to a mechanism where a user
> can issue sub-atomic write range IO and concurrent overlapping reads
> to read arbitrary stale data from the disk just before it is
> overwritten.
>
> I suspect the only way to fix this for sub-force-aligned DIo writes
> if for iomap_dio_bio_iter() to chain the zeroing and data bios so
> the entire range gets a single completion run on it instead of three
> separate sub-aligned extent IO completions. We only need to do this
> in the zeroing case - this is already the DIo slow path, so
> additional submission overhead is not an issue. It would, however,
> reduce completion overhead and latency, as we only need to run a
> single extent conversion instead of 3, so chaining the bios on
> aligned writes may well be a net win...
>
ok, I'll check that idea.
> Thoughts? Christoph probably needs to weigh in on this one...
>
ok
Cheers,
John
^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH v2 09/14] fs: Add FS_XFLAG_ATOMICWRITES flag
2024-03-04 13:04 [PATCH v2 00/14] block atomic writes for XFS John Garry
` (7 preceding siblings ...)
2024-03-04 13:04 ` [PATCH v2 08/14] fs: xfs: " John Garry
@ 2024-03-04 13:04 ` John Garry
2024-03-04 13:04 ` [PATCH v2 10/14] fs: iomap: Atomic write support John Garry
` (4 subsequent siblings)
13 siblings, 0 replies; 53+ messages in thread
From: John Garry @ 2024-03-04 13:04 UTC (permalink / raw)
To: djwong, hch, viro, brauner, jack, chandan.babu, david, axboe
Cc: martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block, John Garry
Add a flag indicating that a regular file is enabled for atomic writes.
This is a file attribute that mirrors an ondisk inode flag. Actual support
for untorn file writes (for now) depends on both the iflag and the
underlying storage devices, which we can only really check at statx and
pwritev2() time. This is the same story as FS_XFLAG_DAX, which signals to
the fs that we should try to enable the fsdax IO path on the file (instead
of the regular page cache), but applications have to query STAT_ATTR_DAX
to find out if they really got that IO path.
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
include/uapi/linux/fs.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 8828822331bf..aacf54381718 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -142,6 +142,7 @@ struct fsxattr {
#define FS_XFLAG_COWEXTSIZE 0x00010000 /* CoW extent size allocator hint */
/* data extent mappings for regular files must be aligned to extent size hint */
#define FS_XFLAG_FORCEALIGN 0x00020000
+#define FS_XFLAG_ATOMICWRITES 0x00040000 /* atomic writes enabled */
#define FS_XFLAG_HASATTR 0x80000000 /* no DIFLAG for this */
/* the read-only stuff doesn't really belong here, but any other place is
--
2.31.1
^ permalink raw reply related [flat|nested] 53+ messages in thread* [PATCH v2 10/14] fs: iomap: Atomic write support
2024-03-04 13:04 [PATCH v2 00/14] block atomic writes for XFS John Garry
` (8 preceding siblings ...)
2024-03-04 13:04 ` [PATCH v2 09/14] fs: Add FS_XFLAG_ATOMICWRITES flag John Garry
@ 2024-03-04 13:04 ` John Garry
2024-03-04 13:04 ` [PATCH v2 11/14] fs: xfs: Support FS_XFLAG_ATOMICWRITES for forcealign John Garry
` (3 subsequent siblings)
13 siblings, 0 replies; 53+ messages in thread
From: John Garry @ 2024-03-04 13:04 UTC (permalink / raw)
To: djwong, hch, viro, brauner, jack, chandan.babu, david, axboe
Cc: martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block, John Garry
Support atomic writes by producing a single BIO with REQ_ATOMIC flag set.
We rely on the FS to guarantee extent alignment, such that an atomic write
should never straddle two or more extents. The FS should also check for
validity of an atomic write length/alignment.
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
fs/iomap/direct-io.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 733f83f839b6..197f1bb6a261 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -275,6 +275,7 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
struct iomap_dio *dio)
{
+ bool is_atomic = dio->iocb->ki_flags & IOCB_ATOMIC;
const struct iomap *iomap = &iter->iomap;
struct inode *inode = iter->inode;
unsigned int zeroing_size, pad;
@@ -383,6 +384,9 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
GFP_KERNEL);
bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
bio->bi_ioprio = dio->iocb->ki_ioprio;
+ if (is_atomic)
+ bio->bi_opf |= REQ_ATOMIC;
+
bio->bi_private = dio;
bio->bi_end_io = iomap_dio_bio_end_io;
@@ -399,6 +403,12 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
}
n = bio->bi_iter.bi_size;
+ if (is_atomic && (n != orig_count)) {
+ /* This bio should have covered the complete length */
+ ret = -EINVAL;
+ bio_put(bio);
+ goto out;
+ }
if (dio->flags & IOMAP_DIO_WRITE) {
task_io_account_write(n);
} else {
--
2.31.1
^ permalink raw reply related [flat|nested] 53+ messages in thread* [PATCH v2 11/14] fs: xfs: Support FS_XFLAG_ATOMICWRITES for forcealign
2024-03-04 13:04 [PATCH v2 00/14] block atomic writes for XFS John Garry
` (9 preceding siblings ...)
2024-03-04 13:04 ` [PATCH v2 10/14] fs: iomap: Atomic write support John Garry
@ 2024-03-04 13:04 ` John Garry
2024-03-06 21:43 ` Dave Chinner
2024-03-04 13:04 ` [PATCH v2 12/14] fs: xfs: Support atomic write for statx John Garry
` (2 subsequent siblings)
13 siblings, 1 reply; 53+ messages in thread
From: John Garry @ 2024-03-04 13:04 UTC (permalink / raw)
To: djwong, hch, viro, brauner, jack, chandan.babu, david, axboe
Cc: martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block, John Garry
Add initial support for FS_XFLAG_ATOMICWRITES for forcealign enabled.
Current kernel support for atomic writes is based on HW support (for atomic
writes). As such, it is required to ensure extent alignment with
atomic_write_unit_max so that an atomic write can result in a single
HW-compliant IO operation.
rtvol also guarantees extent alignment, but we are basing support initially
on forcealign, which is not supported for rtvol yet.
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
fs/xfs/libxfs/xfs_format.h | 13 ++++++++++---
fs/xfs/libxfs/xfs_sb.c | 2 ++
fs/xfs/xfs_inode.c | 2 ++
fs/xfs/xfs_inode.h | 5 +++++
fs/xfs/xfs_ioctl.c | 15 +++++++++++++--
fs/xfs/xfs_mount.h | 2 ++
fs/xfs/xfs_super.c | 4 ++++
7 files changed, 38 insertions(+), 5 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 2d9f5430efc3..5f54f9b3755e 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -354,12 +354,16 @@ xfs_sb_has_compat_feature(
#define XFS_SB_FEAT_RO_COMPAT_REFLINK (1 << 2) /* reflinked files */
#define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3) /* inobt block counts */
#define XFS_SB_FEAT_RO_COMPAT_FORCEALIGN (1 << 30) /* aligned file data extents */
+#define XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES (1 << 31) /* atomicwrites enabled */
+
#define XFS_SB_FEAT_RO_COMPAT_ALL \
(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
XFS_SB_FEAT_RO_COMPAT_REFLINK| \
- XFS_SB_FEAT_RO_COMPAT_INOBTCNT | \
- XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
+ XFS_SB_FEAT_RO_COMPAT_INOBTCNT| \
+ XFS_SB_FEAT_RO_COMPAT_FORCEALIGN| \
+ XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
+
#define XFS_SB_FEAT_RO_COMPAT_UNKNOWN ~XFS_SB_FEAT_RO_COMPAT_ALL
static inline bool
xfs_sb_has_ro_compat_feature(
@@ -1089,6 +1093,7 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
#define XFS_DIFLAG2_NREXT64_BIT 4 /* large extent counters */
/* data extent mappings for regular files must be aligned to extent size hint */
#define XFS_DIFLAG2_FORCEALIGN_BIT 5
+#define XFS_DIFLAG2_ATOMICWRITES_BIT 6
#define XFS_DIFLAG2_DAX (1 << XFS_DIFLAG2_DAX_BIT)
#define XFS_DIFLAG2_REFLINK (1 << XFS_DIFLAG2_REFLINK_BIT)
@@ -1096,10 +1101,12 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
#define XFS_DIFLAG2_BIGTIME (1 << XFS_DIFLAG2_BIGTIME_BIT)
#define XFS_DIFLAG2_NREXT64 (1 << XFS_DIFLAG2_NREXT64_BIT)
#define XFS_DIFLAG2_FORCEALIGN (1 << XFS_DIFLAG2_FORCEALIGN_BIT)
+#define XFS_DIFLAG2_ATOMICWRITES (1 << XFS_DIFLAG2_ATOMICWRITES_BIT)
#define XFS_DIFLAG2_ANY \
(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
- XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_FORCEALIGN)
+ XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_FORCEALIGN | \
+ XFS_DIFLAG2_ATOMICWRITES)
static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
{
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index f2c16a028fae..d7bb3e34dd69 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -165,6 +165,8 @@ xfs_sb_version_to_features(
features |= XFS_FEAT_INOBTCNT;
if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
features |= XFS_FEAT_FORCEALIGN;
+ if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
+ features |= XFS_FEAT_ATOMICWRITES;
if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_FTYPE)
features |= XFS_FEAT_FTYPE;
if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_SPINODES)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index bbb8886f1d32..14020ab1450c 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -645,6 +645,8 @@ xfs_ip2xflags(
flags |= FS_XFLAG_COWEXTSIZE;
if (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)
flags |= FS_XFLAG_FORCEALIGN;
+ if (ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES)
+ flags |= FS_XFLAG_ATOMICWRITES;
}
if (xfs_inode_has_attr_fork(ip))
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index b6c42c27943e..f56bdbb74ad7 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -310,6 +310,11 @@ static inline bool xfs_inode_forcealign(struct xfs_inode *ip)
return ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN;
}
+static inline bool xfs_inode_atomicwrites(struct xfs_inode *ip)
+{
+ return ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES;
+}
+
/*
* Return the buftarg used for data allocations on a given inode.
*/
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 867d8d51a3d0..f118a1ae39b5 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1112,6 +1112,8 @@ xfs_flags2diflags2(
di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
if (xflags & FS_XFLAG_FORCEALIGN)
di_flags2 |= XFS_DIFLAG2_FORCEALIGN;
+ if (xflags & FS_XFLAG_ATOMICWRITES)
+ di_flags2 |= XFS_DIFLAG2_ATOMICWRITES;
return di_flags2;
}
@@ -1124,10 +1126,12 @@ xfs_ioctl_setattr_xflags(
{
struct xfs_mount *mp = ip->i_mount;
bool rtflag = (fa->fsx_xflags & FS_XFLAG_REALTIME);
+ bool atomic_writes = fa->fsx_xflags & FS_XFLAG_ATOMICWRITES;
uint64_t i_flags2;
- if (rtflag != XFS_IS_REALTIME_INODE(ip)) {
- /* Can't change realtime flag if any extents are allocated. */
+ /* Can't change RT or atomic flags if any extents are allocated. */
+ if (rtflag != XFS_IS_REALTIME_INODE(ip) ||
+ atomic_writes != xfs_inode_atomicwrites(ip)) {
if (ip->i_df.if_nextents || ip->i_delayed_blks)
return -EINVAL;
}
@@ -1164,6 +1168,13 @@ xfs_ioctl_setattr_xflags(
return -EINVAL;
}
+ if (atomic_writes) {
+ if (!xfs_has_atomicwrites(mp))
+ return -EINVAL;
+ if (!(fa->fsx_xflags & FS_XFLAG_FORCEALIGN))
+ return -EINVAL;
+ }
+
ip->i_diflags = xfs_flags2diflags(ip, fa->fsx_xflags);
ip->i_diflags2 = i_flags2;
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index e1ef31675db3..3b60d8a1d396 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -290,6 +290,7 @@ typedef struct xfs_mount {
#define XFS_FEAT_NEEDSREPAIR (1ULL << 25) /* needs xfs_repair */
#define XFS_FEAT_NREXT64 (1ULL << 26) /* large extent counters */
#define XFS_FEAT_FORCEALIGN (1ULL << 27) /* aligned file data extents */
+#define XFS_FEAT_ATOMICWRITES (1ULL << 28) /* atomic writes support */
/* Mount features */
#define XFS_FEAT_NOATTR2 (1ULL << 48) /* disable attr2 creation */
@@ -354,6 +355,7 @@ __XFS_HAS_FEAT(bigtime, BIGTIME)
__XFS_HAS_FEAT(needsrepair, NEEDSREPAIR)
__XFS_HAS_FEAT(large_extent_counts, NREXT64)
__XFS_HAS_FEAT(forcealign, FORCEALIGN)
+__XFS_HAS_FEAT(atomicwrites, ATOMICWRITES)
/*
* Mount features
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 74dcafddf6a9..efe4b4234b2e 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1712,6 +1712,10 @@ xfs_fs_fill_super(
xfs_warn(mp,
"EXPERIMENTAL forced data extent alignment feature in use. Use at your own risk!");
+ if (xfs_has_atomicwrites(mp))
+ xfs_warn(mp,
+"EXPERIMENTAL atomicwrites feature in use. Use at your own risk!");
+
if (xfs_has_reflink(mp)) {
if (mp->m_sb.sb_rblocks) {
xfs_alert(mp,
--
2.31.1
^ permalink raw reply related [flat|nested] 53+ messages in thread* Re: [PATCH v2 11/14] fs: xfs: Support FS_XFLAG_ATOMICWRITES for forcealign
2024-03-04 13:04 ` [PATCH v2 11/14] fs: xfs: Support FS_XFLAG_ATOMICWRITES for forcealign John Garry
@ 2024-03-06 21:43 ` Dave Chinner
2024-03-07 12:42 ` John Garry
0 siblings, 1 reply; 53+ messages in thread
From: Dave Chinner @ 2024-03-06 21:43 UTC (permalink / raw)
To: John Garry
Cc: djwong, hch, viro, brauner, jack, chandan.babu, axboe,
martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block
On Mon, Mar 04, 2024 at 01:04:25PM +0000, John Garry wrote:
> Add initial support for FS_XFLAG_ATOMICWRITES for forcealign enabled.
>
> Current kernel support for atomic writes is based on HW support (for atomic
> writes). As such, it is required to ensure extent alignment with
> atomic_write_unit_max so that an atomic write can result in a single
> HW-compliant IO operation.
>
> rtvol also guarantees extent alignment, but we are basing support initially
> on forcealign, which is not supported for rtvol yet.
>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
....
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index 2d9f5430efc3..5f54f9b3755e 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -354,12 +354,16 @@ xfs_sb_has_compat_feature(
> #define XFS_SB_FEAT_RO_COMPAT_REFLINK (1 << 2) /* reflinked files */
> #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3) /* inobt block counts */
> #define XFS_SB_FEAT_RO_COMPAT_FORCEALIGN (1 << 30) /* aligned file data extents */
> +#define XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES (1 << 31) /* atomicwrites enabled */
> +
> #define XFS_SB_FEAT_RO_COMPAT_ALL \
> (XFS_SB_FEAT_RO_COMPAT_FINOBT | \
> XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
> XFS_SB_FEAT_RO_COMPAT_REFLINK| \
> - XFS_SB_FEAT_RO_COMPAT_INOBTCNT | \
> - XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
> + XFS_SB_FEAT_RO_COMPAT_INOBTCNT| \
> + XFS_SB_FEAT_RO_COMPAT_FORCEALIGN| \
Please leave a spave between the feature name and the '| \'.
> + XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
> +
> #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN ~XFS_SB_FEAT_RO_COMPAT_ALL
> static inline bool
> xfs_sb_has_ro_compat_feature(
> @@ -1089,6 +1093,7 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
> #define XFS_DIFLAG2_NREXT64_BIT 4 /* large extent counters */
> /* data extent mappings for regular files must be aligned to extent size hint */
> #define XFS_DIFLAG2_FORCEALIGN_BIT 5
> +#define XFS_DIFLAG2_ATOMICWRITES_BIT 6
>
> #define XFS_DIFLAG2_DAX (1 << XFS_DIFLAG2_DAX_BIT)
> #define XFS_DIFLAG2_REFLINK (1 << XFS_DIFLAG2_REFLINK_BIT)
> @@ -1096,10 +1101,12 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
> #define XFS_DIFLAG2_BIGTIME (1 << XFS_DIFLAG2_BIGTIME_BIT)
> #define XFS_DIFLAG2_NREXT64 (1 << XFS_DIFLAG2_NREXT64_BIT)
> #define XFS_DIFLAG2_FORCEALIGN (1 << XFS_DIFLAG2_FORCEALIGN_BIT)
> +#define XFS_DIFLAG2_ATOMICWRITES (1 << XFS_DIFLAG2_ATOMICWRITES_BIT)
>
> #define XFS_DIFLAG2_ANY \
> (XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
> - XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_FORCEALIGN)
> + XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_FORCEALIGN | \
> + XFS_DIFLAG2_ATOMICWRITES)
>
> static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
> {
> diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> index f2c16a028fae..d7bb3e34dd69 100644
> --- a/fs/xfs/libxfs/xfs_sb.c
> +++ b/fs/xfs/libxfs/xfs_sb.c
> @@ -165,6 +165,8 @@ xfs_sb_version_to_features(
> features |= XFS_FEAT_INOBTCNT;
> if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
> features |= XFS_FEAT_FORCEALIGN;
> + if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
> + features |= XFS_FEAT_ATOMICWRITES;
> if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_FTYPE)
> features |= XFS_FEAT_FTYPE;
> if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_SPINODES)
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index bbb8886f1d32..14020ab1450c 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -645,6 +645,8 @@ xfs_ip2xflags(
> flags |= FS_XFLAG_COWEXTSIZE;
> if (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)
> flags |= FS_XFLAG_FORCEALIGN;
> + if (ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES)
> + flags |= FS_XFLAG_ATOMICWRITES;
> }
>
> if (xfs_inode_has_attr_fork(ip))
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index b6c42c27943e..f56bdbb74ad7 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -310,6 +310,11 @@ static inline bool xfs_inode_forcealign(struct xfs_inode *ip)
> return ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN;
> }
>
> +static inline bool xfs_inode_atomicwrites(struct xfs_inode *ip)
> +{
> + return ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES;
> +}
I'd really like this to be more readable:
xfs_inode_has_atomic_writes().
Same for the force align check, now that I notice it:
xfs_inode_has_force_align().
> +
> /*
> * Return the buftarg used for data allocations on a given inode.
> */
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 867d8d51a3d0..f118a1ae39b5 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -1112,6 +1112,8 @@ xfs_flags2diflags2(
> di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
> if (xflags & FS_XFLAG_FORCEALIGN)
> di_flags2 |= XFS_DIFLAG2_FORCEALIGN;
> + if (xflags & FS_XFLAG_ATOMICWRITES)
> + di_flags2 |= XFS_DIFLAG2_ATOMICWRITES;
>
> return di_flags2;
> }
> @@ -1124,10 +1126,12 @@ xfs_ioctl_setattr_xflags(
> {
> struct xfs_mount *mp = ip->i_mount;
> bool rtflag = (fa->fsx_xflags & FS_XFLAG_REALTIME);
> + bool atomic_writes = fa->fsx_xflags & FS_XFLAG_ATOMICWRITES;
> uint64_t i_flags2;
>
> - if (rtflag != XFS_IS_REALTIME_INODE(ip)) {
> - /* Can't change realtime flag if any extents are allocated. */
> + /* Can't change RT or atomic flags if any extents are allocated. */
> + if (rtflag != XFS_IS_REALTIME_INODE(ip) ||
> + atomic_writes != xfs_inode_atomicwrites(ip)) {
> if (ip->i_df.if_nextents || ip->i_delayed_blks)
> return -EINVAL;
> }
> @@ -1164,6 +1168,13 @@ xfs_ioctl_setattr_xflags(
> return -EINVAL;
> }
>
> + if (atomic_writes) {
> + if (!xfs_has_atomicwrites(mp))
> + return -EINVAL;
That looks wrong - if we are trying to turn on atomic writes, then
shouldn't this be returning an error if atomic writes are already
configured?
> + if (!(fa->fsx_xflags & FS_XFLAG_FORCEALIGN))
> + return -EINVAL;
Where's the check for xfs_has_atomicwrites(mp) here? We can't allow
this inode flag to be set if the superblock does not have the
feature bit that says it's a known feature bit set.
Which reminds me: both the forcealign and the atomicwrite inode flag
need explicit checking in the inode verifier. i.e. checking that if
the inode flag bit is set, the relevant superblock feature bit is
set.
....
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 74dcafddf6a9..efe4b4234b2e 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1712,6 +1712,10 @@ xfs_fs_fill_super(
> xfs_warn(mp,
> "EXPERIMENTAL forced data extent alignment feature in use. Use at your own risk!");
>
> + if (xfs_has_atomicwrites(mp))
> + xfs_warn(mp,
> +"EXPERIMENTAL atomicwrites feature in use. Use at your own risk!");
"EXPERIMENTAL atomic write IO feature is in use. Use at your own risk!");
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 53+ messages in thread* Re: [PATCH v2 11/14] fs: xfs: Support FS_XFLAG_ATOMICWRITES for forcealign
2024-03-06 21:43 ` Dave Chinner
@ 2024-03-07 12:42 ` John Garry
0 siblings, 0 replies; 53+ messages in thread
From: John Garry @ 2024-03-07 12:42 UTC (permalink / raw)
To: Dave Chinner
Cc: djwong, hch, viro, brauner, jack, chandan.babu, axboe,
martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block
>> #define XFS_SB_FEAT_RO_COMPAT_ALL \
>> (XFS_SB_FEAT_RO_COMPAT_FINOBT | \
>> XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
>> XFS_SB_FEAT_RO_COMPAT_REFLINK| \
>> - XFS_SB_FEAT_RO_COMPAT_INOBTCNT | \
>> - XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
>> + XFS_SB_FEAT_RO_COMPAT_INOBTCNT| \
>> + XFS_SB_FEAT_RO_COMPAT_FORCEALIGN| \
>
> Please leave a spave between the feature name and the '| \'.
>
ok
>> + XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
>> +
...
>> }
>>
>> +static inline bool xfs_inode_atomicwrites(struct xfs_inode *ip)
>> +{
>> + return ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES;
>> +}
>
> I'd really like this to be more readable:
> xfs_inode_has_atomic_writes().
>
> Same for the force align check, now that I notice it:
> xfs_inode_has_force_align().
ok, will change
>
>> +
>> /*
>> * Return the buftarg used for data allocations on a given inode.
>> */
>> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
>> index 867d8d51a3d0..f118a1ae39b5 100644
>> --- a/fs/xfs/xfs_ioctl.c
>> +++ b/fs/xfs/xfs_ioctl.c
>> @@ -1112,6 +1112,8 @@ xfs_flags2diflags2(
>> di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
>> if (xflags & FS_XFLAG_FORCEALIGN)
>> di_flags2 |= XFS_DIFLAG2_FORCEALIGN;
>> + if (xflags & FS_XFLAG_ATOMICWRITES)
>> + di_flags2 |= XFS_DIFLAG2_ATOMICWRITES;
>>
>> return di_flags2;
>> }
>> @@ -1124,10 +1126,12 @@ xfs_ioctl_setattr_xflags(
>> {
>> struct xfs_mount *mp = ip->i_mount;
>> bool rtflag = (fa->fsx_xflags & FS_XFLAG_REALTIME);
>> + bool atomic_writes = fa->fsx_xflags & FS_XFLAG_ATOMICWRITES;
>> uint64_t i_flags2;
>>
>> - if (rtflag != XFS_IS_REALTIME_INODE(ip)) {
>> - /* Can't change realtime flag if any extents are allocated. */
>> + /* Can't change RT or atomic flags if any extents are allocated. */
>> + if (rtflag != XFS_IS_REALTIME_INODE(ip) ||
>> + atomic_writes != xfs_inode_atomicwrites(ip)) {
>> if (ip->i_df.if_nextents || ip->i_delayed_blks)
>> return -EINVAL;
>> }
>> @@ -1164,6 +1168,13 @@ xfs_ioctl_setattr_xflags(
>> return -EINVAL;
>> }
>>
>> + if (atomic_writes) {
>> + if (!xfs_has_atomicwrites(mp))
>> + return -EINVAL;
>
> That looks wrong - if we are trying to turn on atomic writes, then
> shouldn't this be returning an error if atomic writes are already
> configured?
I think that you are talking about a xfs_inode_atomicwrites() check.
>
>> + if (!(fa->fsx_xflags & FS_XFLAG_FORCEALIGN))
>> + return -EINVAL;
>
> Where's the check for xfs_has_atomicwrites(mp) here?
please see above
> We can't allow
> this inode flag to be set if the superblock does not have the
> feature bit that says it's a known feature bit set.
>
> Which reminds me: both the forcealign and the atomicwrite inode flag
> need explicit checking in the inode verifier. i.e. checking that if
> the inode flag bit is set, the relevant superblock feature bit is
> set.
We do have that in the xfs_has_atomicwrites() and xfs_has_forcealign()
checks - is that ok?
>
> ....
>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
>> index 74dcafddf6a9..efe4b4234b2e 100644
>> --- a/fs/xfs/xfs_super.c
>> +++ b/fs/xfs/xfs_super.c
>> @@ -1712,6 +1712,10 @@ xfs_fs_fill_super(
>> xfs_warn(mp,
>> "EXPERIMENTAL forced data extent alignment feature in use. Use at your own risk!");
>>
>> + if (xfs_has_atomicwrites(mp))
>> + xfs_warn(mp,
>> +"EXPERIMENTAL atomicwrites feature in use. Use at your own risk!");
>
> "EXPERIMENTAL atomic write IO feature is in use. Use at your own risk!");
>
ok
Thanks,
John
^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH v2 12/14] fs: xfs: Support atomic write for statx
2024-03-04 13:04 [PATCH v2 00/14] block atomic writes for XFS John Garry
` (10 preceding siblings ...)
2024-03-04 13:04 ` [PATCH v2 11/14] fs: xfs: Support FS_XFLAG_ATOMICWRITES for forcealign John Garry
@ 2024-03-04 13:04 ` John Garry
2024-03-06 21:31 ` Dave Chinner
2024-03-04 13:04 ` [PATCH v2 13/14] fs: xfs: Validate atomic writes John Garry
2024-03-04 13:04 ` [PATCH v2 14/14] fs: xfs: Support setting FMODE_CAN_ATOMIC_WRITE John Garry
13 siblings, 1 reply; 53+ messages in thread
From: John Garry @ 2024-03-04 13:04 UTC (permalink / raw)
To: djwong, hch, viro, brauner, jack, chandan.babu, david, axboe
Cc: martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block, John Garry
Support providing info on atomic write unit min and max for an inode.
For simplicity, currently we limit the min at the FS block size, but a
lower limit could be supported in future. This is required by iomap
DIO.
The atomic write unit min and max is limited by the guaranteed extent
alignment for the inode.
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
fs/xfs/xfs_iops.c | 38 ++++++++++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index a0d77f5f512e..6316448083d2 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -546,6 +546,37 @@ xfs_stat_blksize(
return PAGE_SIZE;
}
+static void
+xfs_get_atomic_write_attr(
+ struct xfs_inode *ip,
+ unsigned int *unit_min,
+ unsigned int *unit_max)
+{
+ xfs_extlen_t extsz = xfs_get_extsz(ip);
+ struct xfs_buftarg *target = xfs_inode_buftarg(ip);
+ struct block_device *bdev = target->bt_bdev;
+ struct request_queue *q = bdev->bd_queue;
+ struct xfs_mount *mp = ip->i_mount;
+ struct xfs_sb *sbp = &mp->m_sb;
+ unsigned int awu_min, awu_max;
+ unsigned int extsz_bytes = XFS_FSB_TO_B(mp, extsz);
+
+ awu_min = queue_atomic_write_unit_min_bytes(q);
+ awu_max = queue_atomic_write_unit_max_bytes(q);
+
+ if (sbp->sb_blocksize > awu_max || awu_min > sbp->sb_blocksize ||
+ !xfs_inode_atomicwrites(ip)) {
+ *unit_min = 0;
+ *unit_max = 0;
+ return;
+ }
+
+ /* Floor at FS block size */
+ *unit_min = max(sbp->sb_blocksize, awu_min);
+
+ *unit_max = min(extsz_bytes, awu_max);
+}
+
STATIC int
xfs_vn_getattr(
struct mnt_idmap *idmap,
@@ -619,6 +650,13 @@ xfs_vn_getattr(
stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
stat->dio_offset_align = bdev_logical_block_size(bdev);
}
+ if (request_mask & STATX_WRITE_ATOMIC) {
+ unsigned int unit_min, unit_max;
+
+ xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
+ generic_fill_statx_atomic_writes(stat,
+ unit_min, unit_max);
+ }
fallthrough;
default:
stat->blksize = xfs_stat_blksize(ip);
--
2.31.1
^ permalink raw reply related [flat|nested] 53+ messages in thread* Re: [PATCH v2 12/14] fs: xfs: Support atomic write for statx
2024-03-04 13:04 ` [PATCH v2 12/14] fs: xfs: Support atomic write for statx John Garry
@ 2024-03-06 21:31 ` Dave Chinner
2024-03-07 10:35 ` John Garry
0 siblings, 1 reply; 53+ messages in thread
From: Dave Chinner @ 2024-03-06 21:31 UTC (permalink / raw)
To: John Garry
Cc: djwong, hch, viro, brauner, jack, chandan.babu, axboe,
martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block
On Mon, Mar 04, 2024 at 01:04:26PM +0000, John Garry wrote:
> Support providing info on atomic write unit min and max for an inode.
>
> For simplicity, currently we limit the min at the FS block size, but a
> lower limit could be supported in future. This is required by iomap
> DIO.
>
> The atomic write unit min and max is limited by the guaranteed extent
> alignment for the inode.
>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
> fs/xfs/xfs_iops.c | 38 ++++++++++++++++++++++++++++++++++++++
> 1 file changed, 38 insertions(+)
>
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index a0d77f5f512e..6316448083d2 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -546,6 +546,37 @@ xfs_stat_blksize(
> return PAGE_SIZE;
> }
>
> +static void
> +xfs_get_atomic_write_attr(
> + struct xfs_inode *ip,
> + unsigned int *unit_min,
> + unsigned int *unit_max)
> +{
> + xfs_extlen_t extsz = xfs_get_extsz(ip);
> + struct xfs_buftarg *target = xfs_inode_buftarg(ip);
> + struct block_device *bdev = target->bt_bdev;
> + struct request_queue *q = bdev->bd_queue;
> + struct xfs_mount *mp = ip->i_mount;
> + struct xfs_sb *sbp = &mp->m_sb;
> + unsigned int awu_min, awu_max;
> + unsigned int extsz_bytes = XFS_FSB_TO_B(mp, extsz);
> +
> + awu_min = queue_atomic_write_unit_min_bytes(q);
> + awu_max = queue_atomic_write_unit_max_bytes(q);
We really should be storing these in the xfs_buftarg at mount time,
like we do logical and physical sector sizes. Similar to sector
sizes, they *must not change* once the filesystem has been created
on the device, let alone during an active mount. The whole point of
the xfs_buftarg is to store the information the filesystem
needs to do IO to the underlying block device so we don't have to
chase pointers deep into the block device whenever we need to use
static geometry information.....
> + if (sbp->sb_blocksize > awu_max || awu_min > sbp->sb_blocksize ||
> + !xfs_inode_atomicwrites(ip)) {
> + *unit_min = 0;
> + *unit_max = 0;
> + return;
> + }
Again, this is comparing static geometry - if the block size doesn't
allow atomic writes, then the inode flag should never be set. i.e.
geometry is checked when configuring atomic writes, not in every
place we need to check if atomic writes are supported. Hence this
should simply be:
if (!xfs_inode_has_atomic_writes(ip)) {
*unit_min = 0;
*unit_max = 0;
return;
}
before we even look at the xfs_buftarg to get the supported min/max
values for the given device.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 53+ messages in thread* Re: [PATCH v2 12/14] fs: xfs: Support atomic write for statx
2024-03-06 21:31 ` Dave Chinner
@ 2024-03-07 10:35 ` John Garry
0 siblings, 0 replies; 53+ messages in thread
From: John Garry @ 2024-03-07 10:35 UTC (permalink / raw)
To: Dave Chinner
Cc: djwong, hch, viro, brauner, jack, chandan.babu, axboe,
martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block
On 06/03/2024 21:31, Dave Chinner wrote:
>> + xfs_extlen_t extsz = xfs_get_extsz(ip);
>> + struct xfs_buftarg *target = xfs_inode_buftarg(ip);
>> + struct block_device *bdev = target->bt_bdev;
>> + struct request_queue *q = bdev->bd_queue;
>> + struct xfs_mount *mp = ip->i_mount;
>> + struct xfs_sb *sbp = &mp->m_sb;
>> + unsigned int awu_min, awu_max;
>> + unsigned int extsz_bytes = XFS_FSB_TO_B(mp, extsz);
>> +
>> + awu_min = queue_atomic_write_unit_min_bytes(q);
>> + awu_max = queue_atomic_write_unit_max_bytes(q);
> We really should be storing these in the xfs_buftarg at mount time,
> like we do logical and physical sector sizes.
This has been mentioned previously, and Darrick thought that it was not
safe. Please see first response in:
https://lore.kernel.org/linux-xfs/20231003161029.GG21298@frogsfrogsfrogs/#t
So if this really is true, then I'll stick with something like what I
have here and add a comment on that.
However, in this series the block layer does check for out-of-range
atomic write BIOs in 1/14. So we could store the values in xfs_buftarg,
as you suggest for the lookup here. If the bdev geometry does really
change beneath us, worse thing that happens is that we may report
incorrect values for statx.
> Similar to sector
> sizes, they*must not change* once the filesystem has been created
> on the device, let alone during an active mount. The whole point of
> the xfs_buftarg is to store the information the filesystem
> needs to do IO to the underlying block device so we don't have to
> chase pointers deep into the block device whenever we need to use
> static geometry information.....
>
>> + if (sbp->sb_blocksize > awu_max || awu_min > sbp->sb_blocksize ||
>> + !xfs_inode_atomicwrites(ip)) {
>> + *unit_min = 0;
>> + *unit_max = 0;
>> + return;
>> + }
> Again, this is comparing static geometry - if the block size doesn't
> allow atomic writes, then the inode flag should never be set. i.e.
> geometry is checked when configuring atomic writes, not in every
> place we need to check if atomic writes are supported. Hence this
> should simply be:
>
> if (!xfs_inode_has_atomic_writes(ip)) {
> *unit_min = 0;
> *unit_max = 0;
> return;
> } >
> before we even look at the xfs_buftarg to get the supported min/max
> values for the given device.
Thanks,
John
^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH v2 13/14] fs: xfs: Validate atomic writes
2024-03-04 13:04 [PATCH v2 00/14] block atomic writes for XFS John Garry
` (11 preceding siblings ...)
2024-03-04 13:04 ` [PATCH v2 12/14] fs: xfs: Support atomic write for statx John Garry
@ 2024-03-04 13:04 ` John Garry
2024-03-06 21:22 ` Dave Chinner
2024-03-04 13:04 ` [PATCH v2 14/14] fs: xfs: Support setting FMODE_CAN_ATOMIC_WRITE John Garry
13 siblings, 1 reply; 53+ messages in thread
From: John Garry @ 2024-03-04 13:04 UTC (permalink / raw)
To: djwong, hch, viro, brauner, jack, chandan.babu, david, axboe
Cc: martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block, John Garry
Validate that an atomic write adheres to length/offset rules. Since we
require extent alignment for atomic writes, this effectively also enforces
that the BIO which iomap produces is aligned.
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
fs/xfs/xfs_file.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index d0bd9d5f596c..cecc5428fd7c 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -709,11 +709,20 @@ xfs_file_dio_write(
struct kiocb *iocb,
struct iov_iter *from)
{
- struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp));
+ struct inode *inode = file_inode(iocb->ki_filp);
+ struct xfs_inode *ip = XFS_I(inode);
struct xfs_mount *mp = ip->i_mount;
struct xfs_buftarg *target = xfs_inode_buftarg(ip);
size_t count = iov_iter_count(from);
+ if (iocb->ki_flags & IOCB_ATOMIC) {
+ if (!generic_atomic_write_valid(iocb->ki_pos, from,
+ i_blocksize(inode),
+ XFS_FSB_TO_B(mp, xfs_get_extsz(ip)))) {
+ return -EINVAL;
+ }
+ }
+
/* direct I/O must be aligned to device logical sector size */
if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
return -EINVAL;
--
2.31.1
^ permalink raw reply related [flat|nested] 53+ messages in thread* Re: [PATCH v2 13/14] fs: xfs: Validate atomic writes
2024-03-04 13:04 ` [PATCH v2 13/14] fs: xfs: Validate atomic writes John Garry
@ 2024-03-06 21:22 ` Dave Chinner
2024-03-07 10:19 ` John Garry
0 siblings, 1 reply; 53+ messages in thread
From: Dave Chinner @ 2024-03-06 21:22 UTC (permalink / raw)
To: John Garry
Cc: djwong, hch, viro, brauner, jack, chandan.babu, axboe,
martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block
On Mon, Mar 04, 2024 at 01:04:27PM +0000, John Garry wrote:
> Validate that an atomic write adheres to length/offset rules. Since we
> require extent alignment for atomic writes, this effectively also enforces
> that the BIO which iomap produces is aligned.
>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
> fs/xfs/xfs_file.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index d0bd9d5f596c..cecc5428fd7c 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -709,11 +709,20 @@ xfs_file_dio_write(
> struct kiocb *iocb,
> struct iov_iter *from)
> {
> - struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp));
> + struct inode *inode = file_inode(iocb->ki_filp);
> + struct xfs_inode *ip = XFS_I(inode);
> struct xfs_mount *mp = ip->i_mount;
> struct xfs_buftarg *target = xfs_inode_buftarg(ip);
> size_t count = iov_iter_count(from);
>
> + if (iocb->ki_flags & IOCB_ATOMIC) {
> + if (!generic_atomic_write_valid(iocb->ki_pos, from,
> + i_blocksize(inode),
a.k.a. mp->m_bsize. If you use that here, then the need for the VFS
inode goes away, too.
> + XFS_FSB_TO_B(mp, xfs_get_extsz(ip)))) {
> + return -EINVAL;
> + }
> + }
Also, I think the checks are the wrong way around here. We can only
do IOCB_ATOMIC IO on force aligned/atomic write inodes, so shouldn't
we be checking that first, then basing the rest of the checks on the
assumption that atomic writes are allowed and have been set up
correctly on the inode? i.e.
if (iocb->ki_flags & IOCB_ATOMIC) {
if (!xfs_inode_has_atomicwrites(ip))
return -EINVAL;
if (!generic_atomic_write_valid(iocb->ki_pos, from,
mp->m_bsize, ip->i_extsize))
return -EINVAL;
}
because xfs_inode_has_atomicwrites() implies ip->i_extsize has been
set to the required atomic IO size?
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 53+ messages in thread* Re: [PATCH v2 13/14] fs: xfs: Validate atomic writes
2024-03-06 21:22 ` Dave Chinner
@ 2024-03-07 10:19 ` John Garry
0 siblings, 0 replies; 53+ messages in thread
From: John Garry @ 2024-03-07 10:19 UTC (permalink / raw)
To: Dave Chinner
Cc: djwong, hch, viro, brauner, jack, chandan.babu, axboe,
martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block
On 06/03/2024 21:22, Dave Chinner wrote:
> On Mon, Mar 04, 2024 at 01:04:27PM +0000, John Garry wrote:
>> Validate that an atomic write adheres to length/offset rules. Since we
>> require extent alignment for atomic writes, this effectively also enforces
>> that the BIO which iomap produces is aligned.
>>
>> Signed-off-by: John Garry<john.g.garry@oracle.com>
>> ---
>> fs/xfs/xfs_file.c | 11 ++++++++++-
>> 1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> index d0bd9d5f596c..cecc5428fd7c 100644
>> --- a/fs/xfs/xfs_file.c
>> +++ b/fs/xfs/xfs_file.c
>> @@ -709,11 +709,20 @@ xfs_file_dio_write(
>> struct kiocb *iocb,
>> struct iov_iter *from)
>> {
>> - struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp));
>> + struct inode *inode = file_inode(iocb->ki_filp);
>> + struct xfs_inode *ip = XFS_I(inode);
>> struct xfs_mount *mp = ip->i_mount;
>> struct xfs_buftarg *target = xfs_inode_buftarg(ip);
>> size_t count = iov_iter_count(from);
>>
>> + if (iocb->ki_flags & IOCB_ATOMIC) {
>> + if (!generic_atomic_write_valid(iocb->ki_pos, from,
>> + i_blocksize(inode),
> a.k.a. mp->m_bsize. If you use that here, then the need for the VFS
> inode goes away, too.
ok
>
>> + XFS_FSB_TO_B(mp, xfs_get_extsz(ip)))) {
>> + return -EINVAL;
>> + }
>> + }
> Also, I think the checks are the wrong way around here. We can only
> do IOCB_ATOMIC IO on force aligned/atomic write inodes, so shouldn't
> we be checking that first,
We are checking that, but not here.
In 14/14, we only set FMODE_CAN_ATOMIC_WRITE for when
xfs_inode_has_atomicwrites() is true, and only when
FMODE_CAN_ATOMIC_WRITE is set can we get this far.
I don't see a point in duplicating this xfs_inode_has_atomicwrites()
check, so I will make the commit message clearer on this - ok? Or add a
comment.
> then basing the rest of the checks on the
> assumption that atomic writes are allowed and have been set up
> correctly on the inode? i.e.
>
> if (iocb->ki_flags & IOCB_ATOMIC) {
> if (!xfs_inode_has_atomicwrites(ip))
> return -EINVAL;
> if (!generic_atomic_write_valid(iocb->ki_pos, from,
> mp->m_bsize, ip->i_extsize))
> return -EINVAL;
> }
>
> because xfs_inode_has_atomicwrites() implies ip->i_extsize has been
> set to the required atomic IO size?
I was not too comfortable using ip->i_extsize, as this can be set
without forcealign being set. I know that we would not get this far
without forcealign (being set).
Having said that, I don't like all the xfs_get_extsz() calls, so
something better is required. Let me know you you think.
Thanks,
John
^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH v2 14/14] fs: xfs: Support setting FMODE_CAN_ATOMIC_WRITE
2024-03-04 13:04 [PATCH v2 00/14] block atomic writes for XFS John Garry
` (12 preceding siblings ...)
2024-03-04 13:04 ` [PATCH v2 13/14] fs: xfs: Validate atomic writes John Garry
@ 2024-03-04 13:04 ` John Garry
2024-03-06 21:33 ` Dave Chinner
13 siblings, 1 reply; 53+ messages in thread
From: John Garry @ 2024-03-04 13:04 UTC (permalink / raw)
To: djwong, hch, viro, brauner, jack, chandan.babu, david, axboe
Cc: martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block, John Garry
For when an inode is enabled for atomic writes, set FMODE_CAN_ATOMIC_WRITE
flag. We check for direct I/O and also check that the bdev can actually
support atomic writes.
We rely on the block layer to reject atomic writes which exceed the bdev
request_queue limits, so don't bother checking any such thing here.
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
fs/xfs/xfs_file.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index cecc5428fd7c..e63851be6c15 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1234,6 +1234,25 @@ xfs_file_remap_range(
return remapped > 0 ? remapped : ret;
}
+static bool xfs_file_open_can_atomicwrite(
+ struct inode *inode,
+ struct file *file)
+{
+ struct xfs_inode *ip = XFS_I(inode);
+ struct xfs_buftarg *target = xfs_inode_buftarg(ip);
+
+ if (!(file->f_flags & O_DIRECT))
+ return false;
+
+ if (!xfs_inode_atomicwrites(ip))
+ return false;
+
+ if (!bdev_can_atomic_write(target->bt_bdev))
+ return false;
+
+ return true;
+}
+
STATIC int
xfs_file_open(
struct inode *inode,
@@ -1243,6 +1262,8 @@ xfs_file_open(
return -EIO;
file->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | FMODE_BUF_WASYNC |
FMODE_DIO_PARALLEL_WRITE | FMODE_CAN_ODIRECT;
+ if (xfs_file_open_can_atomicwrite(inode, file))
+ file->f_mode |= FMODE_CAN_ATOMIC_WRITE;
return generic_file_open(inode, file);
}
--
2.31.1
^ permalink raw reply related [flat|nested] 53+ messages in thread* Re: [PATCH v2 14/14] fs: xfs: Support setting FMODE_CAN_ATOMIC_WRITE
2024-03-04 13:04 ` [PATCH v2 14/14] fs: xfs: Support setting FMODE_CAN_ATOMIC_WRITE John Garry
@ 2024-03-06 21:33 ` Dave Chinner
2024-03-07 11:55 ` John Garry
0 siblings, 1 reply; 53+ messages in thread
From: Dave Chinner @ 2024-03-06 21:33 UTC (permalink / raw)
To: John Garry
Cc: djwong, hch, viro, brauner, jack, chandan.babu, axboe,
martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block
On Mon, Mar 04, 2024 at 01:04:28PM +0000, John Garry wrote:
> For when an inode is enabled for atomic writes, set FMODE_CAN_ATOMIC_WRITE
> flag. We check for direct I/O and also check that the bdev can actually
> support atomic writes.
>
> We rely on the block layer to reject atomic writes which exceed the bdev
> request_queue limits, so don't bother checking any such thing here.
>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
> fs/xfs/xfs_file.c | 21 +++++++++++++++++++++
> 1 file changed, 21 insertions(+)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index cecc5428fd7c..e63851be6c15 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1234,6 +1234,25 @@ xfs_file_remap_range(
> return remapped > 0 ? remapped : ret;
> }
>
> +static bool xfs_file_open_can_atomicwrite(
> + struct inode *inode,
> + struct file *file)
> +{
> + struct xfs_inode *ip = XFS_I(inode);
> + struct xfs_buftarg *target = xfs_inode_buftarg(ip);
> +
> + if (!(file->f_flags & O_DIRECT))
> + return false;
> +
> + if (!xfs_inode_atomicwrites(ip))
> + return false;
> +
> + if (!bdev_can_atomic_write(target->bt_bdev))
> + return false;
Again, this is static blockdev information - the inode atomic write
flag should not be set if the bdev cannot do atomic writes. It
should be checked at mount time - the filesystem probably should
only mount read-only if it is configured to allow atomic writes and
the underlying blockdev does not support atomic writes...
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 53+ messages in thread* Re: [PATCH v2 14/14] fs: xfs: Support setting FMODE_CAN_ATOMIC_WRITE
2024-03-06 21:33 ` Dave Chinner
@ 2024-03-07 11:55 ` John Garry
0 siblings, 0 replies; 53+ messages in thread
From: John Garry @ 2024-03-07 11:55 UTC (permalink / raw)
To: Dave Chinner
Cc: djwong, hch, viro, brauner, jack, chandan.babu, axboe,
martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
jbongio, ojaswin, ritesh.list, linux-block
On 06/03/2024 21:33, Dave Chinner wrote:
>> +static bool xfs_file_open_can_atomicwrite(
>> + struct inode *inode,
>> + struct file *file)
>> +{
>> + struct xfs_inode *ip = XFS_I(inode);
>> + struct xfs_buftarg *target = xfs_inode_buftarg(ip);
>> +
>> + if (!(file->f_flags & O_DIRECT))
>> + return false;
>> +
>> + if (!xfs_inode_atomicwrites(ip))
>> + return false;
>> +
>> + if (!bdev_can_atomic_write(target->bt_bdev))
>> + return false;
> Again, this is static blockdev information - the inode atomic write
> flag should not be set if the bdev cannot do atomic writes. It
> should be checked at mount time
ok
> - the filesystem probably should
> only mount read-only if it is configured to allow atomic writes and
> the underlying blockdev does not support atomic writes...
Let me know if you really would like to see that change also. It does
seem a bit drastic, considering we can just disallow atomic writes.
Thanks,
John
^ permalink raw reply [flat|nested] 53+ messages in thread