[PATCH v4 00/14] forcealign for xfs

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 00/14] forcealign for xfs
@ 2024-08-13 16:36 John Garry
  2024-08-13 16:36 ` [PATCH v4 01/14] xfs: only allow minlen allocations when near ENOSPC John Garry
                   ` (14 more replies)
  0 siblings, 15 replies; 58+ messages in thread
From: John Garry @ 2024-08-13 16:36 UTC (permalink / raw)
  To: chandan.babu, djwong, dchinner, hch
  Cc: viro, brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen, John Garry

This series is being spun off the block atomic writes for xfs series at
[0].

That series got too big.

The actual forcealign patches are roughly the same in this series.

Why forcealign?
In some scenarios to may be required to guarantee extent alignment and
granularity.

For example, for atomic writes, the maximum atomic write unit size would
be limited at the extent alignment and granularity, guaranteeing that an
atomic write would not span data present in multiple extents.

forcealign may be useful as a performance tuning optimization in other
scenarios.

I decided not to support forcealign for RT devices here. Initially I
thought that it would be quite simple of implement. However, I discovered
through much testing and subsequent debug that this was not true, so I
decided to defer support to later.

Early development xfsprogs support is at:
https://github.com/johnpgarry/xfsprogs-dev/commits/atomic-writes/

Differences to v3:
- Add more RB tags (thanks)
- Change round-in/out API to up/down
- Change unmap blocks patch to use alloc_fsb and rework helper
  (Darrick)

Differences to v2:
- Add rounding to alloc unit helpers
- Update xfs_setattr_size()
- Disallow mount for forcealign and reflink
- Remove forcealign and RT/reflink inode checks
- Relocate setting of XFS_ALLOC_FORCEALIGN

Differences to v1:
- Add Darricks RB tags (thanks)
- Disallow mount for forcealign and RT
- Disallow cp --reflink from forcealign inode
- Comments improvements (Darrick)
- Coding style improvements (Darrick)
- Fix xfs_inode_alloc_unitsize() (Darrick)

Baseline:
7bf888fa26e8 (tag: xfs-6.11-fixes-1, xfs/xfs-6.11-fixes, xfs/for-next)
xfs: convert comma to semicolon

[0] https://lore.kernel.org/linux-xfs/20240607143919.2622319-1-john.g.garry@oracle.com/
[1] https://lore.kernel.org/linux-block/20240620125359.2684798-1-john.g.garry@oracle.com/

Darrick J. Wong (2):
  xfs: Introduce FORCEALIGN inode flag
  xfs: Enable file data forcealign feature

Dave Chinner (6):
  xfs: only allow minlen allocations when near ENOSPC
  xfs: always tail align maxlen allocations
  xfs: simplify extent allocation alignment
  xfs: make EOF allocation simpler
  xfs: introduce forced allocation alignment
  xfs: align args->minlen for forced allocation alignment

John Garry (6):
  xfs: Update xfs_inode_alloc_unitsize() for forcealign
  xfs: Update xfs_setattr_size() for forcealign
  xfs: Do not free EOF blocks for forcealign
  xfs: Only free full extents for forcealign
  xfs: Unmap blocks according to forcealign
  xfs: Don't revert allocated offset for forcealign

 fs/xfs/libxfs/xfs_alloc.c      |  33 ++--
 fs/xfs/libxfs/xfs_alloc.h      |   3 +-
 fs/xfs/libxfs/xfs_bmap.c       | 320 ++++++++++++++++++---------------
 fs/xfs/libxfs/xfs_format.h     |   9 +-
 fs/xfs/libxfs/xfs_ialloc.c     |  12 +-
 fs/xfs/libxfs/xfs_inode_buf.c  |  46 +++++
 fs/xfs/libxfs/xfs_inode_buf.h  |   3 +
 fs/xfs/libxfs/xfs_inode_util.c |  14 ++
 fs/xfs/libxfs/xfs_sb.c         |   2 +
 fs/xfs/xfs_bmap_util.c         |  14 +-
 fs/xfs/xfs_inode.c             |  41 ++++-
 fs/xfs/xfs_inode.h             |  16 ++
 fs/xfs/xfs_ioctl.c             |  46 ++++-
 fs/xfs/xfs_iops.c              |   4 +-
 fs/xfs/xfs_mount.h             |   2 +
 fs/xfs/xfs_reflink.c           |   5 +-
 fs/xfs/xfs_super.c             |  18 ++
 fs/xfs/xfs_trace.h             |   8 +-
 include/uapi/linux/fs.h        |   2 +
 19 files changed, 405 insertions(+), 193 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH v4 01/14] xfs: only allow minlen allocations when near ENOSPC
  2024-08-13 16:36 [PATCH v4 00/14] forcealign for xfs John Garry
@ 2024-08-13 16:36 ` John Garry
  2024-08-23 16:28   ` Darrick J. Wong
  2024-08-13 16:36 ` [PATCH v4 02/14] xfs: always tail align maxlen allocations John Garry
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 58+ messages in thread
From: John Garry @ 2024-08-13 16:36 UTC (permalink / raw)
  To: chandan.babu, djwong, dchinner, hch
  Cc: viro, brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen, John Garry

From: Dave Chinner <dchinner@redhat.com>

When we are near ENOSPC and don't have enough free
space for an args->maxlen allocation, xfs_alloc_space_available()
will trim args->maxlen to equal the available space. However, this
function has only checked that there is enough contiguous free space
for an aligned args->minlen allocation to succeed. Hence there is no
guarantee that an args->maxlen allocation will succeed, nor that the
available space will allow for correct alignment of an args->maxlen
allocation.

Further, by trimming args->maxlen arbitrarily, it breaks an
assumption made in xfs_alloc_fix_len() that if the caller wants
aligned allocation, then args->maxlen will be set to an aligned
value. It then skips the tail alignment and so we end up with
extents that aren't aligned to extent size hint boundaries as we
approach ENOSPC.

To avoid this problem, don't reduce args->maxlen by some random,
arbitrary amount. If args->maxlen is too large for the available
space, reduce the allocation to a minlen allocation as we know we
have contiguous free space available for this to succeed and always
be correctly aligned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_alloc.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 59326f84f6a5..d559d992c6ef 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2524,14 +2524,23 @@ xfs_alloc_space_available(
 	if (available < (int)max(args->total, alloc_len))
 		return false;

+	if (flags & XFS_ALLOC_FLAG_CHECK)
+		return true;
+
 	/*
-	 * Clamp maxlen to the amount of free space available for the actual
-	 * extent allocation.
+	 * If we can't do a maxlen allocation, then we must reduce the size of
+	 * the allocation to match the available free space. We know how big
+	 * the largest contiguous free space we can allocate is, so that's our
+	 * upper bound. However, we don't exaclty know what alignment/size
+	 * constraints have been placed on the allocation, so we can't
+	 * arbitrarily select some new max size. Hence make this a minlen
+	 * allocation as we know that will definitely succeed and match the
+	 * callers alignment constraints.
 	 */
-	if (available < (int)args->maxlen && !(flags & XFS_ALLOC_FLAG_CHECK)) {
-		args->maxlen = available;
+	alloc_len = args->maxlen + (args->alignment - 1) + args->minalignslop;
+	if (longest < alloc_len) {
+		args->maxlen = args->minlen;
 		ASSERT(args->maxlen > 0);
-		ASSERT(args->maxlen >= args->minlen);
 	}

 	return true;
-- 
2.31.1

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 01/14] xfs: only allow minlen allocations when near ENOSPC
  2024-08-13 16:36 ` [PATCH v4 01/14] xfs: only allow minlen allocations when near ENOSPC John Garry
@ 2024-08-23 16:28   ` Darrick J. Wong
  0 siblings, 0 replies; 58+ messages in thread
From: Darrick J. Wong @ 2024-08-23 16:28 UTC (permalink / raw)
  To: John Garry
  Cc: chandan.babu, dchinner, hch, viro, brauner, jack, linux-xfs,
	linux-kernel, linux-fsdevel, catherine.hoang, martin.petersen

On Tue, Aug 13, 2024 at 04:36:25PM +0000, John Garry wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we are near ENOSPC and don't have enough free
> space for an args->maxlen allocation, xfs_alloc_space_available()
> will trim args->maxlen to equal the available space. However, this
> function has only checked that there is enough contiguous free space
> for an aligned args->minlen allocation to succeed. Hence there is no
> guarantee that an args->maxlen allocation will succeed, nor that the
> available space will allow for correct alignment of an args->maxlen
> allocation.
> 
> Further, by trimming args->maxlen arbitrarily, it breaks an
> assumption made in xfs_alloc_fix_len() that if the caller wants
> aligned allocation, then args->maxlen will be set to an aligned
> value. It then skips the tail alignment and so we end up with
> extents that aren't aligned to extent size hint boundaries as we
> approach ENOSPC.
> 
> To avoid this problem, don't reduce args->maxlen by some random,
> arbitrary amount. If args->maxlen is too large for the available
> space, reduce the allocation to a minlen allocation as we know we
> have contiguous free space available for this to succeed and always
> be correctly aligned.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: John Garry <john.g.garry@oracle.com>

Looks ok.  I still have some misgivings about going straight to minlen,
but that's a common tactic elsewhere in the allocator and at worst the
performance is suboptimal.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/libxfs/xfs_alloc.c | 19 ++++++++++++++-----
>  1 file changed, 14 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 59326f84f6a5..d559d992c6ef 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -2524,14 +2524,23 @@ xfs_alloc_space_available(
>  	if (available < (int)max(args->total, alloc_len))
>  		return false;
>  
> +	if (flags & XFS_ALLOC_FLAG_CHECK)
> +		return true;
> +
>  	/*
> -	 * Clamp maxlen to the amount of free space available for the actual
> -	 * extent allocation.
> +	 * If we can't do a maxlen allocation, then we must reduce the size of
> +	 * the allocation to match the available free space. We know how big
> +	 * the largest contiguous free space we can allocate is, so that's our
> +	 * upper bound. However, we don't exaclty know what alignment/size
> +	 * constraints have been placed on the allocation, so we can't
> +	 * arbitrarily select some new max size. Hence make this a minlen
> +	 * allocation as we know that will definitely succeed and match the
> +	 * callers alignment constraints.
>  	 */
> -	if (available < (int)args->maxlen && !(flags & XFS_ALLOC_FLAG_CHECK)) {
> -		args->maxlen = available;
> +	alloc_len = args->maxlen + (args->alignment - 1) + args->minalignslop;
> +	if (longest < alloc_len) {
> +		args->maxlen = args->minlen;
>  		ASSERT(args->maxlen > 0);
> -		ASSERT(args->maxlen >= args->minlen);
>  	}
>  
>  	return true;
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH v4 02/14] xfs: always tail align maxlen allocations
  2024-08-13 16:36 [PATCH v4 00/14] forcealign for xfs John Garry
  2024-08-13 16:36 ` [PATCH v4 01/14] xfs: only allow minlen allocations when near ENOSPC John Garry
@ 2024-08-13 16:36 ` John Garry
  2024-08-23 16:31   ` Darrick J. Wong
  2024-08-13 16:36 ` [PATCH v4 03/14] xfs: simplify extent allocation alignment John Garry
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 58+ messages in thread
From: John Garry @ 2024-08-13 16:36 UTC (permalink / raw)
  To: chandan.babu, djwong, dchinner, hch
  Cc: viro, brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen, John Garry

From: Dave Chinner <dchinner@redhat.com>

When we do a large allocation, the core free space allocation code
assumes that args->maxlen is aligned to args->prod/args->mod. hence
if we get a maximum sized extent allocated, it does not do tail
alignment of the extent.

However, this assumes that nothing modifies args->maxlen between the
original allocation context setup and trimming the selected free
space extent to size. This assumption has recently been found to be
invalid - xfs_alloc_space_available() modifies args->maxlen in low
space situations - and there may be more situations we haven't yet
found like this.

Force aligned allocation introduces the requirement that extents are
correctly tail aligned, resulting in this occasional latent
alignment failure to be reclassified from an unimportant curiousity
to a must-fix bug.

Removing the assumption about args->maxlen allocations always being
tail aligned is trivial, and should not impact anything because
args->maxlen for inodes with extent size hints configured are
already aligned. Hence all this change does it avoid weird corner
cases that would have resulted in unaligned extent sizes by always
trimming the extent down to an aligned size.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> [provisional on v1 series comment]
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_alloc.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index d559d992c6ef..bf08b9e9d9ac 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -433,20 +433,18 @@ xfs_alloc_compute_diff(
  * Fix up the length, based on mod and prod.
  * len should be k * prod + mod for some k.
  * If len is too small it is returned unchanged.
- * If len hits maxlen it is left alone.
  */
-STATIC void
+static void
 xfs_alloc_fix_len(
-	xfs_alloc_arg_t	*args)		/* allocation argument structure */
+	struct xfs_alloc_arg	*args)
 {
-	xfs_extlen_t	k;
-	xfs_extlen_t	rlen;
+	xfs_extlen_t		k;
+	xfs_extlen_t		rlen = args->len;

 	ASSERT(args->mod < args->prod);
-	rlen = args->len;
 	ASSERT(rlen >= args->minlen);
 	ASSERT(rlen <= args->maxlen);
-	if (args->prod <= 1 || rlen < args->mod || rlen == args->maxlen ||
+	if (args->prod <= 1 || rlen < args->mod ||
 	    (args->mod == 0 && rlen < args->prod))
 		return;
 	k = rlen % args->prod;
-- 
2.31.1

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 02/14] xfs: always tail align maxlen allocations
  2024-08-13 16:36 ` [PATCH v4 02/14] xfs: always tail align maxlen allocations John Garry
@ 2024-08-23 16:31   ` Darrick J. Wong
  2024-08-29 17:58     ` John Garry
  0 siblings, 1 reply; 58+ messages in thread
From: Darrick J. Wong @ 2024-08-23 16:31 UTC (permalink / raw)
  To: John Garry
  Cc: chandan.babu, dchinner, hch, viro, brauner, jack, linux-xfs,
	linux-kernel, linux-fsdevel, catherine.hoang, martin.petersen

On Tue, Aug 13, 2024 at 04:36:26PM +0000, John Garry wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we do a large allocation, the core free space allocation code
> assumes that args->maxlen is aligned to args->prod/args->mod. hence
> if we get a maximum sized extent allocated, it does not do tail
> alignment of the extent.
> 
> However, this assumes that nothing modifies args->maxlen between the
> original allocation context setup and trimming the selected free
> space extent to size. This assumption has recently been found to be
> invalid - xfs_alloc_space_available() modifies args->maxlen in low
> space situations - and there may be more situations we haven't yet
> found like this.
> 
> Force aligned allocation introduces the requirement that extents are
> correctly tail aligned, resulting in this occasional latent
> alignment failure to be reclassified from an unimportant curiousity
> to a must-fix bug.
> 
> Removing the assumption about args->maxlen allocations always being
> tail aligned is trivial, and should not impact anything because
> args->maxlen for inodes with extent size hints configured are
> already aligned. Hence all this change does it avoid weird corner
> cases that would have resulted in unaligned extent sizes by always
> trimming the extent down to an aligned size.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> [provisional on v1 series comment]

Still provisional -- neither the original patch author nor the submitter
have answered my question from June:

IOWs, we always trim rlen, unless there is no alignment (prod==1) or
rlen is less than mod.  For a forcealign file, it should never be the
case that minlen < mod because we'll have returned ENOSPC, right?

--D

> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_alloc.c | 12 +++++-------
>  1 file changed, 5 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index d559d992c6ef..bf08b9e9d9ac 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -433,20 +433,18 @@ xfs_alloc_compute_diff(
>   * Fix up the length, based on mod and prod.
>   * len should be k * prod + mod for some k.
>   * If len is too small it is returned unchanged.
> - * If len hits maxlen it is left alone.
>   */
> -STATIC void
> +static void
>  xfs_alloc_fix_len(
> -	xfs_alloc_arg_t	*args)		/* allocation argument structure */
> +	struct xfs_alloc_arg	*args)
>  {
> -	xfs_extlen_t	k;
> -	xfs_extlen_t	rlen;
> +	xfs_extlen_t		k;
> +	xfs_extlen_t		rlen = args->len;
>  
>  	ASSERT(args->mod < args->prod);
> -	rlen = args->len;
>  	ASSERT(rlen >= args->minlen);
>  	ASSERT(rlen <= args->maxlen);
> -	if (args->prod <= 1 || rlen < args->mod || rlen == args->maxlen ||
> +	if (args->prod <= 1 || rlen < args->mod ||
>  	    (args->mod == 0 && rlen < args->prod))
>  		return;
>  	k = rlen % args->prod;
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 02/14] xfs: always tail align maxlen allocations
  2024-08-23 16:31   ` Darrick J. Wong
@ 2024-08-29 17:58     ` John Garry
  2024-08-29 21:34       ` Darrick J. Wong
  0 siblings, 1 reply; 58+ messages in thread
From: John Garry @ 2024-08-29 17:58 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: chandan.babu, dchinner, hch, viro, brauner, jack, linux-xfs,
	linux-kernel, linux-fsdevel, catherine.hoang, martin.petersen

On 23/08/2024 17:31, Darrick J. Wong wrote:

sorry for the slow reply...

> On Tue, Aug 13, 2024 at 04:36:26PM +0000, John Garry wrote:
>> From: Dave Chinner <dchinner@redhat.com>
>>>> When we do a large allocation, the core free space allocation code
>> assumes that args->maxlen is aligned to args->prod/args->mod. hence
>> if we get a maximum sized extent allocated, it does not do tail
>> alignment of the extent.
>>
>> However, this assumes that nothing modifies args->maxlen between the
>> original allocation context setup and trimming the selected free
>> space extent to size. This assumption has recently been found to be
>> invalid - xfs_alloc_space_available() modifies args->maxlen in low
>> space situations - and there may be more situations we haven't yet
>> found like this.
>>
>> Force aligned allocation introduces the requirement that extents are
>> correctly tail aligned, resulting in this occasional latent
>> alignment failure to be reclassified from an unimportant curiousity
>> to a must-fix bug.
>>
>> Removing the assumption about args->maxlen allocations always being
>> tail aligned is trivial, and should not impact anything because
>> args->maxlen for inodes with extent size hints configured are
>> already aligned. Hence all this change does it avoid weird corner
>> cases that would have resulted in unaligned extent sizes by always
>> trimming the extent down to an aligned size.
>>
>> Signed-off-by: Dave Chinner <dchinner@redhat.com>
>> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> [provisional on v1 series comment]
> 
> Still provisional -- neither the original patch author nor the submitter
> have answered my question from June:
> 
> IOWs, we always trim rlen, unless there is no alignment (prod==1) or
> rlen is less than mod.  For a forcealign file, it should never be the
> case that minlen < mod because we'll have returned ENOSPC, right?

For forcealign, mod == 0, so naturally that (minlen < mod) would not 
happen. We want to alloc a multiple of align only, which is in prod.

If we consider minlen < prod, then that should not happen either as we 
would have returned ENOSPC. In xfs_bmap_select_minlen() we rounddown 
blen by args->alignment, and if that is less than the ap->minlen (1), 
i.e. if after rounddown we have 0, then we return ENOSPC for forcealign. 
So then minlen would not be less than prod after selecting minlen in 
xfs_bmap_select_minlen().

I hope that I am answering the question asked...

Thanks,
John

> 
> --D
> 
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/xfs/libxfs/xfs_alloc.c | 12 +++++-------
>>   1 file changed, 5 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
>> index d559d992c6ef..bf08b9e9d9ac 100644
>> --- a/fs/xfs/libxfs/xfs_alloc.c
>> +++ b/fs/xfs/libxfs/xfs_alloc.c
>> @@ -433,20 +433,18 @@ xfs_alloc_compute_diff(
>>    * Fix up the length, based on mod and prod.
>>    * len should be k * prod + mod for some k.
>>    * If len is too small it is returned unchanged.
>> - * If len hits maxlen it is left alone.
>>    */
>> -STATIC void
>> +static void
>>   xfs_alloc_fix_len(
>> -	xfs_alloc_arg_t	*args)		/* allocation argument structure */
>> +	struct xfs_alloc_arg	*args)
>>   {
>> -	xfs_extlen_t	k;
>> -	xfs_extlen_t	rlen;
>> +	xfs_extlen_t		k;
>> +	xfs_extlen_t		rlen = args->len;
>>   
>>   	ASSERT(args->mod < args->prod);
>> -	rlen = args->len;
>>   	ASSERT(rlen >= args->minlen);
>>   	ASSERT(rlen <= args->maxlen);
>> -	if (args->prod <= 1 || rlen < args->mod || rlen == args->maxlen ||
>> +	if (args->prod <= 1 || rlen < args->mod ||
>>   	    (args->mod == 0 && rlen < args->prod))
>>   		return;
>>   	k = rlen % args->prod;
>> -- 
>> 2.31.1
>>
>>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 02/14] xfs: always tail align maxlen allocations
  2024-08-29 17:58     ` John Garry
@ 2024-08-29 21:34       ` Darrick J. Wong
  0 siblings, 0 replies; 58+ messages in thread
From: Darrick J. Wong @ 2024-08-29 21:34 UTC (permalink / raw)
  To: John Garry
  Cc: chandan.babu, dchinner, hch, viro, brauner, jack, linux-xfs,
	linux-kernel, linux-fsdevel, catherine.hoang, martin.petersen

On Thu, Aug 29, 2024 at 06:58:02PM +0100, John Garry wrote:
> On 23/08/2024 17:31, Darrick J. Wong wrote:
> 
> sorry for the slow reply...
> 
> > On Tue, Aug 13, 2024 at 04:36:26PM +0000, John Garry wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > > > When we do a large allocation, the core free space allocation code
> > > assumes that args->maxlen is aligned to args->prod/args->mod. hence
> > > if we get a maximum sized extent allocated, it does not do tail
> > > alignment of the extent.
> > > 
> > > However, this assumes that nothing modifies args->maxlen between the
> > > original allocation context setup and trimming the selected free
> > > space extent to size. This assumption has recently been found to be
> > > invalid - xfs_alloc_space_available() modifies args->maxlen in low
> > > space situations - and there may be more situations we haven't yet
> > > found like this.
> > > 
> > > Force aligned allocation introduces the requirement that extents are
> > > correctly tail aligned, resulting in this occasional latent
> > > alignment failure to be reclassified from an unimportant curiousity
> > > to a must-fix bug.
> > > 
> > > Removing the assumption about args->maxlen allocations always being
> > > tail aligned is trivial, and should not impact anything because
> > > args->maxlen for inodes with extent size hints configured are
> > > already aligned. Hence all this change does it avoid weird corner
> > > cases that would have resulted in unaligned extent sizes by always
> > > trimming the extent down to an aligned size.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> [provisional on v1 series comment]
> > 
> > Still provisional -- neither the original patch author nor the submitter
> > have answered my question from June:
> > 
> > IOWs, we always trim rlen, unless there is no alignment (prod==1) or
> > rlen is less than mod.  For a forcealign file, it should never be the
> > case that minlen < mod because we'll have returned ENOSPC, right?
> 
> For forcealign, mod == 0, so naturally that (minlen < mod) would not happen.
> We want to alloc a multiple of align only, which is in prod.
> 
> If we consider minlen < prod, then that should not happen either as we would
> have returned ENOSPC. In xfs_bmap_select_minlen() we rounddown blen by
> args->alignment, and if that is less than the ap->minlen (1), i.e. if after
> rounddown we have 0, then we return ENOSPC for forcealign. So then minlen
> would not be less than prod after selecting minlen in
> xfs_bmap_select_minlen().
> 
> I hope that I am answering the question asked...

Yep, that satisfies my curiosity!  Thanks for getting back to me,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> 
> Thanks,
> John
> 
> > 
> > --D
> > 
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > >   fs/xfs/libxfs/xfs_alloc.c | 12 +++++-------
> > >   1 file changed, 5 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > > index d559d992c6ef..bf08b9e9d9ac 100644
> > > --- a/fs/xfs/libxfs/xfs_alloc.c
> > > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > > @@ -433,20 +433,18 @@ xfs_alloc_compute_diff(
> > >    * Fix up the length, based on mod and prod.
> > >    * len should be k * prod + mod for some k.
> > >    * If len is too small it is returned unchanged.
> > > - * If len hits maxlen it is left alone.
> > >    */
> > > -STATIC void
> > > +static void
> > >   xfs_alloc_fix_len(
> > > -	xfs_alloc_arg_t	*args)		/* allocation argument structure */
> > > +	struct xfs_alloc_arg	*args)
> > >   {
> > > -	xfs_extlen_t	k;
> > > -	xfs_extlen_t	rlen;
> > > +	xfs_extlen_t		k;
> > > +	xfs_extlen_t		rlen = args->len;
> > >   	ASSERT(args->mod < args->prod);
> > > -	rlen = args->len;
> > >   	ASSERT(rlen >= args->minlen);
> > >   	ASSERT(rlen <= args->maxlen);
> > > -	if (args->prod <= 1 || rlen < args->mod || rlen == args->maxlen ||
> > > +	if (args->prod <= 1 || rlen < args->mod ||
> > >   	    (args->mod == 0 && rlen < args->prod))
> > >   		return;
> > >   	k = rlen % args->prod;
> > > -- 
> > > 2.31.1
> > > 
> > > 
> 
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH v4 03/14] xfs: simplify extent allocation alignment
  2024-08-13 16:36 [PATCH v4 00/14] forcealign for xfs John Garry
  2024-08-13 16:36 ` [PATCH v4 01/14] xfs: only allow minlen allocations when near ENOSPC John Garry
  2024-08-13 16:36 ` [PATCH v4 02/14] xfs: always tail align maxlen allocations John Garry
@ 2024-08-13 16:36 ` John Garry
  2024-08-13 16:36 ` [PATCH v4 04/14] xfs: make EOF allocation simpler John Garry
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 58+ messages in thread
From: John Garry @ 2024-08-13 16:36 UTC (permalink / raw)
  To: chandan.babu, djwong, dchinner, hch
  Cc: viro, brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen, John Garry

From: Dave Chinner <dchinner@redhat.com>

We currently align extent allocation to stripe unit or stripe width.
That is specified by an external parameter to the allocation code,
which then manipulates the xfs_alloc_args alignment configuration in
interesting ways.

The args->alignment field specifies extent start alignment, but
because we may be attempting non-aligned allocation first there are
also slop variables that allow for those allocation attempts to
account for aligned allocation if they fail.

This gets much more complex as we introduce forced allocation
alignment, where extent size hints are used to generate the extent
start alignment. extent size hints currently only affect extent
lengths (via args->prod and args->mod) and so with this change we
will have two different start alignment conditions.

Avoid this complexity by always using args->alignment to indicate
extent start alignment, and always using args->prod/mod to indicate
extent length adjustment.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[jpg: fixup alignslop references in xfs_trace.h and xfs_ialloc.c]
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_alloc.c  |  4 +-
 fs/xfs/libxfs/xfs_alloc.h  |  2 +-
 fs/xfs/libxfs/xfs_bmap.c   | 95 ++++++++++++++++----------------------
 fs/xfs/libxfs/xfs_ialloc.c | 10 ++--
 fs/xfs/xfs_trace.h         |  8 ++--
 5 files changed, 53 insertions(+), 66 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index bf08b9e9d9ac..a9ab7d71c558 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2506,7 +2506,7 @@ xfs_alloc_space_available(
 	reservation = xfs_ag_resv_needed(pag, args->resv);
 
 	/* do we have enough contiguous free space for the allocation? */
-	alloc_len = args->minlen + (args->alignment - 1) + args->minalignslop;
+	alloc_len = args->minlen + (args->alignment - 1) + args->alignslop;
 	longest = xfs_alloc_longest_free_extent(pag, min_free, reservation);
 	if (longest < alloc_len)
 		return false;
@@ -2535,7 +2535,7 @@ xfs_alloc_space_available(
 	 * allocation as we know that will definitely succeed and match the
 	 * callers alignment constraints.
 	 */
-	alloc_len = args->maxlen + (args->alignment - 1) + args->minalignslop;
+	alloc_len = args->maxlen + (args->alignment - 1) + args->alignslop;
 	if (longest < alloc_len) {
 		args->maxlen = args->minlen;
 		ASSERT(args->maxlen > 0);
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index fae170825be0..473822a5d4e9 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -46,7 +46,7 @@ typedef struct xfs_alloc_arg {
 	xfs_extlen_t	minleft;	/* min blocks must be left after us */
 	xfs_extlen_t	total;		/* total blocks needed in xaction */
 	xfs_extlen_t	alignment;	/* align answer to multiple of this */
-	xfs_extlen_t	minalignslop;	/* slop for minlen+alignment calcs */
+	xfs_extlen_t	alignslop;	/* slop for alignment calcs */
 	xfs_agblock_t	min_agbno;	/* set an agbno range for NEAR allocs */
 	xfs_agblock_t	max_agbno;	/* ... */
 	xfs_extlen_t	len;		/* output: actual size of extent */
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 7df74c35d9f9..25a87e1154bb 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3286,6 +3286,10 @@ xfs_bmap_select_minlen(
 	xfs_extlen_t		blen)
 {
 
+	/* Adjust best length for extent start alignment. */
+	if (blen > args->alignment)
+		blen -= args->alignment;
+
 	/*
 	 * Since we used XFS_ALLOC_FLAG_TRYLOCK in _longest_free_extent(), it is
 	 * possible that there is enough contiguous free space for this request.
@@ -3394,35 +3398,43 @@ xfs_bmap_alloc_account(
 	xfs_trans_mod_dquot_byino(ap->tp, ap->ip, fld, ap->length);
 }
 
-static int
+/*
+ * Calculate the extent start alignment and the extent length adjustments that
+ * constrain this allocation.
+ *
+ * Extent start alignment is currently determined by stripe configuration and is
+ * carried in args->alignment, whilst extent length adjustment is determined by
+ * extent size hints and is carried by args->prod and args->mod.
+ *
+ * Low level allocation code is free to either ignore or override these values
+ * as required.
+ */
+static void
 xfs_bmap_compute_alignments(
 	struct xfs_bmalloca	*ap,
 	struct xfs_alloc_arg	*args)
 {
 	struct xfs_mount	*mp = args->mp;
 	xfs_extlen_t		align = 0; /* minimum allocation alignment */
-	int			stripe_align = 0;
 
 	/* stripe alignment for allocation is determined by mount parameters */
 	if (mp->m_swidth && xfs_has_swalloc(mp))
-		stripe_align = mp->m_swidth;
+		args->alignment = mp->m_swidth;
 	else if (mp->m_dalign)
-		stripe_align = mp->m_dalign;
+		args->alignment = mp->m_dalign;
 
 	if (ap->flags & XFS_BMAPI_COWFORK)
 		align = xfs_get_cowextsz_hint(ap->ip);
 	else if (ap->datatype & XFS_ALLOC_USERDATA)
 		align = xfs_get_extsz_hint(ap->ip);
+
 	if (align) {
 		if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
 					ap->eof, 0, ap->conv, &ap->offset,
 					&ap->length))
 			ASSERT(0);
 		ASSERT(ap->length);
-	}
 
-	/* apply extent size hints if obtained earlier */
-	if (align) {
 		args->prod = align;
 		div_u64_rem(ap->offset, args->prod, &args->mod);
 		if (args->mod)
@@ -3437,7 +3449,6 @@ xfs_bmap_compute_alignments(
 			args->mod = args->prod - args->mod;
 	}
 
-	return stripe_align;
 }
 
 static void
@@ -3509,7 +3520,7 @@ xfs_bmap_exact_minlen_extent_alloc(
 	args.total = ap->total;
 
 	args.alignment = 1;
-	args.minalignslop = 0;
+	args.alignslop = 0;
 
 	args.minleft = ap->minleft;
 	args.wasdel = ap->wasdel;
@@ -3549,7 +3560,6 @@ xfs_bmap_btalloc_at_eof(
 	struct xfs_bmalloca	*ap,
 	struct xfs_alloc_arg	*args,
 	xfs_extlen_t		blen,
-	int			stripe_align,
 	bool			ag_only)
 {
 	struct xfs_mount	*mp = args->mp;
@@ -3563,23 +3573,15 @@ xfs_bmap_btalloc_at_eof(
 	 * allocation.
 	 */
 	if (ap->offset) {
-		xfs_extlen_t	nextminlen = 0;
+		xfs_extlen_t	alignment = args->alignment;
 
 		/*
-		 * Compute the minlen+alignment for the next case.  Set slop so
-		 * that the value of minlen+alignment+slop doesn't go up between
-		 * the calls.
+		 * Compute the alignment slop for the fallback path so we ensure
+		 * we account for the potential alignment space required by the
+		 * fallback paths before we modify the AGF and AGFL here.
 		 */
 		args->alignment = 1;
-		if (blen > stripe_align && blen <= args->maxlen)
-			nextminlen = blen - stripe_align;
-		else
-			nextminlen = args->minlen;
-		if (nextminlen + stripe_align > args->minlen + 1)
-			args->minalignslop = nextminlen + stripe_align -
-					args->minlen - 1;
-		else
-			args->minalignslop = 0;
+		args->alignslop = alignment - args->alignment;
 
 		if (!caller_pag)
 			args->pag = xfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, ap->blkno));
@@ -3597,19 +3599,8 @@ xfs_bmap_btalloc_at_eof(
 		 * Exact allocation failed. Reset to try an aligned allocation
 		 * according to the original allocation specification.
 		 */
-		args->alignment = stripe_align;
-		args->minlen = nextminlen;
-		args->minalignslop = 0;
-	} else {
-		/*
-		 * Adjust minlen to try and preserve alignment if we
-		 * can't guarantee an aligned maxlen extent.
-		 */
-		args->alignment = stripe_align;
-		if (blen > args->alignment &&
-		    blen <= args->maxlen + args->alignment)
-			args->minlen = blen - args->alignment;
-		args->minalignslop = 0;
+		args->alignment = alignment;
+		args->alignslop = 0;
 	}
 
 	if (ag_only) {
@@ -3627,9 +3618,8 @@ xfs_bmap_btalloc_at_eof(
 		return 0;
 
 	/*
-	 * Allocation failed, so turn return the allocation args to their
-	 * original non-aligned state so the caller can proceed on allocation
-	 * failure as if this function was never called.
+	 * Aligned allocation failed, so all fallback paths from here drop the
+	 * start alignment requirement as we know it will not succeed.
 	 */
 	args->alignment = 1;
 	return 0;
@@ -3637,7 +3627,9 @@ xfs_bmap_btalloc_at_eof(
 
 /*
  * We have failed multiple allocation attempts so now are in a low space
- * allocation situation. Try a locality first full filesystem minimum length
+ * allocation situation. We give up on any attempt at aligned allocation here.
+ *
+ * Try a locality first full filesystem minimum length
  * allocation whilst still maintaining necessary total block reservation
  * requirements.
  *
@@ -3654,6 +3646,7 @@ xfs_bmap_btalloc_low_space(
 {
 	int			error;
 
+	args->alignment = 1;
 	if (args->minlen > ap->minlen) {
 		args->minlen = ap->minlen;
 		error = xfs_alloc_vextent_start_ag(args, ap->blkno);
@@ -3673,13 +3666,11 @@ xfs_bmap_btalloc_low_space(
 static int
 xfs_bmap_btalloc_filestreams(
 	struct xfs_bmalloca	*ap,
-	struct xfs_alloc_arg	*args,
-	int			stripe_align)
+	struct xfs_alloc_arg	*args)
 {
 	xfs_extlen_t		blen = 0;
 	int			error = 0;
 
-
 	error = xfs_filestream_select_ag(ap, args, &blen);
 	if (error)
 		return error;
@@ -3698,8 +3689,7 @@ xfs_bmap_btalloc_filestreams(
 
 	args->minlen = xfs_bmap_select_minlen(ap, args, blen);
 	if (ap->aeof)
-		error = xfs_bmap_btalloc_at_eof(ap, args, blen, stripe_align,
-				true);
+		error = xfs_bmap_btalloc_at_eof(ap, args, blen, true);
 
 	if (!error && args->fsbno == NULLFSBLOCK)
 		error = xfs_alloc_vextent_near_bno(args, ap->blkno);
@@ -3723,8 +3713,7 @@ xfs_bmap_btalloc_filestreams(
 static int
 xfs_bmap_btalloc_best_length(
 	struct xfs_bmalloca	*ap,
-	struct xfs_alloc_arg	*args,
-	int			stripe_align)
+	struct xfs_alloc_arg	*args)
 {
 	xfs_extlen_t		blen = 0;
 	int			error;
@@ -3748,8 +3737,7 @@ xfs_bmap_btalloc_best_length(
 	 * trying.
 	 */
 	if (ap->aeof && !(ap->tp->t_flags & XFS_TRANS_LOWMODE)) {
-		error = xfs_bmap_btalloc_at_eof(ap, args, blen, stripe_align,
-				false);
+		error = xfs_bmap_btalloc_at_eof(ap, args, blen, false);
 		if (error || args->fsbno != NULLFSBLOCK)
 			return error;
 	}
@@ -3776,27 +3764,26 @@ xfs_bmap_btalloc(
 		.resv		= XFS_AG_RESV_NONE,
 		.datatype	= ap->datatype,
 		.alignment	= 1,
-		.minalignslop	= 0,
+		.alignslop	= 0,
 	};
 	xfs_fileoff_t		orig_offset;
 	xfs_extlen_t		orig_length;
 	int			error;
-	int			stripe_align;
 
 	ASSERT(ap->length);
 	orig_offset = ap->offset;
 	orig_length = ap->length;
 
-	stripe_align = xfs_bmap_compute_alignments(ap, &args);
+	xfs_bmap_compute_alignments(ap, &args);
 
 	/* Trim the allocation back to the maximum an AG can fit. */
 	args.maxlen = min(ap->length, mp->m_ag_max_usable);
 
 	if ((ap->datatype & XFS_ALLOC_USERDATA) &&
 	    xfs_inode_is_filestream(ap->ip))
-		error = xfs_bmap_btalloc_filestreams(ap, &args, stripe_align);
+		error = xfs_bmap_btalloc_filestreams(ap, &args);
 	else
-		error = xfs_bmap_btalloc_best_length(ap, &args, stripe_align);
+		error = xfs_bmap_btalloc_best_length(ap, &args);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 0af5b7a33d05..2fa29d2f004e 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -758,12 +758,12 @@ xfs_ialloc_ag_alloc(
 		 *
 		 * For an exact allocation, alignment must be 1,
 		 * however we need to take cluster alignment into account when
-		 * fixing up the freelist. Use the minalignslop field to
-		 * indicate that extra blocks might be required for alignment,
-		 * but not to use them in the actual exact allocation.
+		 * fixing up the freelist. Use the alignslop field to indicate
+		 * that extra blocks might be required for alignment, but not
+		 * to use them in the actual exact allocation.
 		 */
 		args.alignment = 1;
-		args.minalignslop = igeo->cluster_align - 1;
+		args.alignslop = igeo->cluster_align - 1;
 
 		/* Allow space for the inode btree to split. */
 		args.minleft = igeo->inobt_maxlevels;
@@ -783,7 +783,7 @@ xfs_ialloc_ag_alloc(
 		 * on, so reset minalignslop to ensure it is not included in
 		 * subsequent requests.
 		 */
-		args.minalignslop = 0;
+		args.alignslop = 0;
 	}
 
 	if (unlikely(args.fsbno == NULLFSBLOCK)) {
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 180ce697305a..52a517a2c8e1 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1811,7 +1811,7 @@ DECLARE_EVENT_CLASS(xfs_alloc_class,
 		__field(xfs_extlen_t, minleft)
 		__field(xfs_extlen_t, total)
 		__field(xfs_extlen_t, alignment)
-		__field(xfs_extlen_t, minalignslop)
+		__field(xfs_extlen_t, alignslop)
 		__field(xfs_extlen_t, len)
 		__field(char, wasdel)
 		__field(char, wasfromfl)
@@ -1830,7 +1830,7 @@ DECLARE_EVENT_CLASS(xfs_alloc_class,
 		__entry->minleft = args->minleft;
 		__entry->total = args->total;
 		__entry->alignment = args->alignment;
-		__entry->minalignslop = args->minalignslop;
+		__entry->alignslop = args->alignslop;
 		__entry->len = args->len;
 		__entry->wasdel = args->wasdel;
 		__entry->wasfromfl = args->wasfromfl;
@@ -1839,7 +1839,7 @@ DECLARE_EVENT_CLASS(xfs_alloc_class,
 		__entry->highest_agno = args->tp->t_highest_agno;
 	),
 	TP_printk("dev %d:%d agno 0x%x agbno 0x%x minlen %u maxlen %u mod %u "
-		  "prod %u minleft %u total %u alignment %u minalignslop %u "
+		  "prod %u minleft %u total %u alignment %u alignslop %u "
 		  "len %u wasdel %d wasfromfl %d resv %d "
 		  "datatype 0x%x highest_agno 0x%x",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
@@ -1852,7 +1852,7 @@ DECLARE_EVENT_CLASS(xfs_alloc_class,
 		  __entry->minleft,
 		  __entry->total,
 		  __entry->alignment,
-		  __entry->minalignslop,
+		  __entry->alignslop,
 		  __entry->len,
 		  __entry->wasdel,
 		  __entry->wasfromfl,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v4 04/14] xfs: make EOF allocation simpler
  2024-08-13 16:36 [PATCH v4 00/14] forcealign for xfs John Garry
                   ` (2 preceding siblings ...)
  2024-08-13 16:36 ` [PATCH v4 03/14] xfs: simplify extent allocation alignment John Garry
@ 2024-08-13 16:36 ` John Garry
  2024-09-04 18:25   ` Ritesh Harjani
  2024-08-13 16:36 ` [PATCH v4 05/14] xfs: introduce forced allocation alignment John Garry
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 58+ messages in thread
From: John Garry @ 2024-08-13 16:36 UTC (permalink / raw)
  To: chandan.babu, djwong, dchinner, hch
  Cc: viro, brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen, John Garry

From: Dave Chinner <dchinner@redhat.com>

Currently the allocation at EOF is broken into two cases - when the
offset is zero and when the offset is non-zero. When the offset is
non-zero, we try to do exact block allocation for contiguous
extent allocation. When the offset is zero, the allocation is simply
an aligned allocation.

We want aligned allocation as the fallback when exact block
allocation fails, but that complicates the EOF allocation in that it
now has to handle two different allocation cases. The
caller also has to handle allocation when not at EOF, and for the
upcoming forced alignment changes we need that to also be aligned
allocation.

To simplify all this, pull the aligned allocation cases back into
the callers and leave the EOF allocation path for exact block
allocation only. This means that the EOF exact block allocation
fallback path is the normal aligned allocation path and that ends up
making things a lot simpler when forced alignment is introduced.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c   | 129 +++++++++++++++----------------------
 fs/xfs/libxfs/xfs_ialloc.c |   2 +-
 2 files changed, 54 insertions(+), 77 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 25a87e1154bb..42a75d257b35 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3310,12 +3310,12 @@ xfs_bmap_select_minlen(
 static int
 xfs_bmap_btalloc_select_lengths(
 	struct xfs_bmalloca	*ap,
-	struct xfs_alloc_arg	*args,
-	xfs_extlen_t		*blen)
+	struct xfs_alloc_arg	*args)
 {
 	struct xfs_mount	*mp = args->mp;
 	struct xfs_perag	*pag;
 	xfs_agnumber_t		agno, startag;
+	xfs_extlen_t		blen = 0;
 	int			error = 0;
 
 	if (ap->tp->t_flags & XFS_TRANS_LOWMODE) {
@@ -3329,19 +3329,18 @@ xfs_bmap_btalloc_select_lengths(
 	if (startag == NULLAGNUMBER)
 		startag = 0;
 
-	*blen = 0;
 	for_each_perag_wrap(mp, startag, agno, pag) {
-		error = xfs_bmap_longest_free_extent(pag, args->tp, blen);
+		error = xfs_bmap_longest_free_extent(pag, args->tp, &blen);
 		if (error && error != -EAGAIN)
 			break;
 		error = 0;
-		if (*blen >= args->maxlen)
+		if (blen >= args->maxlen)
 			break;
 	}
 	if (pag)
 		xfs_perag_rele(pag);
 
-	args->minlen = xfs_bmap_select_minlen(ap, args, *blen);
+	args->minlen = xfs_bmap_select_minlen(ap, args, blen);
 	return error;
 }
 
@@ -3551,78 +3550,40 @@ xfs_bmap_exact_minlen_extent_alloc(
  * If we are not low on available data blocks and we are allocating at
  * EOF, optimise allocation for contiguous file extension and/or stripe
  * alignment of the new extent.
- *
- * NOTE: ap->aeof is only set if the allocation length is >= the
- * stripe unit and the allocation offset is at the end of file.
  */
 static int
 xfs_bmap_btalloc_at_eof(
 	struct xfs_bmalloca	*ap,
-	struct xfs_alloc_arg	*args,
-	xfs_extlen_t		blen,
-	bool			ag_only)
+	struct xfs_alloc_arg	*args)
 {
 	struct xfs_mount	*mp = args->mp;
 	struct xfs_perag	*caller_pag = args->pag;
+	xfs_extlen_t		alignment = args->alignment;
 	int			error;
 
+	ASSERT(ap->aeof && ap->offset);
+	ASSERT(args->alignment >= 1);
+
 	/*
-	 * If there are already extents in the file, try an exact EOF block
-	 * allocation to extend the file as a contiguous extent. If that fails,
-	 * or it's the first allocation in a file, just try for a stripe aligned
-	 * allocation.
+	 * Compute the alignment slop for the fallback path so we ensure
+	 * we account for the potential alignemnt space required by the
+	 * fallback paths before we modify the AGF and AGFL here.
 	 */
-	if (ap->offset) {
-		xfs_extlen_t	alignment = args->alignment;
-
-		/*
-		 * Compute the alignment slop for the fallback path so we ensure
-		 * we account for the potential alignment space required by the
-		 * fallback paths before we modify the AGF and AGFL here.
-		 */
-		args->alignment = 1;
-		args->alignslop = alignment - args->alignment;
-
-		if (!caller_pag)
-			args->pag = xfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, ap->blkno));
-		error = xfs_alloc_vextent_exact_bno(args, ap->blkno);
-		if (!caller_pag) {
-			xfs_perag_put(args->pag);
-			args->pag = NULL;
-		}
-		if (error)
-			return error;
-
-		if (args->fsbno != NULLFSBLOCK)
-			return 0;
-		/*
-		 * Exact allocation failed. Reset to try an aligned allocation
-		 * according to the original allocation specification.
-		 */
-		args->alignment = alignment;
-		args->alignslop = 0;
-	}
+	args->alignment = 1;
+	args->alignslop = alignment - args->alignment;
 
-	if (ag_only) {
-		error = xfs_alloc_vextent_near_bno(args, ap->blkno);
-	} else {
+	if (!caller_pag)
+		args->pag = xfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, ap->blkno));
+	error = xfs_alloc_vextent_exact_bno(args, ap->blkno);
+	if (!caller_pag) {
+		xfs_perag_put(args->pag);
 		args->pag = NULL;
-		error = xfs_alloc_vextent_start_ag(args, ap->blkno);
-		ASSERT(args->pag == NULL);
-		args->pag = caller_pag;
 	}
-	if (error)
-		return error;
 
-	if (args->fsbno != NULLFSBLOCK)
-		return 0;
-
-	/*
-	 * Aligned allocation failed, so all fallback paths from here drop the
-	 * start alignment requirement as we know it will not succeed.
-	 */
-	args->alignment = 1;
-	return 0;
+	/* Reset alignment to original specifications.  */
+	args->alignment = alignment;
+	args->alignslop = 0;
+	return error;
 }
 
 /*
@@ -3688,12 +3649,19 @@ xfs_bmap_btalloc_filestreams(
 	}
 
 	args->minlen = xfs_bmap_select_minlen(ap, args, blen);
-	if (ap->aeof)
-		error = xfs_bmap_btalloc_at_eof(ap, args, blen, true);
+	if (ap->aeof && ap->offset)
+		error = xfs_bmap_btalloc_at_eof(ap, args);
 
+	/* This may be an aligned allocation attempt. */
 	if (!error && args->fsbno == NULLFSBLOCK)
 		error = xfs_alloc_vextent_near_bno(args, ap->blkno);
 
+	/* Attempt non-aligned allocation if we haven't already. */
+	if (!error && args->fsbno == NULLFSBLOCK && args->alignment > 1)  {
+		args->alignment = 1;
+		error = xfs_alloc_vextent_near_bno(args, ap->blkno);
+	}
+
 out_low_space:
 	/*
 	 * We are now done with the perag reference for the filestreams
@@ -3715,7 +3683,6 @@ xfs_bmap_btalloc_best_length(
 	struct xfs_bmalloca	*ap,
 	struct xfs_alloc_arg	*args)
 {
-	xfs_extlen_t		blen = 0;
 	int			error;
 
 	ap->blkno = XFS_INO_TO_FSB(args->mp, ap->ip->i_ino);
@@ -3726,23 +3693,33 @@ xfs_bmap_btalloc_best_length(
 	 * the request.  If one isn't found, then adjust the minimum allocation
 	 * size to the largest space found.
 	 */
-	error = xfs_bmap_btalloc_select_lengths(ap, args, &blen);
+	error = xfs_bmap_btalloc_select_lengths(ap, args);
 	if (error)
 		return error;
 
 	/*
-	 * Don't attempt optimal EOF allocation if previous allocations barely
-	 * succeeded due to being near ENOSPC. It is highly unlikely we'll get
-	 * optimal or even aligned allocations in this case, so don't waste time
-	 * trying.
+	 * If we are in low space mode, then optimal allocation will fail so
+	 * prepare for minimal allocation and run the low space algorithm
+	 * immediately.
 	 */
-	if (ap->aeof && !(ap->tp->t_flags & XFS_TRANS_LOWMODE)) {
-		error = xfs_bmap_btalloc_at_eof(ap, args, blen, false);
-		if (error || args->fsbno != NULLFSBLOCK)
-			return error;
+	if (ap->tp->t_flags & XFS_TRANS_LOWMODE) {
+		ASSERT(args->fsbno == NULLFSBLOCK);
+		return xfs_bmap_btalloc_low_space(ap, args);
+	}
+
+	if (ap->aeof && ap->offset)
+		error = xfs_bmap_btalloc_at_eof(ap, args);
+
+	/* This may be an aligned allocation attempt. */
+	if (!error && args->fsbno == NULLFSBLOCK)
+		error = xfs_alloc_vextent_start_ag(args, ap->blkno);
+
+	/* Attempt non-aligned allocation if we haven't already. */
+	if (!error && args->fsbno == NULLFSBLOCK && args->alignment > 1)  {
+		args->alignment = 1;
+		error = xfs_alloc_vextent_start_ag(args, ap->blkno);
 	}
 
-	error = xfs_alloc_vextent_start_ag(args, ap->blkno);
 	if (error || args->fsbno != NULLFSBLOCK)
 		return error;
 
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 2fa29d2f004e..c5d220d51757 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -780,7 +780,7 @@ xfs_ialloc_ag_alloc(
 		 * the exact agbno requirement and increase the alignment
 		 * instead. It is critical that the total size of the request
 		 * (len + alignment + slop) does not increase from this point
-		 * on, so reset minalignslop to ensure it is not included in
+		 * on, so reset alignslop to ensure it is not included in
 		 * subsequent requests.
 		 */
 		args.alignslop = 0;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 04/14] xfs: make EOF allocation simpler
  2024-08-13 16:36 ` [PATCH v4 04/14] xfs: make EOF allocation simpler John Garry
@ 2024-09-04 18:25   ` Ritesh Harjani
  2024-09-05  7:51     ` John Garry
  0 siblings, 1 reply; 58+ messages in thread
From: Ritesh Harjani @ 2024-09-04 18:25 UTC (permalink / raw)
  To: John Garry, chandan.babu, djwong, dchinner, hch
  Cc: viro, brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen, John Garry

John Garry <john.g.garry@oracle.com> writes:

> From: Dave Chinner <dchinner@redhat.com>
>
> Currently the allocation at EOF is broken into two cases - when the
> offset is zero and when the offset is non-zero. When the offset is
> non-zero, we try to do exact block allocation for contiguous
> extent allocation. When the offset is zero, the allocation is simply
> an aligned allocation.
>
> We want aligned allocation as the fallback when exact block
> allocation fails, but that complicates the EOF allocation in that it
> now has to handle two different allocation cases. The
> caller also has to handle allocation when not at EOF, and for the
> upcoming forced alignment changes we need that to also be aligned
> allocation.
>
> To simplify all this, pull the aligned allocation cases back into
> the callers and leave the EOF allocation path for exact block
> allocation only. This means that the EOF exact block allocation
> fallback path is the normal aligned allocation path and that ends up
> making things a lot simpler when forced alignment is introduced.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_bmap.c   | 129 +++++++++++++++----------------------
>  fs/xfs/libxfs/xfs_ialloc.c |   2 +-
>  2 files changed, 54 insertions(+), 77 deletions(-)
>
<..>

> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index 2fa29d2f004e..c5d220d51757 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -780,7 +780,7 @@ xfs_ialloc_ag_alloc(
>  		 * the exact agbno requirement and increase the alignment
>  		 * instead. It is critical that the total size of the request
>  		 * (len + alignment + slop) does not increase from this point
> -		 * on, so reset minalignslop to ensure it is not included in
> +		 * on, so reset alignslop to ensure it is not included in
>  		 * subsequent requests.
>  		 */
>  		args.alignslop = 0;

minor comment: Looks like this diff got leftover from previous patch
where we cleanup minalignslop/alignslop.

-ritesh

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 04/14] xfs: make EOF allocation simpler
  2024-09-04 18:25   ` Ritesh Harjani
@ 2024-09-05  7:51     ` John Garry
  0 siblings, 0 replies; 58+ messages in thread
From: John Garry @ 2024-09-05  7:51 UTC (permalink / raw)
  To: Ritesh Harjani (IBM), chandan.babu, djwong, dchinner, hch
  Cc: viro, brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen

On 04/09/2024 19:25, Ritesh Harjani (IBM) wrote:
>> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
>> index 2fa29d2f004e..c5d220d51757 100644
>> --- a/fs/xfs/libxfs/xfs_ialloc.c
>> +++ b/fs/xfs/libxfs/xfs_ialloc.c
>> @@ -780,7 +780,7 @@ xfs_ialloc_ag_alloc(
>>   		 * the exact agbno requirement and increase the alignment
>>   		 * instead. It is critical that the total size of the request
>>   		 * (len + alignment + slop) does not increase from this point
>> -		 * on, so reset minalignslop to ensure it is not included in
>> +		 * on, so reset alignslop to ensure it is not included in
>>   		 * subsequent requests.
>>   		 */
>>   		args.alignslop = 0;
> minor comment: Looks like this diff got leftover from previous patch
> where we cleanup minalignslop/alignslop.

Right, that comment modification belongs in the previous patch - I will 
relocate.

Thanks,
John

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH v4 05/14] xfs: introduce forced allocation alignment
  2024-08-13 16:36 [PATCH v4 00/14] forcealign for xfs John Garry
                   ` (3 preceding siblings ...)
  2024-08-13 16:36 ` [PATCH v4 04/14] xfs: make EOF allocation simpler John Garry
@ 2024-08-13 16:36 ` John Garry
  2024-08-13 16:36 ` [PATCH v4 06/14] xfs: align args->minlen for " John Garry
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 58+ messages in thread
From: John Garry @ 2024-08-13 16:36 UTC (permalink / raw)
  To: chandan.babu, djwong, dchinner, hch
  Cc: viro, brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen, John Garry

From: Dave Chinner <dchinner@redhat.com>

When forced allocation alignment is specified, the extent will
be aligned to the extent size hint size rather than stripe
alignment. If aligned allocation cannot be done, then the allocation
is failed rather than attempting non-aligned fallbacks.

Note: none of the per-inode force align configuration is present
yet, so this just triggers off an "always false" wrapper function
for the moment.

[jpg: Set XFS_ALLOC_FORCEALIGN in xfs_bmap_alloc_userdata(), rather than
      xfs_bmap_compute_alignments()]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_alloc.h |  1 +
 fs/xfs/libxfs/xfs_bmap.c  | 28 +++++++++++++++++++++++-----
 fs/xfs/xfs_inode.h        |  5 +++++
 3 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 473822a5d4e9..17b062e8b925 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -66,6 +66,7 @@ typedef struct xfs_alloc_arg {
 #define XFS_ALLOC_USERDATA		(1 << 0)/* allocation is for user data*/
 #define XFS_ALLOC_INITIAL_USER_DATA	(1 << 1)/* special case start of file */
 #define XFS_ALLOC_NOBUSY		(1 << 2)/* Busy extents not allowed */
+#define XFS_ALLOC_FORCEALIGN		(1 << 3)/* forced extent alignment */
 
 /* freespace limit calculations */
 unsigned int xfs_alloc_set_aside(struct xfs_mount *mp);
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 42a75d257b35..602a5a50bcca 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3401,9 +3401,10 @@ xfs_bmap_alloc_account(
  * Calculate the extent start alignment and the extent length adjustments that
  * constrain this allocation.
  *
- * Extent start alignment is currently determined by stripe configuration and is
- * carried in args->alignment, whilst extent length adjustment is determined by
- * extent size hints and is carried by args->prod and args->mod.
+ * Extent start alignment is currently determined by forced inode alignment or
+ * stripe configuration and is carried in args->alignment, whilst extent length
+ * adjustment is determined by extent size hints and is carried by args->prod
+ * and args->mod.
  *
  * Low level allocation code is free to either ignore or override these values
  * as required.
@@ -3416,8 +3417,13 @@ xfs_bmap_compute_alignments(
 	struct xfs_mount	*mp = args->mp;
 	xfs_extlen_t		align = 0; /* minimum allocation alignment */
 
-	/* stripe alignment for allocation is determined by mount parameters */
-	if (mp->m_swidth && xfs_has_swalloc(mp))
+	/*
+	 * Forced inode alignment takes preference over stripe alignment.
+	 * Stripe alignment for allocation is determined by mount parameters.
+	 */
+	if (xfs_inode_has_forcealign(ap->ip))
+		args->alignment = xfs_get_extsz_hint(ap->ip);
+	else if (mp->m_swidth && xfs_has_swalloc(mp))
 		args->alignment = mp->m_swidth;
 	else if (mp->m_dalign)
 		args->alignment = mp->m_dalign;
@@ -3607,6 +3613,11 @@ xfs_bmap_btalloc_low_space(
 {
 	int			error;
 
+	if (args->alignment > 1 && (args->datatype & XFS_ALLOC_FORCEALIGN)) {
+		args->fsbno = NULLFSBLOCK;
+		return 0;
+	}
+
 	args->alignment = 1;
 	if (args->minlen > ap->minlen) {
 		args->minlen = ap->minlen;
@@ -3658,6 +3669,8 @@ xfs_bmap_btalloc_filestreams(
 
 	/* Attempt non-aligned allocation if we haven't already. */
 	if (!error && args->fsbno == NULLFSBLOCK && args->alignment > 1)  {
+		if (args->datatype & XFS_ALLOC_FORCEALIGN)
+			return error;
 		args->alignment = 1;
 		error = xfs_alloc_vextent_near_bno(args, ap->blkno);
 	}
@@ -3716,6 +3729,8 @@ xfs_bmap_btalloc_best_length(
 
 	/* Attempt non-aligned allocation if we haven't already. */
 	if (!error && args->fsbno == NULLFSBLOCK && args->alignment > 1)  {
+		if (args->datatype & XFS_ALLOC_FORCEALIGN)
+			return error;
 		args->alignment = 1;
 		error = xfs_alloc_vextent_start_ag(args, ap->blkno);
 	}
@@ -4151,6 +4166,9 @@ xfs_bmap_alloc_userdata(
 		if (bma->offset == 0)
 			bma->datatype |= XFS_ALLOC_INITIAL_USER_DATA;
 
+		if (xfs_inode_has_forcealign(bma->ip))
+			bma->datatype |= XFS_ALLOC_FORCEALIGN;
+
 		if (mp->m_dalign && bma->length >= mp->m_dalign) {
 			error = xfs_bmap_isaeof(bma, whichfork);
 			if (error)
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 51defdebef30..bf0f4f8b9e64 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -310,6 +310,11 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip)
 	return ip->i_diflags2 & XFS_DIFLAG2_NREXT64;
 }
 
+static inline bool xfs_inode_has_forcealign(struct xfs_inode *ip)
+{
+	return false;
+}
+
 /*
  * Decide if this file is a realtime file whose data allocation unit is larger
  * than a single filesystem block.
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v4 06/14] xfs: align args->minlen for forced allocation alignment
  2024-08-13 16:36 [PATCH v4 00/14] forcealign for xfs John Garry
                   ` (4 preceding siblings ...)
  2024-08-13 16:36 ` [PATCH v4 05/14] xfs: introduce forced allocation alignment John Garry
@ 2024-08-13 16:36 ` John Garry
  2024-08-13 16:36 ` [PATCH v4 07/14] xfs: Introduce FORCEALIGN inode flag John Garry
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 58+ messages in thread
From: John Garry @ 2024-08-13 16:36 UTC (permalink / raw)
  To: chandan.babu, djwong, dchinner, hch
  Cc: viro, brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen, John Garry

From: Dave Chinner <dchinner@redhat.com>

If args->minlen is not aligned to the constraints of forced
alignment, we may do minlen allocations that are not aligned when we
approach ENOSPC. Avoid this by always aligning args->minlen
appropriately. If alignment of minlen results in a value smaller
than the alignment constraint, fail the allocation immediately.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 44 ++++++++++++++++++++++++++++------------
 1 file changed, 31 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 602a5a50bcca..0c3df8c71c6d 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3279,32 +3279,48 @@ xfs_bmap_longest_free_extent(
 	return 0;
 }
 
-static xfs_extlen_t
+static int
 xfs_bmap_select_minlen(
 	struct xfs_bmalloca	*ap,
 	struct xfs_alloc_arg	*args,
 	xfs_extlen_t		blen)
 {
-
 	/* Adjust best length for extent start alignment. */
 	if (blen > args->alignment)
 		blen -= args->alignment;
 
 	/*
 	 * Since we used XFS_ALLOC_FLAG_TRYLOCK in _longest_free_extent(), it is
-	 * possible that there is enough contiguous free space for this request.
+	 * possible that there is enough contiguous free space for this request
+	 * even if best length is less that the minimum length we need.
+	 *
+	 * If the best length won't satisfy the maximum length we requested,
+	 * then use it as the minimum length so we get as large an allocation
+	 * as possible.
 	 */
 	if (blen < ap->minlen)
-		return ap->minlen;
+		blen = ap->minlen;
+	else if (blen > args->maxlen)
+		blen = args->maxlen;
 
 	/*
-	 * If the best seen length is less than the request length,
-	 * use the best as the minimum, otherwise we've got the maxlen we
-	 * were asked for.
+	 * If we have alignment constraints, round the minlen down to match the
+	 * constraint so that alignment will be attempted. This may reduce the
+	 * allocation to smaller than was requested, so clamp the minimum to
+	 * ap->minlen to allow unaligned allocation to succeed. If we are forced
+	 * to align the allocation, return ENOSPC at this point because we don't
+	 * have enough contiguous free space to guarantee aligned allocation.
 	 */
-	if (blen < args->maxlen)
-		return blen;
-	return args->maxlen;
+	if (args->alignment > 1) {
+		blen = rounddown(blen, args->alignment);
+		if (blen < ap->minlen) {
+			if (args->datatype & XFS_ALLOC_FORCEALIGN)
+				return -ENOSPC;
+			blen = ap->minlen;
+		}
+	}
+	args->minlen = blen;
+	return 0;
 }
 
 static int
@@ -3340,8 +3356,7 @@ xfs_bmap_btalloc_select_lengths(
 	if (pag)
 		xfs_perag_rele(pag);
 
-	args->minlen = xfs_bmap_select_minlen(ap, args, blen);
-	return error;
+	return xfs_bmap_select_minlen(ap, args, blen);
 }
 
 /* Update all inode and quota accounting for the allocation we just did. */
@@ -3659,7 +3674,10 @@ xfs_bmap_btalloc_filestreams(
 		goto out_low_space;
 	}
 
-	args->minlen = xfs_bmap_select_minlen(ap, args, blen);
+	error = xfs_bmap_select_minlen(ap, args, blen);
+	if (error)
+		goto out_low_space;
+
 	if (ap->aeof && ap->offset)
 		error = xfs_bmap_btalloc_at_eof(ap, args);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v4 07/14] xfs: Introduce FORCEALIGN inode flag
  2024-08-13 16:36 [PATCH v4 00/14] forcealign for xfs John Garry
                   ` (5 preceding siblings ...)
  2024-08-13 16:36 ` [PATCH v4 06/14] xfs: align args->minlen for " John Garry
@ 2024-08-13 16:36 ` John Garry
  2024-08-13 16:36 ` [PATCH v4 08/14] xfs: Update xfs_inode_alloc_unitsize() for forcealign John Garry
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 58+ messages in thread
From: John Garry @ 2024-08-13 16:36 UTC (permalink / raw)
  To: chandan.babu, djwong, dchinner, hch
  Cc: viro, brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen, John Garry

From: "Darrick J. Wong" <djwong@kernel.org>

Add a new inode flag to require that all file data extent mappings must
be aligned (both the file offset range and the allocated space itself)
to the extent size hint.  Having a separate COW extent size hint is no
longer allowed.

The goal here is to enable sysadmins and users to mandate that all space
mappings in a file must have a startoff/blockcount that are aligned to
(say) a 2MB alignment and that the startblock/blockcount will follow the
same alignment.

Allocated space will be aligned to start of the AG, and not necessarily
aligned with disk blocks. The upcoming atomic writes feature will rely and
forcealign and will also require allocated space will also be aligned to
disk blocks.

reflink will not be supported for forcealign yet, so disallow a mount under
this condition. This is because we have the limitation of pageache
writeback not knowing how to writeback an entire allocation unut, so
reject a mount with relink.

RT vol will not be supported for forcealign yet, so disallow a mount under
this condition. It will be possible to support RT vol and forcealign in
future. For this, the inode extsize must be a multiple of rtextsize - this
is enforced already in xfs_ioctl_setattr_check_extsize() and
xfs_inode_validate_extsize().

[jpg: many changes from orig, including forcealign inode verification
 rework, ioctl setattr rework disallow reflink a forcealign inode,
 disallow mount for forcealign + reflink or rt]

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Co-developed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_format.h     |  6 ++++-
 fs/xfs/libxfs/xfs_inode_buf.c  | 46 ++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_inode_buf.h  |  3 +++
 fs/xfs/libxfs/xfs_inode_util.c | 14 +++++++++++
 fs/xfs/libxfs/xfs_sb.c         |  2 ++
 fs/xfs/xfs_inode.h             |  8 +++++-
 fs/xfs/xfs_ioctl.c             | 46 ++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_mount.h             |  2 ++
 fs/xfs/xfs_reflink.c           |  5 ++--
 fs/xfs/xfs_super.c             | 18 +++++++++++++
 include/uapi/linux/fs.h        |  2 ++
 11 files changed, 146 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index e1bfee0c3b1a..95f5259c4255 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -352,6 +352,7 @@ xfs_sb_has_compat_feature(
 #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
 #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
 #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
+#define XFS_SB_FEAT_RO_COMPAT_FORCEALIGN (1 << 30)	/* aligned file data extents */
 #define XFS_SB_FEAT_RO_COMPAT_ALL \
 		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
 		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
@@ -1093,16 +1094,19 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
 #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
 #define XFS_DIFLAG2_BIGTIME_BIT	3	/* big timestamps */
 #define XFS_DIFLAG2_NREXT64_BIT 4	/* large extent counters */
+/* data extent mappings for regular files must be aligned to extent size hint */
+#define XFS_DIFLAG2_FORCEALIGN_BIT 5
 
 #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
 #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
 #define XFS_DIFLAG2_COWEXTSIZE  (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
 #define XFS_DIFLAG2_BIGTIME	(1 << XFS_DIFLAG2_BIGTIME_BIT)
 #define XFS_DIFLAG2_NREXT64	(1 << XFS_DIFLAG2_NREXT64_BIT)
+#define XFS_DIFLAG2_FORCEALIGN	(1 << XFS_DIFLAG2_FORCEALIGN_BIT)
 
 #define XFS_DIFLAG2_ANY \
 	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
-	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64)
+	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_FORCEALIGN)
 
 static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
 {
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 513b50da6215..1c59891fa9e2 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -657,6 +657,15 @@ xfs_dinode_verify(
 	    !xfs_has_bigtime(mp))
 		return __this_address;
 
+	if (flags2 & XFS_DIFLAG2_FORCEALIGN) {
+		fa = xfs_inode_validate_forcealign(mp,
+				be32_to_cpu(dip->di_extsize),
+				be32_to_cpu(dip->di_cowextsize),
+				mode, flags, flags2);
+		if (fa)
+			return fa;
+	}
+
 	return NULL;
 }
 
@@ -824,3 +833,40 @@ xfs_inode_validate_cowextsize(
 
 	return NULL;
 }
+
+/* Validate the forcealign inode flag */
+xfs_failaddr_t
+xfs_inode_validate_forcealign(
+	struct xfs_mount	*mp,
+	uint32_t		extsize,
+	uint32_t		cowextsize,
+	uint16_t		mode,
+	uint16_t		flags,
+	uint64_t		flags2)
+{
+	/* superblock rocompat feature flag */
+	if (!xfs_has_forcealign(mp))
+		return __this_address;
+
+	/* Only regular files and directories */
+	if (!S_ISDIR(mode) && !S_ISREG(mode))
+		return __this_address;
+
+	/* We require EXTSIZE or EXTSZINHERIT */
+	if (!(flags & (XFS_DIFLAG_EXTSIZE | XFS_DIFLAG_EXTSZINHERIT)))
+		return __this_address;
+
+	/* We require a non-zero extsize */
+	if (!extsize)
+		return __this_address;
+
+	/* COW extsize disallowed */
+	if (flags2 & XFS_DIFLAG2_COWEXTSIZE)
+		return __this_address;
+
+	/* cowextsize must be zero */
+	if (cowextsize)
+		return __this_address;
+
+	return NULL;
+}
diff --git a/fs/xfs/libxfs/xfs_inode_buf.h b/fs/xfs/libxfs/xfs_inode_buf.h
index 585ed5a110af..b8b65287b037 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.h
+++ b/fs/xfs/libxfs/xfs_inode_buf.h
@@ -33,6 +33,9 @@ xfs_failaddr_t xfs_inode_validate_extsize(struct xfs_mount *mp,
 xfs_failaddr_t xfs_inode_validate_cowextsize(struct xfs_mount *mp,
 		uint32_t cowextsize, uint16_t mode, uint16_t flags,
 		uint64_t flags2);
+xfs_failaddr_t xfs_inode_validate_forcealign(struct xfs_mount *mp,
+		uint32_t extsize, uint32_t cowextsize, uint16_t mode,
+		uint16_t flags, uint64_t flags2);
 
 static inline uint64_t xfs_inode_encode_bigtime(struct timespec64 tv)
 {
diff --git a/fs/xfs/libxfs/xfs_inode_util.c b/fs/xfs/libxfs/xfs_inode_util.c
index 032333289113..b264939d8855 100644
--- a/fs/xfs/libxfs/xfs_inode_util.c
+++ b/fs/xfs/libxfs/xfs_inode_util.c
@@ -80,6 +80,8 @@ xfs_flags2diflags2(
 		di_flags2 |= XFS_DIFLAG2_DAX;
 	if (xflags & FS_XFLAG_COWEXTSIZE)
 		di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
+	if (xflags & FS_XFLAG_FORCEALIGN)
+		di_flags2 |= XFS_DIFLAG2_FORCEALIGN;
 
 	return di_flags2;
 }
@@ -126,6 +128,8 @@ xfs_ip2xflags(
 			flags |= FS_XFLAG_DAX;
 		if (ip->i_diflags2 & XFS_DIFLAG2_COWEXTSIZE)
 			flags |= FS_XFLAG_COWEXTSIZE;
+		if (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)
+			flags |= FS_XFLAG_FORCEALIGN;
 	}
 
 	if (xfs_inode_has_attr_fork(ip))
@@ -224,6 +228,8 @@ xfs_inode_inherit_flags2(
 	}
 	if (pip->i_diflags2 & XFS_DIFLAG2_DAX)
 		ip->i_diflags2 |= XFS_DIFLAG2_DAX;
+	if (pip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)
+		ip->i_diflags2 |= XFS_DIFLAG2_FORCEALIGN;
 
 	/* Don't let invalid cowextsize hints propagate. */
 	failaddr = xfs_inode_validate_cowextsize(ip->i_mount, ip->i_cowextsize,
@@ -232,6 +238,14 @@ xfs_inode_inherit_flags2(
 		ip->i_diflags2 &= ~XFS_DIFLAG2_COWEXTSIZE;
 		ip->i_cowextsize = 0;
 	}
+	if (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN) {
+		failaddr = xfs_inode_validate_forcealign(ip->i_mount,
+				ip->i_extsize, ip->i_cowextsize,
+				VFS_I(ip)->i_mode, ip->i_diflags,
+				ip->i_diflags2);
+		if (failaddr)
+			ip->i_diflags2 &= ~XFS_DIFLAG2_FORCEALIGN;
+	}
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 6b56f0f6d4c1..e56911553edd 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -164,6 +164,8 @@ xfs_sb_version_to_features(
 		features |= XFS_FEAT_REFLINK;
 	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
 		features |= XFS_FEAT_INOBTCNT;
+	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
+		features |= XFS_FEAT_FORCEALIGN;
 	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_FTYPE)
 		features |= XFS_FEAT_FTYPE;
 	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_SPINODES)
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index bf0f4f8b9e64..3e7664ec4d6c 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -312,7 +312,13 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip)
 
 static inline bool xfs_inode_has_forcealign(struct xfs_inode *ip)
 {
-	return false;
+	if (!(ip->i_diflags & XFS_DIFLAG_EXTSIZE))
+		return false;
+	if (ip->i_extsize <= 1)
+		return false;
+	if (xfs_is_cow_inode(ip))
+		return false;
+	return ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN;
 }
 
 /*
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 4e933db75b12..7a6757a4d2bd 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -469,6 +469,39 @@ xfs_fileattr_get(
 	return 0;
 }
 
+/*
+ * Forcealign requires a non-zero extent size hint and a zero cow
+ * extent size hint.
+ */
+static int
+xfs_ioctl_setattr_forcealign(
+	struct xfs_inode	*ip,
+	struct fileattr		*fa)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+
+	if (!xfs_has_forcealign(mp))
+		return -EINVAL;
+
+	if (xfs_is_reflink_inode(ip))
+		return -EINVAL;
+
+	if (!(fa->fsx_xflags & (FS_XFLAG_EXTSIZE |
+				FS_XFLAG_EXTSZINHERIT)))
+		return -EINVAL;
+
+	if (fa->fsx_xflags & FS_XFLAG_COWEXTSIZE)
+		return -EINVAL;
+
+	if (!fa->fsx_extsize)
+		return -EINVAL;
+
+	if (fa->fsx_cowextsize)
+		return -EINVAL;
+
+	return 0;
+}
+
 static int
 xfs_ioctl_setattr_xflags(
 	struct xfs_trans	*tp,
@@ -477,10 +510,13 @@ xfs_ioctl_setattr_xflags(
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	bool			rtflag = (fa->fsx_xflags & FS_XFLAG_REALTIME);
+	bool			forcealign = fa->fsx_xflags & FS_XFLAG_FORCEALIGN;
 	uint64_t		i_flags2;
+	int			error;
 
-	if (rtflag != XFS_IS_REALTIME_INODE(ip)) {
-		/* Can't change realtime flag if any extents are allocated. */
+	/* Can't change RT or forcealign flags if any extents are allocated. */
+	if (rtflag != XFS_IS_REALTIME_INODE(ip) ||
+	    forcealign != xfs_inode_has_forcealign(ip)) {
 		if (ip->i_df.if_nextents || ip->i_delayed_blks)
 			return -EINVAL;
 	}
@@ -501,6 +537,12 @@ xfs_ioctl_setattr_xflags(
 	if (i_flags2 && !xfs_has_v3inodes(mp))
 		return -EINVAL;
 
+	if (forcealign) {
+		error = xfs_ioctl_setattr_forcealign(ip, fa);
+		if (error)
+			return error;
+	}
+
 	ip->i_diflags = xfs_flags2diflags(ip, fa->fsx_xflags);
 	ip->i_diflags2 = i_flags2;
 
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index d0567dfbc036..30228fea908d 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -299,6 +299,7 @@ typedef struct xfs_mount {
 #define XFS_FEAT_NEEDSREPAIR	(1ULL << 25)	/* needs xfs_repair */
 #define XFS_FEAT_NREXT64	(1ULL << 26)	/* large extent counters */
 #define XFS_FEAT_EXCHANGE_RANGE	(1ULL << 27)	/* exchange range */
+#define XFS_FEAT_FORCEALIGN	(1ULL << 28)	/* aligned file data extents */
 
 /* Mount features */
 #define XFS_FEAT_NOATTR2	(1ULL << 48)	/* disable attr2 creation */
@@ -385,6 +386,7 @@ __XFS_ADD_V4_FEAT(projid32, PROJID32)
 __XFS_HAS_V4_FEAT(v3inodes, V3INODES)
 __XFS_HAS_V4_FEAT(crc, CRC)
 __XFS_HAS_V4_FEAT(pquotino, PQUOTINO)
+__XFS_HAS_FEAT(forcealign, FORCEALIGN)
 
 /*
  * Mount features
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 6fde6ec8092f..a836bfec7878 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1467,8 +1467,9 @@ xfs_reflink_remap_prep(
 
 	/* Check file eligibility and prepare for block sharing. */
 	ret = -EINVAL;
-	/* Don't reflink realtime inodes */
-	if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
+	/* Don't reflink realtime or forcealign inodes */
+	if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest) ||
+	    xfs_inode_has_forcealign(src) || xfs_inode_has_forcealign(dest))
 		goto out_unlock;
 
 	/* Don't share DAX file data with non-DAX file. */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 27e9f749c4c7..b52a01b50387 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1729,12 +1729,30 @@ xfs_fs_fill_super(
 			goto out_filestream_unmount;
 		}
 
+		if (xfs_has_forcealign(mp)) {
+			xfs_alert(mp,
+	"reflink not compatible with forcealign!");
+			error = -EINVAL;
+			goto out_filestream_unmount;
+		}
+
 		if (xfs_globals.always_cow) {
 			xfs_info(mp, "using DEBUG-only always_cow mode.");
 			mp->m_always_cow = true;
 		}
 	}
 
+	if (xfs_has_forcealign(mp)) {
+		if (xfs_has_realtime(mp)) {
+			xfs_alert(mp,
+		"forcealign not supported for realtime device!");
+			error = -EINVAL;
+			goto out_filestream_unmount;
+		}
+		xfs_warn(mp,
+"EXPERIMENTAL forced data extent alignment feature in use. Use at your own risk!");
+	}
+
 	if (xfs_has_rmapbt(mp) && mp->m_sb.sb_rblocks) {
 		xfs_alert(mp,
 	"reverse mapping btree not compatible with realtime device!");
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 753971770733..f55d650f904a 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -158,6 +158,8 @@ struct fsxattr {
 #define FS_XFLAG_FILESTREAM	0x00004000	/* use filestream allocator */
 #define FS_XFLAG_DAX		0x00008000	/* use DAX for IO */
 #define FS_XFLAG_COWEXTSIZE	0x00010000	/* CoW extent size allocator hint */
+/* data extent mappings for regular files must be aligned to extent size hint */
+#define FS_XFLAG_FORCEALIGN	0x00020000
 #define FS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this	*/
 
 /* the read-only stuff doesn't really belong here, but any other place is
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v4 08/14] xfs: Update xfs_inode_alloc_unitsize() for forcealign
  2024-08-13 16:36 [PATCH v4 00/14] forcealign for xfs John Garry
                   ` (6 preceding siblings ...)
  2024-08-13 16:36 ` [PATCH v4 07/14] xfs: Introduce FORCEALIGN inode flag John Garry
@ 2024-08-13 16:36 ` John Garry
  2024-08-13 16:36 ` [PATCH v4 09/14] xfs: Update xfs_setattr_size() " John Garry
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 58+ messages in thread
From: John Garry @ 2024-08-13 16:36 UTC (permalink / raw)
  To: chandan.babu, djwong, dchinner, hch
  Cc: viro, brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen, John Garry

For forcealign enabled, the allocation unit size is in ip->i_extsize, and
this must never be zero.

Add helper xfs_inode_alloc_fsbsize() to return alloc unit in FSBs, and use
it in xfs_inode_alloc_unitsize().

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_inode.c | 17 +++++++++++++----
 fs/xfs/xfs_inode.h |  1 +
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 7dc6f326936c..5af12f35062d 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3107,17 +3107,26 @@ xfs_break_layouts(
 	return error;
 }
 
-/* Returns the size of fundamental allocation unit for a file, in bytes. */
 unsigned int
-xfs_inode_alloc_unitsize(
+xfs_inode_alloc_fsbsize(
 	struct xfs_inode	*ip)
 {
 	unsigned int		blocks = 1;
 
-	if (XFS_IS_REALTIME_INODE(ip))
+	if (xfs_inode_has_forcealign(ip))
+		blocks = ip->i_extsize;
+	else if (XFS_IS_REALTIME_INODE(ip))
 		blocks = ip->i_mount->m_sb.sb_rextsize;
 
-	return XFS_FSB_TO_B(ip->i_mount, blocks);
+	return blocks;
+}
+
+/* Returns the size of fundamental allocation unit for a file, in bytes. */
+unsigned int
+xfs_inode_alloc_unitsize(
+	struct xfs_inode	*ip)
+{
+	return XFS_FSB_TO_B(ip->i_mount, xfs_inode_alloc_fsbsize(ip));
 }
 
 /* Should we always be using copy on write for file writes? */
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 3e7664ec4d6c..158afad8c7a4 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -641,6 +641,7 @@ int xfs_inode_reload_unlinked(struct xfs_inode *ip);
 bool xfs_ifork_zapped(const struct xfs_inode *ip, int whichfork);
 void xfs_inode_count_blocks(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_filblks_t *dblocks, xfs_filblks_t *rblocks);
+unsigned int xfs_inode_alloc_fsbsize(struct xfs_inode *ip);
 unsigned int xfs_inode_alloc_unitsize(struct xfs_inode *ip);
 
 int xfs_icreate_dqalloc(const struct xfs_icreate_args *args,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v4 09/14] xfs: Update xfs_setattr_size() for forcealign
  2024-08-13 16:36 [PATCH v4 00/14] forcealign for xfs John Garry
                   ` (7 preceding siblings ...)
  2024-08-13 16:36 ` [PATCH v4 08/14] xfs: Update xfs_inode_alloc_unitsize() for forcealign John Garry
@ 2024-08-13 16:36 ` John Garry
  2024-08-13 16:36 ` [PATCH v4 10/14] xfs: Do not free EOF blocks " John Garry
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 58+ messages in thread
From: John Garry @ 2024-08-13 16:36 UTC (permalink / raw)
  To: chandan.babu, djwong, dchinner, hch
  Cc: viro, brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen, John Garry

For when an inode has forcealign, reserve blocks for same reason which we
were doing for big RT alloc.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_iops.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 1cdc8034f54d..6e017aa6f61d 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -926,12 +926,12 @@ xfs_setattr_size(
 	}
 
 	/*
-	 * For realtime inode with more than one block rtextsize, we need the
+	 * For inodes with more than one block alloc unitsize, we need the
 	 * block reservation for bmap btree block allocations/splits that can
 	 * happen since it could split the tail written extent and convert the
 	 * right beyond EOF one to unwritten.
 	 */
-	if (xfs_inode_has_bigrtalloc(ip))
+	if (xfs_inode_alloc_fsbsize(ip) > 1)
 		resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0);
 
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v4 10/14] xfs: Do not free EOF blocks for forcealign
  2024-08-13 16:36 [PATCH v4 00/14] forcealign for xfs John Garry
                   ` (8 preceding siblings ...)
  2024-08-13 16:36 ` [PATCH v4 09/14] xfs: Update xfs_setattr_size() " John Garry
@ 2024-08-13 16:36 ` John Garry
  2024-08-13 16:36 ` [PATCH v4 11/14] xfs: Only free full extents " John Garry
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 58+ messages in thread
From: John Garry @ 2024-08-13 16:36 UTC (permalink / raw)
  To: chandan.babu, djwong, dchinner, hch
  Cc: viro, brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen, John Garry

For when forcealign is enabled, we want the EOF to be aligned as well, so
do not free EOF blocks.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> #earlier version
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_bmap_util.c |  6 +++---
 fs/xfs/xfs_inode.c     | 12 ++++++++++++
 fs/xfs/xfs_inode.h     |  2 ++
 3 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index fe2e2c930975..7a51859eaf84 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -536,9 +536,9 @@ xfs_can_free_eofblocks(
 	 * range supported by the page cache, because the truncation will loop
 	 * forever.
 	 */
-	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_ISIZE(ip));
-	if (xfs_inode_has_bigrtalloc(ip))
-		end_fsb = xfs_rtb_roundup_rtx(mp, end_fsb);
+	end_fsb = xfs_inode_roundup_alloc_unit(ip,
+			XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_ISIZE(ip)));
+
 	last_fsb = XFS_B_TO_FSB(mp, mp->m_super->s_maxbytes);
 	if (last_fsb <= end_fsb)
 		return false;
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 5af12f35062d..94ab3f4d6cef 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3129,6 +3129,18 @@ xfs_inode_alloc_unitsize(
 	return XFS_FSB_TO_B(ip->i_mount, xfs_inode_alloc_fsbsize(ip));
 }
 
+xfs_fileoff_t
+xfs_inode_roundup_alloc_unit(
+	struct xfs_inode	*ip,
+	xfs_fileoff_t		offset)
+{
+	unsigned int		rounding = xfs_inode_alloc_fsbsize(ip);
+
+	if (rounding == 1)
+		return offset;
+	return roundup_64(offset, rounding);
+}
+
 /* Should we always be using copy on write for file writes? */
 bool
 xfs_is_always_cow_inode(
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 158afad8c7a4..71acddb8061d 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -643,6 +643,8 @@ void xfs_inode_count_blocks(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_filblks_t *dblocks, xfs_filblks_t *rblocks);
 unsigned int xfs_inode_alloc_fsbsize(struct xfs_inode *ip);
 unsigned int xfs_inode_alloc_unitsize(struct xfs_inode *ip);
+xfs_fileoff_t xfs_inode_roundup_alloc_unit(struct xfs_inode *ip,
+		xfs_fileoff_t offset);
 
 int xfs_icreate_dqalloc(const struct xfs_icreate_args *args,
 		struct xfs_dquot **udqpp, struct xfs_dquot **gdqpp,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v4 11/14] xfs: Only free full extents for forcealign
  2024-08-13 16:36 [PATCH v4 00/14] forcealign for xfs John Garry
                   ` (9 preceding siblings ...)
  2024-08-13 16:36 ` [PATCH v4 10/14] xfs: Do not free EOF blocks " John Garry
@ 2024-08-13 16:36 ` John Garry
  2024-08-13 16:36 ` [PATCH v4 12/14] xfs: Unmap blocks according to forcealign John Garry
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 58+ messages in thread
From: John Garry @ 2024-08-13 16:36 UTC (permalink / raw)
  To: chandan.babu, djwong, dchinner, hch
  Cc: viro, brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen, John Garry

Like we already do for rtvol, only free full extents for forcealign in
xfs_free_file_space().

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> #earlier version
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_bmap_util.c |  8 +++-----
 fs/xfs/xfs_inode.c     | 12 ++++++++++++
 fs/xfs/xfs_inode.h     |  2 ++
 3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 7a51859eaf84..317ce8947e8d 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -851,11 +851,9 @@ xfs_free_file_space(
 	startoffset_fsb = XFS_B_TO_FSB(mp, offset);
 	endoffset_fsb = XFS_B_TO_FSBT(mp, offset + len);
 
-	/* We can only free complete realtime extents. */
-	if (xfs_inode_has_bigrtalloc(ip)) {
-		startoffset_fsb = xfs_rtb_roundup_rtx(mp, startoffset_fsb);
-		endoffset_fsb = xfs_rtb_rounddown_rtx(mp, endoffset_fsb);
-	}
+	/* Free only complete extents. */
+	startoffset_fsb = xfs_inode_roundup_alloc_unit(ip, startoffset_fsb);
+	endoffset_fsb = xfs_inode_rounddown_alloc_unit(ip, endoffset_fsb);
 
 	/*
 	 * Need to zero the stuff we're not freeing, on disk.
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 94ab3f4d6cef..73562c6f9a1d 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3141,6 +3141,18 @@ xfs_inode_roundup_alloc_unit(
 	return roundup_64(offset, rounding);
 }
 
+xfs_fileoff_t
+xfs_inode_rounddown_alloc_unit(
+	struct xfs_inode	*ip,
+	xfs_fileoff_t		offset)
+{
+	unsigned int		rounding = xfs_inode_alloc_fsbsize(ip);
+
+	if (rounding == 1)
+		return offset;
+	return rounddown_64(offset, rounding);
+}
+
 /* Should we always be using copy on write for file writes? */
 bool
 xfs_is_always_cow_inode(
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 71acddb8061d..336124105c47 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -645,6 +645,8 @@ unsigned int xfs_inode_alloc_fsbsize(struct xfs_inode *ip);
 unsigned int xfs_inode_alloc_unitsize(struct xfs_inode *ip);
 xfs_fileoff_t xfs_inode_roundup_alloc_unit(struct xfs_inode *ip,
 		xfs_fileoff_t offset);
+xfs_fileoff_t xfs_inode_rounddown_alloc_unit(struct xfs_inode *ip,
+		xfs_fileoff_t offset);
 
 int xfs_icreate_dqalloc(const struct xfs_icreate_args *args,
 		struct xfs_dquot **udqpp, struct xfs_dquot **gdqpp,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v4 12/14] xfs: Unmap blocks according to forcealign
  2024-08-13 16:36 [PATCH v4 00/14] forcealign for xfs John Garry
                   ` (10 preceding siblings ...)
  2024-08-13 16:36 ` [PATCH v4 11/14] xfs: Only free full extents " John Garry
@ 2024-08-13 16:36 ` John Garry
  2024-08-23 16:35   ` Darrick J. Wong
  2024-08-13 16:36 ` [PATCH v4 13/14] xfs: Don't revert allocated offset for forcealign John Garry
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 58+ messages in thread
From: John Garry @ 2024-08-13 16:36 UTC (permalink / raw)
  To: chandan.babu, djwong, dchinner, hch
  Cc: viro, brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen, John Garry

For when forcealign is enabled, blocks in an inode need to be unmapped
according to extent alignment, like what is already done for rtvol.

Generalize the code by replacing variable isrt with a value to hold the
FSB alloc size for the inode, which works for forcealign.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 46 ++++++++++++++++++++++++++++------------
 1 file changed, 32 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 0c3df8c71c6d..3ab2cecf09d2 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -5409,6 +5409,25 @@ xfs_bmap_del_extent_real(
 	return 0;
 }
 
+static xfs_extlen_t
+xfs_bmap_alloc_unit_offset(
+	struct xfs_inode	*ip,
+	unsigned int		alloc_fsb,
+	xfs_fsblock_t		fsbno)
+{
+	xfs_agblock_t		agbno;
+
+	if (XFS_IS_REALTIME_INODE(ip))
+		return do_div(fsbno, alloc_fsb);
+	/*
+	 * The agbno for the fsbno is aligned to extsize, but the fsbno itself
+	 * is not necessarily aligned (to extsize), so use agbno to determine
+	 * mod to the alloc unit boundary.
+	 */
+	agbno = XFS_FSB_TO_AGBNO(ip->i_mount, fsbno);
+	return agbno % alloc_fsb;
+}
+
 /*
  * Unmap (remove) blocks from a file.
  * If nexts is nonzero then the number of extents to remove is limited to
@@ -5430,7 +5449,6 @@ __xfs_bunmapi(
 	xfs_extnum_t		extno;		/* extent number in list */
 	struct xfs_bmbt_irec	got;		/* current extent record */
 	struct xfs_ifork	*ifp;		/* inode fork pointer */
-	int			isrt;		/* freeing in rt area */
 	int			logflags;	/* transaction logging flags */
 	xfs_extlen_t		mod;		/* rt extent offset */
 	struct xfs_mount	*mp = ip->i_mount;
@@ -5441,6 +5459,7 @@ __xfs_bunmapi(
 	xfs_fileoff_t		end;
 	struct xfs_iext_cursor	icur;
 	bool			done = false;
+	unsigned int		alloc_fsb = xfs_inode_alloc_fsbsize(ip);
 
 	trace_xfs_bunmap(ip, start, len, flags, _RET_IP_);
 
@@ -5467,7 +5486,6 @@ __xfs_bunmapi(
 		return 0;
 	}
 	XFS_STATS_INC(mp, xs_blk_unmap);
-	isrt = xfs_ifork_is_realtime(ip, whichfork);
 	end = start + len;
 
 	if (!xfs_iext_lookup_extent_before(ip, ifp, &end, &icur, &got)) {
@@ -5519,18 +5537,18 @@ __xfs_bunmapi(
 		if (del.br_startoff + del.br_blockcount > end + 1)
 			del.br_blockcount = end + 1 - del.br_startoff;
 
-		if (!isrt || (flags & XFS_BMAPI_REMAP))
+		if (alloc_fsb == 1 || (flags & XFS_BMAPI_REMAP))
 			goto delete;
 
-		mod = xfs_rtb_to_rtxoff(mp,
+		mod = xfs_bmap_alloc_unit_offset(ip, alloc_fsb,
 				del.br_startblock + del.br_blockcount);
 		if (mod) {
 			/*
-			 * Realtime extent not lined up at the end.
+			 * Not aligned to allocation unit on the end.
 			 * The extent could have been split into written
 			 * and unwritten pieces, or we could just be
 			 * unmapping part of it.  But we can't really
-			 * get rid of part of a realtime extent.
+			 * get rid of part of an extent.
 			 */
 			if (del.br_state == XFS_EXT_UNWRITTEN) {
 				/*
@@ -5554,8 +5572,8 @@ __xfs_bunmapi(
 			ASSERT(del.br_state == XFS_EXT_NORM);
 			ASSERT(tp->t_blk_res > 0);
 			/*
-			 * If this spans a realtime extent boundary,
-			 * chop it back to the start of the one we end at.
+			 * If this spans an extent boundary, chop it back to
+			 * the start of the one we end at.
 			 */
 			if (del.br_blockcount > mod) {
 				del.br_startoff += del.br_blockcount - mod;
@@ -5571,14 +5589,14 @@ __xfs_bunmapi(
 			goto nodelete;
 		}
 
-		mod = xfs_rtb_to_rtxoff(mp, del.br_startblock);
+		mod = xfs_bmap_alloc_unit_offset(ip, alloc_fsb,
+					del.br_startblock);
 		if (mod) {
-			xfs_extlen_t off = mp->m_sb.sb_rextsize - mod;
-
+			xfs_extlen_t off = alloc_fsb - mod;
 			/*
-			 * Realtime extent is lined up at the end but not
-			 * at the front.  We'll get rid of full extents if
-			 * we can.
+			 * Extent is lined up to the allocation unit at the
+			 * end but not at the front.  We'll get rid of full
+			 * extents if we can.
 			 */
 			if (del.br_blockcount > off) {
 				del.br_blockcount -= off;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 12/14] xfs: Unmap blocks according to forcealign
  2024-08-13 16:36 ` [PATCH v4 12/14] xfs: Unmap blocks according to forcealign John Garry
@ 2024-08-23 16:35   ` Darrick J. Wong
  0 siblings, 0 replies; 58+ messages in thread
From: Darrick J. Wong @ 2024-08-23 16:35 UTC (permalink / raw)
  To: John Garry
  Cc: chandan.babu, dchinner, hch, viro, brauner, jack, linux-xfs,
	linux-kernel, linux-fsdevel, catherine.hoang, martin.petersen

On Tue, Aug 13, 2024 at 04:36:36PM +0000, John Garry wrote:
> For when forcealign is enabled, blocks in an inode need to be unmapped
> according to extent alignment, like what is already done for rtvol.
> 
> Generalize the code by replacing variable isrt with a value to hold the
> FSB alloc size for the inode, which works for forcealign.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>

Looks good to me now,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/libxfs/xfs_bmap.c | 46 ++++++++++++++++++++++++++++------------
>  1 file changed, 32 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 0c3df8c71c6d..3ab2cecf09d2 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -5409,6 +5409,25 @@ xfs_bmap_del_extent_real(
>  	return 0;
>  }
>  
> +static xfs_extlen_t
> +xfs_bmap_alloc_unit_offset(
> +	struct xfs_inode	*ip,
> +	unsigned int		alloc_fsb,
> +	xfs_fsblock_t		fsbno)
> +{
> +	xfs_agblock_t		agbno;
> +
> +	if (XFS_IS_REALTIME_INODE(ip))
> +		return do_div(fsbno, alloc_fsb);
> +	/*
> +	 * The agbno for the fsbno is aligned to extsize, but the fsbno itself
> +	 * is not necessarily aligned (to extsize), so use agbno to determine
> +	 * mod to the alloc unit boundary.
> +	 */
> +	agbno = XFS_FSB_TO_AGBNO(ip->i_mount, fsbno);
> +	return agbno % alloc_fsb;
> +}
> +
>  /*
>   * Unmap (remove) blocks from a file.
>   * If nexts is nonzero then the number of extents to remove is limited to
> @@ -5430,7 +5449,6 @@ __xfs_bunmapi(
>  	xfs_extnum_t		extno;		/* extent number in list */
>  	struct xfs_bmbt_irec	got;		/* current extent record */
>  	struct xfs_ifork	*ifp;		/* inode fork pointer */
> -	int			isrt;		/* freeing in rt area */
>  	int			logflags;	/* transaction logging flags */
>  	xfs_extlen_t		mod;		/* rt extent offset */
>  	struct xfs_mount	*mp = ip->i_mount;
> @@ -5441,6 +5459,7 @@ __xfs_bunmapi(
>  	xfs_fileoff_t		end;
>  	struct xfs_iext_cursor	icur;
>  	bool			done = false;
> +	unsigned int		alloc_fsb = xfs_inode_alloc_fsbsize(ip);
>  
>  	trace_xfs_bunmap(ip, start, len, flags, _RET_IP_);
>  
> @@ -5467,7 +5486,6 @@ __xfs_bunmapi(
>  		return 0;
>  	}
>  	XFS_STATS_INC(mp, xs_blk_unmap);
> -	isrt = xfs_ifork_is_realtime(ip, whichfork);
>  	end = start + len;
>  
>  	if (!xfs_iext_lookup_extent_before(ip, ifp, &end, &icur, &got)) {
> @@ -5519,18 +5537,18 @@ __xfs_bunmapi(
>  		if (del.br_startoff + del.br_blockcount > end + 1)
>  			del.br_blockcount = end + 1 - del.br_startoff;
>  
> -		if (!isrt || (flags & XFS_BMAPI_REMAP))
> +		if (alloc_fsb == 1 || (flags & XFS_BMAPI_REMAP))
>  			goto delete;
>  
> -		mod = xfs_rtb_to_rtxoff(mp,
> +		mod = xfs_bmap_alloc_unit_offset(ip, alloc_fsb,
>  				del.br_startblock + del.br_blockcount);
>  		if (mod) {
>  			/*
> -			 * Realtime extent not lined up at the end.
> +			 * Not aligned to allocation unit on the end.
>  			 * The extent could have been split into written
>  			 * and unwritten pieces, or we could just be
>  			 * unmapping part of it.  But we can't really
> -			 * get rid of part of a realtime extent.
> +			 * get rid of part of an extent.
>  			 */
>  			if (del.br_state == XFS_EXT_UNWRITTEN) {
>  				/*
> @@ -5554,8 +5572,8 @@ __xfs_bunmapi(
>  			ASSERT(del.br_state == XFS_EXT_NORM);
>  			ASSERT(tp->t_blk_res > 0);
>  			/*
> -			 * If this spans a realtime extent boundary,
> -			 * chop it back to the start of the one we end at.
> +			 * If this spans an extent boundary, chop it back to
> +			 * the start of the one we end at.
>  			 */
>  			if (del.br_blockcount > mod) {
>  				del.br_startoff += del.br_blockcount - mod;
> @@ -5571,14 +5589,14 @@ __xfs_bunmapi(
>  			goto nodelete;
>  		}
>  
> -		mod = xfs_rtb_to_rtxoff(mp, del.br_startblock);
> +		mod = xfs_bmap_alloc_unit_offset(ip, alloc_fsb,
> +					del.br_startblock);
>  		if (mod) {
> -			xfs_extlen_t off = mp->m_sb.sb_rextsize - mod;
> -
> +			xfs_extlen_t off = alloc_fsb - mod;
>  			/*
> -			 * Realtime extent is lined up at the end but not
> -			 * at the front.  We'll get rid of full extents if
> -			 * we can.
> +			 * Extent is lined up to the allocation unit at the
> +			 * end but not at the front.  We'll get rid of full
> +			 * extents if we can.
>  			 */
>  			if (del.br_blockcount > off) {
>  				del.br_blockcount -= off;
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH v4 13/14] xfs: Don't revert allocated offset for forcealign
  2024-08-13 16:36 [PATCH v4 00/14] forcealign for xfs John Garry
                   ` (11 preceding siblings ...)
  2024-08-13 16:36 ` [PATCH v4 12/14] xfs: Unmap blocks according to forcealign John Garry
@ 2024-08-13 16:36 ` John Garry
  2024-08-13 16:36 ` [PATCH v4 14/14] xfs: Enable file data forcealign feature John Garry
  2024-09-04 18:14 ` [PATCH v4 00/14] forcealign for xfs Ritesh Harjani
  14 siblings, 0 replies; 58+ messages in thread
From: John Garry @ 2024-08-13 16:36 UTC (permalink / raw)
  To: chandan.babu, djwong, dchinner, hch
  Cc: viro, brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen, John Garry

In xfs_bmap_process_allocated_extent(), for when we found that we could not
provide the requested length completely, the mapping is moved so that we
can provide as much as possible for the original request.

For forcealign, this would mean ignoring alignment guaranteed, so don't do
this.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 3ab2cecf09d2..a9e98f83e57e 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3491,11 +3491,15 @@ xfs_bmap_process_allocated_extent(
 	 * original request as possible.  Free space is apparently
 	 * very fragmented so we're unlikely to be able to satisfy the
 	 * hints anyway.
+	 * However, for an inode with forcealign, continue with the
+	 * found offset as we need to honour the alignment hint.
 	 */
-	if (ap->length <= orig_length)
-		ap->offset = orig_offset;
-	else if (ap->offset + ap->length < orig_offset + orig_length)
-		ap->offset = orig_offset + orig_length - ap->length;
+	if (!xfs_inode_has_forcealign(ap->ip)) {
+		if (ap->length <= orig_length)
+			ap->offset = orig_offset;
+		else if (ap->offset + ap->length < orig_offset + orig_length)
+			ap->offset = orig_offset + orig_length - ap->length;
+	}
 	xfs_bmap_alloc_account(ap);
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v4 14/14] xfs: Enable file data forcealign feature
  2024-08-13 16:36 [PATCH v4 00/14] forcealign for xfs John Garry
                   ` (12 preceding siblings ...)
  2024-08-13 16:36 ` [PATCH v4 13/14] xfs: Don't revert allocated offset for forcealign John Garry
@ 2024-08-13 16:36 ` John Garry
  2024-09-04 18:14 ` [PATCH v4 00/14] forcealign for xfs Ritesh Harjani
  14 siblings, 0 replies; 58+ messages in thread
From: John Garry @ 2024-08-13 16:36 UTC (permalink / raw)
  To: chandan.babu, djwong, dchinner, hch
  Cc: viro, brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen, John Garry

From: "Darrick J. Wong" <djwong@kernel.org>

Enable this feature.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_format.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 95f5259c4255..04c6cbc943c2 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -357,7 +357,8 @@ xfs_sb_has_compat_feature(
 		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
 		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
 		 XFS_SB_FEAT_RO_COMPAT_REFLINK| \
-		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
+		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT | \
+		 XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
 #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
 static inline bool
 xfs_sb_has_ro_compat_feature(
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-08-13 16:36 [PATCH v4 00/14] forcealign for xfs John Garry
                   ` (13 preceding siblings ...)
  2024-08-13 16:36 ` [PATCH v4 14/14] xfs: Enable file data forcealign feature John Garry
@ 2024-09-04 18:14 ` Ritesh Harjani
  2024-09-04 23:20   ` Dave Chinner
  14 siblings, 1 reply; 58+ messages in thread
From: Ritesh Harjani @ 2024-09-04 18:14 UTC (permalink / raw)
  To: John Garry, chandan.babu, djwong, dchinner, hch
  Cc: viro, brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen, John Garry

John Garry <john.g.garry@oracle.com> writes:

> This series is being spun off the block atomic writes for xfs series at
> [0].
>
> That series got too big.
>
> The actual forcealign patches are roughly the same in this series.
>
> Why forcealign?
> In some scenarios to may be required to guarantee extent alignment and
> granularity.
>
> For example, for atomic writes, the maximum atomic write unit size would
> be limited at the extent alignment and granularity, guaranteeing that an
> atomic write would not span data present in multiple extents.
>
> forcealign may be useful as a performance tuning optimization in other
> scenarios.
>
> I decided not to support forcealign for RT devices here. Initially I
> thought that it would be quite simple of implement. However, I discovered
> through much testing and subsequent debug that this was not true, so I
> decided to defer support to later.
>
> Early development xfsprogs support is at:
> https://github.com/johnpgarry/xfsprogs-dev/commits/atomic-writes/
>

Hi John,

Thanks for your continued work on atomic write.
I went over the XFS patch series and this is my understanding + some queries. Could you please help with these.

1. As I understand XFS untorn atomic write support is built on top of FORCEALIGN feature (which this series is adding) which in turn uses extsize hint feature underneath.
   Now extsize hint mainly controls the alignment of both "physical start" & "logical start" offset and extent length, correct?
   This is done using args->alignment for start aand args->prod/mode variables for extent length. Correct?

   - If say we are not able to allocate an aligned physical start? Then since extsize is just a hint we go ahead with whatever best available extent is right?
   - also extsize looks to be only providing allocation side of hints. (not de-allocation). Correct?

2. If say there is an append write i.e. the allocation is needed to be done at EOF. Then we try for an exact bno (from eof block) and aligned extent length, right?
   i.e. xfs_bmap_btalloc_filestreams() -> xfs_bmap_btalloc_at_eof(ap, args);
   If it is not available then we try for nearby bno xfs_alloc_vextent_near_bno(args, target) and similar...

3. It is the FORCEALIGN feature which _mandates_ both allocation (by using extsize hint) and de-allocation to happen _only_ in extsize chunks.
   i.e. forcealign mandates -
   - the logical and physical start offset should be aligned as per args->alignment
   - extent length be aligned as per args->prod/mod.
     If above two cannot be satisfied then return -ENOSPC.

   - Does the unmapping of extents also only happens in extsize chunks (with forcealign)?
     If the start or end of the extent which needs unmapping is unaligned then we convert that extent to unwritten and skip, is it? (__xfs_bunmapi())
     This is a bit unclear to me. Maybe I need to look more deeper into the __xfs_bunmapi() while loop.

My knowledge about this is still limited so please ignore any silly questions.

-ritesh

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-04 18:14 ` [PATCH v4 00/14] forcealign for xfs Ritesh Harjani
@ 2024-09-04 23:20   ` Dave Chinner
  2024-09-05  3:56     ` Ritesh Harjani
  2024-09-05 10:15     ` John Garry
  0 siblings, 2 replies; 58+ messages in thread
From: Dave Chinner @ 2024-09-04 23:20 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: John Garry, chandan.babu, djwong, dchinner, hch, viro, brauner,
	jack, linux-xfs, linux-kernel, linux-fsdevel, catherine.hoang,
	martin.petersen

On Wed, Sep 04, 2024 at 11:44:29PM +0530, Ritesh Harjani wrote:
> John Garry <john.g.garry@oracle.com> writes:
> 
> > This series is being spun off the block atomic writes for xfs
> > series at [0].
> >
> > That series got too big.
> >
> > The actual forcealign patches are roughly the same in this
> > series.
> >
> > Why forcealign?  In some scenarios to may be required to
> > guarantee extent alignment and granularity.
> >
> > For example, for atomic writes, the maximum atomic write unit
> > size would be limited at the extent alignment and granularity,
> > guaranteeing that an atomic write would not span data present in
> > multiple extents.
> >
> > forcealign may be useful as a performance tuning optimization in
> > other scenarios.
> >
> > I decided not to support forcealign for RT devices here.
> > Initially I thought that it would be quite simple of implement.
> > However, I discovered through much testing and subsequent debug
> > that this was not true, so I decided to defer support to
> > later.
> >
> > Early development xfsprogs support is at:
> > https://github.com/johnpgarry/xfsprogs-dev/commits/atomic-writes/
> >
> 
> Hi John,
> 
> Thanks for your continued work on atomic write.  I went over the
> XFS patch series and this is my understanding + some queries.
> Could you please help with these.

Hi Ritesh - to make it easier for everyone to read and reply to you
emails, can you please word wrap the text at 72 columns?

> 1. As I understand XFS untorn atomic write support is built on top
> of FORCEALIGN feature (which this series is adding) which in turn
> uses extsize hint feature underneath.

Yes.

>    Now extsize hint mainly controls the alignment of both
>    "physical start" & "logical start" offset and extent length,
>    correct?

Yes.

>    This is done using args->alignment for start aand
>    args->prod/mode variables for extent length. Correct?

Yes.

>    - If say we are not able to allocate an aligned physical start?
>    Then since extsize is just a hint we go ahead with whatever
>    best available extent is right?

No. The definition of "forced alignment" is that we guarantee
aligned allocation to the extent size hint. i.e the extent size hint
is not a hint anymore - it defines the alignment we are guaranteeing
allocation will achieve.

hence if we can't align the extent to the alignment provided, we
fail the alignment.

>    - also extsize looks to be only providing allocation side of hints. (not de-allocation). Correct?

No. See the use of xfs_inode_alloc_unitsize() in all the places
where we free space. Forced alignment extends this function to
return the extent size, not the block size.

> 2. If say there is an append write i.e. the allocation is needed
> to be done at EOF. Then we try for an exact bno (from eof block)
> and aligned extent length, right?

Yes. This works because the previous extent is exactly aligned,
hence a contiguous allocation will continue to be correctly aligned
due to the forced alignment constraints.

>    i.e. xfs_bmap_btalloc_filestreams() ->
>    xfs_bmap_btalloc_at_eof(ap, args); If it is not available then
>    we try for nearby bno xfs_alloc_vextent_near_bno(args, target)
>    and similar...

yes, that's just the normal aligned allocation fallback path when
exact allocation fails.

> 3. It is the FORCEALIGN feature which _mandates_ both allocation
> (by using extsize hint) and de-allocation to happen _only_ in
> extsize chunks.
>
>    i.e. forcealign mandates -
>    - the logical and physical start offset should be aligned as
>    per args->alignment
>    - extent length be aligned as per args->prod/mod.
>      If above two cannot be satisfied then return -ENOSPC.

Yes.

> 
>    - Does the unmapping of extents also only happens in extsize
>    chunks (with forcealign)?

Yes, via use of xfs_inode_alloc_unitsize() in the high level code
aligning the fsbno ranges to be unmapped.

Remember, force align requires both logical file offset and
physical block number to be correctly aligned, so unmap alignment
has to be set up correctly at file offset level before we even know
what extents underly the file range we need to unmap....

>      If the start or end of the extent which needs unmapping is
>      unaligned then we convert that extent to unwritten and skip,
>      is it? (__xfs_bunmapi())

The high level code should be aligning the start and end of the
file range to be removed via xfs_inode_alloc_unitsize(). Hence 
the low level __xfs_bunmapi() code shouldn't ever be encountering
unaligned unmaps on force-aligned inodes.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-04 23:20   ` Dave Chinner
@ 2024-09-05  3:56     ` Ritesh Harjani
  2024-09-05  6:33       ` Dave Chinner
  2024-09-05 10:15     ` John Garry
  1 sibling, 1 reply; 58+ messages in thread
From: Ritesh Harjani @ 2024-09-05  3:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: John Garry, chandan.babu, djwong, dchinner, hch, viro, brauner,
	jack, linux-xfs, linux-kernel, linux-fsdevel, catherine.hoang,
	martin.petersen


Thanks Dave for quick response. 

Dave Chinner <david@fromorbit.com> writes:

> On Wed, Sep 04, 2024 at 11:44:29PM +0530, Ritesh Harjani wrote:
>> John Garry <john.g.garry@oracle.com> writes:
>> 
>> > This series is being spun off the block atomic writes for xfs
>> > series at [0].
>> >
>> > That series got too big.
>> >
>> > The actual forcealign patches are roughly the same in this
>> > series.
>> >
>> > Why forcealign?  In some scenarios to may be required to
>> > guarantee extent alignment and granularity.
>> >
>> > For example, for atomic writes, the maximum atomic write unit
>> > size would be limited at the extent alignment and granularity,
>> > guaranteeing that an atomic write would not span data present in
>> > multiple extents.
>> >
>> > forcealign may be useful as a performance tuning optimization in
>> > other scenarios.
>> >
>> > I decided not to support forcealign for RT devices here.
>> > Initially I thought that it would be quite simple of implement.
>> > However, I discovered through much testing and subsequent debug
>> > that this was not true, so I decided to defer support to
>> > later.
>> >
>> > Early development xfsprogs support is at:
>> > https://github.com/johnpgarry/xfsprogs-dev/commits/atomic-writes/
>> >
>> 
>> Hi John,
>> 
>> Thanks for your continued work on atomic write.  I went over the
>> XFS patch series and this is my understanding + some queries.
>> Could you please help with these.
>
> Hi Ritesh - to make it easier for everyone to read and reply to you
> emails, can you please word wrap the text at 72 columns?
>

argh! Sorry about that. I had formed my queries in a separate notes
application and copy pasted it here. Hopefully this time it will be ok.

>> 1. As I understand XFS untorn atomic write support is built on top
>> of FORCEALIGN feature (which this series is adding) which in turn
>> uses extsize hint feature underneath.
>
> Yes.
>
>>    Now extsize hint mainly controls the alignment of both
>>    "physical start" & "logical start" offset and extent length,
>>    correct?
>
> Yes.
>
>>    This is done using args->alignment for start aand
>>    args->prod/mode variables for extent length. Correct?
>
> Yes.
>
>>    - If say we are not able to allocate an aligned physical start?
>>    Then since extsize is just a hint we go ahead with whatever
>>    best available extent is right?
>
> No. The definition of "forced alignment" is that we guarantee
> aligned allocation to the extent size hint. i.e the extent size hint
> is not a hint anymore - it defines the alignment we are guaranteeing
> allocation will achieve.
>
> hence if we can't align the extent to the alignment provided, we
> fail the alignment.
>
>>    - also extsize looks to be only providing allocation side of hints. (not de-allocation). Correct?
>
> No. See the use of xfs_inode_alloc_unitsize() in all the places
> where we free space. Forced alignment extends this function to
> return the extent size, not the block size.
>

Sorry for not being explicit. For queries in point 1. above, I was
referring to extent size hint feature w/o FORCEALIGN. But I got the
gist from your response. Thanks!

>> 2. If say there is an append write i.e. the allocation is needed
>> to be done at EOF. Then we try for an exact bno (from eof block)
>> and aligned extent length, right?
>
> Yes. This works because the previous extent is exactly aligned,
> hence a contiguous allocation will continue to be correctly aligned
> due to the forced alignment constraints.
>
>>    i.e. xfs_bmap_btalloc_filestreams() ->
>>    xfs_bmap_btalloc_at_eof(ap, args); If it is not available then
>>    we try for nearby bno xfs_alloc_vextent_near_bno(args, target)
>>    and similar...
>
> yes, that's just the normal aligned allocation fallback path when
> exact allocation fails.
>
>> 3. It is the FORCEALIGN feature which _mandates_ both allocation
>> (by using extsize hint) and de-allocation to happen _only_ in
>> extsize chunks.
>>
>>    i.e. forcealign mandates -
>>    - the logical and physical start offset should be aligned as
>>    per args->alignment
>>    - extent length be aligned as per args->prod/mod.
>>      If above two cannot be satisfied then return -ENOSPC.
>
> Yes.
>
>> 
>>    - Does the unmapping of extents also only happens in extsize
>>    chunks (with forcealign)?
>
> Yes, via use of xfs_inode_alloc_unitsize() in the high level code
> aligning the fsbno ranges to be unmapped.
>
> Remember, force align requires both logical file offset and
> physical block number to be correctly aligned,

This is where I would like to double confirm it again. Even the
extsize hint feature (w/o FORCEALIGN) will try to allocate aligned
physical start and logical start file offset and length right?

(Or does extsize hint only restricts alignment to logical start file
offset + length and not the physical start?)


Also it looks like there is no difference with ATOMIC_WRITE AND
FORCEALIGN feature with XFS, correct? (except that ATOMIC_WRITE is
adding additional natural alignment restrictions on pos and len). 
So why maintain 2 separate on disk inode flags for FORCEALIGN AND
ATOMIC_WRITE?
- Do you foresee FORCEALIGN to be also used at other places w/o
ATOMIC_WRITE where feature differentiation between the two on an
inode is required?

- Does the same reasoning will hold for XFS_SB_FEAT_RO_COMPAT_FORCEALIGN
& XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES too?

- But why ro_compact for ATOMICWRITES? There aren't any on disk metadata
changes within XFS filesystem to support atomic writes, right? 
Is it something to just prevent users from destroying their own data
by not allowing a rw mount from an older kernel where users could do
unaligned writes to files marked for atomic writes?
Or is there any other reasoning to prevent XFS filesystem from becoming
inconsistent if an older kernel does a rw mount here.



> so unmap alignment
> has to be set up correctly at file offset level before we even know
> what extents underly the file range we need to unmap....
>
>>      If the start or end of the extent which needs unmapping is
>>      unaligned then we convert that extent to unwritten and skip,
>>      is it? (__xfs_bunmapi())
>
> The high level code should be aligning the start and end of the
> file range to be removed via xfs_inode_alloc_unitsize(). Hence 
> the low level __xfs_bunmapi() code shouldn't ever be encountering
> unaligned unmaps on force-aligned inodes.
>

Yes, but isn't this code snippet trying to handle a case when it finds an
unaligned extent during unmap? And what we are essentially trying to 
do here is leave the unwritten extent as is and if the found extent is
written then convert to unwritten and skip it (goto nodelete). This
means with forcealign if we encounter an unaligned extent then the file
will have this space reserved as is with extent marked unwritten. 

Is this understanding correct?

__xfs_bunmapi(...) 
{
    unsigned int alloc_fsb = xfs_inode_alloc_fsbsize(ip);
    <...>
    while (end != (xfs_fileoff_t)-1 && end >= start &
          (nexts == 0 || extno < nexts)) {
          <...>

          if (alloc_fsb == 1 || (flags & XFS_BMAPI_REMAP))
                goto delete;

         mod = xfs_bmap_alloc_unit_offset(ip, alloc_fsb,
                         del.br_startblock + del.br_blockcount);
         if (mod) {
                 /*
                  * Not aligned to allocation unit on the end.
                  * The extent could have been split into written
                  * and unwritten pieces, or we could just be
                  * unmapping part of it.  But we can't really
                  * get rid of part of an extent.
                  */
                 if (del.br_state == XFS_EXT_UNWRITTEN) {
                         /*
                          * This piece is unwritten, or we're not
                          * using unwritten extents.  Skip over it.
                          */
                         ASSERT((flags & XFS_BMAPI_REMAP) || end >= mod);
                         end -= mod > del.br_blockcount ?
                                 del.br_blockcount : mod;
                         if (end < got.br_startoff &&
                             !xfs_iext_prev_extent(ifp, &icur, &got)) {
                                 done = true;
                                 break;
                         }
                         continue;
                 }
                 /*
                  * It's written, turn it unwritten.
                  * This is better than zeroing it.
                  */
                 ASSERT(del.br_state == XFS_EXT_NORM);
                 ASSERT(tp->t_blk_res > 0);
                 /*
                  * If this spans an extent boundary, chop it back to
                  * the start of the one we end at.
                  */
                 if (del.br_blockcount > mod) {
                         del.br_startoff += del.br_blockcount - mod;
                         del.br_startblock += del.br_blockcount - mod;
                         del.br_blockcount = mod;
                 }
                 del.br_state = XFS_EXT_UNWRITTEN;
                 error = xfs_bmap_add_extent_unwritten_real(tp, ip,
                                 whichfork, &icur, &cur, &del,
                                 &logflags);
                 if (error)
                         goto error0;
                 goto nodelete;
         }

         mod = xfs_bmap_alloc_unit_offset(ip, alloc_fsb,
                                 del.br_startblock);
         if (mod) {
            // handle it for unaligned start block
            <...>
         }
    }
}

> -Dave.


Thanks a lot for answering the queries.

-ritesh

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-05  3:56     ` Ritesh Harjani
@ 2024-09-05  6:33       ` Dave Chinner
  2024-09-10  2:51         ` Ritesh Harjani
  2024-09-10 12:33         ` Ritesh Harjani
  0 siblings, 2 replies; 58+ messages in thread
From: Dave Chinner @ 2024-09-05  6:33 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: John Garry, chandan.babu, djwong, dchinner, hch, viro, brauner,
	jack, linux-xfs, linux-kernel, linux-fsdevel, catherine.hoang,
	martin.petersen

On Thu, Sep 05, 2024 at 09:26:25AM +0530, Ritesh Harjani wrote:
> Dave Chinner <david@fromorbit.com> writes:
> > On Wed, Sep 04, 2024 at 11:44:29PM +0530, Ritesh Harjani wrote:
> >> 3. It is the FORCEALIGN feature which _mandates_ both allocation
> >> (by using extsize hint) and de-allocation to happen _only_ in
> >> extsize chunks.
> >>
> >>    i.e. forcealign mandates -
> >>    - the logical and physical start offset should be aligned as
> >>    per args->alignment
> >>    - extent length be aligned as per args->prod/mod.
> >>      If above two cannot be satisfied then return -ENOSPC.
> >
> > Yes.
> >
> >> 
> >>    - Does the unmapping of extents also only happens in extsize
> >>    chunks (with forcealign)?
> >
> > Yes, via use of xfs_inode_alloc_unitsize() in the high level code
> > aligning the fsbno ranges to be unmapped.
> >
> > Remember, force align requires both logical file offset and
> > physical block number to be correctly aligned,
> 
> This is where I would like to double confirm it again. Even the
> extsize hint feature (w/o FORCEALIGN) will try to allocate aligned
> physical start and logical start file offset and length right?

No.

> (Or does extsize hint only restricts alignment to logical start file
> offset + length and not the physical start?)

Neither.

extsize hint by itself (i.e. existing behaviour) has no alignment
effect at all. All it affects is -size- of the extent. i.e. once
the extent start is chosen, extent size hints will trim the length
of the extent to a multiple of the extent size hint. Alignment is
not considered at all.

> Also it looks like there is no difference with ATOMIC_WRITE AND
> FORCEALIGN feature with XFS, correct? (except that ATOMIC_WRITE is
> adding additional natural alignment restrictions on pos and len). 

Atomic write requires additional hardware support, and it restricts
the valid sizes of extent size hints that can be set. Only atomic
writes can be done on files marked as configured for atomic writes;
force alignment can be done on any file...

> So why maintain 2 separate on disk inode flags for FORCEALIGN AND
> ATOMIC_WRITE?

the atomic write flag indicates that a file has been set up
correctly for atomic writes to be able to issues reliably. force
alignment doesn't guarantee that - it's just a mechanism that tells
the allocator to behave a specific way.

> - Do you foresee FORCEALIGN to be also used at other places w/o
> ATOMIC_WRITE where feature differentiation between the two on an
> inode is required?

The already exist. For example, reliably allocating huge page
mappings on DAX filesystems requires 2MB forced alignment. 

> - Does the same reasoning will hold for XFS_SB_FEAT_RO_COMPAT_FORCEALIGN
> & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES too?

Same as above.

> - But why ro_compact for ATOMICWRITES? There aren't any on disk metadata
> changes within XFS filesystem to support atomic writes, right? 

Because if you downgrade the kernel to something that doesn't
support atomic writes, then non-atomic sized/aligned data can be
written to the file and/or torn writes can occur.

Worse, extent size hints that don't match the underlying hardware
support could be set up for inodes, and when the kernel is upgraded
again then atomic writes will fail on inodes that have atomic write
flags set on them....

> Is it something to just prevent users from destroying their own data
> by not allowing a rw mount from an older kernel where users could do
> unaligned writes to files marked for atomic writes?
> Or is there any other reasoning to prevent XFS filesystem from becoming
> inconsistent if an older kernel does a rw mount here.

The older kernel does not know what the unknown inode flag means
(i.e. atomic writes) and so, by definition, we cannot allow it to
modify metadata or file data because it may not modify it in the
correct way for that flag being set on the inode.

Kernels that don't understand feature flags need to treat the
filesystem as read-only, no matter how trivial the feature addition
might seem.

> > so unmap alignment
> > has to be set up correctly at file offset level before we even know
> > what extents underly the file range we need to unmap....
> >
> >>      If the start or end of the extent which needs unmapping is
> >>      unaligned then we convert that extent to unwritten and skip,
> >>      is it? (__xfs_bunmapi())
> >
> > The high level code should be aligning the start and end of the
> > file range to be removed via xfs_inode_alloc_unitsize(). Hence 
> > the low level __xfs_bunmapi() code shouldn't ever be encountering
> > unaligned unmaps on force-aligned inodes.
> >
> 
> Yes, but isn't this code snippet trying to handle a case when it finds an
> unaligned extent during unmap?

It does exactly that.

> And what we are essentially trying to 
> do here is leave the unwritten extent as is and if the found extent is
> written then convert to unwritten and skip it (goto nodelete). This
> means with forcealign if we encounter an unaligned extent then the file
> will have this space reserved as is with extent marked unwritten. 
> 
> Is this understanding correct?

Yes, except for the fact that this code is not triggered by force
alignment.

This code is preexisting realtime file functionality. It is used
when the rtextent size is larger than a single filesytem block.

In these configurations, we do allocation in rtextsize units, but we
track written/unwritten extents in the BMBT on filesystem block
granularity. Hence we can have unaligned written/unwritten extent
boundaries, and if we aren't unmapping a whole rtextent then we
simply mark the unused portion of it as unwritten in the BMBT to
indicate it contains zeroes.

IOWs, existing RT files have different IO alignment and
written/unwritten extent conversion behaviour to the new forced
alignment feature. The implementation code is shared in many places,
but the higher level forced alignment functionality ensures there
are never unaligned unwritten extents created or unaligned
unmappings asked for. Hence this code does not trigger for the
forced alignment cases.

i.e. we have multiple seperate allocation alignment behaviours that
share an implementation, but they do not all behave exactly the same
way or provide the same alignment guarantees....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-05  6:33       ` Dave Chinner
@ 2024-09-10  2:51         ` Ritesh Harjani
  2024-09-16  6:33           ` Dave Chinner
  2024-09-10 12:33         ` Ritesh Harjani
  1 sibling, 1 reply; 58+ messages in thread
From: Ritesh Harjani @ 2024-09-10  2:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: John Garry, chandan.babu, djwong, dchinner, hch, viro, brauner,
	jack, linux-xfs, linux-kernel, linux-fsdevel, catherine.hoang,
	martin.petersen

Dave Chinner <david@fromorbit.com> writes:

> On Thu, Sep 05, 2024 at 09:26:25AM +0530, Ritesh Harjani wrote:
>> Dave Chinner <david@fromorbit.com> writes:
>> > On Wed, Sep 04, 2024 at 11:44:29PM +0530, Ritesh Harjani wrote:
>> >> 3. It is the FORCEALIGN feature which _mandates_ both allocation
>> >> (by using extsize hint) and de-allocation to happen _only_ in
>> >> extsize chunks.
>> >>
>> >>    i.e. forcealign mandates -
>> >>    - the logical and physical start offset should be aligned as
>> >>    per args->alignment
>> >>    - extent length be aligned as per args->prod/mod.
>> >>      If above two cannot be satisfied then return -ENOSPC.
>> >
>> > Yes.
>> >
>> >> 
>> >>    - Does the unmapping of extents also only happens in extsize
>> >>    chunks (with forcealign)?
>> >
>> > Yes, via use of xfs_inode_alloc_unitsize() in the high level code
>> > aligning the fsbno ranges to be unmapped.
>> >
>> > Remember, force align requires both logical file offset and
>> > physical block number to be correctly aligned,
>> 
>> This is where I would like to double confirm it again. Even the
>> extsize hint feature (w/o FORCEALIGN) will try to allocate aligned
>> physical start and logical start file offset and length right?
>
> No.
>
>> (Or does extsize hint only restricts alignment to logical start file
>> offset + length and not the physical start?)
>
> Neither.
>

Yes, thanks for the correction. Indeed extsize hint does not take care
of the physical start alignment at all.

> extsize hint by itself (i.e. existing behaviour) has no alignment
> effect at all. All it affects is -size- of the extent. i.e. once
> the extent start is chosen, extent size hints will trim the length
> of the extent to a multiple of the extent size hint. Alignment is
> not considered at all.
>

Please correct me I wrong here... but XFS considers aligning the logical
start and the length of the allocated extent (for extsize) as per below
code right? 

i.e.
1) xfs_direct_write_iomap_begin()
{
    <...>
    if (offset + length > XFS_ISIZE(ip))
		end_fsb = xfs_iomap_eof_align_last_fsb(ip, end_fsb);
                  => xfs_fileoff_t aligned_end_fsb = roundup_64(end_fsb, align);
                     return aligned_end_fsb
}

2) xfs_bmap_compute_alignments()
{
    <...>
    	else if (ap->datatype & XFS_ALLOC_USERDATA)
		     align = xfs_get_extsz_hint(ap->ip);

        if (align) {
            if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
                        ap->eof, 0, ap->conv, &ap->offset,
                        &ap->length))
                ASSERT(0);
            ASSERT(ap->length);

            args->prod = align;
            div_u64_rem(ap->offset, args->prod, &args->mod);
            if (args->mod)
                args->mod = args->prod - args->mod;
        }
        <...>
}

So args->prod and args->mod... aren't they use to align the logical
start and the length of the extent?


However, I do notice that when the file is closed XFS trims the length
allocated beyond EOF boundary (for extsize but not for forcealign from
the new forcealign series) i.e.

xfs_file_release() -> xfs_release() -> xfs_free_eofblocks()

I guess that is because xfs_can_free_eofblocks() does not consider
alignment for extsize in this function 

xfs_can_free_eofblocks()
{
<...>
	end_fsb = xfs_inode_roundup_alloc_unit(ip,
			XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_ISIZE(ip)));
<...>
}




>> Also it looks like there is no difference with ATOMIC_WRITE AND
>> FORCEALIGN feature with XFS, correct? (except that ATOMIC_WRITE is
>> adding additional natural alignment restrictions on pos and len). 
>
> Atomic write requires additional hardware support, and it restricts
> the valid sizes of extent size hints that can be set. Only atomic
> writes can be done on files marked as configured for atomic writes;
> force alignment can be done on any file...
>
>> So why maintain 2 separate on disk inode flags for FORCEALIGN AND
>> ATOMIC_WRITE?
>
> the atomic write flag indicates that a file has been set up
> correctly for atomic writes to be able to issues reliably. force
> alignment doesn't guarantee that - it's just a mechanism that tells
> the allocator to behave a specific way.
>
>> - Do you foresee FORCEALIGN to be also used at other places w/o
>> ATOMIC_WRITE where feature differentiation between the two on an
>> inode is required?
>
> The already exist. For example, reliably allocating huge page
> mappings on DAX filesystems requires 2MB forced alignment. 
>
>> - Does the same reasoning will hold for XFS_SB_FEAT_RO_COMPAT_FORCEALIGN
>> & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES too?
>
> Same as above.
>
>> - But why ro_compact for ATOMICWRITES? There aren't any on disk metadata
>> changes within XFS filesystem to support atomic writes, right? 
>
> Because if you downgrade the kernel to something that doesn't
> support atomic writes, then non-atomic sized/aligned data can be
> written to the file and/or torn writes can occur.
>
> Worse, extent size hints that don't match the underlying hardware
> support could be set up for inodes, and when the kernel is upgraded
> again then atomic writes will fail on inodes that have atomic write
> flags set on them....
>
>> Is it something to just prevent users from destroying their own data
>> by not allowing a rw mount from an older kernel where users could do
>> unaligned writes to files marked for atomic writes?
>> Or is there any other reasoning to prevent XFS filesystem from becoming
>> inconsistent if an older kernel does a rw mount here.
>
> The older kernel does not know what the unknown inode flag means
> (i.e. atomic writes) and so, by definition, we cannot allow it to
> modify metadata or file data because it may not modify it in the
> correct way for that flag being set on the inode.
>
> Kernels that don't understand feature flags need to treat the
> filesystem as read-only, no matter how trivial the feature addition
> might seem.
>
>> > so unmap alignment
>> > has to be set up correctly at file offset level before we even know
>> > what extents underly the file range we need to unmap....
>> >
>> >>      If the start or end of the extent which needs unmapping is
>> >>      unaligned then we convert that extent to unwritten and skip,
>> >>      is it? (__xfs_bunmapi())
>> >
>> > The high level code should be aligning the start and end of the
>> > file range to be removed via xfs_inode_alloc_unitsize(). Hence 
>> > the low level __xfs_bunmapi() code shouldn't ever be encountering
>> > unaligned unmaps on force-aligned inodes.
>> >
>> 
>> Yes, but isn't this code snippet trying to handle a case when it finds an
>> unaligned extent during unmap?
>
> It does exactly that.
>
>> And what we are essentially trying to 
>> do here is leave the unwritten extent as is and if the found extent is
>> written then convert to unwritten and skip it (goto nodelete). This
>> means with forcealign if we encounter an unaligned extent then the file
>> will have this space reserved as is with extent marked unwritten. 
>> 
>> Is this understanding correct?
>
> Yes, except for the fact that this code is not triggered by force
> alignment.
>
> This code is preexisting realtime file functionality. It is used
> when the rtextent size is larger than a single filesytem block.
>
> In these configurations, we do allocation in rtextsize units, but we
> track written/unwritten extents in the BMBT on filesystem block
> granularity. Hence we can have unaligned written/unwritten extent
> boundaries, and if we aren't unmapping a whole rtextent then we
> simply mark the unused portion of it as unwritten in the BMBT to
> indicate it contains zeroes.
>
> IOWs, existing RT files have different IO alignment and
> written/unwritten extent conversion behaviour to the new forced
> alignment feature. The implementation code is shared in many places,
> but the higher level forced alignment functionality ensures there
> are never unaligned unwritten extents created or unaligned
> unmappings asked for. Hence this code does not trigger for the
> forced alignment cases.
>
> i.e. we have multiple seperate allocation alignment behaviours that
> share an implementation, but they do not all behave exactly the same
> way or provide the same alignment guarantees....
>

Thanks for taking time and explaining this. 

-ritesh

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-10  2:51         ` Ritesh Harjani
@ 2024-09-16  6:33           ` Dave Chinner
  0 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2024-09-16  6:33 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: John Garry, chandan.babu, djwong, dchinner, hch, viro, brauner,
	jack, linux-xfs, linux-kernel, linux-fsdevel, catherine.hoang,
	martin.petersen

On Tue, Sep 10, 2024 at 08:21:55AM +0530, Ritesh Harjani wrote:
> Dave Chinner <david@fromorbit.com> writes:
> 
> > On Thu, Sep 05, 2024 at 09:26:25AM +0530, Ritesh Harjani wrote:
> >> Dave Chinner <david@fromorbit.com> writes:
> >> > On Wed, Sep 04, 2024 at 11:44:29PM +0530, Ritesh Harjani wrote:
> >> >> 3. It is the FORCEALIGN feature which _mandates_ both allocation
> >> >> (by using extsize hint) and de-allocation to happen _only_ in
> >> >> extsize chunks.
> >> >>
> >> >>    i.e. forcealign mandates -
> >> >>    - the logical and physical start offset should be aligned as
> >> >>    per args->alignment
> >> >>    - extent length be aligned as per args->prod/mod.
> >> >>      If above two cannot be satisfied then return -ENOSPC.
> >> >
> >> > Yes.
> >> >
> >> >> 
> >> >>    - Does the unmapping of extents also only happens in extsize
> >> >>    chunks (with forcealign)?
> >> >
> >> > Yes, via use of xfs_inode_alloc_unitsize() in the high level code
> >> > aligning the fsbno ranges to be unmapped.
> >> >
> >> > Remember, force align requires both logical file offset and
> >> > physical block number to be correctly aligned,
> >> 
> >> This is where I would like to double confirm it again. Even the
> >> extsize hint feature (w/o FORCEALIGN) will try to allocate aligned
> >> physical start and logical start file offset and length right?
> >
> > No.
> >
> >> (Or does extsize hint only restricts alignment to logical start file
> >> offset + length and not the physical start?)
> >
> > Neither.
> >
> 
> Yes, thanks for the correction. Indeed extsize hint does not take care
> of the physical start alignment at all.
> 
> > extsize hint by itself (i.e. existing behaviour) has no alignment
> > effect at all. All it affects is -size- of the extent. i.e. once
> > the extent start is chosen, extent size hints will trim the length
> > of the extent to a multiple of the extent size hint. Alignment is
> > not considered at all.
> >
> 
> Please correct me I wrong here... but XFS considers aligning the logical
> start and the length of the allocated extent (for extsize) as per below
> code right? 

Sorry, I was talking about physical alignment, not logical file
offset alignment. The logical file offset alignment that is done
for extent size hints is much more convoluted and dependent on
certain preconditions existing for it to function as forced
alignment/atomic writes require.

> 
> i.e.
> 1) xfs_direct_write_iomap_begin()
> {
>     <...>
>     if (offset + length > XFS_ISIZE(ip))
> 		end_fsb = xfs_iomap_eof_align_last_fsb(ip, end_fsb);
>                   => xfs_fileoff_t aligned_end_fsb = roundup_64(end_fsb, align);
>                      return aligned_end_fsb
> }

That's calculating the file offset of the end of the extent for an
extending write. It's not really an alignment - it's simply
calculating the file offset the allocation needs to cover to allow
for aligned allocation. This length needs to be fed into the
transaction reservation (i.e. ENOSPC checks) before we start the
allocation, so we have to have some idea of the extent size we are
going to allocate here...

> 2) xfs_bmap_compute_alignments()
> {
>     <...>
>     	else if (ap->datatype & XFS_ALLOC_USERDATA)
> 		     align = xfs_get_extsz_hint(ap->ip);
> 
>         if (align) {
>             if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
>                         ap->eof, 0, ap->conv, &ap->offset,
>                         &ap->length))
>                 ASSERT(0);
>             ASSERT(ap->length);
> 
>             args->prod = align;
>             div_u64_rem(ap->offset, args->prod, &args->mod);
>             if (args->mod)
>                 args->mod = args->prod - args->mod;
>         }
>         <...>
> }
> 
> So args->prod and args->mod... aren't they use to align the logical
> start and the length of the extent?

Nope. They are only used way down in xfs_alloc_fix_len(), which
trims the length of the selected *physical* extent to the required
length.

Look further up - ap->offset is the logical file offset the
allocation needs to cover.  Logical alignment of the offset (i.e.
determining where in the file the physical extent will be placed) is
done in xfs_bmap_extsize_align(). As i said above, it's not purely
an extent size alignment calculation....

> However, I do notice that when the file is closed XFS trims the length
> allocated beyond EOF boundary (for extsize but not for forcealign from
> the new forcealign series) i.e.
> 
> xfs_file_release() -> xfs_release() -> xfs_free_eofblocks()
> 
> I guess that is because xfs_can_free_eofblocks() does not consider
> alignment for extsize in this function 

Of course - who wants large chunks of space allocated beyond EOF
when you are never going to write to the file again?

i.e. If you have large extsize hints then the post-eof tail can
consume a -lot- of space that won't otherwise get freed. This can
lead to rapid, unexpected ENOSPC, and it's not clear to users what
the cause is.

Hence we don't care if extsz is set on the inode or not when we
decide to remove post-eof blocks - reclaiming the unused space is
much more important that an occasional unaligned or small extent.

Forcealign changes that equation, but if you choose forcealign you
are doing it for a specific reason and likely not applying it to the
entire filesystem.....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-05  6:33       ` Dave Chinner
  2024-09-10  2:51         ` Ritesh Harjani
@ 2024-09-10 12:33         ` Ritesh Harjani
  2024-09-16  7:03           ` Dave Chinner
  1 sibling, 1 reply; 58+ messages in thread
From: Ritesh Harjani @ 2024-09-10 12:33 UTC (permalink / raw)
  To: Dave Chinner, John Garry
  Cc: John Garry, chandan.babu, djwong, dchinner, hch, viro, brauner,
	jack, linux-xfs, linux-kernel, linux-fsdevel, catherine.hoang,
	martin.petersen

Dave Chinner <david@fromorbit.com> writes:

> On Thu, Sep 05, 2024 at 09:26:25AM +0530, Ritesh Harjani wrote:
>> Dave Chinner <david@fromorbit.com> writes:
>> > On Wed, Sep 04, 2024 at 11:44:29PM +0530, Ritesh Harjani wrote:
>> >> 3. It is the FORCEALIGN feature which _mandates_ both allocation
>> >> (by using extsize hint) and de-allocation to happen _only_ in
>> >> extsize chunks.
>> >>
>> >>    i.e. forcealign mandates -
>> >>    - the logical and physical start offset should be aligned as
>> >>    per args->alignment
>> >>    - extent length be aligned as per args->prod/mod.
>> >>      If above two cannot be satisfied then return -ENOSPC.
>> >
>> > Yes.
>> >
>> >> 
>> >>    - Does the unmapping of extents also only happens in extsize
>> >>    chunks (with forcealign)?
>> >
>> > Yes, via use of xfs_inode_alloc_unitsize() in the high level code
>> > aligning the fsbno ranges to be unmapped.
>> >
>> > Remember, force align requires both logical file offset and
>> > physical block number to be correctly aligned,
>> 
>> This is where I would like to double confirm it again. Even the
>> extsize hint feature (w/o FORCEALIGN) will try to allocate aligned
>> physical start and logical start file offset and length right?
>
> No.
>
>> (Or does extsize hint only restricts alignment to logical start file
>> offset + length and not the physical start?)
>
> Neither.
>
> extsize hint by itself (i.e. existing behaviour) has no alignment
> effect at all. All it affects is -size- of the extent. i.e. once
> the extent start is chosen, extent size hints will trim the length
> of the extent to a multiple of the extent size hint. Alignment is
> not considered at all.
>
>> Also it looks like there is no difference with ATOMIC_WRITE AND
>> FORCEALIGN feature with XFS, correct? (except that ATOMIC_WRITE is
>> adding additional natural alignment restrictions on pos and len). 
>
> Atomic write requires additional hardware support, and it restricts
> the valid sizes of extent size hints that can be set. Only atomic
> writes can be done on files marked as configured for atomic writes;
> force alignment can be done on any file...
>
>> So why maintain 2 separate on disk inode flags for FORCEALIGN AND
>> ATOMIC_WRITE?
>
> the atomic write flag indicates that a file has been set up
> correctly for atomic writes to be able to issues reliably. force
> alignment doesn't guarantee that - it's just a mechanism that tells
> the allocator to behave a specific way.
>
>> - Do you foresee FORCEALIGN to be also used at other places w/o
>> ATOMIC_WRITE where feature differentiation between the two on an
>> inode is required?
>
> The already exist. For example, reliably allocating huge page
> mappings on DAX filesystems requires 2MB forced alignment. 
>
>> - Does the same reasoning will hold for XFS_SB_FEAT_RO_COMPAT_FORCEALIGN
>> & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES too?
>
> Same as above.
>
>> - But why ro_compact for ATOMICWRITES? There aren't any on disk metadata
>> changes within XFS filesystem to support atomic writes, right? 
>
> Because if you downgrade the kernel to something that doesn't
> support atomic writes, then non-atomic sized/aligned data can be
> written to the file and/or torn writes can occur.
>
> Worse, extent size hints that don't match the underlying hardware
> support could be set up for inodes, and when the kernel is upgraded
> again then atomic writes will fail on inodes that have atomic write
> flags set on them....
>
>> Is it something to just prevent users from destroying their own data
>> by not allowing a rw mount from an older kernel where users could do
>> unaligned writes to files marked for atomic writes?
>> Or is there any other reasoning to prevent XFS filesystem from becoming
>> inconsistent if an older kernel does a rw mount here.
>
> The older kernel does not know what the unknown inode flag means
> (i.e. atomic writes) and so, by definition, we cannot allow it to
> modify metadata or file data because it may not modify it in the
> correct way for that flag being set on the inode.
>
> Kernels that don't understand feature flags need to treat the
> filesystem as read-only, no matter how trivial the feature addition
> might seem.
>

1. Will it require a fresh formatting of filesystem with mkfs.xfs for
enabling atomic writes (/forcealign) on XFS?
  a. Is that because reflink is not support with atomic writes
  (/forcealign) today?

As I understand for setting forcealign attr on any inode it checks for
whether xfs_has_forcealign(mp). That means forcealign can _only_ be
enabled during mkfs time and it also needs reflink to be disabled with
-m reflink=0. Right?

-ritesh

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-10 12:33         ` Ritesh Harjani
@ 2024-09-16  7:03           ` Dave Chinner
  2024-09-16 10:24             ` John Garry
  0 siblings, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2024-09-16  7:03 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: John Garry, chandan.babu, djwong, dchinner, hch, viro, brauner,
	jack, linux-xfs, linux-kernel, linux-fsdevel, catherine.hoang,
	martin.petersen

On Tue, Sep 10, 2024 at 06:03:12PM +0530, Ritesh Harjani wrote:
> >> Is it something to just prevent users from destroying their own data
> >> by not allowing a rw mount from an older kernel where users could do
> >> unaligned writes to files marked for atomic writes?
> >> Or is there any other reasoning to prevent XFS filesystem from becoming
> >> inconsistent if an older kernel does a rw mount here.
> >
> > The older kernel does not know what the unknown inode flag means
> > (i.e. atomic writes) and so, by definition, we cannot allow it to
> > modify metadata or file data because it may not modify it in the
> > correct way for that flag being set on the inode.
> >
> > Kernels that don't understand feature flags need to treat the
> > filesystem as read-only, no matter how trivial the feature addition
> > might seem.
> >
> 
> 1. Will it require a fresh formatting of filesystem with mkfs.xfs for
> enabling atomic writes (/forcealign) on XFS?

Initially, yes.

>   a. Is that because reflink is not support with atomic writes
>   (/forcealign) today?

It's much more complex than that.

e.g. How does force-align and COW interact, especially w.r.t.
sub-alloc unit overwrites, cowextsz based preallocation and
unwritten extents in the COW fork?

> As I understand for setting forcealign attr on any inode it checks for
> whether xfs_has_forcealign(mp). That means forcealign can _only_ be
> enabled during mkfs time and it also needs reflink to be disabled with
> -m reflink=0. Right?

forcealign doesn't need to be completely turned off when reflink is
enabled and/or vice versa. Both can co-exist in the filesytsem at
the same time, but the current implementation does not allow
forcealign and reflink to be used on the same inode at the same
time.

It was decided that the best way to handle the lack of reflink
support initially was to make the two feature bits incompatible at
mount time. Hence we currently have to make a non-reflink filesystem
to test forcealign based functionality.

However, doing it this way means that when we fix the implementation
to support reflink and forcealign together, we just remove the mount
time check and all existing reflink filesystems can be immediately
upgraded to support forcealign.

OTOH, we can't do this with atomic writes. Atomic writes require
some mkfs help because they require explicit physical alignment of
the filesystem to the underlying storage. Hence we'll eventually end
up with atomic writes needing to be enabled at mkfs time, but force
align will be an upgradeable feature flag.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-16  7:03           ` Dave Chinner
@ 2024-09-16 10:24             ` John Garry
  2024-09-17 20:54               ` Darrick J. Wong
  2024-09-17 22:12               ` Dave Chinner
  0 siblings, 2 replies; 58+ messages in thread
From: John Garry @ 2024-09-16 10:24 UTC (permalink / raw)
  To: Dave Chinner, Ritesh Harjani
  Cc: chandan.babu, djwong, dchinner, hch, viro, brauner, jack,
	linux-xfs, linux-kernel, linux-fsdevel, catherine.hoang,
	martin.petersen

On 16/09/2024 08:03, Dave Chinner wrote:
> OTOH, we can't do this with atomic writes. Atomic writes require
> some mkfs help because they require explicit physical alignment of
> the filesystem to the underlying storage. 

If we are enabling atomic writes at mkfs time, then we can ensure agsize 
% extsize == 0. That provides the physical alignment guarantee. It also 
makes sense to ensure extsize is a power-of-2.

However, extsize is re-configurble per inode. So, for an inode enabled 
for atomic writes, we must still ensure agsize % new extsize == 0 (and 
also new extsize is a power-of-2)

> Hence we'll eventually end
> up with atomic writes needing to be enabled at mkfs time, but force
> align will be an upgradeable feature flag.

Could atomic writes also be an upgradeable feature? We just need to 
ensure that agsize % extsize == 0 for an inode enabled for atomic 
writes. Valid extsize values may be quite limited, though, depending on 
the value of agsize.

Thanks,
John

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-16 10:24             ` John Garry
@ 2024-09-17 20:54               ` Darrick J. Wong
  2024-09-17 23:34                 ` Dave Chinner
  2024-09-17 22:12               ` Dave Chinner
  1 sibling, 1 reply; 58+ messages in thread
From: Darrick J. Wong @ 2024-09-17 20:54 UTC (permalink / raw)
  To: John Garry
  Cc: Dave Chinner, Ritesh Harjani, chandan.babu, dchinner, hch, viro,
	brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen

On Mon, Sep 16, 2024 at 11:24:56AM +0100, John Garry wrote:
> On 16/09/2024 08:03, Dave Chinner wrote:
> > OTOH, we can't do this with atomic writes. Atomic writes require
> > some mkfs help because they require explicit physical alignment of
> > the filesystem to the underlying storage.

Forcealign requires agsize%extsize==0, it's atomicwrites that adds the
requirement that extsize be a power of 2...

> If we are enabling atomic writes at mkfs time, then we can ensure agsize %
> extsize == 0. That provides the physical alignment guarantee. It also makes
> sense to ensure extsize is a power-of-2.
> 
> However, extsize is re-configurble per inode. So, for an inode enabled for
> atomic writes, we must still ensure agsize % new extsize == 0 (and also new
> extsize is a power-of-2)

...so yes.

> > Hence we'll eventually end
> > up with atomic writes needing to be enabled at mkfs time, but force
> > align will be an upgradeable feature flag.
> 
> Could atomic writes also be an upgradeable feature? We just need to ensure
> that agsize % extsize == 0 for an inode enabled for atomic writes. Valid
> extsize values may be quite limited, though, depending on the value of
> agsize.

I don't see why forcealign and atomicwrites can't be added to the sb
featureset after the fact, though you're right that callers of xfs_io
chattr might be hard pressed to find an fsx_extsize value that fits.

--D

> Thanks,
> John
> 
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-17 20:54               ` Darrick J. Wong
@ 2024-09-17 23:34                 ` Dave Chinner
  0 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2024-09-17 23:34 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: John Garry, Ritesh Harjani, chandan.babu, dchinner, hch, viro,
	brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen

On Tue, Sep 17, 2024 at 01:54:20PM -0700, Darrick J. Wong wrote:
> On Mon, Sep 16, 2024 at 11:24:56AM +0100, John Garry wrote:
> > On 16/09/2024 08:03, Dave Chinner wrote:
> > > OTOH, we can't do this with atomic writes. Atomic writes require
> > > some mkfs help because they require explicit physical alignment of
> > > the filesystem to the underlying storage.
> 
> Forcealign requires agsize%extsize==0,

No it doesn't.

AG size is irrelevant when aligning extents - all
that matters is that we can find a free extent that can be trimmed
to the alignment defined by extsize. 

> it's atomicwrites that adds the
> requirement that extsize be a power of 2...

Only by explicit implementation constraint.

Atomic writes do not require power of two extsize - they only
require correctly aligned physical extents.

e.g an 8kB atomic write is always guaranteed to succeed if
we have an extsize of 24kB for laying out the physical extents
because a 24kB physical extent is always 8kB aligned and is an exact
multiple of 8kB. This meets the requirements for 8kB atomic writes to
always succeed, and hence there is no fundamental requirement for
extsize to be a power of 2.

We have *chosen* to simplify the implementation by only allowing
a single aligned atomic write to be issued at a time. This means
the alignment and size of atomic writes is always the minimum size
the hardware advertises, and that is (at present) always a power of
2.

Hence the "extsize needs to be a power of 2" comes from the
constraints exposed from the block device configuration (i.e.
minimum atomic write unit), not from a filesystem design or
implementation constraint.

At the filesystem level, we have further simplified things by only
allowing extsize = atomic write size. Hence the initial
implementation ends up only support power of 2 extsize values. This
is not a hard design or implementation constraint, however.

.....

hmmmmm.

.....

In writing this I've had further thoughts on force-align and the
sub-alloc-unit unwritten extent issue we've been discussing here.
i.e. I've stopped and considered the existing design constraints
given what I wrote above and considered what is needed for
supporting large extsize for small atomic writes.

I think we need to support large extsize with small atomic write
size for two reasons:

1. extsize is still going to be needed for preventing excessive
fragmentation with atomic writes. It's the small DIO write workloads
that see lots of fragmentation, and applications using atomic writes
are essentially being forced down the path of being small DIO write
workloads.

2. we can allow force-align w/o atomic writes behaviour to match the
existing rtvol sb_rextsize > 1 fsb behaviour without impacting
atomic write behaviour. (i.e. less behavioural differences, more
common code, better performance, etc).

To do this (and I think we do want to do this), then we have to do
two things:

1. force-align needs to add a "unwritten align" inode parameter to
allow sub-extsize unwritten extent boundaries to exist in the BMBT.
(i.e.  similar to how rt files w/ sb_rextsize > 1 fsb currently
behave.)

This is purely an in-memory value - for pure "force-align" we can
set it 1 fsb and then the behaviour will match existing RT
behaviour.  We can abstract this behaviour by replacing the hard
coded 1 block alignment for unwritten conversion with an "unwritten
align" value which would initially be set to 1.

We can also abstract this code away from being "rt specific" and
make it entirely dependent on "alloc-unit" configuration. This means
rt, force-align and atomic write will all be running the same code,
which makes testing a lot easier..

2. inodes with atomic write enabled must set the unwritten align
value to the atomic write size exposed by the hardware, and the
extsize must be an exact integer multiple of the unwritten align
size.

The initial settings of unwritten align == extsize gives the current
behaviour of all allocation and extent conversion being aligned to
atomic write constraints.

The separation of unwritten conversion from the extsize then allows
allows the situation I described above with 8kB atomic writes
and 24kB extsize. Because unwritten conversion is aligned to atomic
wriet boundaries, we can use sub-alloc-unit unwritten extents
without violating atomic write boundaries.

This would allow us to use extsize for atomic writes in the same
manner we use it for now - enable large contiguous allocations to
prevent fragmentation when doing lots of concurrent small
"immediate write" operations across many files.

I think this can all be added on top of the existing patchset - it's
not really a fundamental change to any of it. It's a little bit more
abstraction and unification, but it enables a lot more flexibility
for optimising atomic write functionality in the future.

Thoughts?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-16 10:24             ` John Garry
  2024-09-17 20:54               ` Darrick J. Wong
@ 2024-09-17 22:12               ` Dave Chinner
  2024-09-18  7:59                 ` John Garry
  1 sibling, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2024-09-17 22:12 UTC (permalink / raw)
  To: John Garry
  Cc: Ritesh Harjani, chandan.babu, djwong, dchinner, hch, viro,
	brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen

On Mon, Sep 16, 2024 at 11:24:56AM +0100, John Garry wrote:
> On 16/09/2024 08:03, Dave Chinner wrote:
> > OTOH, we can't do this with atomic writes. Atomic writes require
> > some mkfs help because they require explicit physical alignment of
> > the filesystem to the underlying storage.
> 
> If we are enabling atomic writes at mkfs time, then we can ensure agsize %
> extsize == 0. That provides the physical alignment guarantee. It also makes
> sense to ensure extsize is a power-of-2.

No, mkfs does not want to align anything to "extsize". It needs to
align the filesystem geometry to be compatible with the underlying
block device atomic write alignment parameters.

We just don't care if extsize is not an exact multiple of agsize.
As long as extsize is aligned to the atomic write boundaries and the
start of the AG is aligned to atomic write boundaries, we can
allocate hardware aligned extsize sized extents from the AG.

AGs are always going to contain lots of non-aligned, randomly sized
extents for other stuff like metadata and unaligned file data.
Aligned allocation is all about finding extsized aligned free space
within the AG and has nothing to do with the size of the AG itself.

> However, extsize is re-configurble per inode. So, for an inode enabled for
> atomic writes, we must still ensure agsize % new extsize == 0 (and also new
> extsize is a power-of-2)

Ensuring that the extsize is aligned to the hardware atomic write
limits is a kernel runtime check when enabling atomic writes on an
inode.

In this case, we do not care what the AG size is - it is completely
irrelevant to these per-inode runtime checks because mkfs has
already guaranteed that the AG is correctly aligned to the
underlying hardware. That means is extsize is also aligned to the
underlying hardware, physical extent layout is guaranteed to be
compatible with the hardware constraints for atomic writes...

> > Hence we'll eventually end
> > up with atomic writes needing to be enabled at mkfs time, but force
> > align will be an upgradeable feature flag.
> 
> Could atomic writes also be an upgradeable feature? We just need to ensure
> that agsize % extsize == 0 for an inode enabled for atomic writes.

To turn the superblock feature bit on, we have to check the AGs are
correctly aligned to the *underlying hardware*. If they aren't
correctly aligned (and there is a good chance they will not be)
then we can't enable atomic writes at all. The only way to change
this is to physically move AGs around in the block device (i.e. via
xfs_expand tool I proposed).

i.e. the mkfs dependency on having the AGs aligned to the underlying
atomic write capabilities of the block device never goes away, even
if we want to make the feature dynamically enabled.

IOWs, yes, an existing filesystem -could- be upgradeable, but there
is no guarantee that is will be.

Quite frankly, we aren't going to see block devices that filesystems
already exist on suddenly sprout support for atomic writes mid-life.
Hence if mkfs detects atomic write support in the underlying device,
it should *always* modify the geometry to be compatible with atomic
writes and enable atomic write support.

Yes, that means the "incompat with reflink" issue needs to be fixed
before we take atomic writes out of experimental (i.e. we consistently
apply the same "full support" criteria we applied to DAX).

Hence by the time atomic writes are a fully supported feature, we're
going to be able to enable them by default at mkfs time for any
hardware that supports them...

> Valid
> extsize values may be quite limited, though, depending on the value of
> agsize.

No. The only limit agsize puts on extsize is that a single aligned
extent can't be larger than half the AG size. Forced alignment and
atomic writes don't change that.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-17 22:12               ` Dave Chinner
@ 2024-09-18  7:59                 ` John Garry
  2024-09-23  2:57                   ` Dave Chinner
  0 siblings, 1 reply; 58+ messages in thread
From: John Garry @ 2024-09-18  7:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ritesh Harjani, chandan.babu, djwong, dchinner, hch, viro,
	brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen

On 17/09/2024 23:12, Dave Chinner wrote:
> On Mon, Sep 16, 2024 at 11:24:56AM +0100, John Garry wrote:
>> On 16/09/2024 08:03, Dave Chinner wrote:
>>> OTOH, we can't do this with atomic writes. Atomic writes require
>>> some mkfs help because they require explicit physical alignment of
>>> the filesystem to the underlying storage.
>>
>> If we are enabling atomic writes at mkfs time, then we can ensure agsize %
>> extsize == 0. That provides the physical alignment guarantee. It also makes
>> sense to ensure extsize is a power-of-2.
> 
> No, mkfs does not want to align anything to "extsize". It needs to
> align the filesystem geometry to be compatible with the underlying
> block device atomic write alignment parameters.
> 
> We just don't care if extsize is not an exact multiple of agsize.
> As long as extsize is aligned to the atomic write boundaries and the
> start of the AG is aligned to atomic write boundaries, we can
> allocate hardware aligned extsize sized extents from the AG.
> 
> AGs are always going to contain lots of non-aligned, randomly sized
> extents for other stuff like metadata and unaligned file data.
> Aligned allocation is all about finding extsized aligned free space
> within the AG and has nothing to do with the size of the AG itself.

Fine, we can go the way of aligning the agsize to the atomic write unit 
max for mkfs.

> 
>> However, extsize is re-configurble per inode. So, for an inode enabled for
>> atomic writes, we must still ensure agsize % new extsize == 0 (and also new
>> extsize is a power-of-2)
> 
> Ensuring that the extsize is aligned to the hardware atomic write
> limits is a kernel runtime check when enabling atomic writes on an
> inode.
> 
> In this case, we do not care what the AG size is - it is completely
> irrelevant to these per-inode runtime checks because mkfs has
> already guaranteed that the AG is correctly aligned to the
> underlying hardware. That means is extsize is also aligned to the
> underlying hardware, physical extent layout is guaranteed to be
> compatible with the hardware constraints for atomic writes...

Sure, we would just need to enforce that extsize is a power-of-2 then.

> 
>>> Hence we'll eventually end
>>> up with atomic writes needing to be enabled at mkfs time, but force
>>> align will be an upgradeable feature flag.
>>
>> Could atomic writes also be an upgradeable feature? We just need to ensure
>> that agsize % extsize == 0 for an inode enabled for atomic writes.
> 
> To turn the superblock feature bit on, we have to check the AGs are
> correctly aligned to the *underlying hardware*. If they aren't
> correctly aligned (and there is a good chance they will not be)
> then we can't enable atomic writes at all. The only way to change
> this is to physically move AGs around in the block device (i.e. via
> xfs_expand tool I proposed).
 > > i.e. the mkfs dependency on having the AGs aligned to the underlying
> atomic write capabilities of the block device never goes away, even
> if we want to make the feature dynamically enabled.
> 
> IOWs, yes, an existing filesystem -could- be upgradeable, but there
> is no guarantee that is will be.
> 
> Quite frankly, we aren't going to see block devices that filesystems
> already exist on suddenly sprout support for atomic writes mid-life.

I would not be so sure. Some SCSI devices used in production which I 
know implicitly write 32KB atomically. And we would like to use them for 
atomic writes. 32KB is small and I guess that there is a small chance of 
pre-existing AGs not being 32KB aligned. I would need to check if there 
is even a min alignment for AGs...

> Hence if mkfs detects atomic write support in the underlying device,
> it should *always* modify the geometry to be compatible with atomic
> writes and enable atomic write support.

The current solution is to enable via commandline.

> 
> Yes, that means the "incompat with reflink" issue needs to be fixed
> before we take atomic writes out of experimental (i.e. we consistently
> apply the same "full support" criteria we applied to DAX).

In the meantime, if mkfs auto-enables atomic writes (when the HW 
supports), what will it do to reflink feature (in terms of enabling)?

> 
> Hence by the time atomic writes are a fully supported feature, we're
> going to be able to enable them by default at mkfs time for any
> hardware that supports them...
> 
>> Valid
>> extsize values may be quite limited, though, depending on the value of
>> agsize.
> 
> No. The only limit agsize puts on extsize is that a single aligned
> extent can't be larger than half the AG size. Forced alignment and
> atomic writes don't change that.
> 

ok

Thanks,
John


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-18  7:59                 ` John Garry
@ 2024-09-23  2:57                   ` Dave Chinner
  2024-09-23  3:33                     ` Christoph Hellwig
  2024-09-23  8:00                     ` John Garry
  0 siblings, 2 replies; 58+ messages in thread
From: Dave Chinner @ 2024-09-23  2:57 UTC (permalink / raw)
  To: John Garry
  Cc: Ritesh Harjani, chandan.babu, djwong, dchinner, hch, viro,
	brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen

On Wed, Sep 18, 2024 at 08:59:41AM +0100, John Garry wrote:
> On 17/09/2024 23:12, Dave Chinner wrote:
> > On Mon, Sep 16, 2024 at 11:24:56AM +0100, John Garry wrote:
> > > > Hence we'll eventually end
> > > > up with atomic writes needing to be enabled at mkfs time, but force
> > > > align will be an upgradeable feature flag.
> > > 
> > > Could atomic writes also be an upgradeable feature? We just need to ensure
> > > that agsize % extsize == 0 for an inode enabled for atomic writes.
> > 
> > To turn the superblock feature bit on, we have to check the AGs are
> > correctly aligned to the *underlying hardware*. If they aren't
> > correctly aligned (and there is a good chance they will not be)
> > then we can't enable atomic writes at all. The only way to change
> > this is to physically move AGs around in the block device (i.e. via
> > xfs_expand tool I proposed).
> > > i.e. the mkfs dependency on having the AGs aligned to the underlying
> > atomic write capabilities of the block device never goes away, even
> > if we want to make the feature dynamically enabled.
> > 
> > IOWs, yes, an existing filesystem -could- be upgradeable, but there
> > is no guarantee that is will be.
> > 
> > Quite frankly, we aren't going to see block devices that filesystems
> > already exist on suddenly sprout support for atomic writes mid-life.
> 
> I would not be so sure. Some SCSI devices used in production which I know
> implicitly write 32KB atomically. And we would like to use them for atomic
> writes.

Ok, but that's not going to be widespread. Very little storage
hardware out there supports atomic writes - the vast majority of
deployments will be new hardware that will have mkfs run on it.

A better argument for dynamic upgrade is turning on atomic writes
on reflink enabled filesystems once the kernel implementation has
been updates to allow the two features to co-exist.

> 32KB is small and I guess that there is a small chance of
> pre-existing AGs not being 32KB aligned. I would need to check if there is
> even a min alignment for AGs...

There is no default alignment for AGs unless there is a stripe unit
set. Then it will align AGs to the stripe unit. There is also no
guarantee that stripe units are aligned to powers of two or atomic
write units...

> > Hence if mkfs detects atomic write support in the underlying device,
> > it should *always* modify the geometry to be compatible with atomic
> > writes and enable atomic write support.
> 
> The current solution is to enable via commandline.

Yes, that's the current proposal.  What I'm saying is that this
isn't a future proof solution, nor how we want this functionality to
work in the future.

We should be looking at the block device capabilities (like we do for
stripe unit, etc) and then *do the right thing automatically*. If
the block device advertises atomic write support, then we should
automatically align the filesystem to atomic write constraints, even
if atomic writes can not be immediately enabled (because reflink).

I'm trying to describe how we want things to work once atomic write
support is ubiquitous. It needs to be simple for users and admins,
and it should work (or be reliably upgradeable) out of the box on
all new hardware that supports this functionality.

> > Yes, that means the "incompat with reflink" issue needs to be fixed
> > before we take atomic writes out of experimental (i.e. we consistently
> > apply the same "full support" criteria we applied to DAX).
> 
> In the meantime, if mkfs auto-enables atomic writes (when the HW supports),
> what will it do to reflink feature (in terms of enabling)?

I didn't say we should always "auto-enable atomic writes".

I said if the hardware is atomic write capable, then mkfs should
always *align the filesystem* to atomic write constraints.  A kernel
upgrade will eventually allow reflink and atomic writes to co-exist,
but only if the filesystem is correctly aligned to the hardware
constrains for atomic writes. We need to ensure we leave that
upgrade path open....

.... and only once we have full support can we make "mkfs
auto-enable atomic writes".

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-23  2:57                   ` Dave Chinner
@ 2024-09-23  3:33                     ` Christoph Hellwig
  2024-09-23  8:16                       ` John Garry
  2024-09-23  8:00                     ` John Garry
  1 sibling, 1 reply; 58+ messages in thread
From: Christoph Hellwig @ 2024-09-23  3:33 UTC (permalink / raw)
  To: Dave Chinner
  Cc: John Garry, Ritesh Harjani, chandan.babu, djwong, dchinner, hch,
	viro, brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen

On Mon, Sep 23, 2024 at 12:57:32PM +1000, Dave Chinner wrote:
> Ok, but that's not going to be widespread. Very little storage
> hardware out there supports atomic writes - the vast majority of
> deployments will be new hardware that will have mkfs run on it.

Just about every enterprise NVMe SSD supports atomic write size
larger than a single LBA, because it is completely natural fallout
from FTL deѕign.  That beeing said to support those SSDs a block
size of 16 or 32k would be a lot more natural than all the forcealign
madness.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-23  3:33                     ` Christoph Hellwig
@ 2024-09-23  8:16                       ` John Garry
  2024-09-23 12:07                         ` Christoph Hellwig
  0 siblings, 1 reply; 58+ messages in thread
From: John Garry @ 2024-09-23  8:16 UTC (permalink / raw)
  To: Christoph Hellwig, Dave Chinner
  Cc: Ritesh Harjani, chandan.babu, djwong, dchinner, viro, brauner,
	jack, linux-xfs, linux-kernel, linux-fsdevel, catherine.hoang,
	martin.petersen

On 23/09/2024 04:33, Christoph Hellwig wrote:
> On Mon, Sep 23, 2024 at 12:57:32PM +1000, Dave Chinner wrote:
>> Ok, but that's not going to be widespread. Very little storage
>> hardware out there supports atomic writes - the vast majority of
>> deployments will be new hardware that will have mkfs run on it.
> 
> Just about every enterprise NVMe SSD supports atomic write size
> larger than a single LBA, because it is completely natural fallout
> from FTL deѕign.  That beeing said to support those SSDs a block
> size of 16 or 32k would be a lot more natural than all the forcealign
> madness.
> 

Outside the block allocator changes, most changes for forcealign are 
just refactoring the RT big alloc unit checks. So - as you have said 
previously - this so-called madness is already there. How can the sanity 
be improved?

To me, yes, there are so many "if (RT)" checks and special cases in the 
code, which makes a maintenance headache.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-23  8:16                       ` John Garry
@ 2024-09-23 12:07                         ` Christoph Hellwig
  2024-09-23 12:33                           ` John Garry
  0 siblings, 1 reply; 58+ messages in thread
From: Christoph Hellwig @ 2024-09-23 12:07 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, Dave Chinner, Ritesh Harjani, chandan.babu,
	djwong, dchinner, viro, brauner, jack, linux-xfs, linux-kernel,
	linux-fsdevel, catherine.hoang, martin.petersen

On Mon, Sep 23, 2024 at 09:16:22AM +0100, John Garry wrote:
> Outside the block allocator changes, most changes for forcealign are just 
> refactoring the RT big alloc unit checks. So - as you have said previously 
> - this so-called madness is already there. How can the sanity be improved?

As a first step by not making it worse, and that not only means not
spreading the rtextent stuff further, but more importantly not introducing
additional complexities by requiring to be able to write over the
written/unwritten boundaries created by either rtextentsize > 1 or
the forcealign stuff if you actually want atomic writes.

> To me, yes, there are so many "if (RT)" checks and special cases in the 
> code, which makes a maintenance headache.

Replacing them with a different condition doesn't really make that
any better.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-23 12:07                         ` Christoph Hellwig
@ 2024-09-23 12:33                           ` John Garry
  2024-09-24  6:17                             ` Christoph Hellwig
  0 siblings, 1 reply; 58+ messages in thread
From: John Garry @ 2024-09-23 12:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Ritesh Harjani, chandan.babu, djwong, dchinner,
	viro, brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen

On 23/09/2024 13:07, Christoph Hellwig wrote:
> On Mon, Sep 23, 2024 at 09:16:22AM +0100, John Garry wrote:
>> Outside the block allocator changes, most changes for forcealign are just
>> refactoring the RT big alloc unit checks. So - as you have said previously
>> - this so-called madness is already there. How can the sanity be improved?
> 
> As a first step by not making it worse, and that not only means not
> spreading the rtextent stuff further,

I assume that refactoring rtextent into "big alloc unit" is spreading 
(rtextent stuff), right? If so, what other solution? CoW?

> but more importantly not introducing
> additional complexities by requiring to be able to write over the
> written/unwritten boundaries created by either rtextentsize > 1 or
> the forcealign stuff if you actually want atomic writes.

The very original solution required a single mapping and in written 
state for atomic writes. Reverting to that would save a lot of hassle in 
the kernel. It just means that the user needs to manually pre-zero.

> 
>> To me, yes, there are so many "if (RT)" checks and special cases in the
>> code, which makes a maintenance headache.
> 
> Replacing them with a different condition doesn't really make that
> any better.

I am just saying that the rtextent stuff is not nice, but it is not 
going away. I suppose a tiny perk is that "big alloc unit" checks are 
better than "if (rt)" checks, as it makes the condition slightly more 
obvious.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-23 12:33                           ` John Garry
@ 2024-09-24  6:17                             ` Christoph Hellwig
  2024-09-24  9:48                               ` John Garry
  0 siblings, 1 reply; 58+ messages in thread
From: Christoph Hellwig @ 2024-09-24  6:17 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, Dave Chinner, Ritesh Harjani, chandan.babu,
	djwong, dchinner, viro, brauner, jack, linux-xfs, linux-kernel,
	linux-fsdevel, catherine.hoang, martin.petersen

On Mon, Sep 23, 2024 at 01:33:12PM +0100, John Garry wrote:
>> As a first step by not making it worse, and that not only means not
>> spreading the rtextent stuff further,
>
> I assume that refactoring rtextent into "big alloc unit" is spreading 
> (rtextent stuff), right? If so, what other solution? CoW?

Well, if you look at the force align series you'd agree that it
spreads the thing out into the btree allocator.  Or do I misread it?

>
>> but more importantly not introducing
>> additional complexities by requiring to be able to write over the
>> written/unwritten boundaries created by either rtextentsize > 1 or
>> the forcealign stuff if you actually want atomic writes.
>
> The very original solution required a single mapping and in written state 
> for atomic writes. Reverting to that would save a lot of hassle in the 
> kernel. It just means that the user needs to manually pre-zero.

What atomic I/O sizes do your users require?  Would they fit into
a large sector size now supported by XFS (i.e. 32k for now).


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-24  6:17                             ` Christoph Hellwig
@ 2024-09-24  9:48                               ` John Garry
  2024-11-29 11:36                                 ` John Garry
  0 siblings, 1 reply; 58+ messages in thread
From: John Garry @ 2024-09-24  9:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Ritesh Harjani, chandan.babu, djwong, dchinner,
	viro, brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen

On 24/09/2024 07:17, Christoph Hellwig wrote:
> On Mon, Sep 23, 2024 at 01:33:12PM +0100, John Garry wrote:
>>> As a first step by not making it worse, and that not only means not
>>> spreading the rtextent stuff further,
>>
>> I assume that refactoring rtextent into "big alloc unit" is spreading
>> (rtextent stuff), right? If so, what other solution? CoW?
> 
> Well, if you look at the force align series you'd agree that it
> spreads the thing out into the btree allocator.  Or do I misread it?

Yes, there are more changes than just refactoring "big alloc unit" 
stuff. There are btree allocator changes.

About those btree allocator changes, strictly speaking there are just a 
couple of changes to provide forcealign support - the rest are prep 
patches. And those forcealign changes build on pre-existing allocator 
features, like extent alignment and length specifiers.

> 
>>
>>> but more importantly not introducing
>>> additional complexities by requiring to be able to write over the
>>> written/unwritten boundaries created by either rtextentsize > 1 or
>>> the forcealign stuff if you actually want atomic writes.
>>
>> The very original solution required a single mapping and in written state
>> for atomic writes. Reverting to that would save a lot of hassle in the
>> kernel. It just means that the user needs to manually pre-zero.
> 
> What atomic I/O sizes do your users require?  Would they fit into
> a large sector size now supported by XFS (i.e. 32k for now).
> 

It could be used, but then we have 16KB filesystem block size, which 
some just may not want. And we just don't want 16KB sector size, but I 
don't think that we require that if we use RWF_ATOMIC.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-24  9:48                               ` John Garry
@ 2024-11-29 11:36                                 ` John Garry
  0 siblings, 0 replies; 58+ messages in thread
From: John Garry @ 2024-11-29 11:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Ritesh Harjani, chandan.babu, djwong, dchinner,
	viro, brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen

On 24/09/2024 10:48, John Garry wrote:
>>
>>>
>>>> but more importantly not introducing
>>>> additional complexities by requiring to be able to write over the
>>>> written/unwritten boundaries created by either rtextentsize > 1 or
>>>> the forcealign stuff if you actually want atomic writes.
>>>
>>> The very original solution required a single mapping and in written 
>>> state
>>> for atomic writes. Reverting to that would save a lot of hassle in the
>>> kernel. It just means that the user needs to manually pre-zero.
>>
>> What atomic I/O sizes do your users require?  Would they fit into
>> a large sector size now supported by XFS (i.e. 32k for now).
>>
> 
> It could be used, but then we have 16KB filesystem block size, which 
> some just may not want. And we just don't want 16KB sector size, but I 
> don't think that we require that if we use RWF_ATOMIC.

Hi Christoph,

I want to come back to this topic of forcealign.

We have been doing much MySQL workload testing with following 
configurations:
a. 4k FS blocksize and RT 16K rextsize
b. 4k FS blocksize and forcealign 16K extsize
c. 16K FS blocksize

a. and b. show comparable performance, with b. slightly better overall. 
Generally c. shows significantly slower performance at lower thread 
count/lower load testing. We put that down to MySQL REDO log write 
amplification from larger FS blocksize. At higher loads, performance is 
comparable.

So we tried:
d. 4K FS blocksize for REDO log on 1x partition and 16K FS blocksize for 
DB pagesize atomic writes on 1x partition

For d., performance was good and comparable to a. and b., if not overall 
a bit better.

Unfortunately d. does not allow us to do a single FS snapshot, so not of 
much value for production.

I was talking to Martin on this log write topic, and he considers that 
there are many other scenarios where a larger FS blocksize can affect 
log writes latency, so quite undesirable (to have a large FS blocksize).

So we would still like to try for forcealign.

However, enabling large atomic writes for rtvol is quite simple and 
there would be overlap with enabling large atomic writes for forcealign 
- see 
https://github.com/johnpgarry/linux/commits/atomic-write-large-atomics-pre-v6.13/ 
- so I am thinking of trying for that first.

Let me know what you think.

Thanks,
John

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-23  2:57                   ` Dave Chinner
  2024-09-23  3:33                     ` Christoph Hellwig
@ 2024-09-23  8:00                     ` John Garry
  1 sibling, 0 replies; 58+ messages in thread
From: John Garry @ 2024-09-23  8:00 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ritesh Harjani, chandan.babu, djwong, dchinner, hch, viro,
	brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen

On 23/09/2024 03:57, Dave Chinner wrote:
>> In the meantime, if mkfs auto-enables atomic writes (when the HW supports),
>> what will it do to reflink feature (in terms of enabling)?
> I didn't say we should always "auto-enable atomic writes".
> 
> I said if the hardware is atomic write capable, then mkfs should
> always*align the filesystem* to atomic write constraints.  A kernel
> upgrade will eventually allow reflink and atomic writes to co-exist,
> but only if the filesystem is correctly aligned to the hardware
> constrains for atomic writes. We need to ensure we leave that
> upgrade path open....
> 
> .... and only once we have full support can we make "mkfs
> auto-enable atomic writes".

ok, fine. The current maximum value of atomic write unit max is 512KB 
(assuming 4K PAGE_SIZE and 512B sector size), so that should not be too 
needlessly inefficient for laying out the AGs. However, for 16KB+ 
PAGE_SIZE, that value could naturally be larger. However having HW which 
supports such large atomics would be very unlikely.

Thanks,
John


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-04 23:20   ` Dave Chinner
  2024-09-05  3:56     ` Ritesh Harjani
@ 2024-09-05 10:15     ` John Garry
  2024-09-05 21:47       ` Dave Chinner
  1 sibling, 1 reply; 58+ messages in thread
From: John Garry @ 2024-09-05 10:15 UTC (permalink / raw)
  To: Dave Chinner, Ritesh Harjani
  Cc: chandan.babu, djwong, dchinner, hch, viro, brauner, jack,
	linux-xfs, linux-kernel, linux-fsdevel, catherine.hoang,
	martin.petersen


> 
>> 1. As I understand XFS untorn atomic write support is built on top
>> of FORCEALIGN feature (which this series is adding) which in turn
>> uses extsize hint feature underneath.
> 
> Yes.
> 
>>     Now extsize hint mainly controls the alignment of both
>>     "physical start" & "logical start" offset and extent length,
>>     correct?
> 
> Yes.

To be clear, only for atomic writes do we require physical block alignment.

> 
>>     This is done using args->alignment for start aand
>>     args->prod/mode variables for extent length. Correct?
> 
> Yes.
> 
>>     - If say we are not able to allocate an aligned physical start?
>>     Then since extsize is just a hint we go ahead with whatever
>>     best available extent is right?

---

> 
>>
>>     - Does the unmapping of extents also only happens in extsize
>>     chunks (with forcealign)?
> 
> Yes, via use of xfs_inode_alloc_unitsize() in the high level code
> aligning the fsbno ranges to be unmapped.
> 
> Remember, force align requires both logical file offset and
> physical block number to be correctly aligned, so unmap alignment
> has to be set up correctly at file offset level before we even know
> what extents underly the file range we need to unmap....
> 
>>       If the start or end of the extent which needs unmapping is
>>       unaligned then we convert that extent to unwritten and skip,
>>       is it? (__xfs_bunmapi())
> 
> The high level code should be aligning the start and end of the
> file range to be removed via xfs_inode_alloc_unitsize(). 

Is that the case for something like truncate? There we just say what is 
the end block which we want to truncate to in 
xfs_itruncate_extents_flags(new_size)  -> 
xfs_bunmapi_range(XFS_B_TO_FSB(new_size)), and that may not be alloc 
unit aligned.

> Hence
> the low level __xfs_bunmapi() code shouldn't ever be encountering
> unaligned unmaps on force-aligned inodes.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-05 10:15     ` John Garry
@ 2024-09-05 21:47       ` Dave Chinner
  2024-09-06 14:31         ` John Garry
  0 siblings, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2024-09-05 21:47 UTC (permalink / raw)
  To: John Garry
  Cc: Ritesh Harjani, chandan.babu, djwong, dchinner, hch, viro,
	brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen

On Thu, Sep 05, 2024 at 11:15:41AM +0100, John Garry wrote:
> > >     - Does the unmapping of extents also only happens in extsize
> > >     chunks (with forcealign)?
> > 
> > Yes, via use of xfs_inode_alloc_unitsize() in the high level code
> > aligning the fsbno ranges to be unmapped.
> > 
> > Remember, force align requires both logical file offset and
> > physical block number to be correctly aligned, so unmap alignment
> > has to be set up correctly at file offset level before we even know
> > what extents underly the file range we need to unmap....
> > 
> > >       If the start or end of the extent which needs unmapping is
> > >       unaligned then we convert that extent to unwritten and skip,
> > >       is it? (__xfs_bunmapi())
> > 
> > The high level code should be aligning the start and end of the
> > file range to be removed via xfs_inode_alloc_unitsize().
> 
> Is that the case for something like truncate? There we just say what is the
> end block which we want to truncate to in
> xfs_itruncate_extents_flags(new_size)  ->
> xfs_bunmapi_range(XFS_B_TO_FSB(new_size)), and that may not be alloc unit
> aligned.

Ah, I thought we had that alignment in xfs_itruncate_extents_flags()
already, but if we don't then that's a bug that needs to be fixed.

We change the space reservation in xfs-setattr_size() for this case
(patch 9) but then don't do any alignment there - it relies on
xfs_itruncate_extents_flags() to do the right thing w.r.t. extent
removal alignment w.r.t. the new EOF.

i.e. The xfs_setattr_size() code takes care of EOF block zeroing and
page cache removal so the user doesn't see old data beyond EOF,
whilst xfs_itruncate_extents_flags() is supposed to take care of the
extent removal and the details of that operation (e.g. alignment).

Patch 10 also modifies xfs_can_free_eofblocks() to take alignment
into account for the post-eof block removal, but doesn't change
xfs_free_eofblocks() at all. i.e  it also relies on
xfs_itruncate_extents_flags() to do the right thing for force
aligned inodes.

In this case, we are removing post-eof speculative preallocation
that that has been allocated by delalloc conversion during
writeback.  These post-eof extents will already be unwritten extents
because delalloc conversion uses unwritten extents to avoid
stale data exposure if we crash between allocation and the data
being written to the extents. Hence there should be no extents to
convert to unwritten in the majority of cases here.

The only case where we might get written extents beyond EOF is if
the file has been truncated down, but in that case we don't really
care because truncate should have already taken care of post-eof
extent alignment for us. xfs_can_free_eofblocks() will see this
extent alignment and so we'll skip xfs_free_eofblocks() in this case
altogether....

Hence xfs_free_eofblocks() should never need to convert a partial
unaligned extent range to unwritten when force-align is enabled
because the post-eof extents should already be unwritten. We also
want to leave the inode in the most optimal state for future
extension, which means we want the post-eof extent to be correctly
aligned.

Hence there are multiple reasons that xfs_itruncate_extents_flags()
should be aligning the post-EOF block it is starting the unmapping
at for force aligned allocation contexts. And in doing so, we remove
the weird corner case where we can have an unaligned extent state
boundary at EOF for atomic writes....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-05 21:47       ` Dave Chinner
@ 2024-09-06 14:31         ` John Garry
  2024-09-08 22:49           ` Dave Chinner
  0 siblings, 1 reply; 58+ messages in thread
From: John Garry @ 2024-09-06 14:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ritesh Harjani, chandan.babu, djwong, dchinner, hch, viro,
	brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen

On 05/09/2024 22:47, Dave Chinner wrote:
>>>>   If the start or end of the extent which needs unmapping is
>>>>        unaligned then we convert that extent to unwritten and skip,
>>>>        is it? (__xfs_bunmapi())
>>> The high level code should be aligning the start and end of the
>>> file range to be removed via xfs_inode_alloc_unitsize().
>> Is that the case for something like truncate? There we just say what is the
>> end block which we want to truncate to in
>> xfs_itruncate_extents_flags(new_size)  ->
>> xfs_bunmapi_range(XFS_B_TO_FSB(new_size)), and that may not be alloc unit
>> aligned.
> Ah, I thought we had that alignment in xfs_itruncate_extents_flags()
> already, but if we don't then that's a bug that needs to be fixed.

AFAICS, forcealign behaviour is same as RT, so then this would be a 
mainline bug, right?

 > > We change the space reservation in xfs-setattr_size() for this case
> (patch 9) but then don't do any alignment there - it relies on
> xfs_itruncate_extents_flags() to do the right thing w.r.t. extent
> removal alignment w.r.t. the new EOF.
> 
> i.e. The xfs_setattr_size() code takes care of EOF block zeroing and
> page cache removal so the user doesn't see old data beyond EOF,
> whilst xfs_itruncate_extents_flags() is supposed to take care of the
> extent removal and the details of that operation (e.g. alignment).

So we should roundup the unmap block to the alloc unit, correct? I have 
my doubts about that, and thought that xfs_bunmapi_range() takes care of 
any alignment handling.

> 
> Patch 10 also modifies xfs_can_free_eofblocks() to take alignment
> into account for the post-eof block removal, but doesn't change
> xfs_free_eofblocks() at all. i.e  it also relies on
> xfs_itruncate_extents_flags() to do the right thing for force
> aligned inodes.

What state should the blocks post-EOF blocks be? A simple example of 
partially truncating an alloc unit is:

$xfs_io -c "extsize" mnt/file
[16384] mnt/file


$xfs_bmap -vvp mnt/file
mnt/file:
  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
    0: [0..20479]:      192..20671        0 (192..20671)     20480 000000


$truncate -s 10461184 mnt/file # 10M - 6FSB

$xfs_bmap -vvp mnt/file
mnt/file:
  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
    0: [0..20431]:      192..20623        0 (192..20623)     20432 000000
    1: [20432..20447]:  20624..20639      0 (20624..20639)      16 010000
  FLAG Values:
     0010000 Unwritten preallocated extent

Is that incorrect state?

> 
> In this case, we are removing post-eof speculative preallocation
> that that has been allocated by delalloc conversion during
> writeback.  These post-eof extents will already be unwritten extents
> because delalloc conversion uses unwritten extents to avoid
> stale data exposure if we crash between allocation and the data
> being written to the extents. Hence there should be no extents to
> convert to unwritten in the majority of cases here.
> 
> The only case where we might get written extents beyond EOF is if
> the file has been truncated down, but in that case we don't really
> care because truncate should have already taken care of post-eof
> extent alignment for us. xfs_can_free_eofblocks() will see this
> extent alignment and so we'll skip xfs_free_eofblocks() in this case
> altogether....
> 
> Hence xfs_free_eofblocks() should never need to convert a partial
> unaligned extent range to unwritten when force-align is enabled
> because the post-eof extents should already be unwritten. We also
> want to leave the inode in the most optimal state for future
> extension, which means we want the post-eof extent to be correctly
> aligned.
> 
> Hence there are multiple reasons that xfs_itruncate_extents_flags()
> should be aligning the post-EOF block it is starting the unmapping
> at for force aligned allocation contexts. And in doing so, we remove
> the weird corner case where we can have an unaligned extent state
> boundary at EOF for atomic writes....

Yeah, I don't think that sub-alloc unit extent zeroing would help us 
there, as we not be dealing with a new extent (for zeroing to occur).

Thanks,
John


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-06 14:31         ` John Garry
@ 2024-09-08 22:49           ` Dave Chinner
  2024-09-09 16:18             ` John Garry
  0 siblings, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2024-09-08 22:49 UTC (permalink / raw)
  To: John Garry
  Cc: Ritesh Harjani, chandan.babu, djwong, dchinner, hch, viro,
	brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen

On Fri, Sep 06, 2024 at 03:31:43PM +0100, John Garry wrote:
> On 05/09/2024 22:47, Dave Chinner wrote:
> > > > >   If the start or end of the extent which needs unmapping is
> > > > >        unaligned then we convert that extent to unwritten and skip,
> > > > >        is it? (__xfs_bunmapi())
> > > > The high level code should be aligning the start and end of the
> > > > file range to be removed via xfs_inode_alloc_unitsize().
> > > Is that the case for something like truncate? There we just say what is the
> > > end block which we want to truncate to in
> > > xfs_itruncate_extents_flags(new_size)  ->
> > > xfs_bunmapi_range(XFS_B_TO_FSB(new_size)), and that may not be alloc unit
> > > aligned.
> > Ah, I thought we had that alignment in xfs_itruncate_extents_flags()
> > already, but if we don't then that's a bug that needs to be fixed.
> 
> AFAICS, forcealign behaviour is same as RT, so then this would be a mainline
> bug, right?

No, I don't think so. I think this is a case where forcealign and RT
behaviours differ, primarily because RT doesn't actually care about
file offset -> physical offset alignment and can do unaligned IO
whenever it wants. Hence having an unaligned written->unwritten
extent state transition doesnt' affect anything for rt files...

> 
> > > We change the space reservation in xfs-setattr_size() for this case
> > (patch 9) but then don't do any alignment there - it relies on
> > xfs_itruncate_extents_flags() to do the right thing w.r.t. extent
> > removal alignment w.r.t. the new EOF.
> > 
> > i.e. The xfs_setattr_size() code takes care of EOF block zeroing and
> > page cache removal so the user doesn't see old data beyond EOF,
> > whilst xfs_itruncate_extents_flags() is supposed to take care of the
> > extent removal and the details of that operation (e.g. alignment).
> 
> So we should roundup the unmap block to the alloc unit, correct? I have my
> doubts about that, and thought that xfs_bunmapi_range() takes care of any
> alignment handling.

The start should round up, yes. And, no, xfs_bunmapi_range() isn't
specifically "taking care" of alignment. It's just marking a partial
alloc unit range as unwritten, which means we now have -unaligned
extents- in the BMBT. 

> 
> > 
> > Patch 10 also modifies xfs_can_free_eofblocks() to take alignment
> > into account for the post-eof block removal, but doesn't change
> > xfs_free_eofblocks() at all. i.e  it also relies on
> > xfs_itruncate_extents_flags() to do the right thing for force
> > aligned inodes.
> 
> What state should the blocks post-EOF blocks be? A simple example of
> partially truncating an alloc unit is:
> 
> $xfs_io -c "extsize" mnt/file
> [16384] mnt/file
> 
> 
> $xfs_bmap -vvp mnt/file
> mnt/file:
>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>    0: [0..20479]:      192..20671        0 (192..20671)     20480 000000
> 
> 
> $truncate -s 10461184 mnt/file # 10M - 6FSB
> 
> $xfs_bmap -vvp mnt/file
> mnt/file:
>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>    0: [0..20431]:      192..20623        0 (192..20623)     20432 000000
>    1: [20432..20447]:  20624..20639      0 (20624..20639)      16 010000
>  FLAG Values:
>     0010000 Unwritten preallocated extent
> 
> Is that incorrect state?

Think about it: what happens if you now truncate it back up to 10MB
(i.e. aligned length) and then do an aligned atomic write on it.

First: What happens when you truncate up?

......

Yes, iomap_zero_range() will see the unwritten extent and skip it.
i.e. The unwritten extent stays as an unwritten extent, it's now
within EOF. That written->unwritten extent boundary is not on an
aligned file offset.

Second: What happens when you do a correctly aligned atomic write
that spans this range now?

......

Iomap only maps a single extent at a time, so it will only map the
written range from the start of the IO (aligned) to the start of the
unwritten extent (unaligned).  Hence the atomic write will be
rejected because we can't do the atomic write to such an unaligned
extent.

That's not a bug in the atomic write path - this failure occurs
because of the truncate behaviour doing post-eof unwritten extent
conversion....

Yes, I agree that the entire -physical- extent is still correctly
aligned on disk so you could argue that the unwritten conversion
that xfs_bunmapi_range is doing is valid forced alignment behaviour.
However, the fact is that breaking the aligned physical extent into
two unaligned contiguous extents in different states in the BMBT
means that they are treated as two seperate unaligned extents, not
one contiguous aligned physical extent.

This extent state discontiunity is results in breaking physical IO
across the extent state boundary. Hence such an unaligned extent
state change violates the physical IO alignment guarantees that
forced alignment is supposed to provide atomic writes...

This is the reason why operations like hole punching round to extent
size for forced alignment at the highest level. i.e. We cannot allow
xfs_bunmapi_range() to "take care" of alignment handling because
managing partial extent unmappings with sub-alloc-unit unwritten
extents does not work for forced alignment.

Again, this is different to the traditional RT file behaviour - it
can use unwritten extents for sub-alloc-unit alignment unmaps
because the RT device can align file offset to any physical offset,
and issue unaligned sector sized IO without any restrictions. Forced
alignment does not have this freedom, and when we extend forced
alignment to RT files, it will not have the freedom to use
unwritten extents for sub-alloc-unit unmapping, either.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-08 22:49           ` Dave Chinner
@ 2024-09-09 16:18             ` John Garry
  2024-09-16  5:25               ` Dave Chinner
  0 siblings, 1 reply; 58+ messages in thread
From: John Garry @ 2024-09-09 16:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ritesh Harjani, chandan.babu, djwong, dchinner, hch, viro,
	brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen

>>
>> AFAICS, forcealign behaviour is same as RT, so then this would be a mainline
>> bug, right?
> 
> No, I don't think so. I think this is a case where forcealign and RT
> behaviours differ, primarily because RT doesn't actually care about
> file offset -> physical offset alignment and can do unaligned IO
> whenever it wants. Hence having an unaligned written->unwritten
> extent state transition doesnt' affect anything for rt files...

ok

> 
>>
>>>> We change the space reservation in xfs-setattr_size() for this case
>>> (patch 9) but then don't do any alignment there - it relies on
>>> xfs_itruncate_extents_flags() to do the right thing w.r.t. extent
>>> removal alignment w.r.t. the new EOF.
>>>
>>> i.e. The xfs_setattr_size() code takes care of EOF block zeroing and
>>> page cache removal so the user doesn't see old data beyond EOF,
>>> whilst xfs_itruncate_extents_flags() is supposed to take care of the
>>> extent removal and the details of that operation (e.g. alignment).
>>
>> So we should roundup the unmap block to the alloc unit, correct? I have my
>> doubts about that, and thought that xfs_bunmapi_range() takes care of any
>> alignment handling.
> 
> The start should round up, yes. And, no, xfs_bunmapi_range() isn't
> specifically "taking care" of alignment. It's just marking a partial
> alloc unit range as unwritten, which means we now have -unaligned
> extents- in the BMBT.

Yes, I have noticed this previously.

> 
>>
>>>
>>> Patch 10 also modifies xfs_can_free_eofblocks() to take alignment
>>> into account for the post-eof block removal, but doesn't change
>>> xfs_free_eofblocks() at all. i.e  it also relies on
>>> xfs_itruncate_extents_flags() to do the right thing for force
>>> aligned inodes.
>>
>> What state should the blocks post-EOF blocks be? A simple example of
>> partially truncating an alloc unit is:
>>
>> $xfs_io -c "extsize" mnt/file
>> [16384] mnt/file
>>
>>
>> $xfs_bmap -vvp mnt/file
>> mnt/file:
>>   EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>>     0: [0..20479]:      192..20671        0 (192..20671)     20480 000000
>>
>>
>> $truncate -s 10461184 mnt/file # 10M - 6FSB
>>
>> $xfs_bmap -vvp mnt/file
>> mnt/file:
>>   EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>>     0: [0..20431]:      192..20623        0 (192..20623)     20432 000000
>>     1: [20432..20447]:  20624..20639      0 (20624..20639)      16 010000
>>   FLAG Values:
>>      0010000 Unwritten preallocated extent
>>
>> Is that incorrect state?
> 
> Think about it: what happens if you now truncate it back up to 10MB
> (i.e. aligned length) and then do an aligned atomic write on it.
> 
> First: What happens when you truncate up?
> 
> ......
> 
> Yes, iomap_zero_range() will see the unwritten extent and skip it.
> i.e. The unwritten extent stays as an unwritten extent, it's now
> within EOF. That written->unwritten extent boundary is not on an
> aligned file offset.

Right

> 
> Second: What happens when you do a correctly aligned atomic write
> that spans this range now?
> 
> ......
> 
> Iomap only maps a single extent at a time, so it will only map the
> written range from the start of the IO (aligned) to the start of the
> unwritten extent (unaligned).  Hence the atomic write will be
> rejected because we can't do the atomic write to such an unaligned
> extent.

It was being considered to change this handling for atomic writes. More 
below at *.

> 
> That's not a bug in the atomic write path - this failure occurs
> because of the truncate behaviour doing post-eof unwritten extent
> conversion....
> 
> Yes, I agree that the entire -physical- extent is still correctly
> aligned on disk so you could argue that the unwritten conversion
> that xfs_bunmapi_range is doing is valid forced alignment behaviour.
> However, the fact is that breaking the aligned physical extent into
> two unaligned contiguous extents in different states in the BMBT
> means that they are treated as two seperate unaligned extents, not
> one contiguous aligned physical extent.

Right, this is problematic.

* I guess that you had not been following the recent discussion on this 
topic in the latest xfs atomic writes series @ 
https://lore.kernel.org/linux-xfs/20240817094800.776408-1-john.g.garry@oracle.com/ 
and also mentioned earlier in 
https://lore.kernel.org/linux-xfs/20240726171358.GA27612@lst.de/

There I dropped the sub-alloc unit zeroing. The concept to iter for a 
single bio seems sane, but as Darrick mentioned, we have issue of 
non-atomically committing all the extent conversions.

FWIW, this is how I think that the change according to Darrick's idea 
would look:

---->8----

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 72c981e3dc92..ec76d6a8750d 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -244,7 +244,8 @@ xfs_iomap_write_direct(
  	xfs_fileoff_t		count_fsb,
  	unsigned int		flags,
  	struct xfs_bmbt_irec	*imap,
-	u64			*seq)
+	u64			*seq,
+	bool			zeroing)
  {
  	struct xfs_mount	*mp = ip->i_mount;
  	struct xfs_trans	*tp;
@@ -285,7 +286,7 @@ xfs_iomap_write_direct(
  	 * the reserve block pool for bmbt block allocation if there is no space
  	 * left but we need to do unwritten extent conversion.
  	 */
+	if (zeroing)
+		bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO;
	if (flags & IOMAP_DAX) {
  		bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO;
  		if (imap->br_state == XFS_EXT_UNWRITTEN) {
  			force = true;
@@ -780,6 +781,11 @@ xfs_direct_write_iomap_begin(
  	u16			iomap_flags = 0;
  	unsigned int		lockmode;
  	u64			seq;
+	bool			atomic = flags & IOMAP_ATOMIC;
+	struct xfs_bmbt_irec	imap2[XFS_BMAP_MAX_NMAP];
+	xfs_fileoff_t		_offset_fsb = xfs_inode_rounddown_alloc_unit(ip, 
offset_fsb);
+	xfs_fileoff_t		_end_fsb = xfs_inode_roundup_alloc_unit(ip, end_fsb);
+	int loop_index;

  	ASSERT(flags & (IOMAP_WRITE | IOMAP_ZERO));

@@ -843,6 +849,9 @@ xfs_direct_write_iomap_begin(
  	if (imap_needs_alloc(inode, flags, &imap, nimaps))
  		goto allocate_blocks;

+	if (atomic && !imap_spans_range(&imap, offset_fsb, end_fsb))
+		goto out_atomic;
+
  	/*
  	 * NOWAIT and OVERWRITE I/O needs to span the entire requested I/O with
  	 * a single map so that we avoid partial IO failures due to the rest of
@@ -897,7 +906,7 @@ xfs_direct_write_iomap_begin(
  	xfs_iunlock(ip, lockmode);

  	error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb,
-			flags, &imap, &seq);
+			flags, &imap, &seq, false);
  	if (error)
  		return error;

@@ -918,6 +927,28 @@ xfs_direct_write_iomap_begin(
  	xfs_iunlock(ip, lockmode);
  	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, IOMAP_F_SHARED, seq);

+out_atomic:
+	nimaps = XFS_BMAP_MAX_NMAP;
+
+	error = xfs_bmapi_read(ip, _offset_fsb, _end_fsb - _offset_fsb, &imap2[0],
+			       &nimaps, 0);
+	for (loop_index = 0; loop_index < nimaps; loop_index++) {
+		struct xfs_bmbt_irec *_imap2 = &imap2[loop_index];
+
+		if (_imap2->br_state == XFS_EXT_UNWRITTEN) {
+
+			xfs_iunlock(ip, lockmode);
+
+			error = xfs_iomap_write_direct(ip, _imap2->br_startoff, 
_imap2->br_blockcount,
+					flags, &imap, &seq, true);
+			if (error)
+				return error;
+			goto relock;
+		}
+	}
+	/* We should not reach this, but TODO better handling */
+	error = -EINVAL;
+
  out_unlock:
  	if (lockmode)
  		xfs_iunlock(ip, lockmode);

-----8<----

But I have my doubts about using XFS_BMAPI_ZERO, even if it is just for 
pre-zeroing unwritten parts of the alloc unit for an atomic write.

> 
> This extent state discontiunity is results in breaking physical IO
> across the extent state boundary. Hence such an unaligned extent
> state change violates the physical IO alignment guarantees that
> forced alignment is supposed to provide atomic writes...

Yes

> 
> This is the reason why operations like hole punching round to extent
> size for forced alignment at the highest level. i.e. We cannot allow
> xfs_bunmapi_range() to "take care" of alignment handling because
> managing partial extent unmappings with sub-alloc-unit unwritten
> extents does not work for forced alignment.
> 
> Again, this is different to the traditional RT file behaviour - it
> can use unwritten extents for sub-alloc-unit alignment unmaps
> because the RT device can align file offset to any physical offset,
> and issue unaligned sector sized IO without any restrictions. Forced
> alignment does not have this freedom, and when we extend forced
> alignment to RT files, it will not have the freedom to use
> unwritten extents for sub-alloc-unit unmapping, either.
> 
So how do you think that we should actually implement 
xfs_itruncate_extents_flags() properly for forcealign? Would it simply 
be like:

--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1050,7 +1050,7 @@ xfs_itruncate_extents_flags(
                 WARN_ON_ONCE(first_unmap_block > XFS_MAX_FILEOFF);
                 return 0;
         }
+	if (xfs_inode_has_forcealign(ip))
+	       first_unmap_block = xfs_inode_roundup_alloc_unit(ip, 
first_unmap_block);
         error = xfs_bunmapi_range(&tp, ip, flags, first_unmap_block,

Thanks,
John


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-09 16:18             ` John Garry
@ 2024-09-16  5:25               ` Dave Chinner
  2024-09-16  9:44                 ` John Garry
  0 siblings, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2024-09-16  5:25 UTC (permalink / raw)
  To: John Garry
  Cc: Ritesh Harjani, chandan.babu, djwong, dchinner, hch, viro,
	brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen

On Mon, Sep 09, 2024 at 05:18:43PM +0100, John Garry wrote:
> > > > Patch 10 also modifies xfs_can_free_eofblocks() to take alignment
> > > > into account for the post-eof block removal, but doesn't change
> > > > xfs_free_eofblocks() at all. i.e  it also relies on
> > > > xfs_itruncate_extents_flags() to do the right thing for force
> > > > aligned inodes.
> > > 
> > > What state should the blocks post-EOF blocks be? A simple example of
> > > partially truncating an alloc unit is:
> > > 
> > > $xfs_io -c "extsize" mnt/file
> > > [16384] mnt/file
> > > 
> > > 
> > > $xfs_bmap -vvp mnt/file
> > > mnt/file:
> > >   EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
> > >     0: [0..20479]:      192..20671        0 (192..20671)     20480 000000
> > > 
> > > 
> > > $truncate -s 10461184 mnt/file # 10M - 6FSB
> > > 
> > > $xfs_bmap -vvp mnt/file
> > > mnt/file:
> > >   EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
> > >     0: [0..20431]:      192..20623        0 (192..20623)     20432 000000
> > >     1: [20432..20447]:  20624..20639      0 (20624..20639)      16 010000
> > >   FLAG Values:
> > >      0010000 Unwritten preallocated extent
> > > 
> > > Is that incorrect state?
> > 
> > Think about it: what happens if you now truncate it back up to 10MB
> > (i.e. aligned length) and then do an aligned atomic write on it.
> > 
> > First: What happens when you truncate up?
> > 
> > ......
> > 
> > Yes, iomap_zero_range() will see the unwritten extent and skip it.
> > i.e. The unwritten extent stays as an unwritten extent, it's now
> > within EOF. That written->unwritten extent boundary is not on an
> > aligned file offset.
> 
> Right
> 
> > 
> > Second: What happens when you do a correctly aligned atomic write
> > that spans this range now?
> > 
> > ......
> > 
> > Iomap only maps a single extent at a time, so it will only map the
> > written range from the start of the IO (aligned) to the start of the
> > unwritten extent (unaligned).  Hence the atomic write will be
> > rejected because we can't do the atomic write to such an unaligned
> > extent.
> 
> It was being considered to change this handling for atomic writes. More
> below at *.

I don't think that this is something specific to atomic writes -
forced alignment means -alignment is guaranteed- regardless of what
ends up using it.

Yes, we can track unwritten extents on an -unaligned- boundary, but
that doesn't mean that we should allow it when we are trying to
guarantee logical and physical alignment of the file offset and
extent boundaries. i.e. The definition of forced alignment behaviour
is that all file offsets and extents in the file are aligned to the
same alignment.

I don't see an exception that allows for unaligned unwritten
extents in that definition.


> > That's not a bug in the atomic write path - this failure occurs
> > because of the truncate behaviour doing post-eof unwritten extent
> > conversion....
> > 
> > Yes, I agree that the entire -physical- extent is still correctly
> > aligned on disk so you could argue that the unwritten conversion
> > that xfs_bunmapi_range is doing is valid forced alignment behaviour.
> > However, the fact is that breaking the aligned physical extent into
> > two unaligned contiguous extents in different states in the BMBT
> > means that they are treated as two seperate unaligned extents, not
> > one contiguous aligned physical extent.
> 
> Right, this is problematic.
> 
> * I guess that you had not been following the recent discussion on this
> topic in the latest xfs atomic writes series @ https://lore.kernel.org/linux-xfs/20240817094800.776408-1-john.g.garry@oracle.com/
> and also mentioned earlier in
> https://lore.kernel.org/linux-xfs/20240726171358.GA27612@lst.de/
> 
> There I dropped the sub-alloc unit zeroing. The concept to iter for a single
> bio seems sane, but as Darrick mentioned, we have issue of non-atomically
> committing all the extent conversions.

Yes, I understand these problems exist.  My entire point is that the
forced alignment implemention should never allow such unaligned
extent patterns to be created in the first place. If we avoid
creating such situations in the first place, then we never have to
care about about unaligned unwritten extent conversion breaking
atomic IO.

FWIW, I also understand things are different if we are doing 128kB
atomic writes on 16kB force aligned files. However, in this
situation we are treating the 128kB atomic IO as eight individual
16kB atomic IOs that are physically contiguous. Hence in this
situation it doesn't matter if we have a mix of 16kB aligned
written/unwritten/hole extents as each 16kB chunks is independent of
the others.

What matters is that each indivudal 16kB chunk shows either the old
data or the new data - we are not guaranteeing that the entire 128kB
write is atomic. Hence in this situation we can both submit and
process each 16kB shunk as independent IOs with independent IO
compeltion transactions. All that matters is that we don't signal
completion to userspace until all the IO is complete, and we already
do that for fragmented DIO writes...

> > Again, this is different to the traditional RT file behaviour - it
> > can use unwritten extents for sub-alloc-unit alignment unmaps
> > because the RT device can align file offset to any physical offset,
> > and issue unaligned sector sized IO without any restrictions. Forced
> > alignment does not have this freedom, and when we extend forced
> > alignment to RT files, it will not have the freedom to use
> > unwritten extents for sub-alloc-unit unmapping, either.
> > 
> So how do you think that we should actually implement
> xfs_itruncate_extents_flags() properly for forcealign? Would it simply be
> like:
> 
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1050,7 +1050,7 @@ xfs_itruncate_extents_flags(
>                 WARN_ON_ONCE(first_unmap_block > XFS_MAX_FILEOFF);
>                 return 0;
>         }
> +	if (xfs_inode_has_forcealign(ip))
> +	       first_unmap_block = xfs_inode_roundup_alloc_unit(ip,
> first_unmap_block);
>         error = xfs_bunmapi_range(&tp, ip, flags, first_unmap_block,

Yes, it would be something like that, except it would have to be
done before first_unmap_block is verified.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-16  5:25               ` Dave Chinner
@ 2024-09-16  9:44                 ` John Garry
  2024-09-17 22:27                   ` Dave Chinner
  0 siblings, 1 reply; 58+ messages in thread
From: John Garry @ 2024-09-16  9:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ritesh Harjani, chandan.babu, djwong, dchinner, hch, viro,
	brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen


>> * I guess that you had not been following the recent discussion on this
>> topic in the latest xfs atomic writes series @ https://urldefense.com/v3/__https://lore.kernel.org/linux-xfs/20240817094800.776408-1-john.g.garry@oracle.com/__;!!ACWV5N9M2RV99hQ!JIzCbkyp3JuPyzBx1n80WAdog5rLxMRB65FYrs1sFf3ei-wOdqrU_DZBE5zwrJXhrj949HSE0TwOEV0ciu8$
>> and also mentioned earlier in
>> https://urldefense.com/v3/__https://lore.kernel.org/linux-xfs/20240726171358.GA27612@lst.de/__;!!ACWV5N9M2RV99hQ!JIzCbkyp3JuPyzBx1n80WAdog5rLxMRB65FYrs1sFf3ei-wOdqrU_DZBE5zwrJXhrj949HSE0TwOiiEnYSk$
>>
>> There I dropped the sub-alloc unit zeroing. The concept to iter for a single
>> bio seems sane, but as Darrick mentioned, we have issue of non-atomically
>> committing all the extent conversions.
> 
> Yes, I understand these problems exist.  My entire point is that the
> forced alignment implemention should never allow such unaligned
> extent patterns to be created in the first place. If we avoid
> creating such situations in the first place, then we never have to
> care about about unaligned unwritten extent conversion breaking
> atomic IO.

OK, but what about this situation with non-EOF unaligned extents:

# xfs_io -c "lsattr -v" mnt/file
[extsize, has-xattr, force-align] mnt/file
# xfs_io -c "extsize" mnt/file
[65536] mnt/file
#
# xfs_io  -d -c "pwrite 64k 64k" mnt/file
# xfs_io  -d -c "pwrite 8k 8k" mnt/file
# xfs_bmap -vvp  mnt/file
mnt/file:
EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
   0: [0..15]:         384..399          0 (384..399)          16 010000
   1: [16..31]:        400..415          0 (400..415)          16 000000
   2: [32..127]:       416..511          0 (416..511)          96 010000
   3: [128..255]:      256..383          0 (256..383)         128 000000
FLAG Values:
    0010000 Unwritten preallocated extent

Here we have unaligned extents wrt extsize.

The sub-alloc unit zeroing would solve that - is that what you would 
still advocate (to solve that issue)?

> 
> FWIW, I also understand things are different if we are doing 128kB
> atomic writes on 16kB force aligned files. However, in this
> situation we are treating the 128kB atomic IO as eight individual
> 16kB atomic IOs that are physically contiguous.

Yes, if 16kB force aligned, userspace can only issue 16KB atomic writes.

> Hence in this
> situation it doesn't matter if we have a mix of 16kB aligned
> written/unwritten/hole extents as each 16kB chunks is independent of
> the others.

Sure

> 
> What matters is that each indivudal 16kB chunk shows either the old
> data or the new data - we are not guaranteeing that the entire 128kB
> write is atomic. Hence in this situation we can both submit and
> process each 16kB shunk as independent IOs with independent IO
> compeltion transactions. All that matters is that we don't signal
> completion to userspace until all the IO is complete, and we already
> do that for fragmented DIO writes...
> 
>>> Again, this is different to the traditional RT file behaviour - it
>>> can use unwritten extents for sub-alloc-unit alignment unmaps
>>> because the RT device can align file offset to any physical offset,
>>> and issue unaligned sector sized IO without any restrictions. Forced
>>> alignment does not have this freedom, and when we extend forced
>>> alignment to RT files, it will not have the freedom to use
>>> unwritten extents for sub-alloc-unit unmapping, either.
>>>
>> So how do you think that we should actually implement
>> xfs_itruncate_extents_flags() properly for forcealign? Would it simply be
>> like:
>>
>> --- a/fs/xfs/xfs_inode.c
>> +++ b/fs/xfs/xfs_inode.c
>> @@ -1050,7 +1050,7 @@ xfs_itruncate_extents_flags(
>>                  WARN_ON_ONCE(first_unmap_block > XFS_MAX_FILEOFF);
>>                  return 0;
>>          }
>> +	if (xfs_inode_has_forcealign(ip))
>> +	       first_unmap_block = xfs_inode_roundup_alloc_unit(ip,
>> first_unmap_block);
>>          error = xfs_bunmapi_range(&tp, ip, flags, first_unmap_block,
> 
> Yes, it would be something like that, except it would have to be
> done before first_unmap_block is verified.
> 

ok, and are you still of the opinion that this does not apply to rtvol?

Thanks,
John



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-16  9:44                 ` John Garry
@ 2024-09-17 22:27                   ` Dave Chinner
  2024-09-18 10:12                     ` John Garry
  0 siblings, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2024-09-17 22:27 UTC (permalink / raw)
  To: John Garry
  Cc: Ritesh Harjani, chandan.babu, djwong, dchinner, hch, viro,
	brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen

On Mon, Sep 16, 2024 at 10:44:38AM +0100, John Garry wrote:
> 
> > > * I guess that you had not been following the recent discussion on this
> > > topic in the latest xfs atomic writes series @ https://urldefense.com/v3/__https://lore.kernel.org/linux-xfs/20240817094800.776408-1-john.g.garry@oracle.com/__;!!ACWV5N9M2RV99hQ!JIzCbkyp3JuPyzBx1n80WAdog5rLxMRB65FYrs1sFf3ei-wOdqrU_DZBE5zwrJXhrj949HSE0TwOEV0ciu8$
> > > and also mentioned earlier in
> > > https://urldefense.com/v3/__https://lore.kernel.org/linux-xfs/20240726171358.GA27612@lst.de/__;!!ACWV5N9M2RV99hQ!JIzCbkyp3JuPyzBx1n80WAdog5rLxMRB65FYrs1sFf3ei-wOdqrU_DZBE5zwrJXhrj949HSE0TwOiiEnYSk$
> > > 
> > > There I dropped the sub-alloc unit zeroing. The concept to iter for a single
> > > bio seems sane, but as Darrick mentioned, we have issue of non-atomically
> > > committing all the extent conversions.
> > 
> > Yes, I understand these problems exist.  My entire point is that the
> > forced alignment implemention should never allow such unaligned
> > extent patterns to be created in the first place. If we avoid
> > creating such situations in the first place, then we never have to
> > care about about unaligned unwritten extent conversion breaking
> > atomic IO.
> 
> OK, but what about this situation with non-EOF unaligned extents:
> 
> # xfs_io -c "lsattr -v" mnt/file
> [extsize, has-xattr, force-align] mnt/file
> # xfs_io -c "extsize" mnt/file
> [65536] mnt/file
> #
> # xfs_io  -d -c "pwrite 64k 64k" mnt/file
> # xfs_io  -d -c "pwrite 8k 8k" mnt/file
> # xfs_bmap -vvp  mnt/file
> mnt/file:
> EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>   0: [0..15]:         384..399          0 (384..399)          16 010000
>   1: [16..31]:        400..415          0 (400..415)          16 000000
>   2: [32..127]:       416..511          0 (416..511)          96 010000
>   3: [128..255]:      256..383          0 (256..383)         128 000000
> FLAG Values:
>    0010000 Unwritten preallocated extent
> 
> Here we have unaligned extents wrt extsize.
> 
> The sub-alloc unit zeroing would solve that - is that what you would still
> advocate (to solve that issue)?

Yes, I thought that was already implemented for force-align with the
DIO code via the extsize zero-around changes in the iomap code. Why
isn't that zero-around code ensuring the correct extent layout here?

> > FWIW, I also understand things are different if we are doing 128kB
> > atomic writes on 16kB force aligned files. However, in this
> > situation we are treating the 128kB atomic IO as eight individual
> > 16kB atomic IOs that are physically contiguous.
> 
> Yes, if 16kB force aligned, userspace can only issue 16KB atomic writes.

Right, but the eventual goal (given the statx parameters) is to be
able to do 8x16kB sequential atomic writes as a single 128kB IO, yes?

> > > > Again, this is different to the traditional RT file behaviour - it
> > > > can use unwritten extents for sub-alloc-unit alignment unmaps
> > > > because the RT device can align file offset to any physical offset,
> > > > and issue unaligned sector sized IO without any restrictions. Forced
> > > > alignment does not have this freedom, and when we extend forced
> > > > alignment to RT files, it will not have the freedom to use
> > > > unwritten extents for sub-alloc-unit unmapping, either.
> > > > 
> > > So how do you think that we should actually implement
> > > xfs_itruncate_extents_flags() properly for forcealign? Would it simply be
> > > like:
> > > 
> > > --- a/fs/xfs/xfs_inode.c
> > > +++ b/fs/xfs/xfs_inode.c
> > > @@ -1050,7 +1050,7 @@ xfs_itruncate_extents_flags(
> > >                  WARN_ON_ONCE(first_unmap_block > XFS_MAX_FILEOFF);
> > >                  return 0;
> > >          }
> > > +	if (xfs_inode_has_forcealign(ip))
> > > +	       first_unmap_block = xfs_inode_roundup_alloc_unit(ip,
> > > first_unmap_block);
> > >          error = xfs_bunmapi_range(&tp, ip, flags, first_unmap_block,
> > 
> > Yes, it would be something like that, except it would have to be
> > done before first_unmap_block is verified.
> > 
> 
> ok, and are you still of the opinion that this does not apply to rtvol?

The rtvol is *not* force-aligned. It -may- have some aligned
allocation requirements that are similar (i.e. sb_rextsize > 1 fsb)
but it does *not* force-align extents, written or unwritten.

The moment we add force-align support to RT files (as is the plan),
then the force-aligned inodes on the rtvol will need to behave as
force aligned inodes, not "rtvol" inodes.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-17 22:27                   ` Dave Chinner
@ 2024-09-18 10:12                     ` John Garry
  2024-11-14 12:48                       ` Long Li
  0 siblings, 1 reply; 58+ messages in thread
From: John Garry @ 2024-09-18 10:12 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ritesh Harjani, chandan.babu, djwong, dchinner, hch, viro,
	brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen

On 17/09/2024 23:27, Dave Chinner wrote:
>> # xfs_bmap -vvp  mnt/file
>> mnt/file:
>> EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>>    0: [0..15]:         384..399          0 (384..399)          16 010000
>>    1: [16..31]:        400..415          0 (400..415)          16 000000
>>    2: [32..127]:       416..511          0 (416..511)          96 010000
>>    3: [128..255]:      256..383          0 (256..383)         128 000000
>> FLAG Values:
>>     0010000 Unwritten preallocated extent
>>
>> Here we have unaligned extents wrt extsize.
>>
>> The sub-alloc unit zeroing would solve that - is that what you would still
>> advocate (to solve that issue)?
> Yes, I thought that was already implemented for force-align with the
> DIO code via the extsize zero-around changes in the iomap code. Why
> isn't that zero-around code ensuring the correct extent layout here?

I just have not included the extsize zero-around changes here. They were 
just grouped with the atomic writes support, as they were added 
specifically for the atomic writes support. Indeed - to me at least - it 
is strange that the DIO code changes are required for XFS forcealign 
implementation. And, even if we use extsize zero-around changes for DIO 
path, what about buffered IO?

BTW, I still have concern with this extsize zero-around change which I 
was making:

xfs_iomap_write_unwritten()
{
	unsigned int rounding;

	/* when converting anything unwritten, we must be spanning an 	alloc 
unit, so round up/down */
	if (rounding > 1) {
		offset_fsb = rounddown(rounding);
		count_fsb = roundup(rounding);
	}

	...
	do {
		xfs_bmapi_write();
		...
		xfs_trans_commit();
	} while ();
}

As mentioned elsewhere, it's a bit of a bodge (to do this rounding).

> 
>>> FWIW, I also understand things are different if we are doing 128kB
>>> atomic writes on 16kB force aligned files. However, in this
>>> situation we are treating the 128kB atomic IO as eight individual
>>> 16kB atomic IOs that are physically contiguous.
>> Yes, if 16kB force aligned, userspace can only issue 16KB atomic writes.
> Right, but the eventual goal (given the statx parameters) is to be
> able to do 8x16kB sequential atomic writes as a single 128kB IO, yes?

No, if atomic write unit max is 16KB, then userspace can only issue a 
single 16KB atomic write.

However, some things to consider:
a. the block layer may merge those 16KB atomic writes
b. userspace may also merge 16KB atomic writes and issue a larger atomic 
write (if atomic write unit max is > 16KB)

I had been wondering if there is any value in a lib for helping with b.

> 
>>>>> Again, this is different to the traditional RT file behaviour - it
>>>>> can use unwritten extents for sub-alloc-unit alignment unmaps
>>>>> because the RT device can align file offset to any physical offset,
>>>>> and issue unaligned sector sized IO without any restrictions. Forced
>>>>> alignment does not have this freedom, and when we extend forced
>>>>> alignment to RT files, it will not have the freedom to use
>>>>> unwritten extents for sub-alloc-unit unmapping, either.
>>>>>
>>>> So how do you think that we should actually implement
>>>> xfs_itruncate_extents_flags() properly for forcealign? Would it simply be
>>>> like:
>>>>
>>>> --- a/fs/xfs/xfs_inode.c
>>>> +++ b/fs/xfs/xfs_inode.c
>>>> @@ -1050,7 +1050,7 @@ xfs_itruncate_extents_flags(
>>>>                   WARN_ON_ONCE(first_unmap_block > XFS_MAX_FILEOFF);
>>>>                   return 0;
>>>>           }
>>>> +	if (xfs_inode_has_forcealign(ip))
>>>> +	       first_unmap_block = xfs_inode_roundup_alloc_unit(ip,
>>>> first_unmap_block);
>>>>           error = xfs_bunmapi_range(&tp, ip, flags, first_unmap_block,
>>> Yes, it would be something like that, except it would have to be
>>> done before first_unmap_block is verified.
>>>
>> ok, and are you still of the opinion that this does not apply to rtvol?
> The rtvol is*not* force-aligned. It -may- have some aligned
> allocation requirements that are similar (i.e. sb_rextsize > 1 fsb)
> but it does*not* force-align extents, written or unwritten.
> 
> The moment we add force-align support to RT files (as is the plan),
> then the force-aligned inodes on the rtvol will need to behave as
> force aligned inodes, not "rtvol" inodes.

ok, fine

Thanks,
John



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-09-18 10:12                     ` John Garry
@ 2024-11-14 12:48                       ` Long Li
  2024-11-14 16:22                         ` John Garry
  2024-11-14 20:07                         ` Dave Chinner
  0 siblings, 2 replies; 58+ messages in thread
From: Long Li @ 2024-11-14 12:48 UTC (permalink / raw)
  To: John Garry, Dave Chinner
  Cc: Ritesh Harjani, chandan.babu, djwong, dchinner, hch, viro,
	brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen

On Wed, Sep 18, 2024 at 11:12:47AM +0100, John Garry wrote:
> On 17/09/2024 23:27, Dave Chinner wrote:
> > > # xfs_bmap -vvp  mnt/file
> > > mnt/file:
> > > EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
> > >    0: [0..15]:         384..399          0 (384..399)          16 010000
> > >    1: [16..31]:        400..415          0 (400..415)          16 000000
> > >    2: [32..127]:       416..511          0 (416..511)          96 010000
> > >    3: [128..255]:      256..383          0 (256..383)         128 000000
> > > FLAG Values:
> > >     0010000 Unwritten preallocated extent
> > > 
> > > Here we have unaligned extents wrt extsize.
> > > 
> > > The sub-alloc unit zeroing would solve that - is that what you would still
> > > advocate (to solve that issue)?
> > Yes, I thought that was already implemented for force-align with the
> > DIO code via the extsize zero-around changes in the iomap code. Why
> > isn't that zero-around code ensuring the correct extent layout here?
> 
> I just have not included the extsize zero-around changes here. They were
> just grouped with the atomic writes support, as they were added specifically
> for the atomic writes support. Indeed - to me at least - it is strange that
> the DIO code changes are required for XFS forcealign implementation. And,
> even if we use extsize zero-around changes for DIO path, what about buffered
> IO?


I've been reviewing and testing the XFS atomic write patch series. Since
there haven't been any new responses to the previous discussions on this
issue, I'd like to inquire about the buffered IO problem with force-aligned
files, which is a scenario we might encounter.

Consider a case where the file supports force-alignment with a 64K extent size,
and the system page size is 4K. Take the following commands as an example:

xfs_io  -c "pwrite 64k 64k" mnt/file
xfs_io  -c "pwrite 8k 8k" mnt/file

If unaligned unwritten extents are not permitted, we need to zero out the
sub-allocation units for ranges [0, 8K] and [16K, 64K] to prevent stale
data. While this can be handled relatively easily in direct I/O scenarios,
it presents significant challenges in buffered I/O operations. The main
difficulty arises because the extent size (64K) is larger than the page
size (4K), and our current code base has substantial limitations in handling
such cases.

Any thoughts on this?

Thanks,
Long Li

> 
> BTW, I still have concern with this extsize zero-around change which I was
> making:
> 
> xfs_iomap_write_unwritten()
> {
> 	unsigned int rounding;
> 
> 	/* when converting anything unwritten, we must be spanning an 	alloc unit,
> so round up/down */
> 	if (rounding > 1) {
> 		offset_fsb = rounddown(rounding);
> 		count_fsb = roundup(rounding);
> 	}
> 
> 	...
> 	do {
> 		xfs_bmapi_write();
> 		...
> 		xfs_trans_commit();
> 	} while ();
> }
> 
> As mentioned elsewhere, it's a bit of a bodge (to do this rounding).
> 
> > 
> > > > FWIW, I also understand things are different if we are doing 128kB
> > > > atomic writes on 16kB force aligned files. However, in this
> > > > situation we are treating the 128kB atomic IO as eight individual
> > > > 16kB atomic IOs that are physically contiguous.
> > > Yes, if 16kB force aligned, userspace can only issue 16KB atomic writes.
> > Right, but the eventual goal (given the statx parameters) is to be
> > able to do 8x16kB sequential atomic writes as a single 128kB IO, yes?
> 
> No, if atomic write unit max is 16KB, then userspace can only issue a single
> 16KB atomic write.
> 
> However, some things to consider:
> a. the block layer may merge those 16KB atomic writes
> b. userspace may also merge 16KB atomic writes and issue a larger atomic
> write (if atomic write unit max is > 16KB)
> 
> I had been wondering if there is any value in a lib for helping with b.
> 
> > 
> > > > > > Again, this is different to the traditional RT file behaviour - it
> > > > > > can use unwritten extents for sub-alloc-unit alignment unmaps
> > > > > > because the RT device can align file offset to any physical offset,
> > > > > > and issue unaligned sector sized IO without any restrictions. Forced
> > > > > > alignment does not have this freedom, and when we extend forced
> > > > > > alignment to RT files, it will not have the freedom to use
> > > > > > unwritten extents for sub-alloc-unit unmapping, either.
> > > > > > 
> > > > > So how do you think that we should actually implement
> > > > > xfs_itruncate_extents_flags() properly for forcealign? Would it simply be
> > > > > like:
> > > > > 
> > > > > --- a/fs/xfs/xfs_inode.c
> > > > > +++ b/fs/xfs/xfs_inode.c
> > > > > @@ -1050,7 +1050,7 @@ xfs_itruncate_extents_flags(
> > > > >                   WARN_ON_ONCE(first_unmap_block > XFS_MAX_FILEOFF);
> > > > >                   return 0;
> > > > >           }
> > > > > +	if (xfs_inode_has_forcealign(ip))
> > > > > +	       first_unmap_block = xfs_inode_roundup_alloc_unit(ip,
> > > > > first_unmap_block);
> > > > >           error = xfs_bunmapi_range(&tp, ip, flags, first_unmap_block,
> > > > Yes, it would be something like that, except it would have to be
> > > > done before first_unmap_block is verified.
> > > > 
> > > ok, and are you still of the opinion that this does not apply to rtvol?
> > The rtvol is*not* force-aligned. It -may- have some aligned
> > allocation requirements that are similar (i.e. sb_rextsize > 1 fsb)
> > but it does*not* force-align extents, written or unwritten.
> > 
> > The moment we add force-align support to RT files (as is the plan),
> > then the force-aligned inodes on the rtvol will need to behave as
> > force aligned inodes, not "rtvol" inodes.
> 
> ok, fine
> 
> Thanks,
> John
> 
> 
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-11-14 12:48                       ` Long Li
@ 2024-11-14 16:22                         ` John Garry
  2024-11-14 20:07                         ` Dave Chinner
  1 sibling, 0 replies; 58+ messages in thread
From: John Garry @ 2024-11-14 16:22 UTC (permalink / raw)
  To: Long Li, Dave Chinner
  Cc: Ritesh Harjani, chandan.babu, djwong, dchinner, hch, viro,
	brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen

On 14/11/2024 12:48, Long Li wrote:
> On Wed, Sep 18, 2024 at 11:12:47AM +0100, John Garry wrote:
>> On 17/09/2024 23:27, Dave Chinner wrote:
>>>> # xfs_bmap -vvp  mnt/file
>>>> mnt/file:
>>>> EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>>>>     0: [0..15]:         384..399          0 (384..399)          16 010000
>>>>     1: [16..31]:        400..415          0 (400..415)          16 000000
>>>>     2: [32..127]:       416..511          0 (416..511)          96 010000
>>>>     3: [128..255]:      256..383          0 (256..383)         128 000000
>>>> FLAG Values:
>>>>      0010000 Unwritten preallocated extent
>>>>
>>>> Here we have unaligned extents wrt extsize.
>>>>
>>>> The sub-alloc unit zeroing would solve that - is that what you would still
>>>> advocate (to solve that issue)?
>>> Yes, I thought that was already implemented for force-align with the
>>> DIO code via the extsize zero-around changes in the iomap code. Why
>>> isn't that zero-around code ensuring the correct extent layout here?
>> I just have not included the extsize zero-around changes here. They were
>> just grouped with the atomic writes support, as they were added specifically
>> for the atomic writes support. Indeed - to me at least - it is strange that
>> the DIO code changes are required for XFS forcealign implementation. And,
>> even if we use extsize zero-around changes for DIO path, what about buffered
>> IO?
> 
> I've been reviewing and testing the XFS atomic write patch series. Since
> there haven't been any new responses to the previous discussions on this
> issue, I'd like to inquire about the buffered IO problem with force-aligned
> files, which is a scenario we might encounter.
> 
> Consider a case where the file supports force-alignment with a 64K extent size,
> and the system page size is 4K. Take the following commands as an example:
> 
> xfs_io  -c "pwrite 64k 64k" mnt/file
> xfs_io  -c "pwrite 8k 8k" mnt/file
> 
> If unaligned unwritten extents are not permitted, we need to zero out the
> sub-allocation units for ranges [0, 8K] and [16K, 64K] to prevent stale
> data. 

How does this prevent stale data? Just zeroing will ensure aligned 
extents. Unless iomap is provided a mapping for the fully aligned extent.

> While this can be handled relatively easily in direct I/O scenarios,
> it presents significant challenges in buffered I/O operations. The main
> difficulty arises because the extent size (64K) is larger than the page
> size (4K), and our current code base has substantial limitations in handling
> such cases.

What is the limitation exactly?

> 
> Any thoughts on this?

TBH, the buffered IO case has not been considered too much.

The sub-extent zeroing was intended for atomic writes > 1x FSB and we 
only care about DIO there.

Thanks,
John

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-11-14 12:48                       ` Long Li
  2024-11-14 16:22                         ` John Garry
@ 2024-11-14 20:07                         ` Dave Chinner
  2024-11-15  8:14                           ` John Garry
  2024-11-15 11:20                           ` Long Li
  1 sibling, 2 replies; 58+ messages in thread
From: Dave Chinner @ 2024-11-14 20:07 UTC (permalink / raw)
  To: Long Li
  Cc: John Garry, Dave Chinner, Ritesh Harjani, chandan.babu, djwong,
	hch, viro, brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen

On Thu, Nov 14, 2024 at 08:48:21PM +0800, Long Li wrote:
> On Wed, Sep 18, 2024 at 11:12:47AM +0100, John Garry wrote:
> > On 17/09/2024 23:27, Dave Chinner wrote:
> > > > # xfs_bmap -vvp  mnt/file
> > > > mnt/file:
> > > > EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
> > > >    0: [0..15]:         384..399          0 (384..399)          16 010000
> > > >    1: [16..31]:        400..415          0 (400..415)          16 000000
> > > >    2: [32..127]:       416..511          0 (416..511)          96 010000
> > > >    3: [128..255]:      256..383          0 (256..383)         128 000000
> > > > FLAG Values:
> > > >     0010000 Unwritten preallocated extent
> > > > 
> > > > Here we have unaligned extents wrt extsize.
> > > > 
> > > > The sub-alloc unit zeroing would solve that - is that what you would still
> > > > advocate (to solve that issue)?
> > > Yes, I thought that was already implemented for force-align with the
> > > DIO code via the extsize zero-around changes in the iomap code. Why
> > > isn't that zero-around code ensuring the correct extent layout here?
> > 
> > I just have not included the extsize zero-around changes here. They were
> > just grouped with the atomic writes support, as they were added specifically
> > for the atomic writes support. Indeed - to me at least - it is strange that
> > the DIO code changes are required for XFS forcealign implementation. And,
> > even if we use extsize zero-around changes for DIO path, what about buffered
> > IO?
> 
> 
> I've been reviewing and testing the XFS atomic write patch series. Since
> there haven't been any new responses to the previous discussions on this
> issue, I'd like to inquire about the buffered IO problem with force-aligned
> files, which is a scenario we might encounter.
> 
> Consider a case where the file supports force-alignment with a 64K extent size,
> and the system page size is 4K. Take the following commands as an example:
> 
> xfs_io  -c "pwrite 64k 64k" mnt/file
> xfs_io  -c "pwrite 8k 8k" mnt/file
> 
> If unaligned unwritten extents are not permitted, we need to zero out the
> sub-allocation units for ranges [0, 8K] and [16K, 64K] to prevent stale
> data. While this can be handled relatively easily in direct I/O scenarios,
> it presents significant challenges in buffered I/O operations. The main
> difficulty arises because the extent size (64K) is larger than the page
> size (4K), and our current code base has substantial limitations in handling
> such cases.
> 
> Any thoughts on this?

Large folios in the page cache solve this problem. i.e. it's the
same problem that block size > page size support had to solve.

-Dave.
-- 
Dave Chinner
dchinner@redhat.com


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-11-14 20:07                         ` Dave Chinner
@ 2024-11-15  8:14                           ` John Garry
  2024-11-15 11:20                           ` Long Li
  1 sibling, 0 replies; 58+ messages in thread
From: John Garry @ 2024-11-15  8:14 UTC (permalink / raw)
  To: Dave Chinner, Long Li
  Cc: Dave Chinner, Ritesh Harjani, chandan.babu, djwong, hch, viro,
	brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen

On 14/11/2024 20:07, Dave Chinner wrote:
>> xfs_io  -c "pwrite 64k 64k" mnt/file
>> xfs_io  -c "pwrite 8k 8k" mnt/file
>>
>> If unaligned unwritten extents are not permitted, we need to zero out the
>> sub-allocation units for ranges [0, 8K] and [16K, 64K] to prevent stale
>> data. While this can be handled relatively easily in direct I/O scenarios,
>> it presents significant challenges in buffered I/O operations. The main
>> difficulty arises because the extent size (64K) is larger than the page
>> size (4K), and our current code base has substantial limitations in handling
>> such cases.
>>
>> Any thoughts on this?
> Large folios in the page cache solve this problem. i.e. it's the
> same problem that block size > page size support had to solve.

Would that work for an extsize which is not a power-of-2?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/14] forcealign for xfs
  2024-11-14 20:07                         ` Dave Chinner
  2024-11-15  8:14                           ` John Garry
@ 2024-11-15 11:20                           ` Long Li
  1 sibling, 0 replies; 58+ messages in thread
From: Long Li @ 2024-11-15 11:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: John Garry, Dave Chinner, Ritesh Harjani, chandan.babu, djwong,
	hch, viro, brauner, jack, linux-xfs, linux-kernel, linux-fsdevel,
	catherine.hoang, martin.petersen

On Fri, Nov 15, 2024 at 07:07:53AM +1100, Dave Chinner wrote:
> On Thu, Nov 14, 2024 at 08:48:21PM +0800, Long Li wrote:
> > On Wed, Sep 18, 2024 at 11:12:47AM +0100, John Garry wrote:
> > > On 17/09/2024 23:27, Dave Chinner wrote:
> > > > > # xfs_bmap -vvp  mnt/file
> > > > > mnt/file:
> > > > > EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
> > > > >    0: [0..15]:         384..399          0 (384..399)          16 010000
> > > > >    1: [16..31]:        400..415          0 (400..415)          16 000000
> > > > >    2: [32..127]:       416..511          0 (416..511)          96 010000
> > > > >    3: [128..255]:      256..383          0 (256..383)         128 000000
> > > > > FLAG Values:
> > > > >     0010000 Unwritten preallocated extent
> > > > > 
> > > > > Here we have unaligned extents wrt extsize.
> > > > > 
> > > > > The sub-alloc unit zeroing would solve that - is that what you would still
> > > > > advocate (to solve that issue)?
> > > > Yes, I thought that was already implemented for force-align with the
> > > > DIO code via the extsize zero-around changes in the iomap code. Why
> > > > isn't that zero-around code ensuring the correct extent layout here?
> > > 
> > > I just have not included the extsize zero-around changes here. They were
> > > just grouped with the atomic writes support, as they were added specifically
> > > for the atomic writes support. Indeed - to me at least - it is strange that
> > > the DIO code changes are required for XFS forcealign implementation. And,
> > > even if we use extsize zero-around changes for DIO path, what about buffered
> > > IO?
> > 
> > 
> > I've been reviewing and testing the XFS atomic write patch series. Since
> > there haven't been any new responses to the previous discussions on this
> > issue, I'd like to inquire about the buffered IO problem with force-aligned
> > files, which is a scenario we might encounter.
> > 
> > Consider a case where the file supports force-alignment with a 64K extent size,
> > and the system page size is 4K. Take the following commands as an example:
> > 
> > xfs_io  -c "pwrite 64k 64k" mnt/file
> > xfs_io  -c "pwrite 8k 8k" mnt/file
> > 
> > If unaligned unwritten extents are not permitted, we need to zero out the
> > sub-allocation units for ranges [0, 8K] and [16K, 64K] to prevent stale
> > data. While this can be handled relatively easily in direct I/O scenarios,
> > it presents significant challenges in buffered I/O operations. The main
> > difficulty arises because the extent size (64K) is larger than the page
> > size (4K), and our current code base has substantial limitations in handling
> > such cases.
> > 
> > Any thoughts on this?
> 
> Large folios in the page cache solve this problem. i.e. it's the
> same problem that block size > page size support had to solve.
> 
> 

Thanks for your reply, it cleared up my confusion. So maybe we need
to set a minimum folio order for force-aligned inodes, just
like Large block sizes (LBS).

Thanks,
Long Li

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2024-11-29 11:36 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-13 16:36 [PATCH v4 00/14] forcealign for xfs John Garry
2024-08-13 16:36 ` [PATCH v4 01/14] xfs: only allow minlen allocations when near ENOSPC John Garry
2024-08-23 16:28   ` Darrick J. Wong
2024-08-13 16:36 ` [PATCH v4 02/14] xfs: always tail align maxlen allocations John Garry
2024-08-23 16:31   ` Darrick J. Wong
2024-08-29 17:58     ` John Garry
2024-08-29 21:34       ` Darrick J. Wong
2024-08-13 16:36 ` [PATCH v4 03/14] xfs: simplify extent allocation alignment John Garry
2024-08-13 16:36 ` [PATCH v4 04/14] xfs: make EOF allocation simpler John Garry
2024-09-04 18:25   ` Ritesh Harjani
2024-09-05  7:51     ` John Garry
2024-08-13 16:36 ` [PATCH v4 05/14] xfs: introduce forced allocation alignment John Garry
2024-08-13 16:36 ` [PATCH v4 06/14] xfs: align args->minlen for " John Garry
2024-08-13 16:36 ` [PATCH v4 07/14] xfs: Introduce FORCEALIGN inode flag John Garry
2024-08-13 16:36 ` [PATCH v4 08/14] xfs: Update xfs_inode_alloc_unitsize() for forcealign John Garry
2024-08-13 16:36 ` [PATCH v4 09/14] xfs: Update xfs_setattr_size() " John Garry
2024-08-13 16:36 ` [PATCH v4 10/14] xfs: Do not free EOF blocks " John Garry
2024-08-13 16:36 ` [PATCH v4 11/14] xfs: Only free full extents " John Garry
2024-08-13 16:36 ` [PATCH v4 12/14] xfs: Unmap blocks according to forcealign John Garry
2024-08-23 16:35   ` Darrick J. Wong
2024-08-13 16:36 ` [PATCH v4 13/14] xfs: Don't revert allocated offset for forcealign John Garry
2024-08-13 16:36 ` [PATCH v4 14/14] xfs: Enable file data forcealign feature John Garry
2024-09-04 18:14 ` [PATCH v4 00/14] forcealign for xfs Ritesh Harjani
2024-09-04 23:20   ` Dave Chinner
2024-09-05  3:56     ` Ritesh Harjani
2024-09-05  6:33       ` Dave Chinner
2024-09-10  2:51         ` Ritesh Harjani
2024-09-16  6:33           ` Dave Chinner
2024-09-10 12:33         ` Ritesh Harjani
2024-09-16  7:03           ` Dave Chinner
2024-09-16 10:24             ` John Garry
2024-09-17 20:54               ` Darrick J. Wong
2024-09-17 23:34                 ` Dave Chinner
2024-09-17 22:12               ` Dave Chinner
2024-09-18  7:59                 ` John Garry
2024-09-23  2:57                   ` Dave Chinner
2024-09-23  3:33                     ` Christoph Hellwig
2024-09-23  8:16                       ` John Garry
2024-09-23 12:07                         ` Christoph Hellwig
2024-09-23 12:33                           ` John Garry
2024-09-24  6:17                             ` Christoph Hellwig
2024-09-24  9:48                               ` John Garry
2024-11-29 11:36                                 ` John Garry
2024-09-23  8:00                     ` John Garry
2024-09-05 10:15     ` John Garry
2024-09-05 21:47       ` Dave Chinner
2024-09-06 14:31         ` John Garry
2024-09-08 22:49           ` Dave Chinner
2024-09-09 16:18             ` John Garry
2024-09-16  5:25               ` Dave Chinner
2024-09-16  9:44                 ` John Garry
2024-09-17 22:27                   ` Dave Chinner
2024-09-18 10:12                     ` John Garry
2024-11-14 12:48                       ` Long Li
2024-11-14 16:22                         ` John Garry
2024-11-14 20:07                         ` Dave Chinner
2024-11-15  8:14                           ` John Garry
2024-11-15 11:20                           ` Long Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).