Re: [PATCH v4 00/14] forcealign for xfs

All of lore.kernel.org
 help / color / mirror / Atom feed

From: John Garry <john.g.garry@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Ritesh Harjani <ritesh.list@gmail.com>,
	chandan.babu@oracle.com, djwong@kernel.org, dchinner@redhat.com,
	hch@lst.de, viro@zeniv.linux.org.uk, brauner@kernel.org,
	jack@suse.cz, linux-xfs@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	catherine.hoang@oracle.com, martin.petersen@oracle.com
Subject: Re: [PATCH v4 00/14] forcealign for xfs
Date: Mon, 9 Sep 2024 17:18:43 +0100	[thread overview]
Message-ID: <84b68068-e159-4e28-bf06-767ea7858d79@oracle.com> (raw)
In-Reply-To: <Zt4qCLL6gBQ1kOFj@dread.disaster.area>

>>
>> AFAICS, forcealign behaviour is same as RT, so then this would be a mainline
>> bug, right?
> 
> No, I don't think so. I think this is a case where forcealign and RT
> behaviours differ, primarily because RT doesn't actually care about
> file offset -> physical offset alignment and can do unaligned IO
> whenever it wants. Hence having an unaligned written->unwritten
> extent state transition doesnt' affect anything for rt files...

ok

> 
>>
>>>> We change the space reservation in xfs-setattr_size() for this case
>>> (patch 9) but then don't do any alignment there - it relies on
>>> xfs_itruncate_extents_flags() to do the right thing w.r.t. extent
>>> removal alignment w.r.t. the new EOF.
>>>
>>> i.e. The xfs_setattr_size() code takes care of EOF block zeroing and
>>> page cache removal so the user doesn't see old data beyond EOF,
>>> whilst xfs_itruncate_extents_flags() is supposed to take care of the
>>> extent removal and the details of that operation (e.g. alignment).
>>
>> So we should roundup the unmap block to the alloc unit, correct? I have my
>> doubts about that, and thought that xfs_bunmapi_range() takes care of any
>> alignment handling.
> 
> The start should round up, yes. And, no, xfs_bunmapi_range() isn't
> specifically "taking care" of alignment. It's just marking a partial
> alloc unit range as unwritten, which means we now have -unaligned
> extents- in the BMBT.

Yes, I have noticed this previously.

> 
>>
>>>
>>> Patch 10 also modifies xfs_can_free_eofblocks() to take alignment
>>> into account for the post-eof block removal, but doesn't change
>>> xfs_free_eofblocks() at all. i.e  it also relies on
>>> xfs_itruncate_extents_flags() to do the right thing for force
>>> aligned inodes.
>>
>> What state should the blocks post-EOF blocks be? A simple example of
>> partially truncating an alloc unit is:
>>
>> $xfs_io -c "extsize" mnt/file
>> [16384] mnt/file
>>
>>
>> $xfs_bmap -vvp mnt/file
>> mnt/file:
>>   EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>>     0: [0..20479]:      192..20671        0 (192..20671)     20480 000000
>>
>>
>> $truncate -s 10461184 mnt/file # 10M - 6FSB
>>
>> $xfs_bmap -vvp mnt/file
>> mnt/file:
>>   EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>>     0: [0..20431]:      192..20623        0 (192..20623)     20432 000000
>>     1: [20432..20447]:  20624..20639      0 (20624..20639)      16 010000
>>   FLAG Values:
>>      0010000 Unwritten preallocated extent
>>
>> Is that incorrect state?
> 
> Think about it: what happens if you now truncate it back up to 10MB
> (i.e. aligned length) and then do an aligned atomic write on it.
> 
> First: What happens when you truncate up?
> 
> ......
> 
> Yes, iomap_zero_range() will see the unwritten extent and skip it.
> i.e. The unwritten extent stays as an unwritten extent, it's now
> within EOF. That written->unwritten extent boundary is not on an
> aligned file offset.

Right

> 
> Second: What happens when you do a correctly aligned atomic write
> that spans this range now?
> 
> ......
> 
> Iomap only maps a single extent at a time, so it will only map the
> written range from the start of the IO (aligned) to the start of the
> unwritten extent (unaligned).  Hence the atomic write will be
> rejected because we can't do the atomic write to such an unaligned
> extent.

It was being considered to change this handling for atomic writes. More 
below at *.

> 
> That's not a bug in the atomic write path - this failure occurs
> because of the truncate behaviour doing post-eof unwritten extent
> conversion....
> 
> Yes, I agree that the entire -physical- extent is still correctly
> aligned on disk so you could argue that the unwritten conversion
> that xfs_bunmapi_range is doing is valid forced alignment behaviour.
> However, the fact is that breaking the aligned physical extent into
> two unaligned contiguous extents in different states in the BMBT
> means that they are treated as two seperate unaligned extents, not
> one contiguous aligned physical extent.

Right, this is problematic.

* I guess that you had not been following the recent discussion on this 
topic in the latest xfs atomic writes series @ 
https://lore.kernel.org/linux-xfs/20240817094800.776408-1-john.g.garry@oracle.com/ 
and also mentioned earlier in 
https://lore.kernel.org/linux-xfs/20240726171358.GA27612@lst.de/

There I dropped the sub-alloc unit zeroing. The concept to iter for a 
single bio seems sane, but as Darrick mentioned, we have issue of 
non-atomically committing all the extent conversions.

FWIW, this is how I think that the change according to Darrick's idea 
would look:

---->8----

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 72c981e3dc92..ec76d6a8750d 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -244,7 +244,8 @@ xfs_iomap_write_direct(
  	xfs_fileoff_t		count_fsb,
  	unsigned int		flags,
  	struct xfs_bmbt_irec	*imap,
-	u64			*seq)
+	u64			*seq,
+	bool			zeroing)
  {
  	struct xfs_mount	*mp = ip->i_mount;
  	struct xfs_trans	*tp;
@@ -285,7 +286,7 @@ xfs_iomap_write_direct(
  	 * the reserve block pool for bmbt block allocation if there is no space
  	 * left but we need to do unwritten extent conversion.
  	 */
+	if (zeroing)
+		bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO;
	if (flags & IOMAP_DAX) {
  		bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO;
  		if (imap->br_state == XFS_EXT_UNWRITTEN) {
  			force = true;
@@ -780,6 +781,11 @@ xfs_direct_write_iomap_begin(
  	u16			iomap_flags = 0;
  	unsigned int		lockmode;
  	u64			seq;
+	bool			atomic = flags & IOMAP_ATOMIC;
+	struct xfs_bmbt_irec	imap2[XFS_BMAP_MAX_NMAP];
+	xfs_fileoff_t		_offset_fsb = xfs_inode_rounddown_alloc_unit(ip, 
offset_fsb);
+	xfs_fileoff_t		_end_fsb = xfs_inode_roundup_alloc_unit(ip, end_fsb);
+	int loop_index;

  	ASSERT(flags & (IOMAP_WRITE | IOMAP_ZERO));

@@ -843,6 +849,9 @@ xfs_direct_write_iomap_begin(
  	if (imap_needs_alloc(inode, flags, &imap, nimaps))
  		goto allocate_blocks;

+	if (atomic && !imap_spans_range(&imap, offset_fsb, end_fsb))
+		goto out_atomic;
+
  	/*
  	 * NOWAIT and OVERWRITE I/O needs to span the entire requested I/O with
  	 * a single map so that we avoid partial IO failures due to the rest of
@@ -897,7 +906,7 @@ xfs_direct_write_iomap_begin(
  	xfs_iunlock(ip, lockmode);

  	error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb,
-			flags, &imap, &seq);
+			flags, &imap, &seq, false);
  	if (error)
  		return error;

@@ -918,6 +927,28 @@ xfs_direct_write_iomap_begin(
  	xfs_iunlock(ip, lockmode);
  	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, IOMAP_F_SHARED, seq);

+out_atomic:
+	nimaps = XFS_BMAP_MAX_NMAP;
+
+	error = xfs_bmapi_read(ip, _offset_fsb, _end_fsb - _offset_fsb, &imap2[0],
+			       &nimaps, 0);
+	for (loop_index = 0; loop_index < nimaps; loop_index++) {
+		struct xfs_bmbt_irec *_imap2 = &imap2[loop_index];
+
+		if (_imap2->br_state == XFS_EXT_UNWRITTEN) {
+
+			xfs_iunlock(ip, lockmode);
+
+			error = xfs_iomap_write_direct(ip, _imap2->br_startoff, 
_imap2->br_blockcount,
+					flags, &imap, &seq, true);
+			if (error)
+				return error;
+			goto relock;
+		}
+	}
+	/* We should not reach this, but TODO better handling */
+	error = -EINVAL;
+
  out_unlock:
  	if (lockmode)
  		xfs_iunlock(ip, lockmode);

-----8<----

But I have my doubts about using XFS_BMAPI_ZERO, even if it is just for 
pre-zeroing unwritten parts of the alloc unit for an atomic write.

> 
> This extent state discontiunity is results in breaking physical IO
> across the extent state boundary. Hence such an unaligned extent
> state change violates the physical IO alignment guarantees that
> forced alignment is supposed to provide atomic writes...

Yes

> 
> This is the reason why operations like hole punching round to extent
> size for forced alignment at the highest level. i.e. We cannot allow
> xfs_bunmapi_range() to "take care" of alignment handling because
> managing partial extent unmappings with sub-alloc-unit unwritten
> extents does not work for forced alignment.
> 
> Again, this is different to the traditional RT file behaviour - it
> can use unwritten extents for sub-alloc-unit alignment unmaps
> because the RT device can align file offset to any physical offset,
> and issue unaligned sector sized IO without any restrictions. Forced
> alignment does not have this freedom, and when we extend forced
> alignment to RT files, it will not have the freedom to use
> unwritten extents for sub-alloc-unit unmapping, either.
> 
So how do you think that we should actually implement 
xfs_itruncate_extents_flags() properly for forcealign? Would it simply 
be like:

--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1050,7 +1050,7 @@ xfs_itruncate_extents_flags(
                 WARN_ON_ONCE(first_unmap_block > XFS_MAX_FILEOFF);
                 return 0;
         }
+	if (xfs_inode_has_forcealign(ip))
+	       first_unmap_block = xfs_inode_roundup_alloc_unit(ip, 
first_unmap_block);
         error = xfs_bunmapi_range(&tp, ip, flags, first_unmap_block,

Thanks,
John

next prev parent reply	other threads:[~2024-09-09 16:19 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-13 16:36 [PATCH v4 00/14] forcealign for xfs John Garry
2024-08-13 16:36 ` [PATCH v4 01/14] xfs: only allow minlen allocations when near ENOSPC John Garry
2024-08-23 16:28   ` Darrick J. Wong
2024-08-13 16:36 ` [PATCH v4 02/14] xfs: always tail align maxlen allocations John Garry
2024-08-23 16:31   ` Darrick J. Wong
2024-08-29 17:58     ` John Garry
2024-08-29 21:34       ` Darrick J. Wong
2024-08-13 16:36 ` [PATCH v4 03/14] xfs: simplify extent allocation alignment John Garry
2024-08-13 16:36 ` [PATCH v4 04/14] xfs: make EOF allocation simpler John Garry
2024-09-04 18:25   ` Ritesh Harjani
2024-09-05  7:51     ` John Garry
2024-08-13 16:36 ` [PATCH v4 05/14] xfs: introduce forced allocation alignment John Garry
2024-08-13 16:36 ` [PATCH v4 06/14] xfs: align args->minlen for " John Garry
2024-08-13 16:36 ` [PATCH v4 07/14] xfs: Introduce FORCEALIGN inode flag John Garry
2024-08-13 16:36 ` [PATCH v4 08/14] xfs: Update xfs_inode_alloc_unitsize() for forcealign John Garry
2024-08-13 16:36 ` [PATCH v4 09/14] xfs: Update xfs_setattr_size() " John Garry
2024-08-13 16:36 ` [PATCH v4 10/14] xfs: Do not free EOF blocks " John Garry
2024-08-13 16:36 ` [PATCH v4 11/14] xfs: Only free full extents " John Garry
2024-08-13 16:36 ` [PATCH v4 12/14] xfs: Unmap blocks according to forcealign John Garry
2024-08-23 16:35   ` Darrick J. Wong
2024-08-13 16:36 ` [PATCH v4 13/14] xfs: Don't revert allocated offset for forcealign John Garry
2024-08-13 16:36 ` [PATCH v4 14/14] xfs: Enable file data forcealign feature John Garry
2024-09-04 18:14 ` [PATCH v4 00/14] forcealign for xfs Ritesh Harjani
2024-09-04 23:20   ` Dave Chinner
2024-09-05  3:56     ` Ritesh Harjani
2024-09-05  6:33       ` Dave Chinner
2024-09-10  2:51         ` Ritesh Harjani
2024-09-16  6:33           ` Dave Chinner
2024-09-10 12:33         ` Ritesh Harjani
2024-09-16  7:03           ` Dave Chinner
2024-09-16 10:24             ` John Garry
2024-09-17 20:54               ` Darrick J. Wong
2024-09-17 23:34                 ` Dave Chinner
2024-09-17 22:12               ` Dave Chinner
2024-09-18  7:59                 ` John Garry
2024-09-23  2:57                   ` Dave Chinner
2024-09-23  3:33                     ` Christoph Hellwig
2024-09-23  8:16                       ` John Garry
2024-09-23 12:07                         ` Christoph Hellwig
2024-09-23 12:33                           ` John Garry
2024-09-24  6:17                             ` Christoph Hellwig
2024-09-24  9:48                               ` John Garry
2024-11-29 11:36                                 ` John Garry
2024-09-23  8:00                     ` John Garry
2024-09-05 10:15     ` John Garry
2024-09-05 21:47       ` Dave Chinner
2024-09-06 14:31         ` John Garry
2024-09-08 22:49           ` Dave Chinner
2024-09-09 16:18             ` John Garry [this message]
2024-09-16  5:25               ` Dave Chinner
2024-09-16  9:44                 ` John Garry
2024-09-17 22:27                   ` Dave Chinner
2024-09-18 10:12                     ` John Garry
2024-11-14 12:48                       ` Long Li
2024-11-14 16:22                         ` John Garry
2024-11-14 20:07                         ` Dave Chinner
2024-11-15  8:14                           ` John Garry
2024-11-15 11:20                           ` Long Li

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:72c981e3dc9 dfblob:ec76d6a8750 )
 OR (
bs:"Re: [PATCH v4 00/14] forcealign for xfs" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=84b68068-e159-4e28-bf06-767ea7858d79@oracle.com \
    --to=john.g.garry@oracle.com \
    --cc=brauner@kernel.org \
    --cc=catherine.hoang@oracle.com \
    --cc=chandan.babu@oracle.com \
    --cc=david@fromorbit.com \
    --cc=dchinner@redhat.com \
    --cc=djwong@kernel.org \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=ritesh.list@gmail.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.