linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: John Garry <john.g.garry@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Ritesh Harjani <ritesh.list@gmail.com>,
	chandan.babu@oracle.com, djwong@kernel.org, dchinner@redhat.com,
	hch@lst.de, viro@zeniv.linux.org.uk, brauner@kernel.org,
	jack@suse.cz, linux-xfs@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	catherine.hoang@oracle.com, martin.petersen@oracle.com
Subject: Re: [PATCH v4 00/14] forcealign for xfs
Date: Mon, 9 Sep 2024 17:18:43 +0100	[thread overview]
Message-ID: <84b68068-e159-4e28-bf06-767ea7858d79@oracle.com> (raw)
In-Reply-To: <Zt4qCLL6gBQ1kOFj@dread.disaster.area>

>>
>> AFAICS, forcealign behaviour is same as RT, so then this would be a mainline
>> bug, right?
> 
> No, I don't think so. I think this is a case where forcealign and RT
> behaviours differ, primarily because RT doesn't actually care about
> file offset -> physical offset alignment and can do unaligned IO
> whenever it wants. Hence having an unaligned written->unwritten
> extent state transition doesnt' affect anything for rt files...

ok

> 
>>
>>>> We change the space reservation in xfs-setattr_size() for this case
>>> (patch 9) but then don't do any alignment there - it relies on
>>> xfs_itruncate_extents_flags() to do the right thing w.r.t. extent
>>> removal alignment w.r.t. the new EOF.
>>>
>>> i.e. The xfs_setattr_size() code takes care of EOF block zeroing and
>>> page cache removal so the user doesn't see old data beyond EOF,
>>> whilst xfs_itruncate_extents_flags() is supposed to take care of the
>>> extent removal and the details of that operation (e.g. alignment).
>>
>> So we should roundup the unmap block to the alloc unit, correct? I have my
>> doubts about that, and thought that xfs_bunmapi_range() takes care of any
>> alignment handling.
> 
> The start should round up, yes. And, no, xfs_bunmapi_range() isn't
> specifically "taking care" of alignment. It's just marking a partial
> alloc unit range as unwritten, which means we now have -unaligned
> extents- in the BMBT.

Yes, I have noticed this previously.

> 
>>
>>>
>>> Patch 10 also modifies xfs_can_free_eofblocks() to take alignment
>>> into account for the post-eof block removal, but doesn't change
>>> xfs_free_eofblocks() at all. i.e  it also relies on
>>> xfs_itruncate_extents_flags() to do the right thing for force
>>> aligned inodes.
>>
>> What state should the blocks post-EOF blocks be? A simple example of
>> partially truncating an alloc unit is:
>>
>> $xfs_io -c "extsize" mnt/file
>> [16384] mnt/file
>>
>>
>> $xfs_bmap -vvp mnt/file
>> mnt/file:
>>   EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>>     0: [0..20479]:      192..20671        0 (192..20671)     20480 000000
>>
>>
>> $truncate -s 10461184 mnt/file # 10M - 6FSB
>>
>> $xfs_bmap -vvp mnt/file
>> mnt/file:
>>   EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>>     0: [0..20431]:      192..20623        0 (192..20623)     20432 000000
>>     1: [20432..20447]:  20624..20639      0 (20624..20639)      16 010000
>>   FLAG Values:
>>      0010000 Unwritten preallocated extent
>>
>> Is that incorrect state?
> 
> Think about it: what happens if you now truncate it back up to 10MB
> (i.e. aligned length) and then do an aligned atomic write on it.
> 
> First: What happens when you truncate up?
> 
> ......
> 
> Yes, iomap_zero_range() will see the unwritten extent and skip it.
> i.e. The unwritten extent stays as an unwritten extent, it's now
> within EOF. That written->unwritten extent boundary is not on an
> aligned file offset.

Right

> 
> Second: What happens when you do a correctly aligned atomic write
> that spans this range now?
> 
> ......
> 
> Iomap only maps a single extent at a time, so it will only map the
> written range from the start of the IO (aligned) to the start of the
> unwritten extent (unaligned).  Hence the atomic write will be
> rejected because we can't do the atomic write to such an unaligned
> extent.

It was being considered to change this handling for atomic writes. More 
below at *.

> 
> That's not a bug in the atomic write path - this failure occurs
> because of the truncate behaviour doing post-eof unwritten extent
> conversion....
> 
> Yes, I agree that the entire -physical- extent is still correctly
> aligned on disk so you could argue that the unwritten conversion
> that xfs_bunmapi_range is doing is valid forced alignment behaviour.
> However, the fact is that breaking the aligned physical extent into
> two unaligned contiguous extents in different states in the BMBT
> means that they are treated as two seperate unaligned extents, not
> one contiguous aligned physical extent.

Right, this is problematic.

* I guess that you had not been following the recent discussion on this 
topic in the latest xfs atomic writes series @ 
https://lore.kernel.org/linux-xfs/20240817094800.776408-1-john.g.garry@oracle.com/ 
and also mentioned earlier in 
https://lore.kernel.org/linux-xfs/20240726171358.GA27612@lst.de/

There I dropped the sub-alloc unit zeroing. The concept to iter for a 
single bio seems sane, but as Darrick mentioned, we have issue of 
non-atomically committing all the extent conversions.

FWIW, this is how I think that the change according to Darrick's idea 
would look:

---->8----

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 72c981e3dc92..ec76d6a8750d 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -244,7 +244,8 @@ xfs_iomap_write_direct(
  	xfs_fileoff_t		count_fsb,
  	unsigned int		flags,
  	struct xfs_bmbt_irec	*imap,
-	u64			*seq)
+	u64			*seq,
+	bool			zeroing)
  {
  	struct xfs_mount	*mp = ip->i_mount;
  	struct xfs_trans	*tp;
@@ -285,7 +286,7 @@ xfs_iomap_write_direct(
  	 * the reserve block pool for bmbt block allocation if there is no space
  	 * left but we need to do unwritten extent conversion.
  	 */
+	if (zeroing)
+		bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO;
	if (flags & IOMAP_DAX) {
  		bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO;
  		if (imap->br_state == XFS_EXT_UNWRITTEN) {
  			force = true;
@@ -780,6 +781,11 @@ xfs_direct_write_iomap_begin(
  	u16			iomap_flags = 0;
  	unsigned int		lockmode;
  	u64			seq;
+	bool			atomic = flags & IOMAP_ATOMIC;
+	struct xfs_bmbt_irec	imap2[XFS_BMAP_MAX_NMAP];
+	xfs_fileoff_t		_offset_fsb = xfs_inode_rounddown_alloc_unit(ip, 
offset_fsb);
+	xfs_fileoff_t		_end_fsb = xfs_inode_roundup_alloc_unit(ip, end_fsb);
+	int loop_index;

  	ASSERT(flags & (IOMAP_WRITE | IOMAP_ZERO));

@@ -843,6 +849,9 @@ xfs_direct_write_iomap_begin(
  	if (imap_needs_alloc(inode, flags, &imap, nimaps))
  		goto allocate_blocks;

+	if (atomic && !imap_spans_range(&imap, offset_fsb, end_fsb))
+		goto out_atomic;
+
  	/*
  	 * NOWAIT and OVERWRITE I/O needs to span the entire requested I/O with
  	 * a single map so that we avoid partial IO failures due to the rest of
@@ -897,7 +906,7 @@ xfs_direct_write_iomap_begin(
  	xfs_iunlock(ip, lockmode);

  	error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb,
-			flags, &imap, &seq);
+			flags, &imap, &seq, false);
  	if (error)
  		return error;

@@ -918,6 +927,28 @@ xfs_direct_write_iomap_begin(
  	xfs_iunlock(ip, lockmode);
  	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, IOMAP_F_SHARED, seq);

+out_atomic:
+	nimaps = XFS_BMAP_MAX_NMAP;
+
+	error = xfs_bmapi_read(ip, _offset_fsb, _end_fsb - _offset_fsb, &imap2[0],
+			       &nimaps, 0);
+	for (loop_index = 0; loop_index < nimaps; loop_index++) {
+		struct xfs_bmbt_irec *_imap2 = &imap2[loop_index];
+
+		if (_imap2->br_state == XFS_EXT_UNWRITTEN) {
+
+			xfs_iunlock(ip, lockmode);
+
+			error = xfs_iomap_write_direct(ip, _imap2->br_startoff, 
_imap2->br_blockcount,
+					flags, &imap, &seq, true);
+			if (error)
+				return error;
+			goto relock;
+		}
+	}
+	/* We should not reach this, but TODO better handling */
+	error = -EINVAL;
+
  out_unlock:
  	if (lockmode)
  		xfs_iunlock(ip, lockmode);

-----8<----

But I have my doubts about using XFS_BMAPI_ZERO, even if it is just for 
pre-zeroing unwritten parts of the alloc unit for an atomic write.

> 
> This extent state discontiunity is results in breaking physical IO
> across the extent state boundary. Hence such an unaligned extent
> state change violates the physical IO alignment guarantees that
> forced alignment is supposed to provide atomic writes...

Yes

> 
> This is the reason why operations like hole punching round to extent
> size for forced alignment at the highest level. i.e. We cannot allow
> xfs_bunmapi_range() to "take care" of alignment handling because
> managing partial extent unmappings with sub-alloc-unit unwritten
> extents does not work for forced alignment.
> 
> Again, this is different to the traditional RT file behaviour - it
> can use unwritten extents for sub-alloc-unit alignment unmaps
> because the RT device can align file offset to any physical offset,
> and issue unaligned sector sized IO without any restrictions. Forced
> alignment does not have this freedom, and when we extend forced
> alignment to RT files, it will not have the freedom to use
> unwritten extents for sub-alloc-unit unmapping, either.
> 
So how do you think that we should actually implement 
xfs_itruncate_extents_flags() properly for forcealign? Would it simply 
be like:

--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1050,7 +1050,7 @@ xfs_itruncate_extents_flags(
                 WARN_ON_ONCE(first_unmap_block > XFS_MAX_FILEOFF);
                 return 0;
         }
+	if (xfs_inode_has_forcealign(ip))
+	       first_unmap_block = xfs_inode_roundup_alloc_unit(ip, 
first_unmap_block);
         error = xfs_bunmapi_range(&tp, ip, flags, first_unmap_block,

Thanks,
John


  reply	other threads:[~2024-09-09 16:19 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-13 16:36 [PATCH v4 00/14] forcealign for xfs John Garry
2024-08-13 16:36 ` [PATCH v4 01/14] xfs: only allow minlen allocations when near ENOSPC John Garry
2024-08-23 16:28   ` Darrick J. Wong
2024-08-13 16:36 ` [PATCH v4 02/14] xfs: always tail align maxlen allocations John Garry
2024-08-23 16:31   ` Darrick J. Wong
2024-08-29 17:58     ` John Garry
2024-08-29 21:34       ` Darrick J. Wong
2024-08-13 16:36 ` [PATCH v4 03/14] xfs: simplify extent allocation alignment John Garry
2024-08-13 16:36 ` [PATCH v4 04/14] xfs: make EOF allocation simpler John Garry
2024-09-04 18:25   ` Ritesh Harjani
2024-09-05  7:51     ` John Garry
2024-08-13 16:36 ` [PATCH v4 05/14] xfs: introduce forced allocation alignment John Garry
2024-08-13 16:36 ` [PATCH v4 06/14] xfs: align args->minlen for " John Garry
2024-08-13 16:36 ` [PATCH v4 07/14] xfs: Introduce FORCEALIGN inode flag John Garry
2024-08-13 16:36 ` [PATCH v4 08/14] xfs: Update xfs_inode_alloc_unitsize() for forcealign John Garry
2024-08-13 16:36 ` [PATCH v4 09/14] xfs: Update xfs_setattr_size() " John Garry
2024-08-13 16:36 ` [PATCH v4 10/14] xfs: Do not free EOF blocks " John Garry
2024-08-13 16:36 ` [PATCH v4 11/14] xfs: Only free full extents " John Garry
2024-08-13 16:36 ` [PATCH v4 12/14] xfs: Unmap blocks according to forcealign John Garry
2024-08-23 16:35   ` Darrick J. Wong
2024-08-13 16:36 ` [PATCH v4 13/14] xfs: Don't revert allocated offset for forcealign John Garry
2024-08-13 16:36 ` [PATCH v4 14/14] xfs: Enable file data forcealign feature John Garry
2024-09-04 18:14 ` [PATCH v4 00/14] forcealign for xfs Ritesh Harjani
2024-09-04 23:20   ` Dave Chinner
2024-09-05  3:56     ` Ritesh Harjani
2024-09-05  6:33       ` Dave Chinner
2024-09-10  2:51         ` Ritesh Harjani
2024-09-16  6:33           ` Dave Chinner
2024-09-10 12:33         ` Ritesh Harjani
2024-09-16  7:03           ` Dave Chinner
2024-09-16 10:24             ` John Garry
2024-09-17 20:54               ` Darrick J. Wong
2024-09-17 23:34                 ` Dave Chinner
2024-09-17 22:12               ` Dave Chinner
2024-09-18  7:59                 ` John Garry
2024-09-23  2:57                   ` Dave Chinner
2024-09-23  3:33                     ` Christoph Hellwig
2024-09-23  8:16                       ` John Garry
2024-09-23 12:07                         ` Christoph Hellwig
2024-09-23 12:33                           ` John Garry
2024-09-24  6:17                             ` Christoph Hellwig
2024-09-24  9:48                               ` John Garry
2024-11-29 11:36                                 ` John Garry
2024-09-23  8:00                     ` John Garry
2024-09-05 10:15     ` John Garry
2024-09-05 21:47       ` Dave Chinner
2024-09-06 14:31         ` John Garry
2024-09-08 22:49           ` Dave Chinner
2024-09-09 16:18             ` John Garry [this message]
2024-09-16  5:25               ` Dave Chinner
2024-09-16  9:44                 ` John Garry
2024-09-17 22:27                   ` Dave Chinner
2024-09-18 10:12                     ` John Garry
2024-11-14 12:48                       ` Long Li
2024-11-14 16:22                         ` John Garry
2024-11-14 20:07                         ` Dave Chinner
2024-11-15  8:14                           ` John Garry
2024-11-15 11:20                           ` Long Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=84b68068-e159-4e28-bf06-767ea7858d79@oracle.com \
    --to=john.g.garry@oracle.com \
    --cc=brauner@kernel.org \
    --cc=catherine.hoang@oracle.com \
    --cc=chandan.babu@oracle.com \
    --cc=david@fromorbit.com \
    --cc=dchinner@redhat.com \
    --cc=djwong@kernel.org \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=ritesh.list@gmail.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).