public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET 0/3] xfs: fixes for 6.2
@ 2022-11-24 16:59 Darrick J. Wong
  2022-11-24 16:59 ` [PATCH 1/3] xfs: invalidate block device page cache during unmount Darrick J. Wong
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Darrick J. Wong @ 2022-11-24 16:59 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

Bug fixes for XFS for 6.2.  The first one fixes stale bdev pagecache
contents after unmount, and the other two resolve gcc warnings.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=xfs-6.2-fixes
---
 fs/xfs/xfs_buf.c       |    1 +
 fs/xfs/xfs_trans_ail.c |    4 +++-
 fs/xfs/xfs_xattr.c     |    2 +-
 3 files changed, 5 insertions(+), 2 deletions(-)


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/3] xfs: invalidate block device page cache during unmount
  2022-11-24 16:59 [PATCHSET 0/3] xfs: fixes for 6.2 Darrick J. Wong
@ 2022-11-24 16:59 ` Darrick J. Wong
  2022-11-29  2:36   ` Gao Xiang
  2022-11-29  5:23   ` Dave Chinner
  2022-11-24 16:59 ` [PATCH 2/3] xfs: use memcpy, not strncpy, to format the attr prefix during listxattr Darrick J. Wong
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 18+ messages in thread
From: Darrick J. Wong @ 2022-11-24 16:59 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Every now and then I see fstests failures on aarch64 (64k pages) that
trigger on the following sequence:

mkfs.xfs $dev
mount $dev $mnt
touch $mnt/a
umount $mnt
xfs_db -c 'path /a' -c 'print' $dev

99% of the time this succeeds, but every now and then xfs_db cannot find
/a and fails.  This turns out to be a race involving udev/blkid, the
page cache for the block device, and the xfs_db process.

udev is triggered whenever anyone closes a block device or unmounts it.
The default udev rules invoke blkid to read the fs super and create
symlinks to the bdev under /dev/disk.  For this, it uses buffered reads
through the page cache.

xfs_db also uses buffered reads to examine metadata.  There is no
coordination between xfs_db and udev, which means that they can run
concurrently.  Note there is no coordination between the kernel and
blkid either.

On a system with 64k pages, the page cache can cache the superblock and
the root inode (and hence the root dir) with the same 64k page.  If
udev spawns blkid after the mkfs and the system is busy enough that it
is still running when xfs_db starts up, they'll both read from the same
page in the pagecache.

The unmount writes updated inode metadata to disk directly.  The XFS
buffer cache does not use the bdev pagecache, nor does it invalidate the
pagecache on umount.  If the above scenario occurs, the pagecache no
longer reflects what's on disk, xfs_db reads the stale metadata, and
fails to find /a.  Most of the time this succeeds because closing a bdev
invalidates the page cache, but when processes race, everyone loses.

Fix the problem by invalidating the bdev pagecache after flushing the
bdev, so that xfs_db will see up to date metadata.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_buf.c |    1 +
 1 file changed, 1 insertion(+)


diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index dde346450952..54c774af6e1c 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1945,6 +1945,7 @@ xfs_free_buftarg(
 	list_lru_destroy(&btp->bt_lru);
 
 	blkdev_issue_flush(btp->bt_bdev);
+	invalidate_bdev(btp->bt_bdev);
 	fs_put_dax(btp->bt_daxdev, btp->bt_mount);
 
 	kmem_free(btp);


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/3] xfs: use memcpy, not strncpy, to format the attr prefix during listxattr
  2022-11-24 16:59 [PATCHSET 0/3] xfs: fixes for 6.2 Darrick J. Wong
  2022-11-24 16:59 ` [PATCH 1/3] xfs: invalidate block device page cache during unmount Darrick J. Wong
@ 2022-11-24 16:59 ` Darrick J. Wong
  2022-11-29  2:37   ` Gao Xiang
  2022-11-29  5:26   ` Dave Chinner
  2022-11-24 16:59 ` [PATCH 3/3] xfs: shut up -Wuninitialized in xfsaild_push Darrick J. Wong
  2022-11-27 18:36 ` [PATCH 4/3] xfs: attach dquots to inode before reading data/cow fork mappings Darrick J. Wong
  3 siblings, 2 replies; 18+ messages in thread
From: Darrick J. Wong @ 2022-11-24 16:59 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

When -Wstringop-truncation is enabled, the compiler complains about
truncation of the null byte at the end of the xattr name prefix.  This
is intentional, since we're concatenating the two strings together and
do _not_ want a null byte in the middle of the name.

We've already ensured that the name buffer is long enough to handle
prefix and name, and the prefix_len is supposed to be the length of the
prefix string without the null byte, so use memcpy here instead.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_xattr.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


diff --git a/fs/xfs/xfs_xattr.c b/fs/xfs/xfs_xattr.c
index c325a28b89a8..10aa1fd39d2b 100644
--- a/fs/xfs/xfs_xattr.c
+++ b/fs/xfs/xfs_xattr.c
@@ -210,7 +210,7 @@ __xfs_xattr_put_listent(
 		return;
 	}
 	offset = context->buffer + context->count;
-	strncpy(offset, prefix, prefix_len);
+	memcpy(offset, prefix, prefix_len);
 	offset += prefix_len;
 	strncpy(offset, (char *)name, namelen);			/* real name */
 	offset += namelen;


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/3] xfs: shut up -Wuninitialized in xfsaild_push
  2022-11-24 16:59 [PATCHSET 0/3] xfs: fixes for 6.2 Darrick J. Wong
  2022-11-24 16:59 ` [PATCH 1/3] xfs: invalidate block device page cache during unmount Darrick J. Wong
  2022-11-24 16:59 ` [PATCH 2/3] xfs: use memcpy, not strncpy, to format the attr prefix during listxattr Darrick J. Wong
@ 2022-11-24 16:59 ` Darrick J. Wong
  2022-11-29  3:00   ` Gao Xiang
  2022-11-29  5:36   ` Dave Chinner
  2022-11-27 18:36 ` [PATCH 4/3] xfs: attach dquots to inode before reading data/cow fork mappings Darrick J. Wong
  3 siblings, 2 replies; 18+ messages in thread
From: Darrick J. Wong @ 2022-11-24 16:59 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

-Wuninitialized complains about @target in xfsaild_push being
uninitialized in the case where the waitqueue is active but there is no
last item in the AIL to wait for.  I /think/ it should never be the case
that the subsequent xfs_trans_ail_cursor_first returns a log item and
hence we'll never end up at XFS_LSN_CMP, but let's make this explicit.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_trans_ail.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index f51df7d94ef7..7d4109af193e 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -422,7 +422,7 @@ xfsaild_push(
 	struct xfs_ail_cursor	cur;
 	struct xfs_log_item	*lip;
 	xfs_lsn_t		lsn;
-	xfs_lsn_t		target;
+	xfs_lsn_t		target = NULLCOMMITLSN;
 	long			tout;
 	int			stuck = 0;
 	int			flushing = 0;
@@ -472,6 +472,8 @@ xfsaild_push(
 
 	XFS_STATS_INC(mp, xs_push_ail);
 
+	ASSERT(target != NULLCOMMITLSN);
+
 	lsn = lip->li_lsn;
 	while ((XFS_LSN_CMP(lip->li_lsn, target) <= 0)) {
 		int	lock_result;


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/3] xfs: attach dquots to inode before reading data/cow fork mappings
  2022-11-24 16:59 [PATCHSET 0/3] xfs: fixes for 6.2 Darrick J. Wong
                   ` (2 preceding siblings ...)
  2022-11-24 16:59 ` [PATCH 3/3] xfs: shut up -Wuninitialized in xfsaild_push Darrick J. Wong
@ 2022-11-27 18:36 ` Darrick J. Wong
  2022-11-29  6:31   ` Dave Chinner
  2022-11-29 21:05   ` [PATCH v2 " Darrick J. Wong
  3 siblings, 2 replies; 18+ messages in thread
From: Darrick J. Wong @ 2022-11-27 18:36 UTC (permalink / raw)
  To: linux-xfs; +Cc: Dave Chinner

From: Darrick J. Wong <djwong@kernel.org>

I've been running near-continuous integration testing of online fsck,
and I've noticed that once a day, one of the ARM VMs will fail the test
with out of order records in the data fork.

xfs/804 races fsstress with online scrub (aka scan but do not change
anything), so I think this might be a bug in the core xfs code.  This
also only seems to trigger if one runs the test for more than ~6 minutes
via TIME_FACTOR=13 or something.
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/tree/tests/xfs/804?h=djwong-wtf

I added a debugging patch to the kernel to check the data fork extents
after taking the ILOCK, before dropping ILOCK, and before and after each
bmapping operation.  So far I've narrowed it down to the delalloc code
inserting a record in the wrong place in the iext tree:

xfs_bmap_add_extent_hole_delay, near line 2691:

	case 0:
		/*
		 * New allocation is not contiguous with another
		 * delayed allocation.
		 * Insert a new entry.
		 */
		oldlen = newlen = 0;
		xfs_iunlock_check_datafork(ip);		<-- ok here
		xfs_iext_insert(ip, icur, new, state);
		xfs_iunlock_check_datafork(ip);		<-- bad here
		break;
	}

I recorded the state of the data fork mappings and iext cursor state
when a corrupt data fork is detected immediately after the
xfs_bmap_add_extent_hole_delay call in xfs_bmapi_reserve_delalloc:

ino 0x140bb3 func xfs_bmapi_reserve_delalloc line 4164 data fork:
    ino 0x140bb3 nr 0x0 nr_real 0x0 offset 0xb9 blockcount 0x1f startblock 0x935de2 state 1
    ino 0x140bb3 nr 0x1 nr_real 0x1 offset 0xe6 blockcount 0xa startblock 0xffffffffe0007 state 0
    ino 0x140bb3 nr 0x2 nr_real 0x1 offset 0xd8 blockcount 0xe startblock 0x935e01 state 0

Here we see that a delalloc extent was inserted into the wrong position
in the iext leaf, same as all the other times.  The extra trace data I
collected are as follows:

ino 0x140bb3 fork 0 oldoff 0xe6 oldlen 0x4 oldprealloc 0x6 isize 0xe6000
    ino 0x140bb3 oldgotoff 0xea oldgotstart 0xfffffffffffffffe oldgotcount 0x0 oldgotstate 0
    ino 0x140bb3 crapgotoff 0x0 crapgotstart 0x0 crapgotcount 0x0 crapgotstate 0
    ino 0x140bb3 freshgotoff 0xd8 freshgotstart 0x935e01 freshgotcount 0xe freshgotstate 0
    ino 0x140bb3 nowgotoff 0xe6 nowgotstart 0xffffffffe0007 nowgotcount 0xa nowgotstate 0
    ino 0x140bb3 oldicurpos 1 oldleafnr 2 oldleaf 0xfffffc00f0609a00
    ino 0x140bb3 crapicurpos 2 crapleafnr 2 crapleaf 0xfffffc00f0609a00
    ino 0x140bb3 freshicurpos 1 freshleafnr 2 freshleaf 0xfffffc00f0609a00
    ino 0x140bb3 newicurpos 1 newleafnr 3 newleaf 0xfffffc00f0609a00

The first line shows that xfs_bmapi_reserve_delalloc was called with
whichfork=XFS_DATA_FORK, off=0xe6, len=0x4, prealloc=6.

The second line ("oldgot") shows the contents of @got at the beginning
of the call, which are the results of the first iext lookup in
xfs_buffered_write_iomap_begin.

Line 3 ("crapgot") is the result of duplicating the cursor at the start
of the body of xfs_bmapi_reserve_delalloc and performing a fresh lookup
at @off.

Line 4 ("freshgot") is the result of a new xfs_iext_get_extent right
before the call to xfs_bmap_add_extent_hole_delay.  Totally garbage.

Line 5 ("nowgot") is contents of @got after the
xfs_bmap_add_extent_hole_delay call.

Line 6 is the contents of @icur at the beginning fo the call.  Lines 7-9
are the contents of the iext cursors at the point where the block
mappings were sampled.

I think @oldgot is a HOLESTARTBLOCK extent because the first lookup
didn't find anything, so we filled in imap with "fake hole until the
end".  At the time of the first lookup, I suspect that there's only one
32-block unwritten extent in the mapping (hence oldicurpos==1) but by
the time we get to recording crapgot, crapicurpos==2.

Dave then added:

Ok, that's much simpler to reason about, and implies the smoke is
coming from xfs_buffered_write_iomap_begin() or
xfs_bmapi_reserve_delalloc(). I suspect the former - it does a lot
of stuff with the ILOCK_EXCL held.....

.... including calling xfs_qm_dqattach_locked().

xfs_buffered_write_iomap_begin
  ILOCK_EXCL
  look up icur
  xfs_qm_dqattach_locked
    xfs_qm_dqattach_one
      xfs_qm_dqget_inode
        dquot cache miss
        xfs_iunlock(ip, XFS_ILOCK_EXCL);
        error = xfs_qm_dqread(mp, id, type, can_alloc, &dqp);
        xfs_ilock(ip, XFS_ILOCK_EXCL);
  ....
  xfs_bmapi_reserve_delalloc(icur)

Yup, that's what is letting the magic smoke out -
xfs_qm_dqattach_locked() can cycle the ILOCK. If that happens, we
can pass a stale icur to xfs_bmapi_reserve_delalloc() and it all
goes downhill from there.

So.  Fix this by moving the dqattach_locked call up, and add a comment
about how we must attach the dquots *before* sampling the data/cow fork
contents.

Fixes: a526c85c2236 ("xfs: move xfs_file_iomap_begin_delay around") # goes further back than this
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_iomap.c |   12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 1bdd7afc1010..d903f0586490 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -984,6 +984,14 @@ xfs_buffered_write_iomap_begin(
 	if (error)
 		goto out_unlock;
 
+	/*
+	 * Attach dquots before we access the data/cow fork mappings, because
+	 * this function can cycle the ILOCK.
+	 */
+	error = xfs_qm_dqattach_locked(ip, false);
+	if (error)
+		goto out_unlock;
+
 	/*
 	 * Search the data fork first to look up our source mapping.  We
 	 * always need the data fork map, as we have to return it to the
@@ -1071,10 +1079,6 @@ xfs_buffered_write_iomap_begin(
 			allocfork = XFS_COW_FORK;
 	}
 
-	error = xfs_qm_dqattach_locked(ip, false);
-	if (error)
-		goto out_unlock;
-
 	if (eof && offset + count > XFS_ISIZE(ip)) {
 		/*
 		 * Determine the initial size of the preallocation.

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] xfs: invalidate block device page cache during unmount
  2022-11-24 16:59 ` [PATCH 1/3] xfs: invalidate block device page cache during unmount Darrick J. Wong
@ 2022-11-29  2:36   ` Gao Xiang
  2022-11-29  5:23   ` Dave Chinner
  1 sibling, 0 replies; 18+ messages in thread
From: Gao Xiang @ 2022-11-29  2:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Nov 24, 2022 at 08:59:24AM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Every now and then I see fstests failures on aarch64 (64k pages) that
> trigger on the following sequence:
> 
> mkfs.xfs $dev
> mount $dev $mnt
> touch $mnt/a
> umount $mnt
> xfs_db -c 'path /a' -c 'print' $dev
> 
> 99% of the time this succeeds, but every now and then xfs_db cannot find
> /a and fails.  This turns out to be a race involving udev/blkid, the
> page cache for the block device, and the xfs_db process.
> 
> udev is triggered whenever anyone closes a block device or unmounts it.
> The default udev rules invoke blkid to read the fs super and create
> symlinks to the bdev under /dev/disk.  For this, it uses buffered reads
> through the page cache.
> 
> xfs_db also uses buffered reads to examine metadata.  There is no
> coordination between xfs_db and udev, which means that they can run
> concurrently.  Note there is no coordination between the kernel and
> blkid either.
> 
> On a system with 64k pages, the page cache can cache the superblock and
> the root inode (and hence the root dir) with the same 64k page.  If
> udev spawns blkid after the mkfs and the system is busy enough that it
> is still running when xfs_db starts up, they'll both read from the same
> page in the pagecache.
> 
> The unmount writes updated inode metadata to disk directly.  The XFS
> buffer cache does not use the bdev pagecache, nor does it invalidate the
> pagecache on umount.  If the above scenario occurs, the pagecache no
> longer reflects what's on disk, xfs_db reads the stale metadata, and
> fails to find /a.  Most of the time this succeeds because closing a bdev
> invalidates the page cache, but when processes race, everyone loses.
> 
> Fix the problem by invalidating the bdev pagecache after flushing the
> bdev, so that xfs_db will see up to date metadata.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>

Thanks,
Gao Xiang

> ---
>  fs/xfs/xfs_buf.c |    1 +
>  1 file changed, 1 insertion(+)
> 
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index dde346450952..54c774af6e1c 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1945,6 +1945,7 @@ xfs_free_buftarg(
>  	list_lru_destroy(&btp->bt_lru);
>  
>  	blkdev_issue_flush(btp->bt_bdev);
> +	invalidate_bdev(btp->bt_bdev);
>  	fs_put_dax(btp->bt_daxdev, btp->bt_mount);
>  
>  	kmem_free(btp);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] xfs: use memcpy, not strncpy, to format the attr prefix during listxattr
  2022-11-24 16:59 ` [PATCH 2/3] xfs: use memcpy, not strncpy, to format the attr prefix during listxattr Darrick J. Wong
@ 2022-11-29  2:37   ` Gao Xiang
  2022-11-29  5:26   ` Dave Chinner
  1 sibling, 0 replies; 18+ messages in thread
From: Gao Xiang @ 2022-11-29  2:37 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Nov 24, 2022 at 08:59:29AM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> When -Wstringop-truncation is enabled, the compiler complains about
> truncation of the null byte at the end of the xattr name prefix.  This
> is intentional, since we're concatenating the two strings together and
> do _not_ want a null byte in the middle of the name.
> 
> We've already ensured that the name buffer is long enough to handle
> prefix and name, and the prefix_len is supposed to be the length of the
> prefix string without the null byte, so use memcpy here instead.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>

Thanks,
Gao Xiang

> ---
>  fs/xfs/xfs_xattr.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> 
> diff --git a/fs/xfs/xfs_xattr.c b/fs/xfs/xfs_xattr.c
> index c325a28b89a8..10aa1fd39d2b 100644
> --- a/fs/xfs/xfs_xattr.c
> +++ b/fs/xfs/xfs_xattr.c
> @@ -210,7 +210,7 @@ __xfs_xattr_put_listent(
>  		return;
>  	}
>  	offset = context->buffer + context->count;
> -	strncpy(offset, prefix, prefix_len);
> +	memcpy(offset, prefix, prefix_len);
>  	offset += prefix_len;
>  	strncpy(offset, (char *)name, namelen);			/* real name */
>  	offset += namelen;

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] xfs: shut up -Wuninitialized in xfsaild_push
  2022-11-24 16:59 ` [PATCH 3/3] xfs: shut up -Wuninitialized in xfsaild_push Darrick J. Wong
@ 2022-11-29  3:00   ` Gao Xiang
  2022-11-29  5:36   ` Dave Chinner
  1 sibling, 0 replies; 18+ messages in thread
From: Gao Xiang @ 2022-11-29  3:00 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Nov 24, 2022 at 08:59:35AM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> -Wuninitialized complains about @target in xfsaild_push being
> uninitialized in the case where the waitqueue is active but there is no
> last item in the AIL to wait for.  I /think/ it should never be the case
> that the subsequent xfs_trans_ail_cursor_first returns a log item and
> hence we'll never end up at XFS_LSN_CMP, but let's make this explicit.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

As far as I understand, I don't think this can happen as well.

Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>

Thanks,
Gao Xiang


> ---
>  fs/xfs/xfs_trans_ail.c |    4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> 
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index f51df7d94ef7..7d4109af193e 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -422,7 +422,7 @@ xfsaild_push(
>  	struct xfs_ail_cursor	cur;
>  	struct xfs_log_item	*lip;
>  	xfs_lsn_t		lsn;
> -	xfs_lsn_t		target;
> +	xfs_lsn_t		target = NULLCOMMITLSN;
>  	long			tout;
>  	int			stuck = 0;
>  	int			flushing = 0;
> @@ -472,6 +472,8 @@ xfsaild_push(
>  
>  	XFS_STATS_INC(mp, xs_push_ail);
>  
> +	ASSERT(target != NULLCOMMITLSN);
> +
>  	lsn = lip->li_lsn;
>  	while ((XFS_LSN_CMP(lip->li_lsn, target) <= 0)) {
>  		int	lock_result;

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] xfs: invalidate block device page cache during unmount
  2022-11-24 16:59 ` [PATCH 1/3] xfs: invalidate block device page cache during unmount Darrick J. Wong
  2022-11-29  2:36   ` Gao Xiang
@ 2022-11-29  5:23   ` Dave Chinner
  2022-11-29  5:59     ` Darrick J. Wong
  1 sibling, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2022-11-29  5:23 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Nov 24, 2022 at 08:59:24AM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Every now and then I see fstests failures on aarch64 (64k pages) that
> trigger on the following sequence:
> 
> mkfs.xfs $dev
> mount $dev $mnt
> touch $mnt/a
> umount $mnt
> xfs_db -c 'path /a' -c 'print' $dev
> 
> 99% of the time this succeeds, but every now and then xfs_db cannot find
> /a and fails.  This turns out to be a race involving udev/blkid, the
> page cache for the block device, and the xfs_db process.
> 
> udev is triggered whenever anyone closes a block device or unmounts it.
> The default udev rules invoke blkid to read the fs super and create
> symlinks to the bdev under /dev/disk.  For this, it uses buffered reads
> through the page cache.
> 
> xfs_db also uses buffered reads to examine metadata.  There is no
> coordination between xfs_db and udev, which means that they can run
> concurrently.  Note there is no coordination between the kernel and
> blkid either.
> 
> On a system with 64k pages, the page cache can cache the superblock and
> the root inode (and hence the root dir) with the same 64k page.  If
> udev spawns blkid after the mkfs and the system is busy enough that it
> is still running when xfs_db starts up, they'll both read from the same
> page in the pagecache.
> 
> The unmount writes updated inode metadata to disk directly.  The XFS
> buffer cache does not use the bdev pagecache, nor does it invalidate the
> pagecache on umount.  If the above scenario occurs, the pagecache no
> longer reflects what's on disk, xfs_db reads the stale metadata, and
> fails to find /a.  Most of the time this succeeds because closing a bdev
> invalidates the page cache, but when processes race, everyone loses.
> 
> Fix the problem by invalidating the bdev pagecache after flushing the
> bdev, so that xfs_db will see up to date metadata.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_buf.c |    1 +
>  1 file changed, 1 insertion(+)
> 
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index dde346450952..54c774af6e1c 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1945,6 +1945,7 @@ xfs_free_buftarg(
>  	list_lru_destroy(&btp->bt_lru);
>  
>  	blkdev_issue_flush(btp->bt_bdev);
> +	invalidate_bdev(btp->bt_bdev);
>  	fs_put_dax(btp->bt_daxdev, btp->bt_mount);
>  
>  	kmem_free(btp);

Looks OK and because XFS has multiple block devices we have to do
this invalidation for each bdev.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

However: this does not look to be an XFS specific problem.  If we
look at reconfigure_super(), when it completes a remount-ro
operation it calls invalidate_bdev() because:

       /*
         * Some filesystems modify their metadata via some other path than the
         * bdev buffer cache (eg. use a private mapping, or directories in
         * pagecache, etc). Also file data modifications go via their own
         * mappings. So If we try to mount readonly then copy the filesystem
         * from bdev, we could get stale data, so invalidate it to give a best
         * effort at coherency.
         */
        if (remount_ro && sb->s_bdev)
                invalidate_bdev(sb->s_bdev);

This is pretty much the same problem as this patch avoids for XFS in
the unmount path, yes? Shouldn't we be adding a call to
invalidate_bdev(sb->s_bdev) after the fs->kill_sb() call in
deactivate_locked_super() so that this problem goes away for all
filesystems?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] xfs: use memcpy, not strncpy, to format the attr prefix during listxattr
  2022-11-24 16:59 ` [PATCH 2/3] xfs: use memcpy, not strncpy, to format the attr prefix during listxattr Darrick J. Wong
  2022-11-29  2:37   ` Gao Xiang
@ 2022-11-29  5:26   ` Dave Chinner
  1 sibling, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2022-11-29  5:26 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Nov 24, 2022 at 08:59:29AM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> When -Wstringop-truncation is enabled, the compiler complains about
> truncation of the null byte at the end of the xattr name prefix.  This
> is intentional, since we're concatenating the two strings together and
> do _not_ want a null byte in the middle of the name.
> 
> We've already ensured that the name buffer is long enough to handle
> prefix and name, and the prefix_len is supposed to be the length of the
> prefix string without the null byte, so use memcpy here instead.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_xattr.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> 
> diff --git a/fs/xfs/xfs_xattr.c b/fs/xfs/xfs_xattr.c
> index c325a28b89a8..10aa1fd39d2b 100644
> --- a/fs/xfs/xfs_xattr.c
> +++ b/fs/xfs/xfs_xattr.c
> @@ -210,7 +210,7 @@ __xfs_xattr_put_listent(
>  		return;
>  	}
>  	offset = context->buffer + context->count;
> -	strncpy(offset, prefix, prefix_len);
> +	memcpy(offset, prefix, prefix_len);
>  	offset += prefix_len;
>  	strncpy(offset, (char *)name, namelen);			/* real name */
>  	offset += namelen;

Looks fine.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] xfs: shut up -Wuninitialized in xfsaild_push
  2022-11-24 16:59 ` [PATCH 3/3] xfs: shut up -Wuninitialized in xfsaild_push Darrick J. Wong
  2022-11-29  3:00   ` Gao Xiang
@ 2022-11-29  5:36   ` Dave Chinner
  1 sibling, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2022-11-29  5:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Nov 24, 2022 at 08:59:35AM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> -Wuninitialized complains about @target in xfsaild_push being
> uninitialized in the case where the waitqueue is active but there is no
> last item in the AIL to wait for.  I /think/ it should never be the case
> that the subsequent xfs_trans_ail_cursor_first returns a log item and
> hence we'll never end up at XFS_LSN_CMP, but let's make this explicit.

If xfs_ail_max() returns NULL, then xfs_trans_ail_cursor_first()
must return NULL as the AIL is empty. So we always jump out of the
code in that case, and never use an uninitialised target value.
Older compilers (gcc-11) don't complain about target being used
uninitialised, only newer, "smarter" versions.

FWIW, the patchset I have that reworks the AIL push
target/wakeup/grant head accounting completely reworks this target
code[1], so in the mean time doing this to shut up the compiler
warnings is fine.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

[1] https://lore.kernel.org/linux-xfs/20220809230353.3353059-1-david@fromorbit.com/

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] xfs: invalidate block device page cache during unmount
  2022-11-29  5:23   ` Dave Chinner
@ 2022-11-29  5:59     ` Darrick J. Wong
  0 siblings, 0 replies; 18+ messages in thread
From: Darrick J. Wong @ 2022-11-29  5:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Nov 29, 2022 at 04:23:22PM +1100, Dave Chinner wrote:
> On Thu, Nov 24, 2022 at 08:59:24AM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Every now and then I see fstests failures on aarch64 (64k pages) that
> > trigger on the following sequence:
> > 
> > mkfs.xfs $dev
> > mount $dev $mnt
> > touch $mnt/a
> > umount $mnt
> > xfs_db -c 'path /a' -c 'print' $dev
> > 
> > 99% of the time this succeeds, but every now and then xfs_db cannot find
> > /a and fails.  This turns out to be a race involving udev/blkid, the
> > page cache for the block device, and the xfs_db process.
> > 
> > udev is triggered whenever anyone closes a block device or unmounts it.
> > The default udev rules invoke blkid to read the fs super and create
> > symlinks to the bdev under /dev/disk.  For this, it uses buffered reads
> > through the page cache.
> > 
> > xfs_db also uses buffered reads to examine metadata.  There is no
> > coordination between xfs_db and udev, which means that they can run
> > concurrently.  Note there is no coordination between the kernel and
> > blkid either.
> > 
> > On a system with 64k pages, the page cache can cache the superblock and
> > the root inode (and hence the root dir) with the same 64k page.  If
> > udev spawns blkid after the mkfs and the system is busy enough that it
> > is still running when xfs_db starts up, they'll both read from the same
> > page in the pagecache.
> > 
> > The unmount writes updated inode metadata to disk directly.  The XFS
> > buffer cache does not use the bdev pagecache, nor does it invalidate the
> > pagecache on umount.  If the above scenario occurs, the pagecache no
> > longer reflects what's on disk, xfs_db reads the stale metadata, and
> > fails to find /a.  Most of the time this succeeds because closing a bdev
> > invalidates the page cache, but when processes race, everyone loses.
> > 
> > Fix the problem by invalidating the bdev pagecache after flushing the
> > bdev, so that xfs_db will see up to date metadata.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/xfs_buf.c |    1 +
> >  1 file changed, 1 insertion(+)
> > 
> > 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index dde346450952..54c774af6e1c 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -1945,6 +1945,7 @@ xfs_free_buftarg(
> >  	list_lru_destroy(&btp->bt_lru);
> >  
> >  	blkdev_issue_flush(btp->bt_bdev);
> > +	invalidate_bdev(btp->bt_bdev);
> >  	fs_put_dax(btp->bt_daxdev, btp->bt_mount);
> >  
> >  	kmem_free(btp);
> 
> Looks OK and because XFS has multiple block devices we have to do
> this invalidation for each bdev.
> 
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> 
> However: this does not look to be an XFS specific problem.  If we
> look at reconfigure_super(), when it completes a remount-ro
> operation it calls invalidate_bdev() because:
> 
>        /*
>          * Some filesystems modify their metadata via some other path than the
>          * bdev buffer cache (eg. use a private mapping, or directories in
>          * pagecache, etc). Also file data modifications go via their own
>          * mappings. So If we try to mount readonly then copy the filesystem
>          * from bdev, we could get stale data, so invalidate it to give a best
>          * effort at coherency.
>          */
>         if (remount_ro && sb->s_bdev)
>                 invalidate_bdev(sb->s_bdev);
> 
> This is pretty much the same problem as this patch avoids for XFS in
> the unmount path, yes? Shouldn't we be adding a call to
> invalidate_bdev(sb->s_bdev) after the fs->kill_sb() call in
> deactivate_locked_super() so that this problem goes away for all
> filesystems?

I'm not sure this applies to everyone -- AFAICT, ext2/4 still write
everything through the bdev page cache, which means that the
invalidation isn't necessary there, except for perhaps the MMP block.

Years ago I remember Andreas rolling his eyes at how the kernel would
usually drop the whole pagecache between umount and e2fsck starting.
But I guess that's *usually* what we get anyways, so adding an
invalidation everywhere for the long tail of simple bdev filesystems
wouldn't hurt much.  Hmm.  Ok, I'm more convinced now.

I'll ask on the ext4 concall this week, and in the meantime try to
figure out what's the deal with btrfs.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/3] xfs: attach dquots to inode before reading data/cow fork mappings
  2022-11-27 18:36 ` [PATCH 4/3] xfs: attach dquots to inode before reading data/cow fork mappings Darrick J. Wong
@ 2022-11-29  6:31   ` Dave Chinner
  2022-11-29  6:50     ` Darrick J. Wong
  2022-11-29 21:05   ` [PATCH v2 " Darrick J. Wong
  1 sibling, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2022-11-29  6:31 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Sun, Nov 27, 2022 at 10:36:29AM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> I've been running near-continuous integration testing of online fsck,
> and I've noticed that once a day, one of the ARM VMs will fail the test
> with out of order records in the data fork.
> 
> xfs/804 races fsstress with online scrub (aka scan but do not change
> anything), so I think this might be a bug in the core xfs code.  This
> also only seems to trigger if one runs the test for more than ~6 minutes
> via TIME_FACTOR=13 or something.
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/tree/tests/xfs/804?h=djwong-wtf
.....
> So.  Fix this by moving the dqattach_locked call up, and add a comment
> about how we must attach the dquots *before* sampling the data/cow fork
> contents.
> 
> Fixes: a526c85c2236 ("xfs: move xfs_file_iomap_begin_delay around") # goes further back than this
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_iomap.c |   12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 1bdd7afc1010..d903f0586490 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -984,6 +984,14 @@ xfs_buffered_write_iomap_begin(
>  	if (error)
>  		goto out_unlock;
>  
> +	/*
> +	 * Attach dquots before we access the data/cow fork mappings, because
> +	 * this function can cycle the ILOCK.
> +	 */
> +	error = xfs_qm_dqattach_locked(ip, false);
> +	if (error)
> +		goto out_unlock;
> +
>  	/*
>  	 * Search the data fork first to look up our source mapping.  We
>  	 * always need the data fork map, as we have to return it to the
> @@ -1071,10 +1079,6 @@ xfs_buffered_write_iomap_begin(
>  			allocfork = XFS_COW_FORK;
>  	}
>  
> -	error = xfs_qm_dqattach_locked(ip, false);
> -	if (error)
> -		goto out_unlock;
> -
>  	if (eof && offset + count > XFS_ISIZE(ip)) {
>  		/*
>  		 * Determine the initial size of the preallocation.
> 

Why not attached the dquots before we call xfs_ilock_for_iomap()?
That way we can just call xfs_qm_dqattach(ip, false) and just return
on failure immediately. That's exactly what we do in the
xfs_iomap_write_direct() path, and it avoids the need to mention
anything about lock cycling because we just don't care
about cycling the ILOCK to read in or allocate dquots before we
start the real work that needs to be done...

Hmmmmm - this means there's a potential problem with IOCB_NOWAIT
here - if the dquots are not in memory, we're going to drop and then
retake the ILOCK_EXCL without trylocks, potentially blocking a task
that should not get blocked. That's a separate problem, though, and
we probably need to plumb NOWAIT through to the dquot lookup cache
miss case to solve that.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/3] xfs: attach dquots to inode before reading data/cow fork mappings
  2022-11-29  6:31   ` Dave Chinner
@ 2022-11-29  6:50     ` Darrick J. Wong
  2022-11-29  8:04       ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Darrick J. Wong @ 2022-11-29  6:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Nov 29, 2022 at 05:31:04PM +1100, Dave Chinner wrote:
> On Sun, Nov 27, 2022 at 10:36:29AM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > I've been running near-continuous integration testing of online fsck,
> > and I've noticed that once a day, one of the ARM VMs will fail the test
> > with out of order records in the data fork.
> > 
> > xfs/804 races fsstress with online scrub (aka scan but do not change
> > anything), so I think this might be a bug in the core xfs code.  This
> > also only seems to trigger if one runs the test for more than ~6 minutes
> > via TIME_FACTOR=13 or something.
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/tree/tests/xfs/804?h=djwong-wtf
> .....
> > So.  Fix this by moving the dqattach_locked call up, and add a comment
> > about how we must attach the dquots *before* sampling the data/cow fork
> > contents.
> > 
> > Fixes: a526c85c2236 ("xfs: move xfs_file_iomap_begin_delay around") # goes further back than this
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/xfs_iomap.c |   12 ++++++++----
> >  1 file changed, 8 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > index 1bdd7afc1010..d903f0586490 100644
> > --- a/fs/xfs/xfs_iomap.c
> > +++ b/fs/xfs/xfs_iomap.c
> > @@ -984,6 +984,14 @@ xfs_buffered_write_iomap_begin(
> >  	if (error)
> >  		goto out_unlock;
> >  
> > +	/*
> > +	 * Attach dquots before we access the data/cow fork mappings, because
> > +	 * this function can cycle the ILOCK.
> > +	 */
> > +	error = xfs_qm_dqattach_locked(ip, false);
> > +	if (error)
> > +		goto out_unlock;
> > +
> >  	/*
> >  	 * Search the data fork first to look up our source mapping.  We
> >  	 * always need the data fork map, as we have to return it to the
> > @@ -1071,10 +1079,6 @@ xfs_buffered_write_iomap_begin(
> >  			allocfork = XFS_COW_FORK;
> >  	}
> >  
> > -	error = xfs_qm_dqattach_locked(ip, false);
> > -	if (error)
> > -		goto out_unlock;
> > -
> >  	if (eof && offset + count > XFS_ISIZE(ip)) {
> >  		/*
> >  		 * Determine the initial size of the preallocation.
> > 
> 
> Why not attached the dquots before we call xfs_ilock_for_iomap()?

I wanted to minimize the number of xfs_ilock calls -- under the scheme
you outline, xfs_qm_dqattach will lock it once; a dquot cache miss
will drop and retake it; and then xfs_ilock_for_iomap would take it yet
again.  That's one more ilock song-and-dance than this patch does...

> That way we can just call xfs_qm_dqattach(ip, false) and just return
> on failure immediately. That's exactly what we do in the
> xfs_iomap_write_direct() path, and it avoids the need to mention
> anything about lock cycling because we just don't care
> about cycling the ILOCK to read in or allocate dquots before we
> start the real work that needs to be done...

...but I guess it's cleaner once you start assuming that dqattach has
grown its own NOWAIT flag.  I'd sorta prefer to commit this corruption
fix as it is and rearrange dqget with NOWAIT as a separate series since
Linus has already warned us[1] to get things done sooner than later.

[1] https://lore.kernel.org/lkml/CAHk-=wgUZwX8Sbb8Zvm7FxWVfX6CGuE7x+E16VKoqL7Ok9vv7g@mail.gmail.com/

(OTOH it's already 6pm your time so I may very well be done with all
the quota nowait changes before you wake up :P)

> Hmmmmm - this means there's a potential problem with IOCB_NOWAIT
> here - if the dquots are not in memory, we're going to drop and then
> retake the ILOCK_EXCL without trylocks, potentially blocking a task
> that should not get blocked. That's a separate problem, though, and
> we probably need to plumb NOWAIT through to the dquot lookup cache
> miss case to solve that.

It wouldn't be that hard to turn that second parameter into the usual
uint flags argument, but I agree that's a separate patch.

How much you wanna bet the FB people have never turned on quota and
hence have not yet played whackanowait with that subsystem?

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/3] xfs: attach dquots to inode before reading data/cow fork mappings
  2022-11-29  6:50     ` Darrick J. Wong
@ 2022-11-29  8:04       ` Dave Chinner
  2022-11-29 21:03         ` Darrick J. Wong
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2022-11-29  8:04 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Nov 28, 2022 at 10:50:40PM -0800, Darrick J. Wong wrote:
> On Tue, Nov 29, 2022 at 05:31:04PM +1100, Dave Chinner wrote:
> > On Sun, Nov 27, 2022 at 10:36:29AM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > I've been running near-continuous integration testing of online fsck,
> > > and I've noticed that once a day, one of the ARM VMs will fail the test
> > > with out of order records in the data fork.
> > > 
> > > xfs/804 races fsstress with online scrub (aka scan but do not change
> > > anything), so I think this might be a bug in the core xfs code.  This
> > > also only seems to trigger if one runs the test for more than ~6 minutes
> > > via TIME_FACTOR=13 or something.
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/tree/tests/xfs/804?h=djwong-wtf
> > .....
> > > So.  Fix this by moving the dqattach_locked call up, and add a comment
> > > about how we must attach the dquots *before* sampling the data/cow fork
> > > contents.
> > > 
> > > Fixes: a526c85c2236 ("xfs: move xfs_file_iomap_begin_delay around") # goes further back than this
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  fs/xfs/xfs_iomap.c |   12 ++++++++----
> > >  1 file changed, 8 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > index 1bdd7afc1010..d903f0586490 100644
> > > --- a/fs/xfs/xfs_iomap.c
> > > +++ b/fs/xfs/xfs_iomap.c
> > > @@ -984,6 +984,14 @@ xfs_buffered_write_iomap_begin(
> > >  	if (error)
> > >  		goto out_unlock;
> > >  
> > > +	/*
> > > +	 * Attach dquots before we access the data/cow fork mappings, because
> > > +	 * this function can cycle the ILOCK.
> > > +	 */
> > > +	error = xfs_qm_dqattach_locked(ip, false);
> > > +	if (error)
> > > +		goto out_unlock;
> > > +
> > >  	/*
> > >  	 * Search the data fork first to look up our source mapping.  We
> > >  	 * always need the data fork map, as we have to return it to the
> > > @@ -1071,10 +1079,6 @@ xfs_buffered_write_iomap_begin(
> > >  			allocfork = XFS_COW_FORK;
> > >  	}
> > >  
> > > -	error = xfs_qm_dqattach_locked(ip, false);
> > > -	if (error)
> > > -		goto out_unlock;
> > > -
> > >  	if (eof && offset + count > XFS_ISIZE(ip)) {
> > >  		/*
> > >  		 * Determine the initial size of the preallocation.
> > > 
> > 
> > Why not attached the dquots before we call xfs_ilock_for_iomap()?
> 
> I wanted to minimize the number of xfs_ilock calls -- under the scheme
> you outline, xfs_qm_dqattach will lock it once; a dquot cache miss
> will drop and retake it; and then xfs_ilock_for_iomap would take it yet
> again.  That's one more ilock song-and-dance than this patch does...

Ture, but we don't have an extra lock cycle if the dquots are
already attached to the inode - xfs_qm_dqattach() checks for
attached inodes before it takes the ILOCK to attach them. Hence if
we are doing lots of small writes to a file, we only take this extra
lock cycle for the first delalloc reservation that we make, not
every single one....

We have to do it this way for anything that runs an actual
transaction (like the direct IO write path we take if an extent size
hint is set) as we can't cycle the ILOCK within a transaction
context, so the code is already optimised for the "dquots already
attached" case....

> > That way we can just call xfs_qm_dqattach(ip, false) and just return
> > on failure immediately. That's exactly what we do in the
> > xfs_iomap_write_direct() path, and it avoids the need to mention
> > anything about lock cycling because we just don't care
> > about cycling the ILOCK to read in or allocate dquots before we
> > start the real work that needs to be done...
> 
> ...but I guess it's cleaner once you start assuming that dqattach has
> grown its own NOWAIT flag.  I'd sorta prefer to commit this corruption
> fix as it is and rearrange dqget with NOWAIT as a separate series since
> Linus has already warned us[1] to get things done sooner than later.
> 
> [1] https://lore.kernel.org/lkml/CAHk-=wgUZwX8Sbb8Zvm7FxWVfX6CGuE7x+E16VKoqL7Ok9vv7g@mail.gmail.com/

<shrug>

If that's your concern, then

Reviewed-by: Dave Chinner <dchinner@redhat.com>

However, as maintainer I was never concerned about being "too late
in the cycle". I'd just push it into the for next tree with a stable
tag and when it gets merged in a couple of weeks the stable
maintainers should notice it and backport it appropriately
automatically....

For distro backports, merging into the XFS tree is good enough to be
iconsidered upstream as it's pretty much guaranteed to end up in the
mainline tree once it's been merged by the maintainer....

> (OTOH it's already 6pm your time so I may very well be done with all
> the quota nowait changes before you wake up :P)

NOWAIT changes are definitely next cycle stuff :)

> > Hmmmmm - this means there's a potential problem with IOCB_NOWAIT
> > here - if the dquots are not in memory, we're going to drop and then
> > retake the ILOCK_EXCL without trylocks, potentially blocking a task
> > that should not get blocked. That's a separate problem, though, and
> > we probably need to plumb NOWAIT through to the dquot lookup cache
> > miss case to solve that.
> 
> It wouldn't be that hard to turn that second parameter into the usual
> uint flags argument, but I agree that's a separate patch.

*nod*

> How much you wanna bet the FB people have never turned on quota and
> hence have not yet played whackanowait with that subsystem?

No bet, we both know the odds. :/

Indeed, set an extent size hint on a file and then run io_uring
async buffered writes and watch all the massive long tail latencies
that occur on the transaction reservations and btree block IO and
locking in the allocation path....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/3] xfs: attach dquots to inode before reading data/cow fork mappings
  2022-11-29  8:04       ` Dave Chinner
@ 2022-11-29 21:03         ` Darrick J. Wong
  0 siblings, 0 replies; 18+ messages in thread
From: Darrick J. Wong @ 2022-11-29 21:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Nov 29, 2022 at 07:04:50PM +1100, Dave Chinner wrote:
> On Mon, Nov 28, 2022 at 10:50:40PM -0800, Darrick J. Wong wrote:
> > On Tue, Nov 29, 2022 at 05:31:04PM +1100, Dave Chinner wrote:
> > > On Sun, Nov 27, 2022 at 10:36:29AM -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > I've been running near-continuous integration testing of online fsck,
> > > > and I've noticed that once a day, one of the ARM VMs will fail the test
> > > > with out of order records in the data fork.
> > > > 
> > > > xfs/804 races fsstress with online scrub (aka scan but do not change
> > > > anything), so I think this might be a bug in the core xfs code.  This
> > > > also only seems to trigger if one runs the test for more than ~6 minutes
> > > > via TIME_FACTOR=13 or something.
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/tree/tests/xfs/804?h=djwong-wtf
> > > .....
> > > > So.  Fix this by moving the dqattach_locked call up, and add a comment
> > > > about how we must attach the dquots *before* sampling the data/cow fork
> > > > contents.
> > > > 
> > > > Fixes: a526c85c2236 ("xfs: move xfs_file_iomap_begin_delay around") # goes further back than this
> > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > ---
> > > >  fs/xfs/xfs_iomap.c |   12 ++++++++----
> > > >  1 file changed, 8 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > > index 1bdd7afc1010..d903f0586490 100644
> > > > --- a/fs/xfs/xfs_iomap.c
> > > > +++ b/fs/xfs/xfs_iomap.c
> > > > @@ -984,6 +984,14 @@ xfs_buffered_write_iomap_begin(
> > > >  	if (error)
> > > >  		goto out_unlock;
> > > >  
> > > > +	/*
> > > > +	 * Attach dquots before we access the data/cow fork mappings, because
> > > > +	 * this function can cycle the ILOCK.
> > > > +	 */
> > > > +	error = xfs_qm_dqattach_locked(ip, false);
> > > > +	if (error)
> > > > +		goto out_unlock;
> > > > +
> > > >  	/*
> > > >  	 * Search the data fork first to look up our source mapping.  We
> > > >  	 * always need the data fork map, as we have to return it to the
> > > > @@ -1071,10 +1079,6 @@ xfs_buffered_write_iomap_begin(
> > > >  			allocfork = XFS_COW_FORK;
> > > >  	}
> > > >  
> > > > -	error = xfs_qm_dqattach_locked(ip, false);
> > > > -	if (error)
> > > > -		goto out_unlock;
> > > > -
> > > >  	if (eof && offset + count > XFS_ISIZE(ip)) {
> > > >  		/*
> > > >  		 * Determine the initial size of the preallocation.
> > > > 
> > > 
> > > Why not attached the dquots before we call xfs_ilock_for_iomap()?
> > 
> > I wanted to minimize the number of xfs_ilock calls -- under the scheme
> > you outline, xfs_qm_dqattach will lock it once; a dquot cache miss
> > will drop and retake it; and then xfs_ilock_for_iomap would take it yet
> > again.  That's one more ilock song-and-dance than this patch does...
> 
> Ture, but we don't have an extra lock cycle if the dquots are
> already attached to the inode - xfs_qm_dqattach() checks for
> attached inodes before it takes the ILOCK to attach them. Hence if
> we are doing lots of small writes to a file, we only take this extra
> lock cycle for the first delalloc reservation that we make, not
> every single one....
> 
> We have to do it this way for anything that runs an actual
> transaction (like the direct IO write path we take if an extent size
> hint is set) as we can't cycle the ILOCK within a transaction
> context, so the code is already optimised for the "dquots already
> attached" case....

<nod> In the end, I decided to rewrite the patch to xfs_qm_dqattach at
the start of xfs_buffered_write_iomap_begin.  I'll send that shortly.

> > > That way we can just call xfs_qm_dqattach(ip, false) and just return
> > > on failure immediately. That's exactly what we do in the
> > > xfs_iomap_write_direct() path, and it avoids the need to mention
> > > anything about lock cycling because we just don't care
> > > about cycling the ILOCK to read in or allocate dquots before we
> > > start the real work that needs to be done...
> > 
> > ...but I guess it's cleaner once you start assuming that dqattach has
> > grown its own NOWAIT flag.  I'd sorta prefer to commit this corruption
> > fix as it is and rearrange dqget with NOWAIT as a separate series since
> > Linus has already warned us[1] to get things done sooner than later.
> > 
> > [1] https://lore.kernel.org/lkml/CAHk-=wgUZwX8Sbb8Zvm7FxWVfX6CGuE7x+E16VKoqL7Ok9vv7g@mail.gmail.com/
> 
> <shrug>
> 
> If that's your concern, then
> 
> Reviewed-by: Dave Chinner <dchinner@redhat.com>

Thanks! ;)

> However, as maintainer I was never concerned about being "too late
> in the cycle". I'd just push it into the for next tree with a stable
> tag and when it gets merged in a couple of weeks the stable
> maintainers should notice it and backport it appropriately
> automatically....

<nod> Normally I wouldn't care about timing since it's a bugfix, but I
kinda want to get all these sharp ends wrapped up, to minimize the
number of fixes that we still have to work on for -rc1+ in January.

> For distro backports, merging into the XFS tree is good enough to be
> iconsidered upstream as it's pretty much guaranteed to end up in the
> mainline tree once it's been merged by the maintainer....
> 
> > (OTOH it's already 6pm your time so I may very well be done with all
> > the quota nowait changes before you wake up :P)
> 
> NOWAIT changes are definitely next cycle stuff :)
> 
> > > Hmmmmm - this means there's a potential problem with IOCB_NOWAIT
> > > here - if the dquots are not in memory, we're going to drop and then
> > > retake the ILOCK_EXCL without trylocks, potentially blocking a task
> > > that should not get blocked. That's a separate problem, though, and
> > > we probably need to plumb NOWAIT through to the dquot lookup cache
> > > miss case to solve that.
> > 
> > It wouldn't be that hard to turn that second parameter into the usual
> > uint flags argument, but I agree that's a separate patch.
> 
> *nod*
> 
> > How much you wanna bet the FB people have never turned on quota and
> > hence have not yet played whackanowait with that subsystem?
> 
> No bet, we both know the odds. :/
> 
> Indeed, set an extent size hint on a file and then run io_uring
> async buffered writes and watch all the massive long tail latencies
> that occur on the transaction reservations and btree block IO and
> locking in the allocation path....

Granted, I wonder what would

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 4/3] xfs: attach dquots to inode before reading data/cow fork mappings
  2022-11-27 18:36 ` [PATCH 4/3] xfs: attach dquots to inode before reading data/cow fork mappings Darrick J. Wong
  2022-11-29  6:31   ` Dave Chinner
@ 2022-11-29 21:05   ` Darrick J. Wong
  2022-11-29 21:38     ` Dave Chinner
  1 sibling, 1 reply; 18+ messages in thread
From: Darrick J. Wong @ 2022-11-29 21:05 UTC (permalink / raw)
  To: linux-xfs; +Cc: Dave Chinner

From: Darrick J. Wong <djwong@kernel.org>

I've been running near-continuous integration testing of online fsck,
and I've noticed that once a day, one of the ARM VMs will fail the test
with out of order records in the data fork.

xfs/804 races fsstress with online scrub (aka scan but do not change
anything), so I think this might be a bug in the core xfs code.  This
also only seems to trigger if one runs the test for more than ~6 minutes
via TIME_FACTOR=13 or something.
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/tree/tests/xfs/804?h=djwong-wtf

I added a debugging patch to the kernel to check the data fork extents
after taking the ILOCK, before dropping ILOCK, and before and after each
bmapping operation.  So far I've narrowed it down to the delalloc code
inserting a record in the wrong place in the iext tree:

xfs_bmap_add_extent_hole_delay, near line 2691:

	case 0:
		/*
		 * New allocation is not contiguous with another
		 * delayed allocation.
		 * Insert a new entry.
		 */
		oldlen = newlen = 0;
		xfs_iunlock_check_datafork(ip);		<-- ok here
		xfs_iext_insert(ip, icur, new, state);
		xfs_iunlock_check_datafork(ip);		<-- bad here
		break;
	}

I recorded the state of the data fork mappings and iext cursor state
when a corrupt data fork is detected immediately after the
xfs_bmap_add_extent_hole_delay call in xfs_bmapi_reserve_delalloc:

ino 0x140bb3 func xfs_bmapi_reserve_delalloc line 4164 data fork:
    ino 0x140bb3 nr 0x0 nr_real 0x0 offset 0xb9 blockcount 0x1f startblock 0x935de2 state 1
    ino 0x140bb3 nr 0x1 nr_real 0x1 offset 0xe6 blockcount 0xa startblock 0xffffffffe0007 state 0
    ino 0x140bb3 nr 0x2 nr_real 0x1 offset 0xd8 blockcount 0xe startblock 0x935e01 state 0

Here we see that a delalloc extent was inserted into the wrong position
in the iext leaf, same as all the other times.  The extra trace data I
collected are as follows:

ino 0x140bb3 fork 0 oldoff 0xe6 oldlen 0x4 oldprealloc 0x6 isize 0xe6000
    ino 0x140bb3 oldgotoff 0xea oldgotstart 0xfffffffffffffffe oldgotcount 0x0 oldgotstate 0
    ino 0x140bb3 crapgotoff 0x0 crapgotstart 0x0 crapgotcount 0x0 crapgotstate 0
    ino 0x140bb3 freshgotoff 0xd8 freshgotstart 0x935e01 freshgotcount 0xe freshgotstate 0
    ino 0x140bb3 nowgotoff 0xe6 nowgotstart 0xffffffffe0007 nowgotcount 0xa nowgotstate 0
    ino 0x140bb3 oldicurpos 1 oldleafnr 2 oldleaf 0xfffffc00f0609a00
    ino 0x140bb3 crapicurpos 2 crapleafnr 2 crapleaf 0xfffffc00f0609a00
    ino 0x140bb3 freshicurpos 1 freshleafnr 2 freshleaf 0xfffffc00f0609a00
    ino 0x140bb3 newicurpos 1 newleafnr 3 newleaf 0xfffffc00f0609a00

The first line shows that xfs_bmapi_reserve_delalloc was called with
whichfork=XFS_DATA_FORK, off=0xe6, len=0x4, prealloc=6.

The second line ("oldgot") shows the contents of @got at the beginning
of the call, which are the results of the first iext lookup in
xfs_buffered_write_iomap_begin.

Line 3 ("crapgot") is the result of duplicating the cursor at the start
of the body of xfs_bmapi_reserve_delalloc and performing a fresh lookup
at @off.

Line 4 ("freshgot") is the result of a new xfs_iext_get_extent right
before the call to xfs_bmap_add_extent_hole_delay.  Totally garbage.

Line 5 ("nowgot") is contents of @got after the
xfs_bmap_add_extent_hole_delay call.

Line 6 is the contents of @icur at the beginning fo the call.  Lines 7-9
are the contents of the iext cursors at the point where the block
mappings were sampled.

I think @oldgot is a HOLESTARTBLOCK extent because the first lookup
didn't find anything, so we filled in imap with "fake hole until the
end".  At the time of the first lookup, I suspect that there's only one
32-block unwritten extent in the mapping (hence oldicurpos==1) but by
the time we get to recording crapgot, crapicurpos==2.

Dave then added:

Ok, that's much simpler to reason about, and implies the smoke is
coming from xfs_buffered_write_iomap_begin() or
xfs_bmapi_reserve_delalloc(). I suspect the former - it does a lot
of stuff with the ILOCK_EXCL held.....

.... including calling xfs_qm_dqattach_locked().

xfs_buffered_write_iomap_begin
  ILOCK_EXCL
  look up icur
  xfs_qm_dqattach_locked
    xfs_qm_dqattach_one
      xfs_qm_dqget_inode
        dquot cache miss
        xfs_iunlock(ip, XFS_ILOCK_EXCL);
        error = xfs_qm_dqread(mp, id, type, can_alloc, &dqp);
        xfs_ilock(ip, XFS_ILOCK_EXCL);
  ....
  xfs_bmapi_reserve_delalloc(icur)

Yup, that's what is letting the magic smoke out -
xfs_qm_dqattach_locked() can cycle the ILOCK. If that happens, we
can pass a stale icur to xfs_bmapi_reserve_delalloc() and it all
goes downhill from there.

Back to Darrick now:

So.  Fix this by moving the dqattach_locked call up before we take the
ILOCK, like all the other callers in that file.

Fixes: a526c85c2236 ("xfs: move xfs_file_iomap_begin_delay around") # goes further back than this
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
v2: just do a regular dqattach, and tweak the commit message to make it
clearer if it's dave or me talking
---
 fs/xfs/xfs_iomap.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 1005f1e36545..68436370927d 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -978,6 +978,10 @@ xfs_buffered_write_iomap_begin(
 
 	ASSERT(!XFS_IS_REALTIME_INODE(ip));
 
+	error = xfs_qm_dqattach(ip);
+	if (error)
+		return error;
+
 	error = xfs_ilock_for_iomap(ip, flags, &lockmode);
 	if (error)
 		return error;
@@ -1081,10 +1085,6 @@ xfs_buffered_write_iomap_begin(
 			allocfork = XFS_COW_FORK;
 	}
 
-	error = xfs_qm_dqattach_locked(ip, false);
-	if (error)
-		goto out_unlock;
-
 	if (eof && offset + count > XFS_ISIZE(ip)) {
 		/*
 		 * Determine the initial size of the preallocation.

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/3] xfs: attach dquots to inode before reading data/cow fork mappings
  2022-11-29 21:05   ` [PATCH v2 " Darrick J. Wong
@ 2022-11-29 21:38     ` Dave Chinner
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2022-11-29 21:38 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Nov 29, 2022 at 01:05:24PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> I've been running near-continuous integration testing of online fsck,
> and I've noticed that once a day, one of the ARM VMs will fail the test
> with out of order records in the data fork.
> 
> xfs/804 races fsstress with online scrub (aka scan but do not change
> anything), so I think this might be a bug in the core xfs code.  This
> also only seems to trigger if one runs the test for more than ~6 minutes
> via TIME_FACTOR=13 or something.
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/tree/tests/xfs/804?h=djwong-wtf
> 
> I added a debugging patch to the kernel to check the data fork extents
> after taking the ILOCK, before dropping ILOCK, and before and after each
> bmapping operation.  So far I've narrowed it down to the delalloc code
> inserting a record in the wrong place in the iext tree:
.....
> 
> So.  Fix this by moving the dqattach_locked call up before we take the
> ILOCK, like all the other callers in that file.
> 
> Fixes: a526c85c2236 ("xfs: move xfs_file_iomap_begin_delay around") # goes further back than this
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> ---
> v2: just do a regular dqattach, and tweak the commit message to make it
> clearer if it's dave or me talking

All looks good, thanks for doing the updates :)

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-11-29 21:38 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-11-24 16:59 [PATCHSET 0/3] xfs: fixes for 6.2 Darrick J. Wong
2022-11-24 16:59 ` [PATCH 1/3] xfs: invalidate block device page cache during unmount Darrick J. Wong
2022-11-29  2:36   ` Gao Xiang
2022-11-29  5:23   ` Dave Chinner
2022-11-29  5:59     ` Darrick J. Wong
2022-11-24 16:59 ` [PATCH 2/3] xfs: use memcpy, not strncpy, to format the attr prefix during listxattr Darrick J. Wong
2022-11-29  2:37   ` Gao Xiang
2022-11-29  5:26   ` Dave Chinner
2022-11-24 16:59 ` [PATCH 3/3] xfs: shut up -Wuninitialized in xfsaild_push Darrick J. Wong
2022-11-29  3:00   ` Gao Xiang
2022-11-29  5:36   ` Dave Chinner
2022-11-27 18:36 ` [PATCH 4/3] xfs: attach dquots to inode before reading data/cow fork mappings Darrick J. Wong
2022-11-29  6:31   ` Dave Chinner
2022-11-29  6:50     ` Darrick J. Wong
2022-11-29  8:04       ` Dave Chinner
2022-11-29 21:03         ` Darrick J. Wong
2022-11-29 21:05   ` [PATCH v2 " Darrick J. Wong
2022-11-29 21:38     ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox