[PATCHSET v2 0/4] xfs: bug fixes for 6.4-rc1

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCHSET v2 0/4] xfs: bug fixes for 6.4-rc1
@ 2023-05-01 18:26 Darrick J. Wong
  2023-05-01 18:26 ` [PATCH 1/4] xfs: don't unconditionally null args->pag in xfs_bmap_btalloc_at_eof Darrick J. Wong
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Darrick J. Wong @ 2023-05-01 18:26 UTC (permalink / raw)
  To: david, djwong; +Cc: Dave Chinner, linux-xfs

Hi all,

Here are some assorted bug fixes for 6.4:

 * A regression fix for the allocator refactoring that we did in 6.3.
 * Fix a bug that occurs when formatting an internal log with a stripe
   alignment such that there's free space before the start of the log
   but not after.
 * Make scrub actually take the MMAPLOCK (to lock out page faults) when
   scrubbing the COW fork
 * If we call FUNSHARE on a hole in the data fork, don't create a
   delalloc reservation in the cow fork for that hole.

v2: fix some comments that fell out of sync with the code

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=kernel-fixes-6.4
---
 fs/xfs/libxfs/xfs_ag.c   |   19 +++++++++----------
 fs/xfs/libxfs/xfs_bmap.c |    5 +++--
 fs/xfs/scrub/bmap.c      |    4 ++--
 fs/xfs/xfs_iomap.c       |    5 +++--
 4 files changed, 17 insertions(+), 16 deletions(-)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/4] xfs: don't unconditionally null args->pag in xfs_bmap_btalloc_at_eof
  2023-05-01 18:26 [PATCHSET v2 0/4] xfs: bug fixes for 6.4-rc1 Darrick J. Wong
@ 2023-05-01 18:26 ` Darrick J. Wong
  2023-05-01 18:27 ` [PATCH 2/4] xfs: set bnobt/cntbt numrecs correctly when formatting new AGs Darrick J. Wong
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Darrick J. Wong @ 2023-05-01 18:26 UTC (permalink / raw)
  To: david, djwong; +Cc: Dave Chinner, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

xfs/170 on a filesystem with su=128k,sw=4 produces this splat:

BUG: kernel NULL pointer dereference, address: 0000000000000010
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: 0002 [#1] PREEMPT SMP
CPU: 1 PID: 4022907 Comm: dd Tainted: G        W          6.3.0-xfsx #2 6ebeeffbe9577d32
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-bu
RIP: 0010:xfs_perag_rele+0x10/0x70 [xfs]
RSP: 0018:ffffc90001e43858 EFLAGS: 00010217
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000100
RDX: ffffffffa054e717 RSI: 0000000000000005 RDI: 0000000000000000
RBP: ffff888194eea000 R08: 0000000000000000 R09: 0000000000000037
R10: ffff888100ac1cb0 R11: 0000000000000018 R12: 0000000000000000
R13: ffffc90001e43a38 R14: ffff888194eea000 R15: ffff888194eea000
FS:  00007f93d1a0e740(0000) GS:ffff88843fc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000010 CR3: 000000018a34f000 CR4: 00000000003506e0
Call Trace:
 <TASK>
 xfs_bmap_btalloc+0x1a7/0x5d0 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
 xfs_bmapi_allocate+0xee/0x470 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
 xfs_bmapi_write+0x539/0x9e0 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
 xfs_iomap_write_direct+0x1bb/0x2b0 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
 xfs_direct_write_iomap_begin+0x51c/0x710 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
 iomap_iter+0x132/0x2f0
 __iomap_dio_rw+0x2f8/0x840
 iomap_dio_rw+0xe/0x30
 xfs_file_dio_write_aligned+0xad/0x180 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
 xfs_file_write_iter+0xfb/0x190 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
 vfs_write+0x2eb/0x410
 ksys_write+0x65/0xe0
 do_syscall_64+0x2b/0x80

This crash occurs under the "out_low_space" label.  We grabbed a perag
reference, passed it via args->pag into xfs_bmap_btalloc_at_eof, and
afterwards args->pag is NULL.  Fix the second function not to clobber
args->pag if the caller had passed one in.

Fixes: 85843327094f ("xfs: factor xfs_bmap_btalloc()")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index b512de0540d5..cd8870a16fd1 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3494,8 +3494,10 @@ xfs_bmap_btalloc_at_eof(
 		if (!caller_pag)
 			args->pag = xfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, ap->blkno));
 		error = xfs_alloc_vextent_exact_bno(args, ap->blkno);
-		if (!caller_pag)
+		if (!caller_pag) {
 			xfs_perag_put(args->pag);
+			args->pag = NULL;
+		}
 		if (error)
 			return error;
 
@@ -3505,7 +3507,6 @@ xfs_bmap_btalloc_at_eof(
 		 * Exact allocation failed. Reset to try an aligned allocation
 		 * according to the original allocation specification.
 		 */
-		args->pag = NULL;
 		args->alignment = stripe_align;
 		args->minlen = nextminlen;
 		args->minalignslop = 0;


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/4] xfs: set bnobt/cntbt numrecs correctly when formatting new AGs
  2023-05-01 18:26 [PATCHSET v2 0/4] xfs: bug fixes for 6.4-rc1 Darrick J. Wong
  2023-05-01 18:26 ` [PATCH 1/4] xfs: don't unconditionally null args->pag in xfs_bmap_btalloc_at_eof Darrick J. Wong
@ 2023-05-01 18:27 ` Darrick J. Wong
  2023-05-01 23:05   ` Dave Chinner
  2023-05-01 18:27 ` [PATCH 3/4] xfs: flush dirty data and drain directios before scrubbing cow fork Darrick J. Wong
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2023-05-01 18:27 UTC (permalink / raw)
  To: david, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Through generic/300, I discovered that mkfs.xfs creates corrupt
filesystems when given these parameters:

# mkfs.xfs -d size=512M /dev/sda -f -d su=128k,sw=4 --unsupported
Filesystems formatted with --unsupported are not supported!!
meta-data=/dev/sda               isize=512    agcount=8, agsize=16352 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
data     =                       bsize=4096   blocks=130816, imaxpct=25
         =                       sunit=32     swidth=128 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=8192, version=2
         =                       sectsz=512   sunit=32 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
         =                       rgcount=0    rgsize=0 blks
Discarding blocks...Done.
# xfs_repair -n /dev/sda
Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
Phase 2 - using internal log
        - zero log...
        - 16:30:50: zeroing log - 16320 of 16320 blocks done
        - scan filesystem freespace and inode maps...
agf_freeblks 25, counted 0 in ag 4
sb_fdblocks 8823, counted 8798

The root cause of this problem is the numrecs handling in
xfs_freesp_init_recs, which is used to initialize a new AG.  Prior to
calling the function, we set up the new bnobt block with numrecs == 1
and rely on _freesp_init_recs to format that new record.  If the last
record created has a blockcount of zero, then it sets numrecs = 0.

That last bit isn't correct if the AG contains the log, the start of the
log is not immediately after the initial blocks due to stripe alignment,
and the end of the log is perfectly aligned with the end of the AG.  For
this case, we actually formatted a single bnobt record to handle the
free space before the start of the (stripe aligned) log, and incremented
arec to try to format a second record.  That second record turned out to
be unnecessary, so what we really want is to leave numrecs at 1.

The numrecs handling itself is overly complicated because a different
function sets numrecs == 1.  Change the bnobt creation code to start
with numrecs set to zero and only increment it after successfully
formatting a free space extent into the btree block.

Fixes: f327a00745ff ("xfs: account for log space when formatting new AGs")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_ag.c |   19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
index 1b078bbbf225..9b373a0c7aaf 100644
--- a/fs/xfs/libxfs/xfs_ag.c
+++ b/fs/xfs/libxfs/xfs_ag.c
@@ -495,10 +495,12 @@ xfs_freesp_init_recs(
 		ASSERT(start >= mp->m_ag_prealloc_blocks);
 		if (start != mp->m_ag_prealloc_blocks) {
 			/*
-			 * Modify first record to pad stripe align of log
+			 * Modify first record to pad stripe align of log and
+			 * bump the record count.
 			 */
 			arec->ar_blockcount = cpu_to_be32(start -
 						mp->m_ag_prealloc_blocks);
+			be16_add_cpu(&block->bb_numrecs, 1);
 			nrec = arec + 1;
 
 			/*
@@ -509,7 +511,6 @@ xfs_freesp_init_recs(
 					be32_to_cpu(arec->ar_startblock) +
 					be32_to_cpu(arec->ar_blockcount));
 			arec = nrec;
-			be16_add_cpu(&block->bb_numrecs, 1);
 		}
 		/*
 		 * Change record start to after the internal log
@@ -518,15 +519,13 @@ xfs_freesp_init_recs(
 	}
 
 	/*
-	 * Calculate the record block count and check for the case where
-	 * the log might have consumed all available space in the AG. If
-	 * so, reset the record count to 0 to avoid exposure of an invalid
-	 * record start block.
+	 * Calculate the block count of this record; if it is nonzero,
+	 * increment the record count.
 	 */
 	arec->ar_blockcount = cpu_to_be32(id->agsize -
 					  be32_to_cpu(arec->ar_startblock));
-	if (!arec->ar_blockcount)
-		block->bb_numrecs = 0;
+	if (arec->ar_blockcount)
+		be16_add_cpu(&block->bb_numrecs, 1);
 }
 
 /*
@@ -538,7 +537,7 @@ xfs_bnoroot_init(
 	struct xfs_buf		*bp,
 	struct aghdr_init_data	*id)
 {
-	xfs_btree_init_block(mp, bp, XFS_BTNUM_BNO, 0, 1, id->agno);
+	xfs_btree_init_block(mp, bp, XFS_BTNUM_BNO, 0, 0, id->agno);
 	xfs_freesp_init_recs(mp, bp, id);
 }
 
@@ -548,7 +547,7 @@ xfs_cntroot_init(
 	struct xfs_buf		*bp,
 	struct aghdr_init_data	*id)
 {
-	xfs_btree_init_block(mp, bp, XFS_BTNUM_CNT, 0, 1, id->agno);
+	xfs_btree_init_block(mp, bp, XFS_BTNUM_CNT, 0, 0, id->agno);
 	xfs_freesp_init_recs(mp, bp, id);
 }
 


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/4] xfs: flush dirty data and drain directios before scrubbing cow fork
  2023-05-01 18:26 [PATCHSET v2 0/4] xfs: bug fixes for 6.4-rc1 Darrick J. Wong
  2023-05-01 18:26 ` [PATCH 1/4] xfs: don't unconditionally null args->pag in xfs_bmap_btalloc_at_eof Darrick J. Wong
  2023-05-01 18:27 ` [PATCH 2/4] xfs: set bnobt/cntbt numrecs correctly when formatting new AGs Darrick J. Wong
@ 2023-05-01 18:27 ` Darrick J. Wong
  2023-05-01 18:27 ` [PATCH 4/4] xfs: don't allocate into the data fork for an unshare request Darrick J. Wong
  2023-05-01 21:24 ` [PATCH 5/4] xfs: fix negative array access in xfs_getbmap Darrick J. Wong
  4 siblings, 0 replies; 9+ messages in thread
From: Darrick J. Wong @ 2023-05-01 18:27 UTC (permalink / raw)
  To: david, djwong; +Cc: Dave Chinner, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

When we're scrubbing the COW fork, we need to take MMAPLOCK_EXCL to
prevent page_mkwrite from modifying any inode state.  The ILOCK should
suffice to avoid confusing online fsck, but let's take the same locks
that we do everywhere else.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/bmap.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index 87ab9f95a487..69bc89d0fc68 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -42,12 +42,12 @@ xchk_setup_inode_bmap(
 	xfs_ilock(sc->ip, XFS_IOLOCK_EXCL);
 
 	/*
-	 * We don't want any ephemeral data fork updates sitting around
+	 * We don't want any ephemeral data/cow fork updates sitting around
 	 * while we inspect block mappings, so wait for directio to finish
 	 * and flush dirty data if we have delalloc reservations.
 	 */
 	if (S_ISREG(VFS_I(sc->ip)->i_mode) &&
-	    sc->sm->sm_type == XFS_SCRUB_TYPE_BMBTD) {
+	    sc->sm->sm_type != XFS_SCRUB_TYPE_BMBTA) {
 		struct address_space	*mapping = VFS_I(sc->ip)->i_mapping;
 
 		sc->ilock_flags |= XFS_MMAPLOCK_EXCL;


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 4/4] xfs: don't allocate into the data fork for an unshare request
  2023-05-01 18:26 [PATCHSET v2 0/4] xfs: bug fixes for 6.4-rc1 Darrick J. Wong
                   ` (2 preceding siblings ...)
  2023-05-01 18:27 ` [PATCH 3/4] xfs: flush dirty data and drain directios before scrubbing cow fork Darrick J. Wong
@ 2023-05-01 18:27 ` Darrick J. Wong
  2023-05-01 21:24 ` [PATCH 5/4] xfs: fix negative array access in xfs_getbmap Darrick J. Wong
  4 siblings, 0 replies; 9+ messages in thread
From: Darrick J. Wong @ 2023-05-01 18:27 UTC (permalink / raw)
  To: david, djwong; +Cc: Dave Chinner, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

For an unshare request, we only have to take action if the data fork has
a shared mapping.  We don't care if someone else set up a cow operation.
If we find nothing in the data fork, return a hole to avoid allocating
space.

Note that fallocate will replace the delalloc reservation with an
unwritten extent anyway, so this has no user-visible effects outside of
avoiding unnecessary updates.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iomap.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 285885c308bd..18c8f168b153 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1006,8 +1006,9 @@ xfs_buffered_write_iomap_begin(
 	if (eof)
 		imap.br_startoff = end_fsb; /* fake hole until the end */
 
-	/* We never need to allocate blocks for zeroing a hole. */
-	if ((flags & IOMAP_ZERO) && imap.br_startoff > offset_fsb) {
+	/* We never need to allocate blocks for zeroing or unsharing a hole. */
+	if ((flags & (IOMAP_UNSHARE | IOMAP_ZERO)) &&
+	    imap.br_startoff > offset_fsb) {
 		xfs_hole_to_iomap(ip, iomap, offset_fsb, imap.br_startoff);
 		goto out_unlock;
 	}


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 5/4] xfs: fix negative array access in xfs_getbmap
  2023-05-01 18:26 [PATCHSET v2 0/4] xfs: bug fixes for 6.4-rc1 Darrick J. Wong
                   ` (3 preceding siblings ...)
  2023-05-01 18:27 ` [PATCH 4/4] xfs: don't allocate into the data fork for an unshare request Darrick J. Wong
@ 2023-05-01 21:24 ` Darrick J. Wong
  2023-05-01 23:09   ` Dave Chinner
  2023-05-04 12:43   ` yebin (H)
  4 siblings, 2 replies; 9+ messages in thread
From: Darrick J. Wong @ 2023-05-01 21:24 UTC (permalink / raw)
  To: david; +Cc: Dave Chinner, linux-xfs, yebin10

From: Darrick J. Wong <djwong@kernel.org>

In commit 8ee81ed581ff, Ye Bin complained about an ASSERT in the bmapx
code that trips if we encounter a delalloc extent after flushing the
pagecache to disk.  The ioctl code does not hold MMAPLOCK so it's
entirely possible that a racing write page fault can create a delalloc
extent after the file has been flushed.  The proposed solution was to
replace the assertion with an early return that avoids filling out the
bmap recordset with a delalloc entry if the caller didn't ask for it.

At the time, I recall thinking that the forward logic sounded ok, but
felt hesitant because I suspected that changing this code would cause
something /else/ to burst loose due to some other subtlety.

syzbot of course found that subtlety.  If all the extent mappings found
after the flush are delalloc mappings, we'll reach the end of the data
fork without ever incrementing bmv->bmv_entries.  This is new, since
before we'd have emitted the delalloc mappings even though the caller
didn't ask for them.  Once we reach the end, we'll try to set
BMV_OF_LAST on the -1st entry (because bmv_entries is zero) and go
corrupt something else in memory.  Yay.

I really dislike all these stupid patches that fiddle around with debug
code and break things that otherwise worked well enough.  Nobody was
complaining that calling XFS_IOC_BMAPX without BMV_IF_DELALLOC would
return BMV_OF_DELALLOC records, and now we've gone from "weird behavior
that nobody cared about" to "bad behavior that must be addressed
immediately".

Maybe I'll just ignore anything from Huawei from now on for my own sake.

Reported-by: syzbot+c103d3808a0de5faaf80@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-xfs/20230412024907.GP360889@frogsfrogsfrogs/
Fixes: 8ee81ed581ff ("xfs: fix BUG_ON in xfs_getbmap()")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_bmap_util.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index f032d3a4b727..fbb675563208 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -558,7 +558,9 @@ xfs_getbmap(
 		if (!xfs_iext_next_extent(ifp, &icur, &got)) {
 			xfs_fileoff_t	end = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));

-			out[bmv->bmv_entries - 1].bmv_oflags |= BMV_OF_LAST;
+			if (bmv->bmv_entries > 0)
+				out[bmv->bmv_entries - 1].bmv_oflags |=
+								BMV_OF_LAST;

 			if (whichfork != XFS_ATTR_FORK && bno < end &&
 			    !xfs_getbmap_full(bmv)) {

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/4] xfs: set bnobt/cntbt numrecs correctly when formatting new AGs
  2023-05-01 18:27 ` [PATCH 2/4] xfs: set bnobt/cntbt numrecs correctly when formatting new AGs Darrick J. Wong
@ 2023-05-01 23:05   ` Dave Chinner
  0 siblings, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2023-05-01 23:05 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, May 01, 2023 at 11:27:04AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Through generic/300, I discovered that mkfs.xfs creates corrupt
> filesystems when given these parameters:
> 
> # mkfs.xfs -d size=512M /dev/sda -f -d su=128k,sw=4 --unsupported
> Filesystems formatted with --unsupported are not supported!!
> meta-data=/dev/sda               isize=512    agcount=8, agsize=16352 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=1
>          =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
> data     =                       bsize=4096   blocks=130816, imaxpct=25
>          =                       sunit=32     swidth=128 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=8192, version=2
>          =                       sectsz=512   sunit=32 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
>          =                       rgcount=0    rgsize=0 blks
> Discarding blocks...Done.
> # xfs_repair -n /dev/sda
> Phase 1 - find and verify superblock...
>         - reporting progress in intervals of 15 minutes
> Phase 2 - using internal log
>         - zero log...
>         - 16:30:50: zeroing log - 16320 of 16320 blocks done
>         - scan filesystem freespace and inode maps...
> agf_freeblks 25, counted 0 in ag 4
> sb_fdblocks 8823, counted 8798
> 
> The root cause of this problem is the numrecs handling in
> xfs_freesp_init_recs, which is used to initialize a new AG.  Prior to
> calling the function, we set up the new bnobt block with numrecs == 1
> and rely on _freesp_init_recs to format that new record.  If the last
> record created has a blockcount of zero, then it sets numrecs = 0.
> 
> That last bit isn't correct if the AG contains the log, the start of the
> log is not immediately after the initial blocks due to stripe alignment,
> and the end of the log is perfectly aligned with the end of the AG.  For
> this case, we actually formatted a single bnobt record to handle the
> free space before the start of the (stripe aligned) log, and incremented
> arec to try to format a second record.  That second record turned out to
> be unnecessary, so what we really want is to leave numrecs at 1.
> 
> The numrecs handling itself is overly complicated because a different
> function sets numrecs == 1.  Change the bnobt creation code to start
> with numrecs set to zero and only increment it after successfully
> formatting a free space extent into the btree block.
> 
> Fixes: f327a00745ff ("xfs: account for log space when formatting new AGs")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_ag.c |   19 +++++++++----------
>  1 file changed, 9 insertions(+), 10 deletions(-)

Looks fine.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 5/4] xfs: fix negative array access in xfs_getbmap
  2023-05-01 21:24 ` [PATCH 5/4] xfs: fix negative array access in xfs_getbmap Darrick J. Wong
@ 2023-05-01 23:09   ` Dave Chinner
  2023-05-04 12:43   ` yebin (H)
  1 sibling, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2023-05-01 23:09 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs, yebin10

On Mon, May 01, 2023 at 02:24:34PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> In commit 8ee81ed581ff, Ye Bin complained about an ASSERT in the bmapx
> code that trips if we encounter a delalloc extent after flushing the
> pagecache to disk.  The ioctl code does not hold MMAPLOCK so it's
> entirely possible that a racing write page fault can create a delalloc
> extent after the file has been flushed.  The proposed solution was to
> replace the assertion with an early return that avoids filling out the
> bmap recordset with a delalloc entry if the caller didn't ask for it.
> 
> At the time, I recall thinking that the forward logic sounded ok, but
> felt hesitant because I suspected that changing this code would cause
> something /else/ to burst loose due to some other subtlety.
> 
> syzbot of course found that subtlety.  If all the extent mappings found
> after the flush are delalloc mappings, we'll reach the end of the data
> fork without ever incrementing bmv->bmv_entries.  This is new, since
> before we'd have emitted the delalloc mappings even though the caller
> didn't ask for them.  Once we reach the end, we'll try to set
> BMV_OF_LAST on the -1st entry (because bmv_entries is zero) and go
> corrupt something else in memory.  Yay.
> 
> I really dislike all these stupid patches that fiddle around with debug
> code and break things that otherwise worked well enough.  Nobody was
> complaining that calling XFS_IOC_BMAPX without BMV_IF_DELALLOC would
> return BMV_OF_DELALLOC records, and now we've gone from "weird behavior
> that nobody cared about" to "bad behavior that must be addressed
> immediately".
> 
> Maybe I'll just ignore anything from Huawei from now on for my own sake.
> 
> Reported-by: syzbot+c103d3808a0de5faaf80@syzkaller.appspotmail.com
> Link: https://lore.kernel.org/linux-xfs/20230412024907.GP360889@frogsfrogsfrogs/
> Fixes: 8ee81ed581ff ("xfs: fix BUG_ON in xfs_getbmap()")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_bmap_util.c |    4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)

Ugh. Yet again we add weight to the approach of "if it ain't broke,
don't fix it" for maintaining code that has not changed for a long
time...

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 5/4] xfs: fix negative array access in xfs_getbmap
  2023-05-01 21:24 ` [PATCH 5/4] xfs: fix negative array access in xfs_getbmap Darrick J. Wong
  2023-05-01 23:09   ` Dave Chinner
@ 2023-05-04 12:43   ` yebin (H)
  1 sibling, 0 replies; 9+ messages in thread
From: yebin (H) @ 2023-05-04 12:43 UTC (permalink / raw)
  To: Darrick J. Wong, david; +Cc: Dave Chinner, linux-xfs



On 2023/5/2 5:24, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> In commit 8ee81ed581ff, Ye Bin complained about an ASSERT in the bmapx
> code that trips if we encounter a delalloc extent after flushing the
> pagecache to disk.  The ioctl code does not hold MMAPLOCK so it's
> entirely possible that a racing write page fault can create a delalloc
> extent after the file has been flushed.  The proposed solution was to
> replace the assertion with an early return that avoids filling out the
> bmap recordset with a delalloc entry if the caller didn't ask for it.
>
> At the time, I recall thinking that the forward logic sounded ok, but
> felt hesitant because I suspected that changing this code would cause
> something /else/ to burst loose due to some other subtlety.
>
> syzbot of course found that subtlety.  If all the extent mappings found
> after the flush are delalloc mappings, we'll reach the end of the data
> fork without ever incrementing bmv->bmv_entries.  This is new, since
> before we'd have emitted the delalloc mappings even though the caller
> didn't ask for them.  Once we reach the end, we'll try to set
> BMV_OF_LAST on the -1st entry (because bmv_entries is zero) and go
> corrupt something else in memory.  Yay.
>
> I really dislike all these stupid patches that fiddle around with debug
> code and break things that otherwise worked well enough.  Nobody was
> complaining that calling XFS_IOC_BMAPX without BMV_IF_DELALLOC would
> return BMV_OF_DELALLOC records, and now we've gone from "weird behavior
> that nobody cared about" to "bad behavior that must be addressed
> immediately".
>
> Maybe I'll just ignore anything from Huawei from now on for my own sake.
I am very sorry for introducing a new issue and causing you inconvenience.
The issue fixed by commit 8ee81ed581ff was triggered by doing our syzkaller
testing，and my intention is to fix the issue without any malice and offend.

I fully agree with you that we should be more cautious in modifying the code
that was originally working well. I will do more self code review and 
test before
sending patches to upstream.
> Reported-by: syzbot+c103d3808a0de5faaf80@syzkaller.appspotmail.com
> Link: https://lore.kernel.org/linux-xfs/20230412024907.GP360889@frogsfrogsfrogs/
> Fixes: 8ee81ed581ff ("xfs: fix BUG_ON in xfs_getbmap()")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>   fs/xfs/xfs_bmap_util.c |    4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index f032d3a4b727..fbb675563208 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -558,7 +558,9 @@ xfs_getbmap(
>   		if (!xfs_iext_next_extent(ifp, &icur, &got)) {
>   			xfs_fileoff_t	end = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
>   
> -			out[bmv->bmv_entries - 1].bmv_oflags |= BMV_OF_LAST;
> +			if (bmv->bmv_entries > 0)
> +				out[bmv->bmv_entries - 1].bmv_oflags |=
> +								BMV_OF_LAST;
>   
>   			if (whichfork != XFS_ATTR_FORK && bno < end &&
>   			    !xfs_getbmap_full(bmv)) {
> .
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-05-04 12:43 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-01 18:26 [PATCHSET v2 0/4] xfs: bug fixes for 6.4-rc1 Darrick J. Wong
2023-05-01 18:26 ` [PATCH 1/4] xfs: don't unconditionally null args->pag in xfs_bmap_btalloc_at_eof Darrick J. Wong
2023-05-01 18:27 ` [PATCH 2/4] xfs: set bnobt/cntbt numrecs correctly when formatting new AGs Darrick J. Wong
2023-05-01 23:05   ` Dave Chinner
2023-05-01 18:27 ` [PATCH 3/4] xfs: flush dirty data and drain directios before scrubbing cow fork Darrick J. Wong
2023-05-01 18:27 ` [PATCH 4/4] xfs: don't allocate into the data fork for an unshare request Darrick J. Wong
2023-05-01 21:24 ` [PATCH 5/4] xfs: fix negative array access in xfs_getbmap Darrick J. Wong
2023-05-01 23:09   ` Dave Chinner
2023-05-04 12:43   ` yebin (H)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).