* [PATCH 0/4] xfs: fixes for 6.9-rcX
@ 2024-04-02 21:38 Dave Chinner
2024-04-02 21:38 ` [PATCH 1/4] xfs: use kvmalloc for xattr buffers Dave Chinner
` (3 more replies)
0 siblings, 4 replies; 25+ messages in thread
From: Dave Chinner @ 2024-04-02 21:38 UTC (permalink / raw)
To: linux-xfs; +Cc: chandanbabu
Hi Chandan,
The following patches are fixes I consider necessary for the current
-rc cycle.
The first patch addresses a potential regression when allocating
temporary memory buffers for xattr block manipulations. It was found
during BS > PS development. While I don't think it can manifest on
existing kernels (because xattr block size is limited to page size),
it's better to fix it straight away so we don't leave a landmine
that will get stepped on sooner rather than later.
The second patch addresses a regression from the 6.6 kernels. It
removed ENOSPC detection from xfs_alloc_file_space() and replaced it
with an endless loop. This patch essentially reverts the commit
that caused the issue because it allows xfs_alloc_file_space() to
fall into an unkillable loop where it will endlessly retrying an
allocation that keeps failing. It will not making forwards progress,
and there is no way to break out of the loop. A reboot is needed to
fix the situation. This was found whilst triaging a bug that
corrupted free space accounting.
This was trying to address an issue with RT delalloc and the whacky
"delalloc does partial allocation but reports no allocation at all"
xfs_bmapi_write() semantics that was causing fallocate() to fail on
delalloc extents on RT devices. This was the wrong way to fix the
issue - we need to fix the whacky "partial progress was made but
tell the caller ENOSPC" behaviour so that callers can distinguish
between a hard ENOSPC error and a "keep allocating because we made
partial progress" condition.
The third addresses another problem found during triage of the same
free space accounting issue yesterday. A fchown() was run, leading
to a new dquot cluster being allocated. The transaction reservation
succeeded (because free space counter issues) and then the
allocation failed. This code does not handle allocation failure at
all - it has an ASSERT() that allocation succeeded - leading to it
trying to allocate a buffer over a disk address that was not
initialised and "access beyond EOFS" warnings from the buffer cache.
This patch handles the allocation failure in a manner that prevents
a shutdown from occurring and ensures -ENOSPC is returned to the
caller.
The last patch fixes the free space accounting issue that uncovered
the bugs fixes by the previous two patches. This can be triggered
from userspace, though it requires CAP_SYS_ADMIN and so requires
trust/privilege to be gained before it can be abused.
These have all passed through several fstests runs overnight without
introducing new regressions.
-Dave.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 1/4] xfs: use kvmalloc for xattr buffers
2024-04-02 21:38 [PATCH 0/4] xfs: fixes for 6.9-rcX Dave Chinner
@ 2024-04-02 21:38 ` Dave Chinner
2024-04-03 3:43 ` Darrick J. Wong
` (2 more replies)
2024-04-02 21:38 ` [PATCH 2/4] xfs: xfs_alloc_file_space() fails to detect ENOSPC Dave Chinner
` (2 subsequent siblings)
3 siblings, 3 replies; 25+ messages in thread
From: Dave Chinner @ 2024-04-02 21:38 UTC (permalink / raw)
To: linux-xfs; +Cc: chandanbabu
From: Dave Chinner <dchinner@redhat.com>
Pankaj Raghav reported that when filesystem block size is larger
than page size, the xattr code can use kmalloc() for high order
allocations. This triggers a useless warning in the allocator as it
is a __GFP_NOFAIL allocation here:
static inline
struct page *rmqueue(struct zone *preferred_zone,
struct zone *zone, unsigned int order,
gfp_t gfp_flags, unsigned int alloc_flags,
int migratetype)
{
struct page *page;
/*
* We most definitely don't want callers attempting to
* allocate greater than order-1 page units with __GFP_NOFAIL.
*/
>>>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
...
Fix this by changing all these call sites to use kvmalloc(), which
will strip the NOFAIL from the kmalloc attempt and if that fails
will do a __GFP_NOFAIL vmalloc().
This is not an issue that productions systems will see as
filesystems with block size > page size cannot be mounted by the
kernel; Pankaj is developing this functionality right now.
Reported-by: Pankaj Raghav <kernel@pankajraghav.com>
Fixes: f078d4ea8276 ("xfs: convert kmem_alloc() to kmalloc()")
Signed-off-be: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/libxfs/xfs_attr_leaf.c | 15 ++++++---------
1 file changed, 6 insertions(+), 9 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
index ac904cc1a97b..969abc6efd70 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.c
+++ b/fs/xfs/libxfs/xfs_attr_leaf.c
@@ -1059,10 +1059,7 @@ xfs_attr3_leaf_to_shortform(
trace_xfs_attr_leaf_to_sf(args);
- tmpbuffer = kmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
- if (!tmpbuffer)
- return -ENOMEM;
-
+ tmpbuffer = kvmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
memcpy(tmpbuffer, bp->b_addr, args->geo->blksize);
leaf = (xfs_attr_leafblock_t *)tmpbuffer;
@@ -1125,7 +1122,7 @@ xfs_attr3_leaf_to_shortform(
error = 0;
out:
- kfree(tmpbuffer);
+ kvfree(tmpbuffer);
return error;
}
@@ -1533,7 +1530,7 @@ xfs_attr3_leaf_compact(
trace_xfs_attr_leaf_compact(args);
- tmpbuffer = kmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
+ tmpbuffer = kvmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
memcpy(tmpbuffer, bp->b_addr, args->geo->blksize);
memset(bp->b_addr, 0, args->geo->blksize);
leaf_src = (xfs_attr_leafblock_t *)tmpbuffer;
@@ -1571,7 +1568,7 @@ xfs_attr3_leaf_compact(
*/
xfs_trans_log_buf(trans, bp, 0, args->geo->blksize - 1);
- kfree(tmpbuffer);
+ kvfree(tmpbuffer);
}
/*
@@ -2250,7 +2247,7 @@ xfs_attr3_leaf_unbalance(
struct xfs_attr_leafblock *tmp_leaf;
struct xfs_attr3_icleaf_hdr tmphdr;
- tmp_leaf = kzalloc(state->args->geo->blksize,
+ tmp_leaf = kvzalloc(state->args->geo->blksize,
GFP_KERNEL | __GFP_NOFAIL);
/*
@@ -2291,7 +2288,7 @@ xfs_attr3_leaf_unbalance(
}
memcpy(save_leaf, tmp_leaf, state->args->geo->blksize);
savehdr = tmphdr; /* struct copy */
- kfree(tmp_leaf);
+ kvfree(tmp_leaf);
}
xfs_attr3_leaf_hdr_to_disk(state->args->geo, save_leaf, &savehdr);
--
2.43.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 2/4] xfs: xfs_alloc_file_space() fails to detect ENOSPC
2024-04-02 21:38 [PATCH 0/4] xfs: fixes for 6.9-rcX Dave Chinner
2024-04-02 21:38 ` [PATCH 1/4] xfs: use kvmalloc for xattr buffers Dave Chinner
@ 2024-04-02 21:38 ` Dave Chinner
2024-04-03 3:46 ` Darrick J. Wong
2024-04-03 4:40 ` Christoph Hellwig
2024-04-02 21:38 ` [PATCH 3/4] xfs: handle allocation failure in xfs_dquot_disk_alloc() Dave Chinner
2024-04-02 21:38 ` [PATCH 4/4] xfs: validate block count for XFS_IOC_SET_RESBLKS Dave Chinner
3 siblings, 2 replies; 25+ messages in thread
From: Dave Chinner @ 2024-04-02 21:38 UTC (permalink / raw)
To: linux-xfs; +Cc: chandanbabu
From: Dave Chinner <dchinner@redhat.com>
xfs_alloc_file_space ends up in an endless loop when
xfs_bmapi_write() returns nimaps == 0 at ENOSPC. The process is
unkillable, and so just runs around in a tight circle burning CPU
until the system is rebooted.
This is a regression introduced by commit 35dc55b9e80c ("xfs: handle
nimaps=0 from xfs_bmapi_write in xfs_alloc_file_space") which
specifically removed ENOSPC detection from xfs_alloc_file_space()
and replaces it with an endless loop. This attempts to fix an issue
converting a delalloc extent when not enough contiguous free space
is available to convert the entire delalloc extent.
Right now just revert the change as it only manifested on code under
development and isn't currently a real-world problem.
Fixes: 35dc55b9e80c ("xfs: handle nimaps=0 from xfs_bmapi_write in xfs_alloc_file_space")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_bmap_util.c | 18 ++++++------------
1 file changed, 6 insertions(+), 12 deletions(-)
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 19e11d1da660..262557735d4d 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -735,19 +735,13 @@ xfs_alloc_file_space(
if (error)
break;
- /*
- * If the allocator cannot find a single free extent large
- * enough to cover the start block of the requested range,
- * xfs_bmapi_write will return 0 but leave *nimaps set to 0.
- *
- * In that case we simply need to keep looping with the same
- * startoffset_fsb so that one of the following allocations
- * will eventually reach the requested range.
- */
- if (nimaps) {
- startoffset_fsb += imapp->br_blockcount;
- allocatesize_fsb -= imapp->br_blockcount;
+ if (nimaps == 0) {
+ error = ENOSPC;
+ break;
}
+
+ startoffset_fsb += imapp->br_blockcount;
+ allocatesize_fsb -= imapp->br_blockcount;
}
return error;
--
2.43.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 3/4] xfs: handle allocation failure in xfs_dquot_disk_alloc()
2024-04-02 21:38 [PATCH 0/4] xfs: fixes for 6.9-rcX Dave Chinner
2024-04-02 21:38 ` [PATCH 1/4] xfs: use kvmalloc for xattr buffers Dave Chinner
2024-04-02 21:38 ` [PATCH 2/4] xfs: xfs_alloc_file_space() fails to detect ENOSPC Dave Chinner
@ 2024-04-02 21:38 ` Dave Chinner
2024-04-03 3:48 ` Darrick J. Wong
` (2 more replies)
2024-04-02 21:38 ` [PATCH 4/4] xfs: validate block count for XFS_IOC_SET_RESBLKS Dave Chinner
3 siblings, 3 replies; 25+ messages in thread
From: Dave Chinner @ 2024-04-02 21:38 UTC (permalink / raw)
To: linux-xfs; +Cc: chandanbabu
From: Dave Chinner <dchinner@redhat.com>
If free space accounting is screwed up, then dquot allocation may go
ahead when there is no space available. xfs_dquot_disk_alloc() does
not handle allocation failure - it expects that it will not get
called when there isn't space available to allocate dquots.
Because fuzzers have been screwing up the free space accounting, we
are seeing failures in dquot allocation, and they aren't being
caught on produciton kernels. Debug kernels will assert fail in this
case, so turn that assert fail into more robust error handling to
avoid these issues in future.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_dquot.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index c98cb468c357..a2652e3d5164 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -356,6 +356,23 @@ xfs_dquot_disk_alloc(
if (error)
goto err_cancel;
+ if (nmaps == 0) {
+ /*
+ * Unexpected ENOSPC - the transaction reservation should have
+ * guaranteed that this allocation will succeed. We don't know
+ * why this happened, so just back out gracefully.
+ *
+ * We commit the transaction instead of cancelling it as it may
+ * be dirty due to extent count upgrade. This avoids a potential
+ * filesystem shutdown when this happens. We ignore any error
+ * from the transaction commit - we always return -ENOSPC to the
+ * caller here so we really don't care if the commit fails for
+ * some unknown reason...
+ */
+ xfs_trans_commit(tp);
+ return -ENOSPC;
+ }
+
ASSERT(map.br_blockcount == XFS_DQUOT_CLUSTER_SIZE_FSB);
ASSERT(nmaps == 1);
ASSERT((map.br_startblock != DELAYSTARTBLOCK) &&
--
2.43.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 4/4] xfs: validate block count for XFS_IOC_SET_RESBLKS
2024-04-02 21:38 [PATCH 0/4] xfs: fixes for 6.9-rcX Dave Chinner
` (2 preceding siblings ...)
2024-04-02 21:38 ` [PATCH 3/4] xfs: handle allocation failure in xfs_dquot_disk_alloc() Dave Chinner
@ 2024-04-02 21:38 ` Dave Chinner
2024-04-03 3:53 ` Darrick J. Wong
2024-04-03 4:43 ` Christoph Hellwig
3 siblings, 2 replies; 25+ messages in thread
From: Dave Chinner @ 2024-04-02 21:38 UTC (permalink / raw)
To: linux-xfs; +Cc: chandanbabu
From: Dave Chinner <dchinner@redhat.com>
Userspace can pass anything it wants in the reserved block count
and we simply pass that to the reservation code. If a value that is
far too large is passed, we can overflow the free space counter
and df reports things like:
Filesystem Size Used Avail Use% Mounted on
/dev/loop0 14M -27Z 27Z - /home/dave/bugs/file0
As reserving space requires CAP_SYS_ADMIN, this is not a problem
that will ever been seen in production systems. However, fuzzers are
running with CAP_SYS_ADMIN, and so they able to run filesystem code
with out-of-band free space accounting.
Stop the fuzzers ifrom being able to do this by validating that the
count is within the bounds of the filesystem size and reject
anything outside those bounds as invalid.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_ioctl.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index d0e2cec6210d..18a225d884dd 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1892,6 +1892,9 @@ xfs_ioctl_getset_resblocks(
if (copy_from_user(&fsop, arg, sizeof(fsop)))
return -EFAULT;
+ if (fsop.resblks >= mp->m_sb.sb_dblocks)
+ return -EINVAL;
+
error = mnt_want_write_file(filp);
if (error)
return error;
--
2.43.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 1/4] xfs: use kvmalloc for xattr buffers
2024-04-02 21:38 ` [PATCH 1/4] xfs: use kvmalloc for xattr buffers Dave Chinner
@ 2024-04-03 3:43 ` Darrick J. Wong
2024-04-03 4:39 ` Christoph Hellwig
2024-04-17 14:35 ` Pankaj Raghav (Samsung)
2 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2024-04-03 3:43 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs, chandanbabu
On Wed, Apr 03, 2024 at 08:38:16AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Pankaj Raghav reported that when filesystem block size is larger
> than page size, the xattr code can use kmalloc() for high order
> allocations. This triggers a useless warning in the allocator as it
> is a __GFP_NOFAIL allocation here:
>
> static inline
> struct page *rmqueue(struct zone *preferred_zone,
> struct zone *zone, unsigned int order,
> gfp_t gfp_flags, unsigned int alloc_flags,
> int migratetype)
> {
> struct page *page;
>
> /*
> * We most definitely don't want callers attempting to
> * allocate greater than order-1 page units with __GFP_NOFAIL.
> */
> >>>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
> ...
>
> Fix this by changing all these call sites to use kvmalloc(), which
> will strip the NOFAIL from the kmalloc attempt and if that fails
> will do a __GFP_NOFAIL vmalloc().
>
> This is not an issue that productions systems will see as
> filesystems with block size > page size cannot be mounted by the
> kernel; Pankaj is developing this functionality right now.
>
> Reported-by: Pankaj Raghav <kernel@pankajraghav.com>
> Fixes: f078d4ea8276 ("xfs: convert kmem_alloc() to kmalloc()")
> Signed-off-be: Dave Chinner <dchinner@redhat.com>
Looks fine to me,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
--D
> ---
> fs/xfs/libxfs/xfs_attr_leaf.c | 15 ++++++---------
> 1 file changed, 6 insertions(+), 9 deletions(-)
>
> diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
> index ac904cc1a97b..969abc6efd70 100644
> --- a/fs/xfs/libxfs/xfs_attr_leaf.c
> +++ b/fs/xfs/libxfs/xfs_attr_leaf.c
> @@ -1059,10 +1059,7 @@ xfs_attr3_leaf_to_shortform(
>
> trace_xfs_attr_leaf_to_sf(args);
>
> - tmpbuffer = kmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
> - if (!tmpbuffer)
> - return -ENOMEM;
> -
> + tmpbuffer = kvmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
> memcpy(tmpbuffer, bp->b_addr, args->geo->blksize);
>
> leaf = (xfs_attr_leafblock_t *)tmpbuffer;
> @@ -1125,7 +1122,7 @@ xfs_attr3_leaf_to_shortform(
> error = 0;
>
> out:
> - kfree(tmpbuffer);
> + kvfree(tmpbuffer);
> return error;
> }
>
> @@ -1533,7 +1530,7 @@ xfs_attr3_leaf_compact(
>
> trace_xfs_attr_leaf_compact(args);
>
> - tmpbuffer = kmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
> + tmpbuffer = kvmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
> memcpy(tmpbuffer, bp->b_addr, args->geo->blksize);
> memset(bp->b_addr, 0, args->geo->blksize);
> leaf_src = (xfs_attr_leafblock_t *)tmpbuffer;
> @@ -1571,7 +1568,7 @@ xfs_attr3_leaf_compact(
> */
> xfs_trans_log_buf(trans, bp, 0, args->geo->blksize - 1);
>
> - kfree(tmpbuffer);
> + kvfree(tmpbuffer);
> }
>
> /*
> @@ -2250,7 +2247,7 @@ xfs_attr3_leaf_unbalance(
> struct xfs_attr_leafblock *tmp_leaf;
> struct xfs_attr3_icleaf_hdr tmphdr;
>
> - tmp_leaf = kzalloc(state->args->geo->blksize,
> + tmp_leaf = kvzalloc(state->args->geo->blksize,
> GFP_KERNEL | __GFP_NOFAIL);
>
> /*
> @@ -2291,7 +2288,7 @@ xfs_attr3_leaf_unbalance(
> }
> memcpy(save_leaf, tmp_leaf, state->args->geo->blksize);
> savehdr = tmphdr; /* struct copy */
> - kfree(tmp_leaf);
> + kvfree(tmp_leaf);
> }
>
> xfs_attr3_leaf_hdr_to_disk(state->args->geo, save_leaf, &savehdr);
> --
> 2.43.0
>
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/4] xfs: xfs_alloc_file_space() fails to detect ENOSPC
2024-04-02 21:38 ` [PATCH 2/4] xfs: xfs_alloc_file_space() fails to detect ENOSPC Dave Chinner
@ 2024-04-03 3:46 ` Darrick J. Wong
2024-04-03 4:40 ` Christoph Hellwig
1 sibling, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2024-04-03 3:46 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs, chandanbabu, Christoph Hellwig
[explicitly cc hch]
On Wed, Apr 03, 2024 at 08:38:17AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> xfs_alloc_file_space ends up in an endless loop when
> xfs_bmapi_write() returns nimaps == 0 at ENOSPC. The process is
> unkillable, and so just runs around in a tight circle burning CPU
> until the system is rebooted.
>
> This is a regression introduced by commit 35dc55b9e80c ("xfs: handle
> nimaps=0 from xfs_bmapi_write in xfs_alloc_file_space") which
> specifically removed ENOSPC detection from xfs_alloc_file_space()
> and replaces it with an endless loop. This attempts to fix an issue
> converting a delalloc extent when not enough contiguous free space
> is available to convert the entire delalloc extent.
>
> Right now just revert the change as it only manifested on code under
> development and isn't currently a real-world problem.
>
> Fixes: 35dc55b9e80c ("xfs: handle nimaps=0 from xfs_bmapi_write in xfs_alloc_file_space")
Shouldn't Christoph be cc'd if you're reverting his patch?
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> fs/xfs/xfs_bmap_util.c | 18 ++++++------------
> 1 file changed, 6 insertions(+), 12 deletions(-)
>
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 19e11d1da660..262557735d4d 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -735,19 +735,13 @@ xfs_alloc_file_space(
> if (error)
> break;
>
> - /*
> - * If the allocator cannot find a single free extent large
> - * enough to cover the start block of the requested range,
> - * xfs_bmapi_write will return 0 but leave *nimaps set to 0.
> - *
> - * In that case we simply need to keep looping with the same
> - * startoffset_fsb so that one of the following allocations
> - * will eventually reach the requested range.
> - */
> - if (nimaps) {
> - startoffset_fsb += imapp->br_blockcount;
> - allocatesize_fsb -= imapp->br_blockcount;
> + if (nimaps == 0) {
> + error = ENOSPC;
-ENOSPC.
--D
> + break;
> }
> +
> + startoffset_fsb += imapp->br_blockcount;
> + allocatesize_fsb -= imapp->br_blockcount;
> }
>
> return error;
> --
> 2.43.0
>
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/4] xfs: handle allocation failure in xfs_dquot_disk_alloc()
2024-04-02 21:38 ` [PATCH 3/4] xfs: handle allocation failure in xfs_dquot_disk_alloc() Dave Chinner
@ 2024-04-03 3:48 ` Darrick J. Wong
2024-04-03 4:41 ` Christoph Hellwig
2024-04-03 14:06 ` Christoph Hellwig
2 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2024-04-03 3:48 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs, chandanbabu
On Wed, Apr 03, 2024 at 08:38:18AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> If free space accounting is screwed up, then dquot allocation may go
> ahead when there is no space available. xfs_dquot_disk_alloc() does
> not handle allocation failure - it expects that it will not get
> called when there isn't space available to allocate dquots.
>
> Because fuzzers have been screwing up the free space accounting, we
> are seeing failures in dquot allocation, and they aren't being
> caught on produciton kernels. Debug kernels will assert fail in this
> case, so turn that assert fail into more robust error handling to
> avoid these issues in future.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
Sounds fine to me! It'll be interesting to see what happens the next
time one of my VMs trips this.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
--D
> ---
> fs/xfs/xfs_dquot.c | 17 +++++++++++++++++
> 1 file changed, 17 insertions(+)
>
> diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> index c98cb468c357..a2652e3d5164 100644
> --- a/fs/xfs/xfs_dquot.c
> +++ b/fs/xfs/xfs_dquot.c
> @@ -356,6 +356,23 @@ xfs_dquot_disk_alloc(
> if (error)
> goto err_cancel;
>
> + if (nmaps == 0) {
> + /*
> + * Unexpected ENOSPC - the transaction reservation should have
> + * guaranteed that this allocation will succeed. We don't know
> + * why this happened, so just back out gracefully.
> + *
> + * We commit the transaction instead of cancelling it as it may
> + * be dirty due to extent count upgrade. This avoids a potential
> + * filesystem shutdown when this happens. We ignore any error
> + * from the transaction commit - we always return -ENOSPC to the
> + * caller here so we really don't care if the commit fails for
> + * some unknown reason...
> + */
> + xfs_trans_commit(tp);
> + return -ENOSPC;
> + }
> +
> ASSERT(map.br_blockcount == XFS_DQUOT_CLUSTER_SIZE_FSB);
> ASSERT(nmaps == 1);
> ASSERT((map.br_startblock != DELAYSTARTBLOCK) &&
> --
> 2.43.0
>
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 4/4] xfs: validate block count for XFS_IOC_SET_RESBLKS
2024-04-02 21:38 ` [PATCH 4/4] xfs: validate block count for XFS_IOC_SET_RESBLKS Dave Chinner
@ 2024-04-03 3:53 ` Darrick J. Wong
2024-04-03 6:55 ` Dave Chinner
2024-04-03 4:43 ` Christoph Hellwig
1 sibling, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2024-04-03 3:53 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs, chandanbabu
On Wed, Apr 03, 2024 at 08:38:19AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Userspace can pass anything it wants in the reserved block count
> and we simply pass that to the reservation code. If a value that is
> far too large is passed, we can overflow the free space counter
> and df reports things like:
>
> Filesystem Size Used Avail Use% Mounted on
> /dev/loop0 14M -27Z 27Z - /home/dave/bugs/file0
>
> As reserving space requires CAP_SYS_ADMIN, this is not a problem
> that will ever been seen in production systems. However, fuzzers are
> running with CAP_SYS_ADMIN, and so they able to run filesystem code
> with out-of-band free space accounting.
>
> Stop the fuzzers ifrom being able to do this by validating that the
> count is within the bounds of the filesystem size and reject
> anything outside those bounds as invalid.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> fs/xfs/xfs_ioctl.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index d0e2cec6210d..18a225d884dd 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -1892,6 +1892,9 @@ xfs_ioctl_getset_resblocks(
> if (copy_from_user(&fsop, arg, sizeof(fsop)))
> return -EFAULT;
>
> + if (fsop.resblks >= mp->m_sb.sb_dblocks)
> + return -EINVAL;
Why isn't xfs_reserve_blocks catching this? Is this due to the odd
behavior that a failed xfs_mod_fdblocks is undone and m_resblks simply
allowed to remain?
Also why wouldn't we limit m_resblks to something smaller, like 10% of
the fs or half an AG or something like that?
--D
> +
> error = mnt_want_write_file(filp);
> if (error)
> return error;
> --
> 2.43.0
>
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/4] xfs: use kvmalloc for xattr buffers
2024-04-02 21:38 ` [PATCH 1/4] xfs: use kvmalloc for xattr buffers Dave Chinner
2024-04-03 3:43 ` Darrick J. Wong
@ 2024-04-03 4:39 ` Christoph Hellwig
2024-04-03 6:16 ` Dave Chinner
2024-04-17 14:35 ` Pankaj Raghav (Samsung)
2 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2024-04-03 4:39 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs, chandanbabu
On Wed, Apr 03, 2024 at 08:38:16AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Pankaj Raghav reported that when filesystem block size is larger
> than page size, the xattr code can use kmalloc() for high order
> allocations. This triggers a useless warning in the allocator as it
> is a __GFP_NOFAIL allocation here:
Can we just get the warning fixed in the MM code?
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/4] xfs: xfs_alloc_file_space() fails to detect ENOSPC
2024-04-02 21:38 ` [PATCH 2/4] xfs: xfs_alloc_file_space() fails to detect ENOSPC Dave Chinner
2024-04-03 3:46 ` Darrick J. Wong
@ 2024-04-03 4:40 ` Christoph Hellwig
2024-04-03 6:34 ` Dave Chinner
1 sibling, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2024-04-03 4:40 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs, chandanbabu
On Wed, Apr 03, 2024 at 08:38:17AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> xfs_alloc_file_space ends up in an endless loop when
> xfs_bmapi_write() returns nimaps == 0 at ENOSPC. The process is
> unkillable, and so just runs around in a tight circle burning CPU
> until the system is rebooted.
What is your reproducer? Let's just fix this for real.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/4] xfs: handle allocation failure in xfs_dquot_disk_alloc()
2024-04-02 21:38 ` [PATCH 3/4] xfs: handle allocation failure in xfs_dquot_disk_alloc() Dave Chinner
2024-04-03 3:48 ` Darrick J. Wong
@ 2024-04-03 4:41 ` Christoph Hellwig
2024-04-03 4:54 ` Darrick J. Wong
2024-04-03 14:06 ` Christoph Hellwig
2 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2024-04-03 4:41 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs, chandanbabu
> + if (nmaps == 0) {
> + /*
> + * Unexpected ENOSPC - the transaction reservation should have
> + * guaranteed that this allocation will succeed. We don't know
> + * why this happened, so just back out gracefully.
> + *
> + * We commit the transaction instead of cancelling it as it may
> + * be dirty due to extent count upgrade. This avoids a potential
> + * filesystem shutdown when this happens. We ignore any error
> + * from the transaction commit - we always return -ENOSPC to the
> + * caller here so we really don't care if the commit fails for
> + * some unknown reason...
> + */
> + xfs_trans_commit(tp);
> + return -ENOSPC;
A cancel and thus shutdown does seem like the right behavior for a trap
for an unknown bug..
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 4/4] xfs: validate block count for XFS_IOC_SET_RESBLKS
2024-04-02 21:38 ` [PATCH 4/4] xfs: validate block count for XFS_IOC_SET_RESBLKS Dave Chinner
2024-04-03 3:53 ` Darrick J. Wong
@ 2024-04-03 4:43 ` Christoph Hellwig
1 sibling, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2024-04-03 4:43 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs, chandanbabu
On Wed, Apr 03, 2024 at 08:38:19AM +1100, Dave Chinner wrote:
> Stop the fuzzers ifrom being able to do this by validating that the
s/ifrom/from/
> + if (fsop.resblks >= mp->m_sb.sb_dblocks)
Even all of dblocks seems too much. A quarter of all blocks still feels
very generous.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/4] xfs: handle allocation failure in xfs_dquot_disk_alloc()
2024-04-03 4:41 ` Christoph Hellwig
@ 2024-04-03 4:54 ` Darrick J. Wong
2024-04-03 4:56 ` Christoph Hellwig
0 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2024-04-03 4:54 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Dave Chinner, linux-xfs, chandanbabu
On Tue, Apr 02, 2024 at 09:41:56PM -0700, Christoph Hellwig wrote:
> > + if (nmaps == 0) {
> > + /*
> > + * Unexpected ENOSPC - the transaction reservation should have
> > + * guaranteed that this allocation will succeed. We don't know
> > + * why this happened, so just back out gracefully.
> > + *
> > + * We commit the transaction instead of cancelling it as it may
> > + * be dirty due to extent count upgrade. This avoids a potential
> > + * filesystem shutdown when this happens. We ignore any error
> > + * from the transaction commit - we always return -ENOSPC to the
> > + * caller here so we really don't care if the commit fails for
> > + * some unknown reason...
> > + */
> > + xfs_trans_commit(tp);
> > + return -ENOSPC;
>
> A cancel and thus shutdown does seem like the right behavior for a trap
> for an unknown bug..
Usually this will result in the file write erroring out, right?
--D
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/4] xfs: handle allocation failure in xfs_dquot_disk_alloc()
2024-04-03 4:54 ` Darrick J. Wong
@ 2024-04-03 4:56 ` Christoph Hellwig
2024-04-03 5:04 ` Darrick J. Wong
0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2024-04-03 4:56 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: Christoph Hellwig, Dave Chinner, linux-xfs, chandanbabu
On Tue, Apr 02, 2024 at 09:54:56PM -0700, Darrick J. Wong wrote:
> Usually this will result in the file write erroring out, right?
quota file allocations usually come originally from a file write or
inode creation. But I'm not entirely sure if that was the question..
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/4] xfs: handle allocation failure in xfs_dquot_disk_alloc()
2024-04-03 4:56 ` Christoph Hellwig
@ 2024-04-03 5:04 ` Darrick J. Wong
2024-04-03 6:41 ` Dave Chinner
0 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2024-04-03 5:04 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Dave Chinner, linux-xfs, chandanbabu
On Tue, Apr 02, 2024 at 09:56:24PM -0700, Christoph Hellwig wrote:
> On Tue, Apr 02, 2024 at 09:54:56PM -0700, Darrick J. Wong wrote:
> > Usually this will result in the file write erroring out, right?
>
> quota file allocations usually come originally from a file write or
> inode creation. But I'm not entirely sure if that was the question..
Heh, and the question was based on a misreading of your comment. 8-)
AFAICT this can result in dqattach erroring out, which seems mostly
benign.
--D
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/4] xfs: use kvmalloc for xattr buffers
2024-04-03 4:39 ` Christoph Hellwig
@ 2024-04-03 6:16 ` Dave Chinner
2024-04-03 6:19 ` Christoph Hellwig
0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2024-04-03 6:16 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-xfs, chandanbabu
On Tue, Apr 02, 2024 at 09:39:47PM -0700, Christoph Hellwig wrote:
> On Wed, Apr 03, 2024 at 08:38:16AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > Pankaj Raghav reported that when filesystem block size is larger
> > than page size, the xattr code can use kmalloc() for high order
> > allocations. This triggers a useless warning in the allocator as it
> > is a __GFP_NOFAIL allocation here:
>
> Can we just get the warning fixed in the MM code?
I'd love that, but until the MM developers actually agree to
supporting __GFP_NOFAIL as normal, guaranteed allocation policy this
isn't going to change. I don't want to hold up the LBS support work
by gating it on mm policy changes....
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/4] xfs: use kvmalloc for xattr buffers
2024-04-03 6:16 ` Dave Chinner
@ 2024-04-03 6:19 ` Christoph Hellwig
0 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2024-04-03 6:19 UTC (permalink / raw)
To: Dave Chinner; +Cc: Christoph Hellwig, linux-xfs, chandanbabu
On Wed, Apr 03, 2024 at 05:16:58PM +1100, Dave Chinner wrote:
> > Can we just get the warning fixed in the MM code?
>
> I'd love that, but until the MM developers actually agree to
> supporting __GFP_NOFAIL as normal, guaranteed allocation policy this
> isn't going to change. I don't want to hold up the LBS support work
> by gating it on mm policy changes....
Well, let's give them at least a little more time and only add this
patches to the large block size series for now.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/4] xfs: xfs_alloc_file_space() fails to detect ENOSPC
2024-04-03 4:40 ` Christoph Hellwig
@ 2024-04-03 6:34 ` Dave Chinner
2024-04-03 18:23 ` Christoph Hellwig
0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2024-04-03 6:34 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-xfs, chandanbabu
On Tue, Apr 02, 2024 at 09:40:21PM -0700, Christoph Hellwig wrote:
> On Wed, Apr 03, 2024 at 08:38:17AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > xfs_alloc_file_space ends up in an endless loop when
> > xfs_bmapi_write() returns nimaps == 0 at ENOSPC. The process is
> > unkillable, and so just runs around in a tight circle burning CPU
> > until the system is rebooted.
>
> What is your reproducer? Let's just fix this for real.
Run the reproducer in this bug report on a TOT kernel, and the
XFS_IOC_RESVSP call will livelock:
https://lore.kernel.org/linux-xfs/CAEJPjCvT3Uag-pMTYuigEjWZHn1sGMZ0GCjVVCv29tNHK76Cgg@mail.gmail.com/
That has nothing to do with delalloc - free space accounting was
screwed up by a reserve blocks ioctl, and so when allocation fails
it just runs around in a tight circle and cannot be broken out of.
Regardless of the reproducer that corrupts free space accounting,
there is no guarantee that allocations will succeed even if there is
free space available. Hence this loop must have a way to break out
when allocation fails. This becomes even more apparent with the
forced alignment feature - as soon as we run out of contiguous free
space for aligned allocation, allocations will persistently fail
when there is plenty of free space still available.
Given that the fix was for something that doesn't currently exist
(RT delalloc) the only sane thing to do right now is revert the fix
and push that revert back to the stable kernels that are susceptible
to this livelock.
I don't know exactly how the orginal delalloc issue was triggered,
let alone had the time to time to understand how to actually
fix it properly. The code as it stands contains a regression and so
the first thing we need to do is revert the change so we can
backport it....
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/4] xfs: handle allocation failure in xfs_dquot_disk_alloc()
2024-04-03 5:04 ` Darrick J. Wong
@ 2024-04-03 6:41 ` Dave Chinner
0 siblings, 0 replies; 25+ messages in thread
From: Dave Chinner @ 2024-04-03 6:41 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs, chandanbabu
On Tue, Apr 02, 2024 at 10:04:30PM -0700, Darrick J. Wong wrote:
> On Tue, Apr 02, 2024 at 09:56:24PM -0700, Christoph Hellwig wrote:
> > On Tue, Apr 02, 2024 at 09:54:56PM -0700, Darrick J. Wong wrote:
> > > Usually this will result in the file write erroring out, right?
> >
> > quota file allocations usually come originally from a file write or
> > inode creation. But I'm not entirely sure if that was the question..
>
> Heh, and the question was based on a misreading of your comment. 8-)
>
> AFAICT this can result in dqattach erroring out, which seems mostly
> benign.
Right - this propagates the ENOSPC error back to the caller without
a shutdown being required. If we get a corruption detected, then
the allocation will return an error, not nmaps == 0. That error will
cause a corruption. But an unexpected allocation failure right at
ENOSPC can occur without there being corruption because of, say, one
of the many, many near ENOSPC accounting bugs we've had to fix
over the past 20 years, and if the allocation fails we should just
clean up and return -ENOSPC without shutting down the filesystem.
We're right at ENOSPC, so there's every chance that the next
operation after the dquot was attached would fail with ENOSPC
anyway....
So, yeah, I don't see any reason to shut the filesytsem down because
we ended up with a transient ENOSPC error or an off-by one in the
free space accounting somewhere...
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 4/4] xfs: validate block count for XFS_IOC_SET_RESBLKS
2024-04-03 3:53 ` Darrick J. Wong
@ 2024-04-03 6:55 ` Dave Chinner
0 siblings, 0 replies; 25+ messages in thread
From: Dave Chinner @ 2024-04-03 6:55 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-xfs, chandanbabu
On Tue, Apr 02, 2024 at 08:53:14PM -0700, Darrick J. Wong wrote:
> On Wed, Apr 03, 2024 at 08:38:19AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > Userspace can pass anything it wants in the reserved block count
> > and we simply pass that to the reservation code. If a value that is
> > far too large is passed, we can overflow the free space counter
> > and df reports things like:
> >
> > Filesystem Size Used Avail Use% Mounted on
> > /dev/loop0 14M -27Z 27Z - /home/dave/bugs/file0
> >
> > As reserving space requires CAP_SYS_ADMIN, this is not a problem
> > that will ever been seen in production systems. However, fuzzers are
> > running with CAP_SYS_ADMIN, and so they able to run filesystem code
> > with out-of-band free space accounting.
> >
> > Stop the fuzzers ifrom being able to do this by validating that the
> > count is within the bounds of the filesystem size and reject
> > anything outside those bounds as invalid.
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> > fs/xfs/xfs_ioctl.c | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> > index d0e2cec6210d..18a225d884dd 100644
> > --- a/fs/xfs/xfs_ioctl.c
> > +++ b/fs/xfs/xfs_ioctl.c
> > @@ -1892,6 +1892,9 @@ xfs_ioctl_getset_resblocks(
> > if (copy_from_user(&fsop, arg, sizeof(fsop)))
> > return -EFAULT;
> >
> > + if (fsop.resblks >= mp->m_sb.sb_dblocks)
> > + return -EINVAL;
>
> Why isn't xfs_reserve_blocks catching this?
xfs_reserve_blocks() assumes that values have already been bounds
checked and are valid, so calculations won't overflow. Nothing in
the internal calls to xfs_reserve_blocks() will pass an out-of-range
value, but the value coming from userspace via the ioctl is
completely unbounded. Bounding checks should be done in the code
processing the unbounded input, not the internal functions.
> Is this due to the odd
> behavior that a failed xfs_mod_fdblocks is undone and m_resblks simply
> allowed to remain?
I don't know - I couldn't work out where the overflow was occurring
from reading the code, and once I realised that all the internal
calls are using sanitised values and the ioctl didn't sanitise the
user input, the fix was obvious....
> Also why wouldn't we limit m_resblks to something smaller, like 10% of
> the fs or half an AG or something like that?
Because now we bikeshed over what is a useful limit for userspace to be
setting, rather than focussing on fixing the bug. i.e. this bug fix
doesn't limit the actual usable range that userspace can reserve,
but it prevents the overflow from occurring.i
If you want to set a limit on this value and update the code,
manpages, etc to document the new behaviour and then have to change
it when someone inevitably says "I have a workflow that temporarily
reserves 75% of the disk space because ....", then do it as a
separate "new feature" patchset. In the mean time, we just need the
overflow to go away....
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/4] xfs: handle allocation failure in xfs_dquot_disk_alloc()
2024-04-02 21:38 ` [PATCH 3/4] xfs: handle allocation failure in xfs_dquot_disk_alloc() Dave Chinner
2024-04-03 3:48 ` Darrick J. Wong
2024-04-03 4:41 ` Christoph Hellwig
@ 2024-04-03 14:06 ` Christoph Hellwig
2024-04-03 21:49 ` Dave Chinner
2 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2024-04-03 14:06 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs, chandanbabu
On Wed, Apr 03, 2024 at 08:38:18AM +1100, Dave Chinner wrote:
> @@ -356,6 +356,23 @@ xfs_dquot_disk_alloc(
> if (error)
> goto err_cancel;
>
> + if (nmaps == 0) {
> + /*
> + * Unexpected ENOSPC - the transaction reservation should have
> + * guaranteed that this allocation will succeed. We don't know
> + * why this happened, so just back out gracefully.
So looking at this code, xfs_dquot_disk_alloc allocates it's own
transaction, and does so without a space reservation. In other words:
an ENOSPC is entirely expected here in the current form. The code, just
like many other callers of xfs_bmapi_write, just fails to handle
the weird 0 return value and zero nmaps convention properly.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/4] xfs: xfs_alloc_file_space() fails to detect ENOSPC
2024-04-03 6:34 ` Dave Chinner
@ 2024-04-03 18:23 ` Christoph Hellwig
0 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2024-04-03 18:23 UTC (permalink / raw)
To: Dave Chinner; +Cc: Christoph Hellwig, linux-xfs, chandanbabu
On Wed, Apr 03, 2024 at 05:34:09PM +1100, Dave Chinner wrote:
> Given that the fix was for something that doesn't currently exist
> (RT delalloc) the only sane thing to do right now is revert the fix
> and push that revert back to the stable kernels that are susceptible
> to this livelock.
That was how I originally found it. I reproduced the problem with
the normal allocator and very small AGs since then.
Anyway, here is what I came up with today. TL;DR: xfs_bmapi_write
calling conventions are complete fucked up and I'm amazed we've survived
this far.
---
From 80bb655563fad9757c141df4c47ce9701ed4926a Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Wed, 3 Apr 2024 11:04:07 +0200
Subject: xfs: fix error returns from xfs_bmapi_write
xfs_bmapi_write can return 0 without actually returning a mapping in
mval in two different cases:
1) when there is absolutely no space available to do an allocation
2) when converting delalloc space, and the allocation is so small
that it only covers parts of the delalloc extent before the
range requested by the caller
Callers at best can handle one of these cases, but in many cases can't
cope with either one. Switch xfs_bmapi_write to always return a
mapping or return an error code instead. For case 1) above ENOSPC is
the obvious choice which is very much what the callers expect anyway.
For case 2) there is no really good error code, so pick a funky one
from the SysV streams portfolio.
This fixes the reproducer here:
https://lore.kernel.org/linux-xfs/CAEJPjCvT3Uag-pMTYuigEjWZHn1sGMZ0GCjVVCv29tNHK76Cgg@mail.gmail.com0/
which uses reserved blocks to create file systems that are gravely
out of space and thus cause at least xfs_file_alloc_space to hang
and trigger the lack of ENOSPC handling in xfs_dquot_disk_alloc.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
fs/xfs/libxfs/xfs_attr_remote.c | 1 -
fs/xfs/libxfs/xfs_bmap.c | 43 ++++++++++++++++++++++++++++-----
fs/xfs/libxfs/xfs_da_btree.c | 20 ++++-----------
fs/xfs/scrub/quota_repair.c | 6 -----
fs/xfs/scrub/rtbitmap_repair.c | 2 --
fs/xfs/xfs_bmap_util.c | 31 ++++++++++++------------
fs/xfs/xfs_dquot.c | 1 -
fs/xfs/xfs_iomap.c | 8 ------
fs/xfs/xfs_reflink.c | 14 -----------
fs/xfs/xfs_rtalloc.c | 2 --
10 files changed, 57 insertions(+), 71 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
index ff04128287720a..41a02dcc2541b0 100644
--- a/fs/xfs/libxfs/xfs_attr_remote.c
+++ b/fs/xfs/libxfs/xfs_attr_remote.c
@@ -626,7 +626,6 @@ xfs_attr_rmtval_set_blk(
if (error)
return error;
- ASSERT(nmap == 1);
ASSERT((map->br_startblock != DELAYSTARTBLOCK) &&
(map->br_startblock != HOLESTARTBLOCK));
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 656c95a22f2e6d..631fd454a832cd 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4226,8 +4226,10 @@ xfs_bmapi_allocate(
} else {
error = xfs_bmap_alloc_userdata(bma);
}
- if (error || bma->blkno == NULLFSBLOCK)
+ if (error)
return error;
+ if (bma->blkno == NULLFSBLOCK)
+ return -ENOSPC;
if (bma->flags & XFS_BMAPI_ZERO) {
error = xfs_zero_extent(bma->ip, bma->blkno, bma->length);
@@ -4406,6 +4408,15 @@ xfs_bmapi_finish(
* extent state if necessary. Details behaviour is controlled by the flags
* parameter. Only allocates blocks from a single allocation group, to avoid
* locking problems.
+ *
+ * Returns 0 on success and places the extent mapings in mval. nmaps is used as
+ * an input/output parameter where the caller specifies the maximum number
+ * before calling xfs_bmapi_write, and xfs_bmapi_write passes the number of
+ * mappings (including existing mappings) it found.
+ *
+ * Returns a negative error code on failure, including -ENOSPC when it could not
+ * allocate any blocks and -ENOSR when it did allocated blocks to convert a
+ * delalloc range, but those blocks were before the passed in range.
*/
int
xfs_bmapi_write(
@@ -4534,10 +4545,16 @@ xfs_bmapi_write(
ASSERT(len > 0);
ASSERT(bma.length > 0);
error = xfs_bmapi_allocate(&bma);
- if (error)
+ if (error) {
+ /*
+ * If we already allocated space in a previous
+ * iteration return what we go so far when
+ * running out of space.
+ */
+ if (error == -ENOSPC && bma.nallocs)
+ break;
goto error0;
- if (bma.blkno == NULLFSBLOCK)
- break;
+ }
/*
* If this is a CoW allocation, record the data in
@@ -4575,7 +4592,6 @@ xfs_bmapi_write(
if (!xfs_iext_next_extent(ifp, &bma.icur, &bma.got))
eof = true;
}
- *nmap = n;
error = xfs_bmap_btree_to_extents(tp, ip, bma.cur, &bma.logflags,
whichfork);
@@ -4586,7 +4602,22 @@ xfs_bmapi_write(
ifp->if_nextents > XFS_IFORK_MAXEXT(ip, whichfork));
xfs_bmapi_finish(&bma, whichfork, 0);
xfs_bmap_validate_ret(orig_bno, orig_len, orig_flags, orig_mval,
- orig_nmap, *nmap);
+ orig_nmap, n);
+
+ /*
+ * When converting delayed allocations, xfs_bmapi_allocate ignores
+ * the passed in bno and always converts from the start of the found
+ * delalloc extent.
+ *
+ * To avoid a successful return with *nmap set to 0, return the magic
+ * -ENOSR error code for this particular case so that the caller can
+ * handle it.
+ */
+ if (!n) {
+ ASSERT(bma.nallocs >= *nmap);
+ return -ENOSR;
+ }
+ *nmap = n;
return 0;
error0:
xfs_bmapi_finish(&bma, whichfork, error);
diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index 718d071bb21ae3..276c710548b5a7 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -2167,8 +2167,8 @@ xfs_da_grow_inode_int(
struct xfs_inode *dp = args->dp;
int w = args->whichfork;
xfs_rfsblock_t nblks = dp->i_nblocks;
- struct xfs_bmbt_irec map, *mapp;
- int nmap, error, got, i, mapi;
+ struct xfs_bmbt_irec map, *mapp = ↦
+ int nmap, error, got, i, mapi = 1;
/*
* Find a spot in the file space to put the new block.
@@ -2184,14 +2184,7 @@ xfs_da_grow_inode_int(
error = xfs_bmapi_write(tp, dp, *bno, count,
xfs_bmapi_aflag(w)|XFS_BMAPI_METADATA|XFS_BMAPI_CONTIG,
args->total, &map, &nmap);
- if (error)
- return error;
-
- ASSERT(nmap <= 1);
- if (nmap == 1) {
- mapp = ↦
- mapi = 1;
- } else if (nmap == 0 && count > 1) {
+ if (error == -ENOSPC && count > 1) {
xfs_fileoff_t b;
int c;
@@ -2209,16 +2202,13 @@ xfs_da_grow_inode_int(
args->total, &mapp[mapi], &nmap);
if (error)
goto out_free_map;
- if (nmap < 1)
- break;
mapi += nmap;
b = mapp[mapi - 1].br_startoff +
mapp[mapi - 1].br_blockcount;
}
- } else {
- mapi = 0;
- mapp = NULL;
}
+ if (error)
+ goto out_free_map;
/*
* Count the blocks we got, make sure it matches the total.
diff --git a/fs/xfs/scrub/quota_repair.c b/fs/xfs/scrub/quota_repair.c
index 0bab4c30cb85ab..90cd1512bba961 100644
--- a/fs/xfs/scrub/quota_repair.c
+++ b/fs/xfs/scrub/quota_repair.c
@@ -77,8 +77,6 @@ xrep_quota_item_fill_bmap_hole(
irec, &nmaps);
if (error)
return error;
- if (nmaps != 1)
- return -ENOSPC;
dq->q_blkno = XFS_FSB_TO_DADDR(mp, irec->br_startblock);
@@ -444,10 +442,6 @@ xrep_quota_data_fork(
XFS_BMAPI_CONVERT, 0, &nrec, &nmap);
if (error)
goto out;
- if (nmap != 1) {
- error = -ENOSPC;
- goto out;
- }
ASSERT(nrec.br_startoff == irec.br_startoff);
ASSERT(nrec.br_blockcount == irec.br_blockcount);
diff --git a/fs/xfs/scrub/rtbitmap_repair.c b/fs/xfs/scrub/rtbitmap_repair.c
index 46f5d5f605c915..0fef98e9f83409 100644
--- a/fs/xfs/scrub/rtbitmap_repair.c
+++ b/fs/xfs/scrub/rtbitmap_repair.c
@@ -108,8 +108,6 @@ xrep_rtbitmap_data_mappings(
0, &map, &nmaps);
if (error)
return error;
- if (nmaps != 1)
- return -EFSCORRUPTED;
/* Commit new extent and all deferred work. */
error = xrep_defer_finish(sc);
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 19e11d1da66074..fbca94170cd386 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -721,33 +721,32 @@ xfs_alloc_file_space(
if (error)
goto error;
- error = xfs_bmapi_write(tp, ip, startoffset_fsb,
- allocatesize_fsb, XFS_BMAPI_PREALLOC, 0, imapp,
- &nimaps);
- if (error)
- goto error;
-
- ip->i_diflags |= XFS_DIFLAG_PREALLOC;
- xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
-
- error = xfs_trans_commit(tp);
- xfs_iunlock(ip, XFS_ILOCK_EXCL);
- if (error)
- break;
-
/*
* If the allocator cannot find a single free extent large
* enough to cover the start block of the requested range,
- * xfs_bmapi_write will return 0 but leave *nimaps set to 0.
+ * xfs_bmapi_write will return -ENOSR.
*
* In that case we simply need to keep looping with the same
* startoffset_fsb so that one of the following allocations
* will eventually reach the requested range.
*/
- if (nimaps) {
+ error = xfs_bmapi_write(tp, ip, startoffset_fsb,
+ allocatesize_fsb, XFS_BMAPI_PREALLOC, 0, imapp,
+ &nimaps);
+ if (error) {
+ if (error != -ENOSR)
+ goto error;
+ error = 0;
+ } else {
startoffset_fsb += imapp->br_blockcount;
allocatesize_fsb -= imapp->br_blockcount;
}
+
+ ip->i_diflags |= XFS_DIFLAG_PREALLOC;
+ xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+
+ error = xfs_trans_commit(tp);
+ xfs_iunlock(ip, XFS_ILOCK_EXCL);
}
return error;
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index c98cb468c35780..0c9eb8fdeec082 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -357,7 +357,6 @@ xfs_dquot_disk_alloc(
goto err_cancel;
ASSERT(map.br_blockcount == XFS_DQUOT_CLUSTER_SIZE_FSB);
- ASSERT(nmaps == 1);
ASSERT((map.br_startblock != DELAYSTARTBLOCK) &&
(map.br_startblock != HOLESTARTBLOCK));
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 4087af7f3c9f3f..42155cedefb77d 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -321,14 +321,6 @@ xfs_iomap_write_direct(
if (error)
goto out_unlock;
- /*
- * Copy any maps to caller's array and return any error.
- */
- if (nimaps == 0) {
- error = -ENOSPC;
- goto out_unlock;
- }
-
if (unlikely(!xfs_valid_startblock(ip, imap->br_startblock))) {
xfs_bmap_mark_sick(ip, XFS_DATA_FORK);
error = xfs_alert_fsblock_zero(ip, imap);
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 7da0e8f961d351..5ecb52a234becc 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -430,13 +430,6 @@ xfs_reflink_fill_cow_hole(
if (error)
return error;
- /*
- * Allocation succeeded but the requested range was not even partially
- * satisfied? Bail out!
- */
- if (nimaps == 0)
- return -ENOSPC;
-
convert:
return xfs_reflink_convert_unwritten(ip, imap, cmap, convert_now);
@@ -499,13 +492,6 @@ xfs_reflink_fill_delalloc(
error = xfs_trans_commit(tp);
if (error)
return error;
-
- /*
- * Allocation succeeded but the requested range was not even
- * partially satisfied? Bail out!
- */
- if (nimaps == 0)
- return -ENOSPC;
} while (cmap->br_startoff + cmap->br_blockcount <= imap->br_startoff);
return xfs_reflink_convert_unwritten(ip, imap, cmap, convert_now);
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index e66f9bd5de5cff..46ee8093f797a6 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -709,8 +709,6 @@ xfs_growfs_rt_alloc(
nmap = 1;
error = xfs_bmapi_write(tp, ip, oblocks, nblocks - oblocks,
XFS_BMAPI_METADATA, 0, &map, &nmap);
- if (!error && nmap < 1)
- error = -ENOSPC;
if (error)
goto out_trans_cancel;
/*
--
2.39.2
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 3/4] xfs: handle allocation failure in xfs_dquot_disk_alloc()
2024-04-03 14:06 ` Christoph Hellwig
@ 2024-04-03 21:49 ` Dave Chinner
0 siblings, 0 replies; 25+ messages in thread
From: Dave Chinner @ 2024-04-03 21:49 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-xfs, chandanbabu
On Wed, Apr 03, 2024 at 07:06:48AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 03, 2024 at 08:38:18AM +1100, Dave Chinner wrote:
> > @@ -356,6 +356,23 @@ xfs_dquot_disk_alloc(
> > if (error)
> > goto err_cancel;
> >
> > + if (nmaps == 0) {
> > + /*
> > + * Unexpected ENOSPC - the transaction reservation should have
> > + * guaranteed that this allocation will succeed. We don't know
> > + * why this happened, so just back out gracefully.
>
> So looking at this code, xfs_dquot_disk_alloc allocates it's own
> transaction, and does so without a space reservation.
It does have a valid space reservation:
error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_dqalloc,
XFS_QM_DQALLOC_SPACE_RES(mp), 0, 0, &tp);
The space reservation is XFS_QM_DQALLOC_SPACE_RES(mp):
#define XFS_QM_DQALLOC_SPACE_RES(mp) \
(XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK) + \
XFS_DQUOT_CLUSTER_SIZE_FSB)
which is correct for allocating a single XFS_DQUOT_CLUSTER_SIZE_FSB
sized extent.
> In other words:
> an ENOSPC is entirely expected here in the current form.
I disagree with that assessment. :)
> The code, just
> like many other callers of xfs_bmapi_write, just fails to handle
> the weird 0 return value and zero nmaps convention properly.
Yes, that's exactly what this patch is fixing regardless of the
cause of the failure. It's the right thing to do - error handling by
assumption (i.e. ASSERT()) is simply poor code....
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/4] xfs: use kvmalloc for xattr buffers
2024-04-02 21:38 ` [PATCH 1/4] xfs: use kvmalloc for xattr buffers Dave Chinner
2024-04-03 3:43 ` Darrick J. Wong
2024-04-03 4:39 ` Christoph Hellwig
@ 2024-04-17 14:35 ` Pankaj Raghav (Samsung)
2 siblings, 0 replies; 25+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-04-17 14:35 UTC (permalink / raw)
To: david; +Cc: chandanbabu, linux-xfs, mcgrof, p.raghav
> Pankaj Raghav reported that when filesystem block size is larger
> than page size, the xattr code can use kmalloc() for high order
> allocations. This triggers a useless warning in the allocator as it
> is a __GFP_NOFAIL allocation here:
>
> static inline
> struct page *rmqueue(struct zone *preferred_zone,
> struct zone *zone, unsigned int order,
> gfp_t gfp_flags, unsigned int alloc_flags,
> int migratetype)
> {
> struct page *page;
>
> /*
> * We most definitely don't want callers attempting to
> * allocate greater than order-1 page units with __GFP_NOFAIL.
> */
> >>>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
> ...
>
> Fix this by changing all these call sites to use kvmalloc(), which
> will strip the NOFAIL from the kmalloc attempt and if that fails
> will do a __GFP_NOFAIL vmalloc().
>
> This is not an issue that productions systems will see as
> filesystems with block size > page size cannot be mounted by the
> kernel; Pankaj is developing this functionality right now.
>
> Reported-by: Pankaj Raghav <kernel@pankajraghav.com>
> Fixes: f078d4ea8276 ("xfs: convert kmem_alloc() to kmalloc()")
> Signed-off-be: Dave Chinner <dchinner@redhat.com>
Thanks. I tested this patch in my LBS branch and it fixes the warning.
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
For now, I will add it to my LBS branch as I don't see it yet land on
6.9-rcs.
> ---
> fs/xfs/libxfs/xfs_attr_leaf.c | 15 ++++++---------
> 1 file changed, 6 insertions(+), 9 deletions(-)
--
Pankaj
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2024-04-17 14:35 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-02 21:38 [PATCH 0/4] xfs: fixes for 6.9-rcX Dave Chinner
2024-04-02 21:38 ` [PATCH 1/4] xfs: use kvmalloc for xattr buffers Dave Chinner
2024-04-03 3:43 ` Darrick J. Wong
2024-04-03 4:39 ` Christoph Hellwig
2024-04-03 6:16 ` Dave Chinner
2024-04-03 6:19 ` Christoph Hellwig
2024-04-17 14:35 ` Pankaj Raghav (Samsung)
2024-04-02 21:38 ` [PATCH 2/4] xfs: xfs_alloc_file_space() fails to detect ENOSPC Dave Chinner
2024-04-03 3:46 ` Darrick J. Wong
2024-04-03 4:40 ` Christoph Hellwig
2024-04-03 6:34 ` Dave Chinner
2024-04-03 18:23 ` Christoph Hellwig
2024-04-02 21:38 ` [PATCH 3/4] xfs: handle allocation failure in xfs_dquot_disk_alloc() Dave Chinner
2024-04-03 3:48 ` Darrick J. Wong
2024-04-03 4:41 ` Christoph Hellwig
2024-04-03 4:54 ` Darrick J. Wong
2024-04-03 4:56 ` Christoph Hellwig
2024-04-03 5:04 ` Darrick J. Wong
2024-04-03 6:41 ` Dave Chinner
2024-04-03 14:06 ` Christoph Hellwig
2024-04-03 21:49 ` Dave Chinner
2024-04-02 21:38 ` [PATCH 4/4] xfs: validate block count for XFS_IOC_SET_RESBLKS Dave Chinner
2024-04-03 3:53 ` Darrick J. Wong
2024-04-03 6:55 ` Dave Chinner
2024-04-03 4:43 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox