Linux XFS filesystem development

Linux XFS filesystem development
 help / color / mirror / Atom feed

* Re: [PATCH] xfs/333: test zoned realtime writeback EIO shutdown
From: Zorro Lang @ 2026-06-14 17:19 UTC (permalink / raw)
  To: Yao Sang; +Cc: fstests, linux-xfs, hch, djwong
In-Reply-To: <ai7LiV1TrX5D_E2J@zlang-mailbox>

On Mon, Jun 15, 2026 at 12:19:09AM +0800, Zorro Lang wrote:
> On Fri, Jun 12, 2026 at 03:42:31PM +0800, Yao Sang wrote:
> > Create a zoned realtime filesystem with a single user data open zone,
> > dirty one zone worth of data, and inject a writeback error on the
> > realtime device.
> > 
> > After disabling fault injection, fsync a second write.  The filesystem
> > must already be shut down so that the second fsync fails quickly instead
> > of waiting for zoned allocation progress.
> > 
> > Signed-off-by: Yao Sang <sangyao@kylinos.cn>
> > ---
> > This is a test for the XFS zoned writeback shutdown fix posted as:
> > https://lore.kernel.org/all/20260611015305.1583003-1-sangyao@kylinos.cn/
> > 
> > Tested on a ZNS realtime XFS setup with fail_make_request enabled.  The
> > test passes with the v2 XFS fix applied.  On a kernel without that fix,
> > the second fsync waits behind the consumed open zone and reproduces the
> > hung writer condition.
> > 
> >  tests/xfs/333     | 69 +++++++++++++++++++++++++++++++++++++++++++++++
> >  tests/xfs/333.out |  1 +
> >  2 files changed, 70 insertions(+)
> >  create mode 100755 tests/xfs/333
> >  create mode 100644 tests/xfs/333.out
> > 
> > diff --git a/tests/xfs/333 b/tests/xfs/333
> > new file mode 100755
> > index 00000000..aafee4c3
> > --- /dev/null
> > +++ b/tests/xfs/333
> > @@ -0,0 +1,69 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright (c) 2026 Kylin Software.
> > +#
> > +# FS QA Test No. 333
> > +#
> > +# Check that an unrecoverable writeback error on a zoned realtime device shuts
> > +# down the filesystem.
> > +#
> > +. ./common/preamble
> > +_begin_fstest auto quick rw zone eio
> > +
> > +. ./common/filter
> > +. ./common/fail_make_request
> > +. ./common/zoned
> > +
> > +_require_debugfs
> > +_require_scratch_nocheck
> > +_require_realtime
> > +_require_block_device $SCRATCH_RTDEV
> > +_require_zoned_device $SCRATCH_RTDEV
> > +_require_command "$BLKZONE_PROG" blkzone
> > +_require_fail_make_request
> > +
> > +_cleanup()
> > +{
> > +	[ -n "$SCRATCH_RTDEV" ] && \
> > +		_bdev_fail_make_request $SCRATCH_RTDEV 0 > /dev/null 2>&1
> > +	_disallow_fail_make_request > /dev/null 2>&1
> > +	_scratch_unmount > /dev/null 2>&1
> > +	cd /
> > +	rm -r -f $tmp.*
> > +}
> > +_register_cleanup _cleanup
>    ^^^^^^^^^^^^^^^^^
> This line isn't necessary. Since "_register_cleanup _cleanup" is called in
> _begin_fstest, it only registers the function name (not function instance).
> The latest definition will be called automatically, so there's no need to
> worry.
> 
> > +
> > +zone_capacity=$(_zone_capacity 0 $SCRATCH_RTDEV)
> > +echo "zone capacity: $zone_capacity" >> $seqres.full
> > +
> > +_scratch_mkfs_sized $((256 * 1024 * 1024)) >> $seqres.full 2>&1
> > +
> > +# Leave only one open zone for user data.  If writeback fails after consuming
> > +# it, further writers must either see shutdown or wait forever for zone space.
> > +export MOUNT_OPTIONS="$MOUNT_OPTIONS -o max_open_zones=2"
> > +_try_scratch_mount || _notrun "mount option not supported"
> > +_require_xfs_scratch_zoned 1
> > +
> > +_prepare_for_eio_shutdown $SCRATCH_DEV
>    ^^^^^^^^^^^^^^^^^^^^^^^^^
> This line is already called in _try_scratch_mount, don't need to call it again,
> except you want to run it on another device.
> 
> > +
> > +testfile=$SCRATCH_MNT/writeback-error
> > +waitfile=$SCRATCH_MNT/wait-for-zone
> > +
> > +$XFS_IO_PROG -f -c "pwrite -S 0x58 0 $zone_capacity" $testfile \
> > +	>> $seqres.full 2>&1 || _fail "failed to dirty test file"
> > +
> > +_allow_fail_make_request 100 100000 0 > /dev/null
> > +_bdev_fail_make_request $SCRATCH_RTDEV 1 > /dev/null
> > +
> > +$XFS_IO_PROG -c "fsync" $testfile >> $seqres.full 2>&1 && \
> > +	_fail "fsync succeeded despite realtime device error"
> > +
> > +_bdev_fail_make_request $SCRATCH_RTDEV 0 > /dev/null
> > +_disallow_fail_make_request > /dev/null
> > +
> > +$XFS_IO_PROG -f -c "pwrite -S 0x59 0 4k" -c "fsync" $waitfile \
> > +	>> $seqres.full 2>&1 && \
> > +	_fail "filesystem did not shut down after zoned writeback error"
> 
> OK, others looks good to me.
> 
> Since above changes are straightforward, if you don't have any other updates
> that require a V2, I can just remove those two redundant lines for you when
> merging.
> 
> Reviewed-by: Zorro Lang <zlang@kernel.org>
> 
> Thanks,
> Zorro
> 
> > +
> > +status=0
> > +exit

And,

echo "Silence is golden"
_exit 0

> > diff --git a/tests/xfs/333.out b/tests/xfs/333.out
> > new file mode 100644
> > index 00000000..3e11d156
> > --- /dev/null
> > +++ b/tests/xfs/333.out
> > @@ -0,0 +1 @@
> > +QA output created by 333

Silence is golden

I'll help to make all these changes when I merge it.

Thanks,
Zorro

> > -- 
> > 2.25.1
> > 
> 

^ permalink raw reply

* Re: [PATCH] xfs/333: test zoned realtime writeback EIO shutdown
From: Zorro Lang @ 2026-06-14 16:19 UTC (permalink / raw)
  To: Yao Sang; +Cc: fstests, linux-xfs, hch, djwong
In-Reply-To: <20260612074231.1109213-1-sangyao@kylinos.cn>

On Fri, Jun 12, 2026 at 03:42:31PM +0800, Yao Sang wrote:
> Create a zoned realtime filesystem with a single user data open zone,
> dirty one zone worth of data, and inject a writeback error on the
> realtime device.
> 
> After disabling fault injection, fsync a second write.  The filesystem
> must already be shut down so that the second fsync fails quickly instead
> of waiting for zoned allocation progress.
> 
> Signed-off-by: Yao Sang <sangyao@kylinos.cn>
> ---
> This is a test for the XFS zoned writeback shutdown fix posted as:
> https://lore.kernel.org/all/20260611015305.1583003-1-sangyao@kylinos.cn/
> 
> Tested on a ZNS realtime XFS setup with fail_make_request enabled.  The
> test passes with the v2 XFS fix applied.  On a kernel without that fix,
> the second fsync waits behind the consumed open zone and reproduces the
> hung writer condition.
> 
>  tests/xfs/333     | 69 +++++++++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/333.out |  1 +
>  2 files changed, 70 insertions(+)
>  create mode 100755 tests/xfs/333
>  create mode 100644 tests/xfs/333.out
> 
> diff --git a/tests/xfs/333 b/tests/xfs/333
> new file mode 100755
> index 00000000..aafee4c3
> --- /dev/null
> +++ b/tests/xfs/333
> @@ -0,0 +1,69 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2026 Kylin Software.
> +#
> +# FS QA Test No. 333
> +#
> +# Check that an unrecoverable writeback error on a zoned realtime device shuts
> +# down the filesystem.
> +#
> +. ./common/preamble
> +_begin_fstest auto quick rw zone eio
> +
> +. ./common/filter
> +. ./common/fail_make_request
> +. ./common/zoned
> +
> +_require_debugfs
> +_require_scratch_nocheck
> +_require_realtime
> +_require_block_device $SCRATCH_RTDEV
> +_require_zoned_device $SCRATCH_RTDEV
> +_require_command "$BLKZONE_PROG" blkzone
> +_require_fail_make_request
> +
> +_cleanup()
> +{
> +	[ -n "$SCRATCH_RTDEV" ] && \
> +		_bdev_fail_make_request $SCRATCH_RTDEV 0 > /dev/null 2>&1
> +	_disallow_fail_make_request > /dev/null 2>&1
> +	_scratch_unmount > /dev/null 2>&1
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup _cleanup
   ^^^^^^^^^^^^^^^^^
This line isn't necessary. Since "_register_cleanup _cleanup" is called in
_begin_fstest, it only registers the function name (not function instance).
The latest definition will be called automatically, so there's no need to
worry.

> +
> +zone_capacity=$(_zone_capacity 0 $SCRATCH_RTDEV)
> +echo "zone capacity: $zone_capacity" >> $seqres.full
> +
> +_scratch_mkfs_sized $((256 * 1024 * 1024)) >> $seqres.full 2>&1
> +
> +# Leave only one open zone for user data.  If writeback fails after consuming
> +# it, further writers must either see shutdown or wait forever for zone space.
> +export MOUNT_OPTIONS="$MOUNT_OPTIONS -o max_open_zones=2"
> +_try_scratch_mount || _notrun "mount option not supported"
> +_require_xfs_scratch_zoned 1
> +
> +_prepare_for_eio_shutdown $SCRATCH_DEV
   ^^^^^^^^^^^^^^^^^^^^^^^^^
This line is already called in _try_scratch_mount, don't need to call it again,
except you want to run it on another device.

> +
> +testfile=$SCRATCH_MNT/writeback-error
> +waitfile=$SCRATCH_MNT/wait-for-zone
> +
> +$XFS_IO_PROG -f -c "pwrite -S 0x58 0 $zone_capacity" $testfile \
> +	>> $seqres.full 2>&1 || _fail "failed to dirty test file"
> +
> +_allow_fail_make_request 100 100000 0 > /dev/null
> +_bdev_fail_make_request $SCRATCH_RTDEV 1 > /dev/null
> +
> +$XFS_IO_PROG -c "fsync" $testfile >> $seqres.full 2>&1 && \
> +	_fail "fsync succeeded despite realtime device error"
> +
> +_bdev_fail_make_request $SCRATCH_RTDEV 0 > /dev/null
> +_disallow_fail_make_request > /dev/null
> +
> +$XFS_IO_PROG -f -c "pwrite -S 0x59 0 4k" -c "fsync" $waitfile \
> +	>> $seqres.full 2>&1 && \
> +	_fail "filesystem did not shut down after zoned writeback error"

OK, others looks good to me.

Since above changes are straightforward, if you don't have any other updates
that require a V2, I can just remove those two redundant lines for you when
merging.

Reviewed-by: Zorro Lang <zlang@kernel.org>

Thanks,
Zorro

> +
> +status=0
> +exit
> diff --git a/tests/xfs/333.out b/tests/xfs/333.out
> new file mode 100644
> index 00000000..3e11d156
> --- /dev/null
> +++ b/tests/xfs/333.out
> @@ -0,0 +1 @@
> +QA output created by 333
> -- 
> 2.25.1
> 

^ permalink raw reply

* Re: [PATCH v5 2/2] xfs: shut down the filesystem on a failed mount
From: Zhang Cen @ 2026-06-13  8:13 UTC (permalink / raw)
  To: Mikhail Lobanov
  Cc: Carlos Maiolino, Darrick J . Wong, Dave Chinner,
	Christoph Hellwig, Christoph Hellwig, linux-xfs, linux-kernel,
	lvc-project, Zhang Cen
In-Reply-To: <20260610191904.12533-2-m.lobanov@rosa.ru>

Hi Mikhail and Christoph,

Sorry for the delayed follow-up. I was tied up with other work recently and
did not get a chance to continue following up on this issue.

This series appears to address the same mount-failure inodegc/quota teardown
race that I reported earlier here:

https://lore.kernel.org/linux-xfs/20260528053554.3047829-1-rollkingzzc@gmail.com/

Could you please carry the following tags?

Reported-by: Zhang Cen <rollkingzzc@gmail.com>
Closes: https://lore.kernel.org/linux-xfs/20260528053554.3047829-1-rollkingzzc@gmail.com/

Thanks,
Zhang Cen

^ permalink raw reply

* Re: [PATCH 6.12.y] iomap: don't revert iov_iter on partially completed buffered writes
From: Sasha Levin @ 2026-06-13  0:20 UTC (permalink / raw)
  To: stable
  Cc: Sasha Levin, linux-fsdevel, linux-xfs, Gregg Leventhal,
	Eric Hagberg, Brian Foster
In-Reply-To: <20260612121047.397754-1-bfoster@redhat.com>

On Fri, Jun 12, 2026 at 08:10:47AM -0400, Brian Foster wrote:
> Finally, note that I'm not intimately familiar with -stable process so
> I'm just sending a 6.12.y version here. Earlier branches can either also
> include this, revert 18e419f6e80a directly, or I can post targeted
> patches if needed. Thoughts, reviews, flames appreciated.

Queued for 6.12.y, 6.6.y and 6.1.y. Thanks!

--
Thanks,
Sasha

^ permalink raw reply

* Re: [PATCH] iomap: enforce DIO alignment check in iomap
From: Carlos Maiolino @ 2026-06-12 13:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, brauner, linux-block, linux-fsdevel, linux-ext4,
	linux-xfs, Hannes Reinecke, Martin K. Petersen, Jens Axboe
In-Reply-To: <20260612052831.GA9010@lst.de>

On Fri, Jun 12, 2026 at 07:28:31AM +0200, Christoph Hellwig wrote:
> On Thu, Jun 11, 2026 at 05:47:07PM +0200, Carlos Maiolino wrote:
> > On Thu, Jun 11, 2026 at 03:38:33PM +0200, Christoph Hellwig wrote:
> > > On Thu, Jun 11, 2026 at 06:57:47AM -0600, Keith Busch wrote:
> > > > It's entirely possible a device supports byte aligned addresses. The
> > > > block layer just doesn't let a driver report that. So either it really
> > > > was successful because you found a bug that skips the alignment checks,
> > > > or your device silently corrupted your payload.
> > 
> > I tried this on different hardware, I find it hard to say all those
> > devices were corrupting the payload.
> 
> I think in the other thread we agreed that we are currently missing
> the alignment check for fast-path bios not hitting the splitting code,
> so maybe that is something you see.  Additionally we're missing the
> checks for purely bio based drivers not calling the splitting helper
> at all, but I don't think that applies here.
> 
> > > > Anyway, my earlier suggestion should work. Ming thinks it may go to far,
> > > > though, in not taking the optimization when it was possible. So here's
> > > > an alternative suggestion that should get things working as expected:
> > > 
> > > The fix below looks like it is addressing a real bug.  I'm not sure if
> > > Carlos is hitting it, but we were missing the alignment checks for
> > > single-bvec fast path bios so far indeed.
> > 
> > You left context out so I'm assuming by the fix you meant Keith's patch.
> 
> Yes.

The fix indeed seems to fix the behavior I'm seeing. Keith could you Cc
me if you end up sending an official version?

^ permalink raw reply

* Re: [PATCH v2 1/2] xfs: add a XFS_TRANS_WRITECOUNT_TRYLOCK flag
From: Aditya Prakash Srivastava @ 2026-06-12 12:48 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: Christoph Hellwig, linux-xfs, linux-kernel
In-Reply-To: <20260612114917.2192-2-aditya.ansh182@gmail.com>

Apologies for the mailing list noise in v2; a local stale patch file
collision caused the wrong files and version headers to be sent.
Please ignore this version v2 completely and take a look at the
clean v3 series.

Thanks
Aditya Prakash Srivastava

On Fri, Jun 12, 2026 at 5:20 PM Aditya Srivastava
<aditya.ansh182@gmail.com> wrote:
>
> From: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>
>
> Introduce a new transaction allocation flag, XFS_TRANS_WRITECOUNT_TRYLOCK.
> When this flag is specified, __xfs_trans_alloc() attempts to obtain
> freeze protection using sb_start_intwrite_trylock() instead of blocking
> indefinitely on sb_start_intwrite().
>
> If the trylock fails, the allocation is aborted gracefully: the freshly
> allocated transaction handle is freed, and the function returns the
> appropriate error pointer ERR_PTR(-EAGAIN), which is then propagated
> to the caller by xfs_trans_alloc().
>
> Also add an assertion in __xfs_trans_alloc() to ensure that both
> XFS_TRANS_NO_WRITECOUNT and XFS_TRANS_WRITECOUNT_TRYLOCK are never
> specified at the same time, as they are mutually exclusive.
>
> Suggested-by: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>
> ---
>  fs/xfs/libxfs/xfs_shared.h |  3 +++
>  fs/xfs/xfs_trans.c         | 12 +++++++++++-
>  2 files changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
> index b1e0d9bc1f7d..68d22b6cddd3 100644
> --- a/fs/xfs/libxfs/xfs_shared.h
> +++ b/fs/xfs/libxfs/xfs_shared.h
> @@ -164,6 +164,9 @@ void        xfs_log_get_max_trans_res(struct xfs_mount *mp,
>  /* Transaction has locked the rtbitmap and rtsum inodes */
>  #define XFS_TRANS_RTBITMAP_LOCKED      (1u << 9)
>
> +/* Try lock filesystem superblock for freeze protection */
> +#define XFS_TRANS_WRITECOUNT_TRYLOCK   (1u << 10)
> +
>  /*
>   * Field values for xfs_trans_mod_sb.
>   */
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 148cc32449c1..3860e44d6439 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -216,10 +216,18 @@ __xfs_trans_alloc(
>         struct xfs_trans        *tp;
>
>         ASSERT(!(flags & XFS_TRANS_RES_FDBLKS) || xfs_has_lazysbcount(mp));
> +       ASSERT(!((flags & XFS_TRANS_NO_WRITECOUNT) &&
> +                (flags & XFS_TRANS_WRITECOUNT_TRYLOCK)));
>
>         tp = kmem_cache_zalloc(xfs_trans_cache, GFP_KERNEL | __GFP_NOFAIL);
> -       if (!(flags & XFS_TRANS_NO_WRITECOUNT))
> +       if (flags & XFS_TRANS_WRITECOUNT_TRYLOCK) {
> +               if (!sb_start_intwrite_trylock(mp->m_super)) {
> +                       kmem_cache_free(xfs_trans_cache, tp);
> +                       return ERR_PTR(-EAGAIN);
> +               }
> +       } else if (!(flags & XFS_TRANS_NO_WRITECOUNT)) {
>                 sb_start_intwrite(mp->m_super);
> +       }
>         xfs_trans_set_context(tp);
>         tp->t_flags = flags;
>         tp->t_mountp = mp;
> @@ -252,6 +260,8 @@ xfs_trans_alloc(
>          */
>  retry:
>         tp = __xfs_trans_alloc(mp, flags);
> +       if (IS_ERR(tp))
> +               return PTR_ERR(tp);
>         WARN_ON(mp->m_super->s_writers.frozen == SB_FREEZE_COMPLETE);
>         error = xfs_trans_reserve(tp, resp, blocks, rtextents);
>         if (error == -ENOSPC && want_retry) {
> --
> 2.47.3
>

^ permalink raw reply

* Re: [PATCH v2 2/2] xfs: prevent close() from hanging on frozen filesystems
From: Aditya Prakash Srivastava @ 2026-06-12 12:47 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: Christoph Hellwig, linux-xfs, linux-kernel
In-Reply-To: <20260612114917.2192-3-aditya.ansh182@gmail.com>

Apologies for the mailing list noise in v2; a local stale patch file
collision caused the wrong files and version headers to be sent.
Please ignore this version v2 completely and take a look at the
clean v3 series.

Thanks
Aditya Prakash Srivastava

On Fri, Jun 12, 2026 at 5:20 PM Aditya Srivastava
<aditya.ansh182@gmail.com> wrote:
>
> From: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>
>
> When a file with active speculative post-EOF preallocations is closed,
> xfs_file_release() synchronously triggers xfs_free_eofblocks() to clean
> them up. This requires allocating a write transaction (xfs_trans_alloc),
> which blocks indefinitely if the filesystem is currently frozen or in the
> process of freezing, as it waits to acquire the superblock's write lock.
>
> As a result, a close() system call on a read-write file descriptor can
> hang indefinitely in percpu_rwsem_wait() until the filesystem is thawed,
> even if the file is closed by a non-writer process or after all writing
> activity has already ceased.
>
> To fix this properly and avoid any potential race conditions where a freeze
> might come in immediately after a writable check, pass the new
> XFS_TRANS_WRITECOUNT_TRYLOCK flag to xfs_trans_alloc() when freeing
> speculative preallocations in xfs_file_release().
>
> If xfs_free_eofblocks() returns -EAGAIN on a trylock failure, we cleanly
> bypass setting XFS_EOFBLOCKS_RELEASED on the inode, ensuring subsequent
> releases or the background blockgc garbage collector can successfully retry
> the cleanup once the filesystem thaws.
>
> Also, rename the flags parameter in xfs_free_eofblocks() to trans_flags as
> suggested to make its usage stand out, and update existing callers to
> pass 0 to preserve standard blocking paths.
>
> Suggested-by: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>
> ---
>  fs/xfs/xfs_bmap_util.c | 9 +++++----
>  fs/xfs/xfs_bmap_util.h | 2 +-
>  fs/xfs/xfs_file.c      | 8 +++++---
>  fs/xfs/xfs_icache.c    | 2 +-
>  fs/xfs/xfs_inode.c     | 2 +-
>  5 files changed, 13 insertions(+), 10 deletions(-)
>
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 0ab00615f1ad..faf0630717dc 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -574,7 +574,8 @@ xfs_can_free_eofblocks(
>   */
>  int
>  xfs_free_eofblocks(
> -       struct xfs_inode        *ip)
> +       struct xfs_inode        *ip,
> +       uint                    trans_flags)
>  {
>         struct xfs_trans        *tp;
>         struct xfs_mount        *mp = ip->i_mount;
> @@ -604,9 +605,9 @@ xfs_free_eofblocks(
>                 return 0;
>         }
>
> -       error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
> +       error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, trans_flags, &tp);
>         if (error) {
> -               ASSERT(xfs_is_shutdown(mp));
> +               ASSERT(error == -EAGAIN || xfs_is_shutdown(mp));
>                 return error;
>         }
>
> @@ -928,7 +929,7 @@ xfs_prepare_shift(
>          * into the accessible region of the file.
>          */
>         if (xfs_can_free_eofblocks(ip)) {
> -               error = xfs_free_eofblocks(ip);
> +               error = xfs_free_eofblocks(ip, 0);
>                 if (error)
>                         return error;
>         }
> diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
> index c477b3361630..c13774aa0892 100644
> --- a/fs/xfs/xfs_bmap_util.h
> +++ b/fs/xfs/xfs_bmap_util.h
> @@ -66,7 +66,7 @@ int   xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset,
>
>  /* EOF block manipulation functions */
>  bool   xfs_can_free_eofblocks(struct xfs_inode *ip);
> -int    xfs_free_eofblocks(struct xfs_inode *ip);
> +int    xfs_free_eofblocks(struct xfs_inode *ip, uint trans_flags);
>
>  int    xfs_swap_extents(struct xfs_inode *ip, struct xfs_inode *tip,
>                          struct xfs_swapext *sx);
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 845a97c9b063..76c9b2fe7c51 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1806,9 +1806,11 @@ xfs_file_release(
>          */
>         if (!xfs_iflags_test(ip, XFS_EOFBLOCKS_RELEASED) &&
>             xfs_ilock_nowait(ip, XFS_IOLOCK_EXCL)) {
> -               if (xfs_can_free_eofblocks(ip) &&
> -                   !xfs_iflags_test_and_set(ip, XFS_EOFBLOCKS_RELEASED))
> -                       xfs_free_eofblocks(ip);
> +               if (!xfs_iflags_test(ip, XFS_EOFBLOCKS_RELEASED) &&
> +                   xfs_can_free_eofblocks(ip) &&
> +                   !xfs_free_eofblocks(ip, XFS_TRANS_WRITECOUNT_TRYLOCK))
> +                       xfs_iflags_set(ip, XFS_EOFBLOCKS_RELEASED);
> +
>                 xfs_iunlock(ip, XFS_IOLOCK_EXCL);
>         }
>
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 2040a9292ee6..c575b4acb24c 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1259,7 +1259,7 @@ xfs_inode_free_eofblocks(
>         *lockflags |= XFS_IOLOCK_EXCL;
>
>         if (xfs_can_free_eofblocks(ip))
> -               return xfs_free_eofblocks(ip);
> +               return xfs_free_eofblocks(ip, 0);
>
>         /* inode could be preallocated */
>         trace_xfs_inode_free_eofblocks_invalid(ip);
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 9978ac1422fc..1c75f51af4d7 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1423,7 +1423,7 @@ xfs_inactive(
>                  * reference to the inode at this point anyways.
>                  */
>                 if (xfs_can_free_eofblocks(ip))
> -                       error = xfs_free_eofblocks(ip);
> +                       error = xfs_free_eofblocks(ip, 0);
>
>                 goto out;
>         }
> --
> 2.47.3
>

^ permalink raw reply

* [PATCH] xfs/842: test close() deadlocks on frozen filesystems
From: Aditya Srivastava @ 2026-06-12 12:37 UTC (permalink / raw)
  To: Zorro Lang
  Cc: Christoph Hellwig, Darrick J . Wong, fstests, linux-xfs,
	Aditya Prakash Srivastava

From: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>

When a file with active speculative post-EOF preallocations is closed,
XFS synchronously attempts to free them via xfs_free_eofblocks().
This requires allocating a write transaction, which blocks indefinitely
if the filesystem is frozen, causing close() to hang.

Add a regression test using a helper C program to verify that close()
returns cleanly (with -EAGAIN if using XFS_TRANS_WRITECOUNT_TRYLOCK)
without blocking the close system call.

This is a regression test for the corresponding kernel patch:
"xfs: prevent close() from hanging on frozen filesystems"

On a successfully patched kernel, the close system call returns cleanly
and the test helper outputs 'close() completed immediately!', matching
the golden output. On an unpatched kernel, the close system call blocks,
the helper times out and outputs 'close() hung under freeze!', cleanly
flagging the deadlock to fstests.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=205833
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1474726
Signed-off-by: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>
---
 src/Makefile                 |   1 +
 src/xfs_freeze_close_repro.c | 127 +++++++++++++++++++++++++++++++++++
 tests/xfs/842                |  32 +++++++++
 tests/xfs/842.out            |   2 +
 4 files changed, 162 insertions(+)
 create mode 100644 src/xfs_freeze_close_repro.c
 create mode 100755 tests/xfs/842
 create mode 100644 tests/xfs/842.out

diff --git a/src/Makefile b/src/Makefile
index 31ac43b2..9553e7c9 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -68,6 +68,7 @@ LCFLAGS += -DHAVE_FILE_GETATTR
 endif
 
 ifeq ($(PKG_PLATFORM),linux)
+LINUX_TARGETS += xfs_freeze_close_repro
 TARGETS += $(LINUX_TARGETS)
 endif
 
diff --git a/src/xfs_freeze_close_repro.c b/src/xfs_freeze_close_repro.c
new file mode 100644
index 00000000..075f2d18
--- /dev/null
+++ b/src/xfs_freeze_close_repro.c
@@ -0,0 +1,127 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2026 Aditya Prakash Srivastava. All Rights Reserved.
+ *
+ * xfstests helper to reproduce close() hang on frozen XFS filesystems.
+ */
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <pthread.h>
+#include <sys/ioctl.h>
+#include <sys/vfs.h>
+#include <linux/fs.h>
+#include <libgen.h>
+
+int close_started;
+int close_completed;
+
+void *
+close_thread(void *arg)
+{
+	int fd = *(int *)arg;
+
+	close_started = 1;
+	close(fd);
+	close_completed = 1;
+	return NULL;
+}
+
+int
+main(int argc, char *argv[])
+{
+	struct statfs sfs;
+	char *dir_buf;
+	char *parent_dir;
+	int freeze_fd;
+	int write_fd;
+	char buf[65536];
+	pthread_t thread;
+	int hung = 1;
+	int i;
+
+	if (argc < 2) {
+		fprintf(stderr, "Usage: %s <file_on_xfs>\n", argv[0]);
+		return 1;
+	}
+
+	if (statfs(argv[1], &sfs) < 0) {
+		dir_buf = strdup(argv[1]);
+		parent_dir = dirname(dir_buf);
+		if (statfs(parent_dir, &sfs) < 0) {
+			perror("statfs");
+			free(dir_buf);
+			return 1;
+		}
+		free(dir_buf);
+	}
+	if (sfs.f_type != 0x58465342) {
+		fprintf(stderr, "Not on XFS\n");
+		return 1;
+	}
+
+	dir_buf = strdup(argv[1]);
+	freeze_fd = open(dirname(dir_buf), O_RDONLY);
+	if (freeze_fd < 0) {
+		perror("open dir");
+		free(dir_buf);
+		return 1;
+	}
+	free(dir_buf);
+
+	write_fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC, 0644);
+	if (write_fd < 0) {
+		perror("open file");
+		close(freeze_fd);
+		return 1;
+	}
+
+	memset(buf, 'A', sizeof(buf));
+	for (i = 0; i < 320; i++) {
+		if (write(write_fd, buf, sizeof(buf)) < 0) {
+			perror("write");
+			close(write_fd);
+			close(freeze_fd);
+			return 1;
+		}
+	}
+
+	if (ioctl(freeze_fd, FIFREEZE, 0) < 0) {
+		perror("ioctl FIFREEZE");
+		close(write_fd);
+		close(freeze_fd);
+		return 1;
+	}
+
+	if (pthread_create(&thread, NULL, close_thread, &write_fd) != 0) {
+		perror("pthread_create");
+		ioctl(freeze_fd, FITHAW, 0);
+		close(write_fd);
+		close(freeze_fd);
+		return 1;
+	}
+
+	while (!close_started)
+		usleep(1000);
+
+	for (i = 0; i < 30; i++) {
+		usleep(100000);
+		if (close_completed) {
+			hung = 0;
+			break;
+		}
+	}
+
+	if (hung)
+		printf("close() hung under freeze!\n");
+	else
+		printf("close() completed immediately!\n");
+
+	ioctl(freeze_fd, FITHAW, 0);
+	pthread_join(thread, NULL);
+	close(freeze_fd);
+	return hung ? 0 : 1;
+}
diff --git a/tests/xfs/842 b/tests/xfs/842
new file mode 100755
index 00000000..8b225ea2
--- /dev/null
+++ b/tests/xfs/842
@@ -0,0 +1,32 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 Aditya Prakash Srivastava.  All Rights Reserved.
+#
+# FS QA Test No. 842
+#
+# Verify that closing a file with active speculative post-EOF preallocations
+# on a frozen XFS filesystem does not deadlock inside close().
+#
+. ./common/preamble
+_begin_fstest auto quick freeze
+
+# Import helper functions
+. ./common/rc
+. ./common/filter
+
+# Real-world configs
+_supported_fs xfs
+_require_scratch
+_require_test_program "xfs_freeze_close_repro"
+
+_scratch_mkfs >> $seqres.full 2>&1
+_scratch_mount
+
+TEST_FILE=$SCRATCH_MNT/test_eofblocks_freeze
+
+# Execute the reproducer helper on scratch mount
+$here/src/xfs_freeze_close_repro $TEST_FILE
+
+status=$?
+_scratch_unmount
+_exit
diff --git a/tests/xfs/842.out b/tests/xfs/842.out
new file mode 100644
index 00000000..386fb9cc
--- /dev/null
+++ b/tests/xfs/842.out
@@ -0,0 +1,2 @@
+QA output created by 842
+close() completed immediately!
-- 
2.47.3


^ permalink raw reply related

* [PATCH 6.12.y] iomap: don't revert iov_iter on partially completed buffered writes
From: Brian Foster @ 2026-06-12 12:10 UTC (permalink / raw)
  To: stable; +Cc: linux-fsdevel, linux-xfs, Gregg Leventhal, Eric Hagberg

Gregg reports that the iomap retry behavior for nonblocking (nowait)
append writes is broken. The problem occurs when an append write is
first submitted in non-blocking mode (i.e. via io_uring), partially
completes before hitting -EAGAIN, and then is resubmitted from
blocking context.

The specific problem is that at least one iteration of the loop in
iomap_write_iter() completes in non-blocking context and thus has
bumped i_size. The next iteration hits -EAGAIN, reverts the iov_iter
and returns. io_uring retries the entire append write from blocking
context, but since i_size has already been increased, the data that
was partially written on the first attempt is rewritten at the new
i_size. This is essentially an intra-write data corruption since the
data written to the file does not reflect the write from userspace.

This problem is already fixed on master as of commit 1a1a3b574b97
("iomap: advance the iter directly on buffered writes"). That commit
was primarily intended to clean up iomap iter state tracking, but it
also happened to remove the iov_iter revert and thus accidentally
fix this problem as well. Without the revert, iomap will commit
partial progress internally and loop once more before it more than
likely hits -EAGAIN and returns partial progress consistent with the
inode updates. This means the blocking retry from io_uring will pick
up where the first attempt left off at the current i_size and
perform the remainder of the write correctly.

Cc: <stable@vger.kernel.org>
Fixes: 18e419f6e80a ("iomap: Return -EAGAIN from iomap_write_iter()")
Reported-by: Gregg Leventhal <gleventhal@janestreet.com>
Reported-by: Eric Hagberg <ehagberg@janestreet.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
---

Hi all,

This relates to the discussion here[1]. Refer to the link for further
details and the custom reproducer. Since this is a stable-only patch,
I'd like to see at least one ack if possible from an iomap developer.

Note that this patch introduces an extra iomap iteration in the write
iter path before an -EAGAIN will return, but I went this route because
this is current upstream behavior and I didn't want to introduce novel
behavior in -stable, as trivial as it might be. This was initially
developed as a custom/selective backport of 1a1a3b574b97, but as it
turns out this is also effectively a revert of commit 18e419f6e80a. So
FWIW, this is somewhat historical behavior as well.

I plan to float a patch upstream to fix that loop wart soon. That would
seem overkill to fix in stable to me, but if it does prove necessary we
can revisit something like the custom version posted in [1] as a
stable-worthy variant. I'd just prefer to only do that if/after that
change proves acceptable upstream.

Finally, note that I'm not intimately familiar with -stable process so
I'm just sending a 6.12.y version here. Earlier branches can either also
include this, revert 18e419f6e80a directly, or I can post targeted
patches if needed. Thoughts, reviews, flames appreciated.

Brian

[1] https://lore.kernel.org/linux-fsdevel/CAFN_u7FrgM4Dzie2jjkLwWV8P0dvUG_Wwy3Q9B3-2HnnWiDu8w@mail.gmail.com/

 fs/iomap/buffered-io.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 0178292c1864..5f885286b2f4 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1037,10 +1037,6 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 		}
 	} while (iov_iter_count(i) && length);

-	if (status == -EAGAIN) {
-		iov_iter_revert(i, total_written);
-		return -EAGAIN;
-	}
 	return total_written ? total_written : status;
 }

-- 
2.54.0

^ permalink raw reply related

* [PATCH v3 2/2] xfs: prevent close() from hanging on frozen filesystems
From: Aditya Srivastava @ 2026-06-12 12:05 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: Christoph Hellwig, linux-xfs, linux-kernel,
	Aditya Prakash Srivastava

From: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>

When a file with active speculative post-EOF preallocations is closed,
xfs_file_release() synchronously triggers xfs_free_eofblocks() to clean
them up. This requires allocating a write transaction (xfs_trans_alloc),
which blocks indefinitely if the filesystem is currently frozen or in the
process of freezing, as it waits to acquire the superblock's write lock.

As a result, a close() system call on a read-write file descriptor can
hang indefinitely in percpu_rwsem_wait() until the filesystem is thawed,
even if the file is closed by a non-writer process or after all writing
activity has already ceased.

To fix this properly and avoid any potential race conditions where a freeze
might come in immediately after a writable check, pass the new
XFS_TRANS_WRITECOUNT_TRYLOCK flag to xfs_trans_alloc() when freeing
speculative preallocations in xfs_file_release().

If xfs_free_eofblocks() returns -EAGAIN on a trylock failure, we cleanly
bypass setting XFS_EOFBLOCKS_RELEASED on the inode, ensuring subsequent
releases or the background blockgc garbage collector can successfully retry
the cleanup once the filesystem thaws.

Also, rename the flags parameter in xfs_free_eofblocks() to trans_flags as
suggested to make its usage stand out, and update existing callers to
pass 0 to preserve standard blocking paths.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>
---
 fs/xfs/xfs_bmap_util.c | 9 +++++----
 fs/xfs/xfs_bmap_util.h | 2 +-
 fs/xfs/xfs_file.c      | 8 +++++---
 fs/xfs/xfs_icache.c    | 2 +-
 fs/xfs/xfs_inode.c     | 2 +-
 5 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 0ab00615f1ad..faf0630717dc 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -574,7 +574,8 @@ xfs_can_free_eofblocks(
  */
 int
 xfs_free_eofblocks(
-	struct xfs_inode	*ip)
+	struct xfs_inode	*ip,
+	uint			trans_flags)
 {
 	struct xfs_trans	*tp;
 	struct xfs_mount	*mp = ip->i_mount;
@@ -604,9 +605,9 @@ xfs_free_eofblocks(
 		return 0;
 	}
 
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, trans_flags, &tp);
 	if (error) {
-		ASSERT(xfs_is_shutdown(mp));
+		ASSERT(error == -EAGAIN || xfs_is_shutdown(mp));
 		return error;
 	}
 
@@ -928,7 +929,7 @@ xfs_prepare_shift(
 	 * into the accessible region of the file.
 	 */
 	if (xfs_can_free_eofblocks(ip)) {
-		error = xfs_free_eofblocks(ip);
+		error = xfs_free_eofblocks(ip, 0);
 		if (error)
 			return error;
 	}
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index c477b3361630..c13774aa0892 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -66,7 +66,7 @@ int	xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset,
 
 /* EOF block manipulation functions */
 bool	xfs_can_free_eofblocks(struct xfs_inode *ip);
-int	xfs_free_eofblocks(struct xfs_inode *ip);
+int	xfs_free_eofblocks(struct xfs_inode *ip, uint trans_flags);
 
 int	xfs_swap_extents(struct xfs_inode *ip, struct xfs_inode *tip,
 			 struct xfs_swapext *sx);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 845a97c9b063..76c9b2fe7c51 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1806,9 +1806,11 @@ xfs_file_release(
 	 */
 	if (!xfs_iflags_test(ip, XFS_EOFBLOCKS_RELEASED) &&
 	    xfs_ilock_nowait(ip, XFS_IOLOCK_EXCL)) {
-		if (xfs_can_free_eofblocks(ip) &&
-		    !xfs_iflags_test_and_set(ip, XFS_EOFBLOCKS_RELEASED))
-			xfs_free_eofblocks(ip);
+		if (!xfs_iflags_test(ip, XFS_EOFBLOCKS_RELEASED) &&
+		    xfs_can_free_eofblocks(ip) &&
+		    !xfs_free_eofblocks(ip, XFS_TRANS_WRITECOUNT_TRYLOCK))
+			xfs_iflags_set(ip, XFS_EOFBLOCKS_RELEASED);
+
 		xfs_iunlock(ip, XFS_IOLOCK_EXCL);
 	}
 
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 2040a9292ee6..c575b4acb24c 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1259,7 +1259,7 @@ xfs_inode_free_eofblocks(
 	*lockflags |= XFS_IOLOCK_EXCL;
 
 	if (xfs_can_free_eofblocks(ip))
-		return xfs_free_eofblocks(ip);
+		return xfs_free_eofblocks(ip, 0);
 
 	/* inode could be preallocated */
 	trace_xfs_inode_free_eofblocks_invalid(ip);
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 9978ac1422fc..1c75f51af4d7 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1423,7 +1423,7 @@ xfs_inactive(
 		 * reference to the inode at this point anyways.
 		 */
 		if (xfs_can_free_eofblocks(ip))
-			error = xfs_free_eofblocks(ip);
+			error = xfs_free_eofblocks(ip, 0);
 
 		goto out;
 	}
-- 
2.47.3


^ permalink raw reply related

* [PATCH v3 1/2] xfs: add a XFS_TRANS_WRITECOUNT_TRYLOCK flag
From: Aditya Srivastava @ 2026-06-12 12:05 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: Christoph Hellwig, linux-xfs, linux-kernel,
	Aditya Prakash Srivastava

From: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>

Introduce a new transaction allocation flag, XFS_TRANS_WRITECOUNT_TRYLOCK.
When this flag is specified, __xfs_trans_alloc() attempts to obtain
freeze protection using sb_start_intwrite_trylock() instead of blocking
indefinitely on sb_start_intwrite().

If the trylock fails, the allocation is aborted gracefully: the freshly
allocated transaction handle is freed, and the function returns the
appropriate error pointer ERR_PTR(-EAGAIN), which is then propagated
to the caller by xfs_trans_alloc().

Also add an assertion in __xfs_trans_alloc() to ensure that both
XFS_TRANS_NO_WRITECOUNT and XFS_TRANS_WRITECOUNT_TRYLOCK are never
specified at the same time, as they are mutually exclusive.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>
---
 fs/xfs/libxfs/xfs_shared.h |  3 +++
 fs/xfs/xfs_trans.c         | 12 +++++++++++-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index b1e0d9bc1f7d..68d22b6cddd3 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -164,6 +164,9 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
 /* Transaction has locked the rtbitmap and rtsum inodes */
 #define XFS_TRANS_RTBITMAP_LOCKED	(1u << 9)
 
+/* Try lock filesystem superblock for freeze protection */
+#define XFS_TRANS_WRITECOUNT_TRYLOCK	(1u << 10)
+
 /*
  * Field values for xfs_trans_mod_sb.
  */
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 148cc32449c1..3860e44d6439 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -216,10 +216,18 @@ __xfs_trans_alloc(
 	struct xfs_trans	*tp;
 
 	ASSERT(!(flags & XFS_TRANS_RES_FDBLKS) || xfs_has_lazysbcount(mp));
+	ASSERT(!((flags & XFS_TRANS_NO_WRITECOUNT) &&
+		 (flags & XFS_TRANS_WRITECOUNT_TRYLOCK)));
 
 	tp = kmem_cache_zalloc(xfs_trans_cache, GFP_KERNEL | __GFP_NOFAIL);
-	if (!(flags & XFS_TRANS_NO_WRITECOUNT))
+	if (flags & XFS_TRANS_WRITECOUNT_TRYLOCK) {
+		if (!sb_start_intwrite_trylock(mp->m_super)) {
+			kmem_cache_free(xfs_trans_cache, tp);
+			return ERR_PTR(-EAGAIN);
+		}
+	} else if (!(flags & XFS_TRANS_NO_WRITECOUNT)) {
 		sb_start_intwrite(mp->m_super);
+	}
 	xfs_trans_set_context(tp);
 	tp->t_flags = flags;
 	tp->t_mountp = mp;
@@ -252,6 +260,8 @@ xfs_trans_alloc(
 	 */
 retry:
 	tp = __xfs_trans_alloc(mp, flags);
+	if (IS_ERR(tp))
+		return PTR_ERR(tp);
 	WARN_ON(mp->m_super->s_writers.frozen == SB_FREEZE_COMPLETE);
 	error = xfs_trans_reserve(tp, resp, blocks, rtextents);
 	if (error == -ENOSPC && want_retry) {
-- 
2.47.3


^ permalink raw reply related

* [PATCH v3 0/2] xfs: resolve close() deadlocks on frozen filesystems
From: Aditya Srivastava @ 2026-06-12 12:04 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: Christoph Hellwig, linux-xfs, linux-kernel,
	Aditya Prakash Srivastava

From: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>

Hi Carlos and Christoph,

Apologies for the mailing list noise in v2; a local stale patch file
collision caused the wrong files and version headers to be sent in the
previous posting. This v3 series contains the cleanly regenerated,
correct files with a distinct cover letter subject to prevent mail
thread merging.

This is version 3 of the patch series addressing the close() system call
hanging indefinitely on frozen XFS filesystems (Bugzilla #205833).

Based on Christoph's feedback, I have made the following improvements:
- Split the changes into a clean, 2-patch series separating the new
  trylock flag implementation from the deadlock fix itself.
- Extracted the case history, reproducer code, and discussion from the
  commit logs into this cover letter to keep commit logs surgically
  focused.
- Adjusted the parameter name in xfs_free_eofblocks() to trans_flags to
  make its usage stand out.
- Implemented a much cleaner, race-free state preservation logic in
  xfs_file_release() by omitting pre-setting XFS_EOFBLOCKS_RELEASED
  on the inode, setting it only upon successful truncation.
- Simplified the transaction allocation block, renamed the flag to
  XFS_TRANS_WRITECOUNT_TRYLOCK, and added the requested mutual-exclusivity
  assertion in __xfs_trans_alloc().
- Set up this series to be posted standalone (not threaded as a reply)
  per the standard kernel guidelines.

THE REAL-WORLD IMPACT (BUGZILLA & DOWNSTREAM CASES)
==================================================

When speculative post-EOF blocks are closed on XFS, the release path
synchronously attempts to free them via xfs_free_eofblocks(). This
allocates a write transaction (xfs_trans_alloc) which blocks
indefinitely on the superblock freeze write lock (sb_start_intwrite)
under fsfreeze.

This behavior has a long history of causing severe system disruption:
- Downstream Red Hat Bugzilla 1474726 (dating back to 2017) details
  complete system hangs during system backups when rsync and fsfreeze
  are used. Even seemingly harmless read-only commands like
  'cat /var/log/messages' would hang on close() in __sb_start_write
  via xfs_free_eofblocks, requiring a hard reboot.
- Downstream LeApp integration test scenarios consistently hit this hang.

Hanging on close() frequently triggers container healthcheck failures,
systemd service timeouts, and cluster failover cascades, which is
disruptive to user-space applications that view close() as resource
reclamation. No other major Linux filesystem (ext4, btrfs, etc.)
synchronously allocates write transactions during close() system calls.

THE SOLUTION: NON-BLOCKING SUPERBLOCK TRYLOCK
=============================================

Instead of performing racy pre-checks, this series introduces
XFS_TRANS_WRITECOUNT_TRYLOCK. When specified, __xfs_trans_alloc()
attempts to obtain freeze protection using sb_start_intwrite_trylock().
If that fails, it aborts allocation gracefully and returns -EAGAIN.

We then pass XFS_TRANS_WRITECOUNT_TRYLOCK during xfs_file_release(). If
the truncation fails due to a frozen filesystem (-EAGAIN), we cleanly
bypass setting XFS_EOFBLOCKS_RELEASED on the inode, ensuring subsequent
releases or the background blockgc garbage collector can successfully
clean them up once thawed.

REPRODUCER DETAILS (GPLV2 LICENSED)
===================================

As requested, I have added a GPLv2-compatible license to the C
reproducer provided below, and I will be working on wiring up this
reproducer into the official xfstests suite in the near future.

Compile with -pthread:

    /*
     * GPLv2-compatible XFS freeze close() hang reproducer.
     * Copyright (c) 2026 Aditya Prakash Srivastava. All Rights Reserved.
     */
    #define _GNU_SOURCE
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <unistd.h>
    #include <fcntl.h>
    #include <pthread.h>
    #include <sys/ioctl.h>
    #include <sys/vfs.h>
    #include <linux/fs.h>
    #include <libgen.h>

    volatile int close_started = 0;
    volatile int close_completed = 0;

    void *close_thread(void *arg) {
        int fd = *(int *)arg;
        close_started = 1;
        close(fd);
        close_completed = 1;
        return NULL;
    }

    int main(int argc, char *argv[]) {
        struct statfs sfs;
        if (statfs(argv[1], &sfs) < 0) {
            char *dir_buf = strdup(argv[1]);
            char *parent_dir = dirname(dir_buf);
            if (statfs(parent_dir, &sfs) < 0) {
                perror("statfs");
                free(dir_buf);
                return 1;
            }
            free(dir_buf);
        }
        if (sfs.f_type != 0x58465342) return 1;

        int freeze_fd = open(dirname(strdup(argv[1])), O_RDONLY);
        int write_fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC, 0644);
        char buf[65536] = {0};
        for (int i = 0; i < 320; i++) write(write_fd, buf, sizeof(buf));

        ioctl(freeze_fd, FIFREEZE, 0);
        pthread_t thread;
        pthread_create(&thread, NULL, close_thread, &write_fd);
        while (!close_started) usleep(1000);
        usleep(1000000); // Wait 1s
        if (!close_completed) printf("SUCCESS: close() hung!\n");
        ioctl(freeze_fd, FITHAW, 0);
        pthread_join(thread, NULL);
        unlink(argv[1]);
        return 0;
    }

Aditya Prakash Srivastava (2):
  xfs: add a XFS_TRANS_WRITECOUNT_TRYLOCK flag
  xfs: prevent close() from hanging on frozen filesystems

 fs/xfs/libxfs/xfs_shared.h |  3 +++
 fs/xfs/xfs_bmap_util.c     |  9 +++++----
 fs/xfs/xfs_bmap_util.h     |  2 +-
 fs/xfs/xfs_file.c          |  8 +++++---
 fs/xfs/xfs_icache.c        |  2 +-
 fs/xfs/xfs_inode.c         |  2 +-
 fs/xfs/xfs_trans.c         | 12 +++++++++++-
 7 files changed, 27 insertions(+), 11 deletions(-)

-- 
2.47.3

^ permalink raw reply

* [PATCH v2 2/2] xfs: prevent close() from hanging on frozen filesystems
From: Aditya Srivastava @ 2026-06-12 11:49 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: Christoph Hellwig, linux-xfs, linux-kernel,
	Aditya Prakash Srivastava
In-Reply-To: <20260612114917.2192-1-aditya.ansh182@gmail.com>

From: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>

When a file with active speculative post-EOF preallocations is closed,
xfs_file_release() synchronously triggers xfs_free_eofblocks() to clean
them up. This requires allocating a write transaction (xfs_trans_alloc),
which blocks indefinitely if the filesystem is currently frozen or in the
process of freezing, as it waits to acquire the superblock's write lock.

As a result, a close() system call on a read-write file descriptor can
hang indefinitely in percpu_rwsem_wait() until the filesystem is thawed,
even if the file is closed by a non-writer process or after all writing
activity has already ceased.

To fix this properly and avoid any potential race conditions where a freeze
might come in immediately after a writable check, pass the new
XFS_TRANS_WRITECOUNT_TRYLOCK flag to xfs_trans_alloc() when freeing
speculative preallocations in xfs_file_release().

If xfs_free_eofblocks() returns -EAGAIN on a trylock failure, we cleanly
bypass setting XFS_EOFBLOCKS_RELEASED on the inode, ensuring subsequent
releases or the background blockgc garbage collector can successfully retry
the cleanup once the filesystem thaws.

Also, rename the flags parameter in xfs_free_eofblocks() to trans_flags as
suggested to make its usage stand out, and update existing callers to
pass 0 to preserve standard blocking paths.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>
---
 fs/xfs/xfs_bmap_util.c | 9 +++++----
 fs/xfs/xfs_bmap_util.h | 2 +-
 fs/xfs/xfs_file.c      | 8 +++++---
 fs/xfs/xfs_icache.c    | 2 +-
 fs/xfs/xfs_inode.c     | 2 +-
 5 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 0ab00615f1ad..faf0630717dc 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -574,7 +574,8 @@ xfs_can_free_eofblocks(
  */
 int
 xfs_free_eofblocks(
-	struct xfs_inode	*ip)
+	struct xfs_inode	*ip,
+	uint			trans_flags)
 {
 	struct xfs_trans	*tp;
 	struct xfs_mount	*mp = ip->i_mount;
@@ -604,9 +605,9 @@ xfs_free_eofblocks(
 		return 0;
 	}
 
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, trans_flags, &tp);
 	if (error) {
-		ASSERT(xfs_is_shutdown(mp));
+		ASSERT(error == -EAGAIN || xfs_is_shutdown(mp));
 		return error;
 	}
 
@@ -928,7 +929,7 @@ xfs_prepare_shift(
 	 * into the accessible region of the file.
 	 */
 	if (xfs_can_free_eofblocks(ip)) {
-		error = xfs_free_eofblocks(ip);
+		error = xfs_free_eofblocks(ip, 0);
 		if (error)
 			return error;
 	}
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index c477b3361630..c13774aa0892 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -66,7 +66,7 @@ int	xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset,
 
 /* EOF block manipulation functions */
 bool	xfs_can_free_eofblocks(struct xfs_inode *ip);
-int	xfs_free_eofblocks(struct xfs_inode *ip);
+int	xfs_free_eofblocks(struct xfs_inode *ip, uint trans_flags);
 
 int	xfs_swap_extents(struct xfs_inode *ip, struct xfs_inode *tip,
 			 struct xfs_swapext *sx);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 845a97c9b063..76c9b2fe7c51 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1806,9 +1806,11 @@ xfs_file_release(
 	 */
 	if (!xfs_iflags_test(ip, XFS_EOFBLOCKS_RELEASED) &&
 	    xfs_ilock_nowait(ip, XFS_IOLOCK_EXCL)) {
-		if (xfs_can_free_eofblocks(ip) &&
-		    !xfs_iflags_test_and_set(ip, XFS_EOFBLOCKS_RELEASED))
-			xfs_free_eofblocks(ip);
+		if (!xfs_iflags_test(ip, XFS_EOFBLOCKS_RELEASED) &&
+		    xfs_can_free_eofblocks(ip) &&
+		    !xfs_free_eofblocks(ip, XFS_TRANS_WRITECOUNT_TRYLOCK))
+			xfs_iflags_set(ip, XFS_EOFBLOCKS_RELEASED);
+
 		xfs_iunlock(ip, XFS_IOLOCK_EXCL);
 	}
 
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 2040a9292ee6..c575b4acb24c 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1259,7 +1259,7 @@ xfs_inode_free_eofblocks(
 	*lockflags |= XFS_IOLOCK_EXCL;
 
 	if (xfs_can_free_eofblocks(ip))
-		return xfs_free_eofblocks(ip);
+		return xfs_free_eofblocks(ip, 0);
 
 	/* inode could be preallocated */
 	trace_xfs_inode_free_eofblocks_invalid(ip);
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 9978ac1422fc..1c75f51af4d7 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1423,7 +1423,7 @@ xfs_inactive(
 		 * reference to the inode at this point anyways.
 		 */
 		if (xfs_can_free_eofblocks(ip))
-			error = xfs_free_eofblocks(ip);
+			error = xfs_free_eofblocks(ip, 0);
 
 		goto out;
 	}
-- 
2.47.3


^ permalink raw reply related

* [PATCH v2 1/2] xfs: add a XFS_TRANS_WRITECOUNT_TRYLOCK flag
From: Aditya Srivastava @ 2026-06-12 11:49 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: Christoph Hellwig, linux-xfs, linux-kernel,
	Aditya Prakash Srivastava
In-Reply-To: <20260612114917.2192-1-aditya.ansh182@gmail.com>

From: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>

Introduce a new transaction allocation flag, XFS_TRANS_WRITECOUNT_TRYLOCK.
When this flag is specified, __xfs_trans_alloc() attempts to obtain
freeze protection using sb_start_intwrite_trylock() instead of blocking
indefinitely on sb_start_intwrite().

If the trylock fails, the allocation is aborted gracefully: the freshly
allocated transaction handle is freed, and the function returns the
appropriate error pointer ERR_PTR(-EAGAIN), which is then propagated
to the caller by xfs_trans_alloc().

Also add an assertion in __xfs_trans_alloc() to ensure that both
XFS_TRANS_NO_WRITECOUNT and XFS_TRANS_WRITECOUNT_TRYLOCK are never
specified at the same time, as they are mutually exclusive.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>
---
 fs/xfs/libxfs/xfs_shared.h |  3 +++
 fs/xfs/xfs_trans.c         | 12 +++++++++++-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index b1e0d9bc1f7d..68d22b6cddd3 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -164,6 +164,9 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
 /* Transaction has locked the rtbitmap and rtsum inodes */
 #define XFS_TRANS_RTBITMAP_LOCKED	(1u << 9)
 
+/* Try lock filesystem superblock for freeze protection */
+#define XFS_TRANS_WRITECOUNT_TRYLOCK	(1u << 10)
+
 /*
  * Field values for xfs_trans_mod_sb.
  */
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 148cc32449c1..3860e44d6439 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -216,10 +216,18 @@ __xfs_trans_alloc(
 	struct xfs_trans	*tp;
 
 	ASSERT(!(flags & XFS_TRANS_RES_FDBLKS) || xfs_has_lazysbcount(mp));
+	ASSERT(!((flags & XFS_TRANS_NO_WRITECOUNT) &&
+		 (flags & XFS_TRANS_WRITECOUNT_TRYLOCK)));
 
 	tp = kmem_cache_zalloc(xfs_trans_cache, GFP_KERNEL | __GFP_NOFAIL);
-	if (!(flags & XFS_TRANS_NO_WRITECOUNT))
+	if (flags & XFS_TRANS_WRITECOUNT_TRYLOCK) {
+		if (!sb_start_intwrite_trylock(mp->m_super)) {
+			kmem_cache_free(xfs_trans_cache, tp);
+			return ERR_PTR(-EAGAIN);
+		}
+	} else if (!(flags & XFS_TRANS_NO_WRITECOUNT)) {
 		sb_start_intwrite(mp->m_super);
+	}
 	xfs_trans_set_context(tp);
 	tp->t_flags = flags;
 	tp->t_mountp = mp;
@@ -252,6 +260,8 @@ xfs_trans_alloc(
 	 */
 retry:
 	tp = __xfs_trans_alloc(mp, flags);
+	if (IS_ERR(tp))
+		return PTR_ERR(tp);
 	WARN_ON(mp->m_super->s_writers.frozen == SB_FREEZE_COMPLETE);
 	error = xfs_trans_reserve(tp, resp, blocks, rtextents);
 	if (error == -ENOSPC && want_retry) {
-- 
2.47.3


^ permalink raw reply related

* [PATCH v2 0/2] xfs: prevent close() from hanging on frozen filesystems
From: Aditya Srivastava @ 2026-06-12 11:49 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: Christoph Hellwig, linux-xfs, linux-kernel,
	Aditya Prakash Srivastava

From: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>

Hi Carlos and Christoph,

This is version 2 of the patch series addressing the close() system call
hanging indefinitely on frozen XFS filesystems (Bugzilla #205833).

Based on Christoph's feedback, I have made the following improvements:
- Split the changes into a clean, 2-patch series separating the new
  trylock flag implementation from the deadlock fix itself.
- Extracted the case history, reproducer code, and discussion from the
  commit logs into this cover letter to keep commit logs surgically
  focused.
- Adjusted the parameter name in xfs_free_eofblocks() to trans_flags to
  make its usage stand out.
- Implemented a much cleaner, race-free state preservation logic in
  xfs_file_release() by omitting pre-setting XFS_EOFBLOCKS_RELEASED
  on the inode, setting it only upon successful truncation.
- Simplified the transaction allocation block, renamed the flag to
  XFS_TRANS_WRITECOUNT_TRYLOCK, and added the requested mutual-exclusivity
  assertion in __xfs_trans_alloc().
- Set up this series to be posted standalone (not threaded as a reply)
  per the standard kernel guidelines.

THE REAL-WORLD IMPACT (BUGZILLA & DOWNSTREAM CASES)
==================================================

When speculative post-EOF blocks are closed on XFS, the release path
synchronously attempts to free them via xfs_free_eofblocks(). This
allocates a write transaction (xfs_trans_alloc) which blocks
indefinitely on the superblock freeze write lock (sb_start_intwrite)
under fsfreeze.

This behavior has a long history of causing severe system disruption:
- Downstream Red Hat Bugzilla 1474726 (dating back to 2017) details
  complete system hangs during system backups when rsync and fsfreeze
  are used. Even seemingly harmless read-only commands like
  'cat /var/log/messages' would hang on close() in __sb_start_write
  via xfs_free_eofblocks, requiring a hard reboot.
- Downstream LeApp integration test scenarios consistently hit this hang.

Hanging on close() frequently triggers container healthcheck failures,
systemd service timeouts, and cluster failover cascades, which is
disruptive to user-space applications that view close() as resource
reclamation. No other major Linux filesystem (ext4, btrfs, etc.)
synchronously allocates write transactions during close() system calls.

THE SOLUTION: NON-BLOCKING SUPERBLOCK TRYLOCK
=============================================

Instead of performing racy pre-checks, this series introduces
XFS_TRANS_WRITECOUNT_TRYLOCK. When specified, __xfs_trans_alloc()
attempts to obtain freeze protection using sb_start_intwrite_trylock().
If that fails, it aborts allocation gracefully and returns -EAGAIN.

We then pass XFS_TRANS_WRITECOUNT_TRYLOCK during xfs_file_release(). If
the truncation fails due to a frozen filesystem (-EAGAIN), we cleanly
bypass setting XFS_EOFBLOCKS_RELEASED on the inode, ensuring subsequent
releases or the background blockgc garbage collector can successfully
clean them up once thawed.

REPRODUCER DETAILS (GPLV2 LICENSED)
===================================

As requested, I have added a GPLv2-compatible license to the C
reproducer provided below, and I will be working on wiring up this
reproducer into the official xfstests suite in the near future.

Compile with -pthread:

    /*
     * GPLv2-compatible XFS freeze close() hang reproducer.
     * Copyright (c) 2026 Aditya Prakash Srivastava. All Rights Reserved.
     */
    #define _GNU_SOURCE
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <unistd.h>
    #include <fcntl.h>
    #include <pthread.h>
    #include <sys/ioctl.h>
    #include <sys/vfs.h>
    #include <linux/fs.h>
    #include <libgen.h>

    volatile int close_started = 0;
    volatile int close_completed = 0;

    void *close_thread(void *arg) {
        int fd = *(int *)arg;
        close_started = 1;
        close(fd);
        close_completed = 1;
        return NULL;
    }

    int main(int argc, char *argv[]) {
        struct statfs sfs;
        if (statfs(argv[1], &sfs) < 0) {
            char *dir_buf = strdup(argv[1]);
            char *parent_dir = dirname(dir_buf);
            if (statfs(parent_dir, &sfs) < 0) {
                perror("statfs");
                free(dir_buf);
                return 1;
            }
            free(dir_buf);
        }
        if (sfs.f_type != 0x58465342) return 1;

        int freeze_fd = open(dirname(strdup(argv[1])), O_RDONLY);
        int write_fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC, 0644);
        char buf[65536] = {0};
        for (int i = 0; i < 320; i++) write(write_fd, buf, sizeof(buf));

        ioctl(freeze_fd, FIFREEZE, 0);
        pthread_t thread;
        pthread_create(&thread, NULL, close_thread, &write_fd);
        while (!close_started) usleep(1000);
        usleep(1000000); // Wait 1s
        if (!close_completed) printf("SUCCESS: close() hung!\n");
        ioctl(freeze_fd, FITHAW, 0);
        pthread_join(thread, NULL);
        unlink(argv[1]);
        return 0;
    }

Aditya Prakash Srivastava (2):
  xfs: add a XFS_TRANS_WRITECOUNT_TRYLOCK flag
  xfs: prevent close() from hanging on frozen filesystems

 fs/xfs/libxfs/xfs_shared.h |  3 +++
 fs/xfs/xfs_bmap_util.c     |  9 +++++----
 fs/xfs/xfs_bmap_util.h     |  2 +-
 fs/xfs/xfs_file.c          |  8 +++++---
 fs/xfs/xfs_icache.c        |  2 +-
 fs/xfs/xfs_inode.c         |  2 +-
 fs/xfs/xfs_trans.c         | 12 +++++++++++-
 7 files changed, 27 insertions(+), 11 deletions(-)

-- 
2.47.3


^ permalink raw reply

* Re: [PATCH 1/2] xfs: add a XFS_TRANS_WRITECOUNT_TRYLOCK flag
From: Aditya Prakash Srivastava @ 2026-06-12 11:37 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: Christoph Hellwig, linux-xfs, linux-kernel
In-Reply-To: <20260612112252.1697-2-aditya.ansh182@gmail.com>

The sequencing is wrong, Please ignore

Will send a fresh series

On Fri, Jun 12, 2026 at 4:53 PM Aditya Srivastava
<aditya.ansh182@gmail.com> wrote:
>
> From: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>
>
> Introduce a new transaction allocation flag, XFS_TRANS_WRITECOUNT_TRYLOCK.
> When this flag is specified, __xfs_trans_alloc() attempts to obtain
> freeze protection using sb_start_intwrite_trylock() instead of blocking
> indefinitely on sb_start_intwrite().
>
> If the trylock fails, the allocation is aborted gracefully: the freshly
> allocated transaction handle is freed, and the function returns the
> appropriate error pointer ERR_PTR(-EAGAIN), which is then propagated
> to the caller by xfs_trans_alloc().
>
> Also add an assertion in __xfs_trans_alloc() to ensure that both
> XFS_TRANS_NO_WRITECOUNT and XFS_TRANS_WRITECOUNT_TRYLOCK are never
> specified at the same time, as they are mutually exclusive.
>
> Suggested-by: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>
> ---
>  fs/xfs/libxfs/xfs_shared.h |  3 +++
>  fs/xfs/xfs_trans.c         | 12 +++++++++++-
>  2 files changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
> index b1e0d9bc1f7d..68d22b6cddd3 100644
> --- a/fs/xfs/libxfs/xfs_shared.h
> +++ b/fs/xfs/libxfs/xfs_shared.h
> @@ -164,6 +164,9 @@ void        xfs_log_get_max_trans_res(struct xfs_mount *mp,
>  /* Transaction has locked the rtbitmap and rtsum inodes */
>  #define XFS_TRANS_RTBITMAP_LOCKED      (1u << 9)
>
> +/* Try lock filesystem superblock for freeze protection */
> +#define XFS_TRANS_WRITECOUNT_TRYLOCK   (1u << 10)
> +
>  /*
>   * Field values for xfs_trans_mod_sb.
>   */
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 148cc32449c1..3860e44d6439 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -216,10 +216,18 @@ __xfs_trans_alloc(
>         struct xfs_trans        *tp;
>
>         ASSERT(!(flags & XFS_TRANS_RES_FDBLKS) || xfs_has_lazysbcount(mp));
> +       ASSERT(!((flags & XFS_TRANS_NO_WRITECOUNT) &&
> +                (flags & XFS_TRANS_WRITECOUNT_TRYLOCK)));
>
>         tp = kmem_cache_zalloc(xfs_trans_cache, GFP_KERNEL | __GFP_NOFAIL);
> -       if (!(flags & XFS_TRANS_NO_WRITECOUNT))
> +       if (flags & XFS_TRANS_WRITECOUNT_TRYLOCK) {
> +               if (!sb_start_intwrite_trylock(mp->m_super)) {
> +                       kmem_cache_free(xfs_trans_cache, tp);
> +                       return ERR_PTR(-EAGAIN);
> +               }
> +       } else if (!(flags & XFS_TRANS_NO_WRITECOUNT)) {
>                 sb_start_intwrite(mp->m_super);
> +       }
>         xfs_trans_set_context(tp);
>         tp->t_flags = flags;
>         tp->t_mountp = mp;
> @@ -252,6 +260,8 @@ xfs_trans_alloc(
>          */
>  retry:
>         tp = __xfs_trans_alloc(mp, flags);
> +       if (IS_ERR(tp))
> +               return PTR_ERR(tp);
>         WARN_ON(mp->m_super->s_writers.frozen == SB_FREEZE_COMPLETE);
>         error = xfs_trans_reserve(tp, resp, blocks, rtextents);
>         if (error == -ENOSPC && want_retry) {
> --
> 2.47.3
>

^ permalink raw reply

* Re: [PATCH 2/2] xfs: prevent close() from hanging on frozen filesystems
From: Aditya Prakash Srivastava @ 2026-06-12 11:37 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: Christoph Hellwig, linux-xfs, linux-kernel
In-Reply-To: <20260612112252.1697-3-aditya.ansh182@gmail.com>

The sequencing is wrong, Please ignore

Will send a fresh series

On Fri, Jun 12, 2026 at 4:53 PM Aditya Srivastava
<aditya.ansh182@gmail.com> wrote:
>
> From: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>
>
> When a file with active speculative post-EOF preallocations is closed,
> xfs_file_release() synchronously triggers xfs_free_eofblocks() to clean
> them up. This requires allocating a write transaction (xfs_trans_alloc),
> which blocks indefinitely if the filesystem is currently frozen or in the
> process of freezing, as it waits to acquire the superblock's write lock.
>
> As a result, a close() system call on a read-write file descriptor can
> hang indefinitely in percpu_rwsem_wait() until the filesystem is thawed,
> even if the file is closed by a non-writer process or after all writing
> activity has already ceased.
>
> To fix this properly and avoid any potential race conditions where a freeze
> might come in immediately after a writable check, pass the new
> XFS_TRANS_WRITECOUNT_TRYLOCK flag to xfs_trans_alloc() when freeing
> speculative preallocations in xfs_file_release().
>
> If xfs_free_eofblocks() returns -EAGAIN on a trylock failure, we cleanly
> bypass setting XFS_EOFBLOCKS_RELEASED on the inode, ensuring subsequent
> releases or the background blockgc garbage collector can successfully retry
> the cleanup once the filesystem thaws.
>
> Also, rename the flags parameter in xfs_free_eofblocks() to trans_flags as
> suggested to make its usage stand out, and update existing callers to
> pass 0 to preserve standard blocking paths.
>
> Suggested-by: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>
> ---
>  fs/xfs/xfs_bmap_util.c | 9 +++++----
>  fs/xfs/xfs_bmap_util.h | 2 +-
>  fs/xfs/xfs_file.c      | 8 +++++---
>  fs/xfs/xfs_icache.c    | 2 +-
>  fs/xfs/xfs_inode.c     | 2 +-
>  5 files changed, 13 insertions(+), 10 deletions(-)
>
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 0ab00615f1ad..faf0630717dc 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -574,7 +574,8 @@ xfs_can_free_eofblocks(
>   */
>  int
>  xfs_free_eofblocks(
> -       struct xfs_inode        *ip)
> +       struct xfs_inode        *ip,
> +       uint                    trans_flags)
>  {
>         struct xfs_trans        *tp;
>         struct xfs_mount        *mp = ip->i_mount;
> @@ -604,9 +605,9 @@ xfs_free_eofblocks(
>                 return 0;
>         }
>
> -       error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
> +       error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, trans_flags, &tp);
>         if (error) {
> -               ASSERT(xfs_is_shutdown(mp));
> +               ASSERT(error == -EAGAIN || xfs_is_shutdown(mp));
>                 return error;
>         }
>
> @@ -928,7 +929,7 @@ xfs_prepare_shift(
>          * into the accessible region of the file.
>          */
>         if (xfs_can_free_eofblocks(ip)) {
> -               error = xfs_free_eofblocks(ip);
> +               error = xfs_free_eofblocks(ip, 0);
>                 if (error)
>                         return error;
>         }
> diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
> index c477b3361630..c13774aa0892 100644
> --- a/fs/xfs/xfs_bmap_util.h
> +++ b/fs/xfs/xfs_bmap_util.h
> @@ -66,7 +66,7 @@ int   xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset,
>
>  /* EOF block manipulation functions */
>  bool   xfs_can_free_eofblocks(struct xfs_inode *ip);
> -int    xfs_free_eofblocks(struct xfs_inode *ip);
> +int    xfs_free_eofblocks(struct xfs_inode *ip, uint trans_flags);
>
>  int    xfs_swap_extents(struct xfs_inode *ip, struct xfs_inode *tip,
>                          struct xfs_swapext *sx);
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 845a97c9b063..76c9b2fe7c51 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1806,9 +1806,11 @@ xfs_file_release(
>          */
>         if (!xfs_iflags_test(ip, XFS_EOFBLOCKS_RELEASED) &&
>             xfs_ilock_nowait(ip, XFS_IOLOCK_EXCL)) {
> -               if (xfs_can_free_eofblocks(ip) &&
> -                   !xfs_iflags_test_and_set(ip, XFS_EOFBLOCKS_RELEASED))
> -                       xfs_free_eofblocks(ip);
> +               if (!xfs_iflags_test(ip, XFS_EOFBLOCKS_RELEASED) &&
> +                   xfs_can_free_eofblocks(ip) &&
> +                   !xfs_free_eofblocks(ip, XFS_TRANS_WRITECOUNT_TRYLOCK))
> +                       xfs_iflags_set(ip, XFS_EOFBLOCKS_RELEASED);
> +
>                 xfs_iunlock(ip, XFS_IOLOCK_EXCL);
>         }
>
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 2040a9292ee6..c575b4acb24c 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1259,7 +1259,7 @@ xfs_inode_free_eofblocks(
>         *lockflags |= XFS_IOLOCK_EXCL;
>
>         if (xfs_can_free_eofblocks(ip))
> -               return xfs_free_eofblocks(ip);
> +               return xfs_free_eofblocks(ip, 0);
>
>         /* inode could be preallocated */
>         trace_xfs_inode_free_eofblocks_invalid(ip);
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 9978ac1422fc..1c75f51af4d7 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1423,7 +1423,7 @@ xfs_inactive(
>                  * reference to the inode at this point anyways.
>                  */
>                 if (xfs_can_free_eofblocks(ip))
> -                       error = xfs_free_eofblocks(ip);
> +                       error = xfs_free_eofblocks(ip, 0);
>
>                 goto out;
>         }
> --
> 2.47.3
>

^ permalink raw reply

* [PATCH 2/2] xfs: prevent close() from hanging on frozen filesystems
From: Aditya Srivastava @ 2026-06-12 11:22 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: Christoph Hellwig, linux-xfs, linux-kernel,
	Aditya Prakash Srivastava
In-Reply-To: <20260612112252.1697-1-aditya.ansh182@gmail.com>

From: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>

When a file with active speculative post-EOF preallocations is closed,
xfs_file_release() synchronously triggers xfs_free_eofblocks() to clean
them up. This requires allocating a write transaction (xfs_trans_alloc),
which blocks indefinitely if the filesystem is currently frozen or in the
process of freezing, as it waits to acquire the superblock's write lock.

As a result, a close() system call on a read-write file descriptor can
hang indefinitely in percpu_rwsem_wait() until the filesystem is thawed,
even if the file is closed by a non-writer process or after all writing
activity has already ceased.

To fix this properly and avoid any potential race conditions where a freeze
might come in immediately after a writable check, pass the new
XFS_TRANS_WRITECOUNT_TRYLOCK flag to xfs_trans_alloc() when freeing
speculative preallocations in xfs_file_release().

If xfs_free_eofblocks() returns -EAGAIN on a trylock failure, we cleanly
bypass setting XFS_EOFBLOCKS_RELEASED on the inode, ensuring subsequent
releases or the background blockgc garbage collector can successfully retry
the cleanup once the filesystem thaws.

Also, rename the flags parameter in xfs_free_eofblocks() to trans_flags as
suggested to make its usage stand out, and update existing callers to
pass 0 to preserve standard blocking paths.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>
---
 fs/xfs/xfs_bmap_util.c | 9 +++++----
 fs/xfs/xfs_bmap_util.h | 2 +-
 fs/xfs/xfs_file.c      | 8 +++++---
 fs/xfs/xfs_icache.c    | 2 +-
 fs/xfs/xfs_inode.c     | 2 +-
 5 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 0ab00615f1ad..faf0630717dc 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -574,7 +574,8 @@ xfs_can_free_eofblocks(
  */
 int
 xfs_free_eofblocks(
-	struct xfs_inode	*ip)
+	struct xfs_inode	*ip,
+	uint			trans_flags)
 {
 	struct xfs_trans	*tp;
 	struct xfs_mount	*mp = ip->i_mount;
@@ -604,9 +605,9 @@ xfs_free_eofblocks(
 		return 0;
 	}
 
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, trans_flags, &tp);
 	if (error) {
-		ASSERT(xfs_is_shutdown(mp));
+		ASSERT(error == -EAGAIN || xfs_is_shutdown(mp));
 		return error;
 	}
 
@@ -928,7 +929,7 @@ xfs_prepare_shift(
 	 * into the accessible region of the file.
 	 */
 	if (xfs_can_free_eofblocks(ip)) {
-		error = xfs_free_eofblocks(ip);
+		error = xfs_free_eofblocks(ip, 0);
 		if (error)
 			return error;
 	}
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index c477b3361630..c13774aa0892 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -66,7 +66,7 @@ int	xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset,
 
 /* EOF block manipulation functions */
 bool	xfs_can_free_eofblocks(struct xfs_inode *ip);
-int	xfs_free_eofblocks(struct xfs_inode *ip);
+int	xfs_free_eofblocks(struct xfs_inode *ip, uint trans_flags);
 
 int	xfs_swap_extents(struct xfs_inode *ip, struct xfs_inode *tip,
 			 struct xfs_swapext *sx);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 845a97c9b063..76c9b2fe7c51 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1806,9 +1806,11 @@ xfs_file_release(
 	 */
 	if (!xfs_iflags_test(ip, XFS_EOFBLOCKS_RELEASED) &&
 	    xfs_ilock_nowait(ip, XFS_IOLOCK_EXCL)) {
-		if (xfs_can_free_eofblocks(ip) &&
-		    !xfs_iflags_test_and_set(ip, XFS_EOFBLOCKS_RELEASED))
-			xfs_free_eofblocks(ip);
+		if (!xfs_iflags_test(ip, XFS_EOFBLOCKS_RELEASED) &&
+		    xfs_can_free_eofblocks(ip) &&
+		    !xfs_free_eofblocks(ip, XFS_TRANS_WRITECOUNT_TRYLOCK))
+			xfs_iflags_set(ip, XFS_EOFBLOCKS_RELEASED);
+
 		xfs_iunlock(ip, XFS_IOLOCK_EXCL);
 	}
 
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 2040a9292ee6..c575b4acb24c 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1259,7 +1259,7 @@ xfs_inode_free_eofblocks(
 	*lockflags |= XFS_IOLOCK_EXCL;
 
 	if (xfs_can_free_eofblocks(ip))
-		return xfs_free_eofblocks(ip);
+		return xfs_free_eofblocks(ip, 0);
 
 	/* inode could be preallocated */
 	trace_xfs_inode_free_eofblocks_invalid(ip);
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 9978ac1422fc..1c75f51af4d7 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1423,7 +1423,7 @@ xfs_inactive(
 		 * reference to the inode at this point anyways.
 		 */
 		if (xfs_can_free_eofblocks(ip))
-			error = xfs_free_eofblocks(ip);
+			error = xfs_free_eofblocks(ip, 0);
 
 		goto out;
 	}
-- 
2.47.3


^ permalink raw reply related

* [PATCH 1/2] xfs: add a XFS_TRANS_WRITECOUNT_TRYLOCK flag
From: Aditya Srivastava @ 2026-06-12 11:22 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: Christoph Hellwig, linux-xfs, linux-kernel,
	Aditya Prakash Srivastava
In-Reply-To: <20260612112252.1697-1-aditya.ansh182@gmail.com>

From: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>

Introduce a new transaction allocation flag, XFS_TRANS_WRITECOUNT_TRYLOCK.
When this flag is specified, __xfs_trans_alloc() attempts to obtain
freeze protection using sb_start_intwrite_trylock() instead of blocking
indefinitely on sb_start_intwrite().

If the trylock fails, the allocation is aborted gracefully: the freshly
allocated transaction handle is freed, and the function returns the
appropriate error pointer ERR_PTR(-EAGAIN), which is then propagated
to the caller by xfs_trans_alloc().

Also add an assertion in __xfs_trans_alloc() to ensure that both
XFS_TRANS_NO_WRITECOUNT and XFS_TRANS_WRITECOUNT_TRYLOCK are never
specified at the same time, as they are mutually exclusive.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>
---
 fs/xfs/libxfs/xfs_shared.h |  3 +++
 fs/xfs/xfs_trans.c         | 12 +++++++++++-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index b1e0d9bc1f7d..68d22b6cddd3 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -164,6 +164,9 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
 /* Transaction has locked the rtbitmap and rtsum inodes */
 #define XFS_TRANS_RTBITMAP_LOCKED	(1u << 9)
 
+/* Try lock filesystem superblock for freeze protection */
+#define XFS_TRANS_WRITECOUNT_TRYLOCK	(1u << 10)
+
 /*
  * Field values for xfs_trans_mod_sb.
  */
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 148cc32449c1..3860e44d6439 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -216,10 +216,18 @@ __xfs_trans_alloc(
 	struct xfs_trans	*tp;
 
 	ASSERT(!(flags & XFS_TRANS_RES_FDBLKS) || xfs_has_lazysbcount(mp));
+	ASSERT(!((flags & XFS_TRANS_NO_WRITECOUNT) &&
+		 (flags & XFS_TRANS_WRITECOUNT_TRYLOCK)));
 
 	tp = kmem_cache_zalloc(xfs_trans_cache, GFP_KERNEL | __GFP_NOFAIL);
-	if (!(flags & XFS_TRANS_NO_WRITECOUNT))
+	if (flags & XFS_TRANS_WRITECOUNT_TRYLOCK) {
+		if (!sb_start_intwrite_trylock(mp->m_super)) {
+			kmem_cache_free(xfs_trans_cache, tp);
+			return ERR_PTR(-EAGAIN);
+		}
+	} else if (!(flags & XFS_TRANS_NO_WRITECOUNT)) {
 		sb_start_intwrite(mp->m_super);
+	}
 	xfs_trans_set_context(tp);
 	tp->t_flags = flags;
 	tp->t_mountp = mp;
@@ -252,6 +260,8 @@ xfs_trans_alloc(
 	 */
 retry:
 	tp = __xfs_trans_alloc(mp, flags);
+	if (IS_ERR(tp))
+		return PTR_ERR(tp);
 	WARN_ON(mp->m_super->s_writers.frozen == SB_FREEZE_COMPLETE);
 	error = xfs_trans_reserve(tp, resp, blocks, rtextents);
 	if (error == -ENOSPC && want_retry) {
-- 
2.47.3


^ permalink raw reply related

* [PATCH v2 0/2] xfs: prevent close() from hanging on frozen filesystems
From: Aditya Srivastava @ 2026-06-12 11:22 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: Christoph Hellwig, linux-xfs, linux-kernel,
	Aditya Prakash Srivastava

From: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>

Hi Carlos and Christoph,

This is version 2 of the patch series addressing the close() system call
hanging indefinitely on frozen XFS filesystems (Bugzilla #205833).

Based on Christoph's feedback, we have made the following improvements:
- Split the changes into a clean, 2-patch series separating the new
  trylock flag implementation from the deadlock fix itself.
- Extracted the case history, reproducer code, and discussion from the
  commit logs into this cover letter to keep commit logs surgically
  focused.
- Adjusted the parameter name in xfs_free_eofblocks() to trans_flags to
  make its usage stand out.
- Implemented a much cleaner, race-free state preservation logic in
  xfs_file_release() by omitting pre-setting XFS_EOFBLOCKS_RELEASED
  on the inode, setting it only upon successful truncation.
- Simplified the transaction allocation block, renamed the flag to
  XFS_TRANS_WRITECOUNT_TRYLOCK, and added the requested mutual-exclusivity
  assertion in __xfs_trans_alloc().
- Set up this series to be posted standalone (not threaded as a reply)
  per the standard kernel guidelines.

### The Real-world Impact (Bugzilla & Downstream Cases)
When speculative post-EOF blocks are closed on XFS, the release path
synchronously attempts to free them via xfs_free_eofblocks(). This allocates
a write transaction (xfs_trans_alloc) which blocks indefinitely on the
superblock freeze write lock (sb_start_intwrite) under fsfreeze.

This behavior has a long history of causing severe system disruption:
- Downstream Red Hat Bugzilla 1474726 (dating back to 2017) details
  complete system hangs during system backups when rsync and fsfreeze
  are used. Even seemingly harmless read-only commands like
  'cat /var/log/messages' would hang on close() in __sb_start_write
  via xfs_free_eofblocks, requiring a hard reboot.
- Downstream LeApp integration test scenarios consistently hit this hang.

Hanging on close() frequently triggers container healthcheck failures,
systemd service timeouts, and cluster failover cascades, which is
disruptive to user-space applications that view close() as resource
reclamation. No other major Linux filesystem (ext4, btrfs, etc.)
synchronously allocates write transactions during close() system calls.

### The Solution: Non-blocking Superblock Trylock
Instead of performing racy pre-checks, this series introduces
XFS_TRANS_WRITECOUNT_TRYLOCK. When specified, __xfs_trans_alloc() attempts
to obtain freeze protection using sb_start_intwrite_trylock(). If that fails,
it aborts allocation gracefully and returns -EAGAIN.

We then pass XFS_TRANS_WRITECOUNT_TRYLOCK during xfs_file_release(). If
the truncation fails due to a frozen filesystem (-EAGAIN), we cleanly bypass
setting XFS_EOFBLOCKS_RELEASED on the inode, ensuring subsequent releases
or the background blockgc garbage collector can successfully clean them up
once thawed.

### Reproducer Details (GPLv2 Licensed)
As requested, I have added a GPLv2-compatible license to the C reproducer
provided below. I will also be working to wire up this reproducer into the
official xfstests suite in the near future.

Compile with -pthread:

    /*
     * GPLv2-compatible XFS freeze close() hang reproducer.
     * Copyright (c) 2026 Aditya Prakash Srivastava. All Rights Reserved.
     */
    #define _GNU_SOURCE
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <unistd.h>
    #include <fcntl.h>
    #include <pthread.h>
    #include <sys/ioctl.h>
    #include <sys/vfs.h>
    #include <linux/fs.h>
    #include <libgen.h>

    volatile int close_started = 0;
    volatile int close_completed = 0;

    void *close_thread(void *arg) {
        int fd = *(int *)arg;
        close_started = 1;
        close(fd);
        close_completed = 1;
        return NULL;
    }

    int main(int argc, char *argv[]) {
        struct statfs sfs;
        if (statfs(argv[1], &sfs) < 0) {
            char *dir_buf = strdup(argv[1]);
            char *parent_dir = dirname(dir_buf);
            if (statfs(parent_dir, &sfs) < 0) {
                perror("statfs");
                free(dir_buf);
                return 1;
            }
            free(dir_buf);
        }
        if (sfs.f_type != 0x58465342) return 1;

        int freeze_fd = open(dirname(strdup(argv[1])), O_RDONLY);
        int write_fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC, 0644);
        char buf[65536] = {0};
        for (int i = 0; i < 320; i++) write(write_fd, buf, sizeof(buf));

        ioctl(freeze_fd, FIFREEZE, 0);
        pthread_t thread;
        pthread_create(&thread, NULL, close_thread, &write_fd);
        while (!close_started) usleep(1000);
        usleep(1000000); // Wait 1s
        if (!close_completed) printf("SUCCESS: close() hung!\n");
        ioctl(freeze_fd, FITHAW, 0);
        pthread_join(thread, NULL);
        unlink(argv[1]);
        return 0;
    }

Aditya Prakash Srivastava (2):
  xfs: add a XFS_TRANS_WRITECOUNT_TRYLOCK flag
  xfs: prevent close() from hanging on frozen filesystems

 fs/xfs/libxfs/xfs_shared.h |  3 +++
 fs/xfs/xfs_bmap_util.c     |  7 ++++---
 fs/xfs/xfs_bmap_util.h     |  2 +-
 fs/xfs/xfs_file.c          | 12 ++++++------
 fs/xfs/xfs_icache.c        |  2 +-
 fs/xfs/xfs_inode.c         |  2 +-
 fs/xfs/xfs_trans.c         | 14 ++++++++++++--
 7 files changed, 28 insertions(+), 11 deletions(-)

-- 
2.43.0

^ permalink raw reply

* Re: [PATCH v2] xfs: prevent close() from hanging on frozen filesystems
From: Aditya Prakash Srivastava @ 2026-06-12 11:07 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Carlos Maiolino, linux-xfs, linux-kernel
In-Reply-To: <aiueFE1yWmLJGIFA@infradead.org>

Hi Christoph,

Thank you for the detailed review and feedback.

The procedural guidance is helpful. I'll split the changes into
separate patches, move the reproducer and additional background into
the cover letter, and resend as a standalone patch series.

I'll incorporate the comments on the EOFBLOCKS_RELEASED handling,
transaction flag naming, comment formatting, and the transaction
allocation path in the next revision.

Regarding xfstests, I plan to convert it into an xfstests regression
test in the near future so this scenario can be validated automatically
going forward.

Thanks again for taking the time to review this.

Regards,
Aditya Prakash Srivastava

On Fri, Jun 12, 2026 at 11:20 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> Hi Aditya,
>
> this looks mosltly good.  A few procedural comments first, I'll then
> move on to the code nitpicks.
>
>  - please don't respond to the previous patch, always post standalone.
>    git₋send-email gets this right.
>  - a lot of the history in this commit log belongs into a cover letter,
>    not the commit log itself.  Same for the reproducer.
>  - please split this up into multiple patches dong one thing a a time,
>    e.g.
>
> xfs: add a XFS_TRANS_TRYLOCK flag
> xfs: prevent close() from hanging on frozen filesystems
>
> > A simple GPLv2-licensed C reproducer demonstrating the hang (compile
> > with -pthread):
>
> Can you also wire this up to xfstests?
>
> > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> > index 0ab00615f1ad..56c7919bf1e5 100644
> > --- a/fs/xfs/xfs_bmap_util.c
> > +++ b/fs/xfs/xfs_bmap_util.c
> > @@ -574,7 +574,8 @@ xfs_can_free_eofblocks(
> >   */
> >  int
> >  xfs_free_eofblocks(
> > -     struct xfs_inode        *ip)
> > +     struct xfs_inode        *ip,
> > +     uint                    flags)
>
> Maybe name this trans_flags so that the usage sticks out?
>
> > @@ -1807,8 +1807,17 @@ xfs_file_release(
> >       if (!xfs_iflags_test(ip, XFS_EOFBLOCKS_RELEASED) &&
> >           xfs_ilock_nowait(ip, XFS_IOLOCK_EXCL)) {
> >               if (xfs_can_free_eofblocks(ip) &&
> > -                 !xfs_iflags_test_and_set(ip, XFS_EOFBLOCKS_RELEASED))
> > -                     xfs_free_eofblocks(ip);
> > +                 !xfs_iflags_test_and_set(ip, XFS_EOFBLOCKS_RELEASED)) {
> > +                     int error = xfs_free_eofblocks(ip, XFS_TRANS_TRYLOCK);
> > +
> > +                     /*
> > +                      * If transaction allocation fails due to a frozen/freezing
> > +                      * filesystem, clear the released flag so that subsequent
> > +                      * releases or background blockgc can retry post-thaw.
> > +                      */
> > +                     if (error == -EAGAIN)
> > +                             xfs_iflags_clear(ip, XFS_EOFBLOCKS_RELEASED);
>
> The comment has two overly long lines.
>
> Also the resetting the flag on error is a bit odd.  As far as I can tell
> there is no need to clear it before the action, so I think we can just
> move to clearing it only on successful return, e.g.:
>
>         if (!xfs_iflags_test(ip, XFS_EOFBLOCKS_RELEASED) &&
>             xfs_ilock_nowait(ip, XFS_IOLOCK_EXCL)) {
>                 if (!xfs_iflags_test(ip, XFS_EOFBLOCKS_RELEASED) &&
>                     xfs_can_free_eofblocks(ip) &&
>                     !xfs_free_eofblocks(ip, XFS_TRANS_TRYLOCK))
>                         xfs_iflags_set(ip, XFS_EOFBLOCKS_RELEASED);
>
>
> > +     if (!(flags & XFS_TRANS_NO_WRITECOUNT)) {
> > +             /*
> > +              * If TRYLOCK is specified, attempt a non-blocking lock to
> > +              * avoid blocking indefinitely on frozen/freezing filesystems.
> > +              */
>
> This would be a bit cleaner as:
>
>         if (flags & XFS_TRANS_TRYLOCK) {
>                 if (!sb_start_intwrite_trylock(mp->m_super)) {
>                         kmem_cache_free(xfs_trans_cache, tp);
>                         return ERR_PTR(-EAGAIN);
>                 }
>         } else if (!(flags & XFS_TRANS_NO_WRITECOUNT)) {
>                 sb_start_intwrite(mp->m_super);
>         }
>
> And maybe the flag name would better match XFS_TRANS_NO_WRITECOUNT
> a bit, e.g. XFS_TRANS_WRITECOUNT_TRYLOCK?
>
> Maybe also add an assert that XFS_TRANS_NO_WRITECOUNT and
> XFS_TRANS_NO_WRITECOUNT are not specified at the same time.
>

^ permalink raw reply

* [PATCH v3 3/4] iomap: reject NOWAIT and BOUNCE direct IOs
From: Qu Wenruo @ 2026-06-12  9:51 UTC (permalink / raw)
  To: linux-btrfs, linux-block, linux-fsdevel, linux-xfs
In-Reply-To: <cover.1781253428.git.wqu@suse.com>

If a direct IO requires bounced pages for stable buffer, it will always
allocate memory, and both bio_iov_iter_bounce_write() and
bio_iov_iter_bounce_read() are allocating pages using GFP_KERNEL, which
can sleep and break NOWAIT requirement.

So we need to reject such NOWAIT and BOUNCE direct IO in
iomap_dio_bio_iter().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/iomap/direct-io.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index b36ee619cdcd..d1601122f0b5 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -412,6 +412,10 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
 	unsigned int alignment;
 	ssize_t ret = 0;
 
+	/* Bounced direct IO will need to allocate memory, breaking NOWAIT flag. */
+	if (unlikely(iter->flags & IOMAP_NOWAIT && dio->flags & IOMAP_DIO_BOUNCE))
+		return -EAGAIN;
+
 	/*
 	 * File systems that write out of place and always allocate new blocks
 	 * need each bio to be block aligned as that's the unit of allocation.
-- 
2.54.0


^ permalink raw reply related

* [PATCH v3 2/4] block: respect iov_iter::nofault flag in bio_iov_iter_bounce_write()
From: Qu Wenruo @ 2026-06-12  9:51 UTC (permalink / raw)
  To: linux-btrfs, linux-block, linux-fsdevel, linux-xfs
In-Reply-To: <cover.1781253428.git.wqu@suse.com>

For the incoming usage of IOMAP_DIO_BOUNCE in btrfs, btrfs has set
iov_iter::nofault to prevent deadlock when a page fault is needed to
read out the buffer.

However bio_iov_iter_bounce_write() doesn't respect iov_iter::nofault
flag, and just call a plain copy_from_iter() so it can still trigger
page fault and cause deadlock in btrfs.

Fix it by utilizing copy_folio_from_iter_atomic() if nofault flag is
set, otherwise use copy_folio_from_iter().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 block/bio.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index b33ff69bb722..01bb76d9717c 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1335,7 +1335,10 @@ static int bio_iov_iter_bounce_write(struct bio *bio, struct iov_iter *iter,
 			break;
 		bio_add_folio_nofail(bio, folio, this_len, 0);
 
-		copied = copy_from_iter(folio_address(folio), this_len, iter);
+		if (iter->nofault)
+			copied = copy_folio_from_iter_atomic(folio, 0, this_len, iter);
+		else
+			copied = copy_folio_from_iter(folio, 0, this_len, iter);
 		if (copied < this_len) {
 			iov_iter_revert(iter, bio->bi_iter.bi_size - this_len + copied);
 			bio_free_folios(bio);
-- 
2.54.0


^ permalink raw reply related

* [PATCH v3 4/4] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO
From: Qu Wenruo @ 2026-06-12  9:51 UTC (permalink / raw)
  To: linux-btrfs, linux-block, linux-fsdevel, linux-xfs
In-Reply-To: <cover.1781253428.git.wqu@suse.com>

Previously btrfs forces direct writes to fall back to buffered ones if the
inode has data checksum or the profile has duplication.

That fallback is to avoid the content being modified that the final
content may mismatch with the checksum or the other mirrors.

That brings a pretty huge performance cost, which already caused some
concern at that time.

But later upstream commit c9d114846b38 ("iomap: add a flag to bounce
buffer direct I/O") introduced a new method by copying the content into
new pages, and do all the operations based on the newly allocated pages.

So let btrfs to utilize the new flag for direct writes if we require
stable folios.

There is a quick benchmark, using the following fio setup:

 fio --name=randwrite --filename $mnt/foobar --ioengine=libaio --size=4G \
     --rw=randwrite --iodepth=64 --runtime=60 --time_based --direct=1 \
     --bs=$blocksize

Unit is MiB/s.

 Blocksize | Zero-copy (*) | Buffered |   Bounce
-----------+---------------+----------+-----------
        4K |          35.1 |     17.1 |      33.8
       64K |           522 |      251 |       492

*: This is done by reverting the commit 968f19c5b1b7 ("btrfs: always
   fallback to buffered write if the inode requires checksum")

Although with page bouncing the performance is only around 95% of
true-zero copy, it's still almost double the performance of buffered
fallback.

There will be a small change in behavior, since we're using
IOMAP_DIO_BOUNCE flag to allocate new folios, NOWAIT flag will
immediately fail.

So for NOWAIT direct IOs, NODATASUM and RAID0/SINGLE profiles are still
required.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/direct-io.c | 53 ++++++++++++++++++++------------------------
 1 file changed, 24 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index e566a60b0ce5..bbf94056a874 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -818,13 +818,36 @@ static ssize_t btrfs_dio_read(struct kiocb *iocb, struct iov_iter *iter,
 			    IOMAP_DIO_PARTIAL | IOMAP_DIO_FSBLOCK_ALIGNED, &data, done_before);
 }
 
+static bool need_stable_write(struct btrfs_inode *inode)
+{
+	const u64 data_profile = btrfs_data_alloc_profile(inode->root->fs_info) &
+				 BTRFS_BLOCK_GROUP_PROFILE_MASK;
+
+	/* Data checksum requires stable buffer. */
+	if (!(inode->flags & BTRFS_INODE_NODATASUM))
+		return true;
+	/*
+	 * Any profile with mirror/parity will require stable buffer.
+	 * Otherwise the mirror may differ from each other.
+	 *
+	 * Thus only SINGLE and RAID0 doesn't require stable buffer.
+	 */
+	if (data_profile != 0 && data_profile != BTRFS_BLOCK_GROUP_RAID0)
+		return true;
+	return false;
+}
+
 static struct iomap_dio *btrfs_dio_write(struct kiocb *iocb, struct iov_iter *iter,
 					 size_t done_before)
 {
 	struct btrfs_dio_data data = { 0 };
+	unsigned int dio_flags = IOMAP_DIO_PARTIAL | IOMAP_DIO_FSBLOCK_ALIGNED;
+
+	if (need_stable_write(BTRFS_I(file_inode(iocb->ki_filp))))
+		dio_flags |= IOMAP_DIO_BOUNCE;
 
 	return __iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
-			    IOMAP_DIO_PARTIAL | IOMAP_DIO_FSBLOCK_ALIGNED, &data, done_before);
+			      dio_flags, &data, done_before);
 }
 
 static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
@@ -853,8 +876,6 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	ssize_t ret;
 	unsigned int ilock_flags = 0;
 	struct iomap_dio *dio;
-	const u64 data_profile = btrfs_data_alloc_profile(fs_info) &
-				 BTRFS_BLOCK_GROUP_PROFILE_MASK;
 
 	if (iocb->ki_flags & IOCB_NOWAIT)
 		ilock_flags |= BTRFS_ILOCK_TRY;
@@ -868,16 +889,6 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	if (iocb->ki_pos + iov_iter_count(from) <= i_size_read(inode) && IS_NOSEC(inode))
 		ilock_flags |= BTRFS_ILOCK_SHARED;
 
-	/*
-	 * If our data profile has duplication (either extra mirrors or RAID56),
-	 * we can not trust the direct IO buffer, the content may change during
-	 * writeback and cause different contents written to different mirrors.
-	 *
-	 * Thus only RAID0 and SINGLE can go true zero-copy direct IO.
-	 */
-	if (data_profile != BTRFS_BLOCK_GROUP_RAID0 && data_profile != 0)
-		goto buffered;
-
 relock:
 	ret = btrfs_inode_lock(BTRFS_I(inode), ilock_flags);
 	if (ret < 0)
@@ -918,22 +929,6 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 		btrfs_inode_unlock(BTRFS_I(inode), ilock_flags);
 		goto buffered;
 	}
-	/*
-	 * We can't control the folios being passed in, applications can write
-	 * to them while a direct IO write is in progress.  This means the
-	 * content might change after we calculated the data checksum.
-	 * Therefore we can end up storing a checksum that doesn't match the
-	 * persisted data.
-	 *
-	 * To be extra safe and avoid false data checksum mismatch, if the
-	 * inode requires data checksum, just fallback to buffered IO.
-	 * For buffered IO we have full control of page cache and can ensure
-	 * no one is modifying the content during writeback.
-	 */
-	if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
-		btrfs_inode_unlock(BTRFS_I(inode), ilock_flags);
-		goto buffered;
-	}
 
 	/*
 	 * The iov_iter can be mapped to the same file range we are writing to.
-- 
2.54.0


^ permalink raw reply related

* [PATCH v3 1/4] block: revert the iov_iter after a short copy in bio_iov_iter_bounce_write()
From: Qu Wenruo @ 2026-06-12  9:51 UTC (permalink / raw)
  To: linux-btrfs, linux-block, linux-fsdevel, linux-xfs
In-Reply-To: <cover.1781253428.git.wqu@suse.com>

For the incoming IOMAP_DIO_BOUNCE flag usage inside btrfs, it's pretty
easy to hit short copy inside bio_iov_iter_bounce_write().

This is because btrfs has disabled page fault to avoid certain deadlock
during direct writes, and instead btrfs manually fault in the pages then
retry.

And inside bio_iov_iter_bounce_write(), if we hit a short write, we
didn't revert the iov_iter, which can cause problems like unexpected
garbage for the next retry.

Revert the iov_iter after a short copy.

One thing to note is that, the folio is allocated then immediately
queued into the bio, so the proper revert size should be
(bi_size - this_len + copied).

Fixes: 8dd5e7c75d7b ("block: add helpers to bounce buffer an iov_iter into bios")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 block/bio.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 5f10900b3f42..b33ff69bb722 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1321,6 +1321,7 @@ static int bio_iov_iter_bounce_write(struct bio *bio, struct iov_iter *iter,

 	do {
 		size_t this_len = min(total_len, SZ_1M);
+		size_t copied;
 		struct folio *folio;

 		if (this_len > minsize * 2)
@@ -1334,12 +1335,12 @@ static int bio_iov_iter_bounce_write(struct bio *bio, struct iov_iter *iter,
 			break;
 		bio_add_folio_nofail(bio, folio, this_len, 0);

-		if (copy_from_iter(folio_address(folio), this_len, iter) !=
-				this_len) {
+		copied = copy_from_iter(folio_address(folio), this_len, iter);
+		if (copied < this_len) {
+			iov_iter_revert(iter, bio->bi_iter.bi_size - this_len + copied);
 			bio_free_folios(bio);
 			return -EFAULT;
 		}
-
 		total_len -= this_len;
 	} while (total_len && bio->bi_vcnt < bio->bi_max_vecs);

-- 
2.54.0

^ permalink raw reply related

page: next (older)
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox