From: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
To: Dave Chinner <dgc@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>, linux-xfs@vger.kernel.org
Subject: Re: Hang with xfs/285 on 2026-03-02 kernel
Date: Mon, 06 Apr 2026 05:57:06 +0530 [thread overview]
Message-ID: <y0j1kk6d.ritesh.list@gmail.com> (raw)
In-Reply-To: <adLfJwoi1lZhnbjn@dread>
Thanks Dave for your inputs. I have few more data points on the same.
It will be nice to know your thoughts on this.
Dave Chinner <dgc@kernel.org> writes:
> On Sun, Apr 05, 2026 at 06:33:59AM +0530, Ritesh Harjani wrote:
>> Dave Chinner <dgc@kernel.org> writes:
>>
>> > On Fri, Apr 03, 2026 at 04:35:46PM +0100, Matthew Wilcox wrote:
>> >> This is with commit 5619b098e2fb so after 7.0-rc6
>> >> INFO: task fsstress:3762792 blocked on a semaphore likely last held by task fsstress:3762793
>> >> task:fsstress state:D stack:0 pid:3762793 tgid:3762793 ppid:3762783 task_flags:0x440140 flags:0x00080800
>> >> Call Trace:
>> >> <TASK>
>> >> __schedule+0x560/0xfc0
>> >> schedule+0x3e/0x140
>> >> schedule_timeout+0x84/0x110
>> >> ? __pfx_process_timeout+0x10/0x10
>> >> io_schedule_timeout+0x5b/0x80
>> >> xfs_buf_alloc+0x793/0x7d0
>> >
>> > -ENOMEM.
>> >
>> > It'll be looping here:
>> >
>> > fallback:
>> > for (;;) {
>> > bp->b_addr = __vmalloc(size, gfp_mask);
>> > if (bp->b_addr)
>> > break;
>> > if (flags & XBF_READ_AHEAD)
>> > return -ENOMEM;
>> > XFS_STATS_INC(bp->b_mount, xb_page_retries);
>> > memalloc_retry_wait(gfp_mask);
>> > }
>> >
>> > If it is looping here long enough to trigger the hang check timer,
>> > then the MM subsystem is not making progress reclaiming memory. This
>>
>> Hi Dave,
>>
>> If that's the case and if we expect the MM subsystem to do memory
>> reclaim, shouldn't we be passing the __GFP_DIRECT_RECLAIM flag to our
>> fallback loop? I see that we might have cleared this flag and also set
>> __GFP_NORETRY, in the above if condition if allocation size is >PAGE_SIZE.
>>
>> So shouldn't we do?
>>
>> if (size > PAGE_SIZE) {
>> if (!is_power_of_2(size))
>> goto fallback;
>> - gfp_mask &= ~__GFP_DIRECT_RECLAIM;
>> - gfp_mask |= __GFP_NORETRY;
>> + gfp_t alloc_gfp = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_NORETRY;
>> + folio = folio_alloc(alloc_gfp, get_order(size));
>> + } else {
>> + folio = folio_alloc(gfp_mask, get_order(size));
>> }
>> - folio = folio_alloc(gfp_mask, get_order(size));
>> if (!folio) {
>> if (size <= PAGE_SIZE)
>> return -ENOMEM;
>> trace_xfs_buf_backing_fallback(bp, _RET_IP_);
>> goto fallback;
>> }
>
> Possibly.
>
> That said, we really don't want stuff like compaction to
> run here -ever- because of how expensive it is for hot paths when
> memory is low, and the only knob we have to control that is
> __GFP_DIRECT_RECLAIM.
>
Looking at __alloc_pages_direct_compact(), it returns immediately for
order=0 allocations.
> However, turning off direct reclaim should make no difference in
> the long run because vmalloc is only trying to allocate a batch of
> single page folios.
>
> If we are in low memory situations where no single page folios are
> not available, then even for a NORETRY/no direct reclaim allocation
> the expectation is that the failed allocation attempt would be
> kicking kswapd to perform background memory reclaim.
>
> This is especially true when the allocation is GFP_NOFS/GFP_NOIO
> even with direct reclaim turned on - if all the memory is held in
> shrinkable fs/vfs caches then direct reclaim cannot reclaim anything
> filesystem/IO related.
>
So, looking at the logs from Matthew, I think, this case might have
benefitted from __GFP_DIRECT_RECLAIM, because we have many clean
inactive file pages. So theoritically, IMO direct reclaim should be able
to use one of those clean file pages (after it gets direct-reclaimed)
nr_zone_inactive_file 62769
nr_zone_write_pending 0
> i.e. background reclaim making forwards progress is absolutely
> necessary for any sort of "nofail" allocation loop to succeed
> regardless of whether direct reclaim is enabled or not.
>
> Hence if background memory reclaim is making progress, this
> allocation loop should eventually succeed. If the allocation is not
> succeeding, then it implies that some critical resource in the
> allocation path is not being refilled either on allocation failure
> or by background reclaim, and hence the allocation failure persists
> because nothing alleviates the resource shortage that is triggering
> the ENOMEM issue.
I agree, background memory reclaim / kswapd thread should have made
forward progress.
I am not sure why in this case, we are we hitting hung tasks issues then.
Could be because of multiple fsstress threads running in parallel (from
ps -eax output), and maybe some other process ends up using the pages
reclaimed by background kswapd (just a theory).
>
> So the question is: where in the __vmalloc allocation path is the
> ENOMEM error being generated from, and is it the same place every
> time?
>
Although I can't say for sure, but in this case after looking at the
code, and knowing that we are not passing __GFP_DIRECT_RECLAIM, it might
be returning from here (after get_page_from_freelist() couldn't get a
free page).
__alloc_pages_slowpath() {
...
/* Caller is not willing to reclaim, we can't balance anything */
if (!can_direct_reclaim)
goto nopage;
So, with the above data, I think,
In this case, passing __GFP_DIRECT_RECLAIM in vmalloc fallback path
might help. And either ways, until we have a page allocated, we anyway
do an infinite retry, so we may as well pass __GFP_DIRECT_RECLAIM flag
to it, right?
fallback:
for (;;) {
bp->b_addr = __vmalloc(size, gfp_mask);
if (bp->b_addr)
break;
if (flags & XBF_READ_AHEAD)
return -ENOMEM;
XFS_STATS_INC(bp->b_mount, xb_page_retries);
memalloc_retry_wait(gfp_mask);
}
Thoughts?
I am not sure how easily this issue is reproducible at Matthew's end.
But let me also keep a kvm guest with the same kernel version to see if
I can replicate this at my end in an overnight run of xfs/285 in a loop.
-ritesh
next prev parent reply other threads:[~2026-04-06 16:47 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-03 15:35 Hang with xfs/285 on 2026-03-02 kernel Matthew Wilcox
2026-04-04 11:42 ` Dave Chinner
2026-04-04 20:40 ` Matthew Wilcox
2026-04-05 22:29 ` Dave Chinner
2026-04-05 1:03 ` Ritesh Harjani
2026-04-05 22:16 ` Dave Chinner
2026-04-06 0:27 ` Ritesh Harjani [this message]
2026-04-06 21:45 ` Dave Chinner
2026-04-07 5:41 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=y0j1kk6d.ritesh.list@gmail.com \
--to=ritesh.list@gmail.com \
--cc=dgc@kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox