Re: Hang with xfs/285 on 2026-03-02 kernel

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
To: Dave Chinner <dgc@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>, linux-xfs@vger.kernel.org
Subject: Re: Hang with xfs/285 on 2026-03-02 kernel
Date: Mon, 06 Apr 2026 05:57:06 +0530	[thread overview]
Message-ID: <y0j1kk6d.ritesh.list@gmail.com> (raw)
In-Reply-To: <adLfJwoi1lZhnbjn@dread>


Thanks Dave for your inputs. I have few more data points on the same.
It will be nice to know your thoughts on this.

Dave Chinner <dgc@kernel.org> writes:

> On Sun, Apr 05, 2026 at 06:33:59AM +0530, Ritesh Harjani wrote:
>> Dave Chinner <dgc@kernel.org> writes:
>> 
>> > On Fri, Apr 03, 2026 at 04:35:46PM +0100, Matthew Wilcox wrote:
>> >> This is with commit 5619b098e2fb so after 7.0-rc6
>> >> INFO: task fsstress:3762792 blocked on a semaphore likely last held by task fsstress:3762793
>> >> task:fsstress        state:D stack:0     pid:3762793 tgid:3762793 ppid:3762783 task_flags:0x440140 flags:0x00080800
>> >> Call Trace:
>> >>  <TASK>
>> >>  __schedule+0x560/0xfc0
>> >>  schedule+0x3e/0x140
>> >>  schedule_timeout+0x84/0x110
>> >>  ? __pfx_process_timeout+0x10/0x10
>> >>  io_schedule_timeout+0x5b/0x80
>> >>  xfs_buf_alloc+0x793/0x7d0
>> >
>> > -ENOMEM.
>> >
>> > It'll be looping here:
>> >
>> > fallback:
>> >         for (;;) {
>> >                 bp->b_addr = __vmalloc(size, gfp_mask);
>> >                 if (bp->b_addr)
>> >                         break;
>> >                 if (flags & XBF_READ_AHEAD)
>> >                         return -ENOMEM;
>> >                 XFS_STATS_INC(bp->b_mount, xb_page_retries);
>> >                 memalloc_retry_wait(gfp_mask);
>> >         }
>> >
>> > If it is looping here long enough to trigger the hang check timer,
>> > then the MM subsystem is not making progress reclaiming memory. This
>> 
>> Hi Dave,
>> 
>> If that's the case and if we expect the MM subsystem to do memory
>> reclaim, shouldn't we be passing the __GFP_DIRECT_RECLAIM flag to our
>> fallback loop? I see that we might have cleared this flag and also set
>> __GFP_NORETRY, in the above if condition if allocation size is >PAGE_SIZE.
>> 
>> So shouldn't we do?
>> 
>>         if (size > PAGE_SIZE) {
>>                 if (!is_power_of_2(size))
>>                         goto fallback;
>> -               gfp_mask &= ~__GFP_DIRECT_RECLAIM;
>> -               gfp_mask |= __GFP_NORETRY;
>> +               gfp_t alloc_gfp = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_NORETRY;
>> +               folio = folio_alloc(alloc_gfp, get_order(size));
>> +       } else {
>> +               folio = folio_alloc(gfp_mask, get_order(size));
>>         }
>> -       folio = folio_alloc(gfp_mask, get_order(size));
>>         if (!folio) {
>>                 if (size <= PAGE_SIZE)
>>                         return -ENOMEM;
>>                 trace_xfs_buf_backing_fallback(bp, _RET_IP_);
>>                 goto fallback;
>>         }
>
> Possibly.
>
> That said, we really don't want stuff like compaction to
> run here -ever- because of how expensive it is for hot paths when
> memory is low, and the only knob we have to control that is
> __GFP_DIRECT_RECLAIM.
>

Looking at __alloc_pages_direct_compact(), it returns immediately for
order=0 allocations.


> However, turning off direct reclaim should make no difference in
> the long run because vmalloc is only trying to allocate a batch of
> single page folios.
>
> If we are in low memory situations where no single page folios are
> not available, then even for a NORETRY/no direct reclaim allocation
> the expectation is that the failed allocation attempt would be
> kicking kswapd to perform background memory reclaim.
>
> This is especially true when the allocation is GFP_NOFS/GFP_NOIO
> even with direct reclaim turned on - if all the memory is held in
> shrinkable fs/vfs caches then direct reclaim cannot reclaim anything
> filesystem/IO related.
>

So, looking at the logs from Matthew, I think, this case might have
benefitted from __GFP_DIRECT_RECLAIM, because we have many clean
inactive file pages. So theoritically, IMO direct reclaim should be able
to use one of those clean file pages (after it gets direct-reclaimed)

      nr_zone_inactive_file 62769
      nr_zone_write_pending 0


> i.e. background reclaim making forwards progress is absolutely
> necessary for any sort of "nofail" allocation loop to succeed
> regardless of whether direct reclaim is enabled or not.
>
> Hence if background memory reclaim is making progress, this
> allocation loop should eventually succeed. If the allocation is not
> succeeding, then it implies that some critical resource in the
> allocation path is not being refilled either on allocation failure
> or by background reclaim, and hence the allocation failure persists
> because nothing alleviates the resource shortage that is triggering
> the ENOMEM issue.

I agree, background memory reclaim / kswapd thread should have made
forward progress. 

I am not sure why in this case, we are we hitting hung tasks issues then.
Could be because of multiple fsstress threads running in parallel (from
ps -eax output), and maybe some other process ends up using the pages
reclaimed by background kswapd (just a theory). 

>
> So the question is: where in the __vmalloc allocation path is the
> ENOMEM error being generated from, and is it the same place every
> time?
>

Although I can't say for sure, but in this case after looking at the
code, and knowing that we are not passing __GFP_DIRECT_RECLAIM, it might
be returning from here (after get_page_from_freelist() couldn't get a
free page).

__alloc_pages_slowpath() {
    ...
	/* Caller is not willing to reclaim, we can't balance anything */
	if (!can_direct_reclaim)
		goto nopage;


So, with the above data, I think,
In this case, passing __GFP_DIRECT_RECLAIM in vmalloc fallback path
might help. And either ways, until we have a page allocated, we anyway
do an infinite retry, so we may as well pass __GFP_DIRECT_RECLAIM flag
to it, right?

fallback:
	for (;;) {
		bp->b_addr = __vmalloc(size, gfp_mask);
		if (bp->b_addr)
			break;
		if (flags & XBF_READ_AHEAD)
			return -ENOMEM;
		XFS_STATS_INC(bp->b_mount, xb_page_retries);
		memalloc_retry_wait(gfp_mask);
	}

Thoughts?

I am not sure how easily this issue is reproducible at Matthew's end.
But let me also keep a kvm guest with the same kernel version to see if
I can replicate this at my end in an overnight run of xfs/285 in a loop.


-ritesh

next prev parent reply	other threads:[~2026-04-06 16:47 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-03 15:35 Hang with xfs/285 on 2026-03-02 kernel Matthew Wilcox
2026-04-04 11:42 ` Dave Chinner
2026-04-04 20:40   ` Matthew Wilcox
2026-04-05 22:29     ` Dave Chinner
2026-04-05  1:03   ` Ritesh Harjani
2026-04-05 22:16     ` Dave Chinner
2026-04-06  0:27       ` Ritesh Harjani [this message]
2026-04-06 21:45         ` Dave Chinner
2026-04-07  5:41 ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=y0j1kk6d.ritesh.list@gmail.com \
    --to=ritesh.list@gmail.com \
    --cc=dgc@kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.