public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
To: Dave Chinner <dgc@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>, linux-xfs@vger.kernel.org
Subject: Re: Hang with xfs/285 on 2026-03-02 kernel
Date: Mon, 06 Apr 2026 05:57:06 +0530	[thread overview]
Message-ID: <y0j1kk6d.ritesh.list@gmail.com> (raw)
In-Reply-To: <adLfJwoi1lZhnbjn@dread>


Thanks Dave for your inputs. I have few more data points on the same.
It will be nice to know your thoughts on this.

Dave Chinner <dgc@kernel.org> writes:

> On Sun, Apr 05, 2026 at 06:33:59AM +0530, Ritesh Harjani wrote:
>> Dave Chinner <dgc@kernel.org> writes:
>> 
>> > On Fri, Apr 03, 2026 at 04:35:46PM +0100, Matthew Wilcox wrote:
>> >> This is with commit 5619b098e2fb so after 7.0-rc6
>> >> INFO: task fsstress:3762792 blocked on a semaphore likely last held by task fsstress:3762793
>> >> task:fsstress        state:D stack:0     pid:3762793 tgid:3762793 ppid:3762783 task_flags:0x440140 flags:0x00080800
>> >> Call Trace:
>> >>  <TASK>
>> >>  __schedule+0x560/0xfc0
>> >>  schedule+0x3e/0x140
>> >>  schedule_timeout+0x84/0x110
>> >>  ? __pfx_process_timeout+0x10/0x10
>> >>  io_schedule_timeout+0x5b/0x80
>> >>  xfs_buf_alloc+0x793/0x7d0
>> >
>> > -ENOMEM.
>> >
>> > It'll be looping here:
>> >
>> > fallback:
>> >         for (;;) {
>> >                 bp->b_addr = __vmalloc(size, gfp_mask);
>> >                 if (bp->b_addr)
>> >                         break;
>> >                 if (flags & XBF_READ_AHEAD)
>> >                         return -ENOMEM;
>> >                 XFS_STATS_INC(bp->b_mount, xb_page_retries);
>> >                 memalloc_retry_wait(gfp_mask);
>> >         }
>> >
>> > If it is looping here long enough to trigger the hang check timer,
>> > then the MM subsystem is not making progress reclaiming memory. This
>> 
>> Hi Dave,
>> 
>> If that's the case and if we expect the MM subsystem to do memory
>> reclaim, shouldn't we be passing the __GFP_DIRECT_RECLAIM flag to our
>> fallback loop? I see that we might have cleared this flag and also set
>> __GFP_NORETRY, in the above if condition if allocation size is >PAGE_SIZE.
>> 
>> So shouldn't we do?
>> 
>>         if (size > PAGE_SIZE) {
>>                 if (!is_power_of_2(size))
>>                         goto fallback;
>> -               gfp_mask &= ~__GFP_DIRECT_RECLAIM;
>> -               gfp_mask |= __GFP_NORETRY;
>> +               gfp_t alloc_gfp = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_NORETRY;
>> +               folio = folio_alloc(alloc_gfp, get_order(size));
>> +       } else {
>> +               folio = folio_alloc(gfp_mask, get_order(size));
>>         }
>> -       folio = folio_alloc(gfp_mask, get_order(size));
>>         if (!folio) {
>>                 if (size <= PAGE_SIZE)
>>                         return -ENOMEM;
>>                 trace_xfs_buf_backing_fallback(bp, _RET_IP_);
>>                 goto fallback;
>>         }
>
> Possibly.
>
> That said, we really don't want stuff like compaction to
> run here -ever- because of how expensive it is for hot paths when
> memory is low, and the only knob we have to control that is
> __GFP_DIRECT_RECLAIM.
>

Looking at __alloc_pages_direct_compact(), it returns immediately for
order=0 allocations.


> However, turning off direct reclaim should make no difference in
> the long run because vmalloc is only trying to allocate a batch of
> single page folios.
>
> If we are in low memory situations where no single page folios are
> not available, then even for a NORETRY/no direct reclaim allocation
> the expectation is that the failed allocation attempt would be
> kicking kswapd to perform background memory reclaim.
>
> This is especially true when the allocation is GFP_NOFS/GFP_NOIO
> even with direct reclaim turned on - if all the memory is held in
> shrinkable fs/vfs caches then direct reclaim cannot reclaim anything
> filesystem/IO related.
>

So, looking at the logs from Matthew, I think, this case might have
benefitted from __GFP_DIRECT_RECLAIM, because we have many clean
inactive file pages. So theoritically, IMO direct reclaim should be able
to use one of those clean file pages (after it gets direct-reclaimed)

      nr_zone_inactive_file 62769
      nr_zone_write_pending 0


> i.e. background reclaim making forwards progress is absolutely
> necessary for any sort of "nofail" allocation loop to succeed
> regardless of whether direct reclaim is enabled or not.
>
> Hence if background memory reclaim is making progress, this
> allocation loop should eventually succeed. If the allocation is not
> succeeding, then it implies that some critical resource in the
> allocation path is not being refilled either on allocation failure
> or by background reclaim, and hence the allocation failure persists
> because nothing alleviates the resource shortage that is triggering
> the ENOMEM issue.

I agree, background memory reclaim / kswapd thread should have made
forward progress. 

I am not sure why in this case, we are we hitting hung tasks issues then.
Could be because of multiple fsstress threads running in parallel (from
ps -eax output), and maybe some other process ends up using the pages
reclaimed by background kswapd (just a theory). 

>
> So the question is: where in the __vmalloc allocation path is the
> ENOMEM error being generated from, and is it the same place every
> time?
>

Although I can't say for sure, but in this case after looking at the
code, and knowing that we are not passing __GFP_DIRECT_RECLAIM, it might
be returning from here (after get_page_from_freelist() couldn't get a
free page).

__alloc_pages_slowpath() {
    ...
	/* Caller is not willing to reclaim, we can't balance anything */
	if (!can_direct_reclaim)
		goto nopage;


So, with the above data, I think,
In this case, passing __GFP_DIRECT_RECLAIM in vmalloc fallback path
might help. And either ways, until we have a page allocated, we anyway
do an infinite retry, so we may as well pass __GFP_DIRECT_RECLAIM flag
to it, right?

fallback:
	for (;;) {
		bp->b_addr = __vmalloc(size, gfp_mask);
		if (bp->b_addr)
			break;
		if (flags & XBF_READ_AHEAD)
			return -ENOMEM;
		XFS_STATS_INC(bp->b_mount, xb_page_retries);
		memalloc_retry_wait(gfp_mask);
	}

Thoughts?

I am not sure how easily this issue is reproducible at Matthew's end.
But let me also keep a kvm guest with the same kernel version to see if
I can replicate this at my end in an overnight run of xfs/285 in a loop.


-ritesh

  reply	other threads:[~2026-04-06 16:47 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-03 15:35 Hang with xfs/285 on 2026-03-02 kernel Matthew Wilcox
2026-04-04 11:42 ` Dave Chinner
2026-04-04 20:40   ` Matthew Wilcox
2026-04-05 22:29     ` Dave Chinner
2026-04-05  1:03   ` Ritesh Harjani
2026-04-05 22:16     ` Dave Chinner
2026-04-06  0:27       ` Ritesh Harjani [this message]
2026-04-06 21:45         ` Dave Chinner
2026-04-07  5:41 ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=y0j1kk6d.ritesh.list@gmail.com \
    --to=ritesh.list@gmail.com \
    --cc=dgc@kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox