From: Kiryl Shutsemau <kirill@shutemov.name>
To: Chris Arges <carges@cloudflare.com>
Cc: Matthew Wilcox <willy@infradead.org>,
akpm@linux-foundation.org, william.kucharski@oracle.com,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, kernel-team@cloudflare.com
Subject: Re: [PATCH RFC 1/1] mm/filemap: handle large folio split race in page cache lookups
Date: Mon, 30 Mar 2026 11:02:31 +0000 [thread overview]
Message-ID: <acpX8iUEjNT6Kz25@thinkstation> (raw)
In-Reply-To: <acFr4JJgaXsD9gtQ@20HS2G4>
On Mon, Mar 23, 2026 at 11:35:44AM -0500, Chris Arges wrote:
> On 2026-03-06 20:21:59, Kiryl Shutsemau wrote:
> > On Fri, Mar 06, 2026 at 02:11:22PM -0600, Chris Arges wrote:
> > > On 2026-03-06 16:28:19, Matthew Wilcox wrote:
> > > > On Fri, Mar 06, 2026 at 02:13:26PM +0000, Kiryl Shutsemau wrote:
> > > > > On Thu, Mar 05, 2026 at 07:24:38PM +0000, Matthew Wilcox wrote:
> > > > > > folio_split() needs to be sure that it's the only one holding a reference
> > > > > > to the folio. To that end, it calculates the expected refcount of the
> > > > > > folio, and freezes it (sets the refcount to 0 if the refcount is the
> > > > > > expected value). Once filemap_get_entry() has incremented the refcount,
> > > > > > freezing will fail.
> > > > > >
> > > > > > But of course, we can race. filemap_get_entry() can load a folio first,
> > > > > > the entire folio_split can happen, then it calls folio_try_get() and
> > > > > > succeeds, but it no longer covers the index we were looking for. That's
> > > > > > what the xas_reload() is trying to prevent -- if the index is for a
> > > > > > folio which has changed, then the xas_reload() should come back with a
> > > > > > different folio and we goto repeat.
> > > > > >
> > > > > > So how did we get through this with a reference to the wrong folio?
> > > > >
> > > > > What would xas_reload() return if we raced with split and index pointed
> > > > > to a tail page before the split?
> > > > >
> > > > > Wouldn't it return the folio that was a head and check will pass?
> > > >
> > > > It's not supposed to return the head in this case. But, check the code:
> > > >
> > > > if (!node)
> > > > return xa_head(xas->xa);
> > > > if (IS_ENABLED(CONFIG_XARRAY_MULTI)) {
> > > > offset = (xas->xa_index >> node->shift) & XA_CHUNK_MASK;
> > > > entry = xa_entry(xas->xa, node, offset);
> > > > if (!xa_is_sibling(entry))
> > > > return entry;
> > > > offset = xa_to_sibling(entry);
> > > > }
> > > > return xa_entry(xas->xa, node, offset);
> > > >
> > > > (obviously CONFIG_XARRAY_MULTI is enabled)
> > > >
> > > Yes we have this CONFIG enabled.
> > >
> > > Also FWIW, happy to run some additional experiments or more debugging. We _can_
> > > reproduce this, as a machine hits this about every day on a sample of ~128
> > > machines. We also do get crashdumps so we can poke around there as needed.
> > >
> > > I was going to deploy this patch onto a subset of machines, but reading through
> > > this thread I'm a bit concerned if a retry doesn't actually fix the problem,
> > > then we will just loop on this condition and hang.
> >
> > I would be useful to know if the condition is persistent or if retry
> > "fixes" the problem.
>
> I was able to deploy my patch into a set of machines and test from March 11th
> until now. So far it seems like this patch addresses this issue. While removing
> the BUG_ON means that we will no longer see the call trace messages, I looked
> for any lockups that would be related folio/filesystem activities and did not
> find any.
>
> Let me know what else would be useful here, I am happy to re-propose my patch
> without the RFC, unless more verification/analysis is needed.
I wounder if 577a1f495fd7 ("mm/huge_memory: fix a folio_split() race
condition with folio_try_get()") is relevant here.
Do you have it applied on the tree where the problem triggers?
--
Kiryl Shutsemau / Kirill A. Shutemov
prev parent reply other threads:[~2026-03-30 11:02 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-05 18:34 [PATCH RFC 0/1] fix for large folio split race in page cache Chris J Arges
2026-03-05 18:34 ` [PATCH RFC 1/1] mm/filemap: handle large folio split race in page cache lookups Chris J Arges
2026-03-05 19:24 ` Matthew Wilcox
2026-03-06 14:13 ` Kiryl Shutsemau
2026-03-06 16:28 ` Matthew Wilcox
2026-03-06 18:36 ` Kiryl Shutsemau
2026-03-06 18:41 ` Matthew Wilcox
2026-03-06 20:20 ` Kiryl Shutsemau
2026-03-06 20:11 ` Chris Arges
2026-03-06 20:21 ` Kiryl Shutsemau
2026-03-06 20:58 ` Chris Arges
2026-03-23 16:35 ` Chris Arges
2026-03-30 11:02 ` Kiryl Shutsemau [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=acpX8iUEjNT6Kz25@thinkstation \
--to=kirill@shutemov.name \
--cc=akpm@linux-foundation.org \
--cc=carges@cloudflare.com \
--cc=kernel-team@cloudflare.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=william.kucharski@oracle.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.