* Large folios and filemap_get_folios_contig()
@ 2025-04-03 9:36 Qu Wenruo
2025-04-03 12:35 ` Matthew Wilcox
0 siblings, 1 reply; 5+ messages in thread
From: Qu Wenruo @ 2025-04-03 9:36 UTC (permalink / raw)
To: Linux Memory Management List, linux-fsdevel@vger.kernel.org,
linux-btrfs
Cc: vivek.kasireddy, Andrew Morton
Hi,
Recently I hit a bug when developing the large folios support for btrfs.
That we call filemap_get_folios_contig(), then lock each returned folio.
(We also have a case where we unlock each returned folio)
However since a large folio can be returned several times in the batch,
this obviously makes a deadlock, as btrfs is trying to lock the same
folio more than once.
Then I looked into the caller of filemap_get_folios_contig() inside
mm/gup, and it indeed does the correct skip.
This makes me wonder, since we have large folios, why we still go
filemap_get_folios_contig() and skip duplicated large folios?
Isn't the purpose of large folios to handle a much large range in just
one go, without going through multiple pages?
And there are only 3 call sites, two of them are nilfs and ramfs,
neither support large folios, the only caller with large folio support
is the memfd_pin_folios(), which skip duplicated folios manually.
I'm wondering if it's possible to make filemap_get_folios_contig() to
avoid filling the batch with duplicated folios completely?
Thanks,
Qu
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Large folios and filemap_get_folios_contig()
2025-04-03 9:36 Large folios and filemap_get_folios_contig() Qu Wenruo
@ 2025-04-03 12:35 ` Matthew Wilcox
2025-04-03 21:16 ` Qu Wenruo
0 siblings, 1 reply; 5+ messages in thread
From: Matthew Wilcox @ 2025-04-03 12:35 UTC (permalink / raw)
To: Qu Wenruo
Cc: Linux Memory Management List, linux-fsdevel@vger.kernel.org,
linux-btrfs, vivek.kasireddy, Andrew Morton
On Thu, Apr 03, 2025 at 08:06:53PM +1030, Qu Wenruo wrote:
> Recently I hit a bug when developing the large folios support for btrfs.
>
> That we call filemap_get_folios_contig(), then lock each returned folio.
> (We also have a case where we unlock each returned folio)
>
> However since a large folio can be returned several times in the batch,
> this obviously makes a deadlock, as btrfs is trying to lock the same
> folio more than once.
Sorry, what? A large folio should only be returned once. xas_next()
moves to the next folio. How is it possible that
filemap_get_folios_contig() returns the same folio more than once?
> Then I looked into the caller of filemap_get_folios_contig() inside
> mm/gup, and it indeed does the correct skip.
... that code looks wrong to me.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Large folios and filemap_get_folios_contig()
2025-04-03 12:35 ` Matthew Wilcox
@ 2025-04-03 21:16 ` Qu Wenruo
2025-04-04 0:50 ` Vishal Moola (Oracle)
0 siblings, 1 reply; 5+ messages in thread
From: Qu Wenruo @ 2025-04-03 21:16 UTC (permalink / raw)
To: Matthew Wilcox, Qu Wenruo
Cc: Linux Memory Management List, linux-fsdevel@vger.kernel.org,
linux-btrfs, vivek.kasireddy, Andrew Morton
在 2025/4/3 23:05, Matthew Wilcox 写道:
> On Thu, Apr 03, 2025 at 08:06:53PM +1030, Qu Wenruo wrote:
>> Recently I hit a bug when developing the large folios support for btrfs.
>>
>> That we call filemap_get_folios_contig(), then lock each returned folio.
>> (We also have a case where we unlock each returned folio)
>>
>> However since a large folio can be returned several times in the batch,
>> this obviously makes a deadlock, as btrfs is trying to lock the same
>> folio more than once.
>
> Sorry, what? A large folio should only be returned once. xas_next()
> moves to the next folio. How is it possible that
> filemap_get_folios_contig() returns the same folio more than once?
But that's exactly what I got from filemap_get_folios_contig():
lock_delalloc_folios: r/i=5/260 locked_folio=720896(65536) start=782336
end=819199(36864)
lock_delalloc_folios: r/i=5/260 found_folios=1
lock_delalloc_folios: r/i=5/260 i=0 folio=720896(65536)
lock_delalloc_folios: r/i=5/260 found_folios=8
lock_delalloc_folios: r/i=5/260 i=0 folio=786432(262144)
lock_delalloc_folios: r/i=5/260 i=1 folio=786432(262144)
lock_delalloc_folios: r/i=5/260 i=2 folio=786432(262144)
lock_delalloc_folios: r/i=5/260 i=3 folio=786432(262144)
lock_delalloc_folios: r/i=5/260 i=4 folio=786432(262144)
lock_delalloc_folios: r/i=5/260 i=5 folio=786432(262144)
lock_delalloc_folios: r/i=5/260 i=6 folio=786432(262144)
lock_delalloc_folios: r/i=5/260 i=7 folio=786432(262144)
r/i is the root and inode number from btrfs, and you can completely
ignore it.
@locked_folio is the folio we're already holding a lock, the value
inside the brackets is the folio size.
@start and @end is the range we're searching for, the value inside the
brackets is the search range length.
The first iteration returns the current locked folio, and since the
range inside that folio is only 4K, thus it's only returned once.
The next 8 slots are all inside the same large folio at 786432,
resulting duplicated entries.
>
>> Then I looked into the caller of filemap_get_folios_contig() inside
>> mm/gup, and it indeed does the correct skip.
>
> ... that code looks wrong to me.
It looks like it's xas_find() is doing the correct skip by calling
xas_next_offset() -> xas_move_index() to skip the next one.
But the filemap_get_folios_contig() only calls xas_next() by increasing
the index, not really skip to the next folio.
Although I can be totally wrong as I'm not familiar with the xarray
internals at all.
However I totally agree the duplicated behavior (and the extra handling
of duplicated entries) looks very wrong.
Thanks,
Qu
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Large folios and filemap_get_folios_contig()
2025-04-03 21:16 ` Qu Wenruo
@ 2025-04-04 0:50 ` Vishal Moola (Oracle)
2025-04-04 4:15 ` Qu Wenruo
0 siblings, 1 reply; 5+ messages in thread
From: Vishal Moola (Oracle) @ 2025-04-04 0:50 UTC (permalink / raw)
To: Qu Wenruo
Cc: Matthew Wilcox, Qu Wenruo, Linux Memory Management List,
linux-fsdevel@vger.kernel.org, linux-btrfs, vivek.kasireddy,
Andrew Morton
[-- Attachment #1: Type: text/plain, Size: 3231 bytes --]
On Fri, Apr 04, 2025 at 07:46:59AM +1030, Qu Wenruo wrote:
>
>
> 在 2025/4/3 23:05, Matthew Wilcox 写道:
> > On Thu, Apr 03, 2025 at 08:06:53PM +1030, Qu Wenruo wrote:
> > > Recently I hit a bug when developing the large folios support for btrfs.
> > >
> > > That we call filemap_get_folios_contig(), then lock each returned folio.
> > > (We also have a case where we unlock each returned folio)
> > >
> > > However since a large folio can be returned several times in the batch,
> > > this obviously makes a deadlock, as btrfs is trying to lock the same
> > > folio more than once.
> >
> > Sorry, what? A large folio should only be returned once. xas_next()
> > moves to the next folio. How is it possible that
> > filemap_get_folios_contig() returns the same folio more than once?
>
> But that's exactly what I got from filemap_get_folios_contig():
>
> lock_delalloc_folios: r/i=5/260 locked_folio=720896(65536) start=782336
> end=819199(36864)
> lock_delalloc_folios: r/i=5/260 found_folios=1
> lock_delalloc_folios: r/i=5/260 i=0 folio=720896(65536)
> lock_delalloc_folios: r/i=5/260 found_folios=8
> lock_delalloc_folios: r/i=5/260 i=0 folio=786432(262144)
> lock_delalloc_folios: r/i=5/260 i=1 folio=786432(262144)
> lock_delalloc_folios: r/i=5/260 i=2 folio=786432(262144)
> lock_delalloc_folios: r/i=5/260 i=3 folio=786432(262144)
> lock_delalloc_folios: r/i=5/260 i=4 folio=786432(262144)
> lock_delalloc_folios: r/i=5/260 i=5 folio=786432(262144)
> lock_delalloc_folios: r/i=5/260 i=6 folio=786432(262144)
> lock_delalloc_folios: r/i=5/260 i=7 folio=786432(262144)
>
> r/i is the root and inode number from btrfs, and you can completely ignore
> it.
>
> @locked_folio is the folio we're already holding a lock, the value inside
> the brackets is the folio size.
>
> @start and @end is the range we're searching for, the value inside the
> brackets is the search range length.
>
> The first iteration returns the current locked folio, and since the range
> inside that folio is only 4K, thus it's only returned once.
>
> The next 8 slots are all inside the same large folio at 786432, resulting
> duplicated entries.
>
> >
> > > Then I looked into the caller of filemap_get_folios_contig() inside
> > > mm/gup, and it indeed does the correct skip.
> >
> > ... that code looks wrong to me.
>
> It looks like it's xas_find() is doing the correct skip by calling
> xas_next_offset() -> xas_move_index() to skip the next one.
>
> But the filemap_get_folios_contig() only calls xas_next() by increasing the
> index, not really skip to the next folio.
>
> Although I can be totally wrong as I'm not familiar with the xarray
> internals at all.
Thanks for bringing this to my attention, it looks like this is due to a
mistake during my folio conversion for this function. The xas_next()
call tries to go to the next index, but if that index is part of a
multi-index entry things get awkward if we don't manually account for the
index shift of a large folio (which I missed).
Can you please try out this attached patch and see if it gets rid of
the duplicate problem?
> However I totally agree the duplicated behavior (and the extra handling of
> duplicated entries) looks very wrong.
>
> Thanks,
> Qu
[-- Attachment #2: 0001-Fix-filemap_get_folios_contig-returning-batches-of-i.patch --]
[-- Type: text/plain, Size: 1533 bytes --]
From 91e8353cee38b1624fc13587f6db5058d764d983 Mon Sep 17 00:00:00 2001
From: "Vishal Moola (Oracle)" <vishal.moola@gmail.com>
Date: Thu, 3 Apr 2025 16:54:17 -0700
Subject: [PATCH] Fix filemap_get_folios_contig returning batches of identical
folios
filemap_get_folios_contig() is supposed to return distinct folios
found within [start, end]. Large folios in the Xarray become multi-index
entries. xas_next() can iterate through the sub-indexes before finding
a sibling entry and breaking out of the loop.
This can result in a returned folio_batch containing an indeterminate
number of duplicate folios, which forces the callers to skeptically
handle the returned batch. This is inefficient and incurs a large
maintenance overhead.
We can fix this by calling xas_advance() after we have successfully
adding a folio to the batch to ensure our Xarray is positioned such that
it will correctly find the next folio - similar to
filemap_get_read_batch().
Fixes: 35b471467f88 ("filemap: add filemap_get_folios_contig()")
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: <stable@vger.kernel.org>
---
mm/filemap.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/mm/filemap.c b/mm/filemap.c
index cc69f174f76b..bc7b28dfba3c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2233,6 +2233,7 @@ unsigned filemap_get_folios_contig(struct address_space *mapping,
*start = folio->index + nr;
goto out;
}
+ xas_advance(&xas, folio_next_index(folio) - 1);
continue;
put_folio:
folio_put(folio);
--
2.48.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: Large folios and filemap_get_folios_contig()
2025-04-04 0:50 ` Vishal Moola (Oracle)
@ 2025-04-04 4:15 ` Qu Wenruo
0 siblings, 0 replies; 5+ messages in thread
From: Qu Wenruo @ 2025-04-04 4:15 UTC (permalink / raw)
To: Vishal Moola (Oracle)
Cc: Matthew Wilcox, Qu Wenruo, Linux Memory Management List,
linux-fsdevel@vger.kernel.org, linux-btrfs, vivek.kasireddy,
Andrew Morton
在 2025/4/4 11:20, Vishal Moola (Oracle) 写道:
> On Fri, Apr 04, 2025 at 07:46:59AM +1030, Qu Wenruo wrote:
>>
[...]
>>>> Then I looked into the caller of filemap_get_folios_contig() inside
>>>> mm/gup, and it indeed does the correct skip.
>>>
>>> ... that code looks wrong to me.
>>
>> It looks like it's xas_find() is doing the correct skip by calling
>> xas_next_offset() -> xas_move_index() to skip the next one.
>>
>> But the filemap_get_folios_contig() only calls xas_next() by increasing the
>> index, not really skip to the next folio.
>>
>> Although I can be totally wrong as I'm not familiar with the xarray
>> internals at all.
>
> Thanks for bringing this to my attention, it looks like this is due to a
> mistake during my folio conversion for this function. The xas_next()
> call tries to go to the next index, but if that index is part of a
> multi-index entry things get awkward if we don't manually account for the
> index shift of a large folio (which I missed).
>
> Can you please try out this attached patch and see if it gets rid of
> the duplicate problem?
My reproducer no longer hang and the trace events indeed shows only one
folio returned.
Thanks for the quick fix,
Qu
>
>> However I totally agree the duplicated behavior (and the extra handling of
>> duplicated entries) looks very wrong.
>>
>> Thanks,
>> Qu
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-04-04 4:15 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-03 9:36 Large folios and filemap_get_folios_contig() Qu Wenruo
2025-04-03 12:35 ` Matthew Wilcox
2025-04-03 21:16 ` Qu Wenruo
2025-04-04 0:50 ` Vishal Moola (Oracle)
2025-04-04 4:15 ` Qu Wenruo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).