Re: 6.9/BUG: Bad page state in process kswapd0 pfn:d6e840

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* Re: 6.9/BUG: Bad page state in process kswapd0 pfn:d6e840
       [not found]               ` <209ff705-fe6e-4d6d-9d08-201afba7d74b@redhat.com>
@ 2024-05-29  6:57                 ` David Hildenbrand
  2024-05-29 19:00                   ` David Sterba
  2024-05-29 22:37                   ` Qu Wenruo
  0 siblings, 2 replies; 4+ messages in thread
From: David Hildenbrand @ 2024-05-29  6:57 UTC (permalink / raw)
  To: Mikhail Gavrilov, Chris Mason, Josef Bacik, David Sterba
  Cc: Linux List Kernel Mailing, Linux Memory Management List,
	Matthew Wilcox, linux-btrfs

On 28.05.24 16:24, David Hildenbrand wrote:
> Am 28.05.24 um 15:57 schrieb David Hildenbrand:
>> Am 28.05.24 um 08:05 schrieb Mikhail Gavrilov:
>>> On Thu, May 23, 2024 at 12:05 PM Mikhail Gavrilov
>>> <mikhail.v.gavrilov@gmail.com> wrote:
>>>>
>>>> On Thu, May 9, 2024 at 10:50 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> The only known workload that causes this is updating a large
>>>> container. Unfortunately, not every container update reproduces the
>>>> problem.
>>>
>>> Is it possible to add more debugging information to make it clearer
>>> what's going on?
>>
>> If we knew who originally allocated that problematic page, that might help.
>> Maybe page_owner could give some hints?
>>
>>>
>>> BUG: Bad page state in process kcompactd0  pfn:605811
>>> page: refcount:0 mapcount:0 mapping:0000000082d91e3e index:0x1045efc4f
>>> pfn:0x605811
>>> aops:btree_aops ino:1
>>> flags:
>>> 0x17ffffc600020c(referenced|uptodate|workingset|node=0|zone=2|lastcpupid=0x1fffff)
>>> raw: 0017ffffc600020c dead000000000100 dead000000000122 ffff888159075220
>>> raw: 00000001045efc4f 0000000000000000 00000000ffffffff 0000000000000000
>>> page dumped because: non-NULL mapping
>>
>> Seems to be an order-0 page, otherwise we would have another "head: ..." report.
>>
>> It's not an anon/ksm/non-lru migration folio, because we clear the page->mapping
>> field for them manually on the page freeing path. Likely it's a pagecache folio.
>>
>> So one option is that something seems to not properly set folio->mapping to
>> NULL. But that problem would then also show up without page migration? Hmm.
>>
>>> Hardware name: ASUS System Product Name/ROG STRIX B650E-I GAMING WIFI,
>>> BIOS 2611 04/07/2024
>>> Call Trace:
>>>    <TASK>
>>>    dump_stack_lvl+0x84/0xd0
>>>    bad_page.cold+0xbe/0xe0
>>>    ? __pfx_bad_page+0x10/0x10
>>>    ? page_bad_reason+0x9d/0x1f0
>>>    free_unref_page+0x838/0x10e0
>>>    __folio_put+0x1ba/0x2b0
>>>    ? __pfx___folio_put+0x10/0x10
>>>    ? __pfx___might_resched+0x10/0x10
>>
>> I suspect we come via
>>       migrate_pages_batch()->migrate_folio_unmap()->migrate_folio_done().
>>
>> Maybe this is the "Folio was freed from under us. So we are done." path
>> when "folio_ref_count(src) == 1".
>>
>> Alternatively, we might come via
>>       migrate_pages_batch()->migrate_folio_move()->migrate_folio_done().
>>
>> For ordinary migration, move_to_new_folio() will clear src->mapping if
>> the folio was migrated successfully. That's the very first thing that
>> migrate_folio_move() does, so I doubt that is the problem.
>>
>> So I suspect we are in the migrate_folio_unmap() path. But for
>> a !anon folio, who should be freeing the folio concurrently (and not clearing
>> folio->mapping?)? After all, we have to hold the folio lock while migrating.
>>
>> In khugepaged:collapse_file() we manually set folio->mapping = NULL, before
>> dropping the reference.
>>
>> Something to try might be (to see if the problem goes away).
>>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index dd04f578c19c..45e92e14c904 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1124,6 +1124,13 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
>>                   /* Folio was freed from under us. So we are done. */
>>                   folio_clear_active(src);
>>                   folio_clear_unevictable(src);
>> +               /*
>> +                * Anonymous and movable src->mapping will be cleared by
>> +                * free_pages_prepare so don't reset it here for keeping
>> +                * the type to work PageAnon, for example.
>> +                */
>> +               if (!folio_mapping_flags(src))
>> +                       src->mapping = NULL;
>>                   /* free_pages_prepare() will clear PG_isolated. */
>>                   list_del(&src->lru);
>>                   migrate_folio_done(src, reason);
>>
>> But it does feel weird: who freed the page concurrently and didn't clear
>> folio->mapping ...
>>
>> We don't hold the folio lock of src, though, but have the only reference. So
>> another possible thing might be folio refcount mis-counting: folio_ref_count()
>> == 1 but there are other references (e.g., from the pagecache).
> 
> Hmm, your original report mentions kswapd, so I'm getting the feeling someone
> does one folio_put() too much and we are freeing a pageache folio that is still
> in the pageache and, therefore, has folio->mapping set ... bisecting would
> really help.
> 

A little bird just told me that I missed an important piece in the dmesg 
output: "aops:btree_aops ino:1" from dump_mapping():

This is btrfs, i_ino is 1, and we don't have a dentry. Is that 
BTRFS_BTREE_INODE_OBJECTID?

Summarizing what we know so far:
(1) Freeing an order-0 btrfs folio where folio->mapping
     is still set
(2) Triggered by kswapd and kcompactd; not triggered by other means of
     page freeing so far

Possible theories:
(A) folio->mapping not cleared when freeing the folio. But shouldn't
     this also happen on other freeing paths? Or are we simply lucky to
     never trigger that for that folio?
(B) Messed-up refcounting: freeing a folio that is still in use (and
     therefore has folio-> mapping still set)

I was briefly wondering if large folio splitting could be involved.

CCing btrfs maintainers.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 6.9/BUG: Bad page state in process kswapd0 pfn:d6e840
  2024-05-29  6:57                 ` 6.9/BUG: Bad page state in process kswapd0 pfn:d6e840 David Hildenbrand
@ 2024-05-29 19:00                   ` David Sterba
  2024-05-29 22:37                   ` Qu Wenruo
  1 sibling, 0 replies; 4+ messages in thread
From: David Sterba @ 2024-05-29 19:00 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mikhail Gavrilov, Chris Mason, Josef Bacik, David Sterba,
	Linux List Kernel Mailing, Linux Memory Management List,
	Matthew Wilcox, linux-btrfs

On Wed, May 29, 2024 at 08:57:48AM +0200, David Hildenbrand wrote:
> On 28.05.24 16:24, David Hildenbrand wrote:
> > Hmm, your original report mentions kswapd, so I'm getting the feeling someone
> > does one folio_put() too much and we are freeing a pageache folio that is still
> > in the pageache and, therefore, has folio->mapping set ... bisecting would
> > really help.
> 
> A little bird just told me that I missed an important piece in the dmesg 
> output: "aops:btree_aops ino:1" from dump_mapping():
> 
> This is btrfs, i_ino is 1, and we don't have a dentry. Is that 
> BTRFS_BTREE_INODE_OBJECTID?

Yes, that's right, inode with number 1 is representing metadata.

> Summarizing what we know so far:
> (1) Freeing an order-0 btrfs folio where folio->mapping
>      is still set
> (2) Triggered by kswapd and kcompactd; not triggered by other means of
>      page freeing so far
> 
> Possible theories:
> (A) folio->mapping not cleared when freeing the folio. But shouldn't
>      this also happen on other freeing paths? Or are we simply lucky to
>      never trigger that for that folio?
> (B) Messed-up refcounting: freeing a folio that is still in use (and
>      therefore has folio-> mapping still set)
> 
> I was briefly wondering if large folio splitting could be involved.

We do not have large folios enabled for btrfs, the conversion from pages
to folios is still ongoing.

With increased number of strange reports either from syzbot or others,
it seems that something got wrong in the 6.10-rc update or maybe
earlier.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 6.9/BUG: Bad page state in process kswapd0 pfn:d6e840
  2024-05-29  6:57                 ` 6.9/BUG: Bad page state in process kswapd0 pfn:d6e840 David Hildenbrand
  2024-05-29 19:00                   ` David Sterba
@ 2024-05-29 22:37                   ` Qu Wenruo
  2024-05-30  5:26                     ` Qu Wenruo
  1 sibling, 1 reply; 4+ messages in thread
From: Qu Wenruo @ 2024-05-29 22:37 UTC (permalink / raw)
  To: David Hildenbrand, Mikhail Gavrilov, Chris Mason, Josef Bacik,
	David Sterba
  Cc: Linux List Kernel Mailing, Linux Memory Management List,
	Matthew Wilcox, linux-btrfs



在 2024/5/29 16:27, David Hildenbrand 写道:
> On 28.05.24 16:24, David Hildenbrand wrote:
[...]
>> Hmm, your original report mentions kswapd, so I'm getting the feeling
>> someone
>> does one folio_put() too much and we are freeing a pageache folio that
>> is still
>> in the pageache and, therefore, has folio->mapping set ... bisecting
>> would
>> really help.
>>
>
> A little bird just told me that I missed an important piece in the dmesg
> output: "aops:btree_aops ino:1" from dump_mapping():
>
> This is btrfs, i_ino is 1, and we don't have a dentry. Is that
> BTRFS_BTREE_INODE_OBJECTID?
>
> Summarizing what we know so far:
> (1) Freeing an order-0 btrfs folio where folio->mapping
>      is still set
> (2) Triggered by kswapd and kcompactd; not triggered by other means of
>      page freeing so far

 From the implementation of filemap_migrate_folio() (and previous
migrate_page_moving_mapping()), it looks like the migration only involves:

- Migrate the mapping
- Copy the page private value
- Copy the contents (if needed)
- Copy all the page flags

The most recent touch on migration is from v6.0, which I do not believe
is the cause at all.

>
> Possible theories:
> (A) folio->mapping not cleared when freeing the folio. But shouldn't
>      this also happen on other freeing paths? Or are we simply lucky to
>      never trigger that for that folio?

Yeah, in fact we never manually clean folio->mapping inside btrfs, thus
I'm not sure if it's the case.

> (B) Messed-up refcounting: freeing a folio that is still in use (and
>      therefore has folio-> mapping still set)
>
> I was briefly wondering if large folio splitting could be involved.

Although we have all the metadata support for larger folios, we do not
yet enable it.

My current guess is, could it be some race with this commit?

09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method")

For example, when we're allocating an extent buffer (btrfs' metadata
structure), and one page is already attached to the page cache, then the
page is being migrated meanwhile the remaining pages are not yet attached?

It's first introduced in v6.8, matching the earliest report.
But that patch is not easy to revert.


Do you have any extra reproducibility or extra way to debug the lifespan
of that specific patch?

Or is there any way to temporarily disable migration?

Thanks,
Qu
>
> CCing btrfs maintainers.
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 6.9/BUG: Bad page state in process kswapd0 pfn:d6e840
  2024-05-29 22:37                   ` Qu Wenruo
@ 2024-05-30  5:26                     ` Qu Wenruo
  0 siblings, 0 replies; 4+ messages in thread
From: Qu Wenruo @ 2024-05-30  5:26 UTC (permalink / raw)
  To: David Hildenbrand, Mikhail Gavrilov, Chris Mason, Josef Bacik,
	David Sterba
  Cc: Linux List Kernel Mailing, Linux Memory Management List,
	Matthew Wilcox, linux-btrfs



在 2024/5/30 08:07, Qu Wenruo 写道:
>
>
> 在 2024/5/29 16:27, David Hildenbrand 写道:
>>
>> A little bird just told me that I missed an important piece in the dmesg
>> output: "aops:btree_aops ino:1" from dump_mapping():
>>
>> This is btrfs, i_ino is 1, and we don't have a dentry. Is that
>> BTRFS_BTREE_INODE_OBJECTID?
>>
>> Summarizing what we know so far:
>> (1) Freeing an order-0 btrfs folio where folio->mapping
>>      is still set
>> (2) Triggered by kswapd and kcompactd; not triggered by other means of
>>      page freeing so far
>
>  From the implementation of filemap_migrate_folio() (and previous
> migrate_page_moving_mapping()), it looks like the migration only involves:
>
> - Migrate the mapping
> - Copy the page private value
> - Copy the contents (if needed)
> - Copy all the page flags
>
> The most recent touch on migration is from v6.0, which I do not believe
> is the cause at all.
>
>>
>> Possible theories:
>> (A) folio->mapping not cleared when freeing the folio. But shouldn't
>>      this also happen on other freeing paths? Or are we simply lucky to
>>      never trigger that for that folio?
>
> Yeah, in fact we never manually clean folio->mapping inside btrfs, thus
> I'm not sure if it's the case.
>
>> (B) Messed-up refcounting: freeing a folio that is still in use (and
>>      therefore has folio-> mapping still set)
>>
>> I was briefly wondering if large folio splitting could be involved.
>
> Although we have all the metadata support for larger folios, we do not
> yet enable it.

After some extra code digging and tons of trace_printk(), it indeed
looks like btrfs is underflowing the folio ref count.

During the lifespan of an extent buffer (btrfs' metadata), it should at
least has 3 refs after attached to the address space:

1) folio_alloc() inside btrfs_alloc_folio_array()
2) folio_ref_add() inside __filemap_add_folio()
3) folio_add_lru() inside filemap_add_folio()

Even if btrfs wants to release the folio of an eb, we only do:

- Detach the folio::private
- put_folio()

So even if an eb got released, as long as it is not yet detached from
filemap, its refcount should still be >= 2.

Thus the warning is indeed correct, by somehow btrfs called extra
put_folio() on the eb page which is already attached to the btree inode.

I'll continue digging around the eb folio refs inside btrfs, meanwhile I
will also test some extra checks for eb folios on their refcount.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-05-30  5:26 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CABXGCsPktcHQOvKTbPaTwegMExije=Gpgci5NW=hqORo-s7diA@mail.gmail.com>
     [not found] ` <CABXGCsOC2Ji7y5Qfsa33QXQ37T3vzdNPsivGoMHcVnCGFi5vKg@mail.gmail.com>
     [not found]   ` <0672f0b7-36f5-4322-80e6-2da0f24c101b@redhat.com>
     [not found]     ` <CABXGCsN7LBynNk_XzaFm2eVkryVQ26BSzFkrxC2Zb5GEwTvc1g@mail.gmail.com>
     [not found]       ` <6b42ad9a-1f15-439a-8a42-34052fec017e@redhat.com>
     [not found]         ` <CABXGCsP46xvu3C3Ntd=k5ARrYScAea1gj+YmKYqO+Yj7u3xu1Q@mail.gmail.com>
     [not found]           ` <CABXGCsP3Yf2g6e7pSi71pbKpm+r1LdGyF5V7KaXbQjNyR9C_Rw@mail.gmail.com>
     [not found]             ` <162cb2a8-1b53-4e86-8d49-f4e09b3255a4@redhat.com>
     [not found]               ` <209ff705-fe6e-4d6d-9d08-201afba7d74b@redhat.com>
2024-05-29  6:57                 ` 6.9/BUG: Bad page state in process kswapd0 pfn:d6e840 David Hildenbrand
2024-05-29 19:00                   ` David Sterba
2024-05-29 22:37                   ` Qu Wenruo
2024-05-30  5:26                     ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox