[BUG Report] 6.15-rc1 RIP: 0010:__lruvec_stat_mod

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [BUG Report] 6.15-rc1 RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
@ 2025-04-09  0:20 Alison Schofield
  2025-04-09  8:40 ` David Hildenbrand
  0 siblings, 1 reply; 9+ messages in thread
From: Alison Schofield @ 2025-04-09  0:20 UTC (permalink / raw)
  To: David Hildenbrand, Alistair Popple; +Cc: linux-mm, nvdimm

Hi David, because this bisected to a patch you posted
Hi Alistair,  because vmf_insert_page_mkwrite() is in the path

A DAX unit test began failing on 6.15-rc1. I chased it as described below, but
need XFS and/or your Folio/tail page accounting knowledge to take it further.

A DAX XFS mappings that is SHARED and R/W fails when the folio is 
unexpectedly NULL. Note that XFS PRIVATE always succeeds and XFS SHARED,
READ_ONLY works fine. Also note that it works all the ways with EXT4.

[  417.796271] BUG: kernel NULL pointer dereference, address: 0000000000000b00
[  417.796982] #PF: supervisor read access in kernel mode
[  417.797540] #PF: error_code(0x0000) - not-present page
[  417.798123] PGD 2a5c5067 P4D 2a5c5067 PUD 2a5c6067 PMD 0 
[  417.798690] Oops: Oops: 0000 [#1] SMP NOPTI
[  417.799178] CPU: 5 UID: 0 PID: 1515 Comm: mmap Tainted: G           O        6.15.0-rc1-dirty #158 PREEMPT(voluntary) 
[  417.800150] Tainted: [O]=OOT_MODULE
[  417.800583] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[  417.801358] RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
[  417.801948] Code: 85 97 00 00 00 48 8b 43 38 48 89 c3 48 83 e3 f8 a8 02 0f 85 1a 01 00 00 48 85 db 0f 84 28 01 00 00 66 90 49 63 86 80 3e 00 00 <48> 8b 9c c3 00 09 00 00 48 83 c3 40 4c 3b b3 c0 00 00 00 0f 85 68
[  417.803662] RSP: 0000:ffffc90002be3a08 EFLAGS: 00010206
[  417.804234] RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000000002
[  417.804984] RDX: ffffffff815652d7 RSI: 0000000000000000 RDI: ffffffff82a2beae
[  417.805689] RBP: ffffc90002be3a28 R08: 0000000000000000 R09: 0000000000000000
[  417.806384] R10: ffffea0007000040 R11: ffff888376ffe000 R12: 0000000000000001
[  417.807099] R13: 0000000000000012 R14: ffff88807fe4ab40 R15: ffff888029210580
[  417.807801] FS:  00007f339fa7a740(0000) GS:ffff8881fa9b9000(0000) knlGS:0000000000000000
[  417.808570] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  417.809193] CR2: 0000000000000b00 CR3: 000000002a4f0004 CR4: 0000000000370ef0
[  417.809925] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  417.810622] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  417.811353] Call Trace:
[  417.811709]  <TASK>
[  417.812038]  folio_add_file_rmap_ptes+0x143/0x230
[  417.812566]  insert_page_into_pte_locked+0x1ee/0x3c0
[  417.813132]  insert_page+0x78/0xf0
[  417.813558]  vmf_insert_page_mkwrite+0x55/0xa0
[  417.814088]  dax_fault_iter+0x484/0x7b0
[  417.814542]  dax_iomap_pte_fault+0x1ca/0x620
[  417.815055]  dax_iomap_fault+0x39/0x40
[  417.815499]  __xfs_write_fault+0x139/0x380
[  417.815995]  ? __handle_mm_fault+0x5e5/0x1a60
[  417.816483]  xfs_write_fault+0x41/0x50
[  417.816966]  xfs_filemap_fault+0x3b/0xe0
[  417.817424]  __do_fault+0x31/0x180
[  417.817859]  __handle_mm_fault+0xee1/0x1a60
[  417.818325]  ? debug_smp_processor_id+0x17/0x20
[  417.818844]  handle_mm_fault+0xe1/0x2b0
[  417.819286]  do_user_addr_fault+0x217/0x630
[  417.819747]  ? rcu_is_watching+0x11/0x50
[  417.820185]  exc_page_fault+0x6c/0x210
[  417.820599]  asm_exc_page_fault+0x27/0x30
[  417.821080] RIP: 0033:0x40130c
[  417.821461] Code: 89 7d d8 48 89 75 d0 e8 94 ff ff ff 48 c7 45 f8 00 00 00 00 48 8b 45 d8 48 89 45 f0 eb 18 48 8b 45 f0 48 8d 50 08 48 89 55 f0 <48> c7 00 01 00 00 00 48 83 45 f8 01 48 8b 45 d0 48 c1 e8 03 48 39
[  417.823156] RSP: 002b:00007ffcc82a8cb0 EFLAGS: 00010287
[  417.823703] RAX: 00007f336f5f5000 RBX: 00007ffcc82a8f08 RCX: 0000000067f5a1da
[  417.824382] RDX: 00007f336f5f5008 RSI: 0000000000000000 RDI: 0000000000036a98
[  417.825096] RBP: 00007ffcc82a8ce0 R08: 00007f339fa84000 R09: 00000000004040b0
[  417.825769] R10: 00007f339fa8a200 R11: 00007f339fa8a7b0 R12: 0000000000000000
[  417.826438] R13: 00007ffcc82a8f28 R14: 0000000000403e18 R15: 00007f339fac3000
[  417.827148]  </TASK>
[  417.827461] Modules linked in: nd_pmem(O) dax_pmem(O) nd_btt(O) nfit(O) nd_e820(O) libnvdimm(O) nfit_test_iomap(O)
[  417.828404] CR2: 0000000000000b00
[  417.828807] ---[ end trace 0000000000000000 ]---
[  417.829293] RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250


And then, looking at the page passed to vmf_insert_page_mkwrite():

[   55.468109] flags: 0x300000000002009(locked|uptodate|reserved|node=0|zone=3)
[   55.468674] raw: 0300000000002009 ffff888028c27b20 00000000ffffffff ffff888033b69b88
[   55.469270] raw: 000000000000fff5 0000000000000000 00000001ffffffff 0000000000000200
[   55.469835] page dumped because: ALISON dump locked & uptodate pages

^ That's different:  locked|uptodate. Other page flags arriving here are
not locked | uptodate.

Git bisect says this is first bad patch (6.14 --> 6.15-rc1)
4996fc547f5b ("mm: let _folio_nr_pages overlay memcg_data in first tail page")

Experimenting a bit with the patch, UN-defining NR_PAGES_IN_LARGE_FOLIO,
avoids the problem.

The way that patch is reusing memory in tail pages and the fact that it
only fails in XFS (not ext4) suggests the XFS is depending on tail pages
in a way that ext4 does not. 

And that's as far as I've gotten.

-- Alison


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG Report] 6.15-rc1 RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
  2025-04-09  0:20 [BUG Report] 6.15-rc1 RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250 Alison Schofield
@ 2025-04-09  8:40 ` David Hildenbrand
  2025-04-09  8:55   ` David Hildenbrand
  2025-04-09 19:03   ` Dan Williams
  0 siblings, 2 replies; 9+ messages in thread
From: David Hildenbrand @ 2025-04-09  8:40 UTC (permalink / raw)
  To: Alison Schofield, Alistair Popple; +Cc: linux-mm, nvdimm

On 09.04.25 02:20, Alison Schofield wrote:
> Hi David, because this bisected to a patch you posted
> Hi Alistair,  because vmf_insert_page_mkwrite() is in the path

Hi!

> 
> A DAX unit test began failing on 6.15-rc1. I chased it as described below, but
> need XFS and/or your Folio/tail page accounting knowledge to take it further.
> 
> A DAX XFS mappings that is SHARED and R/W fails when the folio is
> unexpectedly NULL. Note that XFS PRIVATE always succeeds and XFS SHARED,
> READ_ONLY works fine. Also note that it works all the ways with EXT4.
> 

Huh, but why is the folio NULL?

insert_page_into_pte_locked() does "folio = page_folio(page)" and then 
even calls folio_get(folio) before calling folio_add_file_rmap_pte().

folio_add_file_rmap_ptes()->__folio_add_file_rmap() just passes the 
folio pointer along.

The RIP seems to be in __lruvec_stat_mod_folio(), so I assume we end up 
in __folio_mod_stat()->__lruvec_stat_mod_folio().

There, we call folio_memcg(folio). Likely we're not getting NULL back, 
which we could handle, but instead "0000000000000b00"

So maybe the memcg we get is "almost NULL", and not the folio ?

> [  417.796271] BUG: kernel NULL pointer dereference, address: 0000000000000b00
> [  417.796982] #PF: supervisor read access in kernel mode
> [  417.797540] #PF: error_code(0x0000) - not-present page
> [  417.798123] PGD 2a5c5067 P4D 2a5c5067 PUD 2a5c6067 PMD 0
> [  417.798690] Oops: Oops: 0000 [#1] SMP NOPTI
> [  417.799178] CPU: 5 UID: 0 PID: 1515 Comm: mmap Tainted: G           O        6.15.0-rc1-dirty #158 PREEMPT(voluntary)
> [  417.800150] Tainted: [O]=OOT_MODULE
> [  417.800583] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> [  417.801358] RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
> [  417.801948] Code: 85 97 00 00 00 48 8b 43 38 48 89 c3 48 83 e3 f8 a8 02 0f 85 1a 01 00 00 48 85 db 0f 84 28 01 00 00 66 90 49 63 86 80 3e 00 00 <48> 8b 9c c3 00 09 00 00 48 83 c3 40 4c 3b b3 c0 00 00 00 0f 85 68
> [  417.803662] RSP: 0000:ffffc90002be3a08 EFLAGS: 00010206
> [  417.804234] RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000000002
> [  417.804984] RDX: ffffffff815652d7 RSI: 0000000000000000 RDI: ffffffff82a2beae
> [  417.805689] RBP: ffffc90002be3a28 R08: 0000000000000000 R09: 0000000000000000
> [  417.806384] R10: ffffea0007000040 R11: ffff888376ffe000 R12: 0000000000000001
> [  417.807099] R13: 0000000000000012 R14: ffff88807fe4ab40 R15: ffff888029210580
> [  417.807801] FS:  00007f339fa7a740(0000) GS:ffff8881fa9b9000(0000) knlGS:0000000000000000
> [  417.808570] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  417.809193] CR2: 0000000000000b00 CR3: 000000002a4f0004 CR4: 0000000000370ef0
> [  417.809925] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  417.810622] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  417.811353] Call Trace:
> [  417.811709]  <TASK>
> [  417.812038]  folio_add_file_rmap_ptes+0x143/0x230
> [  417.812566]  insert_page_into_pte_locked+0x1ee/0x3c0
> [  417.813132]  insert_page+0x78/0xf0
> [  417.813558]  vmf_insert_page_mkwrite+0x55/0xa0
> [  417.814088]  dax_fault_iter+0x484/0x7b0
> [  417.814542]  dax_iomap_pte_fault+0x1ca/0x620
> [  417.815055]  dax_iomap_fault+0x39/0x40
> [  417.815499]  __xfs_write_fault+0x139/0x380
> [  417.815995]  ? __handle_mm_fault+0x5e5/0x1a60
> [  417.816483]  xfs_write_fault+0x41/0x50
> [  417.816966]  xfs_filemap_fault+0x3b/0xe0
> [  417.817424]  __do_fault+0x31/0x180
> [  417.817859]  __handle_mm_fault+0xee1/0x1a60
> [  417.818325]  ? debug_smp_processor_id+0x17/0x20
> [  417.818844]  handle_mm_fault+0xe1/0x2b0
> [  417.819286]  do_user_addr_fault+0x217/0x630
> [  417.819747]  ? rcu_is_watching+0x11/0x50
> [  417.820185]  exc_page_fault+0x6c/0x210
> [  417.820599]  asm_exc_page_fault+0x27/0x30
> [  417.821080] RIP: 0033:0x40130c
> [  417.821461] Code: 89 7d d8 48 89 75 d0 e8 94 ff ff ff 48 c7 45 f8 00 00 00 00 48 8b 45 d8 48 89 45 f0 eb 18 48 8b 45 f0 48 8d 50 08 48 89 55 f0 <48> c7 00 01 00 00 00 48 83 45 f8 01 48 8b 45 d0 48 c1 e8 03 48 39
> [  417.823156] RSP: 002b:00007ffcc82a8cb0 EFLAGS: 00010287
> [  417.823703] RAX: 00007f336f5f5000 RBX: 00007ffcc82a8f08 RCX: 0000000067f5a1da
> [  417.824382] RDX: 00007f336f5f5008 RSI: 0000000000000000 RDI: 0000000000036a98
> [  417.825096] RBP: 00007ffcc82a8ce0 R08: 00007f339fa84000 R09: 00000000004040b0
> [  417.825769] R10: 00007f339fa8a200 R11: 00007f339fa8a7b0 R12: 0000000000000000
> [  417.826438] R13: 00007ffcc82a8f28 R14: 0000000000403e18 R15: 00007f339fac3000
> [  417.827148]  </TASK>
> [  417.827461] Modules linked in: nd_pmem(O) dax_pmem(O) nd_btt(O) nfit(O) nd_e820(O) libnvdimm(O) nfit_test_iomap(O)
> [  417.828404] CR2: 0000000000000b00
> [  417.828807] ---[ end trace 0000000000000000 ]---
> [  417.829293] RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
> 
> 
> And then, looking at the page passed to vmf_insert_page_mkwrite():
> 
> [   55.468109] flags: 0x300000000002009(locked|uptodate|reserved|node=0|zone=3)

reserved might indicate ZONE_DEVICE. But zone=3 might or might not be 
ZONE_DEVICE (depending on the kernel config).

> [   55.468674] raw: 0300000000002009 ffff888028c27b20 00000000ffffffff ffff888033b69b88
> [   55.469270] raw: 000000000000fff5 0000000000000000 00000001ffffffff 0000000000000200
> [   55.469835] page dumped because: ALISON dump locked & uptodate pages

Do you have the other (earlier) output from __dump_page(), especially if 
this page is part of a large folio etc?

Trying to decipher:

0300000000002009 -> "unsigned long flags"
ffff888028c27b20 -> big union

As the big union overlays "unsigned long compound_head", and the last 
bit is not set, this should be a *small folio*.

That would mean that "0000000000000200" would be "unsigned long memcg_data".

0x200 might have been the folio_nr_pages before the large folio was 
split. Likely, we are not clearing that when splitting the large folio, 
resulting in a false-positive "memcg_data" after the split.

> 
> ^ That's different:  locked|uptodate. Other page flags arriving here are
> not locked | uptodate.
> 
> Git bisect says this is first bad patch (6.14 --> 6.15-rc1)
> 4996fc547f5b ("mm: let _folio_nr_pages overlay memcg_data in first tail page")
> 
> Experimenting a bit with the patch, UN-defining NR_PAGES_IN_LARGE_FOLIO,
> avoids the problem.
> 
> The way that patch is reusing memory in tail pages and the fact that it
> only fails in XFS (not ext4) suggests the XFS is depending on tail pages
> in a way that ext4 does not.

IIRC, XFS supports large folios but ext4 does not. But I don't really 
know how that interacts with DAX (if the same thing applies). Ordinary 
XFS large folio tests seem to work just fine, so the question is what 
DAX-specific is happening here.

When we free large folios back to the buddy, we set "folio->_nr_pages = 
0", to make the "page->memcg_data" check in page_bad_reason() happy. 
Also, just before the large folio split for ordinary large folios, we 
set "folio->_nr_pages = 0".

Maybe there is something missing in ZONE_DEVICE freeing/splitting code 
of large folios, where we should do the same, to make sure that all 
page->memcg_data is actually 0?

I assume so. Let me dig.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG Report] 6.15-rc1 RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
  2025-04-09  8:40 ` David Hildenbrand
@ 2025-04-09  8:55   ` David Hildenbrand
  2025-04-09 20:08     ` Dan Williams
  2025-04-09 19:03   ` Dan Williams
  1 sibling, 1 reply; 9+ messages in thread
From: David Hildenbrand @ 2025-04-09  8:55 UTC (permalink / raw)
  To: Alison Schofield, Alistair Popple; +Cc: linux-mm, nvdimm

On 09.04.25 10:40, David Hildenbrand wrote:
> On 09.04.25 02:20, Alison Schofield wrote:
>> Hi David, because this bisected to a patch you posted
>> Hi Alistair,  because vmf_insert_page_mkwrite() is in the path
> 
> Hi!
> 
>>
>> A DAX unit test began failing on 6.15-rc1. I chased it as described below, but
>> need XFS and/or your Folio/tail page accounting knowledge to take it further.
>>
>> A DAX XFS mappings that is SHARED and R/W fails when the folio is
>> unexpectedly NULL. Note that XFS PRIVATE always succeeds and XFS SHARED,
>> READ_ONLY works fine. Also note that it works all the ways with EXT4.
>>
> 
> Huh, but why is the folio NULL?
> 
> insert_page_into_pte_locked() does "folio = page_folio(page)" and then
> even calls folio_get(folio) before calling folio_add_file_rmap_pte().
> 
> folio_add_file_rmap_ptes()->__folio_add_file_rmap() just passes the
> folio pointer along.
> 
> The RIP seems to be in __lruvec_stat_mod_folio(), so I assume we end up
> in __folio_mod_stat()->__lruvec_stat_mod_folio().
> 
> There, we call folio_memcg(folio). Likely we're not getting NULL back,
> which we could handle, but instead "0000000000000b00"
> 
> So maybe the memcg we get is "almost NULL", and not the folio ?
> 
>> [  417.796271] BUG: kernel NULL pointer dereference, address: 0000000000000b00
>> [  417.796982] #PF: supervisor read access in kernel mode
>> [  417.797540] #PF: error_code(0x0000) - not-present page
>> [  417.798123] PGD 2a5c5067 P4D 2a5c5067 PUD 2a5c6067 PMD 0
>> [  417.798690] Oops: Oops: 0000 [#1] SMP NOPTI
>> [  417.799178] CPU: 5 UID: 0 PID: 1515 Comm: mmap Tainted: G           O        6.15.0-rc1-dirty #158 PREEMPT(voluntary)
>> [  417.800150] Tainted: [O]=OOT_MODULE
>> [  417.800583] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>> [  417.801358] RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
>> [  417.801948] Code: 85 97 00 00 00 48 8b 43 38 48 89 c3 48 83 e3 f8 a8 02 0f 85 1a 01 00 00 48 85 db 0f 84 28 01 00 00 66 90 49 63 86 80 3e 00 00 <48> 8b 9c c3 00 09 00 00 48 83 c3 40 4c 3b b3 c0 00 00 00 0f 85 68
>> [  417.803662] RSP: 0000:ffffc90002be3a08 EFLAGS: 00010206
>> [  417.804234] RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000000002
>> [  417.804984] RDX: ffffffff815652d7 RSI: 0000000000000000 RDI: ffffffff82a2beae
>> [  417.805689] RBP: ffffc90002be3a28 R08: 0000000000000000 R09: 0000000000000000
>> [  417.806384] R10: ffffea0007000040 R11: ffff888376ffe000 R12: 0000000000000001
>> [  417.807099] R13: 0000000000000012 R14: ffff88807fe4ab40 R15: ffff888029210580
>> [  417.807801] FS:  00007f339fa7a740(0000) GS:ffff8881fa9b9000(0000) knlGS:0000000000000000
>> [  417.808570] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  417.809193] CR2: 0000000000000b00 CR3: 000000002a4f0004 CR4: 0000000000370ef0
>> [  417.809925] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [  417.810622] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [  417.811353] Call Trace:
>> [  417.811709]  <TASK>
>> [  417.812038]  folio_add_file_rmap_ptes+0x143/0x230
>> [  417.812566]  insert_page_into_pte_locked+0x1ee/0x3c0
>> [  417.813132]  insert_page+0x78/0xf0
>> [  417.813558]  vmf_insert_page_mkwrite+0x55/0xa0
>> [  417.814088]  dax_fault_iter+0x484/0x7b0
>> [  417.814542]  dax_iomap_pte_fault+0x1ca/0x620
>> [  417.815055]  dax_iomap_fault+0x39/0x40
>> [  417.815499]  __xfs_write_fault+0x139/0x380
>> [  417.815995]  ? __handle_mm_fault+0x5e5/0x1a60
>> [  417.816483]  xfs_write_fault+0x41/0x50
>> [  417.816966]  xfs_filemap_fault+0x3b/0xe0
>> [  417.817424]  __do_fault+0x31/0x180
>> [  417.817859]  __handle_mm_fault+0xee1/0x1a60
>> [  417.818325]  ? debug_smp_processor_id+0x17/0x20
>> [  417.818844]  handle_mm_fault+0xe1/0x2b0
>> [  417.819286]  do_user_addr_fault+0x217/0x630
>> [  417.819747]  ? rcu_is_watching+0x11/0x50
>> [  417.820185]  exc_page_fault+0x6c/0x210
>> [  417.820599]  asm_exc_page_fault+0x27/0x30
>> [  417.821080] RIP: 0033:0x40130c
>> [  417.821461] Code: 89 7d d8 48 89 75 d0 e8 94 ff ff ff 48 c7 45 f8 00 00 00 00 48 8b 45 d8 48 89 45 f0 eb 18 48 8b 45 f0 48 8d 50 08 48 89 55 f0 <48> c7 00 01 00 00 00 48 83 45 f8 01 48 8b 45 d0 48 c1 e8 03 48 39
>> [  417.823156] RSP: 002b:00007ffcc82a8cb0 EFLAGS: 00010287
>> [  417.823703] RAX: 00007f336f5f5000 RBX: 00007ffcc82a8f08 RCX: 0000000067f5a1da
>> [  417.824382] RDX: 00007f336f5f5008 RSI: 0000000000000000 RDI: 0000000000036a98
>> [  417.825096] RBP: 00007ffcc82a8ce0 R08: 00007f339fa84000 R09: 00000000004040b0
>> [  417.825769] R10: 00007f339fa8a200 R11: 00007f339fa8a7b0 R12: 0000000000000000
>> [  417.826438] R13: 00007ffcc82a8f28 R14: 0000000000403e18 R15: 00007f339fac3000
>> [  417.827148]  </TASK>
>> [  417.827461] Modules linked in: nd_pmem(O) dax_pmem(O) nd_btt(O) nfit(O) nd_e820(O) libnvdimm(O) nfit_test_iomap(O)
>> [  417.828404] CR2: 0000000000000b00
>> [  417.828807] ---[ end trace 0000000000000000 ]---
>> [  417.829293] RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
>>
>>
>> And then, looking at the page passed to vmf_insert_page_mkwrite():
>>
>> [   55.468109] flags: 0x300000000002009(locked|uptodate|reserved|node=0|zone=3)
> 
> reserved might indicate ZONE_DEVICE. But zone=3 might or might not be
> ZONE_DEVICE (depending on the kernel config).
> 
>> [   55.468674] raw: 0300000000002009 ffff888028c27b20 00000000ffffffff ffff888033b69b88
>> [   55.469270] raw: 000000000000fff5 0000000000000000 00000001ffffffff 0000000000000200
>> [   55.469835] page dumped because: ALISON dump locked & uptodate pages
> 
> Do you have the other (earlier) output from __dump_page(), especially if
> this page is part of a large folio etc?
> 
> Trying to decipher:
> 
> 0300000000002009 -> "unsigned long flags"
> ffff888028c27b20 -> big union
> 
> As the big union overlays "unsigned long compound_head", and the last
> bit is not set, this should be a *small folio*.
> 
> That would mean that "0000000000000200" would be "unsigned long memcg_data".
> 
> 0x200 might have been the folio_nr_pages before the large folio was
> split. Likely, we are not clearing that when splitting the large folio,
> resulting in a false-positive "memcg_data" after the split.
> 
>>
>> ^ That's different:  locked|uptodate. Other page flags arriving here are
>> not locked | uptodate.
>>
>> Git bisect says this is first bad patch (6.14 --> 6.15-rc1)
>> 4996fc547f5b ("mm: let _folio_nr_pages overlay memcg_data in first tail page")
>>
>> Experimenting a bit with the patch, UN-defining NR_PAGES_IN_LARGE_FOLIO,
>> avoids the problem.
>>
>> The way that patch is reusing memory in tail pages and the fact that it
>> only fails in XFS (not ext4) suggests the XFS is depending on tail pages
>> in a way that ext4 does not.
> 
> IIRC, XFS supports large folios but ext4 does not. But I don't really
> know how that interacts with DAX (if the same thing applies). Ordinary
> XFS large folio tests seem to work just fine, so the question is what
> DAX-specific is happening here.
> 
> When we free large folios back to the buddy, we set "folio->_nr_pages =
> 0", to make the "page->memcg_data" check in page_bad_reason() happy.
> Also, just before the large folio split for ordinary large folios, we
> set "folio->_nr_pages = 0".
> 
> Maybe there is something missing in ZONE_DEVICE freeing/splitting code
> of large folios, where we should do the same, to make sure that all
> page->memcg_data is actually 0?
> 
> I assume so. Let me dig.
> 

I suspect this should do the trick:

diff --git a/fs/dax.c b/fs/dax.c
index af5045b0f476e..8dffffef70d21 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -397,6 +397,10 @@ static inline unsigned long dax_folio_put(struct folio *folio)
         if (!order)
                 return 0;
  
+#ifdef NR_PAGES_IN_LARGE_FOLIO
+       folio->_nr_pages = 0;
+#endif
+
         for (i = 0; i < (1UL << order); i++) {
                 struct dev_pagemap *pgmap = page_pgmap(&folio->page);
                 struct page *page = folio_page(folio, i);


Alternatively (in the style of fa23a338de93aa03eb0b6146a0440f5762309f85)

diff --git a/fs/dax.c b/fs/dax.c
index af5045b0f476e..a1e354b748522 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -412,6 +412,9 @@ static inline unsigned long dax_folio_put(struct folio *folio)
                  */
                 new_folio->pgmap = pgmap;
                 new_folio->share = 0;
+#ifdef CONFIG_MEMCG
+               new_folio->memcg_data = 0;
+#endif
                 WARN_ON_ONCE(folio_ref_count(new_folio));
         }
  


-- 
Cheers,

David / dhildenb



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [BUG Report] 6.15-rc1 RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
  2025-04-09  8:40 ` David Hildenbrand
  2025-04-09  8:55   ` David Hildenbrand
@ 2025-04-09 19:03   ` Dan Williams
  1 sibling, 0 replies; 9+ messages in thread
From: Dan Williams @ 2025-04-09 19:03 UTC (permalink / raw)
  To: David Hildenbrand, Alison Schofield, Alistair Popple; +Cc: linux-mm, nvdimm

David Hildenbrand wrote:
[..]
> > ^ That's different:  locked|uptodate. Other page flags arriving here are
> > not locked | uptodate.
> > 
> > Git bisect says this is first bad patch (6.14 --> 6.15-rc1)
> > 4996fc547f5b ("mm: let _folio_nr_pages overlay memcg_data in first tail page")
> > 
> > Experimenting a bit with the patch, UN-defining NR_PAGES_IN_LARGE_FOLIO,
> > avoids the problem.
> > 
> > The way that patch is reusing memory in tail pages and the fact that it
> > only fails in XFS (not ext4) suggests the XFS is depending on tail pages
> > in a way that ext4 does not.
> 
> IIRC, XFS supports large folios but ext4 does not. But I don't really 
> know how that interacts with DAX (if the same thing applies). Ordinary 
> XFS large folio tests seem to work just fine, so the question is what 
> DAX-specific is happening here.

So with fsdax large-folios come from large-extents. I.e. you can have
large fsdax folios regardless of whether the filesystem supports large
folios for page-cache mappings. The dax unit tests have an easier time
getting XFS to create large extents than ext4.

> When we free large folios back to the buddy, we set "folio->_nr_pages = 
> 0", to make the "page->memcg_data" check in page_bad_reason() happy. 
> Also, just before the large folio split for ordinary large folios, we 
> set "folio->_nr_pages = 0".

Ah, yes, that is definitely missing in the fsdax case.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG Report] 6.15-rc1 RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
  2025-04-09  8:55   ` David Hildenbrand
@ 2025-04-09 20:08     ` Dan Williams
  2025-04-09 20:25       ` David Hildenbrand
  0 siblings, 1 reply; 9+ messages in thread
From: Dan Williams @ 2025-04-09 20:08 UTC (permalink / raw)
  To: David Hildenbrand, Alison Schofield, Alistair Popple; +Cc: linux-mm, nvdimm

David Hildenbrand wrote:
[..]
> > Maybe there is something missing in ZONE_DEVICE freeing/splitting code
> > of large folios, where we should do the same, to make sure that all
> > page->memcg_data is actually 0?
> > 
> > I assume so. Let me dig.
> > 
> 
> I suspect this should do the trick:
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index af5045b0f476e..8dffffef70d21 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -397,6 +397,10 @@ static inline unsigned long dax_folio_put(struct folio *folio)
>          if (!order)
>                  return 0;
>   
> +#ifdef NR_PAGES_IN_LARGE_FOLIO
> +       folio->_nr_pages = 0;
> +#endif

I assume this new fs/dax.c instance of this pattern motivates a
folio_set_nr_pages() helper to hide the ifdef?

While it is concerning that fs/dax.c misses common expectations like
this, but I think that is the nature of bypassing the page allocator to
get folios().

However, raises the question if fixing it here is sufficient for other
ZONE_DEVICE folio cases. I did not immediately find a place where other
ZONE_DEVICE users might be calling prep_compound_page() and leaving
stale tail page metadata lying around. Alistair?

> +
>          for (i = 0; i < (1UL << order); i++) {
>                  struct dev_pagemap *pgmap = page_pgmap(&folio->page);
>                  struct page *page = folio_page(folio, i);
> 
> 
> Alternatively (in the style of fa23a338de93aa03eb0b6146a0440f5762309f85)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index af5045b0f476e..a1e354b748522 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -412,6 +412,9 @@ static inline unsigned long dax_folio_put(struct folio *folio)
>                   */
>                  new_folio->pgmap = pgmap;
>                  new_folio->share = 0;
> +#ifdef CONFIG_MEMCG
> +               new_folio->memcg_data = 0;
> +#endif

This looks correct, but I like the first option because I would never
expect a dax-page to need to worry about being part of a memcg.

>                  WARN_ON_ONCE(folio_ref_count(new_folio));
>          }
>   
> 
> 
> -- 
> Cheers,
> 
> David / dhildenb

Thanks for the help, David!


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG Report] 6.15-rc1 RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
  2025-04-09 20:08     ` Dan Williams
@ 2025-04-09 20:25       ` David Hildenbrand
  2025-04-09 21:13         ` Alison Schofield
  2025-04-09 21:41         ` Dan Williams
  0 siblings, 2 replies; 9+ messages in thread
From: David Hildenbrand @ 2025-04-09 20:25 UTC (permalink / raw)
  To: Dan Williams, Alison Schofield, Alistair Popple; +Cc: linux-mm, nvdimm

On 09.04.25 22:08, Dan Williams wrote:
> David Hildenbrand wrote:
> [..]
>>> Maybe there is something missing in ZONE_DEVICE freeing/splitting code
>>> of large folios, where we should do the same, to make sure that all
>>> page->memcg_data is actually 0?
>>>
>>> I assume so. Let me dig.
>>>
>>
>> I suspect this should do the trick:
>>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index af5045b0f476e..8dffffef70d21 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -397,6 +397,10 @@ static inline unsigned long dax_folio_put(struct folio *folio)
>>           if (!order)
>>                   return 0;
>>    
>> +#ifdef NR_PAGES_IN_LARGE_FOLIO
>> +       folio->_nr_pages = 0;
>> +#endif
> 
> I assume this new fs/dax.c instance of this pattern motivates a
> folio_set_nr_pages() helper to hide the ifdef?

Hm, not sure. We do have folio_set_order() but we WARN on order=0" for 
good reasons. ... and having folio_set_nr_pages() that doesn't set the 
order is also weird ...

In the THP case we handle it now by propagating the folio->memcg_data to 
all new_folio->memcg_data.

Maybe we should simply allow setting order=0 for folio_set_order(), 
adding a comment that it is for reset-before split.

Let me think about that.

> 
> While it is concerning that fs/dax.c misses common expectations like
> this, but I think that is the nature of bypassing the page allocator to
> get folios().

It was a bit unfortunate that Alistair's work and my work went into 
mm-unstable and upstream shortly after each other.

> 
> However, raises the question if fixing it here is sufficient for other
> ZONE_DEVICE folio cases. I did not immediately find a place where other
> ZONE_DEVICE users might be calling prep_compound_page() and leaving
> stale tail page metadata lying around. Alistair?

We only have to consider this when splitting folios (putting buddy 
freeing aside). clear_compound_head() is what to search for.

We don't need it in mm/hugetlb.c because we'll only demote large folios 
to smaller-large folios and effectively reset the order/nr_pages for all 
involved folios.


Let me send an official patch tomorrow; maybe Alison can comment until 
then if that fixes the issue.

>> diff --git a/fs/dax.c b/fs/dax.c
>> index af5045b0f476e..a1e354b748522 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -412,6 +412,9 @@ static inline unsigned long dax_folio_put(struct folio *folio)
>>                    */
>>                   new_folio->pgmap = pgmap;
>>                   new_folio->share = 0;
>> +#ifdef CONFIG_MEMCG
>> +               new_folio->memcg_data = 0;
>> +#endif
> 
> This looks correct, but I like the first option because I would never
> expect a dax-page to need to worry about being part of a memcg.

Right.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG Report] 6.15-rc1 RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
  2025-04-09 20:25       ` David Hildenbrand
@ 2025-04-09 21:13         ` Alison Schofield
  2025-04-09 21:41         ` Dan Williams
  1 sibling, 0 replies; 9+ messages in thread
From: Alison Schofield @ 2025-04-09 21:13 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: Dan Williams, Alistair Popple, linux-mm, nvdimm

On Wed, Apr 09, 2025 at 10:25:18PM +0200, David Hildenbrand wrote:
> On 09.04.25 22:08, Dan Williams wrote:
> > David Hildenbrand wrote:
> > [..]
snip
> 
> 
> Let me send an official patch tomorrow; maybe Alison can comment until then
> if that fixes the issue.


Either of the #ifdef's proposed resolve the issue.

--Alison


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG Report] 6.15-rc1 RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
  2025-04-09 20:25       ` David Hildenbrand
  2025-04-09 21:13         ` Alison Schofield
@ 2025-04-09 21:41         ` Dan Williams
  2025-04-10  8:48           ` Christoph Hellwig
  1 sibling, 1 reply; 9+ messages in thread
From: Dan Williams @ 2025-04-09 21:41 UTC (permalink / raw)
  To: David Hildenbrand, Dan Williams, Alison Schofield,
	Alistair Popple
  Cc: linux-mm, nvdimm

David Hildenbrand wrote:
[..]
> > However, raises the question if fixing it here is sufficient for other
> > ZONE_DEVICE folio cases. I did not immediately find a place where other
> > ZONE_DEVICE users might be calling prep_compound_page() and leaving
> > stale tail page metadata lying around. Alistair?
> 
> We only have to consider this when splitting folios (putting buddy 
> freeing aside). clear_compound_head() is what to search for.

So I do not think there is a problem for the DEVICE_PRIVATE case since
that hits this comment in free_zone_device_folio()

        /*
         * Note: we don't expect anonymous compound pages yet. Once supported
         * and we could PTE-map them similar to THP, we'd have to clear
         * PG_anon_exclusive on all tail pages.
         */                       

The p2p-dma use case does not map into userspace, and the device-dax
case has static folio order for all potential folios. So I think this
fix is only needed for fsdax.

> We don't need it in mm/hugetlb.c because we'll only demote large folios 
> to smaller-large folios and effectively reset the order/nr_pages for all 
> involved folios.

I also now feel better about a local fs/dax.c fix because clearing
_nr_pages in free_zone_device_folio() would require static folio
metadata cases like device-dax to start re-inializing that field.

I.e. this seems to be the only ZONE_DEVICE case doing this demote to
order-0.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG Report] 6.15-rc1 RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
  2025-04-09 21:41         ` Dan Williams
@ 2025-04-10  8:48           ` Christoph Hellwig
  0 siblings, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2025-04-10  8:48 UTC (permalink / raw)
  To: Dan Williams
  Cc: David Hildenbrand, Alison Schofield, Alistair Popple, linux-mm,
	nvdimm

On Wed, Apr 09, 2025 at 02:41:14PM -0700, Dan Williams wrote:
> The p2p-dma use case does not map into userspace, and the device-dax
> case has static folio order for all potential folios. So I think this
> fix is only needed for fsdax.

p2pdma pages can be mapped to userspace.  Or do you mean something else?



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-04-10  8:48 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-09  0:20 [BUG Report] 6.15-rc1 RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250 Alison Schofield
2025-04-09  8:40 ` David Hildenbrand
2025-04-09  8:55   ` David Hildenbrand
2025-04-09 20:08     ` Dan Williams
2025-04-09 20:25       ` David Hildenbrand
2025-04-09 21:13         ` Alison Schofield
2025-04-09 21:41         ` Dan Williams
2025-04-10  8:48           ` Christoph Hellwig
2025-04-09 19:03   ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).