linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Alison Schofield <alison.schofield@intel.com>,
	Alistair Popple <apopple@nvidia.com>
Cc: linux-mm@kvack.org, nvdimm@lists.linux.dev
Subject: Re: [BUG Report] 6.15-rc1 RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
Date: Wed, 9 Apr 2025 10:55:11 +0200	[thread overview]
Message-ID: <89c869fe-6552-4c7b-ae32-f8179628cade@redhat.com> (raw)
In-Reply-To: <322e93d6-3fe2-48e9-84a9-c387cef41013@redhat.com>

On 09.04.25 10:40, David Hildenbrand wrote:
> On 09.04.25 02:20, Alison Schofield wrote:
>> Hi David, because this bisected to a patch you posted
>> Hi Alistair,  because vmf_insert_page_mkwrite() is in the path
> 
> Hi!
> 
>>
>> A DAX unit test began failing on 6.15-rc1. I chased it as described below, but
>> need XFS and/or your Folio/tail page accounting knowledge to take it further.
>>
>> A DAX XFS mappings that is SHARED and R/W fails when the folio is
>> unexpectedly NULL. Note that XFS PRIVATE always succeeds and XFS SHARED,
>> READ_ONLY works fine. Also note that it works all the ways with EXT4.
>>
> 
> Huh, but why is the folio NULL?
> 
> insert_page_into_pte_locked() does "folio = page_folio(page)" and then
> even calls folio_get(folio) before calling folio_add_file_rmap_pte().
> 
> folio_add_file_rmap_ptes()->__folio_add_file_rmap() just passes the
> folio pointer along.
> 
> The RIP seems to be in __lruvec_stat_mod_folio(), so I assume we end up
> in __folio_mod_stat()->__lruvec_stat_mod_folio().
> 
> There, we call folio_memcg(folio). Likely we're not getting NULL back,
> which we could handle, but instead "0000000000000b00"
> 
> So maybe the memcg we get is "almost NULL", and not the folio ?
> 
>> [  417.796271] BUG: kernel NULL pointer dereference, address: 0000000000000b00
>> [  417.796982] #PF: supervisor read access in kernel mode
>> [  417.797540] #PF: error_code(0x0000) - not-present page
>> [  417.798123] PGD 2a5c5067 P4D 2a5c5067 PUD 2a5c6067 PMD 0
>> [  417.798690] Oops: Oops: 0000 [#1] SMP NOPTI
>> [  417.799178] CPU: 5 UID: 0 PID: 1515 Comm: mmap Tainted: G           O        6.15.0-rc1-dirty #158 PREEMPT(voluntary)
>> [  417.800150] Tainted: [O]=OOT_MODULE
>> [  417.800583] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>> [  417.801358] RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
>> [  417.801948] Code: 85 97 00 00 00 48 8b 43 38 48 89 c3 48 83 e3 f8 a8 02 0f 85 1a 01 00 00 48 85 db 0f 84 28 01 00 00 66 90 49 63 86 80 3e 00 00 <48> 8b 9c c3 00 09 00 00 48 83 c3 40 4c 3b b3 c0 00 00 00 0f 85 68
>> [  417.803662] RSP: 0000:ffffc90002be3a08 EFLAGS: 00010206
>> [  417.804234] RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000000002
>> [  417.804984] RDX: ffffffff815652d7 RSI: 0000000000000000 RDI: ffffffff82a2beae
>> [  417.805689] RBP: ffffc90002be3a28 R08: 0000000000000000 R09: 0000000000000000
>> [  417.806384] R10: ffffea0007000040 R11: ffff888376ffe000 R12: 0000000000000001
>> [  417.807099] R13: 0000000000000012 R14: ffff88807fe4ab40 R15: ffff888029210580
>> [  417.807801] FS:  00007f339fa7a740(0000) GS:ffff8881fa9b9000(0000) knlGS:0000000000000000
>> [  417.808570] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  417.809193] CR2: 0000000000000b00 CR3: 000000002a4f0004 CR4: 0000000000370ef0
>> [  417.809925] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [  417.810622] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [  417.811353] Call Trace:
>> [  417.811709]  <TASK>
>> [  417.812038]  folio_add_file_rmap_ptes+0x143/0x230
>> [  417.812566]  insert_page_into_pte_locked+0x1ee/0x3c0
>> [  417.813132]  insert_page+0x78/0xf0
>> [  417.813558]  vmf_insert_page_mkwrite+0x55/0xa0
>> [  417.814088]  dax_fault_iter+0x484/0x7b0
>> [  417.814542]  dax_iomap_pte_fault+0x1ca/0x620
>> [  417.815055]  dax_iomap_fault+0x39/0x40
>> [  417.815499]  __xfs_write_fault+0x139/0x380
>> [  417.815995]  ? __handle_mm_fault+0x5e5/0x1a60
>> [  417.816483]  xfs_write_fault+0x41/0x50
>> [  417.816966]  xfs_filemap_fault+0x3b/0xe0
>> [  417.817424]  __do_fault+0x31/0x180
>> [  417.817859]  __handle_mm_fault+0xee1/0x1a60
>> [  417.818325]  ? debug_smp_processor_id+0x17/0x20
>> [  417.818844]  handle_mm_fault+0xe1/0x2b0
>> [  417.819286]  do_user_addr_fault+0x217/0x630
>> [  417.819747]  ? rcu_is_watching+0x11/0x50
>> [  417.820185]  exc_page_fault+0x6c/0x210
>> [  417.820599]  asm_exc_page_fault+0x27/0x30
>> [  417.821080] RIP: 0033:0x40130c
>> [  417.821461] Code: 89 7d d8 48 89 75 d0 e8 94 ff ff ff 48 c7 45 f8 00 00 00 00 48 8b 45 d8 48 89 45 f0 eb 18 48 8b 45 f0 48 8d 50 08 48 89 55 f0 <48> c7 00 01 00 00 00 48 83 45 f8 01 48 8b 45 d0 48 c1 e8 03 48 39
>> [  417.823156] RSP: 002b:00007ffcc82a8cb0 EFLAGS: 00010287
>> [  417.823703] RAX: 00007f336f5f5000 RBX: 00007ffcc82a8f08 RCX: 0000000067f5a1da
>> [  417.824382] RDX: 00007f336f5f5008 RSI: 0000000000000000 RDI: 0000000000036a98
>> [  417.825096] RBP: 00007ffcc82a8ce0 R08: 00007f339fa84000 R09: 00000000004040b0
>> [  417.825769] R10: 00007f339fa8a200 R11: 00007f339fa8a7b0 R12: 0000000000000000
>> [  417.826438] R13: 00007ffcc82a8f28 R14: 0000000000403e18 R15: 00007f339fac3000
>> [  417.827148]  </TASK>
>> [  417.827461] Modules linked in: nd_pmem(O) dax_pmem(O) nd_btt(O) nfit(O) nd_e820(O) libnvdimm(O) nfit_test_iomap(O)
>> [  417.828404] CR2: 0000000000000b00
>> [  417.828807] ---[ end trace 0000000000000000 ]---
>> [  417.829293] RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
>>
>>
>> And then, looking at the page passed to vmf_insert_page_mkwrite():
>>
>> [   55.468109] flags: 0x300000000002009(locked|uptodate|reserved|node=0|zone=3)
> 
> reserved might indicate ZONE_DEVICE. But zone=3 might or might not be
> ZONE_DEVICE (depending on the kernel config).
> 
>> [   55.468674] raw: 0300000000002009 ffff888028c27b20 00000000ffffffff ffff888033b69b88
>> [   55.469270] raw: 000000000000fff5 0000000000000000 00000001ffffffff 0000000000000200
>> [   55.469835] page dumped because: ALISON dump locked & uptodate pages
> 
> Do you have the other (earlier) output from __dump_page(), especially if
> this page is part of a large folio etc?
> 
> Trying to decipher:
> 
> 0300000000002009 -> "unsigned long flags"
> ffff888028c27b20 -> big union
> 
> As the big union overlays "unsigned long compound_head", and the last
> bit is not set, this should be a *small folio*.
> 
> That would mean that "0000000000000200" would be "unsigned long memcg_data".
> 
> 0x200 might have been the folio_nr_pages before the large folio was
> split. Likely, we are not clearing that when splitting the large folio,
> resulting in a false-positive "memcg_data" after the split.
> 
>>
>> ^ That's different:  locked|uptodate. Other page flags arriving here are
>> not locked | uptodate.
>>
>> Git bisect says this is first bad patch (6.14 --> 6.15-rc1)
>> 4996fc547f5b ("mm: let _folio_nr_pages overlay memcg_data in first tail page")
>>
>> Experimenting a bit with the patch, UN-defining NR_PAGES_IN_LARGE_FOLIO,
>> avoids the problem.
>>
>> The way that patch is reusing memory in tail pages and the fact that it
>> only fails in XFS (not ext4) suggests the XFS is depending on tail pages
>> in a way that ext4 does not.
> 
> IIRC, XFS supports large folios but ext4 does not. But I don't really
> know how that interacts with DAX (if the same thing applies). Ordinary
> XFS large folio tests seem to work just fine, so the question is what
> DAX-specific is happening here.
> 
> When we free large folios back to the buddy, we set "folio->_nr_pages =
> 0", to make the "page->memcg_data" check in page_bad_reason() happy.
> Also, just before the large folio split for ordinary large folios, we
> set "folio->_nr_pages = 0".
> 
> Maybe there is something missing in ZONE_DEVICE freeing/splitting code
> of large folios, where we should do the same, to make sure that all
> page->memcg_data is actually 0?
> 
> I assume so. Let me dig.
> 

I suspect this should do the trick:

diff --git a/fs/dax.c b/fs/dax.c
index af5045b0f476e..8dffffef70d21 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -397,6 +397,10 @@ static inline unsigned long dax_folio_put(struct folio *folio)
         if (!order)
                 return 0;
  
+#ifdef NR_PAGES_IN_LARGE_FOLIO
+       folio->_nr_pages = 0;
+#endif
+
         for (i = 0; i < (1UL << order); i++) {
                 struct dev_pagemap *pgmap = page_pgmap(&folio->page);
                 struct page *page = folio_page(folio, i);


Alternatively (in the style of fa23a338de93aa03eb0b6146a0440f5762309f85)

diff --git a/fs/dax.c b/fs/dax.c
index af5045b0f476e..a1e354b748522 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -412,6 +412,9 @@ static inline unsigned long dax_folio_put(struct folio *folio)
                  */
                 new_folio->pgmap = pgmap;
                 new_folio->share = 0;
+#ifdef CONFIG_MEMCG
+               new_folio->memcg_data = 0;
+#endif
                 WARN_ON_ONCE(folio_ref_count(new_folio));
         }
  


-- 
Cheers,

David / dhildenb



  reply	other threads:[~2025-04-09  8:55 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-09  0:20 [BUG Report] 6.15-rc1 RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250 Alison Schofield
2025-04-09  8:40 ` David Hildenbrand
2025-04-09  8:55   ` David Hildenbrand [this message]
2025-04-09 20:08     ` Dan Williams
2025-04-09 20:25       ` David Hildenbrand
2025-04-09 21:13         ` Alison Schofield
2025-04-09 21:41         ` Dan Williams
2025-04-10  8:48           ` Christoph Hellwig
2025-04-09 19:03   ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=89c869fe-6552-4c7b-ae32-f8179628cade@redhat.com \
    --to=david@redhat.com \
    --cc=alison.schofield@intel.com \
    --cc=apopple@nvidia.com \
    --cc=linux-mm@kvack.org \
    --cc=nvdimm@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).