linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "David Wang" <00107082@163.com>
To: "Suren Baghdasaryan" <surenb@google.com>
Cc: kent.overstreet@linux.dev, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: BUG: unable to handle page fault for address
Date: Sun, 18 May 2025 17:55:38 +0800 (CST)	[thread overview]
Message-ID: <489a2474.19ea.196e2d20b87.Coremail.00107082@163.com> (raw)
In-Reply-To: <551cd408.515.196e11108a5.Coremail.00107082@163.com>


>>>
>>> I do notice there are places where counters are referenced "after" free_module, but the logs I attached
>>> happened "during" free_module():
>>>
>>>  [Fri May 16 12:05:41 2025] BUG: unable to handle page fault for address: ffff9d28984c3000
>>>  [Fri May 16 12:05:41 2025] #PF: supervisor read access in kernel mode
>>> [Fri May 16 12:05:41 2025] #PF: error_code(0x0000) - not-present page
>>> ...
>>>  [Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
>>> ...
>>>  [Fri May 16 12:05:41 2025] Call Trace:
>>>  [Fri May 16 12:05:41 2025]  <TASK>
>>>  [Fri May 16 12:05:41 2025]  codetag_unload_module+0x135/0x160
>>> [Fri May 16 12:05:41 2025]  free_module+0x19/0x1a0
>>>
>>> The call chain is the same as you mentioned above. 
>>
>>Is this failure happening before or after my fix? With my fix, percpu
>>data should not be freed at all if tags are still used. Please
>>clarify.
>
>It is before your fix.  Your patch does fix the issue.
>  
>In my reproduce procedure:
>1. enter recovery mode
>2. install nvidia driver 570.144, failed with Unknown symbol drm_client_setup
>3. modprobe drm_client_lib
>4. install nvidia driver 570.144
>5. install nvidia driver 550.144.03
>6. reboot and repeat from step 1
>
>The error happened in step 4,  and the failure in step2 is crucial,  if 'modprobe drm_client_lib' at the beginning, no error could be observed.
>
>There may be something off about how kernel handles data.percpu section.
>Good thing is that It can be reproduced,  I can add debug messages to clear or confirm  suspicions, 
>Any suggestion?
>
>
>Thanks
>David
>
>
After poking around logging memory addresses, I think I finally understand what is happening here.

1. codetag_alloc_module_section alloc memory when loading module
2. module load failed, due to undefined symbol
3. codetag section memory not freed
4. module load, and module's address happens to reuse the address previous used
5. another codetag_alloc_module_section
6. percup section allocation and then relocation address changes made to codetag_alloc_module_section
7. unload module, when searching through maple tree, the code tag memory in step 1 is used, 
which has no relocation address populated at all.
8. page fault error, because tag->counters is 0

I use following changes to log the address, 


The offending address is 
--- a/lib/alloc_tag.c
+++ b/lib/alloc_tag.c
@@ -575,6 +575,11 @@ static void release_module_tags(struct module *mod, bool used)
        if (!used)
                goto release_area;
 
+       struct alloc_tag *ptag = (struct alloc_tag *)(module_tags.start_addr + mas.index);
+       pr_info("percpu 0: 0x%llx(0x%llx)\n",
+                       (long long)per_cpu_ptr(ptag->counters, 0),
+                       (long long)ptag->counters
+                       );


And got following:
[Sun May 18 16:25:47 2025] percpu 0: 0xffff8edb6ee41030(0xffffffffbc57e030)
[Sun May 18 16:25:47 2025] percpu 0: 0xffff8edb6ee410e0(0xffffffffbc57e0e0)
[Sun May 18 16:25:47 2025] percpu 0: 0xffff8edb6ee40fa0(0xffffffffbc57dfa0)
[Sun May 18 16:26:43 2025] percpu 0: 0xffff8edbb28c3000(0x0)   <------


I think, we spot two issues in this thread:

1. when module load failed after codetag section alloced, the memory would leak.
2. counters may needs reference even after module is unloaded.

#2 has already been addressed by your patch. I will send a simple patch to fix #1

(Feel so released to finally draw a conclusion, hope no silly mistakes here  :)


Thanks
David


  reply	other threads:[~2025-05-18  9:55 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-16 13:12 BUG: unable to handle page fault for address David Wang
2025-05-16 17:03 ` Suren Baghdasaryan
2025-05-17  0:11   ` Suren Baghdasaryan
2025-05-17  7:02     ` David Wang
2025-05-17 16:39       ` Suren Baghdasaryan
2025-05-17 16:51         ` David Wang
2025-05-17 17:29           ` Suren Baghdasaryan
2025-05-17 17:57             ` David Wang
2025-05-17 19:02               ` Suren Baghdasaryan
2025-05-18  1:45                 ` David Wang
2025-05-18  9:55                   ` David Wang [this message]
2025-05-19 16:00                     ` Suren Baghdasaryan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=489a2474.19ea.196e2d20b87.Coremail.00107082@163.com \
    --to=00107082@163.com \
    --cc=kent.overstreet@linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=surenb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).