BUG: unable to handle page fault for address

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* BUG: unable to handle page fault for address
@ 2025-05-16 13:12 David Wang
  2025-05-16 17:03 ` Suren Baghdasaryan
  0 siblings, 1 reply; 12+ messages in thread
From: David Wang @ 2025-05-16 13:12 UTC (permalink / raw)
  To: surenb, kent.overstreet; +Cc: linux-mm, linux-kernel

Hi,

I caught a page fault when I was changing my nvidia driver:
(This happens randomly, I can reproduce it with about 1/3 probability)

[Fri May 16 12:05:41 2025] BUG: unable to handle page fault for address: ffff9d28984c3000
[Fri May 16 12:05:41 2025] #PF: supervisor read access in kernel mode
[Fri May 16 12:05:41 2025] #PF: error_code(0x0000) - not-present page
...
[Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
...
[Fri May 16 12:05:41 2025] Call Trace:
[Fri May 16 12:05:41 2025]  <TASK>
[Fri May 16 12:05:41 2025]  codetag_unload_module+0x135/0x160
[Fri May 16 12:05:41 2025]  free_module+0x19/0x1a0
...
(full kernel logs are pasted at the end.)

Using a image with DEBUG_INFO, the calltrack parses as:

RIP: 0010:release_module_tags (./include/linux/alloc_tag.h:134 lib/alloc_tag.c:352 lib/alloc_tag.c:573) 
[Fri May 16 12:05:41 2025] codetag_unload_module (lib/codetag.c:355) 
[Fri May 16 12:05:41 2025] free_module (kernel/module/main.c:1305) 
[Fri May 16 12:05:41 2025] __do_sys_delete_module (kernel/module/main.c:795)

The offending lines in my codebase:
	126 static inline struct alloc_tag_counters alloc_tag_read(struct alloc_tag *tag)
	127 {
	...
	131 
	132         for_each_possible_cpu(cpu) {
	133                 counter = per_cpu_ptr(tag->counters, cpu);
>>>>	134                 v.bytes += counter->bytes;   <--------------here
	135                 v.calls += counter->calls;


Nvidia drivers are out-tree... there could be some strange behavior in it causes this.. but,
when I check the code, I got concerned about lifecycle of tag->counters.
Based on following defination:
	108 #define DEFINE_ALLOC_TAG(_alloc_tag)                                            \
	109         static DEFINE_PER_CPU(struct alloc_tag_counters, _alloc_tag_cntr);      \
	110         static struct alloc_tag _alloc_tag __used __aligned(8)                  \
	111         __section(ALLOC_TAG_SECTION_NAME) = {                                   \
	112                 .ct = CODE_TAG_INIT,                                            \
	113                 .counters = &_alloc_tag_cntr };
	114 
_alloc_tag_cntr is the data referenced by tag->counters, but they are in different section,
and alloc_tag only prepare storage for section ALLOC_TAG_SECTION_NAME.
right?
Then what happens to those ".data..percpu" section when module is unloaded?
Is it safe to keep using those ".data..percpu" section after module unloaded,
or even during module is unloading?

I was expecting page fault when I make following experiments:

1.Load module A
2.Load module B
3.module B alloc memory, and handover the memory to A
4.unload module B
(memory profiling report module B has memory not freed)
... after a while....
5.unload module A, where A free the memory. (when A kfree the memory,
the counters used to be in module B's ".data..percpu" section should
be referenced, it that section is gone, a pagefault should happen).

But, after several trials, not page fault reported....
Would kernel keeps ".data..percpu" since ALLOC_TAG_SECTION_NAME has reference to it,
or I just need wait longer for kernel to "purge" those sections.


Full logs:
[Fri May 16 12:02:28 2025] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  550.144.03  Mon Dec 30 17:10:10 UTC 2024
[Fri May 16 12:02:28 2025] [drm] [nvidia-drm] [GPU ID 0x00002600] Loading driver
[Fri May 16 12:02:28 2025] [drm] Initialized nvidia-drm 0.0.0 for 0000:26:00.0 on minor 0
[Fri May 16 12:03:45 2025] [drm] [nvidia-drm] [GPU ID 0x00002600] Unloading driver
[Fri May 16 12:03:46 2025] nvidia-modeset: Unloading
[Fri May 16 12:03:46 2025] nvidia-nvlink: Unregistered Nvlink Core, major device number 240
[Fri May 16 12:04:38 2025] VFIO - User Level meta-driver version: 0.3
[Fri May 16 12:04:38 2025] nvidia-nvlink: Nvlink Core is being initialized, major device number 239

[Fri May 16 12:04:38 2025] nvidia 0000:26:00.0: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=none
[Fri May 16 12:04:38 2025] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  570.144  Thu Apr 10 20:33:29 UTC 2025
[Fri May 16 12:04:38 2025] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  570.144  Thu Apr 10 20:03:03 UTC 2025
[Fri May 16 12:04:38 2025] nvidia_drm: Unknown symbol drm_client_setup (err -2)
[Fri May 16 12:04:40 2025] nvidia-modeset: Unloading
[Fri May 16 12:04:40 2025] nvidia-nvlink: Unregistered Nvlink Core, major device number 239
[Fri May 16 12:05:40 2025] VFIO - User Level meta-driver version: 0.3
[Fri May 16 12:05:40 2025] nvidia-nvlink: Nvlink Core is being initialized, major device number 239

[Fri May 16 12:05:40 2025] nvidia 0000:26:00.0: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=none
[Fri May 16 12:05:40 2025] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  570.144  Thu Apr 10 20:33:29 UTC 2025
[Fri May 16 12:05:40 2025] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  570.144  Thu Apr 10 20:03:03 UTC 2025
[Fri May 16 12:05:40 2025] [drm] [nvidia-drm] [GPU ID 0x00002600] Loading driver
[Fri May 16 12:05:40 2025] [drm] Initialized nvidia-drm 0.0.0 for 0000:26:00.0 on minor 0
[Fri May 16 12:05:40 2025] [drm] [nvidia-drm] [GPU ID 0x00002600] Unloading driver
[Fri May 16 12:05:40 2025] nvidia-modeset: Unloading
[Fri May 16 12:05:41 2025] nvidia-nvlink: Unregistered Nvlink Core, major device number 239
[Fri May 16 12:05:41 2025] BUG: unable to handle page fault for address: ffff9d28984c3000
[Fri May 16 12:05:41 2025] #PF: supervisor read access in kernel mode
[Fri May 16 12:05:41 2025] #PF: error_code(0x0000) - not-present page
[Fri May 16 12:05:41 2025] PGD 163001067 P4D 163001067 PUD 0 
[Fri May 16 12:05:41 2025] Oops: Oops: 0000 [#1] SMP NOPTI
[Fri May 16 12:05:41 2025] CPU: 0 UID: 0 PID: 35898 Comm: modprobe Tainted: G           OE       6.15.0-rc6-linan-0 #587 PREEMPT(voluntary) 
[Fri May 16 12:05:41 2025] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[Fri May 16 12:05:41 2025] Hardware name: Micro-Star International Co., Ltd. MS-7B89/B450M MORTAR MAX (MS-7B89), BIOS 2.L0 07/18/2024
[Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
[Fri May 16 12:05:41 2025] Code: e3 4c 03 64 24 08 4c 39 e3 72 bf 8b 0d 86 02 f6 00 31 ed 31 c0 eb 17 48 63 f0 49 8b 54 24 20 83 c0 01 48 8b 34 f5 40 e6 d4 95 <48> 03 2c 16 89 ce 48 63 d0 48 c7 c7 c0 fd d4 95 e8 78 3f f5 ff 8b
[Fri May 16 12:05:41 2025] RSP: 0018:ffffb2eb40bdfe00 EFLAGS: 00010202
[Fri May 16 12:05:41 2025] RAX: 0000000000000001 RBX: ffffffffc04609d7 RCX: 0000000000000008
[Fri May 16 12:05:41 2025] RDX: 0000000000000000 RSI: ffff9d28984c3000 RDI: ffffffff95d4fdc0
[Fri May 16 12:05:41 2025] RBP: 0000000000000000 R08: ffff9d25035aed08 R09: 000000000000fcf8
[Fri May 16 12:05:41 2025] R10: ffffb2eb40bdfe00 R11: 0000000000000001 R12: ffffffffc04609b0
[Fri May 16 12:05:41 2025] R13: ffffffffc19ef000 R14: ffff9d2502ab6870 R15: 0000000000000000
[Fri May 16 12:05:41 2025] FS:  00007fe9ee051040(0000) GS:ffff9d28984c3000(0000) knlGS:0000000000000000
[Fri May 16 12:05:41 2025] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Fri May 16 12:05:41 2025] CR2: ffff9d28984c3000 CR3: 000000011278d000 CR4: 0000000000350ef0
[Fri May 16 12:05:41 2025] Call Trace:
[Fri May 16 12:05:41 2025]  <TASK>
[Fri May 16 12:05:41 2025]  codetag_unload_module+0x135/0x160
[Fri May 16 12:05:41 2025]  free_module+0x19/0x1a0
[Fri May 16 12:05:41 2025]  __do_sys_delete_module+0x274/0x310
[Fri May 16 12:05:41 2025]  ? srso_return_thunk+0x5/0x5f
[Fri May 16 12:05:41 2025]  ? fpregs_assert_state_consistent+0x21/0x50
[Fri May 16 12:05:41 2025]  ? srso_return_thunk+0x5/0x5f
[Fri May 16 12:05:41 2025]  do_syscall_64+0x4b/0x120
[Fri May 16 12:05:41 2025]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[Fri May 16 12:05:41 2025] RIP: 0033:0x7fe9ed928a67
[Fri May 16 12:05:41 2025] Code: 73 01 c3 48 8b 0d 99 83 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 69 83 0c 00 f7 d8 64 89 01 48
[Fri May 16 12:05:41 2025] RSP: 002b:00007ffe17759568 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[Fri May 16 12:05:41 2025] RAX: ffffffffffffffda RBX: 0000564f850b7e70 RCX: 00007fe9ed928a67
[Fri May 16 12:05:41 2025] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 0000564f850b7ed8
[Fri May 16 12:05:41 2025] RBP: 0000000000000000 R08: 1999999999999999 R09: 0000000000000000
[Fri May 16 12:05:41 2025] R10: 00007fe9ed99bac0 R11: 0000000000000206 R12: 0000000000000000
[Fri May 16 12:05:41 2025] R13: 0000000000000000 R14: 00007ffe177595a0 R15: 00007ffe1775aa60
[Fri May 16 12:05:41 2025]  </TASK>
[Fri May 16 12:05:41 2025] Modules linked in: vfio_pci_core(E-) vfio(E) drm_client_lib(E) amd_atl(E) intel_rapl_msr(E) intel_rapl_common(E) binfmt_misc(E) edac_mce_amd(E) kvm_amd(E) snd_hda_codec_realtek(E) nls_ascii(E) iwlmvm(E) nls_cp437(E) snd_hda_codec_generic(E) snd_hda_codec_hdmi(E) kvm(E) snd_hda_scodec_component(E) mac80211(E) irqbypass(E) uvcvideo(E) vfat(E) snd_hda_intel(E) libarc4(E) ghash_clmulni_intel(E) fat(E) videobuf2_vmalloc(E) snd_intel_dspcfg(E) sha256_ssse3(E) uvc(E) snd_usb_audio(E) sha1_ssse3(E) snd_hda_codec(E) videobuf2_memops(E) iwlwifi(E) drm_ttm_helper(E) videobuf2_v4l2(E) aesni_intel(E) snd_usbmidi_lib(E) snd_hda_core(E) ttm(E) snd_rawmidi(E) gf128mul(E) snd_pcsp(E) snd_seq_device(E) ppdev(E) snd_hwdep(E) crypto_simd(E) drm_kms_helper(E) videodev(E) cryptd(E) cfg80211(E) snd_pcm(E) videobuf2_common(E) rapl(E) snd_timer(E) sp5100_tco(E) drm(E) mc(E) wmi_bmof(E) k10temp(E) tpm_crb(E) rfkill(E) snd(E) sha3_generic(E) video(E) soundcore(E) jitterentropy_rng(E) sha512_ssse3(E) tpm_tis(E) ccp(E)
[Fri May 16 12:05:41 2025]  sha512_generic(E) tpm_tis_core(E) joydev(E) drbg(E) tpm(E) ansi_cprng(E) libaescfb(E) ecdh_generic(E) ecc(E) parport_pc(E) parport(E) sg(E) rng_core(E) evdev(E) msr(E) loop(E) fuse(E) efi_pstore(E) dm_mod(E) configfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) mbcache(E) jbd2(E) btrfs(E) blake2b_generic(E) efivarfs(E) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) xor(E) raid6_pq(E) raid1(E) raid0(E) md_mod(E) hid_generic(E) usbhid(E) hid(E) sd_mod(E) ahci(E) xhci_pci(E) libahci(E) nvme(E) xhci_hcd(E) libata(E) i2c_piix4(E) i2c_smbus(E) r8169(E) nvme_core(E) realtek(E) usbcore(E) scsi_mod(E) scsi_common(E) usb_common(E) wmi(E) gpio_amdpt(E) gpio_generic(E) button(E) [last unloaded: vfio_pci_core(E)]
[Fri May 16 12:05:41 2025] CR2: ffff9d28984c3000
[Fri May 16 12:05:41 2025] ---[ end trace 0000000000000000 ]---
[Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
[Fri May 16 12:05:41 2025] Code: e3 4c 03 64 24 08 4c 39 e3 72 bf 8b 0d 86 02 f6 00 31 ed 31 c0 eb 17 48 63 f0 49 8b 54 24 20 83 c0 01 48 8b 34 f5 40 e6 d4 95 <48> 03 2c 16 89 ce 48 63 d0 48 c7 c7 c0 fd d4 95 e8 78 3f f5 ff 8b
[Fri May 16 12:05:41 2025] RSP: 0018:ffffb2eb40bdfe00 EFLAGS: 00010202
[Fri May 16 12:05:41 2025] RAX: 0000000000000001 RBX: ffffffffc04609d7 RCX: 0000000000000008
[Fri May 16 12:05:41 2025] RDX: 0000000000000000 RSI: ffff9d28984c3000 RDI: ffffffff95d4fdc0
[Fri May 16 12:05:41 2025] RBP: 0000000000000000 R08: ffff9d25035aed08 R09: 000000000000fcf8
[Fri May 16 12:05:41 2025] R10: ffffb2eb40bdfe00 R11: 0000000000000001 R12: ffffffffc04609b0
[Fri May 16 12:05:41 2025] R13: ffffffffc19ef000 R14: ffff9d2502ab6870 R15: 0000000000000000
[Fri May 16 12:05:41 2025] FS:  00007fe9ee051040(0000) GS:ffff9d28984c3000(0000) knlGS:0000000000000000
[Fri May 16 12:05:41 2025] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Fri May 16 12:05:41 2025] CR2: ffff9d28984c3000 CR3: 000000011278d000 CR4: 0000000000350ef0
[Fri May 16 12:05:41 2025] note: modprobe[35898] exited with irqs disabled
[Fri May 16 12:05:41 2025] note: modprobe[35898] exited with preempt_count 1




Thanks
David



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BUG: unable to handle page fault for address
  2025-05-16 13:12 BUG: unable to handle page fault for address David Wang
@ 2025-05-16 17:03 ` Suren Baghdasaryan
  2025-05-17  0:11   ` Suren Baghdasaryan
  0 siblings, 1 reply; 12+ messages in thread
From: Suren Baghdasaryan @ 2025-05-16 17:03 UTC (permalink / raw)
  To: David Wang; +Cc: kent.overstreet, linux-mm, linux-kernel

Hi David,

On Fri, May 16, 2025 at 6:13 AM David Wang <00107082@163.com> wrote:
>
> Hi,
>
> I caught a page fault when I was changing my nvidia driver:
> (This happens randomly, I can reproduce it with about 1/3 probability)
>
> [Fri May 16 12:05:41 2025] BUG: unable to handle page fault for address: ffff9d28984c3000
> [Fri May 16 12:05:41 2025] #PF: supervisor read access in kernel mode
> [Fri May 16 12:05:41 2025] #PF: error_code(0x0000) - not-present page
> ...
> [Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
> ...
> [Fri May 16 12:05:41 2025] Call Trace:
> [Fri May 16 12:05:41 2025]  <TASK>
> [Fri May 16 12:05:41 2025]  codetag_unload_module+0x135/0x160
> [Fri May 16 12:05:41 2025]  free_module+0x19/0x1a0
> ...
> (full kernel logs are pasted at the end.)
>
> Using a image with DEBUG_INFO, the calltrack parses as:
>
> RIP: 0010:release_module_tags (./include/linux/alloc_tag.h:134 lib/alloc_tag.c:352 lib/alloc_tag.c:573)
> [Fri May 16 12:05:41 2025] codetag_unload_module (lib/codetag.c:355)
> [Fri May 16 12:05:41 2025] free_module (kernel/module/main.c:1305)
> [Fri May 16 12:05:41 2025] __do_sys_delete_module (kernel/module/main.c:795)
>
> The offending lines in my codebase:
>         126 static inline struct alloc_tag_counters alloc_tag_read(struct alloc_tag *tag)
>         127 {
>         ...
>         131
>         132         for_each_possible_cpu(cpu) {
>         133                 counter = per_cpu_ptr(tag->counters, cpu);
> >>>>    134                 v.bytes += counter->bytes;   <--------------here
>         135                 v.calls += counter->calls;
>
>
> Nvidia drivers are out-tree... there could be some strange behavior in it causes this.. but,
> when I check the code, I got concerned about lifecycle of tag->counters.
> Based on following defination:
>         108 #define DEFINE_ALLOC_TAG(_alloc_tag)                                            \
>         109         static DEFINE_PER_CPU(struct alloc_tag_counters, _alloc_tag_cntr);      \
>         110         static struct alloc_tag _alloc_tag __used __aligned(8)                  \
>         111         __section(ALLOC_TAG_SECTION_NAME) = {                                   \
>         112                 .ct = CODE_TAG_INIT,                                            \
>         113                 .counters = &_alloc_tag_cntr };
>         114
> _alloc_tag_cntr is the data referenced by tag->counters, but they are in different section,
> and alloc_tag only prepare storage for section ALLOC_TAG_SECTION_NAME.
> right?
> Then what happens to those ".data..percpu" section when module is unloaded?
> Is it safe to keep using those ".data..percpu" section after module unloaded,
> or even during module is unloading?

Yes, I think you are right, free_module() calls percpu_modfree() which
would free the per-cpu memory allocated for the module. Before
0db6f8d7820a ("alloc_tag: load module tags into separate contiguous
memory") we would not unload the module if there were tags which were
still in use. After that change we load module tags into separate
memory, so I expected this to work but due to this external reference
it indeed should lead to UAF.
I think the simplest way to fix this would be to bypass
percpu_modfree() inside free_module() when there are module tags still
referenced, store mod->percpu inside alloc_tag_module_section and free
it inside clean_unused_module_areas_locked() once we know the counters
are not used anymore. I'll take a stab at it and will send a patch for
testing today.
Thanks,
Suren.

>
> I was expecting page fault when I make following experiments:
>
> 1.Load module A
> 2.Load module B
> 3.module B alloc memory, and handover the memory to A
> 4.unload module B
> (memory profiling report module B has memory not freed)
> ... after a while....
> 5.unload module A, where A free the memory. (when A kfree the memory,
> the counters used to be in module B's ".data..percpu" section should
> be referenced, it that section is gone, a pagefault should happen).
>
> But, after several trials, not page fault reported....
> Would kernel keeps ".data..percpu" since ALLOC_TAG_SECTION_NAME has reference to it,
> or I just need wait longer for kernel to "purge" those sections.
>
>
> Full logs:
> [Fri May 16 12:02:28 2025] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  550.144.03  Mon Dec 30 17:10:10 UTC 2024
> [Fri May 16 12:02:28 2025] [drm] [nvidia-drm] [GPU ID 0x00002600] Loading driver
> [Fri May 16 12:02:28 2025] [drm] Initialized nvidia-drm 0.0.0 for 0000:26:00.0 on minor 0
> [Fri May 16 12:03:45 2025] [drm] [nvidia-drm] [GPU ID 0x00002600] Unloading driver
> [Fri May 16 12:03:46 2025] nvidia-modeset: Unloading
> [Fri May 16 12:03:46 2025] nvidia-nvlink: Unregistered Nvlink Core, major device number 240
> [Fri May 16 12:04:38 2025] VFIO - User Level meta-driver version: 0.3
> [Fri May 16 12:04:38 2025] nvidia-nvlink: Nvlink Core is being initialized, major device number 239
>
> [Fri May 16 12:04:38 2025] nvidia 0000:26:00.0: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=none
> [Fri May 16 12:04:38 2025] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  570.144  Thu Apr 10 20:33:29 UTC 2025
> [Fri May 16 12:04:38 2025] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  570.144  Thu Apr 10 20:03:03 UTC 2025
> [Fri May 16 12:04:38 2025] nvidia_drm: Unknown symbol drm_client_setup (err -2)
> [Fri May 16 12:04:40 2025] nvidia-modeset: Unloading
> [Fri May 16 12:04:40 2025] nvidia-nvlink: Unregistered Nvlink Core, major device number 239
> [Fri May 16 12:05:40 2025] VFIO - User Level meta-driver version: 0.3
> [Fri May 16 12:05:40 2025] nvidia-nvlink: Nvlink Core is being initialized, major device number 239
>
> [Fri May 16 12:05:40 2025] nvidia 0000:26:00.0: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=none
> [Fri May 16 12:05:40 2025] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  570.144  Thu Apr 10 20:33:29 UTC 2025
> [Fri May 16 12:05:40 2025] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  570.144  Thu Apr 10 20:03:03 UTC 2025
> [Fri May 16 12:05:40 2025] [drm] [nvidia-drm] [GPU ID 0x00002600] Loading driver
> [Fri May 16 12:05:40 2025] [drm] Initialized nvidia-drm 0.0.0 for 0000:26:00.0 on minor 0
> [Fri May 16 12:05:40 2025] [drm] [nvidia-drm] [GPU ID 0x00002600] Unloading driver
> [Fri May 16 12:05:40 2025] nvidia-modeset: Unloading
> [Fri May 16 12:05:41 2025] nvidia-nvlink: Unregistered Nvlink Core, major device number 239
> [Fri May 16 12:05:41 2025] BUG: unable to handle page fault for address: ffff9d28984c3000
> [Fri May 16 12:05:41 2025] #PF: supervisor read access in kernel mode
> [Fri May 16 12:05:41 2025] #PF: error_code(0x0000) - not-present page
> [Fri May 16 12:05:41 2025] PGD 163001067 P4D 163001067 PUD 0
> [Fri May 16 12:05:41 2025] Oops: Oops: 0000 [#1] SMP NOPTI
> [Fri May 16 12:05:41 2025] CPU: 0 UID: 0 PID: 35898 Comm: modprobe Tainted: G           OE       6.15.0-rc6-linan-0 #587 PREEMPT(voluntary)
> [Fri May 16 12:05:41 2025] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
> [Fri May 16 12:05:41 2025] Hardware name: Micro-Star International Co., Ltd. MS-7B89/B450M MORTAR MAX (MS-7B89), BIOS 2.L0 07/18/2024
> [Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
> [Fri May 16 12:05:41 2025] Code: e3 4c 03 64 24 08 4c 39 e3 72 bf 8b 0d 86 02 f6 00 31 ed 31 c0 eb 17 48 63 f0 49 8b 54 24 20 83 c0 01 48 8b 34 f5 40 e6 d4 95 <48> 03 2c 16 89 ce 48 63 d0 48 c7 c7 c0 fd d4 95 e8 78 3f f5 ff 8b
> [Fri May 16 12:05:41 2025] RSP: 0018:ffffb2eb40bdfe00 EFLAGS: 00010202
> [Fri May 16 12:05:41 2025] RAX: 0000000000000001 RBX: ffffffffc04609d7 RCX: 0000000000000008
> [Fri May 16 12:05:41 2025] RDX: 0000000000000000 RSI: ffff9d28984c3000 RDI: ffffffff95d4fdc0
> [Fri May 16 12:05:41 2025] RBP: 0000000000000000 R08: ffff9d25035aed08 R09: 000000000000fcf8
> [Fri May 16 12:05:41 2025] R10: ffffb2eb40bdfe00 R11: 0000000000000001 R12: ffffffffc04609b0
> [Fri May 16 12:05:41 2025] R13: ffffffffc19ef000 R14: ffff9d2502ab6870 R15: 0000000000000000
> [Fri May 16 12:05:41 2025] FS:  00007fe9ee051040(0000) GS:ffff9d28984c3000(0000) knlGS:0000000000000000
> [Fri May 16 12:05:41 2025] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [Fri May 16 12:05:41 2025] CR2: ffff9d28984c3000 CR3: 000000011278d000 CR4: 0000000000350ef0
> [Fri May 16 12:05:41 2025] Call Trace:
> [Fri May 16 12:05:41 2025]  <TASK>
> [Fri May 16 12:05:41 2025]  codetag_unload_module+0x135/0x160
> [Fri May 16 12:05:41 2025]  free_module+0x19/0x1a0
> [Fri May 16 12:05:41 2025]  __do_sys_delete_module+0x274/0x310
> [Fri May 16 12:05:41 2025]  ? srso_return_thunk+0x5/0x5f
> [Fri May 16 12:05:41 2025]  ? fpregs_assert_state_consistent+0x21/0x50
> [Fri May 16 12:05:41 2025]  ? srso_return_thunk+0x5/0x5f
> [Fri May 16 12:05:41 2025]  do_syscall_64+0x4b/0x120
> [Fri May 16 12:05:41 2025]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [Fri May 16 12:05:41 2025] RIP: 0033:0x7fe9ed928a67
> [Fri May 16 12:05:41 2025] Code: 73 01 c3 48 8b 0d 99 83 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 69 83 0c 00 f7 d8 64 89 01 48
> [Fri May 16 12:05:41 2025] RSP: 002b:00007ffe17759568 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
> [Fri May 16 12:05:41 2025] RAX: ffffffffffffffda RBX: 0000564f850b7e70 RCX: 00007fe9ed928a67
> [Fri May 16 12:05:41 2025] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 0000564f850b7ed8
> [Fri May 16 12:05:41 2025] RBP: 0000000000000000 R08: 1999999999999999 R09: 0000000000000000
> [Fri May 16 12:05:41 2025] R10: 00007fe9ed99bac0 R11: 0000000000000206 R12: 0000000000000000
> [Fri May 16 12:05:41 2025] R13: 0000000000000000 R14: 00007ffe177595a0 R15: 00007ffe1775aa60
> [Fri May 16 12:05:41 2025]  </TASK>
> [Fri May 16 12:05:41 2025] Modules linked in: vfio_pci_core(E-) vfio(E) drm_client_lib(E) amd_atl(E) intel_rapl_msr(E) intel_rapl_common(E) binfmt_misc(E) edac_mce_amd(E) kvm_amd(E) snd_hda_codec_realtek(E) nls_ascii(E) iwlmvm(E) nls_cp437(E) snd_hda_codec_generic(E) snd_hda_codec_hdmi(E) kvm(E) snd_hda_scodec_component(E) mac80211(E) irqbypass(E) uvcvideo(E) vfat(E) snd_hda_intel(E) libarc4(E) ghash_clmulni_intel(E) fat(E) videobuf2_vmalloc(E) snd_intel_dspcfg(E) sha256_ssse3(E) uvc(E) snd_usb_audio(E) sha1_ssse3(E) snd_hda_codec(E) videobuf2_memops(E) iwlwifi(E) drm_ttm_helper(E) videobuf2_v4l2(E) aesni_intel(E) snd_usbmidi_lib(E) snd_hda_core(E) ttm(E) snd_rawmidi(E) gf128mul(E) snd_pcsp(E) snd_seq_device(E) ppdev(E) snd_hwdep(E) crypto_simd(E) drm_kms_helper(E) videodev(E) cryptd(E) cfg80211(E) snd_pcm(E) videobuf2_common(E) rapl(E) snd_timer(E) sp5100_tco(E) drm(E) mc(E) wmi_bmof(E) k10temp(E) tpm_crb(E) rfkill(E) snd(E) sha3_generic(E) video(E) soundcore(E) jitterentropy_rng(E) sha512_ssse3(E) tpm_tis(E) ccp(E)
> [Fri May 16 12:05:41 2025]  sha512_generic(E) tpm_tis_core(E) joydev(E) drbg(E) tpm(E) ansi_cprng(E) libaescfb(E) ecdh_generic(E) ecc(E) parport_pc(E) parport(E) sg(E) rng_core(E) evdev(E) msr(E) loop(E) fuse(E) efi_pstore(E) dm_mod(E) configfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) mbcache(E) jbd2(E) btrfs(E) blake2b_generic(E) efivarfs(E) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) xor(E) raid6_pq(E) raid1(E) raid0(E) md_mod(E) hid_generic(E) usbhid(E) hid(E) sd_mod(E) ahci(E) xhci_pci(E) libahci(E) nvme(E) xhci_hcd(E) libata(E) i2c_piix4(E) i2c_smbus(E) r8169(E) nvme_core(E) realtek(E) usbcore(E) scsi_mod(E) scsi_common(E) usb_common(E) wmi(E) gpio_amdpt(E) gpio_generic(E) button(E) [last unloaded: vfio_pci_core(E)]
> [Fri May 16 12:05:41 2025] CR2: ffff9d28984c3000
> [Fri May 16 12:05:41 2025] ---[ end trace 0000000000000000 ]---
> [Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
> [Fri May 16 12:05:41 2025] Code: e3 4c 03 64 24 08 4c 39 e3 72 bf 8b 0d 86 02 f6 00 31 ed 31 c0 eb 17 48 63 f0 49 8b 54 24 20 83 c0 01 48 8b 34 f5 40 e6 d4 95 <48> 03 2c 16 89 ce 48 63 d0 48 c7 c7 c0 fd d4 95 e8 78 3f f5 ff 8b
> [Fri May 16 12:05:41 2025] RSP: 0018:ffffb2eb40bdfe00 EFLAGS: 00010202
> [Fri May 16 12:05:41 2025] RAX: 0000000000000001 RBX: ffffffffc04609d7 RCX: 0000000000000008
> [Fri May 16 12:05:41 2025] RDX: 0000000000000000 RSI: ffff9d28984c3000 RDI: ffffffff95d4fdc0
> [Fri May 16 12:05:41 2025] RBP: 0000000000000000 R08: ffff9d25035aed08 R09: 000000000000fcf8
> [Fri May 16 12:05:41 2025] R10: ffffb2eb40bdfe00 R11: 0000000000000001 R12: ffffffffc04609b0
> [Fri May 16 12:05:41 2025] R13: ffffffffc19ef000 R14: ffff9d2502ab6870 R15: 0000000000000000
> [Fri May 16 12:05:41 2025] FS:  00007fe9ee051040(0000) GS:ffff9d28984c3000(0000) knlGS:0000000000000000
> [Fri May 16 12:05:41 2025] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [Fri May 16 12:05:41 2025] CR2: ffff9d28984c3000 CR3: 000000011278d000 CR4: 0000000000350ef0
> [Fri May 16 12:05:41 2025] note: modprobe[35898] exited with irqs disabled
> [Fri May 16 12:05:41 2025] note: modprobe[35898] exited with preempt_count 1
>
>
>
>
> Thanks
> David
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BUG: unable to handle page fault for address
  2025-05-16 17:03 ` Suren Baghdasaryan
@ 2025-05-17  0:11   ` Suren Baghdasaryan
  2025-05-17  7:02     ` David Wang
  0 siblings, 1 reply; 12+ messages in thread
From: Suren Baghdasaryan @ 2025-05-17  0:11 UTC (permalink / raw)
  To: David Wang; +Cc: kent.overstreet, linux-mm, linux-kernel

On Fri, May 16, 2025 at 10:03 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> Hi David,
>
> On Fri, May 16, 2025 at 6:13 AM David Wang <00107082@163.com> wrote:
> >
> > Hi,
> >
> > I caught a page fault when I was changing my nvidia driver:
> > (This happens randomly, I can reproduce it with about 1/3 probability)
> >
> > [Fri May 16 12:05:41 2025] BUG: unable to handle page fault for address: ffff9d28984c3000
> > [Fri May 16 12:05:41 2025] #PF: supervisor read access in kernel mode
> > [Fri May 16 12:05:41 2025] #PF: error_code(0x0000) - not-present page
> > ...
> > [Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
> > ...
> > [Fri May 16 12:05:41 2025] Call Trace:
> > [Fri May 16 12:05:41 2025]  <TASK>
> > [Fri May 16 12:05:41 2025]  codetag_unload_module+0x135/0x160
> > [Fri May 16 12:05:41 2025]  free_module+0x19/0x1a0
> > ...
> > (full kernel logs are pasted at the end.)
> >
> > Using a image with DEBUG_INFO, the calltrack parses as:
> >
> > RIP: 0010:release_module_tags (./include/linux/alloc_tag.h:134 lib/alloc_tag.c:352 lib/alloc_tag.c:573)
> > [Fri May 16 12:05:41 2025] codetag_unload_module (lib/codetag.c:355)
> > [Fri May 16 12:05:41 2025] free_module (kernel/module/main.c:1305)
> > [Fri May 16 12:05:41 2025] __do_sys_delete_module (kernel/module/main.c:795)
> >
> > The offending lines in my codebase:
> >         126 static inline struct alloc_tag_counters alloc_tag_read(struct alloc_tag *tag)
> >         127 {
> >         ...
> >         131
> >         132         for_each_possible_cpu(cpu) {
> >         133                 counter = per_cpu_ptr(tag->counters, cpu);
> > >>>>    134                 v.bytes += counter->bytes;   <--------------here
> >         135                 v.calls += counter->calls;
> >
> >
> > Nvidia drivers are out-tree... there could be some strange behavior in it causes this.. but,
> > when I check the code, I got concerned about lifecycle of tag->counters.
> > Based on following defination:
> >         108 #define DEFINE_ALLOC_TAG(_alloc_tag)                                            \
> >         109         static DEFINE_PER_CPU(struct alloc_tag_counters, _alloc_tag_cntr);      \
> >         110         static struct alloc_tag _alloc_tag __used __aligned(8)                  \
> >         111         __section(ALLOC_TAG_SECTION_NAME) = {                                   \
> >         112                 .ct = CODE_TAG_INIT,                                            \
> >         113                 .counters = &_alloc_tag_cntr };
> >         114
> > _alloc_tag_cntr is the data referenced by tag->counters, but they are in different section,
> > and alloc_tag only prepare storage for section ALLOC_TAG_SECTION_NAME.
> > right?
> > Then what happens to those ".data..percpu" section when module is unloaded?
> > Is it safe to keep using those ".data..percpu" section after module unloaded,
> > or even during module is unloading?
>
> Yes, I think you are right, free_module() calls percpu_modfree() which
> would free the per-cpu memory allocated for the module. Before
> 0db6f8d7820a ("alloc_tag: load module tags into separate contiguous
> memory") we would not unload the module if there were tags which were
> still in use. After that change we load module tags into separate
> memory, so I expected this to work but due to this external reference
> it indeed should lead to UAF.
> I think the simplest way to fix this would be to bypass
> percpu_modfree() inside free_module() when there are module tags still
> referenced, store mod->percpu inside alloc_tag_module_section and free
> it inside clean_unused_module_areas_locked() once we know the counters
> are not used anymore. I'll take a stab at it and will send a patch for
> testing today.

Ok, I went with another implementation, instead dynamically allocating
percpu memory for modules at the module load time. This has another
advantage of not needing extra PERCPU_MODULE_RESERVE currently
required for memory allocation tagging to work.
David, the patch is posted at [1]. Please give it a try and let me
know if the fix works for you.
Thanks,
Suren.

[1] https://lore.kernel.org/all/20250517000739.5930-1-surenb@google.com/


> Thanks,
> Suren.
>
> >
> > I was expecting page fault when I make following experiments:
> >
> > 1.Load module A
> > 2.Load module B
> > 3.module B alloc memory, and handover the memory to A
> > 4.unload module B
> > (memory profiling report module B has memory not freed)
> > ... after a while....
> > 5.unload module A, where A free the memory. (when A kfree the memory,
> > the counters used to be in module B's ".data..percpu" section should
> > be referenced, it that section is gone, a pagefault should happen).
> >
> > But, after several trials, not page fault reported....
> > Would kernel keeps ".data..percpu" since ALLOC_TAG_SECTION_NAME has reference to it,
> > or I just need wait longer for kernel to "purge" those sections.
> >
> >
> > Full logs:
> > [Fri May 16 12:02:28 2025] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  550.144.03  Mon Dec 30 17:10:10 UTC 2024
> > [Fri May 16 12:02:28 2025] [drm] [nvidia-drm] [GPU ID 0x00002600] Loading driver
> > [Fri May 16 12:02:28 2025] [drm] Initialized nvidia-drm 0.0.0 for 0000:26:00.0 on minor 0
> > [Fri May 16 12:03:45 2025] [drm] [nvidia-drm] [GPU ID 0x00002600] Unloading driver
> > [Fri May 16 12:03:46 2025] nvidia-modeset: Unloading
> > [Fri May 16 12:03:46 2025] nvidia-nvlink: Unregistered Nvlink Core, major device number 240
> > [Fri May 16 12:04:38 2025] VFIO - User Level meta-driver version: 0.3
> > [Fri May 16 12:04:38 2025] nvidia-nvlink: Nvlink Core is being initialized, major device number 239
> >
> > [Fri May 16 12:04:38 2025] nvidia 0000:26:00.0: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=none
> > [Fri May 16 12:04:38 2025] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  570.144  Thu Apr 10 20:33:29 UTC 2025
> > [Fri May 16 12:04:38 2025] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  570.144  Thu Apr 10 20:03:03 UTC 2025
> > [Fri May 16 12:04:38 2025] nvidia_drm: Unknown symbol drm_client_setup (err -2)
> > [Fri May 16 12:04:40 2025] nvidia-modeset: Unloading
> > [Fri May 16 12:04:40 2025] nvidia-nvlink: Unregistered Nvlink Core, major device number 239
> > [Fri May 16 12:05:40 2025] VFIO - User Level meta-driver version: 0.3
> > [Fri May 16 12:05:40 2025] nvidia-nvlink: Nvlink Core is being initialized, major device number 239
> >
> > [Fri May 16 12:05:40 2025] nvidia 0000:26:00.0: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=none
> > [Fri May 16 12:05:40 2025] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  570.144  Thu Apr 10 20:33:29 UTC 2025
> > [Fri May 16 12:05:40 2025] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  570.144  Thu Apr 10 20:03:03 UTC 2025
> > [Fri May 16 12:05:40 2025] [drm] [nvidia-drm] [GPU ID 0x00002600] Loading driver
> > [Fri May 16 12:05:40 2025] [drm] Initialized nvidia-drm 0.0.0 for 0000:26:00.0 on minor 0
> > [Fri May 16 12:05:40 2025] [drm] [nvidia-drm] [GPU ID 0x00002600] Unloading driver
> > [Fri May 16 12:05:40 2025] nvidia-modeset: Unloading
> > [Fri May 16 12:05:41 2025] nvidia-nvlink: Unregistered Nvlink Core, major device number 239
> > [Fri May 16 12:05:41 2025] BUG: unable to handle page fault for address: ffff9d28984c3000
> > [Fri May 16 12:05:41 2025] #PF: supervisor read access in kernel mode
> > [Fri May 16 12:05:41 2025] #PF: error_code(0x0000) - not-present page
> > [Fri May 16 12:05:41 2025] PGD 163001067 P4D 163001067 PUD 0
> > [Fri May 16 12:05:41 2025] Oops: Oops: 0000 [#1] SMP NOPTI
> > [Fri May 16 12:05:41 2025] CPU: 0 UID: 0 PID: 35898 Comm: modprobe Tainted: G           OE       6.15.0-rc6-linan-0 #587 PREEMPT(voluntary)
> > [Fri May 16 12:05:41 2025] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
> > [Fri May 16 12:05:41 2025] Hardware name: Micro-Star International Co., Ltd. MS-7B89/B450M MORTAR MAX (MS-7B89), BIOS 2.L0 07/18/2024
> > [Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
> > [Fri May 16 12:05:41 2025] Code: e3 4c 03 64 24 08 4c 39 e3 72 bf 8b 0d 86 02 f6 00 31 ed 31 c0 eb 17 48 63 f0 49 8b 54 24 20 83 c0 01 48 8b 34 f5 40 e6 d4 95 <48> 03 2c 16 89 ce 48 63 d0 48 c7 c7 c0 fd d4 95 e8 78 3f f5 ff 8b
> > [Fri May 16 12:05:41 2025] RSP: 0018:ffffb2eb40bdfe00 EFLAGS: 00010202
> > [Fri May 16 12:05:41 2025] RAX: 0000000000000001 RBX: ffffffffc04609d7 RCX: 0000000000000008
> > [Fri May 16 12:05:41 2025] RDX: 0000000000000000 RSI: ffff9d28984c3000 RDI: ffffffff95d4fdc0
> > [Fri May 16 12:05:41 2025] RBP: 0000000000000000 R08: ffff9d25035aed08 R09: 000000000000fcf8
> > [Fri May 16 12:05:41 2025] R10: ffffb2eb40bdfe00 R11: 0000000000000001 R12: ffffffffc04609b0
> > [Fri May 16 12:05:41 2025] R13: ffffffffc19ef000 R14: ffff9d2502ab6870 R15: 0000000000000000
> > [Fri May 16 12:05:41 2025] FS:  00007fe9ee051040(0000) GS:ffff9d28984c3000(0000) knlGS:0000000000000000
> > [Fri May 16 12:05:41 2025] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [Fri May 16 12:05:41 2025] CR2: ffff9d28984c3000 CR3: 000000011278d000 CR4: 0000000000350ef0
> > [Fri May 16 12:05:41 2025] Call Trace:
> > [Fri May 16 12:05:41 2025]  <TASK>
> > [Fri May 16 12:05:41 2025]  codetag_unload_module+0x135/0x160
> > [Fri May 16 12:05:41 2025]  free_module+0x19/0x1a0
> > [Fri May 16 12:05:41 2025]  __do_sys_delete_module+0x274/0x310
> > [Fri May 16 12:05:41 2025]  ? srso_return_thunk+0x5/0x5f
> > [Fri May 16 12:05:41 2025]  ? fpregs_assert_state_consistent+0x21/0x50
> > [Fri May 16 12:05:41 2025]  ? srso_return_thunk+0x5/0x5f
> > [Fri May 16 12:05:41 2025]  do_syscall_64+0x4b/0x120
> > [Fri May 16 12:05:41 2025]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [Fri May 16 12:05:41 2025] RIP: 0033:0x7fe9ed928a67
> > [Fri May 16 12:05:41 2025] Code: 73 01 c3 48 8b 0d 99 83 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 69 83 0c 00 f7 d8 64 89 01 48
> > [Fri May 16 12:05:41 2025] RSP: 002b:00007ffe17759568 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
> > [Fri May 16 12:05:41 2025] RAX: ffffffffffffffda RBX: 0000564f850b7e70 RCX: 00007fe9ed928a67
> > [Fri May 16 12:05:41 2025] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 0000564f850b7ed8
> > [Fri May 16 12:05:41 2025] RBP: 0000000000000000 R08: 1999999999999999 R09: 0000000000000000
> > [Fri May 16 12:05:41 2025] R10: 00007fe9ed99bac0 R11: 0000000000000206 R12: 0000000000000000
> > [Fri May 16 12:05:41 2025] R13: 0000000000000000 R14: 00007ffe177595a0 R15: 00007ffe1775aa60
> > [Fri May 16 12:05:41 2025]  </TASK>
> > [Fri May 16 12:05:41 2025] Modules linked in: vfio_pci_core(E-) vfio(E) drm_client_lib(E) amd_atl(E) intel_rapl_msr(E) intel_rapl_common(E) binfmt_misc(E) edac_mce_amd(E) kvm_amd(E) snd_hda_codec_realtek(E) nls_ascii(E) iwlmvm(E) nls_cp437(E) snd_hda_codec_generic(E) snd_hda_codec_hdmi(E) kvm(E) snd_hda_scodec_component(E) mac80211(E) irqbypass(E) uvcvideo(E) vfat(E) snd_hda_intel(E) libarc4(E) ghash_clmulni_intel(E) fat(E) videobuf2_vmalloc(E) snd_intel_dspcfg(E) sha256_ssse3(E) uvc(E) snd_usb_audio(E) sha1_ssse3(E) snd_hda_codec(E) videobuf2_memops(E) iwlwifi(E) drm_ttm_helper(E) videobuf2_v4l2(E) aesni_intel(E) snd_usbmidi_lib(E) snd_hda_core(E) ttm(E) snd_rawmidi(E) gf128mul(E) snd_pcsp(E) snd_seq_device(E) ppdev(E) snd_hwdep(E) crypto_simd(E) drm_kms_helper(E) videodev(E) cryptd(E) cfg80211(E) snd_pcm(E) videobuf2_common(E) rapl(E) snd_timer(E) sp5100_tco(E) drm(E) mc(E) wmi_bmof(E) k10temp(E) tpm_crb(E) rfkill(E) snd(E) sha3_generic(E) video(E) soundcore(E) jitterentropy_rng(E) sha512_ssse3(E) tpm_tis(E) ccp(E)
> > [Fri May 16 12:05:41 2025]  sha512_generic(E) tpm_tis_core(E) joydev(E) drbg(E) tpm(E) ansi_cprng(E) libaescfb(E) ecdh_generic(E) ecc(E) parport_pc(E) parport(E) sg(E) rng_core(E) evdev(E) msr(E) loop(E) fuse(E) efi_pstore(E) dm_mod(E) configfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) mbcache(E) jbd2(E) btrfs(E) blake2b_generic(E) efivarfs(E) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) xor(E) raid6_pq(E) raid1(E) raid0(E) md_mod(E) hid_generic(E) usbhid(E) hid(E) sd_mod(E) ahci(E) xhci_pci(E) libahci(E) nvme(E) xhci_hcd(E) libata(E) i2c_piix4(E) i2c_smbus(E) r8169(E) nvme_core(E) realtek(E) usbcore(E) scsi_mod(E) scsi_common(E) usb_common(E) wmi(E) gpio_amdpt(E) gpio_generic(E) button(E) [last unloaded: vfio_pci_core(E)]
> > [Fri May 16 12:05:41 2025] CR2: ffff9d28984c3000
> > [Fri May 16 12:05:41 2025] ---[ end trace 0000000000000000 ]---
> > [Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
> > [Fri May 16 12:05:41 2025] Code: e3 4c 03 64 24 08 4c 39 e3 72 bf 8b 0d 86 02 f6 00 31 ed 31 c0 eb 17 48 63 f0 49 8b 54 24 20 83 c0 01 48 8b 34 f5 40 e6 d4 95 <48> 03 2c 16 89 ce 48 63 d0 48 c7 c7 c0 fd d4 95 e8 78 3f f5 ff 8b
> > [Fri May 16 12:05:41 2025] RSP: 0018:ffffb2eb40bdfe00 EFLAGS: 00010202
> > [Fri May 16 12:05:41 2025] RAX: 0000000000000001 RBX: ffffffffc04609d7 RCX: 0000000000000008
> > [Fri May 16 12:05:41 2025] RDX: 0000000000000000 RSI: ffff9d28984c3000 RDI: ffffffff95d4fdc0
> > [Fri May 16 12:05:41 2025] RBP: 0000000000000000 R08: ffff9d25035aed08 R09: 000000000000fcf8
> > [Fri May 16 12:05:41 2025] R10: ffffb2eb40bdfe00 R11: 0000000000000001 R12: ffffffffc04609b0
> > [Fri May 16 12:05:41 2025] R13: ffffffffc19ef000 R14: ffff9d2502ab6870 R15: 0000000000000000
> > [Fri May 16 12:05:41 2025] FS:  00007fe9ee051040(0000) GS:ffff9d28984c3000(0000) knlGS:0000000000000000
> > [Fri May 16 12:05:41 2025] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [Fri May 16 12:05:41 2025] CR2: ffff9d28984c3000 CR3: 000000011278d000 CR4: 0000000000350ef0
> > [Fri May 16 12:05:41 2025] note: modprobe[35898] exited with irqs disabled
> > [Fri May 16 12:05:41 2025] note: modprobe[35898] exited with preempt_count 1
> >
> >
> >
> >
> > Thanks
> > David
> >


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BUG: unable to handle page fault for address
  2025-05-17  0:11   ` Suren Baghdasaryan
@ 2025-05-17  7:02     ` David Wang
  2025-05-17 16:39       ` Suren Baghdasaryan
  0 siblings, 1 reply; 12+ messages in thread
From: David Wang @ 2025-05-17  7:02 UTC (permalink / raw)
  To: Suren Baghdasaryan; +Cc: kent.overstreet, linux-mm, linux-kernel


At 2025-05-17 08:11:24, "Suren Baghdasaryan" <surenb@google.com> wrote:
>On Fri, May 16, 2025 at 10:03 AM Suren Baghdasaryan <surenb@google.com> wrote:
>>
>> Hi David,
>>
>> On Fri, May 16, 2025 at 6:13 AM David Wang <00107082@163.com> wrote:
>> >
>> > Hi,
>> >
>> > I caught a page fault when I was changing my nvidia driver:
>> > (This happens randomly, I can reproduce it with about 1/3 probability)
>> >
>> > [Fri May 16 12:05:41 2025] BUG: unable to handle page fault for address: ffff9d28984c3000
>> > [Fri May 16 12:05:41 2025] #PF: supervisor read access in kernel mode
>> > [Fri May 16 12:05:41 2025] #PF: error_code(0x0000) - not-present page
>> > ...
>> > [Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
>> > ...
>> > [Fri May 16 12:05:41 2025] Call Trace:
>> > [Fri May 16 12:05:41 2025]  <TASK>
>> > [Fri May 16 12:05:41 2025]  codetag_unload_module+0x135/0x160
>> > [Fri May 16 12:05:41 2025]  free_module+0x19/0x1a0
>> > ...
>> > (full kernel logs are pasted at the end.)
>> >
>> > Using a image with DEBUG_INFO, the calltrack parses as:
>> >
>> > RIP: 0010:release_module_tags (./include/linux/alloc_tag.h:134 lib/alloc_tag.c:352 lib/alloc_tag.c:573)
>> > [Fri May 16 12:05:41 2025] codetag_unload_module (lib/codetag.c:355)
>> > [Fri May 16 12:05:41 2025] free_module (kernel/module/main.c:1305)
>> > [Fri May 16 12:05:41 2025] __do_sys_delete_module (kernel/module/main.c:795)
>> >
>> > The offending lines in my codebase:
>> >         126 static inline struct alloc_tag_counters alloc_tag_read(struct alloc_tag *tag)
>> >         127 {
>> >         ...
>> >         131
>> >         132         for_each_possible_cpu(cpu) {
>> >         133                 counter = per_cpu_ptr(tag->counters, cpu);
>> > >>>>    134                 v.bytes += counter->bytes;   <--------------here
>> >         135                 v.calls += counter->calls;
>> >
>> >
>> > Nvidia drivers are out-tree... there could be some strange behavior in it causes this.. but,
>> > when I check the code, I got concerned about lifecycle of tag->counters.
>> > Based on following defination:
>> >         108 #define DEFINE_ALLOC_TAG(_alloc_tag)                                            \
>> >         109         static DEFINE_PER_CPU(struct alloc_tag_counters, _alloc_tag_cntr);      \
>> >         110         static struct alloc_tag _alloc_tag __used __aligned(8)                  \
>> >         111         __section(ALLOC_TAG_SECTION_NAME) = {                                   \
>> >         112                 .ct = CODE_TAG_INIT,                                            \
>> >         113                 .counters = &_alloc_tag_cntr };
>> >         114
>> > _alloc_tag_cntr is the data referenced by tag->counters, but they are in different section,
>> > and alloc_tag only prepare storage for section ALLOC_TAG_SECTION_NAME.
>> > right?
>> > Then what happens to those ".data..percpu" section when module is unloaded?
>> > Is it safe to keep using those ".data..percpu" section after module unloaded,
>> > or even during module is unloading?
>>
>> Yes, I think you are right, free_module() calls percpu_modfree() which
>> would free the per-cpu memory allocated for the module. Before
>> 0db6f8d7820a ("alloc_tag: load module tags into separate contiguous
>> memory") we would not unload the module if there were tags which were
>> still in use. After that change we load module tags into separate
>> memory, so I expected this to work but due to this external reference
>> it indeed should lead to UAF.
>> I think the simplest way to fix this would be to bypass
>> percpu_modfree() inside free_module() when there are module tags still
>> referenced, store mod->percpu inside alloc_tag_module_section and free
>> it inside clean_unused_module_areas_locked() once we know the counters
>> are not used anymore. I'll take a stab at it and will send a patch for
>> testing today.
>
>Ok, I went with another implementation, instead dynamically allocating
>percpu memory for modules at the module load time. This has another
>advantage of not needing extra PERCPU_MODULE_RESERVE currently
>required for memory allocation tagging to work.
>David, the patch is posted at [1]. Please give it a try and let me
>know if the fix works for you.
>Thanks,
>Suren.
>
>[1] https://lore.kernel.org/all/20250517000739.5930-1-surenb@google.com/
>
>
>> Thanks,
>> Suren.
>>

Hi, the patch does fix my issue.
I now have another similar concern about modules RO data, 
The codetag defined as
 24 struct codetag {
 25         unsigned int flags; /* used in later patches */
 26         unsigned int lineno;
 27         const char *modname;
 28         const char *function;
 29         const char *filename;
 30 } __aligned(8);

Those modname/function/filename would refer to RO data section, right?
When module unloaded, its RO data section would be released at some point.
My question is is it safe to use RO data during module unload? because these
lines seems to access those data:

+			pr_info("%s:%u module %s func:%s has %llu allocated at module unload\n",
+				tag->ct.filename, tag->ct.lineno, tag->ct.modname,
+				tag->ct.function, counter.bytes);



Thanks
David



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BUG: unable to handle page fault for address
  2025-05-17  7:02     ` David Wang
@ 2025-05-17 16:39       ` Suren Baghdasaryan
  2025-05-17 16:51         ` David Wang
  0 siblings, 1 reply; 12+ messages in thread
From: Suren Baghdasaryan @ 2025-05-17 16:39 UTC (permalink / raw)
  To: David Wang; +Cc: kent.overstreet, linux-mm, linux-kernel

On Sat, May 17, 2025 at 12:02 AM David Wang <00107082@163.com> wrote:
>
>
> At 2025-05-17 08:11:24, "Suren Baghdasaryan" <surenb@google.com> wrote:
> >On Fri, May 16, 2025 at 10:03 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >>
> >> Hi David,
> >>
> >> On Fri, May 16, 2025 at 6:13 AM David Wang <00107082@163.com> wrote:
> >> >
> >> > Hi,
> >> >
> >> > I caught a page fault when I was changing my nvidia driver:
> >> > (This happens randomly, I can reproduce it with about 1/3 probability)
> >> >
> >> > [Fri May 16 12:05:41 2025] BUG: unable to handle page fault for address: ffff9d28984c3000
> >> > [Fri May 16 12:05:41 2025] #PF: supervisor read access in kernel mode
> >> > [Fri May 16 12:05:41 2025] #PF: error_code(0x0000) - not-present page
> >> > ...
> >> > [Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
> >> > ...
> >> > [Fri May 16 12:05:41 2025] Call Trace:
> >> > [Fri May 16 12:05:41 2025]  <TASK>
> >> > [Fri May 16 12:05:41 2025]  codetag_unload_module+0x135/0x160
> >> > [Fri May 16 12:05:41 2025]  free_module+0x19/0x1a0
> >> > ...
> >> > (full kernel logs are pasted at the end.)
> >> >
> >> > Using a image with DEBUG_INFO, the calltrack parses as:
> >> >
> >> > RIP: 0010:release_module_tags (./include/linux/alloc_tag.h:134 lib/alloc_tag.c:352 lib/alloc_tag.c:573)
> >> > [Fri May 16 12:05:41 2025] codetag_unload_module (lib/codetag.c:355)
> >> > [Fri May 16 12:05:41 2025] free_module (kernel/module/main.c:1305)
> >> > [Fri May 16 12:05:41 2025] __do_sys_delete_module (kernel/module/main.c:795)
> >> >
> >> > The offending lines in my codebase:
> >> >         126 static inline struct alloc_tag_counters alloc_tag_read(struct alloc_tag *tag)
> >> >         127 {
> >> >         ...
> >> >         131
> >> >         132         for_each_possible_cpu(cpu) {
> >> >         133                 counter = per_cpu_ptr(tag->counters, cpu);
> >> > >>>>    134                 v.bytes += counter->bytes;   <--------------here
> >> >         135                 v.calls += counter->calls;
> >> >
> >> >
> >> > Nvidia drivers are out-tree... there could be some strange behavior in it causes this.. but,
> >> > when I check the code, I got concerned about lifecycle of tag->counters.
> >> > Based on following defination:
> >> >         108 #define DEFINE_ALLOC_TAG(_alloc_tag)                                            \
> >> >         109         static DEFINE_PER_CPU(struct alloc_tag_counters, _alloc_tag_cntr);      \
> >> >         110         static struct alloc_tag _alloc_tag __used __aligned(8)                  \
> >> >         111         __section(ALLOC_TAG_SECTION_NAME) = {                                   \
> >> >         112                 .ct = CODE_TAG_INIT,                                            \
> >> >         113                 .counters = &_alloc_tag_cntr };
> >> >         114
> >> > _alloc_tag_cntr is the data referenced by tag->counters, but they are in different section,
> >> > and alloc_tag only prepare storage for section ALLOC_TAG_SECTION_NAME.
> >> > right?
> >> > Then what happens to those ".data..percpu" section when module is unloaded?
> >> > Is it safe to keep using those ".data..percpu" section after module unloaded,
> >> > or even during module is unloading?
> >>
> >> Yes, I think you are right, free_module() calls percpu_modfree() which
> >> would free the per-cpu memory allocated for the module. Before
> >> 0db6f8d7820a ("alloc_tag: load module tags into separate contiguous
> >> memory") we would not unload the module if there were tags which were
> >> still in use. After that change we load module tags into separate
> >> memory, so I expected this to work but due to this external reference
> >> it indeed should lead to UAF.
> >> I think the simplest way to fix this would be to bypass
> >> percpu_modfree() inside free_module() when there are module tags still
> >> referenced, store mod->percpu inside alloc_tag_module_section and free
> >> it inside clean_unused_module_areas_locked() once we know the counters
> >> are not used anymore. I'll take a stab at it and will send a patch for
> >> testing today.
> >
> >Ok, I went with another implementation, instead dynamically allocating
> >percpu memory for modules at the module load time. This has another
> >advantage of not needing extra PERCPU_MODULE_RESERVE currently
> >required for memory allocation tagging to work.
> >David, the patch is posted at [1]. Please give it a try and let me
> >know if the fix works for you.
> >Thanks,
> >Suren.
> >
> >[1] https://lore.kernel.org/all/20250517000739.5930-1-surenb@google.com/
> >
> >
> >> Thanks,
> >> Suren.
> >>
>
> Hi, the patch does fix my issue.
> I now have another similar concern about modules RO data,
> The codetag defined as
>  24 struct codetag {
>  25         unsigned int flags; /* used in later patches */
>  26         unsigned int lineno;
>  27         const char *modname;
>  28         const char *function;
>  29         const char *filename;
>  30 } __aligned(8);
>
> Those modname/function/filename would refer to RO data section, right?
> When module unloaded, its RO data section would be released at some point.
> My question is is it safe to use RO data during module unload? because these
> lines seems to access those data:
>
> +                       pr_info("%s:%u module %s func:%s has %llu allocated at module unload\n",
> +                               tag->ct.filename, tag->ct.lineno, tag->ct.modname,
> +                               tag->ct.function, counter.bytes);

These lines are called from release_module_tags() using this call chain:

delete_module()
  free_module()
    codetag_unload_module()
      release_module_tags()

and codetag_unload_module() is called at the very beginning of
free_module(), when no other module memory has been freed yet. So,
from what I understand, this should be safe.
After we unload the module these pointers inside the tags will be
dandling but we should not be using them anymore since we do not
report unloaded modules. Do you see some usage that I missed?

>
>
>
> Thanks
> David
>
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BUG: unable to handle page fault for address
  2025-05-17 16:39       ` Suren Baghdasaryan
@ 2025-05-17 16:51         ` David Wang
  2025-05-17 17:29           ` Suren Baghdasaryan
  0 siblings, 1 reply; 12+ messages in thread
From: David Wang @ 2025-05-17 16:51 UTC (permalink / raw)
  To: Suren Baghdasaryan; +Cc: kent.overstreet, linux-mm, linux-kernel


At 2025-05-18 00:39:30, "Suren Baghdasaryan" <surenb@google.com> wrote:
>On Sat, May 17, 2025 at 12:02 AM David Wang <00107082@163.com> wrote:
>>
>>
>> At 2025-05-17 08:11:24, "Suren Baghdasaryan" <surenb@google.com> wrote:
>> >On Fri, May 16, 2025 at 10:03 AM Suren Baghdasaryan <surenb@google.com> wrote:
>> >>
>> >> Hi David,
>> >>
>> >> On Fri, May 16, 2025 at 6:13 AM David Wang <00107082@163.com> wrote:
>> >> >
>> >> > Hi,
>> >> >
>> >> > I caught a page fault when I was changing my nvidia driver:
>> >> > (This happens randomly, I can reproduce it with about 1/3 probability)
>> >> >
>> >> > [Fri May 16 12:05:41 2025] BUG: unable to handle page fault for address: ffff9d28984c3000
>> >> > [Fri May 16 12:05:41 2025] #PF: supervisor read access in kernel mode
>> >> > [Fri May 16 12:05:41 2025] #PF: error_code(0x0000) - not-present page
>> >> > ...
>> >> > [Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
>> >> > ...
>> >> > [Fri May 16 12:05:41 2025] Call Trace:
>> >> > [Fri May 16 12:05:41 2025]  <TASK>
>> >> > [Fri May 16 12:05:41 2025]  codetag_unload_module+0x135/0x160
>> >> > [Fri May 16 12:05:41 2025]  free_module+0x19/0x1a0
>> >> > ...
>> >> > (full kernel logs are pasted at the end.)
>> >> >
>> >> > Using a image with DEBUG_INFO, the calltrack parses as:
>> >> >
>> >> > RIP: 0010:release_module_tags (./include/linux/alloc_tag.h:134 lib/alloc_tag.c:352 lib/alloc_tag.c:573)
>> >> > [Fri May 16 12:05:41 2025] codetag_unload_module (lib/codetag.c:355)
>> >> > [Fri May 16 12:05:41 2025] free_module (kernel/module/main.c:1305)
>> >> > [Fri May 16 12:05:41 2025] __do_sys_delete_module (kernel/module/main.c:795)
>> >> >
>> >> > The offending lines in my codebase:
>> >> >         126 static inline struct alloc_tag_counters alloc_tag_read(struct alloc_tag *tag)
>> >> >         127 {
>> >> >         ...
>> >> >         131
>> >> >         132         for_each_possible_cpu(cpu) {
>> >> >         133                 counter = per_cpu_ptr(tag->counters, cpu);
>> >> > >>>>    134                 v.bytes += counter->bytes;   <--------------here
>> >> >         135                 v.calls += counter->calls;
>> >> >
>> >> >
>> >> > Nvidia drivers are out-tree... there could be some strange behavior in it causes this.. but,
>> >> > when I check the code, I got concerned about lifecycle of tag->counters.
>> >> > Based on following defination:
>> >> >         108 #define DEFINE_ALLOC_TAG(_alloc_tag)                                            \
>> >> >         109         static DEFINE_PER_CPU(struct alloc_tag_counters, _alloc_tag_cntr);      \
>> >> >         110         static struct alloc_tag _alloc_tag __used __aligned(8)                  \
>> >> >         111         __section(ALLOC_TAG_SECTION_NAME) = {                                   \
>> >> >         112                 .ct = CODE_TAG_INIT,                                            \
>> >> >         113                 .counters = &_alloc_tag_cntr };
>> >> >         114
>> >> > _alloc_tag_cntr is the data referenced by tag->counters, but they are in different section,
>> >> > and alloc_tag only prepare storage for section ALLOC_TAG_SECTION_NAME.
>> >> > right?
>> >> > Then what happens to those ".data..percpu" section when module is unloaded?
>> >> > Is it safe to keep using those ".data..percpu" section after module unloaded,
>> >> > or even during module is unloading?
>> >>
>> >> Yes, I think you are right, free_module() calls percpu_modfree() which
>> >> would free the per-cpu memory allocated for the module. Before
>> >> 0db6f8d7820a ("alloc_tag: load module tags into separate contiguous
>> >> memory") we would not unload the module if there were tags which were
>> >> still in use. After that change we load module tags into separate
>> >> memory, so I expected this to work but due to this external reference
>> >> it indeed should lead to UAF.
>> >> I think the simplest way to fix this would be to bypass
>> >> percpu_modfree() inside free_module() when there are module tags still
>> >> referenced, store mod->percpu inside alloc_tag_module_section and free
>> >> it inside clean_unused_module_areas_locked() once we know the counters
>> >> are not used anymore. I'll take a stab at it and will send a patch for
>> >> testing today.
>> >
>> >Ok, I went with another implementation, instead dynamically allocating
>> >percpu memory for modules at the module load time. This has another
>> >advantage of not needing extra PERCPU_MODULE_RESERVE currently
>> >required for memory allocation tagging to work.
>> >David, the patch is posted at [1]. Please give it a try and let me
>> >know if the fix works for you.
>> >Thanks,
>> >Suren.
>> >
>> >[1] https://lore.kernel.org/all/20250517000739.5930-1-surenb@google.com/
>> >
>> >
>> >> Thanks,
>> >> Suren.
>> >>
>>
>> Hi, the patch does fix my issue.
>> I now have another similar concern about modules RO data,
>> The codetag defined as
>>  24 struct codetag {
>>  25         unsigned int flags; /* used in later patches */
>>  26         unsigned int lineno;
>>  27         const char *modname;
>>  28         const char *function;
>>  29         const char *filename;
>>  30 } __aligned(8);
>>
>> Those modname/function/filename would refer to RO data section, right?
>> When module unloaded, its RO data section would be released at some point.
>> My question is is it safe to use RO data during module unload? because these
>> lines seems to access those data:
>>
>> +                       pr_info("%s:%u module %s func:%s has %llu allocated at module unload\n",
>> +                               tag->ct.filename, tag->ct.lineno, tag->ct.modname,
>> +                               tag->ct.function, counter.bytes);
>
>These lines are called from release_module_tags() using this call chain:
>
>delete_module()
>  free_module()
>    codetag_unload_module()
>      release_module_tags()
>
>and codetag_unload_module() is called at the very beginning of
>free_module(), when no other module memory has been freed yet. So,
>from what I understand, this should be safe.
>After we unload the module these pointers inside the tags will be
>dandling but we should not be using them anymore since we do not
>report unloaded modules. Do you see some usage that I missed?

Why data..percpu. is different. The page fault error caught when I reinstall nvidia drivers is also
raised from release_module_tags().

Is data..percpu. section released earlier? 

Thanks
David



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BUG: unable to handle page fault for address
  2025-05-17 16:51         ` David Wang
@ 2025-05-17 17:29           ` Suren Baghdasaryan
  2025-05-17 17:57             ` David Wang
  0 siblings, 1 reply; 12+ messages in thread
From: Suren Baghdasaryan @ 2025-05-17 17:29 UTC (permalink / raw)
  To: David Wang; +Cc: kent.overstreet, linux-mm, linux-kernel

On Sat, May 17, 2025 at 9:51 AM David Wang <00107082@163.com> wrote:
>
>
> At 2025-05-18 00:39:30, "Suren Baghdasaryan" <surenb@google.com> wrote:
> >On Sat, May 17, 2025 at 12:02 AM David Wang <00107082@163.com> wrote:
> >>
> >>
> >> At 2025-05-17 08:11:24, "Suren Baghdasaryan" <surenb@google.com> wrote:
> >> >On Fri, May 16, 2025 at 10:03 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >> >>
> >> >> Hi David,
> >> >>
> >> >> On Fri, May 16, 2025 at 6:13 AM David Wang <00107082@163.com> wrote:
> >> >> >
> >> >> > Hi,
> >> >> >
> >> >> > I caught a page fault when I was changing my nvidia driver:
> >> >> > (This happens randomly, I can reproduce it with about 1/3 probability)
> >> >> >
> >> >> > [Fri May 16 12:05:41 2025] BUG: unable to handle page fault for address: ffff9d28984c3000
> >> >> > [Fri May 16 12:05:41 2025] #PF: supervisor read access in kernel mode
> >> >> > [Fri May 16 12:05:41 2025] #PF: error_code(0x0000) - not-present page
> >> >> > ...
> >> >> > [Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
> >> >> > ...
> >> >> > [Fri May 16 12:05:41 2025] Call Trace:
> >> >> > [Fri May 16 12:05:41 2025]  <TASK>
> >> >> > [Fri May 16 12:05:41 2025]  codetag_unload_module+0x135/0x160
> >> >> > [Fri May 16 12:05:41 2025]  free_module+0x19/0x1a0
> >> >> > ...
> >> >> > (full kernel logs are pasted at the end.)
> >> >> >
> >> >> > Using a image with DEBUG_INFO, the calltrack parses as:
> >> >> >
> >> >> > RIP: 0010:release_module_tags (./include/linux/alloc_tag.h:134 lib/alloc_tag.c:352 lib/alloc_tag.c:573)
> >> >> > [Fri May 16 12:05:41 2025] codetag_unload_module (lib/codetag.c:355)
> >> >> > [Fri May 16 12:05:41 2025] free_module (kernel/module/main.c:1305)
> >> >> > [Fri May 16 12:05:41 2025] __do_sys_delete_module (kernel/module/main.c:795)
> >> >> >
> >> >> > The offending lines in my codebase:
> >> >> >         126 static inline struct alloc_tag_counters alloc_tag_read(struct alloc_tag *tag)
> >> >> >         127 {
> >> >> >         ...
> >> >> >         131
> >> >> >         132         for_each_possible_cpu(cpu) {
> >> >> >         133                 counter = per_cpu_ptr(tag->counters, cpu);
> >> >> > >>>>    134                 v.bytes += counter->bytes;   <--------------here
> >> >> >         135                 v.calls += counter->calls;
> >> >> >
> >> >> >
> >> >> > Nvidia drivers are out-tree... there could be some strange behavior in it causes this.. but,
> >> >> > when I check the code, I got concerned about lifecycle of tag->counters.
> >> >> > Based on following defination:
> >> >> >         108 #define DEFINE_ALLOC_TAG(_alloc_tag)                                            \
> >> >> >         109         static DEFINE_PER_CPU(struct alloc_tag_counters, _alloc_tag_cntr);      \
> >> >> >         110         static struct alloc_tag _alloc_tag __used __aligned(8)                  \
> >> >> >         111         __section(ALLOC_TAG_SECTION_NAME) = {                                   \
> >> >> >         112                 .ct = CODE_TAG_INIT,                                            \
> >> >> >         113                 .counters = &_alloc_tag_cntr };
> >> >> >         114
> >> >> > _alloc_tag_cntr is the data referenced by tag->counters, but they are in different section,
> >> >> > and alloc_tag only prepare storage for section ALLOC_TAG_SECTION_NAME.
> >> >> > right?
> >> >> > Then what happens to those ".data..percpu" section when module is unloaded?
> >> >> > Is it safe to keep using those ".data..percpu" section after module unloaded,
> >> >> > or even during module is unloading?
> >> >>
> >> >> Yes, I think you are right, free_module() calls percpu_modfree() which
> >> >> would free the per-cpu memory allocated for the module. Before
> >> >> 0db6f8d7820a ("alloc_tag: load module tags into separate contiguous
> >> >> memory") we would not unload the module if there were tags which were
> >> >> still in use. After that change we load module tags into separate
> >> >> memory, so I expected this to work but due to this external reference
> >> >> it indeed should lead to UAF.
> >> >> I think the simplest way to fix this would be to bypass
> >> >> percpu_modfree() inside free_module() when there are module tags still
> >> >> referenced, store mod->percpu inside alloc_tag_module_section and free
> >> >> it inside clean_unused_module_areas_locked() once we know the counters
> >> >> are not used anymore. I'll take a stab at it and will send a patch for
> >> >> testing today.
> >> >
> >> >Ok, I went with another implementation, instead dynamically allocating
> >> >percpu memory for modules at the module load time. This has another
> >> >advantage of not needing extra PERCPU_MODULE_RESERVE currently
> >> >required for memory allocation tagging to work.
> >> >David, the patch is posted at [1]. Please give it a try and let me
> >> >know if the fix works for you.
> >> >Thanks,
> >> >Suren.
> >> >
> >> >[1] https://lore.kernel.org/all/20250517000739.5930-1-surenb@google.com/
> >> >
> >> >
> >> >> Thanks,
> >> >> Suren.
> >> >>
> >>
> >> Hi, the patch does fix my issue.
> >> I now have another similar concern about modules RO data,
> >> The codetag defined as
> >>  24 struct codetag {
> >>  25         unsigned int flags; /* used in later patches */
> >>  26         unsigned int lineno;
> >>  27         const char *modname;
> >>  28         const char *function;
> >>  29         const char *filename;
> >>  30 } __aligned(8);
> >>
> >> Those modname/function/filename would refer to RO data section, right?
> >> When module unloaded, its RO data section would be released at some point.
> >> My question is is it safe to use RO data during module unload? because these
> >> lines seems to access those data:
> >>
> >> +                       pr_info("%s:%u module %s func:%s has %llu allocated at module unload\n",
> >> +                               tag->ct.filename, tag->ct.lineno, tag->ct.modname,
> >> +                               tag->ct.function, counter.bytes);
> >
> >These lines are called from release_module_tags() using this call chain:
> >
> >delete_module()
> >  free_module()
> >    codetag_unload_module()
> >      release_module_tags()
> >
> >and codetag_unload_module() is called at the very beginning of
> >free_module(), when no other module memory has been freed yet. So,
> >from what I understand, this should be safe.
> >After we unload the module these pointers inside the tags will be
> >dandling but we should not be using them anymore since we do not
> >report unloaded modules. Do you see some usage that I missed?
>
> Why data..percpu. is different. The page fault error caught when I reinstall nvidia drivers is also
> raised from release_module_tags().
>
> Is data..percpu. section released earlier?

No but counters are different because the allocations that still
reference these tags from unloaded modules will be decrementing them
when they are freed. That's where UAF is coming from. So, the counters
might be accessed after the module is unloaded, other fields should
not.

>
> Thanks
> David
>
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BUG: unable to handle page fault for address
  2025-05-17 17:29           ` Suren Baghdasaryan
@ 2025-05-17 17:57             ` David Wang
  2025-05-17 19:02               ` Suren Baghdasaryan
  0 siblings, 1 reply; 12+ messages in thread
From: David Wang @ 2025-05-17 17:57 UTC (permalink / raw)
  To: Suren Baghdasaryan; +Cc: kent.overstreet, linux-mm, linux-kernel



At 2025-05-18 01:29:35, "Suren Baghdasaryan" <surenb@google.com> wrote:
>On Sat, May 17, 2025 at 9:51 AM David Wang <00107082@163.com> wrote:
>>
>>
>> At 2025-05-18 00:39:30, "Suren Baghdasaryan" <surenb@google.com> wrote:
>> >On Sat, May 17, 2025 at 12:02 AM David Wang <00107082@163.com> wrote:
>> >>
>> >>
>> >> At 2025-05-17 08:11:24, "Suren Baghdasaryan" <surenb@google.com> wrote:
>> >> >On Fri, May 16, 2025 at 10:03 AM Suren Baghdasaryan <surenb@google.com> wrote:
>> >> >>
>> >> >> Hi David,
>> >> >>
>> >> >> On Fri, May 16, 2025 at 6:13 AM David Wang <00107082@163.com> wrote:
>> >> >> >
>> >> >> > Hi,
>> >> >> >
>> >> >> > I caught a page fault when I was changing my nvidia driver:
>> >> >> > (This happens randomly, I can reproduce it with about 1/3 probability)
>> >> >> >
>> >> >> > [Fri May 16 12:05:41 2025] BUG: unable to handle page fault for address: ffff9d28984c3000
>> >> >> > [Fri May 16 12:05:41 2025] #PF: supervisor read access in kernel mode
>> >> >> > [Fri May 16 12:05:41 2025] #PF: error_code(0x0000) - not-present page
>> >> >> > ...
>> >> >> > [Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
>> >> >> > ...
>> >> >> > [Fri May 16 12:05:41 2025] Call Trace:
>> >> >> > [Fri May 16 12:05:41 2025]  <TASK>
>> >> >> > [Fri May 16 12:05:41 2025]  codetag_unload_module+0x135/0x160
>> >> >> > [Fri May 16 12:05:41 2025]  free_module+0x19/0x1a0
>> >> >> > ...
>> >> >> > (full kernel logs are pasted at the end.)
>> >> >> >
>> >> >> > Using a image with DEBUG_INFO, the calltrack parses as:
>> >> >> >
>> >> >> > RIP: 0010:release_module_tags (./include/linux/alloc_tag.h:134 lib/alloc_tag.c:352 lib/alloc_tag.c:573)
>> >> >> > [Fri May 16 12:05:41 2025] codetag_unload_module (lib/codetag.c:355)
>> >> >> > [Fri May 16 12:05:41 2025] free_module (kernel/module/main.c:1305)
>> >> >> > [Fri May 16 12:05:41 2025] __do_sys_delete_module (kernel/module/main.c:795)
>> >> >> >
>> >> >> > The offending lines in my codebase:
>> >> >> >         126 static inline struct alloc_tag_counters alloc_tag_read(struct alloc_tag *tag)
>> >> >> >         127 {
>> >> >> >         ...
>> >> >> >         131
>> >> >> >         132         for_each_possible_cpu(cpu) {
>> >> >> >         133                 counter = per_cpu_ptr(tag->counters, cpu);
>> >> >> > >>>>    134                 v.bytes += counter->bytes;   <--------------here
>> >> >> >         135                 v.calls += counter->calls;
>> >> >> >
>> >> >> >
>> >> >> > Nvidia drivers are out-tree... there could be some strange behavior in it causes this.. but,
>> >> >> > when I check the code, I got concerned about lifecycle of tag->counters.
>> >> >> > Based on following defination:
>> >> >> >         108 #define DEFINE_ALLOC_TAG(_alloc_tag)                                            \
>> >> >> >         109         static DEFINE_PER_CPU(struct alloc_tag_counters, _alloc_tag_cntr);      \
>> >> >> >         110         static struct alloc_tag _alloc_tag __used __aligned(8)                  \
>> >> >> >         111         __section(ALLOC_TAG_SECTION_NAME) = {                                   \
>> >> >> >         112                 .ct = CODE_TAG_INIT,                                            \
>> >> >> >         113                 .counters = &_alloc_tag_cntr };
>> >> >> >         114
>> >> >> > _alloc_tag_cntr is the data referenced by tag->counters, but they are in different section,
>> >> >> > and alloc_tag only prepare storage for section ALLOC_TAG_SECTION_NAME.
>> >> >> > right?
>> >> >> > Then what happens to those ".data..percpu" section when module is unloaded?
>> >> >> > Is it safe to keep using those ".data..percpu" section after module unloaded,
>> >> >> > or even during module is unloading?
>> >> >>
>> >> >> Yes, I think you are right, free_module() calls percpu_modfree() which
>> >> >> would free the per-cpu memory allocated for the module. Before
>> >> >> 0db6f8d7820a ("alloc_tag: load module tags into separate contiguous
>> >> >> memory") we would not unload the module if there were tags which were
>> >> >> still in use. After that change we load module tags into separate
>> >> >> memory, so I expected this to work but due to this external reference
>> >> >> it indeed should lead to UAF.
>> >> >> I think the simplest way to fix this would be to bypass
>> >> >> percpu_modfree() inside free_module() when there are module tags still
>> >> >> referenced, store mod->percpu inside alloc_tag_module_section and free
>> >> >> it inside clean_unused_module_areas_locked() once we know the counters
>> >> >> are not used anymore. I'll take a stab at it and will send a patch for
>> >> >> testing today.
>> >> >
>> >> >Ok, I went with another implementation, instead dynamically allocating
>> >> >percpu memory for modules at the module load time. This has another
>> >> >advantage of not needing extra PERCPU_MODULE_RESERVE currently
>> >> >required for memory allocation tagging to work.
>> >> >David, the patch is posted at [1]. Please give it a try and let me
>> >> >know if the fix works for you.
>> >> >Thanks,
>> >> >Suren.
>> >> >
>> >> >[1] https://lore.kernel.org/all/20250517000739.5930-1-surenb@google.com/
>> >> >
>> >> >
>> >> >> Thanks,
>> >> >> Suren.
>> >> >>
>> >>
>> >> Hi, the patch does fix my issue.
>> >> I now have another similar concern about modules RO data,
>> >> The codetag defined as
>> >>  24 struct codetag {
>> >>  25         unsigned int flags; /* used in later patches */
>> >>  26         unsigned int lineno;
>> >>  27         const char *modname;
>> >>  28         const char *function;
>> >>  29         const char *filename;
>> >>  30 } __aligned(8);
>> >>
>> >> Those modname/function/filename would refer to RO data section, right?
>> >> When module unloaded, its RO data section would be released at some point.
>> >> My question is is it safe to use RO data during module unload? because these
>> >> lines seems to access those data:
>> >>
>> >> +                       pr_info("%s:%u module %s func:%s has %llu allocated at module unload\n",
>> >> +                               tag->ct.filename, tag->ct.lineno, tag->ct.modname,
>> >> +                               tag->ct.function, counter.bytes);
>> >
>> >These lines are called from release_module_tags() using this call chain:
>> >
>> >delete_module()
>> >  free_module()
>> >    codetag_unload_module()
>> >      release_module_tags()
>> >
>> >and codetag_unload_module() is called at the very beginning of
>> >free_module(), when no other module memory has been freed yet. So,
>> >from what I understand, this should be safe.
>> >After we unload the module these pointers inside the tags will be
>> >dandling but we should not be using them anymore since we do not
>> >report unloaded modules. Do you see some usage that I missed?
>>
>> Why data..percpu. is different. The page fault error caught when I reinstall nvidia drivers is also
>> raised from release_module_tags().
>>
>> Is data..percpu. section released earlier?
>
>No but counters are different because the allocations that still
>reference these tags from unloaded modules will be decrementing them
>when they are freed. That's where UAF is coming from. So, the counters
>might be accessed after the module is unloaded, other fields should
>not.
>

I do notice there are places where counters are referenced "after" free_module, but the logs I attached
happened "during" free_module():

 [Fri May 16 12:05:41 2025] BUG: unable to handle page fault for address: ffff9d28984c3000
 [Fri May 16 12:05:41 2025] #PF: supervisor read access in kernel mode
[Fri May 16 12:05:41 2025] #PF: error_code(0x0000) - not-present page
...
 [Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
...
 [Fri May 16 12:05:41 2025] Call Trace:
 [Fri May 16 12:05:41 2025]  <TASK>
 [Fri May 16 12:05:41 2025]  codetag_unload_module+0x135/0x160
[Fri May 16 12:05:41 2025]  free_module+0x19/0x1a0

The call chain is the same as you mentioned above.
This part confusing me a lot. The log indicates during free_module,  ..data..percpu access failed,
I doubted those section would be released that quick.  

The only guess left  I feel reasonable is the ..data_percpu was not paged in at all,  probably  because no access to it,
and when the section is accessed during free_module, somehow the access is refused. Just guessing.....

Or,  do I missing something here?


>>
>> Thanks
>> David
>>
>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BUG: unable to handle page fault for address
  2025-05-17 17:57             ` David Wang
@ 2025-05-17 19:02               ` Suren Baghdasaryan
  2025-05-18  1:45                 ` David Wang
  0 siblings, 1 reply; 12+ messages in thread
From: Suren Baghdasaryan @ 2025-05-17 19:02 UTC (permalink / raw)
  To: David Wang; +Cc: kent.overstreet, linux-mm, linux-kernel

On Sat, May 17, 2025 at 10:57 AM David Wang <00107082@163.com> wrote:
>
>
>
> At 2025-05-18 01:29:35, "Suren Baghdasaryan" <surenb@google.com> wrote:
> >On Sat, May 17, 2025 at 9:51 AM David Wang <00107082@163.com> wrote:
> >>
> >>
> >> At 2025-05-18 00:39:30, "Suren Baghdasaryan" <surenb@google.com> wrote:
> >> >On Sat, May 17, 2025 at 12:02 AM David Wang <00107082@163.com> wrote:
> >> >>
> >> >>
> >> >> At 2025-05-17 08:11:24, "Suren Baghdasaryan" <surenb@google.com> wrote:
> >> >> >On Fri, May 16, 2025 at 10:03 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >> >> >>
> >> >> >> Hi David,
> >> >> >>
> >> >> >> On Fri, May 16, 2025 at 6:13 AM David Wang <00107082@163.com> wrote:
> >> >> >> >
> >> >> >> > Hi,
> >> >> >> >
> >> >> >> > I caught a page fault when I was changing my nvidia driver:
> >> >> >> > (This happens randomly, I can reproduce it with about 1/3 probability)
> >> >> >> >
> >> >> >> > [Fri May 16 12:05:41 2025] BUG: unable to handle page fault for address: ffff9d28984c3000
> >> >> >> > [Fri May 16 12:05:41 2025] #PF: supervisor read access in kernel mode
> >> >> >> > [Fri May 16 12:05:41 2025] #PF: error_code(0x0000) - not-present page
> >> >> >> > ...
> >> >> >> > [Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
> >> >> >> > ...
> >> >> >> > [Fri May 16 12:05:41 2025] Call Trace:
> >> >> >> > [Fri May 16 12:05:41 2025]  <TASK>
> >> >> >> > [Fri May 16 12:05:41 2025]  codetag_unload_module+0x135/0x160
> >> >> >> > [Fri May 16 12:05:41 2025]  free_module+0x19/0x1a0
> >> >> >> > ...
> >> >> >> > (full kernel logs are pasted at the end.)
> >> >> >> >
> >> >> >> > Using a image with DEBUG_INFO, the calltrack parses as:
> >> >> >> >
> >> >> >> > RIP: 0010:release_module_tags (./include/linux/alloc_tag.h:134 lib/alloc_tag.c:352 lib/alloc_tag.c:573)
> >> >> >> > [Fri May 16 12:05:41 2025] codetag_unload_module (lib/codetag.c:355)
> >> >> >> > [Fri May 16 12:05:41 2025] free_module (kernel/module/main.c:1305)
> >> >> >> > [Fri May 16 12:05:41 2025] __do_sys_delete_module (kernel/module/main.c:795)
> >> >> >> >
> >> >> >> > The offending lines in my codebase:
> >> >> >> >         126 static inline struct alloc_tag_counters alloc_tag_read(struct alloc_tag *tag)
> >> >> >> >         127 {
> >> >> >> >         ...
> >> >> >> >         131
> >> >> >> >         132         for_each_possible_cpu(cpu) {
> >> >> >> >         133                 counter = per_cpu_ptr(tag->counters, cpu);
> >> >> >> > >>>>    134                 v.bytes += counter->bytes;   <--------------here
> >> >> >> >         135                 v.calls += counter->calls;
> >> >> >> >
> >> >> >> >
> >> >> >> > Nvidia drivers are out-tree... there could be some strange behavior in it causes this.. but,
> >> >> >> > when I check the code, I got concerned about lifecycle of tag->counters.
> >> >> >> > Based on following defination:
> >> >> >> >         108 #define DEFINE_ALLOC_TAG(_alloc_tag)                                            \
> >> >> >> >         109         static DEFINE_PER_CPU(struct alloc_tag_counters, _alloc_tag_cntr);      \
> >> >> >> >         110         static struct alloc_tag _alloc_tag __used __aligned(8)                  \
> >> >> >> >         111         __section(ALLOC_TAG_SECTION_NAME) = {                                   \
> >> >> >> >         112                 .ct = CODE_TAG_INIT,                                            \
> >> >> >> >         113                 .counters = &_alloc_tag_cntr };
> >> >> >> >         114
> >> >> >> > _alloc_tag_cntr is the data referenced by tag->counters, but they are in different section,
> >> >> >> > and alloc_tag only prepare storage for section ALLOC_TAG_SECTION_NAME.
> >> >> >> > right?
> >> >> >> > Then what happens to those ".data..percpu" section when module is unloaded?
> >> >> >> > Is it safe to keep using those ".data..percpu" section after module unloaded,
> >> >> >> > or even during module is unloading?
> >> >> >>
> >> >> >> Yes, I think you are right, free_module() calls percpu_modfree() which
> >> >> >> would free the per-cpu memory allocated for the module. Before
> >> >> >> 0db6f8d7820a ("alloc_tag: load module tags into separate contiguous
> >> >> >> memory") we would not unload the module if there were tags which were
> >> >> >> still in use. After that change we load module tags into separate
> >> >> >> memory, so I expected this to work but due to this external reference
> >> >> >> it indeed should lead to UAF.
> >> >> >> I think the simplest way to fix this would be to bypass
> >> >> >> percpu_modfree() inside free_module() when there are module tags still
> >> >> >> referenced, store mod->percpu inside alloc_tag_module_section and free
> >> >> >> it inside clean_unused_module_areas_locked() once we know the counters
> >> >> >> are not used anymore. I'll take a stab at it and will send a patch for
> >> >> >> testing today.
> >> >> >
> >> >> >Ok, I went with another implementation, instead dynamically allocating
> >> >> >percpu memory for modules at the module load time. This has another
> >> >> >advantage of not needing extra PERCPU_MODULE_RESERVE currently
> >> >> >required for memory allocation tagging to work.
> >> >> >David, the patch is posted at [1]. Please give it a try and let me
> >> >> >know if the fix works for you.
> >> >> >Thanks,
> >> >> >Suren.
> >> >> >
> >> >> >[1] https://lore.kernel.org/all/20250517000739.5930-1-surenb@google.com/
> >> >> >
> >> >> >
> >> >> >> Thanks,
> >> >> >> Suren.
> >> >> >>
> >> >>
> >> >> Hi, the patch does fix my issue.
> >> >> I now have another similar concern about modules RO data,
> >> >> The codetag defined as
> >> >>  24 struct codetag {
> >> >>  25         unsigned int flags; /* used in later patches */
> >> >>  26         unsigned int lineno;
> >> >>  27         const char *modname;
> >> >>  28         const char *function;
> >> >>  29         const char *filename;
> >> >>  30 } __aligned(8);
> >> >>
> >> >> Those modname/function/filename would refer to RO data section, right?
> >> >> When module unloaded, its RO data section would be released at some point.
> >> >> My question is is it safe to use RO data during module unload? because these
> >> >> lines seems to access those data:
> >> >>
> >> >> +                       pr_info("%s:%u module %s func:%s has %llu allocated at module unload\n",
> >> >> +                               tag->ct.filename, tag->ct.lineno, tag->ct.modname,
> >> >> +                               tag->ct.function, counter.bytes);
> >> >
> >> >These lines are called from release_module_tags() using this call chain:
> >> >
> >> >delete_module()
> >> >  free_module()
> >> >    codetag_unload_module()
> >> >      release_module_tags()
> >> >
> >> >and codetag_unload_module() is called at the very beginning of
> >> >free_module(), when no other module memory has been freed yet. So,
> >> >from what I understand, this should be safe.
> >> >After we unload the module these pointers inside the tags will be
> >> >dandling but we should not be using them anymore since we do not
> >> >report unloaded modules. Do you see some usage that I missed?
> >>
> >> Why data..percpu. is different. The page fault error caught when I reinstall nvidia drivers is also
> >> raised from release_module_tags().
> >>
> >> Is data..percpu. section released earlier?
> >
> >No but counters are different because the allocations that still
> >reference these tags from unloaded modules will be decrementing them
> >when they are freed. That's where UAF is coming from. So, the counters
> >might be accessed after the module is unloaded, other fields should
> >not.
> >
>
> I do notice there are places where counters are referenced "after" free_module, but the logs I attached
> happened "during" free_module():
>
>  [Fri May 16 12:05:41 2025] BUG: unable to handle page fault for address: ffff9d28984c3000
>  [Fri May 16 12:05:41 2025] #PF: supervisor read access in kernel mode
> [Fri May 16 12:05:41 2025] #PF: error_code(0x0000) - not-present page
> ...
>  [Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
> ...
>  [Fri May 16 12:05:41 2025] Call Trace:
>  [Fri May 16 12:05:41 2025]  <TASK>
>  [Fri May 16 12:05:41 2025]  codetag_unload_module+0x135/0x160
> [Fri May 16 12:05:41 2025]  free_module+0x19/0x1a0
>
> The call chain is the same as you mentioned above.

Is this failure happening before or after my fix? With my fix, percpu
data should not be freed at all if tags are still used. Please
clarify.

> This part confusing me a lot. The log indicates during free_module,  ..data..percpu access failed,
> I doubted those section would be released that quick.
>
> The only guess left  I feel reasonable is the ..data_percpu was not paged in at all,  probably  because no access to it,
> and when the section is accessed during free_module, somehow the access is refused. Just guessing.....
>
> Or,  do I missing something here?
>
>
> >>
> >> Thanks
> >> David
> >>
> >>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BUG: unable to handle page fault for address
  2025-05-17 19:02               ` Suren Baghdasaryan
@ 2025-05-18  1:45                 ` David Wang
  2025-05-18  9:55                   ` David Wang
  0 siblings, 1 reply; 12+ messages in thread
From: David Wang @ 2025-05-18  1:45 UTC (permalink / raw)
  To: Suren Baghdasaryan; +Cc: kent.overstreet, linux-mm, linux-kernel


At 2025-05-18 03:02:37, "Suren Baghdasaryan" <surenb@google.com> wrote:
>On Sat, May 17, 2025 at 10:57 AM David Wang <00107082@163.com> wrote:
>>
>>
>>
>> At 2025-05-18 01:29:35, "Suren Baghdasaryan" <surenb@google.com> wrote:
>> >On Sat, May 17, 2025 at 9:51 AM David Wang <00107082@163.com> wrote:
>> >>
>> >>
>> >> At 2025-05-18 00:39:30, "Suren Baghdasaryan" <surenb@google.com> wrote:
>> >> >On Sat, May 17, 2025 at 12:02 AM David Wang <00107082@163.com> wrote:
>> >> >>
>> >> >>
>> >> >> At 2025-05-17 08:11:24, "Suren Baghdasaryan" <surenb@google.com> wrote:
>> >> >> >On Fri, May 16, 2025 at 10:03 AM Suren Baghdasaryan <surenb@google.com> wrote:
>> >> >> >>
>> >> >> >> Hi David,
>> >> >> >>
>> >> >> >> On Fri, May 16, 2025 at 6:13 AM David Wang <00107082@163.com> wrote:
>> >> >> >> >
>> >> >> >> > Hi,
>> >> >> >> >
>> >> >> >> > I caught a page fault when I was changing my nvidia driver:
>> >> >> >> > (This happens randomly, I can reproduce it with about 1/3 probability)
>> >> >> >> >
>> >> >> >> > [Fri May 16 12:05:41 2025] BUG: unable to handle page fault for address: ffff9d28984c3000
>> >> >> >> > [Fri May 16 12:05:41 2025] #PF: supervisor read access in kernel mode
>> >> >> >> > [Fri May 16 12:05:41 2025] #PF: error_code(0x0000) - not-present page
>> >> >> >> > ...
>> >> >> >> > [Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
>> >> >> >> > ...
>> >> >> >> > [Fri May 16 12:05:41 2025] Call Trace:
>> >> >> >> > [Fri May 16 12:05:41 2025]  <TASK>
>> >> >> >> > [Fri May 16 12:05:41 2025]  codetag_unload_module+0x135/0x160
>> >> >> >> > [Fri May 16 12:05:41 2025]  free_module+0x19/0x1a0
>> >> >> >> > ...
>> >> >> >> > (full kernel logs are pasted at the end.)
>> >> >> >> >
>> >> >> >> > Using a image with DEBUG_INFO, the calltrack parses as:
>> >> >> >> >
>> >> >> >> > RIP: 0010:release_module_tags (./include/linux/alloc_tag.h:134 lib/alloc_tag.c:352 lib/alloc_tag.c:573)
>> >> >> >> > [Fri May 16 12:05:41 2025] codetag_unload_module (lib/codetag.c:355)
>> >> >> >> > [Fri May 16 12:05:41 2025] free_module (kernel/module/main.c:1305)
>> >> >> >> > [Fri May 16 12:05:41 2025] __do_sys_delete_module (kernel/module/main.c:795)
>> >> >> >> >
>> >> >> >> > The offending lines in my codebase:
>> >> >> >> >         126 static inline struct alloc_tag_counters alloc_tag_read(struct alloc_tag *tag)
>> >> >> >> >         127 {
>> >> >> >> >         ...
>> >> >> >> >         131
>> >> >> >> >         132         for_each_possible_cpu(cpu) {
>> >> >> >> >         133                 counter = per_cpu_ptr(tag->counters, cpu);
>> >> >> >> > >>>>    134                 v.bytes += counter->bytes;   <--------------here
>> >> >> >> >         135                 v.calls += counter->calls;
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > Nvidia drivers are out-tree... there could be some strange behavior in it causes this.. but,
>> >> >> >> > when I check the code, I got concerned about lifecycle of tag->counters.
>> >> >> >> > Based on following defination:
>> >> >> >> >         108 #define DEFINE_ALLOC_TAG(_alloc_tag)                                            \
>> >> >> >> >         109         static DEFINE_PER_CPU(struct alloc_tag_counters, _alloc_tag_cntr);      \
>> >> >> >> >         110         static struct alloc_tag _alloc_tag __used __aligned(8)                  \
>> >> >> >> >         111         __section(ALLOC_TAG_SECTION_NAME) = {                                   \
>> >> >> >> >         112                 .ct = CODE_TAG_INIT,                                            \
>> >> >> >> >         113                 .counters = &_alloc_tag_cntr };
>> >> >> >> >         114
>> >> >> >> > _alloc_tag_cntr is the data referenced by tag->counters, but they are in different section,
>> >> >> >> > and alloc_tag only prepare storage for section ALLOC_TAG_SECTION_NAME.
>> >> >> >> > right?
>> >> >> >> > Then what happens to those ".data..percpu" section when module is unloaded?
>> >> >> >> > Is it safe to keep using those ".data..percpu" section after module unloaded,
>> >> >> >> > or even during module is unloading?
>> >> >> >>
>> >> >> >> Yes, I think you are right, free_module() calls percpu_modfree() which
>> >> >> >> would free the per-cpu memory allocated for the module. Before
>> >> >> >> 0db6f8d7820a ("alloc_tag: load module tags into separate contiguous
>> >> >> >> memory") we would not unload the module if there were tags which were
>> >> >> >> still in use. After that change we load module tags into separate
>> >> >> >> memory, so I expected this to work but due to this external reference
>> >> >> >> it indeed should lead to UAF.
>> >> >> >> I think the simplest way to fix this would be to bypass
>> >> >> >> percpu_modfree() inside free_module() when there are module tags still
>> >> >> >> referenced, store mod->percpu inside alloc_tag_module_section and free
>> >> >> >> it inside clean_unused_module_areas_locked() once we know the counters
>> >> >> >> are not used anymore. I'll take a stab at it and will send a patch for
>> >> >> >> testing today.
>> >> >> >
>> >> >> >Ok, I went with another implementation, instead dynamically allocating
>> >> >> >percpu memory for modules at the module load time. This has another
>> >> >> >advantage of not needing extra PERCPU_MODULE_RESERVE currently
>> >> >> >required for memory allocation tagging to work.
>> >> >> >David, the patch is posted at [1]. Please give it a try and let me
>> >> >> >know if the fix works for you.
>> >> >> >Thanks,
>> >> >> >Suren.
>> >> >> >
>> >> >> >[1] https://lore.kernel.org/all/20250517000739.5930-1-surenb@google.com/
>> >> >> >
>> >> >> >
>> >> >> >> Thanks,
>> >> >> >> Suren.
>> >> >> >>
>> >> >>
>> >> >> Hi, the patch does fix my issue.
>> >> >> I now have another similar concern about modules RO data,
>> >> >> The codetag defined as
>> >> >>  24 struct codetag {
>> >> >>  25         unsigned int flags; /* used in later patches */
>> >> >>  26         unsigned int lineno;
>> >> >>  27         const char *modname;
>> >> >>  28         const char *function;
>> >> >>  29         const char *filename;
>> >> >>  30 } __aligned(8);
>> >> >>
>> >> >> Those modname/function/filename would refer to RO data section, right?
>> >> >> When module unloaded, its RO data section would be released at some point.
>> >> >> My question is is it safe to use RO data during module unload? because these
>> >> >> lines seems to access those data:
>> >> >>
>> >> >> +                       pr_info("%s:%u module %s func:%s has %llu allocated at module unload\n",
>> >> >> +                               tag->ct.filename, tag->ct.lineno, tag->ct.modname,
>> >> >> +                               tag->ct.function, counter.bytes);
>> >> >
>> >> >These lines are called from release_module_tags() using this call chain:
>> >> >
>> >> >delete_module()
>> >> >  free_module()
>> >> >    codetag_unload_module()
>> >> >      release_module_tags()
>> >> >
>> >> >and codetag_unload_module() is called at the very beginning of
>> >> >free_module(), when no other module memory has been freed yet. So,
>> >> >from what I understand, this should be safe.
>> >> >After we unload the module these pointers inside the tags will be
>> >> >dandling but we should not be using them anymore since we do not
>> >> >report unloaded modules. Do you see some usage that I missed?
>> >>
>> >> Why data..percpu. is different. The page fault error caught when I reinstall nvidia drivers is also
>> >> raised from release_module_tags().
>> >>
>> >> Is data..percpu. section released earlier?
>> >
>> >No but counters are different because the allocations that still
>> >reference these tags from unloaded modules will be decrementing them
>> >when they are freed. That's where UAF is coming from. So, the counters
>> >might be accessed after the module is unloaded, other fields should
>> >not.
>> >
>>
>> I do notice there are places where counters are referenced "after" free_module, but the logs I attached
>> happened "during" free_module():
>>
>>  [Fri May 16 12:05:41 2025] BUG: unable to handle page fault for address: ffff9d28984c3000
>>  [Fri May 16 12:05:41 2025] #PF: supervisor read access in kernel mode
>> [Fri May 16 12:05:41 2025] #PF: error_code(0x0000) - not-present page
>> ...
>>  [Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
>> ...
>>  [Fri May 16 12:05:41 2025] Call Trace:
>>  [Fri May 16 12:05:41 2025]  <TASK>
>>  [Fri May 16 12:05:41 2025]  codetag_unload_module+0x135/0x160
>> [Fri May 16 12:05:41 2025]  free_module+0x19/0x1a0
>>
>> The call chain is the same as you mentioned above. 
>
>Is this failure happening before or after my fix? With my fix, percpu
>data should not be freed at all if tags are still used. Please
>clarify.

It is before your fix.  Your patch does fix the issue.
  
In my reproduce procedure:
1. enter recovery mode
2. install nvidia driver 570.144, failed with Unknown symbol drm_client_setup
3. modprobe drm_client_lib
4. install nvidia driver 570.144
5. install nvidia driver 550.144.03
6. reboot and repeat from step 1

The error happened in step 4,  and the failure in step2 is crucial,  if 'modprobe drm_client_lib' at the beginning, no error could be observed.

There may be something off about how kernel handles data.percpu section.
Good thing is that It can be reproduced,  I can add debug messages to clear or confirm  suspicions, 
Any suggestion?


Thanks
David


>
>> This part confusing me a lot. The log indicates during free_module,  ..data..percpu access failed,
>> I doubted those section would be released that quick.
>>
>> The only guess left  I feel reasonable is the ..data_percpu was not paged in at all,  probably  because no access to it,
>> and when the section is accessed during free_module, somehow the access is refused. Just guessing.....
>>
>> Or,  do I missing something here?
>>
>>
>> >>
>> >> Thanks
>> >> David
>> >>
>> >>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BUG: unable to handle page fault for address
  2025-05-18  1:45                 ` David Wang
@ 2025-05-18  9:55                   ` David Wang
  2025-05-19 16:00                     ` Suren Baghdasaryan
  0 siblings, 1 reply; 12+ messages in thread
From: David Wang @ 2025-05-18  9:55 UTC (permalink / raw)
  To: Suren Baghdasaryan; +Cc: kent.overstreet, linux-mm, linux-kernel


>>>
>>> I do notice there are places where counters are referenced "after" free_module, but the logs I attached
>>> happened "during" free_module():
>>>
>>>  [Fri May 16 12:05:41 2025] BUG: unable to handle page fault for address: ffff9d28984c3000
>>>  [Fri May 16 12:05:41 2025] #PF: supervisor read access in kernel mode
>>> [Fri May 16 12:05:41 2025] #PF: error_code(0x0000) - not-present page
>>> ...
>>>  [Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
>>> ...
>>>  [Fri May 16 12:05:41 2025] Call Trace:
>>>  [Fri May 16 12:05:41 2025]  <TASK>
>>>  [Fri May 16 12:05:41 2025]  codetag_unload_module+0x135/0x160
>>> [Fri May 16 12:05:41 2025]  free_module+0x19/0x1a0
>>>
>>> The call chain is the same as you mentioned above. 
>>
>>Is this failure happening before or after my fix? With my fix, percpu
>>data should not be freed at all if tags are still used. Please
>>clarify.
>
>It is before your fix.  Your patch does fix the issue.
>  
>In my reproduce procedure:
>1. enter recovery mode
>2. install nvidia driver 570.144, failed with Unknown symbol drm_client_setup
>3. modprobe drm_client_lib
>4. install nvidia driver 570.144
>5. install nvidia driver 550.144.03
>6. reboot and repeat from step 1
>
>The error happened in step 4,  and the failure in step2 is crucial,  if 'modprobe drm_client_lib' at the beginning, no error could be observed.
>
>There may be something off about how kernel handles data.percpu section.
>Good thing is that It can be reproduced,  I can add debug messages to clear or confirm  suspicions, 
>Any suggestion?
>
>
>Thanks
>David
>
>
After poking around logging memory addresses, I think I finally understand what is happening here.

1. codetag_alloc_module_section alloc memory when loading module
2. module load failed, due to undefined symbol
3. codetag section memory not freed
4. module load, and module's address happens to reuse the address previous used
5. another codetag_alloc_module_section
6. percup section allocation and then relocation address changes made to codetag_alloc_module_section
7. unload module, when searching through maple tree, the code tag memory in step 1 is used, 
which has no relocation address populated at all.
8. page fault error, because tag->counters is 0

I use following changes to log the address, 


The offending address is 
--- a/lib/alloc_tag.c
+++ b/lib/alloc_tag.c
@@ -575,6 +575,11 @@ static void release_module_tags(struct module *mod, bool used)
        if (!used)
                goto release_area;
 
+       struct alloc_tag *ptag = (struct alloc_tag *)(module_tags.start_addr + mas.index);
+       pr_info("percpu 0: 0x%llx(0x%llx)\n",
+                       (long long)per_cpu_ptr(ptag->counters, 0),
+                       (long long)ptag->counters
+                       );


And got following:
[Sun May 18 16:25:47 2025] percpu 0: 0xffff8edb6ee41030(0xffffffffbc57e030)
[Sun May 18 16:25:47 2025] percpu 0: 0xffff8edb6ee410e0(0xffffffffbc57e0e0)
[Sun May 18 16:25:47 2025] percpu 0: 0xffff8edb6ee40fa0(0xffffffffbc57dfa0)
[Sun May 18 16:26:43 2025] percpu 0: 0xffff8edbb28c3000(0x0)   <------


I think, we spot two issues in this thread:

1. when module load failed after codetag section alloced, the memory would leak.
2. counters may needs reference even after module is unloaded.

#2 has already been addressed by your patch. I will send a simple patch to fix #1

(Feel so released to finally draw a conclusion, hope no silly mistakes here  :)


Thanks
David


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BUG: unable to handle page fault for address
  2025-05-18  9:55                   ` David Wang
@ 2025-05-19 16:00                     ` Suren Baghdasaryan
  0 siblings, 0 replies; 12+ messages in thread
From: Suren Baghdasaryan @ 2025-05-19 16:00 UTC (permalink / raw)
  To: David Wang; +Cc: kent.overstreet, linux-mm, linux-kernel

On Sun, May 18, 2025 at 2:55 AM David Wang <00107082@163.com> wrote:
>
>
> >>>
> >>> I do notice there are places where counters are referenced "after" free_module, but the logs I attached
> >>> happened "during" free_module():
> >>>
> >>>  [Fri May 16 12:05:41 2025] BUG: unable to handle page fault for address: ffff9d28984c3000
> >>>  [Fri May 16 12:05:41 2025] #PF: supervisor read access in kernel mode
> >>> [Fri May 16 12:05:41 2025] #PF: error_code(0x0000) - not-present page
> >>> ...
> >>>  [Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
> >>> ...
> >>>  [Fri May 16 12:05:41 2025] Call Trace:
> >>>  [Fri May 16 12:05:41 2025]  <TASK>
> >>>  [Fri May 16 12:05:41 2025]  codetag_unload_module+0x135/0x160
> >>> [Fri May 16 12:05:41 2025]  free_module+0x19/0x1a0
> >>>
> >>> The call chain is the same as you mentioned above.
> >>
> >>Is this failure happening before or after my fix? With my fix, percpu
> >>data should not be freed at all if tags are still used. Please
> >>clarify.
> >
> >It is before your fix.  Your patch does fix the issue.
> >
> >In my reproduce procedure:
> >1. enter recovery mode
> >2. install nvidia driver 570.144, failed with Unknown symbol drm_client_setup
> >3. modprobe drm_client_lib
> >4. install nvidia driver 570.144
> >5. install nvidia driver 550.144.03
> >6. reboot and repeat from step 1
> >
> >The error happened in step 4,  and the failure in step2 is crucial,  if 'modprobe drm_client_lib' at the beginning, no error could be observed.
> >
> >There may be something off about how kernel handles data.percpu section.
> >Good thing is that It can be reproduced,  I can add debug messages to clear or confirm  suspicions,
> >Any suggestion?
> >
> >
> >Thanks
> >David
> >
> >
> After poking around logging memory addresses, I think I finally understand what is happening here.
>
> 1. codetag_alloc_module_section alloc memory when loading module
> 2. module load failed, due to undefined symbol
> 3. codetag section memory not freed
> 4. module load, and module's address happens to reuse the address previous used
> 5. another codetag_alloc_module_section
> 6. percup section allocation and then relocation address changes made to codetag_alloc_module_section
> 7. unload module, when searching through maple tree, the code tag memory in step 1 is used,
> which has no relocation address populated at all.
> 8. page fault error, because tag->counters is 0
>
> I use following changes to log the address,
>
>
> The offending address is
> --- a/lib/alloc_tag.c
> +++ b/lib/alloc_tag.c
> @@ -575,6 +575,11 @@ static void release_module_tags(struct module *mod, bool used)
>         if (!used)
>                 goto release_area;
>
> +       struct alloc_tag *ptag = (struct alloc_tag *)(module_tags.start_addr + mas.index);
> +       pr_info("percpu 0: 0x%llx(0x%llx)\n",
> +                       (long long)per_cpu_ptr(ptag->counters, 0),
> +                       (long long)ptag->counters
> +                       );
>
>
> And got following:
> [Sun May 18 16:25:47 2025] percpu 0: 0xffff8edb6ee41030(0xffffffffbc57e030)
> [Sun May 18 16:25:47 2025] percpu 0: 0xffff8edb6ee410e0(0xffffffffbc57e0e0)
> [Sun May 18 16:25:47 2025] percpu 0: 0xffff8edb6ee40fa0(0xffffffffbc57dfa0)
> [Sun May 18 16:26:43 2025] percpu 0: 0xffff8edbb28c3000(0x0)   <------
>
>
> I think, we spot two issues in this thread:
>
> 1. when module load failed after codetag section alloced, the memory would leak.
> 2. counters may needs reference even after module is unloaded.
>
> #2 has already been addressed by your patch. I will send a simple patch to fix #1
>
> (Feel so released to finally draw a conclusion, hope no silly mistakes here  :)

I see. So, layout_and_allocate() succeeds in allocating the codetag
memory but during a later failure we fail to free it. Makes sense and
your patch looks good to me. Thanks!

>
>
> Thanks
> David
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-05-19 16:01 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-16 13:12 BUG: unable to handle page fault for address David Wang
2025-05-16 17:03 ` Suren Baghdasaryan
2025-05-17  0:11   ` Suren Baghdasaryan
2025-05-17  7:02     ` David Wang
2025-05-17 16:39       ` Suren Baghdasaryan
2025-05-17 16:51         ` David Wang
2025-05-17 17:29           ` Suren Baghdasaryan
2025-05-17 17:57             ` David Wang
2025-05-17 19:02               ` Suren Baghdasaryan
2025-05-18  1:45                 ` David Wang
2025-05-18  9:55                   ` David Wang
2025-05-19 16:00                     ` Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).