All of lore.kernel.org
 help / color / mirror / Atom feed
* NULL pointer dereference in xenbus_thread->...
@ 2023-08-28 21:50 Marek Marczykowski-Górecki
  2023-10-22 14:14 ` Marek Marczykowski-Górecki
  0 siblings, 1 reply; 13+ messages in thread
From: Marek Marczykowski-Górecki @ 2023-08-28 21:50 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 5037 bytes --]

Hi,

I've noticed in Qubes's CI failure like this:

[  871.271292] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  871.275290] #PF: supervisor read access in kernel mode
[  871.277282] #PF: error_code(0x0000) - not-present page
[  871.279182] PGD 106fdb067 P4D 106fdb067 PUD 106fdc067 PMD 0 
[  871.281071] Oops: 0000 [#1] PREEMPT SMP NOPTI
[  871.282698] CPU: 1 PID: 28 Comm: xenbus Not tainted 6.1.43-1.qubes.fc37.x86_64 #1
[  871.285222] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
[  871.288883] RIP: e030:__wake_up_common+0x4c/0x180
[  871.292838] Code: 24 0c 89 4c 24 08 4d 85 c9 74 0a 41 f6 01 04 0f 85 a3 00 00 00 48 8b 43 08 4c 8d 40 e8 48 83 c3 08 49 8d 40 18 48 39 c3 74 5b <49> 8b 40 18 31 ed 4c 8d 70 e8 45 8b 28 41 f6 c5 04 75 5f 49 8b 40
[  871.299776] RSP: e02b:ffffc900400f7e10 EFLAGS: 00010082
[  871.301656] RAX: 0000000000000000 RBX: ffff88810541ce98 RCX: 0000000000000000
[  871.304255] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff88810541ce90
[  871.306714] RBP: ffffc900400f0280 R08: ffffffffffffffe8 R09: ffffc900400f7e68
[  871.309937] R10: 0000000000007ff0 R11: ffff888100ad3000 R12: ffffc900400f7e68
[  871.312326] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  871.314647] FS:  0000000000000000(0000) GS:ffff88813ff00000(0000) knlGS:0000000000000000
[  871.317677] CS:  10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[  871.319644] CR2: 0000000000000000 CR3: 00000001067fe000 CR4: 0000000000040660
[  871.321973] Call Trace:
[  871.322782]  <TASK>
[  871.323494]  ? show_trace_log_lvl+0x1d3/0x2ef
[  871.324901]  ? show_trace_log_lvl+0x1d3/0x2ef
[  871.326310]  ? show_trace_log_lvl+0x1d3/0x2ef
[  871.327721]  ? __wake_up_common_lock+0x82/0xd0
[  871.329147]  ? __die_body.cold+0x8/0xd
[  871.330378]  ? page_fault_oops+0x163/0x1a0
[  871.331691]  ? exc_page_fault+0x70/0x170
[  871.332946]  ? asm_exc_page_fault+0x22/0x30
[  871.334454]  ? __wake_up_common+0x4c/0x180
[  871.335777]  __wake_up_common_lock+0x82/0xd0
[  871.337183]  ? process_writes+0x240/0x240
[  871.338461]  process_msg+0x18e/0x2f0
[  871.339627]  xenbus_thread+0x165/0x1c0
[  871.340830]  ? cpuusage_read+0x10/0x10
[  871.342032]  kthread+0xe9/0x110
[  871.343317]  ? kthread_complete_and_exit+0x20/0x20
[  871.345020]  ret_from_fork+0x22/0x30
[  871.346239]  </TASK>
[  871.347060] Modules linked in: snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device joydev snd_pcm intel_rapl_msr ppdev intel_rapl_common snd_timer pcspkr e1000e snd soundcore i2c_piix4 parport_pc parport loop fuse xenfs dm_crypt crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic floppy ghash_clmulni_intel sha512_ssse3 serio_raw virtio_scsi virtio_console bochs xhci_pci xhci_pci_renesas xhci_hcd qemu_fw_cfg drm_vram_helper drm_ttm_helper ttm ata_generic pata_acpi xen_privcmd xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn scsi_dh_rdac scsi_dh_emc scsi_dh_alua uinput dm_multipath
[  871.368892] CR2: 0000000000000000
[  871.370160] ---[ end trace 0000000000000000 ]---
[  871.371719] RIP: e030:__wake_up_common+0x4c/0x180
[  871.373273] Code: 24 0c 89 4c 24 08 4d 85 c9 74 0a 41 f6 01 04 0f 85 a3 00 00 00 48 8b 43 08 4c 8d 40 e8 48 83 c3 08 49 8d 40 18 48 39 c3 74 5b <49> 8b 40 18 31 ed 4c 8d 70 e8 45 8b 28 41 f6 c5 04 75 5f 49 8b 40
[  871.379866] RSP: e02b:ffffc900400f7e10 EFLAGS: 00010082
[  871.381689] RAX: 0000000000000000 RBX: ffff88810541ce98 RCX: 0000000000000000
[  871.383971] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff88810541ce90
[  871.386235] RBP: ffffc900400f0280 R08: ffffffffffffffe8 R09: ffffc900400f7e68
[  871.388521] R10: 0000000000007ff0 R11: ffff888100ad3000 R12: ffffc900400f7e68
[  871.390789] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  871.393101] FS:  0000000000000000(0000) GS:ffff88813ff00000(0000) knlGS:0000000000000000
[  871.395671] CS:  10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[  871.397863] CR2: 0000000000000000 CR3: 00000001067fe000 CR4: 0000000000040660
[  871.400441] Kernel panic - not syncing: Fatal exception
[  871.402171] Kernel Offset: disabled
(XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.

It isn't the first time I see similar crash, but I can't really
reproduce it reliably. Restarted test usually passes.
Note this is Xen nested in KVM, so it could very well be some oddity
about nested virt, although looking at the stack trace, it's unlikely
and more likely some race condition hit only on slower system.

Unfortunately I don't have symbols for this kernel handy, but there is a
single wake_up() call in process_writes(), so it shouldn't be an issue.

Any ideas?

Full log at https://openqa.qubes-os.org/tests/80779/logfile?filename=serial0.txt

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: NULL pointer dereference in xenbus_thread->...
  2023-08-28 21:50 NULL pointer dereference in xenbus_thread-> Marek Marczykowski-Górecki
@ 2023-10-22 14:14 ` Marek Marczykowski-Górecki
  2024-03-25 16:17   ` Marek Marczykowski-Górecki
  0 siblings, 1 reply; 13+ messages in thread
From: Marek Marczykowski-Górecki @ 2023-10-22 14:14 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 5566 bytes --]

On Mon, Aug 28, 2023 at 11:50:36PM +0200, Marek Marczykowski-Górecki wrote:
> Hi,
> 
> I've noticed in Qubes's CI failure like this:
> 
> [  871.271292] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [  871.275290] #PF: supervisor read access in kernel mode
> [  871.277282] #PF: error_code(0x0000) - not-present page
> [  871.279182] PGD 106fdb067 P4D 106fdb067 PUD 106fdc067 PMD 0 
> [  871.281071] Oops: 0000 [#1] PREEMPT SMP NOPTI
> [  871.282698] CPU: 1 PID: 28 Comm: xenbus Not tainted 6.1.43-1.qubes.fc37.x86_64 #1
> [  871.285222] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
> [  871.288883] RIP: e030:__wake_up_common+0x4c/0x180
> [  871.292838] Code: 24 0c 89 4c 24 08 4d 85 c9 74 0a 41 f6 01 04 0f 85 a3 00 00 00 48 8b 43 08 4c 8d 40 e8 48 83 c3 08 49 8d 40 18 48 39 c3 74 5b <49> 8b 40 18 31 ed 4c 8d 70 e8 45 8b 28 41 f6 c5 04 75 5f 49 8b 40
> [  871.299776] RSP: e02b:ffffc900400f7e10 EFLAGS: 00010082
> [  871.301656] RAX: 0000000000000000 RBX: ffff88810541ce98 RCX: 0000000000000000
> [  871.304255] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff88810541ce90
> [  871.306714] RBP: ffffc900400f0280 R08: ffffffffffffffe8 R09: ffffc900400f7e68
> [  871.309937] R10: 0000000000007ff0 R11: ffff888100ad3000 R12: ffffc900400f7e68
> [  871.312326] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [  871.314647] FS:  0000000000000000(0000) GS:ffff88813ff00000(0000) knlGS:0000000000000000
> [  871.317677] CS:  10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  871.319644] CR2: 0000000000000000 CR3: 00000001067fe000 CR4: 0000000000040660
> [  871.321973] Call Trace:
> [  871.322782]  <TASK>
> [  871.323494]  ? show_trace_log_lvl+0x1d3/0x2ef
> [  871.324901]  ? show_trace_log_lvl+0x1d3/0x2ef
> [  871.326310]  ? show_trace_log_lvl+0x1d3/0x2ef
> [  871.327721]  ? __wake_up_common_lock+0x82/0xd0
> [  871.329147]  ? __die_body.cold+0x8/0xd
> [  871.330378]  ? page_fault_oops+0x163/0x1a0
> [  871.331691]  ? exc_page_fault+0x70/0x170
> [  871.332946]  ? asm_exc_page_fault+0x22/0x30
> [  871.334454]  ? __wake_up_common+0x4c/0x180
> [  871.335777]  __wake_up_common_lock+0x82/0xd0
> [  871.337183]  ? process_writes+0x240/0x240
> [  871.338461]  process_msg+0x18e/0x2f0
> [  871.339627]  xenbus_thread+0x165/0x1c0
> [  871.340830]  ? cpuusage_read+0x10/0x10
> [  871.342032]  kthread+0xe9/0x110
> [  871.343317]  ? kthread_complete_and_exit+0x20/0x20
> [  871.345020]  ret_from_fork+0x22/0x30
> [  871.346239]  </TASK>
> [  871.347060] Modules linked in: snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device joydev snd_pcm intel_rapl_msr ppdev intel_rapl_common snd_timer pcspkr e1000e snd soundcore i2c_piix4 parport_pc parport loop fuse xenfs dm_crypt crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic floppy ghash_clmulni_intel sha512_ssse3 serio_raw virtio_scsi virtio_console bochs xhci_pci xhci_pci_renesas xhci_hcd qemu_fw_cfg drm_vram_helper drm_ttm_helper ttm ata_generic pata_acpi xen_privcmd xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn scsi_dh_rdac scsi_dh_emc scsi_dh_alua uinput dm_multipath
> [  871.368892] CR2: 0000000000000000
> [  871.370160] ---[ end trace 0000000000000000 ]---
> [  871.371719] RIP: e030:__wake_up_common+0x4c/0x180
> [  871.373273] Code: 24 0c 89 4c 24 08 4d 85 c9 74 0a 41 f6 01 04 0f 85 a3 00 00 00 48 8b 43 08 4c 8d 40 e8 48 83 c3 08 49 8d 40 18 48 39 c3 74 5b <49> 8b 40 18 31 ed 4c 8d 70 e8 45 8b 28 41 f6 c5 04 75 5f 49 8b 40
> [  871.379866] RSP: e02b:ffffc900400f7e10 EFLAGS: 00010082
> [  871.381689] RAX: 0000000000000000 RBX: ffff88810541ce98 RCX: 0000000000000000
> [  871.383971] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff88810541ce90
> [  871.386235] RBP: ffffc900400f0280 R08: ffffffffffffffe8 R09: ffffc900400f7e68
> [  871.388521] R10: 0000000000007ff0 R11: ffff888100ad3000 R12: ffffc900400f7e68
> [  871.390789] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [  871.393101] FS:  0000000000000000(0000) GS:ffff88813ff00000(0000) knlGS:0000000000000000
> [  871.395671] CS:  10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  871.397863] CR2: 0000000000000000 CR3: 00000001067fe000 CR4: 0000000000040660
> [  871.400441] Kernel panic - not syncing: Fatal exception
> [  871.402171] Kernel Offset: disabled
> (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.
> 
> It isn't the first time I see similar crash, but I can't really
> reproduce it reliably. Restarted test usually passes.
> Note this is Xen nested in KVM, so it could very well be some oddity
> about nested virt, although looking at the stack trace, it's unlikely
> and more likely some race condition hit only on slower system.

Recently I've got the same crash on a real system in domU too. And also
on nested on newer kernel 6.1.57 (here it happened in dom0). So, this is
still an issue and affects not only nested case :/

> Unfortunately I don't have symbols for this kernel handy, but there is a
> single wake_up() call in process_writes(), so it shouldn't be an issue.
> 
> Any ideas?
> 
> Full log at https://openqa.qubes-os.org/tests/80779/logfile?filename=serial0.txt

More links at https://github.com/QubesOS/qubes-issues/issues/8638,
including more recent stack trace.


-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: NULL pointer dereference in xenbus_thread->...
  2023-10-22 14:14 ` Marek Marczykowski-Górecki
@ 2024-03-25 16:17   ` Marek Marczykowski-Górecki
  2024-03-26 11:00     ` Julien Grall
  0 siblings, 1 reply; 13+ messages in thread
From: Marek Marczykowski-Górecki @ 2024-03-25 16:17 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 5902 bytes --]

On Sun, Oct 22, 2023 at 04:14:30PM +0200, Marek Marczykowski-Górecki wrote:
> On Mon, Aug 28, 2023 at 11:50:36PM +0200, Marek Marczykowski-Górecki wrote:
> > Hi,
> > 
> > I've noticed in Qubes's CI failure like this:
> > 
> > [  871.271292] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > [  871.275290] #PF: supervisor read access in kernel mode
> > [  871.277282] #PF: error_code(0x0000) - not-present page
> > [  871.279182] PGD 106fdb067 P4D 106fdb067 PUD 106fdc067 PMD 0 
> > [  871.281071] Oops: 0000 [#1] PREEMPT SMP NOPTI
> > [  871.282698] CPU: 1 PID: 28 Comm: xenbus Not tainted 6.1.43-1.qubes.fc37.x86_64 #1
> > [  871.285222] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
> > [  871.288883] RIP: e030:__wake_up_common+0x4c/0x180
> > [  871.292838] Code: 24 0c 89 4c 24 08 4d 85 c9 74 0a 41 f6 01 04 0f 85 a3 00 00 00 48 8b 43 08 4c 8d 40 e8 48 83 c3 08 49 8d 40 18 48 39 c3 74 5b <49> 8b 40 18 31 ed 4c 8d 70 e8 45 8b 28 41 f6 c5 04 75 5f 49 8b 40
> > [  871.299776] RSP: e02b:ffffc900400f7e10 EFLAGS: 00010082
> > [  871.301656] RAX: 0000000000000000 RBX: ffff88810541ce98 RCX: 0000000000000000
> > [  871.304255] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff88810541ce90
> > [  871.306714] RBP: ffffc900400f0280 R08: ffffffffffffffe8 R09: ffffc900400f7e68
> > [  871.309937] R10: 0000000000007ff0 R11: ffff888100ad3000 R12: ffffc900400f7e68
> > [  871.312326] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > [  871.314647] FS:  0000000000000000(0000) GS:ffff88813ff00000(0000) knlGS:0000000000000000
> > [  871.317677] CS:  10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  871.319644] CR2: 0000000000000000 CR3: 00000001067fe000 CR4: 0000000000040660
> > [  871.321973] Call Trace:
> > [  871.322782]  <TASK>
> > [  871.323494]  ? show_trace_log_lvl+0x1d3/0x2ef
> > [  871.324901]  ? show_trace_log_lvl+0x1d3/0x2ef
> > [  871.326310]  ? show_trace_log_lvl+0x1d3/0x2ef
> > [  871.327721]  ? __wake_up_common_lock+0x82/0xd0
> > [  871.329147]  ? __die_body.cold+0x8/0xd
> > [  871.330378]  ? page_fault_oops+0x163/0x1a0
> > [  871.331691]  ? exc_page_fault+0x70/0x170
> > [  871.332946]  ? asm_exc_page_fault+0x22/0x30
> > [  871.334454]  ? __wake_up_common+0x4c/0x180
> > [  871.335777]  __wake_up_common_lock+0x82/0xd0
> > [  871.337183]  ? process_writes+0x240/0x240
> > [  871.338461]  process_msg+0x18e/0x2f0
> > [  871.339627]  xenbus_thread+0x165/0x1c0
> > [  871.340830]  ? cpuusage_read+0x10/0x10
> > [  871.342032]  kthread+0xe9/0x110
> > [  871.343317]  ? kthread_complete_and_exit+0x20/0x20
> > [  871.345020]  ret_from_fork+0x22/0x30
> > [  871.346239]  </TASK>
> > [  871.347060] Modules linked in: snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device joydev snd_pcm intel_rapl_msr ppdev intel_rapl_common snd_timer pcspkr e1000e snd soundcore i2c_piix4 parport_pc parport loop fuse xenfs dm_crypt crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic floppy ghash_clmulni_intel sha512_ssse3 serio_raw virtio_scsi virtio_console bochs xhci_pci xhci_pci_renesas xhci_hcd qemu_fw_cfg drm_vram_helper drm_ttm_helper ttm ata_generic pata_acpi xen_privcmd xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn scsi_dh_rdac scsi_dh_emc scsi_dh_alua uinput dm_multipath
> > [  871.368892] CR2: 0000000000000000
> > [  871.370160] ---[ end trace 0000000000000000 ]---
> > [  871.371719] RIP: e030:__wake_up_common+0x4c/0x180
> > [  871.373273] Code: 24 0c 89 4c 24 08 4d 85 c9 74 0a 41 f6 01 04 0f 85 a3 00 00 00 48 8b 43 08 4c 8d 40 e8 48 83 c3 08 49 8d 40 18 48 39 c3 74 5b <49> 8b 40 18 31 ed 4c 8d 70 e8 45 8b 28 41 f6 c5 04 75 5f 49 8b 40
> > [  871.379866] RSP: e02b:ffffc900400f7e10 EFLAGS: 00010082
> > [  871.381689] RAX: 0000000000000000 RBX: ffff88810541ce98 RCX: 0000000000000000
> > [  871.383971] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff88810541ce90
> > [  871.386235] RBP: ffffc900400f0280 R08: ffffffffffffffe8 R09: ffffc900400f7e68
> > [  871.388521] R10: 0000000000007ff0 R11: ffff888100ad3000 R12: ffffc900400f7e68
> > [  871.390789] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > [  871.393101] FS:  0000000000000000(0000) GS:ffff88813ff00000(0000) knlGS:0000000000000000
> > [  871.395671] CS:  10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  871.397863] CR2: 0000000000000000 CR3: 00000001067fe000 CR4: 0000000000040660
> > [  871.400441] Kernel panic - not syncing: Fatal exception
> > [  871.402171] Kernel Offset: disabled
> > (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.
> > 
> > It isn't the first time I see similar crash, but I can't really
> > reproduce it reliably. Restarted test usually passes.
> > Note this is Xen nested in KVM, so it could very well be some oddity
> > about nested virt, although looking at the stack trace, it's unlikely
> > and more likely some race condition hit only on slower system.
> 
> Recently I've got the same crash on a real system in domU too. And also
> on nested on newer kernel 6.1.57 (here it happened in dom0). So, this is
> still an issue and affects not only nested case :/
> 
> > Unfortunately I don't have symbols for this kernel handy, but there is a
> > single wake_up() call in process_writes(), so it shouldn't be an issue.
> > 
> > Any ideas?
> > 
> > Full log at https://openqa.qubes-os.org/tests/80779/logfile?filename=serial0.txt
> 
> More links at https://github.com/QubesOS/qubes-issues/issues/8638,
> including more recent stack trace.

Happens on 6.1.75 too (new stack trace I've added to the issue above,
but it's pretty similar).

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: NULL pointer dereference in xenbus_thread->...
  2024-03-25 16:17   ` Marek Marczykowski-Górecki
@ 2024-03-26 11:00     ` Julien Grall
  2024-05-31 22:48       ` Marek Marczykowski-Górecki
  0 siblings, 1 reply; 13+ messages in thread
From: Julien Grall @ 2024-03-26 11:00 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki, xen-devel; +Cc: Juergen Gross

Hi Marek,

+Juergen for visibility

When sending a bug report, I would suggest to CC relevant people as 
otherwise it can get lost (not may people monitors Xen devel if they are 
not CCed).

Cheers,

On 25/03/2024 16:17, Marek Marczykowski-Górecki wrote:
> On Sun, Oct 22, 2023 at 04:14:30PM +0200, Marek Marczykowski-Górecki wrote:
>> On Mon, Aug 28, 2023 at 11:50:36PM +0200, Marek Marczykowski-Górecki wrote:
>>> Hi,
>>>
>>> I've noticed in Qubes's CI failure like this:
>>>
>>> [  871.271292] BUG: kernel NULL pointer dereference, address: 0000000000000000
>>> [  871.275290] #PF: supervisor read access in kernel mode
>>> [  871.277282] #PF: error_code(0x0000) - not-present page
>>> [  871.279182] PGD 106fdb067 P4D 106fdb067 PUD 106fdc067 PMD 0
>>> [  871.281071] Oops: 0000 [#1] PREEMPT SMP NOPTI
>>> [  871.282698] CPU: 1 PID: 28 Comm: xenbus Not tainted 6.1.43-1.qubes.fc37.x86_64 #1
>>> [  871.285222] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
>>> [  871.288883] RIP: e030:__wake_up_common+0x4c/0x180
>>> [  871.292838] Code: 24 0c 89 4c 24 08 4d 85 c9 74 0a 41 f6 01 04 0f 85 a3 00 00 00 48 8b 43 08 4c 8d 40 e8 48 83 c3 08 49 8d 40 18 48 39 c3 74 5b <49> 8b 40 18 31 ed 4c 8d 70 e8 45 8b 28 41 f6 c5 04 75 5f 49 8b 40
>>> [  871.299776] RSP: e02b:ffffc900400f7e10 EFLAGS: 00010082
>>> [  871.301656] RAX: 0000000000000000 RBX: ffff88810541ce98 RCX: 0000000000000000
>>> [  871.304255] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff88810541ce90
>>> [  871.306714] RBP: ffffc900400f0280 R08: ffffffffffffffe8 R09: ffffc900400f7e68
>>> [  871.309937] R10: 0000000000007ff0 R11: ffff888100ad3000 R12: ffffc900400f7e68
>>> [  871.312326] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>>> [  871.314647] FS:  0000000000000000(0000) GS:ffff88813ff00000(0000) knlGS:0000000000000000
>>> [  871.317677] CS:  10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  871.319644] CR2: 0000000000000000 CR3: 00000001067fe000 CR4: 0000000000040660
>>> [  871.321973] Call Trace:
>>> [  871.322782]  <TASK>
>>> [  871.323494]  ? show_trace_log_lvl+0x1d3/0x2ef
>>> [  871.324901]  ? show_trace_log_lvl+0x1d3/0x2ef
>>> [  871.326310]  ? show_trace_log_lvl+0x1d3/0x2ef
>>> [  871.327721]  ? __wake_up_common_lock+0x82/0xd0
>>> [  871.329147]  ? __die_body.cold+0x8/0xd
>>> [  871.330378]  ? page_fault_oops+0x163/0x1a0
>>> [  871.331691]  ? exc_page_fault+0x70/0x170
>>> [  871.332946]  ? asm_exc_page_fault+0x22/0x30
>>> [  871.334454]  ? __wake_up_common+0x4c/0x180
>>> [  871.335777]  __wake_up_common_lock+0x82/0xd0
>>> [  871.337183]  ? process_writes+0x240/0x240
>>> [  871.338461]  process_msg+0x18e/0x2f0
>>> [  871.339627]  xenbus_thread+0x165/0x1c0
>>> [  871.340830]  ? cpuusage_read+0x10/0x10
>>> [  871.342032]  kthread+0xe9/0x110
>>> [  871.343317]  ? kthread_complete_and_exit+0x20/0x20
>>> [  871.345020]  ret_from_fork+0x22/0x30
>>> [  871.346239]  </TASK>
>>> [  871.347060] Modules linked in: snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device joydev snd_pcm intel_rapl_msr ppdev intel_rapl_common snd_timer pcspkr e1000e snd soundcore i2c_piix4 parport_pc parport loop fuse xenfs dm_crypt crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic floppy ghash_clmulni_intel sha512_ssse3 serio_raw virtio_scsi virtio_console bochs xhci_pci xhci_pci_renesas xhci_hcd qemu_fw_cfg drm_vram_helper drm_ttm_helper ttm ata_generic pata_acpi xen_privcmd xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn scsi_dh_rdac scsi_dh_emc scsi_dh_alua uinput dm_multipath
>>> [  871.368892] CR2: 0000000000000000
>>> [  871.370160] ---[ end trace 0000000000000000 ]---
>>> [  871.371719] RIP: e030:__wake_up_common+0x4c/0x180
>>> [  871.373273] Code: 24 0c 89 4c 24 08 4d 85 c9 74 0a 41 f6 01 04 0f 85 a3 00 00 00 48 8b 43 08 4c 8d 40 e8 48 83 c3 08 49 8d 40 18 48 39 c3 74 5b <49> 8b 40 18 31 ed 4c 8d 70 e8 45 8b 28 41 f6 c5 04 75 5f 49 8b 40
>>> [  871.379866] RSP: e02b:ffffc900400f7e10 EFLAGS: 00010082
>>> [  871.381689] RAX: 0000000000000000 RBX: ffff88810541ce98 RCX: 0000000000000000
>>> [  871.383971] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff88810541ce90
>>> [  871.386235] RBP: ffffc900400f0280 R08: ffffffffffffffe8 R09: ffffc900400f7e68
>>> [  871.388521] R10: 0000000000007ff0 R11: ffff888100ad3000 R12: ffffc900400f7e68
>>> [  871.390789] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>>> [  871.393101] FS:  0000000000000000(0000) GS:ffff88813ff00000(0000) knlGS:0000000000000000
>>> [  871.395671] CS:  10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  871.397863] CR2: 0000000000000000 CR3: 00000001067fe000 CR4: 0000000000040660
>>> [  871.400441] Kernel panic - not syncing: Fatal exception
>>> [  871.402171] Kernel Offset: disabled
>>> (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.
>>>
>>> It isn't the first time I see similar crash, but I can't really
>>> reproduce it reliably. Restarted test usually passes.
>>> Note this is Xen nested in KVM, so it could very well be some oddity
>>> about nested virt, although looking at the stack trace, it's unlikely
>>> and more likely some race condition hit only on slower system.
>>
>> Recently I've got the same crash on a real system in domU too. And also
>> on nested on newer kernel 6.1.57 (here it happened in dom0). So, this is
>> still an issue and affects not only nested case :/
>>
>>> Unfortunately I don't have symbols for this kernel handy, but there is a
>>> single wake_up() call in process_writes(), so it shouldn't be an issue.
>>>
>>> Any ideas?
>>>
>>> Full log at https://openqa.qubes-os.org/tests/80779/logfile?filename=serial0.txt
>>
>> More links at https://github.com/QubesOS/qubes-issues/issues/8638,
>> including more recent stack trace.
> 
> Happens on 6.1.75 too (new stack trace I've added to the issue above,
> but it's pretty similar).
> 

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: NULL pointer dereference in xenbus_thread->...
  2024-03-26 11:00     ` Julien Grall
@ 2024-05-31 22:48       ` Marek Marczykowski-Górecki
  2025-04-23 12:41         ` Marek Marczykowski-Górecki
  0 siblings, 1 reply; 13+ messages in thread
From: Marek Marczykowski-Górecki @ 2024-05-31 22:48 UTC (permalink / raw)
  To: Julien Grall; +Cc: xen-devel, Juergen Gross

[-- Attachment #1: Type: text/plain, Size: 7304 bytes --]

On Tue, Mar 26, 2024 at 11:00:50AM +0000, Julien Grall wrote:
> Hi Marek,
> 
> +Juergen for visibility
> 
> When sending a bug report, I would suggest to CC relevant people as
> otherwise it can get lost (not may people monitors Xen devel if they are not
> CCed).
> 
> Cheers,
> 
> On 25/03/2024 16:17, Marek Marczykowski-Górecki wrote:
> > On Sun, Oct 22, 2023 at 04:14:30PM +0200, Marek Marczykowski-Górecki wrote:
> > > On Mon, Aug 28, 2023 at 11:50:36PM +0200, Marek Marczykowski-Górecki wrote:
> > > > Hi,
> > > > 
> > > > I've noticed in Qubes's CI failure like this:
> > > > 
> > > > [  871.271292] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > > > [  871.275290] #PF: supervisor read access in kernel mode
> > > > [  871.277282] #PF: error_code(0x0000) - not-present page
> > > > [  871.279182] PGD 106fdb067 P4D 106fdb067 PUD 106fdc067 PMD 0
> > > > [  871.281071] Oops: 0000 [#1] PREEMPT SMP NOPTI
> > > > [  871.282698] CPU: 1 PID: 28 Comm: xenbus Not tainted 6.1.43-1.qubes.fc37.x86_64 #1
> > > > [  871.285222] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
> > > > [  871.288883] RIP: e030:__wake_up_common+0x4c/0x180
> > > > [  871.292838] Code: 24 0c 89 4c 24 08 4d 85 c9 74 0a 41 f6 01 04 0f 85 a3 00 00 00 48 8b 43 08 4c 8d 40 e8 48 83 c3 08 49 8d 40 18 48 39 c3 74 5b <49> 8b 40 18 31 ed 4c 8d 70 e8 45 8b 28 41 f6 c5 04 75 5f 49 8b 40
> > > > [  871.299776] RSP: e02b:ffffc900400f7e10 EFLAGS: 00010082
> > > > [  871.301656] RAX: 0000000000000000 RBX: ffff88810541ce98 RCX: 0000000000000000
> > > > [  871.304255] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff88810541ce90
> > > > [  871.306714] RBP: ffffc900400f0280 R08: ffffffffffffffe8 R09: ffffc900400f7e68
> > > > [  871.309937] R10: 0000000000007ff0 R11: ffff888100ad3000 R12: ffffc900400f7e68
> > > > [  871.312326] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > > > [  871.314647] FS:  0000000000000000(0000) GS:ffff88813ff00000(0000) knlGS:0000000000000000
> > > > [  871.317677] CS:  10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [  871.319644] CR2: 0000000000000000 CR3: 00000001067fe000 CR4: 0000000000040660
> > > > [  871.321973] Call Trace:
> > > > [  871.322782]  <TASK>
> > > > [  871.323494]  ? show_trace_log_lvl+0x1d3/0x2ef
> > > > [  871.324901]  ? show_trace_log_lvl+0x1d3/0x2ef
> > > > [  871.326310]  ? show_trace_log_lvl+0x1d3/0x2ef
> > > > [  871.327721]  ? __wake_up_common_lock+0x82/0xd0
> > > > [  871.329147]  ? __die_body.cold+0x8/0xd
> > > > [  871.330378]  ? page_fault_oops+0x163/0x1a0
> > > > [  871.331691]  ? exc_page_fault+0x70/0x170
> > > > [  871.332946]  ? asm_exc_page_fault+0x22/0x30
> > > > [  871.334454]  ? __wake_up_common+0x4c/0x180
> > > > [  871.335777]  __wake_up_common_lock+0x82/0xd0
> > > > [  871.337183]  ? process_writes+0x240/0x240
> > > > [  871.338461]  process_msg+0x18e/0x2f0
> > > > [  871.339627]  xenbus_thread+0x165/0x1c0
> > > > [  871.340830]  ? cpuusage_read+0x10/0x10
> > > > [  871.342032]  kthread+0xe9/0x110
> > > > [  871.343317]  ? kthread_complete_and_exit+0x20/0x20
> > > > [  871.345020]  ret_from_fork+0x22/0x30
> > > > [  871.346239]  </TASK>
> > > > [  871.347060] Modules linked in: snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device joydev snd_pcm intel_rapl_msr ppdev intel_rapl_common snd_timer pcspkr e1000e snd soundcore i2c_piix4 parport_pc parport loop fuse xenfs dm_crypt crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic floppy ghash_clmulni_intel sha512_ssse3 serio_raw virtio_scsi virtio_console bochs xhci_pci xhci_pci_renesas xhci_hcd qemu_fw_cfg drm_vram_helper drm_ttm_helper ttm ata_generic pata_acpi xen_privcmd xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn scsi_dh_rdac scsi_dh_emc scsi_dh_alua uinput dm_multipath
> > > > [  871.368892] CR2: 0000000000000000
> > > > [  871.370160] ---[ end trace 0000000000000000 ]---
> > > > [  871.371719] RIP: e030:__wake_up_common+0x4c/0x180
> > > > [  871.373273] Code: 24 0c 89 4c 24 08 4d 85 c9 74 0a 41 f6 01 04 0f 85 a3 00 00 00 48 8b 43 08 4c 8d 40 e8 48 83 c3 08 49 8d 40 18 48 39 c3 74 5b <49> 8b 40 18 31 ed 4c 8d 70 e8 45 8b 28 41 f6 c5 04 75 5f 49 8b 40
> > > > [  871.379866] RSP: e02b:ffffc900400f7e10 EFLAGS: 00010082
> > > > [  871.381689] RAX: 0000000000000000 RBX: ffff88810541ce98 RCX: 0000000000000000
> > > > [  871.383971] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff88810541ce90
> > > > [  871.386235] RBP: ffffc900400f0280 R08: ffffffffffffffe8 R09: ffffc900400f7e68
> > > > [  871.388521] R10: 0000000000007ff0 R11: ffff888100ad3000 R12: ffffc900400f7e68
> > > > [  871.390789] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > > > [  871.393101] FS:  0000000000000000(0000) GS:ffff88813ff00000(0000) knlGS:0000000000000000
> > > > [  871.395671] CS:  10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [  871.397863] CR2: 0000000000000000 CR3: 00000001067fe000 CR4: 0000000000040660
> > > > [  871.400441] Kernel panic - not syncing: Fatal exception
> > > > [  871.402171] Kernel Offset: disabled
> > > > (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.
> > > > 
> > > > It isn't the first time I see similar crash, but I can't really
> > > > reproduce it reliably. Restarted test usually passes.
> > > > Note this is Xen nested in KVM, so it could very well be some oddity
> > > > about nested virt, although looking at the stack trace, it's unlikely
> > > > and more likely some race condition hit only on slower system.
> > > 
> > > Recently I've got the same crash on a real system in domU too. And also
> > > on nested on newer kernel 6.1.57 (here it happened in dom0). So, this is
> > > still an issue and affects not only nested case :/
> > > 
> > > > Unfortunately I don't have symbols for this kernel handy, but there is a
> > > > single wake_up() call in process_writes(), so it shouldn't be an issue.
> > > > 
> > > > Any ideas?
> > > > 
> > > > Full log at https://openqa.qubes-os.org/tests/80779/logfile?filename=serial0.txt
> > > 
> > > More links at https://github.com/QubesOS/qubes-issues/issues/8638,
> > > including more recent stack trace.
> > 
> > Happens on 6.1.75 too (new stack trace I've added to the issue above,
> > but it's pretty similar).

Recently I've got a report from another user about similar issue, on
6.6.29 this time. I also still encounter this issue once a month or so,
but the user claims they get it much more often:
https://github.com/QubesOS/qubes-issues/issues/8638#issuecomment-2135419896
The extra conditions reported by the user are:
- old AMD system (KGPE-D16 with Opteron 6282 SE) requiring
  `spec-ctrl=ibpb-entry=no-pv` to remain usable 
- Whonix domU, which has a bunch of sysctl parameters changed, listed
  at:
  - https://github.com/Kicksecure/security-misc
  - https://github.com/Kicksecure/security-misc/blob/master/usr/lib/sysctl.d/990-security-misc.conf
  (unsure which are relevant, maybe `vm.swappiness=1`?)


-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: NULL pointer dereference in xenbus_thread->...
  2024-05-31 22:48       ` Marek Marczykowski-Górecki
@ 2025-04-23 12:41         ` Marek Marczykowski-Górecki
  2025-04-30  0:59           ` Jason Andryuk
  0 siblings, 1 reply; 13+ messages in thread
From: Marek Marczykowski-Górecki @ 2025-04-23 12:41 UTC (permalink / raw)
  To: Julien Grall; +Cc: xen-devel, Juergen Gross

[-- Attachment #1: Type: text/plain, Size: 7773 bytes --]

On Sat, Jun 01, 2024 at 12:48:33AM +0200, Marek Marczykowski-Górecki wrote:
> On Tue, Mar 26, 2024 at 11:00:50AM +0000, Julien Grall wrote:
> > Hi Marek,
> > 
> > +Juergen for visibility
> > 
> > When sending a bug report, I would suggest to CC relevant people as
> > otherwise it can get lost (not may people monitors Xen devel if they are not
> > CCed).
> > 
> > Cheers,
> > 
> > On 25/03/2024 16:17, Marek Marczykowski-Górecki wrote:
> > > On Sun, Oct 22, 2023 at 04:14:30PM +0200, Marek Marczykowski-Górecki wrote:
> > > > On Mon, Aug 28, 2023 at 11:50:36PM +0200, Marek Marczykowski-Górecki wrote:
> > > > > Hi,
> > > > > 
> > > > > I've noticed in Qubes's CI failure like this:
> > > > > 
> > > > > [  871.271292] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > > > > [  871.275290] #PF: supervisor read access in kernel mode
> > > > > [  871.277282] #PF: error_code(0x0000) - not-present page
> > > > > [  871.279182] PGD 106fdb067 P4D 106fdb067 PUD 106fdc067 PMD 0
> > > > > [  871.281071] Oops: 0000 [#1] PREEMPT SMP NOPTI
> > > > > [  871.282698] CPU: 1 PID: 28 Comm: xenbus Not tainted 6.1.43-1.qubes.fc37.x86_64 #1
> > > > > [  871.285222] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
> > > > > [  871.288883] RIP: e030:__wake_up_common+0x4c/0x180
> > > > > [  871.292838] Code: 24 0c 89 4c 24 08 4d 85 c9 74 0a 41 f6 01 04 0f 85 a3 00 00 00 48 8b 43 08 4c 8d 40 e8 48 83 c3 08 49 8d 40 18 48 39 c3 74 5b <49> 8b 40 18 31 ed 4c 8d 70 e8 45 8b 28 41 f6 c5 04 75 5f 49 8b 40
> > > > > [  871.299776] RSP: e02b:ffffc900400f7e10 EFLAGS: 00010082
> > > > > [  871.301656] RAX: 0000000000000000 RBX: ffff88810541ce98 RCX: 0000000000000000
> > > > > [  871.304255] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff88810541ce90
> > > > > [  871.306714] RBP: ffffc900400f0280 R08: ffffffffffffffe8 R09: ffffc900400f7e68
> > > > > [  871.309937] R10: 0000000000007ff0 R11: ffff888100ad3000 R12: ffffc900400f7e68
> > > > > [  871.312326] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > > > > [  871.314647] FS:  0000000000000000(0000) GS:ffff88813ff00000(0000) knlGS:0000000000000000
> > > > > [  871.317677] CS:  10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > [  871.319644] CR2: 0000000000000000 CR3: 00000001067fe000 CR4: 0000000000040660
> > > > > [  871.321973] Call Trace:
> > > > > [  871.322782]  <TASK>
> > > > > [  871.323494]  ? show_trace_log_lvl+0x1d3/0x2ef
> > > > > [  871.324901]  ? show_trace_log_lvl+0x1d3/0x2ef
> > > > > [  871.326310]  ? show_trace_log_lvl+0x1d3/0x2ef
> > > > > [  871.327721]  ? __wake_up_common_lock+0x82/0xd0
> > > > > [  871.329147]  ? __die_body.cold+0x8/0xd
> > > > > [  871.330378]  ? page_fault_oops+0x163/0x1a0
> > > > > [  871.331691]  ? exc_page_fault+0x70/0x170
> > > > > [  871.332946]  ? asm_exc_page_fault+0x22/0x30
> > > > > [  871.334454]  ? __wake_up_common+0x4c/0x180
> > > > > [  871.335777]  __wake_up_common_lock+0x82/0xd0
> > > > > [  871.337183]  ? process_writes+0x240/0x240
> > > > > [  871.338461]  process_msg+0x18e/0x2f0
> > > > > [  871.339627]  xenbus_thread+0x165/0x1c0
> > > > > [  871.340830]  ? cpuusage_read+0x10/0x10
> > > > > [  871.342032]  kthread+0xe9/0x110
> > > > > [  871.343317]  ? kthread_complete_and_exit+0x20/0x20
> > > > > [  871.345020]  ret_from_fork+0x22/0x30
> > > > > [  871.346239]  </TASK>
> > > > > [  871.347060] Modules linked in: snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device joydev snd_pcm intel_rapl_msr ppdev intel_rapl_common snd_timer pcspkr e1000e snd soundcore i2c_piix4 parport_pc parport loop fuse xenfs dm_crypt crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic floppy ghash_clmulni_intel sha512_ssse3 serio_raw virtio_scsi virtio_console bochs xhci_pci xhci_pci_renesas xhci_hcd qemu_fw_cfg drm_vram_helper drm_ttm_helper ttm ata_generic pata_acpi xen_privcmd xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn scsi_dh_rdac scsi_dh_emc scsi_dh_alua uinput dm_multipath
> > > > > [  871.368892] CR2: 0000000000000000
> > > > > [  871.370160] ---[ end trace 0000000000000000 ]---
> > > > > [  871.371719] RIP: e030:__wake_up_common+0x4c/0x180
> > > > > [  871.373273] Code: 24 0c 89 4c 24 08 4d 85 c9 74 0a 41 f6 01 04 0f 85 a3 00 00 00 48 8b 43 08 4c 8d 40 e8 48 83 c3 08 49 8d 40 18 48 39 c3 74 5b <49> 8b 40 18 31 ed 4c 8d 70 e8 45 8b 28 41 f6 c5 04 75 5f 49 8b 40
> > > > > [  871.379866] RSP: e02b:ffffc900400f7e10 EFLAGS: 00010082
> > > > > [  871.381689] RAX: 0000000000000000 RBX: ffff88810541ce98 RCX: 0000000000000000
> > > > > [  871.383971] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff88810541ce90
> > > > > [  871.386235] RBP: ffffc900400f0280 R08: ffffffffffffffe8 R09: ffffc900400f7e68
> > > > > [  871.388521] R10: 0000000000007ff0 R11: ffff888100ad3000 R12: ffffc900400f7e68
> > > > > [  871.390789] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > > > > [  871.393101] FS:  0000000000000000(0000) GS:ffff88813ff00000(0000) knlGS:0000000000000000
> > > > > [  871.395671] CS:  10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > [  871.397863] CR2: 0000000000000000 CR3: 00000001067fe000 CR4: 0000000000040660
> > > > > [  871.400441] Kernel panic - not syncing: Fatal exception
> > > > > [  871.402171] Kernel Offset: disabled
> > > > > (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.
> > > > > 
> > > > > It isn't the first time I see similar crash, but I can't really
> > > > > reproduce it reliably. Restarted test usually passes.
> > > > > Note this is Xen nested in KVM, so it could very well be some oddity
> > > > > about nested virt, although looking at the stack trace, it's unlikely
> > > > > and more likely some race condition hit only on slower system.
> > > > 
> > > > Recently I've got the same crash on a real system in domU too. And also
> > > > on nested on newer kernel 6.1.57 (here it happened in dom0). So, this is
> > > > still an issue and affects not only nested case :/
> > > > 
> > > > > Unfortunately I don't have symbols for this kernel handy, but there is a
> > > > > single wake_up() call in process_writes(), so it shouldn't be an issue.
> > > > > 
> > > > > Any ideas?
> > > > > 
> > > > > Full log at https://openqa.qubes-os.org/tests/80779/logfile?filename=serial0.txt
> > > > 
> > > > More links at https://github.com/QubesOS/qubes-issues/issues/8638,
> > > > including more recent stack trace.
> > > 
> > > Happens on 6.1.75 too (new stack trace I've added to the issue above,
> > > but it's pretty similar).
> 
> Recently I've got a report from another user about similar issue, on
> 6.6.29 this time. I also still encounter this issue once a month or so,
> but the user claims they get it much more often:
> https://github.com/QubesOS/qubes-issues/issues/8638#issuecomment-2135419896
> The extra conditions reported by the user are:
> - old AMD system (KGPE-D16 with Opteron 6282 SE) requiring
>   `spec-ctrl=ibpb-entry=no-pv` to remain usable 
> - Whonix domU, which has a bunch of sysctl parameters changed, listed
>   at:
>   - https://github.com/Kicksecure/security-misc
>   - https://github.com/Kicksecure/security-misc/blob/master/usr/lib/sysctl.d/990-security-misc.conf
>   (unsure which are relevant, maybe `vm.swappiness=1`?)

I've got some more report confirming it's still happening on Linux
6.12.18. Is there anything I can do to help fixing this? Maybe ask users
to enable some extra logging?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: NULL pointer dereference in xenbus_thread->...
  2025-04-23 12:41         ` Marek Marczykowski-Górecki
@ 2025-04-30  0:59           ` Jason Andryuk
  2025-04-30 10:56             ` Marek Marczykowski-Górecki
  2025-05-02  0:01             ` Marek Marczykowski-Górecki
  0 siblings, 2 replies; 13+ messages in thread
From: Jason Andryuk @ 2025-04-30  0:59 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki; +Cc: Julien Grall, xen-devel, Juergen Gross

Hi Marek,

On Wed, Apr 23, 2025 at 8:42 AM Marek Marczykowski-Górecki
<marmarek@invisiblethingslab.com> wrote:
>
> On Sat, Jun 01, 2024 at 12:48:33AM +0200, Marek Marczykowski-Górecki wrote:
> > On Tue, Mar 26, 2024 at 11:00:50AM +0000, Julien Grall wrote:
> > > Hi Marek,
> > >
> > > +Juergen for visibility
> > >
> > > When sending a bug report, I would suggest to CC relevant people as
> > > otherwise it can get lost (not may people monitors Xen devel if they are not
> > > CCed).
> > >
> > > Cheers,
> > >
> > > On 25/03/2024 16:17, Marek Marczykowski-Górecki wrote:
> > > > On Sun, Oct 22, 2023 at 04:14:30PM +0200, Marek Marczykowski-Górecki wrote:
> > > > > On Mon, Aug 28, 2023 at 11:50:36PM +0200, Marek Marczykowski-Górecki wrote:
> > > > > > Hi,
> > > > > >
> > > > > > I've noticed in Qubes's CI failure like this:
> > > > > >
> > > > > > [  871.271292] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > > > > > [  871.275290] #PF: supervisor read access in kernel mode
> > > > > > [  871.277282] #PF: error_code(0x0000) - not-present page
> > > > > > [  871.279182] PGD 106fdb067 P4D 106fdb067 PUD 106fdc067 PMD 0
> > > > > > [  871.281071] Oops: 0000 [#1] PREEMPT SMP NOPTI
> > > > > > [  871.282698] CPU: 1 PID: 28 Comm: xenbus Not tainted 6.1.43-1.qubes.fc37.x86_64 #1
> > > > > > [  871.285222] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
> > > > > > [  871.288883] RIP: e030:__wake_up_common+0x4c/0x180
> > > > > > [  871.292838] Code: 24 0c 89 4c 24 08 4d 85 c9 74 0a 41 f6 01 04 0f 85 a3 00 00 00 48 8b 43 08 4c 8d 40 e8 48 83 c3 08 49 8d 40 18 48 39 c3 74 5b <49> 8b 40 18 31 ed 4c 8d 70 e8 45 8b 28 41 f6 c5 04 75 5f 49 8b 40
> > > > > > [  871.299776] RSP: e02b:ffffc900400f7e10 EFLAGS: 00010082
> > > > > > [  871.301656] RAX: 0000000000000000 RBX: ffff88810541ce98 RCX: 0000000000000000
> > > > > > [  871.304255] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff88810541ce90
> > > > > > [  871.306714] RBP: ffffc900400f0280 R08: ffffffffffffffe8 R09: ffffc900400f7e68
> > > > > > [  871.309937] R10: 0000000000007ff0 R11: ffff888100ad3000 R12: ffffc900400f7e68
> > > > > > [  871.312326] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > > > > > [  871.314647] FS:  0000000000000000(0000) GS:ffff88813ff00000(0000) knlGS:0000000000000000
> > > > > > [  871.317677] CS:  10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > [  871.319644] CR2: 0000000000000000 CR3: 00000001067fe000 CR4: 0000000000040660
> > > > > > [  871.321973] Call Trace:
> > > > > > [  871.322782]  <TASK>
> > > > > > [  871.323494]  ? show_trace_log_lvl+0x1d3/0x2ef
> > > > > > [  871.324901]  ? show_trace_log_lvl+0x1d3/0x2ef
> > > > > > [  871.326310]  ? show_trace_log_lvl+0x1d3/0x2ef
> > > > > > [  871.327721]  ? __wake_up_common_lock+0x82/0xd0
> > > > > > [  871.329147]  ? __die_body.cold+0x8/0xd
> > > > > > [  871.330378]  ? page_fault_oops+0x163/0x1a0
> > > > > > [  871.331691]  ? exc_page_fault+0x70/0x170
> > > > > > [  871.332946]  ? asm_exc_page_fault+0x22/0x30
> > > > > > [  871.334454]  ? __wake_up_common+0x4c/0x180
> > > > > > [  871.335777]  __wake_up_common_lock+0x82/0xd0
> > > > > > [  871.337183]  ? process_writes+0x240/0x240
> > > > > > [  871.338461]  process_msg+0x18e/0x2f0
> > > > > > [  871.339627]  xenbus_thread+0x165/0x1c0
> > > > > > [  871.340830]  ? cpuusage_read+0x10/0x10
> > > > > > [  871.342032]  kthread+0xe9/0x110
> > > > > > [  871.343317]  ? kthread_complete_and_exit+0x20/0x20
> > > > > > [  871.345020]  ret_from_fork+0x22/0x30
> > > > > > [  871.346239]  </TASK>
> > > > > > [  871.347060] Modules linked in: snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device joydev snd_pcm intel_rapl_msr ppdev intel_rapl_common snd_timer pcspkr e1000e snd soundcore i2c_piix4 parport_pc parport loop fuse xenfs dm_crypt crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic floppy ghash_clmulni_intel sha512_ssse3 serio_raw virtio_scsi virtio_console bochs xhci_pci xhci_pci_renesas xhci_hcd qemu_fw_cfg drm_vram_helper drm_ttm_helper ttm ata_generic pata_acpi xen_privcmd xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn scsi_dh_rdac scsi_dh_emc scsi_dh_alua uinput dm_multipath
> > > > > > [  871.368892] CR2: 0000000000000000
> > > > > > [  871.370160] ---[ end trace 0000000000000000 ]---
> > > > > > [  871.371719] RIP: e030:__wake_up_common+0x4c/0x180
> > > > > > [  871.373273] Code: 24 0c 89 4c 24 08 4d 85 c9 74 0a 41 f6 01 04 0f 85 a3 00 00 00 48 8b 43 08 4c 8d 40 e8 48 83 c3 08 49 8d 40 18 48 39 c3 74 5b <49> 8b 40 18 31 ed 4c 8d 70 e8 45 8b 28 41 f6 c5 04 75 5f 49 8b 40
> > > > > > [  871.379866] RSP: e02b:ffffc900400f7e10 EFLAGS: 00010082
> > > > > > [  871.381689] RAX: 0000000000000000 RBX: ffff88810541ce98 RCX: 0000000000000000
> > > > > > [  871.383971] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff88810541ce90
> > > > > > [  871.386235] RBP: ffffc900400f0280 R08: ffffffffffffffe8 R09: ffffc900400f7e68
> > > > > > [  871.388521] R10: 0000000000007ff0 R11: ffff888100ad3000 R12: ffffc900400f7e68
> > > > > > [  871.390789] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > > > > > [  871.393101] FS:  0000000000000000(0000) GS:ffff88813ff00000(0000) knlGS:0000000000000000
> > > > > > [  871.395671] CS:  10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > [  871.397863] CR2: 0000000000000000 CR3: 00000001067fe000 CR4: 0000000000040660
> > > > > > [  871.400441] Kernel panic - not syncing: Fatal exception
> > > > > > [  871.402171] Kernel Offset: disabled
> > > > > > (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.
> > > > > >
> > > > > > It isn't the first time I see similar crash, but I can't really
> > > > > > reproduce it reliably. Restarted test usually passes.
> > > > > > Note this is Xen nested in KVM, so it could very well be some oddity
> > > > > > about nested virt, although looking at the stack trace, it's unlikely
> > > > > > and more likely some race condition hit only on slower system.
> > > > >
> > > > > Recently I've got the same crash on a real system in domU too. And also
> > > > > on nested on newer kernel 6.1.57 (here it happened in dom0). So, this is
> > > > > still an issue and affects not only nested case :/
> > > > >
> > > > > > Unfortunately I don't have symbols for this kernel handy, but there is a
> > > > > > single wake_up() call in process_writes(), so it shouldn't be an issue.
> > > > > >
> > > > > > Any ideas?
> > > > > >
> > > > > > Full log at https://openqa.qubes-os.org/tests/80779/logfile?filename=serial0.txt
> > > > >
> > > > > More links at https://github.com/QubesOS/qubes-issues/issues/8638,
> > > > > including more recent stack trace.
> > > >
> > > > Happens on 6.1.75 too (new stack trace I've added to the issue above,
> > > > but it's pretty similar).
> >
> > Recently I've got a report from another user about similar issue, on
> > 6.6.29 this time. I also still encounter this issue once a month or so,
> > but the user claims they get it much more often:
> > https://github.com/QubesOS/qubes-issues/issues/8638#issuecomment-2135419896
> > The extra conditions reported by the user are:
> > - old AMD system (KGPE-D16 with Opteron 6282 SE) requiring
> >   `spec-ctrl=ibpb-entry=no-pv` to remain usable
> > - Whonix domU, which has a bunch of sysctl parameters changed, listed
> >   at:
> >   - https://github.com/Kicksecure/security-misc
> >   - https://github.com/Kicksecure/security-misc/blob/master/usr/lib/sysctl.d/990-security-misc.conf
> >   (unsure which are relevant, maybe `vm.swappiness=1`?)
>
> I've got some more report confirming it's still happening on Linux
> 6.12.18. Is there anything I can do to help fixing this? Maybe ask users
> to enable some extra logging?

Have you been able to capture a crash with debug symbols and run it
through scripts/decode_stacktrace.sh?

I'm curious what process_msg+0x18e/0x2f0 is.  process_writes() has a
direct call to wake_up(), but process_msg() calling req->cb(req) may
be xs_wake_up() which is a thin wrapper over wake_up().

They make me wonder if req has been free()ed and at least partially
zero-ed, but it still has wake_up() called.  The call stack here is
reminiscent of the one here
https://lore.kernel.org/xen-devel/Z_lJTyVipJJEpWg2@mail-itl/ and the
unexpected value there is 0.

I haven't actually figured out a scenario where req would be kfree()ed
early.  But the handling of kfree(req) being split between
process_msg/writes() and xs_wait_for_reply() makes me a little uneasy.

Regards,
Jason


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: NULL pointer dereference in xenbus_thread->...
  2025-04-30  0:59           ` Jason Andryuk
@ 2025-04-30 10:56             ` Marek Marczykowski-Górecki
  2025-04-30 14:29               ` Jason Andryuk
  2025-05-02  0:01             ` Marek Marczykowski-Górecki
  1 sibling, 1 reply; 13+ messages in thread
From: Marek Marczykowski-Górecki @ 2025-04-30 10:56 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: Julien Grall, xen-devel, Juergen Gross

[-- Attachment #1: Type: text/plain, Size: 9647 bytes --]

On Tue, Apr 29, 2025 at 08:59:45PM -0400, Jason Andryuk wrote:
> Hi Marek,
> 
> On Wed, Apr 23, 2025 at 8:42 AM Marek Marczykowski-Górecki
> <marmarek@invisiblethingslab.com> wrote:
> >
> > On Sat, Jun 01, 2024 at 12:48:33AM +0200, Marek Marczykowski-Górecki wrote:
> > > On Tue, Mar 26, 2024 at 11:00:50AM +0000, Julien Grall wrote:
> > > > Hi Marek,
> > > >
> > > > +Juergen for visibility
> > > >
> > > > When sending a bug report, I would suggest to CC relevant people as
> > > > otherwise it can get lost (not may people monitors Xen devel if they are not
> > > > CCed).
> > > >
> > > > Cheers,
> > > >
> > > > On 25/03/2024 16:17, Marek Marczykowski-Górecki wrote:
> > > > > On Sun, Oct 22, 2023 at 04:14:30PM +0200, Marek Marczykowski-Górecki wrote:
> > > > > > On Mon, Aug 28, 2023 at 11:50:36PM +0200, Marek Marczykowski-Górecki wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > I've noticed in Qubes's CI failure like this:
> > > > > > >
> > > > > > > [  871.271292] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > > > > > > [  871.275290] #PF: supervisor read access in kernel mode
> > > > > > > [  871.277282] #PF: error_code(0x0000) - not-present page
> > > > > > > [  871.279182] PGD 106fdb067 P4D 106fdb067 PUD 106fdc067 PMD 0
> > > > > > > [  871.281071] Oops: 0000 [#1] PREEMPT SMP NOPTI
> > > > > > > [  871.282698] CPU: 1 PID: 28 Comm: xenbus Not tainted 6.1.43-1.qubes.fc37.x86_64 #1
> > > > > > > [  871.285222] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
> > > > > > > [  871.288883] RIP: e030:__wake_up_common+0x4c/0x180
> > > > > > > [  871.292838] Code: 24 0c 89 4c 24 08 4d 85 c9 74 0a 41 f6 01 04 0f 85 a3 00 00 00 48 8b 43 08 4c 8d 40 e8 48 83 c3 08 49 8d 40 18 48 39 c3 74 5b <49> 8b 40 18 31 ed 4c 8d 70 e8 45 8b 28 41 f6 c5 04 75 5f 49 8b 40
> > > > > > > [  871.299776] RSP: e02b:ffffc900400f7e10 EFLAGS: 00010082
> > > > > > > [  871.301656] RAX: 0000000000000000 RBX: ffff88810541ce98 RCX: 0000000000000000
> > > > > > > [  871.304255] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff88810541ce90
> > > > > > > [  871.306714] RBP: ffffc900400f0280 R08: ffffffffffffffe8 R09: ffffc900400f7e68
> > > > > > > [  871.309937] R10: 0000000000007ff0 R11: ffff888100ad3000 R12: ffffc900400f7e68
> > > > > > > [  871.312326] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > > > > > > [  871.314647] FS:  0000000000000000(0000) GS:ffff88813ff00000(0000) knlGS:0000000000000000
> > > > > > > [  871.317677] CS:  10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > > [  871.319644] CR2: 0000000000000000 CR3: 00000001067fe000 CR4: 0000000000040660
> > > > > > > [  871.321973] Call Trace:
> > > > > > > [  871.322782]  <TASK>
> > > > > > > [  871.323494]  ? show_trace_log_lvl+0x1d3/0x2ef
> > > > > > > [  871.324901]  ? show_trace_log_lvl+0x1d3/0x2ef
> > > > > > > [  871.326310]  ? show_trace_log_lvl+0x1d3/0x2ef
> > > > > > > [  871.327721]  ? __wake_up_common_lock+0x82/0xd0
> > > > > > > [  871.329147]  ? __die_body.cold+0x8/0xd
> > > > > > > [  871.330378]  ? page_fault_oops+0x163/0x1a0
> > > > > > > [  871.331691]  ? exc_page_fault+0x70/0x170
> > > > > > > [  871.332946]  ? asm_exc_page_fault+0x22/0x30
> > > > > > > [  871.334454]  ? __wake_up_common+0x4c/0x180
> > > > > > > [  871.335777]  __wake_up_common_lock+0x82/0xd0
> > > > > > > [  871.337183]  ? process_writes+0x240/0x240
> > > > > > > [  871.338461]  process_msg+0x18e/0x2f0
> > > > > > > [  871.339627]  xenbus_thread+0x165/0x1c0
> > > > > > > [  871.340830]  ? cpuusage_read+0x10/0x10
> > > > > > > [  871.342032]  kthread+0xe9/0x110
> > > > > > > [  871.343317]  ? kthread_complete_and_exit+0x20/0x20
> > > > > > > [  871.345020]  ret_from_fork+0x22/0x30
> > > > > > > [  871.346239]  </TASK>
> > > > > > > [  871.347060] Modules linked in: snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device joydev snd_pcm intel_rapl_msr ppdev intel_rapl_common snd_timer pcspkr e1000e snd soundcore i2c_piix4 parport_pc parport loop fuse xenfs dm_crypt crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic floppy ghash_clmulni_intel sha512_ssse3 serio_raw virtio_scsi virtio_console bochs xhci_pci xhci_pci_renesas xhci_hcd qemu_fw_cfg drm_vram_helper drm_ttm_helper ttm ata_generic pata_acpi xen_privcmd xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn scsi_dh_rdac scsi_dh_emc scsi_dh_alua uinput dm_multipath
> > > > > > > [  871.368892] CR2: 0000000000000000
> > > > > > > [  871.370160] ---[ end trace 0000000000000000 ]---
> > > > > > > [  871.371719] RIP: e030:__wake_up_common+0x4c/0x180
> > > > > > > [  871.373273] Code: 24 0c 89 4c 24 08 4d 85 c9 74 0a 41 f6 01 04 0f 85 a3 00 00 00 48 8b 43 08 4c 8d 40 e8 48 83 c3 08 49 8d 40 18 48 39 c3 74 5b <49> 8b 40 18 31 ed 4c 8d 70 e8 45 8b 28 41 f6 c5 04 75 5f 49 8b 40
> > > > > > > [  871.379866] RSP: e02b:ffffc900400f7e10 EFLAGS: 00010082
> > > > > > > [  871.381689] RAX: 0000000000000000 RBX: ffff88810541ce98 RCX: 0000000000000000
> > > > > > > [  871.383971] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff88810541ce90
> > > > > > > [  871.386235] RBP: ffffc900400f0280 R08: ffffffffffffffe8 R09: ffffc900400f7e68
> > > > > > > [  871.388521] R10: 0000000000007ff0 R11: ffff888100ad3000 R12: ffffc900400f7e68
> > > > > > > [  871.390789] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > > > > > > [  871.393101] FS:  0000000000000000(0000) GS:ffff88813ff00000(0000) knlGS:0000000000000000
> > > > > > > [  871.395671] CS:  10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > > [  871.397863] CR2: 0000000000000000 CR3: 00000001067fe000 CR4: 0000000000040660
> > > > > > > [  871.400441] Kernel panic - not syncing: Fatal exception
> > > > > > > [  871.402171] Kernel Offset: disabled
> > > > > > > (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.
> > > > > > >
> > > > > > > It isn't the first time I see similar crash, but I can't really
> > > > > > > reproduce it reliably. Restarted test usually passes.
> > > > > > > Note this is Xen nested in KVM, so it could very well be some oddity
> > > > > > > about nested virt, although looking at the stack trace, it's unlikely
> > > > > > > and more likely some race condition hit only on slower system.
> > > > > >
> > > > > > Recently I've got the same crash on a real system in domU too. And also
> > > > > > on nested on newer kernel 6.1.57 (here it happened in dom0). So, this is
> > > > > > still an issue and affects not only nested case :/
> > > > > >
> > > > > > > Unfortunately I don't have symbols for this kernel handy, but there is a
> > > > > > > single wake_up() call in process_writes(), so it shouldn't be an issue.
> > > > > > >
> > > > > > > Any ideas?
> > > > > > >
> > > > > > > Full log at https://openqa.qubes-os.org/tests/80779/logfile?filename=serial0.txt
> > > > > >
> > > > > > More links at https://github.com/QubesOS/qubes-issues/issues/8638,
> > > > > > including more recent stack trace.
> > > > >
> > > > > Happens on 6.1.75 too (new stack trace I've added to the issue above,
> > > > > but it's pretty similar).
> > >
> > > Recently I've got a report from another user about similar issue, on
> > > 6.6.29 this time. I also still encounter this issue once a month or so,
> > > but the user claims they get it much more often:
> > > https://github.com/QubesOS/qubes-issues/issues/8638#issuecomment-2135419896
> > > The extra conditions reported by the user are:
> > > - old AMD system (KGPE-D16 with Opteron 6282 SE) requiring
> > >   `spec-ctrl=ibpb-entry=no-pv` to remain usable
> > > - Whonix domU, which has a bunch of sysctl parameters changed, listed
> > >   at:
> > >   - https://github.com/Kicksecure/security-misc
> > >   - https://github.com/Kicksecure/security-misc/blob/master/usr/lib/sysctl.d/990-security-misc.conf
> > >   (unsure which are relevant, maybe `vm.swappiness=1`?)
> >
> > I've got some more report confirming it's still happening on Linux
> > 6.12.18. Is there anything I can do to help fixing this? Maybe ask users
> > to enable some extra logging?
> 
> Have you been able to capture a crash with debug symbols and run it
> through scripts/decode_stacktrace.sh?

Not really, as I don't have debug symbols for this kernel. And I can't
reliably reproduce it myself (for me it happens about once in a
month...). I can try reproducing debug symbols, theoretically I should
have all ingredients for it.

> I'm curious what process_msg+0x18e/0x2f0 is.  process_writes() has a
> direct call to wake_up(), but process_msg() calling req->cb(req) may
> be xs_wake_up() which is a thin wrapper over wake_up().

There is a code dump in the crash message, does it help?

> They make me wonder if req has been free()ed and at least partially
> zero-ed, but it still has wake_up() called.  The call stack here is
> reminiscent of the one here
> https://lore.kernel.org/xen-devel/Z_lJTyVipJJEpWg2@mail-itl/ and the
> unexpected value there is 0.

That's interesting idea, the one above I've seen only on 6.15-rc1 (and
no latter rc). But maybe?

> I haven't actually figured out a scenario where req would be kfree()ed
> early.  But the handling of kfree(req) being split between
> process_msg/writes() and xs_wait_for_reply() makes me a little uneasy.
> 
> Regards,
> Jason

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: NULL pointer dereference in xenbus_thread->...
  2025-04-30 10:56             ` Marek Marczykowski-Górecki
@ 2025-04-30 14:29               ` Jason Andryuk
  2025-04-30 15:43                 ` Jürgen Groß
  0 siblings, 1 reply; 13+ messages in thread
From: Jason Andryuk @ 2025-04-30 14:29 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki, Jason Andryuk
  Cc: Julien Grall, xen-devel, Juergen Gross

On 2025-04-30 06:56, Marek Marczykowski-Górecki wrote:
> On Tue, Apr 29, 2025 at 08:59:45PM -0400, Jason Andryuk wrote:
>> Hi Marek,
>>
>> On Wed, Apr 23, 2025 at 8:42 AM Marek Marczykowski-Górecki
>> <marmarek@invisiblethingslab.com> wrote:

>>> I've got some more report confirming it's still happening on Linux
>>> 6.12.18. Is there anything I can do to help fixing this? Maybe ask users
>>> to enable some extra logging?
>>
>> Have you been able to capture a crash with debug symbols and run it
>> through scripts/decode_stacktrace.sh?
> 
> Not really, as I don't have debug symbols for this kernel. And I can't
> reliably reproduce it myself (for me it happens about once in a
> month...). I can try reproducing debug symbols, theoretically I should
> have all ingredients for it.
> 
>> I'm curious what process_msg+0x18e/0x2f0 is.  process_writes() has a
>> direct call to wake_up(), but process_msg() calling req->cb(req) may
>> be xs_wake_up() which is a thin wrapper over wake_up().
> 
> There is a code dump in the crash message, does it help?

That's a little deeper in the call chain.  If you have a vmlinux or 
bzImage with a matching stacktrace, that would work to look up the 
address in the disassembly.  So if you don't have a matching pair, maybe 
try to catch it the next time.

>> They make me wonder if req has been free()ed and at least partially
>> zero-ed, but it still has wake_up() called.  The call stack here is
>> reminiscent of the one here
>> https://lore.kernel.org/xen-devel/Z_lJTyVipJJEpWg2@mail-itl/ and the
>> unexpected value there is 0.
> 
> That's interesting idea, the one above I've seen only on 6.15-rc1 (and
> no latter rc). But maybe?

I am guessing, so I could be wrong.  NULL pointer and unexpected zero 
value are both 0 at least.  Also Whonix looks like it may use 
init_on_free=1 to zero memory at free time.

Regards,
Jason


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: NULL pointer dereference in xenbus_thread->...
  2025-04-30 14:29               ` Jason Andryuk
@ 2025-04-30 15:43                 ` Jürgen Groß
  2025-05-03 15:08                   ` Jason Andryuk
  0 siblings, 1 reply; 13+ messages in thread
From: Jürgen Groß @ 2025-04-30 15:43 UTC (permalink / raw)
  To: Jason Andryuk, Marek Marczykowski-Górecki, Jason Andryuk
  Cc: Julien Grall, xen-devel


[-- Attachment #1.1.1: Type: text/plain, Size: 3473 bytes --]

On 30.04.25 16:29, Jason Andryuk wrote:
> On 2025-04-30 06:56, Marek Marczykowski-Górecki wrote:
>> On Tue, Apr 29, 2025 at 08:59:45PM -0400, Jason Andryuk wrote:
>>> Hi Marek,
>>>
>>> On Wed, Apr 23, 2025 at 8:42 AM Marek Marczykowski-Górecki
>>> <marmarek@invisiblethingslab.com> wrote:
> 
>>>> I've got some more report confirming it's still happening on Linux
>>>> 6.12.18. Is there anything I can do to help fixing this? Maybe ask users
>>>> to enable some extra logging?
>>>
>>> Have you been able to capture a crash with debug symbols and run it
>>> through scripts/decode_stacktrace.sh?
>>
>> Not really, as I don't have debug symbols for this kernel. And I can't
>> reliably reproduce it myself (for me it happens about once in a
>> month...). I can try reproducing debug symbols, theoretically I should
>> have all ingredients for it.
>>
>>> I'm curious what process_msg+0x18e/0x2f0 is.  process_writes() has a
>>> direct call to wake_up(), but process_msg() calling req->cb(req) may
>>> be xs_wake_up() which is a thin wrapper over wake_up().
>>
>> There is a code dump in the crash message, does it help?
> 
> That's a little deeper in the call chain.  If you have a vmlinux or bzImage with 
> a matching stacktrace, that would work to look up the address in the 
> disassembly.  So if you don't have a matching pair, maybe try to catch it the 
> next time.
> 
>>> They make me wonder if req has been free()ed and at least partially
>>> zero-ed, but it still has wake_up() called.  The call stack here is
>>> reminiscent of the one here
>>> https://lore.kernel.org/xen-devel/Z_lJTyVipJJEpWg2@mail-itl/ and the
>>> unexpected value there is 0.
>>
>> That's interesting idea, the one above I've seen only on 6.15-rc1 (and
>> no latter rc). But maybe?
> 
> I am guessing, so I could be wrong.  NULL pointer and unexpected zero value are 
> both 0 at least.  Also Whonix looks like it may use init_on_free=1 to zero 
> memory at free time.

I have looked at this issue multiple times now.

Just some remarks what IMO could go wrong (I didn't find any proof that
this really happened, though), in case someone wants to double check:

The most probably candidate for something going wrong is a use-after-free
of a struct xb_req_data element (normally named "req" in the related code).

Some words about the not really obvious locking scheme used for those
elements: A "req" is owned by a thread as long as it isn't in any of the
lists it can live in (xs_reply_list or xb_write_list). Putting it into one
of the lists or removing it again requires to hold the xb_write_mutex.

A "req" needs to be in a certain state when either in one of the lists or
when being owned by a worker thread.

I'm wondering whether it could happen that a thread waiting for a "req"
could be woken up and the "req" is being freed before the waiting thread
can react. Normally this shouldn't be possible, but "never say never".
What catched my eye today is the test of req->state == xb_req_state_wait_reply
in process_msg() just after dropping the xb_write_mutex. This looks a little
bit fishy, but OTOH the request has been just removed from the xs_reply_list,
so no mutex should be needed for that test.

Possible candidates for such an "impossible" scenario include a wrap of
xs_request_id (not very probable, though, as having 4 billion Xenstore
requests "in flight" is rather unlikely IMHO).


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3743 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: NULL pointer dereference in xenbus_thread->...
  2025-04-30  0:59           ` Jason Andryuk
  2025-04-30 10:56             ` Marek Marczykowski-Górecki
@ 2025-05-02  0:01             ` Marek Marczykowski-Górecki
  2025-05-03 14:59               ` Jason Andryuk
  1 sibling, 1 reply; 13+ messages in thread
From: Marek Marczykowski-Górecki @ 2025-05-02  0:01 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: Julien Grall, xen-devel, Juergen Gross

[-- Attachment #1: Type: text/plain, Size: 13614 bytes --]

On Tue, Apr 29, 2025 at 08:59:45PM -0400, Jason Andryuk wrote:
> Hi Marek,
> 
> On Wed, Apr 23, 2025 at 8:42 AM Marek Marczykowski-Górecki
> <marmarek@invisiblethingslab.com> wrote:
> >
> > On Sat, Jun 01, 2024 at 12:48:33AM +0200, Marek Marczykowski-Górecki wrote:
> > > On Tue, Mar 26, 2024 at 11:00:50AM +0000, Julien Grall wrote:
> > > > Hi Marek,
> > > >
> > > > +Juergen for visibility
> > > >
> > > > When sending a bug report, I would suggest to CC relevant people as
> > > > otherwise it can get lost (not may people monitors Xen devel if they are not
> > > > CCed).
> > > >
> > > > Cheers,
> > > >
> > > > On 25/03/2024 16:17, Marek Marczykowski-Górecki wrote:
> > > > > On Sun, Oct 22, 2023 at 04:14:30PM +0200, Marek Marczykowski-Górecki wrote:
> > > > > > On Mon, Aug 28, 2023 at 11:50:36PM +0200, Marek Marczykowski-Górecki wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > I've noticed in Qubes's CI failure like this:
> > > > > > >
> > > > > > > [  871.271292] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > > > > > > [  871.275290] #PF: supervisor read access in kernel mode
> > > > > > > [  871.277282] #PF: error_code(0x0000) - not-present page
> > > > > > > [  871.279182] PGD 106fdb067 P4D 106fdb067 PUD 106fdc067 PMD 0
> > > > > > > [  871.281071] Oops: 0000 [#1] PREEMPT SMP NOPTI
> > > > > > > [  871.282698] CPU: 1 PID: 28 Comm: xenbus Not tainted 6.1.43-1.qubes.fc37.x86_64 #1
> > > > > > > [  871.285222] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
> > > > > > > [  871.288883] RIP: e030:__wake_up_common+0x4c/0x180
> > > > > > > [  871.292838] Code: 24 0c 89 4c 24 08 4d 85 c9 74 0a 41 f6 01 04 0f 85 a3 00 00 00 48 8b 43 08 4c 8d 40 e8 48 83 c3 08 49 8d 40 18 48 39 c3 74 5b <49> 8b 40 18 31 ed 4c 8d 70 e8 45 8b 28 41 f6 c5 04 75 5f 49 8b 40
> > > > > > > [  871.299776] RSP: e02b:ffffc900400f7e10 EFLAGS: 00010082
> > > > > > > [  871.301656] RAX: 0000000000000000 RBX: ffff88810541ce98 RCX: 0000000000000000
> > > > > > > [  871.304255] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff88810541ce90
> > > > > > > [  871.306714] RBP: ffffc900400f0280 R08: ffffffffffffffe8 R09: ffffc900400f7e68
> > > > > > > [  871.309937] R10: 0000000000007ff0 R11: ffff888100ad3000 R12: ffffc900400f7e68
> > > > > > > [  871.312326] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > > > > > > [  871.314647] FS:  0000000000000000(0000) GS:ffff88813ff00000(0000) knlGS:0000000000000000
> > > > > > > [  871.317677] CS:  10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > > [  871.319644] CR2: 0000000000000000 CR3: 00000001067fe000 CR4: 0000000000040660
> > > > > > > [  871.321973] Call Trace:
> > > > > > > [  871.322782]  <TASK>
> > > > > > > [  871.323494]  ? show_trace_log_lvl+0x1d3/0x2ef
> > > > > > > [  871.324901]  ? show_trace_log_lvl+0x1d3/0x2ef
> > > > > > > [  871.326310]  ? show_trace_log_lvl+0x1d3/0x2ef
> > > > > > > [  871.327721]  ? __wake_up_common_lock+0x82/0xd0
> > > > > > > [  871.329147]  ? __die_body.cold+0x8/0xd
> > > > > > > [  871.330378]  ? page_fault_oops+0x163/0x1a0
> > > > > > > [  871.331691]  ? exc_page_fault+0x70/0x170
> > > > > > > [  871.332946]  ? asm_exc_page_fault+0x22/0x30
> > > > > > > [  871.334454]  ? __wake_up_common+0x4c/0x180
> > > > > > > [  871.335777]  __wake_up_common_lock+0x82/0xd0
> > > > > > > [  871.337183]  ? process_writes+0x240/0x240
> > > > > > > [  871.338461]  process_msg+0x18e/0x2f0
> > > > > > > [  871.339627]  xenbus_thread+0x165/0x1c0
> > > > > > > [  871.340830]  ? cpuusage_read+0x10/0x10
> > > > > > > [  871.342032]  kthread+0xe9/0x110
> > > > > > > [  871.343317]  ? kthread_complete_and_exit+0x20/0x20
> > > > > > > [  871.345020]  ret_from_fork+0x22/0x30
> > > > > > > [  871.346239]  </TASK>
> > > > > > > [  871.347060] Modules linked in: snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device joydev snd_pcm intel_rapl_msr ppdev intel_rapl_common snd_timer pcspkr e1000e snd soundcore i2c_piix4 parport_pc parport loop fuse xenfs dm_crypt crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic floppy ghash_clmulni_intel sha512_ssse3 serio_raw virtio_scsi virtio_console bochs xhci_pci xhci_pci_renesas xhci_hcd qemu_fw_cfg drm_vram_helper drm_ttm_helper ttm ata_generic pata_acpi xen_privcmd xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn scsi_dh_rdac scsi_dh_emc scsi_dh_alua uinput dm_multipath
> > > > > > > [  871.368892] CR2: 0000000000000000
> > > > > > > [  871.370160] ---[ end trace 0000000000000000 ]---
> > > > > > > [  871.371719] RIP: e030:__wake_up_common+0x4c/0x180
> > > > > > > [  871.373273] Code: 24 0c 89 4c 24 08 4d 85 c9 74 0a 41 f6 01 04 0f 85 a3 00 00 00 48 8b 43 08 4c 8d 40 e8 48 83 c3 08 49 8d 40 18 48 39 c3 74 5b <49> 8b 40 18 31 ed 4c 8d 70 e8 45 8b 28 41 f6 c5 04 75 5f 49 8b 40
> > > > > > > [  871.379866] RSP: e02b:ffffc900400f7e10 EFLAGS: 00010082
> > > > > > > [  871.381689] RAX: 0000000000000000 RBX: ffff88810541ce98 RCX: 0000000000000000
> > > > > > > [  871.383971] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff88810541ce90
> > > > > > > [  871.386235] RBP: ffffc900400f0280 R08: ffffffffffffffe8 R09: ffffc900400f7e68
> > > > > > > [  871.388521] R10: 0000000000007ff0 R11: ffff888100ad3000 R12: ffffc900400f7e68
> > > > > > > [  871.390789] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > > > > > > [  871.393101] FS:  0000000000000000(0000) GS:ffff88813ff00000(0000) knlGS:0000000000000000
> > > > > > > [  871.395671] CS:  10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > > [  871.397863] CR2: 0000000000000000 CR3: 00000001067fe000 CR4: 0000000000040660
> > > > > > > [  871.400441] Kernel panic - not syncing: Fatal exception
> > > > > > > [  871.402171] Kernel Offset: disabled
> > > > > > > (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.
> > > > > > >
> > > > > > > It isn't the first time I see similar crash, but I can't really
> > > > > > > reproduce it reliably. Restarted test usually passes.
> > > > > > > Note this is Xen nested in KVM, so it could very well be some oddity
> > > > > > > about nested virt, although looking at the stack trace, it's unlikely
> > > > > > > and more likely some race condition hit only on slower system.
> > > > > >
> > > > > > Recently I've got the same crash on a real system in domU too. And also
> > > > > > on nested on newer kernel 6.1.57 (here it happened in dom0). So, this is
> > > > > > still an issue and affects not only nested case :/
> > > > > >
> > > > > > > Unfortunately I don't have symbols for this kernel handy, but there is a
> > > > > > > single wake_up() call in process_writes(), so it shouldn't be an issue.
> > > > > > >
> > > > > > > Any ideas?
> > > > > > >
> > > > > > > Full log at https://openqa.qubes-os.org/tests/80779/logfile?filename=serial0.txt
> > > > > >
> > > > > > More links at https://github.com/QubesOS/qubes-issues/issues/8638,
> > > > > > including more recent stack trace.
> > > > >
> > > > > Happens on 6.1.75 too (new stack trace I've added to the issue above,
> > > > > but it's pretty similar).
> > >
> > > Recently I've got a report from another user about similar issue, on
> > > 6.6.29 this time. I also still encounter this issue once a month or so,
> > > but the user claims they get it much more often:
> > > https://github.com/QubesOS/qubes-issues/issues/8638#issuecomment-2135419896
> > > The extra conditions reported by the user are:
> > > - old AMD system (KGPE-D16 with Opteron 6282 SE) requiring
> > >   `spec-ctrl=ibpb-entry=no-pv` to remain usable
> > > - Whonix domU, which has a bunch of sysctl parameters changed, listed
> > >   at:
> > >   - https://github.com/Kicksecure/security-misc
> > >   - https://github.com/Kicksecure/security-misc/blob/master/usr/lib/sysctl.d/990-security-misc.conf
> > >   (unsure which are relevant, maybe `vm.swappiness=1`?)
> >
> > I've got some more report confirming it's still happening on Linux
> > 6.12.18. Is there anything I can do to help fixing this? Maybe ask users
> > to enable some extra logging?
> 
> Have you been able to capture a crash with debug symbols and run it
> through scripts/decode_stacktrace.sh?

This worked:

    BUG: kernel NULL pointer dereference, address: 0000000000000000
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 0 P4D 0
    Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
    RIP: 0010:__wake_up_common (kernel/sched/wait.c:85) 
    Code: 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 89 fb 48 83 c3 08 48 83 ec 08 48 8b 47 08 89 54 24 04 48 39 c3 74 55 4d 89 c7 <4c> 8b 00 41 89 f5 41 89 ce 48 8d 78 e8 4d 8d 60 e8 eb 27 74 0c 83
    All code
    ========
       0:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
       5:	41 57                	push   %r15
       7:	41 56                	push   %r14
       9:	41 55                	push   %r13
       b:	41 54                	push   %r12
       d:	55                   	push   %rbp
       e:	53                   	push   %rbx
       f:	48 89 fb             	mov    %rdi,%rbx
      12:	48 83 c3 08          	add    $0x8,%rbx
      16:	48 83 ec 08          	sub    $0x8,%rsp
      1a:	48 8b 47 08          	mov    0x8(%rdi),%rax
      1e:	89 54 24 04          	mov    %edx,0x4(%rsp)
      22:	48 39 c3             	cmp    %rax,%rbx
      25:	74 55                	je     0x7c
      27:	4d 89 c7             	mov    %r8,%r15
      2a:*	4c 8b 00             	mov    (%rax),%r8		<-- trapping instruction
      2d:	41 89 f5             	mov    %esi,%r13d
      30:	41 89 ce             	mov    %ecx,%r14d
      33:	48 8d 78 e8          	lea    -0x18(%rax),%rdi
      37:	4d 8d 60 e8          	lea    -0x18(%r8),%r12
      3b:	eb 27                	jmp    0x64
      3d:	74 0c                	je     0x4b
      3f:	83                   	.byte 0x83

    Code starting with the faulting instruction
    ===========================================
       0:	4c 8b 00             	mov    (%rax),%r8
       3:	41 89 f5             	mov    %esi,%r13d
       6:	41 89 ce             	mov    %ecx,%r14d
       9:	48 8d 78 e8          	lea    -0x18(%rax),%rdi
       d:	4d 8d 60 e8          	lea    -0x18(%r8),%r12
      11:	eb 27                	jmp    0x3a
      13:	74 0c                	je     0x21
      15:	83                   	.byte 0x83
    RSP: 0018:ffffc900009f7e40 EFLAGS: 00010082
    RAX: 0000000000000000 RBX: ffff8880109c0798 RCX: 0000000000000000
    RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff8880109c0790
    RBP: ffff8880109c0790 R08: 0000000000000000 R09: 0000000000000003
    R10: ffffc900009f7eb0 R11: ffffc9000003d000 R12: 0000000000000003
    R13: 0000000000000246 R14: 0000000000000000 R15: 0000000000000000
    FS:  0000000000000000(0000) GS:ffff888018500000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000000 CR3: 00000000432a0000 CR4: 00000000000406f0
    Call Trace:
    <TASK>
    ? __die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434) 
    ? page_fault_oops (arch/x86/mm/fault.c:715) 
    ? exc_page_fault (arch/x86/include/asm/paravirt.h:693 arch/x86/mm/fault.c:1489 arch/x86/mm/fault.c:1539) 
    ? asm_exc_page_fault (arch/x86/include/asm/idtentry.h:623) 
    ? __wake_up_common (kernel/sched/wait.c:85) 
    ? __pfx_xenbus_thread (drivers/xen/xenbus/xenbus_comms.c:411) 
    __wake_up (include/linux/spinlock.h:406 kernel/sched/wait.c:108 kernel/sched/wait.c:127) 
    process_msg (drivers/xen/xenbus/xenbus_comms.c:311) 
    xenbus_thread (drivers/xen/xenbus/xenbus_comms.c:418) 
    ? __pfx_autoremove_wake_function (kernel/sched/wait.c:383) 
    kthread (kernel/kthread.c:389) 
    ? __pfx_kthread (kernel/kthread.c:342) 
    ret_from_fork (arch/x86/kernel/process.c:153) 
    ? __pfx_kthread (kernel/kthread.c:342) 
    ret_from_fork_asm (arch/x86/entry/entry_64.S:257) 
    </TASK>
    Modules linked in: snd_seq_dummy snd_hrtimer snd_seq snd_seq_device snd_timer snd soundcore xenfs nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink binfmt_misc crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 xen_netfront sha1_ssse3 xen_privcmd xen_gntdev xen_gntalloc xen_blkback xen_evtchn loop fuse ip_tables overlay xen_blkfront
    CR2: 0000000000000000
    ---[ end trace 0000000000000000 ]---

> I'm curious what process_msg+0x18e/0x2f0 is.  process_writes() has a
> direct call to wake_up(), but process_msg() calling req->cb(req) may
> be xs_wake_up() which is a thin wrapper over wake_up().

So, it's req->cb(req).

> They make me wonder if req has been free()ed and at least partially
> zero-ed, but it still has wake_up() called.  The call stack here is
> reminiscent of the one here
> https://lore.kernel.org/xen-devel/Z_lJTyVipJJEpWg2@mail-itl/ and the
> unexpected value there is 0.
> 
> I haven't actually figured out a scenario where req would be kfree()ed
> early.  But the handling of kfree(req) being split between
> process_msg/writes() and xs_wait_for_reply() makes me a little uneasy.
> 
> Regards,
> Jason

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: NULL pointer dereference in xenbus_thread->...
  2025-05-02  0:01             ` Marek Marczykowski-Górecki
@ 2025-05-03 14:59               ` Jason Andryuk
  0 siblings, 0 replies; 13+ messages in thread
From: Jason Andryuk @ 2025-05-03 14:59 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki; +Cc: Julien Grall, xen-devel, Juergen Gross

On Thu, May 1, 2025 at 8:01 PM Marek Marczykowski-Górecki
<marmarek@invisiblethingslab.com> wrote:

> > I'm curious what process_msg+0x18e/0x2f0 is.  process_writes() has a
> > direct call to wake_up(), but process_msg() calling req->cb(req) may
> > be xs_wake_up() which is a thin wrapper over wake_up().
>
> So, it's req->cb(req).

Thanks for confirming.

Regards,
Jason


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: NULL pointer dereference in xenbus_thread->...
  2025-04-30 15:43                 ` Jürgen Groß
@ 2025-05-03 15:08                   ` Jason Andryuk
  0 siblings, 0 replies; 13+ messages in thread
From: Jason Andryuk @ 2025-05-03 15:08 UTC (permalink / raw)
  To: Jürgen Groß
  Cc: Jason Andryuk, Marek Marczykowski-Górecki, Julien Grall,
	xen-devel

On Wed, Apr 30, 2025 at 11:44 AM Jürgen Groß <jgross@suse.com> wrote:
> I have looked at this issue multiple times now.
>
> Just some remarks what IMO could go wrong (I didn't find any proof that
> this really happened, though), in case someone wants to double check:
>
> The most probably candidate for something going wrong is a use-after-free
> of a struct xb_req_data element (normally named "req" in the related code).
>
> Some words about the not really obvious locking scheme used for those
> elements: A "req" is owned by a thread as long as it isn't in any of the
> lists it can live in (xs_reply_list or xb_write_list). Putting it into one
> of the lists or removing it again requires to hold the xb_write_mutex.

While I see how the ownership is transferred between lists and the
static variables in process_writes(), xs_wait_for_reply()/read_reply()
is looking at req, too.

> A "req" needs to be in a certain state when either in one of the lists or
> when being owned by a worker thread.
>
> I'm wondering whether it could happen that a thread waiting for a "req"
> could be woken up and the "req" is being freed before the waiting thread
> can react. Normally this shouldn't be possible, but "never say never".
> What catched my eye today is the test of req->state == xb_req_state_wait_reply
> in process_msg() just after dropping the xb_write_mutex. This looks a little
> bit fishy, but OTOH the request has been just removed from the xs_reply_list,
> so no mutex should be needed for that test.
>
> Possible candidates for such an "impossible" scenario include a wrap of
> xs_request_id (not very probable, though, as having 4 billion Xenstore
> requests "in flight" is rather unlikely IMHO).

If read_reply() exits because req->err is set, and req->state is not
queued or wait_reply, the req will be freed in xs_wait_for_reply(),
but not removed from the list.  If it is on the xs_reply_list,
req->cb(req) could be called, but that would need req->state ==
wait_reply.  I don't see how this could happen, but it came to mind.

Put another way, kfree() in xs_wait_for_reply() doesn't know if the
req is on the list or not.  I wonder if it would be better to just
have
if (req->state == xb_req_state_got_reply)
    kfree(req);
else
    req->state = xb_req_state_aborted;

Maybe also some sort of assertion that req is not on a list, too.
This would at least make it clear exactly when xs_wait_for_reply() is
supposed to free req.

Given these states:
enum xb_req_state {
xb_req_state_queued,
xb_req_state_wait_reply,
xb_req_state_got_reply,
xb_req_state_aborted
};

Being explicit on who frees each state would be good.

Or if all list manipulations have to be protected by xb_write_mutex,
then just let the "caller" xs_wait_for_reply() always unlink and
kfree()?

Yet it's unclear how req->cb(req) leads to a use-after-free.
req->cb() was valid to make the call, but then the waitqueue was
(probably) zeroed?  That seems like a race.  ... or is wake_up()
faulting because xs_wait_for_reply() woke up and freed the req before
wake_up() finished with it?

__wake_up_common (kernel/sched/wait.c:85) is:

list_for_each_entry_safe_from(curr, next, &wq_head->head, entry) {

I'l need to look more into the wake_up() implementation...

Regards,
Jason


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-05-03 15:09 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-28 21:50 NULL pointer dereference in xenbus_thread-> Marek Marczykowski-Górecki
2023-10-22 14:14 ` Marek Marczykowski-Górecki
2024-03-25 16:17   ` Marek Marczykowski-Górecki
2024-03-26 11:00     ` Julien Grall
2024-05-31 22:48       ` Marek Marczykowski-Górecki
2025-04-23 12:41         ` Marek Marczykowski-Górecki
2025-04-30  0:59           ` Jason Andryuk
2025-04-30 10:56             ` Marek Marczykowski-Górecki
2025-04-30 14:29               ` Jason Andryuk
2025-04-30 15:43                 ` Jürgen Groß
2025-05-03 15:08                   ` Jason Andryuk
2025-05-02  0:01             ` Marek Marczykowski-Górecki
2025-05-03 14:59               ` Jason Andryuk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.